[그래프 ML] 그래프 데이터셋 - NetworkX, Network Repository, SNAP (Stanford Network Analysis Platform), OGB (Open Graph Benchmark)

인공지능/그래프

[그래프 ML] 그래프 데이터셋 - NetworkX, Network Repository, SNAP (Stanford Network Analysis Platform), OGB (Open Graph Benchmark)

백관구 2024. 3. 12. 17:53

1. NetworkX
2. 네트워크 저장소(Network Repository)
- 2.1. MTX (Matrix Market Exchange Format) 파일 형식
- 2.2. 측정지표 분석
3. 스탠포드 네트워크 분석 플랫폼(SNAP; Stanford Network Analysis Platform)
4. 오픈 그래프 벤치마크(OGB; Open Graph Benchmark)

※ 출처 : 그래프 머신러닝 (클라우디오 스타밀레 외, 김기성·장기식 옮김)

1. NetworkX

Graph generators — NetworkX 3.2.1 documentation

[0] D.G. Corneil, H. Lerchs, L.Stewart Burlingham, “Complement reducible graphs”, Discrete Applied Mathematics, Volume 3, Issue 3, 1981, Pages 163-174, ISSN 0166-218X.

networkx.org

2. 네트워크 저장소(Network Repository)

Network Data Repository | The First Interactive Network Data Repository

The first interactive network dataset repository with interactive graph visualization and analytics

networkrepository.com

여기서는 ASTRO-PH 그래프 데이터셋을 살펴보겠습니다. ASTRO-PH 그래프는 1993년 1월부터 2003년 4월까지 천체 물리학 분야에 게시된 arXiv 저장소에서 확인할 수 있는 과학 논문을 사용해 생성되었습니다. 위 링크를 따라 접속한 후, 검색 박스에 "ASTRO-PH"를 검색합니다. 아래와 같은 화면에서 압축파일(ZIP)을 다운로드 받고 작업할 폴더에 압축 해제를 진행합니다.

압축파일을 풀면 위 화면과 같이 "astro-ph.mtx"라는 파일이 보입니다. 우선 이 파일은 제껴두고 아래 readme.html 파일부터 열어보겠습니다. 만약 이 데이터셋을 사용하여 출처를 밝혀야 할 일이 있을 때, 아래와 같이 작성하라는 가이드가 들어있습니다.

이번에는 astro-ph.mtx 파일을 열어볼 차례입니다. 먼저, 윈도우에서 제공하는 워드패드를 사용해 파일을 열어보겠습니다. 상당히 많은 행(줄)으로 데이터가 입력되어 있습니다. 이 mtx 형식의 파일을 파이썬에서 어떻게 읽고 작업할 수 있는지 알아보겠습니다.

2.1. MTX (Matrix Market Exchange Format) 파일 형식

* ASCII (American Standard Code for Information Interchange) 형식의 텍스트 파일을 통해 실수나 복소수 행렬, 희소 행렬을 지정하기 위한 파일 형식
- 헤더에는 아래와 같이 시작 부분에 %%가 위치함

%%MatrixMarket matrix coordinate real symmetric

* 파이썬의 Scipy 라이브러리를 사용해 읽을 수 있음

from scipy.io import mmread
import networkx as nx


matrix = mmread("../data/network_repository/astro-ph/astro-ph.mtx")  # mtx 파일 읽기
G = nx.from_scipy_sparse_array(matrix)  # scipy matrix를 networkx 형식으로 변환

print(f"그래프 위수 (노드 개수): {G.order()}")
print(f"그래프 크기 (간선 개수): {G.size()}")

그래프 위수 (노드 개수): 16706
그래프 크기 (간선 개수): 121251

2.2. 측정지표 분석

① 매개 중심성(Betweenness Centrality), 지역 군집 계수(Local Clustering Coefficient), 노드 차수(Degree) 연산

betweenness_centrality = nx.centrality.betweenness_centrality(G)  # 매개 중심성
clustering = nx.clustering(G)  # 지역 군집 계수
degree = dict(nx.degree(G))  # 노드 차수

print(f"매개 중심성: {betweenness_centrality}")
print(f"지역 군집 계수: {clustering}")
print(f"노드 차수: {degree}")

매개 중심성: {0: 0.003160341116179179, 1: 0.0, 2: 0.0002895806033400033, 3: 0.0003601880676500333, 4: 0.0001708166540733188, 5: 0.00050286583593598, ...}

지역 군집 계수: {0: 0.06031746031746032, 1: 0, 2: 0.2, 3: 0.16666666666666666, 4: 0.49166666666666664, 5: 0.08088235294117647, ...}

노드 차수: {0: 36, 1: 1, 2: 5, 3: 4, 4: 16, 5: 17, ...}

② pandas.DataFrame으로 변환

import os
import pandas as pd


stats = pd.DataFrame({
    "betweenness_centrality": betweenness_centrality,
    "C_i": clustering,
    "degree": degree
})

os.makedirs("04_read_mtx", exist_ok=True)  # 저장할 디렉토리 생성
stats.to_csv("astro_ph_metrics.csv")  # CSV 형식으로 파일 저장
stats

betweenness_centrality C_i degree
0 0.003160 0.060317 36
1 0.000000 0.000000 1
2 0.000290 0.200000 5
3 0.000360 0.166667 4
4 0.000171 0.491667 16
... ... ... ...
16701 0.000000 0.000000 0
16702 0.000000 1.000000 10
16703 0.000000 1.000000 10
16704 0.000000 1.000000 10
16705 0.000000 1.000000 10
16706 rows × 3 columns

③ 차수 기준 내림차순 정렬
- 가장 큰 차수를 가진 5502번 노드가 360개의 근방으로 연결되는 것을 통해, 이 분야에서 핵심적인 인물이라는 것을 유추할 수 있음

stats.sort_values("degree", ascending=False).head()

betweenness_centrality C_i degree
5502 0.011944 0.100155 360
912 0.016902 0.100856 353
1231 0.015462 0.092242 329
5507 0.008890 0.121367 299
6197 0.008748 0.116170 296

3. 스탠포드 네트워크 분석 플랫폼(SNAP; Stanford Network Analysis Platform)

SNAP: Stanford Network Analysis Project

Stanford Network Analysis Project Stanford Network Analysis Platform (SNAP) is a general purpose network analysis and graph mining library. It is written in C++ and easily scales to massive networks with hundreds of millions of nodes, and billions of edges

snap.stanford.edu

여기서는 "amazon0302" 데이터셋을 다운로드 받아 사용하겠습니다. SNAP 홈페이지 좌측 탭 중 [SNAP Datasets]를 클릭합니다. 그리고 "Amazon networks"를 선택하면 Amazon과 관련된 여러 데이터셋을 볼 수 있습니다. 그 중 "amazon0302"를 클릭하고, 아래와 같은 페이지 하단에 있는 압축파일(.gz)을 다운로드해 압축을 풀어줍니다.

이번에는 텍스트 파일 형식(txt)으로 그래프 데이터가 구성되어 있습니다. 워드패드를 통해 파일을 열어보겠습니다. 데이터에 대한 설명과 간단한 그래프 정보가 포함되어 있습니다. 0 1, 0 2 부분은 간선 정보를 가리킵니다.

3.1. 파이썬 NetworkX 사용해 읽기

import networkx as nx


G = nx.read_edgelist("amazon0302.txt")

3.2. 파이썬 SNAP 사용해 읽기

 python -m pip install snap-stanford

아래의 코드를 실행하면 SNAP 라이브러리의 무향 그래프(PNGraph) 객체를 받게 되지만, 이 객체에서 NetworkX의 기능들을 사용할 수 없습니다. 따라서 NetworkX 기능을 사용하려면, SNAP 그래프 객체를 NetworkX 그래프 객체로 변환해야 합니다.

import snap


snap_G = snap.LoadEdgeList(snap.PNGraph,  # 생성할 그래프 유형(PNGraph: 무향 그래프, TNGraph: 유향 그래프)
                      "amazon0302.txt",  # 간선 목록이 포함된 입력 파일
                      SrcColId=0, DstColId=1,  # SrcColId / DstColId: 근원(Src)/타겟(Dst) 노드 ID가 있는 열의 인덱스
                      Separator="\t")  # Separator: 입력 파일에서 필드를 구분하는 기호

3.3. 파이썬 SNAP 그래프를 NetworkX 그래프로 변환

def snap2networkx(snap_G):
    """
    snap 그래프를 networkx 그래프로 변환하는 함수
    
    Args:
        snap_G: snap 그래프 객체 (snap.PNGraph, snap.TNGraph 등)
    
    Returns:
        nx_G: networkx 그래프 객체
    """
    
    # networkx 그래프 객체 생성
    if isinstance(snap_G, snap.TNGraph):  # 유향 그래프
        nx_G = nx.DiGraph()
    else:  # 무향 그래프
        nx_G = nx.Graph()
    
    # 노드 추가
    for node in snap_G.Nodes():
        nx_G.add_node(node.GetId())
    
    # 간선 추가
    for edge in snap_G.Edges():
        nx_G.add_edge(edge.GetSrcNId(), edge.GetDstNId())  # 근원(Src)/타겟(Dst) 노드 ID (NId)
    
    return nx_G


nx_G = snap2networkx(snap_G)  # snap_G: snap 그래프, nx_G: networkx 그래프

4. 오픈 그래프 벤치마크(OGB; Open Graph Benchmark)

Open Graph Benchmark

A collection of benchmark datasets, data-loaders and evaluators for graph machine learning in PyTorch.

ogb.stanford.edu

저작자표시 비영리

'인공지능 > 그래프' 카테고리의 다른 글

[그래프 ML] 그래프 분석 라이브러리 - 파이썬 NetworkX, SNAP, igraph, graph-tool, NetworKit (0)	2024.03.14
[그래프 ML] 에고 그래프 - 파이썬 NetworkX, Gephi (2)	2024.03.14
[그래프 ML] 그래프 생성 - 파이썬 NetworkX (0)	2024.03.11
[그래프 ML] 그래프 속성 - 파이썬 NetworkX (0)	2024.03.11
[그래프 ML] Gephi 시작하기 (0)	2024.03.08

현재글[그래프 ML] 그래프 데이터셋 - NetworkX, Network Repository, SNAP (Stanford Network Analysis Platform), OGB (Open Graph Benchmark)

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

데싸(Data Science) 노트