[텍스트 분석] 공기어 네트워크

문서 단어 행렬

임포트

import numpy as np
import pandas as pd

데이터 열기

df = pd.read_excel('yelp.xlsx')

문서 단어 행렬 만들기

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english', min_df=0.01, binary=True)

min_df=0.01: 최소 1% 이상의 문서에서 출현한 단어만 포함
binary=True: 문서에 나타난 단어는 빈도 무관하게 1이 됨

dtm = cv.fit_transform(df.review)

인접행렬

인접 행렬(adjacency matrix): 네트워크에서 인접한 점(단어)들의 관계를 행렬로 나타낸 것 문서 단어 행렬을 곱하면 함께 나타난 단어는 1이 되고, 그렇지 않은 단어는 0이 됨.

cooccur = dtm.T @ dtm
adj = cooccur.A

.T: 전치행렬(행과 열을 바꿈)
@: 행렬 곱

각 단어별 문서빈도

n = np.diag(adj)

전체 문서

total, _ = dtm.shape

카이제곱 독립성 검정

from scipy.stats import chi2_contingency
n_all = np.sum(adj)
sig = np.zeros_like(adj)
significance_level = 0.05
for i in range(adj.shape[0]):
    for j in range(adj.shape[1]):
        if i < j:
            n = adj[i,j]
            n_i = np.sum(adj[i,:])
            n_j = np.sum(adj[:,j])
            m = np.array([
                [n, n_i - n],
                [n_j - n, n_all - n_i - n_j + n]])
            chi2, p, dof, ex = chi2_contingency(m)
            sig[i,j] = sig[j, i] = p < significance_level

sig[0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

NetworkX로 변환

Python에서 네트워크 분석을 위한 라이브러리

import networkx as nx

인접행렬을 네트워크로 바꾸기

net = nx.from_numpy_array(sig)

노드 이름을 단어로 바꾸기

words = cv.get_feature_names_out()
net = nx.relabel_nodes(net, dict(enumerate(words)))

steak와 연결된 단어 보기

list(nx.neighbors(net, 'service'))

['excellent', 'place', 'really', 'slow', 'terrible']

중심성

네트워크에서 노드의 중요도를 나타내는 지표

연결 중심성(degree centrality): 연결된 단어 수 / (전체 단어 수 - 1)

dc = nx.degree_centrality(net)

매개 중심성(between centrality): 단어-단어 간의 최단 경로에 포함된 비율

bc = nx.betweenness_centrality(net)

근접 중심성(closeness centrality): 다른 단어와 거리가 평균적으로 짧은 단어

cc = nx.closeness_centrality(net)

고유벡터 중심성(eigenvector centrality): 중요한 단어와 연결된 단어가 중요한 단어

ec = nx.eigenvector_centrality(net)

중심성을 데이터 프레임으로 변환

dcf = pd.DataFrame(dc.items(), columns=['word', 'centrality'])

중심성 순으로 정렬

dcf.sort_values('centrality', ascending=False).head(10)

	word	centrality
60	service	0.061728
36	little	0.049383
14	definitely	0.049383
72	ve	0.037037
57	salad	0.037037
18	disappointed	0.037037
66	taste	0.037037
19	don	0.037037
50	place	0.037037
37	ll	0.037037

시각화

Python 네트워크 시각화를 위한 라이브러리

설치:

pip install pyvis

임포트:

from pyvis.network import Network

networkx 네트워크를 pyvis 네트워크로 변환

vis = Network(height='800px', width='1000px')
vis.from_nx(net)

설정 버튼 추가

vis.show_buttons(filter_=True)

보이기

vis.save_graph('nx.html')