[텍스트 분석] 검색 :: 마인드스케일

순차 검색

import pandas as pd
df = pd.read_csv('neurips.zip')
df.head()

	year	title	abstract
0	2007	Competition Adds Complexity	It is known that determinining whether a DEC-P...
1	2007	Efficient Principled Learning of Thin Junction...	We present the first truly polynomial algorith...
2	2007	Regularized Boost for Semi-Supervised Learning	Semi-supervised inductive learning concerns ho...
3	2007	Simplified Rules and Theoretical Analysis for ...	We show that under suitable assumptions (prima...
4	2007	Predicting human gaze using low-level saliency...	Under natural viewing conditions, human observ...

query = {'natural', 'language'}

import re
def tokenize(text):
    text = text.lower() # 소문자로 변환
    return re.findall(r'\w{2,}', text) # 2글자 이상 단어 추출

표의 각 행에서 순서대로 검색어가 있는지 확인

%%time
results = []
for row in df.itertuples():
    words = set(tokenize(row.abstract))
    if query < words: # 검색어가 부분집합이면
        results.append(row.Index)

CPU times: total: 312 ms
Wall time: 341 ms

조건에 맞는 행 번호

len(results)

조건에 맞는 행 보기

df.loc[results].head()

	year	title	abstract
49	2007	Discriminative Keyword Selection Using Support...	Many tasks in speech processing involve classi...
269	2008	Modeling the effects of memory on human online...	Language comprehension in humans is significan...
542	2009	Rethinking LDA: Why Priors Matter	Implementations of topic models typically use ...
557	2009	Conditional Neural Fields	Conditional random fields (CRF) are quite succ...
846	2010	Probabilistic Deterministic Infinite Automata	We propose a novel Bayesian nonparametric appr...

리스트와 사전

a = list(range(1000000))

리스트에서 999999를 검색하는데 걸리는 시간 측정 리스트의 뒤로 갈 수록 검색이 오래 걸림

%%time
a.index(999999)

CPU times: total: 15.6 ms
Wall time: 14 ms

b = dict(zip(a, a))

검색 시간이 0에 가까움

%%time
b[999999]

CPU times: total: 0 ns
Wall time: 0 ns

인덱싱

from collections import defaultdict
index = defaultdict(set)

for row in df.itertuples():
    words = tokenize(row.abstract)
    for word in words:
        index[word].add(row.Index)

len(index['language'])

%%time
results = list(index['natural'] & index['language'])

CPU times: user 33 µs, sys: 4 µs, total: 37 µs
Wall time: 41.7 µs

TF

from collections import Counter
idxs = list(index['natural'] & index['language'])
results = []
for row in df.iloc[idxs].itertuples():
    words = tokenize(row.abstract)
    cnt = Counter(words)
    tf = sum(cnt[w] for w in query)
    results.append((tf, row.Index))

점수의 역순으로 정렬

idx = [i for _, i in sorted(results, reverse=True)]

정렬된 문서 보기

df.iloc[idx].head()

	year	title	abstract
3445	2017	Emergence of Language with Multi-agent Games: ...	Learning to communicate through interaction, r...
3148	2016	LightRNN: Memory and Computation-Efficient Rec...	Recurrent neural networks (RNNs) have achieved...
1805	2013	A Novel Two-Step Method for Cross Language Rep...	Cross language text classi?cation is an import...
2920	2016	Latent Attention For If-Then Program Synthesis	Automatic translation from natural language de...
2900	2016	Dialog-based Language Learning	A long-term goal of machine learning research ...

TF-IDF

문서 빈도

{k: len(v) for k, v in index.items()}

전체 문서 수

N, _ = df.shape

역문서빈도(inverse document frequency)

import numpy as np
idf = {k: np.log(N / len(v)) for k, v in index.items()}

idxs = list(index['natural'] & index['language'])
results = []

for row in df.iloc[idxs].itertuples():
    words = tokenize(row.abstract)
    cnt = Counter(words)
    tfidf = sum(cnt[w] * idf[w] for w in query)
    results.append((tfidf, row.Index))

idx = [i for _, i in sorted(results, reverse=True)]

BM25

!pip install rank_bm25 kiwipiepy

import pandas as pd
books = pd.read_csv('science_books.csv')

from kiwipiepy import Kiwi
kiwi = Kiwi()

def tokenize(sent):
    for token in kiwi.tokenize(sent):
        if token.tag in {'NNG', 'NNP', 'SL', 'VV', 'VA'}:
            yield token.form, token.tag

tokenized_corpus = []
for title in books.제목:
    tokenized_corpus.append(list(tokenize(title)))

from rank_bm25 import BM25Okapi
bm25 = BM25Okapi(tokenized_corpus)

import pandas as pd
idf_table = pd.DataFrame(bm25.idf.items(), columns=['token', 'idf'])
idf_table.sort_values('idf')

	token	idf
24	(과학, NNG)	1.590378
169	(수학, NNG)	2.076635
9	(이야기, NNG)	2.226424
17	(세상, NNG)	2.396806
31	(양장, NNG)	2.396806
...	...	...
1249	(아이디어, NNG)	6.501790
1248	(보듬, VV)	6.501790
1247	(이웃, NNG)	6.501790
1621	(스타일링, NNG)	6.501790
2389	(트리즈, NNG)	6.501790

2390 rows × 2 columns

query = list(tokenize('다정한 것이 살아남는다'))
bm25.get_top_n(query, books.제목, n=5)

['다정한 것이 살아남는다 : 친화력으로 세상을 바꾸는 인류의 진화에 관하여(10만부 기념 스페셜 에디션, 저자 친필 사인 인쇄본)',
 '낙타는 왜 사막으로 갔을까 : 살아남은 동물들의 비밀',
 '무엇이 우리를 다정하게 만드는가 : 타인을 도우려 하는 인간 심리의 뇌과학적 비밀(양장)',
 '우주에서 기다릴게 : 한국 첫 우주인이 펼치는 다정한 호기심의 기록',
 '다정한 물리학 : 거대한 우주와 물질의 기원을 탐구하고 싶을 때']

임베딩을 이용한 검색

!pip install sentence-transformers

from sentence_transformers import SentenceTransformer
sbert = SentenceTransformer(
    'snunlp/KR-SBERT-V40K-klueNLI-augSTS')

from sklearn.metrics.pairwise import cosine_similarity
emb = sbert.encode(books.제목)

query_emb = sbert.encode(['다정한 것이 살아남는다'])

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

%%time
sims = cosine_similarity(query_emb, emb)
ids = np.argsort(-sims[0])[:5]

CPU times: total: 62.5 ms
Wall time: 42.5 ms

books.iloc[ids]

	제목
322	다정함의 과학 : 친절, 신뢰, 공감 속에 숨어 있는 건강과 행복의 비밀
6	다정한 것이 살아남는다 : 친화력으로 세상을 바꾸는 인류의 진화에 관하여(10만부 ...
618	모든 것은 그 자리에 : 첫사랑부터 마지막 이야기까지(양장)
392	이토록 다정한 기술 : 지구와 이웃을 보듬는 아이디어(〈희망의 이유〉 사쉐 증정 (...
111	무엇이 우리를 다정하게 만드는가 : 타인을 도우려 하는 인간 심리의 뇌과학적 비밀(양장)

nmslib

pip install nmslib

import nmslib
index = nmslib.init()
index.addDataPointBatch(emb)
index.createIndex()

%%time
ids, dist = index.knnQuery(query_emb, k=5)

CPU times: user 1.38 ms, sys: 0 ns, total: 1.38 ms
Wall time: 1.48 ms

books.iloc[ids]

	제목
322	다정함의 과학 : 친절, 신뢰, 공감 속에 숨어 있는 건강과 행복의 비밀
6	다정한 것이 살아남는다 : 친화력으로 세상을 바꾸는 인류의 진화에 관하여(10만부 ...
618	모든 것은 그 자리에 : 첫사랑부터 마지막 이야기까지(양장)
392	이토록 다정한 기술 : 지구와 이웃을 보듬는 아이디어(〈희망의 이유〉 사쉐 증정 (...
111	무엇이 우리를 다정하게 만드는가 : 타인을 도우려 하는 인간 심리의 뇌과학적 비밀(양장)

chroma

!pip install chromadb

import chromadb

client = chromadb.Client()

def embedding_function(text):
    return sbert.encode(text).tolist()

collection = client.create_collection(
    name="science_books",
    embedding_function=embedding_function)

metadatas = books.제목.map(lambda x: {'length': len(x)}).tolist()
ids = books.index.map(str).tolist()
collection.add(
    documents=books.제목.tolist(),
    metadatas=metadatas,
    ids=ids
)

results = collection.query(
    query_texts=["다정한 것이 살아남는다"],
    n_results=5
)

%%time
results = collection.query(
    query_embeddings=[query_emb[0].tolist()],
    n_results=5
)

CPU times: total: 0 ns
Wall time: 999 µs

query_emb.shape

(1, 768)

results

{'ids': [['6', '322', '392', '111', '313']],
 'distances': [[305.27313232421875,
   332.410888671875,
   351.19512939453125,
   361.7748718261719,
   384.9884948730469]],
 'metadatas': [[{'length': 71},
   {'length': 40},
   {'length': 54},
   {'length': 49},
   {'length': 90}]],
 'embeddings': None,
 'documents': [['다정한 것이 살아남는다 : 친화력으로 세상을 바꾸는 인류의 진화에 관하여(10만부 기념 스페셜 에디션, 저자 친필 사인 인쇄본)',
   '다정함의 과학 : 친절, 신뢰, 공감 속에 숨어 있는 건강과 행복의 비밀',
   '이토록 다정한 기술 : 지구와 이웃을 보듬는 아이디어(〈희망의 이유〉 사쉐 증정 (포인트 차감))',
   '무엇이 우리를 다정하게 만드는가 : 타인을 도우려 하는 인간 심리의 뇌과학적 비밀(양장)',
   'ADHD 2.0 : 산만하고 변덕스러운 ‘나’를 뛰어난 ‘창조자’로 바꾸는 특별한 여정!(포함 건강취미분야 2만원↑ 데일리 알약케이스 증정(택1, 포인트 차감))']]}

results = collection.query(
    query_texts=["다정한 것이 살아남는다"],
    n_results=5,
    where={"length": {'$lt': 75}},
    where_document={"$contains":"다정"}
)

results

{'ids': [['6', '322', '392', '111', '105']],
 'distances': [[305.27313232421875,
   332.410888671875,
   351.19512939453125,
   361.7748718261719,
   475.3115234375]],
 'metadatas': [[{'length': 71},
   {'length': 40},
   {'length': 54},
   {'length': 49},
   {'length': 35}]],
 'embeddings': None,
 'documents': [['다정한 것이 살아남는다 : 친화력으로 세상을 바꾸는 인류의 진화에 관하여(10만부 기념 스페셜 에디션, 저자 친필 사인 인쇄본)',
   '다정함의 과학 : 친절, 신뢰, 공감 속에 숨어 있는 건강과 행복의 비밀',
   '이토록 다정한 기술 : 지구와 이웃을 보듬는 아이디어(〈희망의 이유〉 사쉐 증정 (포인트 차감))',
   '무엇이 우리를 다정하게 만드는가 : 타인을 도우려 하는 인간 심리의 뇌과학적 비밀(양장)',
   '다정한 물리학 : 거대한 우주와 물질의 기원을 탐구하고 싶을 때']]}