검색엔진 번역 기본 원리와 중요성 (포스트)

Table of Contents

검색엔진 번역에 대해서 알아 보겠습니다(한영번역)

검색엔진 번역(한국어 원본)

현재 검색 엔진에서는 연관 검색어와 같은 검색어 확장(Search query expansion)을 지원하지만 많은 사용자 들이 같이 입력한 단어를 연관 단어로 추천하는 것이다. 이것을 개인화 검색(Personalized search)이라 할 수 없다.
연관 규칙(Association Rule) 마이닝(mining)에서 용어간의 연관관계(association relationship)는 항목(item set)들의 동시 발생(co-occurrence)을 이용하여 하나의 트랜잭션에서 연관단어(Association Word)들을 찾는다. 트랜잭션은 한 단락이나 한문장을 하나의 트랜잭션으로 사용 한다.
남은 항목(item set)들은 단어들의 포함관계를 이용하여 신뢰도(Confidence)를 측정한다. 신뢰도(Confidence)는 단어 A가 발생될 때 단어 B가 발생될 비율이다. 이 신뢰도(Confidence)가 높을수록 두 단어의 연관 관계가 크다. 검색 쿼리(Search query)와 신뢰도(Confidence)가 높은 단어는 검색 쿼리와 밀접한 관련성을 가지는 것이다.
이러한 이유로 개인의 프로파일 데이터를 사용하지 않고 사용자가 방문하는 웹문서(webpage)에서 주로 발생되는 단어들을 분석하여 질의 확장(query expansion)에 사용가능한 단어들을 추출하는 방법을 제시한다.
검색쿼리(Search query)와 검색쿼리의 연관 단어가 총 2개라면 그림 2와 같이 나타낸다. 그림 2의 쿼리는 검색 쿼리와 검색 쿼리의 연관 단어이다. 좌표에서 차원(Demension)의 개수는 그림 2와 같이 단어의 개수와 동일하다. 검색쿼리의 연관 단어들은 사용자가 방문한 문서(webpage)들에서 아프리오리(Apriori)를 사용하여 찾아낸다. 문서(webpage)는 추출(extraction)한 단어와 그 단어의 좌표를 이용하여 백터로 표현한다. 문서(webpage)가 d이며, 문서(webpage)에서 발생되는 단어가 t1, t2,…tn이고 t1이 문서(webpage)에서 차지하는는 비중이 n(t1)이라면 d={(t1,n(t1)),(t2,n(t2)),…,(tn,n(tn))}으로 표현한다. 백터 값을 이용하여 문서(webpage)들을 좌표로 나타내고 최근 접점 탐색 알고리즘(nearest neighbor search algorithm)으로 분류한다. 최근 접점 알고리즘으로는 ANN(Approximate Nearest Neighbor)을 이용한다.[14] 분된 문서(webpage)들은 그림 2 처럼 군집을 이룬다. 각 문서(webpage)에서 아프리오리로(Apriori) 단어를 추출 할 때 얻은 단어들의 정보를 이용하여 군집을 대표하는 단어들을 추출 한다. 대표 단어들은 군집의 문서(webpage)들에서 단어의 정보를 종합하여 아프리오리(Apriori)를 사용하여 추출된 연관 단어(Association Word)들이다.
수식 1을 이용하여 문서(webpage)에서 단어가 가지는 비중을 각각 측정하게된다.
이렇게 사용자가 방문한 문서(webpage)는 검색 쿼리(Search query)와 쿼리(query)의 연관단어(Association Word)들에 의해 분류된다. 분류된 각 군집의 문서(webpage)들은 같은 연관 단어를 가지고 연관단어의 비중이 유사하다. 즉, 내용이 유사한 문서(webpage)들이 군집(class)을 이루게 되는 것이다.
현재시간 ti에서 가중치가 Wp일 때 이전 Wp의 값을 기반으로 시간당 상승한 인기도(population)를 이용하여 변화 폭을 측정한다. 측정한 변화 폭을 이용하여 가중치를 재조정하는 것이 수식 2의 역할이다.
TF-IDF는 문서(webpage)에서 각 단어들의 가중치(weight)를 수식 5를 사용해서 측정한다. 각 단어는 자신이 발생된 문서(webpage)마다 가중치가 다르다. 실험에서는 사용자가 여러 개의 문서(webpage)를 방문하기 때문에 각 문서(webpage)에서 단어들의 TF-IDF 가중치를 측정하여 합산한다. 예를 들어, 단어 W가 문서(webpage) d1,d3에서 출현하였다면 단어 W의 가중치는 각 문서(webpage)마다 다르다. 실험에서는 문서(webpage) d1에서 단어 w의 가중치와 d3에서의 단어 w의 가중치를 합하여 모든 문서(webpage)들에서 단어 w의 가중치로 사용한다. 단어들을 가중치별로 정렬하고 상위 N개의 단어를 순위별로 사용자에게 추천한다. 표 3과 4의 실험에 사용되는 아프리오리(Apriori)는 3.1.절의 설명과 같이 단어를 추출 한다. 추출한 단어의 신뢰도(Confidence) 값을 이용하여 단어를 정렬하고 신뢰도(Confidence)가 높은 단어 N개를 정렬된 순으로 추천한다.

검색엔진 번역(영어 번역본)

Search engines currently support the search query expansion function, much like the related search query. Since this function recommends related search queries that were most frequently entered by the whole user group, it cannot be regarded as personalized search.
Association Rule and Mining use the association relationship among the words by searching for association words in each transaction and comparing co-occurrence of item sets. Either a paragraph or a sentence is used as one unit of such transaction.
Confidence is measured from analyzing the inclusion relationship among the words in any item set that is leftover. Confidence is defined by the proportion of word B’s occurrence as the word A occurs. Higher confidence level implies stronger association relationship between the two words. In other words, words with high levels of search query and confidence have close relationship with the search query.
For these reasons, we propose a method of extracting usable words for query expansion from frequently occurring words only on webpages visited by the user, rather than from individual profile data.
Fig 2 demonstrates a case where there are two related words for a search query. Query in Fig 2 includes the search query and its related words. Fig 2 also shows that the number of dimensions in a coordinate system is the same as the number of words. The related words for the search query are found by a priori in webpages visited by the user. A webpage is expressed by a vector of extracted words and their coordinates. For example, suppose there is a webpage, d, with a set of occurring words, t1, t2,…tn, and the weight of t1 in d is n(t1). This information can be expressed in a vector form, d={(t1,n(t1)),(t2,n(t2)),…,(tn,n(tn))}. Using the vectors, webpages can be expressed as coordinates and categorized by a nearest neighbor search algorithm – Approximate Nearest Neighbor (ANN). [14] As shown in Fig 2, classified webpages form groups. A representative word for each of such group is selected by adopting information gathered when extracting a priori words in each webpage. In other words, representative words are association words that were extracted from the webpage groups, using a priori with synthesized word information.
The weight of each word in a webpage is calculated using Eq 1. In this way, each webpage visited by the user is identified and classified by the search query and association words of the query in the webpage. Webpages in the same class will thus have identical association words with similar weights in the page. In other words, webpages with analogous contents will be classified together.
Suppose that the weight is Wp at the present time, ti. The variation is measured by using the increase in population by time, based on Wp at the previous time. Eq 2 readjusts the weights using this measured variation.
In TF-IDF, Eq 5 is used for calculating the weight of each word in a webpage. Each word can have different weights in different webpages. Since a user in real life visits multiple webpages, TF-IDF adds up weights of word in different webpages in experiment as well. For example, if a word, W, appeared in two webpages, d1 and d3, W’s weights in d1 and d3 will be different. In the experimental design, weights of W in webpages d1 and d3 are summed and taken as the total weight of W in all webpages. Then N words with the highest weight ranks are recommended to the user from the highest ones. On the contrary, tables 3 and 4 conduct the experiments using a priori, extracted in the same way as explained in 3.1. In this case, words are sorted by confidence, and the N words with the highest confidence levels are recommended in the sorted order.

이상 연세대학교에서 의뢰한 검색엔진 번역(한영번역)의 일부를 살펴 보았습니다.

번역은 기버 번역