[R]1. 문서의 유사도 tdm, cosine similarity

[R]1. 문서의 유사도 tdm, cosine similarity #코사인유사도 #코사인거리 #tdm #dtm #문서유사도 #데이터마이닝

[연구 질문]

각 문서들 속에 들어 있는 단어들 빈도수 분석을 통해 각 문서가 얼마나 유사한지 알고 싶다 (단어의 순서는 상관 없음)

[개념]

코사인 유사도 : 내적공간의 두 벡터간 각도의 코사인 값을 이용하여 측정된 벡터간의 유사한 정도

정보 검색 및 텍스트 마이닝 분야에서, 두 문서의 유사를 측정하는 매우 유용한 방법

단어 하나 하나는 각각의 차원을 구성하고 문서는 각 단어가 문서에 나타나는 회수로 표현되는

벡터값을 가진다

두 벡터의 방향이 완전이 같을 경우 1, 90'의 경우 0, 완전히 반대일 경우 -1 이지만 정보 검색의 경우

문서의 단어 빈도수가 음수가 될 수 없기 때문에 0~1의 값을 가진다.

https://ko.wikipedia.org/wiki/%EC%BD%94%EC%82%AC%EC%9D%B8_%EC%9C%A0%EC%82%AC%EB%8F%84

[필요 패키지]

tm, proxy,

[코드] *파란색은 스크립트, 검정색은 결과값입니다

text <- c("clothing men one jacket portrait necktie adult people outerwear facial_expression menswear confidence outfit business fashion stylish neckwear trendy blazer indoors",

"clothing men one jacket portrait adult necktie confidence facial_expression people outerwear menswear business outfit stylish fashion indoors trendy women music", "clothing men one jacket portrait necktie adult people outerwear menswear confidence facial_expression outfit business fashion stylish indoors blazer shadow neckwear", "clothing men one jacket portrait necktie adult people facial_expression outerwear menswear confidence outfit business blazer neckwear stylish fashion indoors actor ",

"clothing men one jacket portrait adult people outerwear necktie facial_expression menswear confidence outfit business fashion blazer stylish music indoors actor")

library(tm)

view <- factor(rep(c("view 1"), each=5))

df <- data.frame(text, view, stringsAsFactors=FALSE)

corpus <- Corpus(VectorSource(df$text))

#corpus <- tm_map(corpus, content_transformer(tolower))

#corpus <- tm_map(corpus, removePunctuation)

#corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))

#corpus <- tm_map(corpus, stemDocument, language = "english")

corpus

tdm <- TermDocumentMatrix(corpus)

inspect(tdm)

tdm <- as.matrix(tdm)

install.packages("proxy")

library(proxy)

cosine_dist_mat <- as.matrix(dist(t(tdm), method = "cosine"))

diag(cosine_dist_mat) <- NA

cosine_dist <- apply(cosine_dist_mat, 2, mean, na.rm=TRUE)

cosine_dist

#> cosine_dist

# 1 2 3 4 5

#0.0750 0.1250 0.0875 0.0750 0.0875

ave(cosine_dist)

#> ave(cosine_dist)

# 1 2 3 4 5

#0.09 0.09 0.09 0.09 0.09

저작자표시 비영리 변경금지

'Data > R' 카테고리의 다른 글

[R]4.소셜 네트워크 감정 분석 sentiment analysis (6)	2016.02.19
[R]3. 결정 트리 Classification (9)	2016.02.10
[R]2. 데이터 클러스터링 k-means 알고리즘 (4)	2016.02.10
[R 기초] Tip & 기본 함수 Command (0)	2016.02.09
[R 기초] 개요, 기초 데이터 구조 (0)	2016.02.09

On the ball

[R]1. 문서의 유사도 tdm, cosine similarity

'Data > R' 카테고리의 다른 글

티스토리툴바

[R]1. 문서의 유사도 tdm, cosine similarity

'Data > R' 카테고리의 다른 글

관련글

티스토리툴바