shepherd Sinbad :: R 스터디 - 텍스트 마이닝 기초

2016. 11. 18. 14:55

R 스터디 - 텍스트 마이닝 기초 카테고리 없음2016. 11. 18. 14:55

.. .. ..

R을 활용한 텍스트 마이닝 기초_1

한글 텍스트 분석하기

1. 패키지 설치

install.packages("KoNLP")

2. 라이브러리 로드 (Java 필수 필요)

library(KoNLP)

3. 함수

- extractNoun (한글 명사 추출)

- txt1 = readLines("좋아하는 괄일.txt")

- txt2 = sapply(txt1.extractNoun, USE.NAMEs=F

- 사전 사용 방법

- useSejongDic()

- 사전에 새로운 단어 추가

- margeUserDic(data.frame("파인애플","ncn")) # ncn 은 명사,

- 자동 추가 방법

- mergeUserDic(data.frame(readLines("파일명","ncn"))

- 사전 파일 경로

- .libPaths() # 로 경로 확인. (KoNLP 경로 아래 있음)

- 함수 사용시 UTF-8 에러 처리

- e

- 밀정 테스트

예제 1. 영화 밀정의 리뷰를 분석하여 워드 클라우드 작성하기

################################################################

## 영화의 리뷰를 분석하여 워드 클라우드 그리기

## 영화 밀정 워드 클라우드 그리기

################################################################

setwd("c:\\a_temp")

#install.packages(“KoNLP”)

#install.packages(“wordcloud”)

#install.packages("stringr")

#library(KoNLP)

#library(wordcloud)

#library(stringr)

#Step 1. 분석 파일을 불러 옵니다

data1 <- readLines("영화_밀정.txt")

data1

#Step 2. 문장을 단어로 분리합니다.

data1 <- gsub(" ","-",data1)

data1 <- str_split(data1,"-")

data1 <- str_replace_all(unlist(data1),"[^[:alpha:][:blank:]]","")

tran1 <- Map(extractNoun, data1)

tran1

tran11 <- unique(tran1)

tran2 <- sapply(tran11, unique)

tran2

tran3 <- rapply(tran2, function(x) gsub("리뷰", "", x), how = "replace")

tran3 <- rapply(tran3, function(x) gsub("영화", "", x), how = "replace")

tran3 <- rapply(tran3, function(x) gsub("평점", "", x), how = "replace")

tran3 <- rapply(tran3, function(x) gsub("내용", "", x), how = "replace")

tran3 <- rapply(tran3, function(x) gsub("제외", "", x), how = "replace")

tran3 <- rapply(tran3, function(x) gsub("ㅋㅋㅋ", "", x), how = "replace")

tran3 <- rapply(tran3, function(x) gsub("ㄱㄱㄱ", "", x), how = "replace")

tran3

tran4 <- sapply(tran3, function(x) {Filter(function(y) {nchar(y) <= 6 && nchar(y) > 1},x)} )

tran4

#Step 3. 공백을 제거하기 위해 저장 후 다시 불러 옵니다.

write(unlist(tran4),"밀정_2.txt")

data4 <- read.table("밀정_2.txt")

data4

nrow(data4)

#Step 4. 각 단어별로 집계하여 출현 빈도를 계산합니다 (1차 확인 단계)

wordcount <- table(data4)

wordcount

wordcount <- Filter(function(x) {nchar(x) <= 10} ,wordcount)

head(sort(wordcount, decreasing=T),100)

#Step 5. 필요없는 단어를 제거한 후 공백을 제거하고 다시 집계를 합니다.

# 이 과정을 여러 번 반복하여 필요 없는 단어들은 모두 제거해야 합니다.

txt <- readLines("영화gsub.txt")

txt

cnt_txt <- length(txt)

cnt_txt

for( i in 1:cnt_txt) {

tran3 <- rapply(tran3, function(x) gsub((txt[i]),"", x), how = "replace")

}

tran3

data3 <- sapply(tran3, function(x) {Filter(function(y) { nchar(y) >=2 },x)} )

write(unlist(data3),"밀_2.txt")

data4 <- read.table("밀_2.txt")

data4

nrow(data4)

wordcount <- table(data4)

wordcount

head(sort(wordcount, decreasing=T),100)

#Step 6. 전처리 작업이 모두 완료되면 워드 클라우드를 그립니다.

library(RColorBrewer)

palete <- brewer.pal(7,"Set2")

wordcloud(names(wordcount),freq=wordcount,scale=c(5,1),rot.per=0.25,min.freq=5,

random.order=F,random.color=T,colors=palete)

legend(0.3,1 ,"영화 댓글 분석 - 밀정 ",cex=1.2,fill=NA,border=NA,bg="white" ,

text.col="red",text.font=2,box.col="red")

savePlot("영화_밀정.png",type="png")

저작자표시 비영리 동일조건 (새창열림)

Posted by .07274.

달력

« 2025/7 »

R 스터디 - 텍스트 마이닝 기초 카테고리 없음2016. 11. 18. 14:55

shepherd Sinbad

카테고리

공지사항

태그목록

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

링크

티스토리툴바