파이썬 자연어 처리 기초(NLTK)

\이미지 출처: 자연어 처리 입문 강의 자료 (이성주)

pypi https://pypi.org/project/nltk/

서울대 http://konlpy.org/en/latest/ GPL v3 오픈소스 라이선스

구글 https://cloud.google.com/natural-language/

import nltk

nltk.download()

from nltk.corpus import gutenberg

gutenberg.fileids() #파일 목록

#nltp에서 제공하는 corpus 리더기

raw_text = gutenberg.raw('austen-emma.txt')

print(raw_text[:100])

#raw 리더기를 안쓰는 경우 아래와 같이 path작업을 계속해줘야함

import os

os.path.join(gutenberg.root)

[Emma by Jane Austen 1816]
VOLUME I
CHAPTER I

Emma Woodhouse, handsome, clever, and rich, with a

#words 메소드 활용한 토큰화 tokens <- 단어, 문장기호, 숫자...

field_1 = gutenberg.fileids()

token = gutenberg.words(field_1)

print(token[:10])

#토큰화 함수 사용

from nltk.tokenize import word_tokenize

print(word_tokenize(raw_text)[:10])

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER']

koreanTest = '''NLTK로 한국어가 토큰화 될 수 있을까요? 
된다고 하던데...'''
print(word_tokenize(koreanTest))

['NLTK로', '한국어가', '토큰화', '될', '수', '있을까요', '?', '된다고', '하던데', '...']

from nltk import book

print(book.text1)

#원래는 아래 코드

gutenberg.fileids()

tokens_moby_dick = gutenberg.words('melville-moby_dick.txt')

text_moby_dick = nltk.Text(tokens_moby_dick)

print(text_moby_dick)

<Text: Moby Dick by Herman Melville 1851>
<Text: Moby Dick by Herman Melville 1851>

#유사한 단어 찾기
book.text1.similar('monstrous')

true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless

tokens_all = gutenberg.words()

text_gutenberg = nltk.Text(tokens_all)

text_gutenberg.similar('monstrous')

text_gutenberg.similar('happy')

very great so strange mighty one good terrible the real first old true
evil right solemn foolish extremely wise happy
that good and well there all long much now so little great said as not
then here in he right

저작자표시 비영리 변경금지

'Data > Python' 카테고리의 다른 글

Corpus type - Categories, Tagged, UDHR (0)	2018.12.11
말뭉치Corpus 다루기, Pandas 기초 (0)	2018.12.11
Word, pdf 문서에서 문자열 추출하기, 파일 입출력, 인코딩 (0)	2018.12.10
Scrapy 크롤러 기본 (0)	2018.12.06
윈도우에서 웹 크롤링 Windows Web Crawling 환경설정 (0)	2018.12.03

부동산 On the ball

파이썬 자연어 처리 기초(NLTK)

pypi https://pypi.org/project/nltk/

서울대 http://konlpy.org/en/latest/ GPL v3 오픈소스 라이선스

구글 https://cloud.google.com/natural-language/

'Data > Python' 카테고리의 다른 글

티스토리툴바

파이썬 자연어 처리 기초(NLTK)

pypi https://pypi.org/project/nltk/

서울대 http://konlpy.org/en/latest/ GPL v3 오픈소스 라이선스

구글 https://cloud.google.com/natural-language/

'Data > Python' 카테고리의 다른 글

관련글

티스토리툴바