Corpus type - Categories, Tagged, UDHR

import pandas as pd

from nltk.corpus import brown

#CategorizedTaggedCorpusReader

print(brown)

brown.fileids()[:5]

brown.words()

brown.tagged_words()

brown.categories()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

brown.words(categories='news')
brown.fileids(categories='news')[:5]
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
['ca01', 'ca02', 'ca03', 'ca04', 'ca05']

분류별 어휘 스타일 비교

news_tokens = brown.words(categories='news')
fiction_tokens = brown.words(categories='fiction')
news_tokens_series = pd.Series(news_tokens)
news_tokens_series[:5]
news_tokens_series.value_counts()[:10]
the    5580
,      5188
.      4030
of     2849
and    2146
to     2116
a      1993
in     1893
for     943
The     806
dtype: int64

단어 토큰만 선택

# Pandas.Series.Str  <- Value가 문자열인 경우 각 문자열에 대해 .str로 접근 후 .메소드 적용 가능
news_tokens_series = news_tokens_series.str.lower()
# 영문자만 선택됨
isAlpha = news_tokens_series.str.isalpha()
news_words_series = news_tokens_series[isAlpha]
print(news_words_series[:10])
print('\n')
print(news_words_series.value_counts()[:10])
news_words_series_count = news_words_series.value_counts()
0              the

1           fulton
2           county
3            grand
4             jury
5             said
6           friday
7               an
8    investigation
9               of
dtype: object

the     6386
of      2861
and     2186
to      2144
a       2130
in      2020
for      969
that     829
is       733
was      717
dtype: int64

#위 단계와 동일

fiction_tokens_series = pd.Series(fiction_tokens)

fiction_tokens_series = fiction_tokens_series.str.lower()

isAlpha = fiction_tokens_series.str.isalpha()

fiction_words_series = fiction_tokens_series[isAlpha]

fiction_words_series.value_counts()[:10]

fiction_words_serise_count = fiction_words_series.value_counts()

#관심 단어

words = ['can', 'could', 'may']

#관심 단어 도수

print(news_words_series_count[words])

print(fiction_words_serise_count[words])

#관심 단어 비율

news_words_rate = news_words_series_count / news_words_series_count.sum()

fiction_words_rate = fiction_words_serise_count / fiction_words_serise_count.sum()

print(news_words_rate[words])

print(fiction_words_rate[words])

can      94
could    87
may      93
dtype: int64
can       39
could    168
may       10
dtype: int64
can      0.001125
could    0.001041
may      0.001113
dtype: float64
can      0.000683
could    0.002943
may      0.000175
dtype: float64

#동일 내용 함수화

def calWordRate(CorpusReader, cat):
    tokens = CorpusReader.words(categories = cat)
    tokens_series = pd.Series(tokens)    
    isAlpha = tokens_series.str.isalpha()
    words_series = tokens_series[isAlpha].str.lower()
    words_count = words_series.value_counts()
    words_rate = words_count / words_count.sum()
    return words_rate

news_words_rate = calWordRate(brown,'news')
news_words_rate[:10]

the     0.076422
of      0.034238
and     0.026160
to      0.025658
a       0.025490
in      0.024174
for     0.011596
that    0.009921
is      0.008772
was     0.008580
dtype: float64

words_rate_for_category = {}

for cat in brown.categories():
    words_rate = calWordRate(brown, cat)
    words_rate_for_category[cat] = words_rate

#print(words_rate_for_category)

pd.DataFrame(words_rate_for_category).T

a	aa	aaa	aaawww	aah	aaron	ab	aback	abandon	abandoned	...	zoooop
adventure	0.025274	NaN	NaN	0.000018	0.000018	0.000035	NaN	0.000018	0.000018	0.000018	...	NaN
belles_lettres	0.023154	NaN	NaN	NaN	NaN	0.000007	NaN	NaN	0.000047	0.000027	...	NaN
editorial	0.022041	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000019	0.000057	...	NaN
fiction	0.023456	NaN	NaN	NaN	NaN	0.000018	NaN	NaN	NaN	0.000018	...	NaN
government	0.016194	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000017	NaN	...	NaN
hobbies	0.026620	NaN	0.000014	NaN	NaN	NaN	NaN	NaN	0.000014	0.000014	...	0.000029
humor	0.029422	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN
learned	0.022052	NaN	NaN	NaN	NaN	0.000006	0.000013	NaN	NaN	0.000006	...	NaN
lore	0.025607	NaN	NaN	NaN	NaN	0.000011	NaN	NaN	0.000021	0.000053	...	NaN
mystery	0.025661	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000043	0.000043	...	NaN
news	0.025490	NaN	NaN	NaN	NaN	0.000012	NaN	NaN	NaN	0.000036	...	NaN
religion	0.020437	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000059	0.000059	...	NaN
reviews	0.027266	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000059	...	NaN
romance	0.024324	0.000018	NaN	NaN	NaN	NaN	NaN	0.000018	NaN	NaN	...	NaN
science_fiction	0.020065	NaN	NaN	NaN	NaN	NaN	0.000085	NaN	NaN	NaN	...

words_rate_for_category_table = pd.DataFrame(words_rate_for_category).T

words_rate_for_category_table[words].plot(kind='barh', stacked=True)

from nltk.corpus import udhr

udhr

#세계 인권 선언문 - 같은 의미의 내용이 서로다른 언어로 되어 있음-> 기계번역의 소스

udhr.fileids()[147:148]

fid_korean = udhr.fileids()[147]

print(udhr.raw(fid_korean)[:100])

세 계 인 권 선 언

전 문 

모든 인류 구성원의 천부의 존엄성과 동등하고 양도할 수 없는 권리를 인정하는 것이 세계의 자유 , 정의 및 평화의 기초이며 , 

인권에 대한 무

저작자표시 비영리 변경금지 (새창열림)

'Data > Python' 카테고리의 다른 글

한글 말뭉치 리더기 만들기 (세종) (0)	2018.12.11
사용자 정의 말뭉치 읽고 처리 (0)	2018.12.11
말뭉치Corpus 다루기, Pandas 기초 (0)	2018.12.11
파이썬 자연어 처리 기초(NLTK) (1)	2018.12.10
Word, pdf 문서에서 문자열 추출하기, 파일 입출력, 인코딩 (0)	2018.12.10

On the ball

Corpus type - Categories, Tagged, UDHR

분류별 어휘 스타일 비교

단어 토큰만 선택

'Data > Python' 카테고리의 다른 글

티스토리툴바

Corpus type - Categories, Tagged, UDHR

분류별 어휘 스타일 비교

단어 토큰만 선택

'Data > Python' 카테고리의 다른 글

관련글

티스토리툴바