정규표현식

String.find()는 일치 문자열 탐지, 정규표현식은 패턴 부합 탐지

with open('wiki/turing.txt', encoding='utf-8') as fp:

turing = fp.read()

print(turing)

앨런 매티슨 튜링(영어: Alan Mathison Turing, OBE, FRS, 1912년 6월 23일 ~ 1954년 6월 7일)은 영국의 수학자, 암호학자, 논리학자이자 컴퓨터 과학의 선구적 인물이다. 알고리즘과 계산 개념을 튜링 기계라는 추상 모델을 통해 형식화함으로써 컴퓨터 과학의 발전에 지대한 공헌을 했다.[2][3][4] 튜링 테스트의 고안으로도 유명하다. ACM에서 컴퓨터 과학에 중요한 업적을 남긴 사람들에게 매년 시상하는 튜링상은 그의 이름을 따 제정한 것이다. 이론 컴퓨터 과학과 인공지능 분야에 지대한 공헌을 했기 때문에 "컴퓨터 과학의 아버지"라고 불린다.

1945년에 그가 고안한 튜링 머신은 초보적 형태의 컴퓨터로, 복잡한 계산과 논리 문제를 처리할 수 있었다. 하지만 튜링은 1952년에 당시에는 범죄로 취급되던 동성애 혐의로 영국 경찰에 체포돼 유죄 판결을 받았다.[5] 감옥에 가는 대신 화학적 거세를 받아야 했던 그는, 2년 뒤 청산가리를 넣은 사과를 먹고 자살했다.[5]

사후 59년만인 2013년 12월 24일에 엘리자베스 2세 여왕이 크리스 그레일링 법무부 장관의 건의를 받아들여 튜링의 동성애 죄를 사면하였다. 이어서 무죄 판결을 받고 복권되었다.[5]

yyyy년 mm월 dd일 형태 추출

import re
# \\digit 4개
date_pattern = '\\d{4}년 \\d{2}월 \\d{2}일'
print(re.findall(date_pattern, turing))
#1개 또는 2개
date_pattern = '\\d{4}년 \\d{1,2}월 \\d{1,2}일'
print(re.findall(date_pattern, turing))

['2013년 12월 24일']
['1912년 6월 23일', '1954년 6월 7일', '2013년 12월 24일']

-ed로 끝나는 단어 추출
from nltk.corpus import words
en_words = words.words()
pattern = 'ed$'
target_word_list = []
for word in en_words:
    if re.search(pattern,word):
        target_word_list.append(word)
target_word_list[:5]

['abaissed', 'abandoned', 'abased', 'abashed', 'abatised']

token_series = pd.Series(en_words)
isPattern = token_series.str.contains(pattern)
print(token_series[isPattern][:5])

39     abaissed
47    abandoned
62       abased
69      abashed
85     abatised
dtype: object

def select_token_by_pattern(token_series, pattern):
    isPattern = token_series.str.contains(pattern)
    return token_series[isPattern]

# $ 끝나는

print(select_token_by_pattern(token_series,pattern='ed$')[:5])

# ^시작하는

print(select_token_by_pattern(token_series,pattern='^ir')[:5])

# ^ 시작 ..아무글자 $끝나는 전체 8자

print(select_token_by_pattern(token_series,pattern='^..j..t..$')[:5])

# 적어도 2글자이상 포함

print(select_token_by_pattern(token_series,pattern='..j..t..')[:5])

# [ab] a또는 b

print(select_token_by_pattern(token_series,pattern='^[ghi][mno][jlk][def]$')[:5])

# 경로 패턴 한국어는 [ㄱ-ㅣ가-힣] 또는 [가-힣]

print(select_token_by_pattern(token_series,pattern='^[A-Z][a-z0-9]$')[:5])

39     abaissed
47    abandoned
62       abased
69      abashed
85     abatised
dtype: object
97896         iracund
97897      iracundity
97898    iracundulous
97899           irade
97910        irascent
dtype: object
276      abjectly
2620     adjuster
50269    dejected
50274    dejectly
94828    injector
dtype: object
273    abjectedness
274       abjection
275       abjective
276        abjectly
277      abjectness
dtype: object
78596     gold
78655     golf
86476     hold
86492     hole
236192    gold
dtype: object
15       Ab
4264     Ah
4576     Al
11025    Ao
13945    As
dtype: object

from nltk.corpus import nps_chat
chat_token_series = pd.Series(nps_chat.words())

#변형된 단어 등을 정규식으로 추출 제거
print(chat_token_series[chat_token_series.str.len()>10][:5])
# + 1개이상 들어간 것 찾기
print(select_token_by_pattern(chat_token_series,pattern='m+i+k+e+').drop_duplicates()[0:5])
print(select_token_by_pattern(chat_token_series,pattern='!+').drop_duplicates()[0:5])
# * 0개이상 들어간 것 찾기
print(select_token_by_pattern(chat_token_series,pattern='^m*i*k*e*$').drop_duplicates()[0:5])

208        Considerably
442         considering
448        ihavehotnips
528    iamahotniplickme
543         appearently
dtype: object
3065                             mike
3385    mikeeeeeeeeeeeeeeeeeeeeeeeeee
dtype: object
213                   !
592    !!!!!!!!!!!!!!!!
619               !!!!!
669             !!!!!!!
679                  !!
dtype: object
36        m
45       me
69        i
790       k
2576    mmm
dtype: object

부정(NOT)

# 괄호 안의 ^는 부정 모음으로만 이루어진 단어를 제외한

print(select_token_by_pattern(chat_token_series,pattern='^[^AEIOUaeiou]+$').drop_duplicates()[0:5])

7     :P
14     :
21     .
30    :)
34    26
dtype: object

소수점이 포함된 숫자 선택
# .은 아무문자인데 \ 탈출기호로 의미 해제하여 .으로 인식
print(select_token_by_pattern(chat_token_series,pattern='^[0-9]+\.[0-9]+$').drop_duplicates()[0:5])
# 갯수 지정
print(select_token_by_pattern(chat_token_series,pattern='^[0-9]{4}$').drop_duplicates()[0:5])

1865      1.99
3214      4.20
10512     39.3
10515    121.7
10668     64.8
dtype: object
5916     1200
10529    2006
10585    1980
10933    1299
10984    1900
dtype: object

example = '아름다운 우리말과 영어(English) ^O^ ㅋㅋ'
print(' '.join(re.findall('[ㄱ-ㅣ가-힣]+',example)))
#이외이 패턴을 ''으로 변경(삭제함) 공백 추가
print(re.sub('[^ㄱ-ㅣ가-힣 ]+','',example))
#이외이 패턴을 ''으로 변경(삭제함) 공백 추가
print(re.sub('[^가-힣 ]+','',example))
print(re.sub('\[[0-9]+\]','','test[3][e][1]'))

아름다운 우리말과 영어 ㅋㅋ
아름다운 우리말과 영어  ㅋㅋ
아름다운 우리말과 영어  
test[e]

저작자표시 비영리 변경금지 (새창열림)

'Data > Python' 카테고리의 다른 글

BeautifulSoup 크롤러 기본 (0)	2018.12.12
Word Net 대응, synsets, synset, 거리측정 (0)	2018.12.12
자연어 처리 한글사전 만들어 비교하기 (0)	2018.12.12
자연어 처리 영사전 만들어 비교하기 (0)	2018.12.12
한글 말뭉치 리더기 만들기 (세종) (0)	2018.12.11

On the ball

정규표현식