Word, pdf 문서에서 문자열 추출하기, 파일 입출력, 인코딩

[문서에서 문자열 추출]

https://pypi.org/project/pyautomate/

> pip install pyautomate

import pyautomate

from pyautomate.office import Word

docx = Word('test.docx')

from pyautomate.pdf import PDFDocument

pdf = PDFDocument('test.pdf')

본문 = pdf.extract_text()

print(본문)

[파일 입출력]

file = open('test.txt')

file

body = file.read()

file.close()

print(body)

#파이썬 스타일

with open('test.txt') as file:

body = file.read()

print(body)

file.closed

[인코딩]

open('test_from_jupyter.txt') #파일을 그냥 생성하면 cp949 로 Encoding됨

<_io.TextIOWrapper name='test_from_jupyter.txt' mode='r' encoding='cp949'>

name = '홍길동'
unicodepoint = [ord(char) for char in name]
print(unicodepoint)
print([hex(cp) for cp in unicodepoint])
#유니코드는 모든 문자조합을 숫자로 매핑한 추상적 개념

#인코딩은 문자를 메모리에 저장하는데 필요한 구체적 정보(1문자당 3개씩)
print(name.encode('utf-8'))
print(name.encode('utf-16'))
print(name.encode('cp949'))
print(name.encode('euc-kr'))

with open("test_write.txt", 'w', encoding='utf-8') as file:
    file.write('파일 utf-8 인코딩하여 쓰기')

저작자표시 비영리 변경금지

'Data > Python' 카테고리의 다른 글

말뭉치Corpus 다루기, Pandas 기초 (0)	2018.12.11
파이썬 자연어 처리 기초(NLTK) (0)	2018.12.10
Scrapy 크롤러 기본 (0)	2018.12.06
윈도우에서 웹 크롤링 Windows Web Crawling 환경설정 (0)	2018.12.03
Web Scraper 기본 (0)	2018.05.18

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

On the ball

Word, pdf 문서에서 문자열 추출하기, 파일 입출력, 인코딩

'Data > Python' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Word, pdf 문서에서 문자열 추출하기, 파일 입출력, 인코딩

'Data > Python' 카테고리의 다른 글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역