Word, pdf 문서에서 문자열 추출하기, 파일 입출력, 인코딩

Data/Python

Word, pdf 문서에서 문자열 추출하기, 파일 입출력, 인코딩

pub-lican01 2018. 12. 10. 13:55

[문서에서 문자열 추출]

https://pypi.org/project/pyautomate/

> pip install pyautomate

import pyautomate

from pyautomate.office import Word

docx = Word('test.docx')

from pyautomate.pdf import PDFDocument

pdf = PDFDocument('test.pdf')

본문 = pdf.extract_text()

print(본문)

[파일 입출력]

file = open('test.txt')

file

body = file.read()

file.close()

print(body)

#파이썬 스타일

with open('test.txt') as file:

body = file.read()

print(body)

file.closed

[인코딩]

open('test_from_jupyter.txt') #파일을 그냥 생성하면 cp949 로 Encoding됨

<_io.TextIOWrapper name='test_from_jupyter.txt' mode='r' encoding='cp949'>

name = '홍길동'
unicodepoint = [ord(char) for char in name]
print(unicodepoint)
print([hex(cp) for cp in unicodepoint])
#유니코드는 모든 문자조합을 숫자로 매핑한 추상적 개념

#인코딩은 문자를 메모리에 저장하는데 필요한 구체적 정보(1문자당 3개씩)
print(name.encode('utf-8'))
print(name.encode('utf-16'))
print(name.encode('cp949'))
print(name.encode('euc-kr'))

with open("test_write.txt", 'w', encoding='utf-8') as file:
    file.write('파일 utf-8 인코딩하여 쓰기')

저작자표시 비영리 변경금지 (새창열림)