윈도우에서 웹 크롤링 Windows Web Crawling 환경설정

Python과 pip은 설치되어 있음

가상환경 설정

C:\>pip install virtaulenv virtualenvwrapper

C:\>virtualenv NAME

아래 가상환경 진입

C:\NAME\Scripts>activate.bat

아래 가상환경 나가기

(NAME) C:\NAME\Scripts>deactivate.bat

virtualenv 만들 때

setuptools, pip, wheel을 다운로드 받는데 Proxy등 문제로 안될 경우

pip download --no-cache --proxy http://PROXYSERVER:PORT --trusted-host pypi.python.org setuptools wheel pip

virtualenv --no-download --extra-search-dir /opt/pypi/downloads virtualenv

#Beautiful Soup - html 문서, 인코딩을 유니코드로 변환해서 UTF-8 출력, lxml,html5lib

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

[BeautifulSoup]

pip install --trusted-host pypi.python.org --proxy http://PROXYSERVER:PORT lxml html5lib beautifulsoup4

테스트

C:\>python

>>>from bs4 import BeautifulSoup

>>>soup = BeautifulSoup('<html></html>','html.parser'>

>>>print(soup)

Parser	Typical usage	Advantages	Disadvantages
Python’s html.parser	`BeautifulSoup(markup, "html.parser")`	Batteries included Decent speed Lenient (as of Python 2.7.3 and 3.2.)	Not very lenient (before Python 2.7.3 or 3.2.2)
lxml’s HTML parser	`BeautifulSoup(markup, "lxml")`	Very fast Lenient	External C dependency
lxml’s XML parser	`BeautifulSoup(markup, "lxml-xml")` `BeautifulSoup(markup, "xml")`	Very fast The only currently supported XML parser	External C dependency
html5lib	`BeautifulSoup(markup, "html5lib")`	Extremely lenient Parses pages the same way a web browser does Creates valid HTML5	Very slow External Python dependency

#Scrapy - framework, 다양한 Selector, 파이프라인, 로깅, 이메일 등

[Scrapy]

C:\>pip install --trusted-host pypi.python.org --proxy http://PROXYSERVER:PORT Scrapy pypiwin32

Twisted 패키지 설치 에러 시

https://www.lfd.uci.edu/~gohlke/pythonlibs/ 에서 해당 Twisted 패키지 검색 및 Python버전, CPU에 맞는 버전 Whl 파일 다운로드

C:\>pip install --trusted-host pypi.python.org --proxy http://PROXYSERVER:PORT Twisted-18.9.0-cp36-cp36m-win_amd64.whl

이렇게 로컬설치 먼저 진행 후 위 Scrapy 설치

C:\>scrapy startproject tutorial

로 프로젝트 생성

pycharm 설치 community 버전

pycharm 가상환경 설정

File - Settings - Project:tutorial - Project Interpreter - 톱니바퀴 add - virtualenv - Scripts - python.exe 선택 추가

[proxy 환경설정]

C:\>set http_proxy=http://proxy:port
csh% setenv http_proxy http://proxy:port
sh$ export http_proxy=http://proxy:port

C:\>set https_proxy=https://proxy:port
csh% setenv https_proxy https://proxy:port
sh$ export https_proxy=https://proxy:port

Word, pdf 문서에서 문자열 추출하기, 파일 입출력, 인코딩 (0)	2018.12.10
Scrapy 크롤러 기본 (0)	2018.12.06
Web Scraper 기본 (0)	2018.05.18
Python IDE 개발환경, pip proxy ssl 문제해결 (0)	2018.05.10
자료형 - List (0)	2018.02.13

부동산 On the ball