- 汽车之家: 口碑页, 论坛页
- 懂车帝 all from 今日头条 but has own comments
- 虎扑-汽车交友论坛
- 爱卡汽车
- 易车
- scrapy>=2.4.0
- parsel
- w3lib
- twisted
- cryptography
- pyOpenSSL
- lxml
- itemadapter>=0.2.0
- pymongo>=3.11.0
- requests>=2.24.0
- matplotlib>=3.3.2
- fonttools>=4.16.1
- lxml>=4.6.1
urllib.request.urlopen(url, data=None, [timeout, ]*)
urllib.request.Request(url, data=None, headers={}, method=None)
>>> r = requests.get('https://api.github.com/events')
>>> r = requests.post('https://httpbin.org/post', data = {'key':'value'})
>>> r = requests.put('https://httpbin.org/put', data = {'key':'value'})
>>> r = requests.delete('https://httpbin.org/delete')
>>> r = requests.head('https://httpbin.org/get')
>>> r = requests.options('https://httpbin.org/get')
>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.get('https://httpbin.org/get', params=payload)
>>> url = 'https://api.github.com/some/endpoint'
>>> headers = {'user-agent': 'my-app/0.0.1'}
>>> r = requests.get(url, headers=headers)
import re
re.match
re.search
re.findall
re.sub
re.compile
soup = BeautifulSoup(html_doc,'lxml')
print(soup.title.string)
print(soup.p.string)
print(soup.title.parent.name)
print(soup.a)
print(soup.find_all('a'))
print(soup.find(id="link2"))
print(soup.get_text())
Drawback : Slow
soup = BeautifulSoup(html_doc,'lxml')
print(soup.select("title"))
print(soup.select("body a"))
print(soup.select("p > #link1"))
lxml is an XML parsing library (which also parses HTML) with a pythonic API based on ElementTree
.
- pip install selenium download driver config PATH restart IDE
driver = webdriver.Chrome()
- find_element_by_id
- find_element_by_name
- find_element_by_xpath
- find_element_by_link_text
- find_element_by_partial_link_text
- find_element_by_tag_name
- find_element_by_class_name
- find_element_by_css_selector
if more elements use 'elements' instead of 'element'
driver.find_elements(By.ID, 'xxx')
ID = "id"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
NAME = "name"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"
driver.current_url
driver.get_cookies()
driver.page_source
input.text
Python object to JSON
json.dumps()
JSON to python object
json.loads()
并行:在某一个时间段里,可以同时执行多个进程 并发:就是在一个时间点,同时执行多个进程 threading.Thread multiprocession.dummy
GIL锁:控制线程执行权限
so called 微线程 genvent mnkey.patch_all
module: _thread, threading, Queues Usually use threading
在 python 中 可以使用 ThreadPoolExecutor() 来实现线程池
有点复杂
from multiprocessing import Process
from multiprocessing import Pool
- Cookie
- 发送表单,urllib
- Selenium
- 拿到输入框的selector,登录后webdriver.get_cookies()拿到cookie
Link tool: Python-tesseract pip install pytesseract 安装 tesseract-ocr
pytesseract.image_to_string()
//#处理灰度
img.convert('L')
// 二值化
def convert_img(img,threshold):
img = img.convert("L") # 处理灰度
pixels = img.load()
for x in range(img.width):
for y in range(img.height):
if pixels[x, y] > threshold:
pixels[x, y] = 255
else:
pixels[x, y] = 0
return img
//降噪
data = img.getdata()
w,h = img.size
count = 0
for x in range(1,h-1):
for y in range(1, h - 1):
# 找出各个像素方向
mid_pixel = data[w * y + x]
if mid_pixel == 0:
top_pixel = data[w * (y - 1) + x]
left_pixel = data[w * y + (x - 1)]
down_pixel = data[w * (y + 1) + x]
right_pixel = data[w * y + (x + 1)]
if top_pixel == 0:
count += 1
if left_pixel == 0:
count += 1
if down_pixel == 0:
count += 1
if right_pixel == 0:
count += 1
if count > 4:
img.putpixel((x, y), 0)
爬取手机app数据
库:
- csv
- import csv
- pandas
- pip install pandas
库:
- Pymongo
- from pymongo import MongoClient
Link 好用的库:
- matplotlib
- python -m pip install -U pip setuptools
- python -m pip install matplotlib
- matplotlib
- seaborn
- pyecharts
- 百度开源可视化库
- 很酷炫
- Link
scrapy startproject projectname
scrapy crawl projectname
- ChroPath
- 一键获取xpath、css selector
- Web Scraper
- 直接爬取数据
- Link
-Link
- 移动端页面可能反爬会少一些
- 克制一些:sleep 以及 半夜爬
- 选用robots.txt里的UA
- 找到sitemap获取url
- js2py
- PyV8
- PyExecJS