crawlers-master

Car forums

International forums

https://www.speakev.com/search/514139/?q=ota&o=date

Local forums

汽车之家：口碑页，论坛页
懂车帝 all from 今日头条 but has own comments
虎扑-汽车交友论坛
爱卡汽车
易车

学习网址

Prerequisite

scrapy>=2.4.0
- parsel
- w3lib
- twisted
- cryptography
- pyOpenSSL
- lxml
itemadapter>=0.2.0
pymongo>=3.11.0
requests>=2.24.0
matplotlib>=3.3.2
fonttools>=4.16.1
lxml>=4.6.1

urllib

urllib.request.urlopen(url, data=None, [timeout, ]*)

urllib.request.Request(url, data=None, headers={}, method=None)

requests

Requests

不同请求

>>> r = requests.get('https://api.github.com/events')

>>> r = requests.post('https://httpbin.org/post', data = {'key':'value'})

>>> r = requests.put('https://httpbin.org/put', data = {'key':'value'})

>>> r = requests.delete('https://httpbin.org/delete')

>>> r = requests.head('https://httpbin.org/get')

>>> r = requests.options('https://httpbin.org/get')

带参数

>>> payload = {'key1': 'value1', 'key2': 'value2'}

>>> r = requests.get('https://httpbin.org/get', params=payload)

伪装浏览器

>>> url = 'https://api.github.com/some/endpoint'

>>> headers = {'user-agent': 'my-app/0.0.1'}

>>> r = requests.get(url, headers=headers)

正则

import re

regular expression

re.match
re.search
re.findall
re.sub
re.compile

BeautifulSoup

soup = BeautifulSoup(html_doc,'lxml')
print(soup.title.string)
print(soup.p.string)
print(soup.title.parent.name)
print(soup.a)
print(soup.find_all('a'))
print(soup.find(id="link2"))
print(soup.get_text())

Drawback : Slow

CSS select for soup

soup = BeautifulSoup(html_doc,'lxml')

print(soup.select("title"))
print(soup.select("body a"))
print(soup.select("p > #link1"))

lxml

lxml is an XML parsing library (which also parses HTML) with a pythonic API based on ElementTree.

Selenium

pip install selenium download driver config PATH restart IDE

link

driver = webdriver.Chrome()

- find_element_by_id

- find_element_by_name

- find_element_by_xpath

- find_element_by_link_text

- find_element_by_partial_link_text

- find_element_by_tag_name

- find_element_by_class_name

- find_element_by_css_selector

if more elements use 'elements' instead of 'element'

driver.find_elements(By.ID, 'xxx')
ID = "id"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
NAME = "name"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"

driver.current_url
driver.get_cookies()
driver.page_source
input.text

JSON

Python object to JSON json.dumps()

JSON to python object json.loads()

多线程与多进程

并行：在某一个时间段里，可以同时执行多个进程并发：就是在一个时间点，同时执行多个进程 threading.Thread multiprocession.dummy

互斥锁

GIL锁：控制线程执行权限

协程

so called 微线程 genvent mnkey.patch_all

多线程

module: _thread, threading, Queues Usually use threading

在 python 中可以使用 ThreadPoolExecutor() 来实现线程池

Queue

有点复杂

多进程

from multiprocessing import Process
from multiprocessing import Pool

代理ip池

Link

模拟登录

三种登录方法

Cookie
发送表单，urllib
Selenium
- 拿到输入框的selector，登录后webdriver.get_cookies()拿到cookie

验证码

Link tool： Python-tesseract pip install pytesseract 安装 tesseract-ocr

pytesseract.image_to_string()
//#处理灰度
img.convert('L') 
// 二值化
def convert_img(img,threshold):
    img = img.convert("L")  # 处理灰度
    pixels = img.load()
    for x in range(img.width):
        for y in range(img.height):
            if pixels[x, y] > threshold:
                pixels[x, y] = 255
            else:
                pixels[x, y] = 0
    return img
//降噪
data = img.getdata()
    w,h = img.size
    count = 0
    for x in range(1,h-1):
        for y in range(1, h - 1):
            # 找出各个像素方向
            mid_pixel = data[w * y + x]
            if mid_pixel == 0:
                top_pixel = data[w * (y - 1) + x]
                left_pixel = data[w * y + (x - 1)]
                down_pixel = data[w * (y + 1) + x]
                right_pixel = data[w * y + (x + 1)]

                if top_pixel == 0:
                    count += 1
                if left_pixel == 0:
                    count += 1
                if down_pixel == 0:
                    count += 1
                if right_pixel == 0:
                    count += 1
                if count > 4:
                    img.putpixel((x, y), 0)

滑动识别

Link

模拟登录模型脚本

Link

Appnium

爬取手机app数据

数据存储

CSV

库：

csv
- import csv
pandas
- pip install pandas

MySQL

Link

MongoDB

库:

Pymongo
- from pymongo import MongoClient

数据可视化

Link 好用的库：

matplotlib
- python -m pip install -U pip setuptools
- python -m pip install matplotlib
- matplotlib
seaborn
- seaborn
pyecharts
- 百度开源可视化库
- 很酷炫
- Link

Scrapy

scrapy startproject projectname
scrapy crawl projectname

Chrome extentions

ChroPath
- 一键获取xpath、css selector
Web Scraper
- 直接爬取数据
- Link

字体css加密

Link

js加密

-Link

爬虫小技巧

移动端页面可能反爬会少一些
克制一些:sleep 以及半夜爬
选用robots.txt里的UA
找到sitemap获取url

python运行js的库

js2py
PyV8
PyExecJS

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
AutohomeCrawler		AutohomeCrawler
.gitignore		.gitignore
README.md		README.md

flyslowly/crawlers-master

Folders and files

Latest commit

History

Repository files navigation

crawlers-master

Car forums

International forums

Local forums

学习网址

Prerequisite

urllib

requests

不同请求

带参数

伪装浏览器

正则

BeautifulSoup

CSS select for soup

lxml

Selenium

JSON

多线程与多进程

互斥锁

协程

多线程

Queue

多进程

代理ip池

模拟登录

三种登录方法

验证码

滑动识别

模拟登录模型脚本

Appnium

数据存储

CSV

MySQL

MongoDB

数据可视化

Scrapy

Chrome extentions

字体css加密

js加密

爬虫小技巧

python运行js的库

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages