Skip to content

flyslowly/crawlers-master

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 

Repository files navigation

crawlers-master

Car forums

International forums

Local forums

  • 汽车之家: 口碑页, 论坛页
  • 懂车帝 all from 今日头条 but has own comments
  • 虎扑-汽车交友论坛
  • 爱卡汽车
  • 易车

学习网址

Prerequisite

  • scrapy>=2.4.0
    • parsel
    • w3lib
    • twisted
    • cryptography
    • pyOpenSSL
    • lxml
  • itemadapter>=0.2.0
  • pymongo>=3.11.0
  • requests>=2.24.0
  • matplotlib>=3.3.2
  • fonttools>=4.16.1
  • lxml>=4.6.1

urllib

urllib.request.urlopen(url, data=None, [timeout, ]*)

urllib.request.Request(url, data=None, headers={}, method=None)

requests

Requests

不同请求

>>> r = requests.get('https://api.github.com/events')

>>> r = requests.post('https://httpbin.org/post', data = {'key':'value'})

>>> r = requests.put('https://httpbin.org/put', data = {'key':'value'})

>>> r = requests.delete('https://httpbin.org/delete')

>>> r = requests.head('https://httpbin.org/get')

>>> r = requests.options('https://httpbin.org/get')

带参数

>>> payload = {'key1': 'value1', 'key2': 'value2'}

>>> r = requests.get('https://httpbin.org/get', params=payload)

伪装浏览器

>>> url = 'https://api.github.com/some/endpoint'

>>> headers = {'user-agent': 'my-app/0.0.1'}

>>> r = requests.get(url, headers=headers)

正则

import re

regular expression

re.match
re.search
re.findall
re.sub
re.compile

BeautifulSoup

soup = BeautifulSoup(html_doc,'lxml')
print(soup.title.string)
print(soup.p.string)
print(soup.title.parent.name)
print(soup.a)
print(soup.find_all('a'))
print(soup.find(id="link2"))
print(soup.get_text())

Drawback : Slow

CSS select for soup

soup = BeautifulSoup(html_doc,'lxml')

print(soup.select("title"))
print(soup.select("body a"))
print(soup.select("p > #link1"))

lxml

lxml is an XML parsing library (which also parses HTML) with a pythonic API based on ElementTree.

Selenium

link

driver = webdriver.Chrome()

- find_element_by_id

- find_element_by_name

- find_element_by_xpath

- find_element_by_link_text

- find_element_by_partial_link_text

- find_element_by_tag_name

- find_element_by_class_name

- find_element_by_css_selector

if more elements use 'elements' instead of 'element'

driver.find_elements(By.ID, 'xxx')
ID = "id"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
NAME = "name"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"

driver.current_url
driver.get_cookies()
driver.page_source
input.text

JSON

Python object to JSON json.dumps()

JSON to python object json.loads()

多线程与多进程

并行:在某一个时间段里,可以同时执行多个进程 并发:就是在一个时间点,同时执行多个进程 threading.Thread multiprocession.dummy

互斥锁

GIL锁:控制线程执行权限

协程

so called 微线程 genvent mnkey.patch_all

多线程

module: _thread, threading, Queues Usually use threading

在 python 中 可以使用 ThreadPoolExecutor() 来实现线程池

Queue

有点复杂

多进程

from multiprocessing import Process
from multiprocessing import Pool

代理ip池

Link

模拟登录

三种登录方法

  • Cookie
  • 发送表单,urllib
  • Selenium
    • 拿到输入框的selector,登录后webdriver.get_cookies()拿到cookie

验证码

Link tool: Python-tesseract pip install pytesseract 安装 tesseract-ocr

pytesseract.image_to_string()
//#处理灰度
img.convert('L') 
// 二值化
def convert_img(img,threshold):
    img = img.convert("L")  # 处理灰度
    pixels = img.load()
    for x in range(img.width):
        for y in range(img.height):
            if pixels[x, y] > threshold:
                pixels[x, y] = 255
            else:
                pixels[x, y] = 0
    return img
//降噪
data = img.getdata()
    w,h = img.size
    count = 0
    for x in range(1,h-1):
        for y in range(1, h - 1):
            # 找出各个像素方向
            mid_pixel = data[w * y + x]
            if mid_pixel == 0:
                top_pixel = data[w * (y - 1) + x]
                left_pixel = data[w * y + (x - 1)]
                down_pixel = data[w * (y + 1) + x]
                right_pixel = data[w * y + (x + 1)]

                if top_pixel == 0:
                    count += 1
                if left_pixel == 0:
                    count += 1
                if down_pixel == 0:
                    count += 1
                if right_pixel == 0:
                    count += 1
                if count > 4:
                    img.putpixel((x, y), 0)

滑动识别

Link

模拟登录模型脚本

Link

Appnium

爬取手机app数据

数据存储

CSV

库:

  • csv
    • import csv
  • pandas
    • pip install pandas

MySQL

Link

MongoDB

库:

  • Pymongo
    • from pymongo import MongoClient

数据可视化

Link 好用的库:

  • matplotlib
    • python -m pip install -U pip setuptools
    • python -m pip install matplotlib
    • matplotlib
  • seaborn
  • pyecharts
    • 百度开源可视化库
    • 很酷炫
    • Link

Scrapy

scrapy startproject projectname
scrapy crawl projectname

Chrome extentions

  • ChroPath
    • 一键获取xpath、css selector
  • Web Scraper
    • 直接爬取数据
    • Link

字体css加密

js加密

-Link

爬虫小技巧

  1. 移动端页面可能反爬会少一些
  2. 克制一些:sleep 以及 半夜爬
  3. 选用robots.txt里的UA
  4. 找到sitemap获取url

python运行js的库

  • js2py
  • PyV8
  • PyExecJS

About

A python crawler for info from cars forums

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages