Skip to content

Latest commit

 

History

History
187 lines (124 loc) · 6.72 KB

README.md

File metadata and controls

187 lines (124 loc) · 6.72 KB


Logo

Web Scraper

Node.js and puppeteer web scraper with auto scrolling!

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Acknowledgments

About The Project

img

Purpose: This web scraper is designed to extract data from websites automatically, enabling users to gather valuable information quickly and efficiently. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Features:

  • Login and Authentication: The scraper can handle login and authentication processes, allowing access to protected areas of websites.
  • Infinite Scroll Support: It can navigate through pages with infinite scroll, capturing dynamic content as it loads.
  • Data Extraction: The scraper can extract specific data elements such as names, emails, phone numbers, and addresses from targeted web pages.
  • Page Pooling: To optimize performance, it utilizes a page pooling mechanism, reusing and managing browser pages effectively.

(back to top)

Built With

node

puppeteer

js

(back to top)

Getting Started

This web scraping project allows you to extract data from websites (10ksb) efficiently. Follow the steps below to get started

Prerequisites

Node.js and npm: Ensure you have Node.js (v20 or above) and npm (Node Package Manager) installed on your machine. You can check this by running the following commands in your terminal:

  • node
    node -v
  • npm
    npm install npm@latest -g

Chrome Browser: The project utilizes Puppeteer(v14.20), which requires Google Chrome (v103) or Chromium to be installed on your system.

Installation

  1. Clone the repo

    git clone https://github.com/s33chin/web-scraper.git
  2. Install NPM packages

    npm install puppeteer@14.2.0 @puppeteer/browsers cli-progress puppeteer-core   
  3. Adjust the Chrome executable path and other settings in the scrapeData function to match your system.

  4. Run the script

    node indexWithPooling.js

Note: Ensure that you have proper permissions and authorization to scrape data from the target website. Respect the website's terms of service and policies while scraping.

(back to top)

Usage

demo

(back to top)

Roadmap

  • Auto re-login and authentication after session timeout
  • Proxy and User-Agent Rotation
  • Data Storage Options: CSV, JSON, Database
  • User-Friendly CLI Interface
  • Support for Multiple Browsers

(back to top)

Acknowledgments

(back to top)