Skip to content

Basic Setup

msharing edited this page Oct 28, 2016 · 4 revisions

Setup

Preamble

First of all, this walk-through has only been tested on Linux-systems. It may not work in the same way for each Linux-distribution.

Some packages installed here may have other (but similar) names on other distributions. All-in-all, the setup should be very similar on all Linux-based systems.

The setup-process should also be very similar for Windows and Mac-systems as well, but when installing dependencies these must be installed manually. A link to the site will be provided.

All commands executed here are meant to be executed on an Ubuntu-system. So for example all apt-get-commands must be replaced with the correct command of your system.

For some commands sudo may be required.

If both versions of Python (Version 2 and Version 3) are installed, pay attention to always invoking the correct command for python 2 or python 3, depending on the version you want to run the tool with. For example on Ubuntu it is python and pip but on Arch it is python2 and pip2. Here, always python and pip will be used.

1. Installing Python 2.7.8

To run the newscrawler, first of all python 2.7.* or higher is required.

On some distributions (e.g. Ubuntu) the python 2.7 package is just called python. On other distributions (e.g. Arch) the python 2.7 package is called python2 and when installing python, python 3 will be installed. So keep an eye on that for the version you want to install.

$ apt-get install python

Python also offers a package-manager itself called pip. This package-manager is also required. Again, the package for pip2 may be called python-pip on some distributions and python2-pip on others.

$ apt-get install python-pip

2. Installing Python MySQL-Connector

Because the newscrawler saves some meta-data about the file-structure and the files itself in a database, the MySQL-Connector is required.

Most distributions offer packages of the MySQL-Connector in their package manager, otherwise the package can be installed by using pip (not recommended, if the package-manager offers a package. Command: pip install mysql-connector-python):

$ apt-get install python-mysql.connector

3. Installing git

$ apt-get install git

4. Installing MariaDB

MariaDB is an open-source implementation of the MySQL-Database-Managment-System. Here, only an installation of MariaDB will be explained.

For installing MariaDB on Mac and Windows-systems XAMPP is recommended. For installing MariaDB on Linux execute:

$ apt-get install mariadb-server

Afterwards an installation dialogue should appear where the root-password will be set.

5. Installing dependencies

The newscrawler is based on scrapy which is a crawler-framework for python. To install scrapy execute:

$ pip install scrapy

Because the newscrawler is able to run with Python 2 and Python 3, the package future must be installed as well:

$ pip install future

So "Human Readable JSON" can be used instead of normal JSON (JSON is valid HJSON), hjson needs to be installed:

$ pip install hjson

6. Installing the newscrawler

Now switch to the directory where you want to install the newscrawler. Another directory ccolon_newscrawler will be created in this directory.

$ git clone https://github.com/JBH168/Newscrawler

7. Starting and configuring the database

The database should already run. To verify execute:

$ systemctl status mysql

(on some distributions it may be called mysqld or mysql-server, and your system needs to support and have systemd installed)

Now there should be written Active: active (running) somewhere. If there is written Loaded: not-found (Reason: No such file or directory) somewhere, you need to use mysqld or mysql-server instead. If there is written Active: inactive (dead) just start the server by invoking

$ systemctl start mysql

Now for security reasons another MySQL-user should be created, therefor open the mysql command prompt:

$ mysql -u root -p <root-password>

Now a mysql-command-prompt should open. Now create a new user, called ccolon (can be changed, but keep in mind you changed it as the user is needed in other steps as well) with password <ccolon-password> (of course you should not use this password, choose one yourself ;-) ):

MariaDB [(none)]> CREATE USER 'ccolon'@'localhost' IDENTIFIED BY '<ccolon-password>';

(Attention: On the end of every statement an ; is needed)

Give the user the needed privileges (unlimited queries per hour, etc):

MariaDB [(none)]> GRANT USAGE ON *.* TO 'ccolon'@'localhost' IDENTIFIED BY '<ccolon-password>' REQUIRE NONE WITH MAX_QUERIES_PER_HOUR 0 MAX_CONNECTIONS_PER_HOUR 0 MAX_UPDATES_PER_HOUR 0 MAX_USER_CONNECTIONS 0;

Add a database newscrawler (of course you can use a different name as well) for the user:

MariaDB [(none)]> CREATE DATABASE IF NOT EXISTS `newscrawler`;

Now allow this user to do everything he wants to do on this database:

MariaDB [(none)]> GRANT ALL PRIVILEGES ON `newscrawler`.* TO 'ccolon'@'localhost';

Now the user has been created and all privileges have been granted, so the mysql-command-prompt can be exited:

MariaDB [(none)]> quit;

Now switch back to the directory where you installed the newscrawler (via git).

Afterwards go in ccolon_newscrawler-directory:

$ cd ./ccolon_newscrawler

Now you need to import the tables to the newly generated database, here again newscrawler is the database, the second < (before ./init-db.sql) is in fact meant to be there:

$ mysql -u ccolon --password=<ccolon-password> -D newscrawler < ./init-db.sql

Now the database-scheme was imported and the database-connection-information must be set in the config.

Therefore, open ./newscrawler.cfg with any file editor and set in the [Database]-Section following settings (leave the other settings as they are):

host = localhost
db = newscrawler
username = ccolon
password = <ccolon-password>

Save it and close the editor or change other settings as documented here.

8. Setting up the sites

To setup the sites edit the file ./input_data.json.

This file must conform all JSON-standards.

The basic setup is the following:

{
  "base_urls": [

  ]
}

Now, for each site you want to crawl, add an object like this to "base_urls":

    {
      "url": "http://website.com"
    }

so if you add multiple sites, the whole file looks like this:

{
  "base_urls": [
    {
      "url": "http://website.com"
    },
    {
      "url": "http://websitetwo.com"
    }
  ]
}

For more configuration options, take a look here.

9. Starting the crawler

Now you are ready to go.

You can start the crawler by invoking:

python ./start_processes.py

Depending on your the Debug option set in the newscrawler.cfg, you should see none to many output-messages.

If the crawler has daemons set in the input_json, the crawler won't terminate. Else it will terminate after all crawlers finished.

For terminating the crawler, press CTRL-C once or send just SIGINT. The crawler will need some time to gracefully shut down.

Try to avoid ungraceful shutdowns (e.g. pressing CTRL-C twice).

  • [I am a developer](I am a developer)
  • [I am a user](I am a user)

Setup

Crawlers / Spiders

System design

Further Documentation

Clone this wiki locally