-
Notifications
You must be signed in to change notification settings - Fork 1
Basic Setup
First of all, this walk-through has only been tested on Linux-systems. It may not work in the same way for each Linux-distribution.
Some packages installed here may have other (but similar) names on other distributions. All-in-all, the setup should be very similar on all Linux-based systems.
The setup-process should also be very similar for Windows and Mac-systems as well, but when installing dependencies these must be installed manually. A link to the site will be provided.
All commands executed here are meant to be executed on an Ubuntu-system.
So for example all apt-get
-commands must be replaced with the correct command of your system.
For some commands sudo
may be required.
If both versions of Python (Version 2 and Version 3) are installed, pay attention to always invoking the correct command for python 2 or python 3, depending on the version you want to run the tool with.
For example on Ubuntu it is python
and pip
but on Arch it is python2
and pip2
. Here, always python
and pip
will be used.
To run the newscrawler, first of all python 2.7.* or higher is required.
On some distributions (e.g. Ubuntu) the python 2.7 package is just called python. On other distributions (e.g. Arch) the python 2.7 package is called python2 and when installing python, python 3 will be installed. So keep an eye on that for the version you want to install.
$ apt-get install python
Python also offers a package-manager itself called pip. This package-manager is also required. Again, the package for pip2 may be called python-pip on some distributions and python2-pip on others.
$ apt-get install python-pip
Because the newscrawler saves some meta-data about the file-structure and the files itself in a database, the MySQL-Connector is required.
Most distributions offer packages of the MySQL-Connector in their package manager, otherwise the package can be installed by using pip (not recommended, if the package-manager offers a package. Command: pip install mysql-connector-python
):
$ apt-get install python-mysql.connector
$ apt-get install git
MariaDB is an open-source implementation of the MySQL-Database-Managment-System. Here, only an installation of MariaDB will be explained.
For installing MariaDB on Mac and Windows-systems XAMPP is recommended. For installing MariaDB on Linux execute:
$ apt-get install mariadb-server
Afterwards an installation dialogue should appear where the root-password will be set.
The newscrawler is based on scrapy which is a crawler-framework for python. To install scrapy execute:
$ pip install scrapy
Because the newscrawler is able to run with Python 2 and Python 3, the package future
must be installed as well:
$ pip install future
So "Human Readable JSON" can be used instead of normal JSON (JSON is valid HJSON), hjson
needs to be installed:
$ pip install hjson
Now switch to the directory where you want to install the newscrawler.
Another directory ccolon_newscrawler
will be created in this directory.
$ git clone https://github.com/JBH168/Newscrawler
The database should already run. To verify execute:
$ systemctl status mysql
(on some distributions it may be called mysqld
or mysql-server
, and your system needs to support and have systemd installed)
Now there should be written Active: active (running)
somewhere.
If there is written Loaded: not-found (Reason: No such file or directory)
somewhere, you need to use mysqld
or mysql-server
instead.
If there is written Active: inactive (dead)
just start the server by invoking
$ systemctl start mysql
Now for security reasons another MySQL-user should be created, therefor open the mysql command prompt:
$ mysql -u root -p <root-password>
Now a mysql-command-prompt should open.
Now create a new user, called ccolon
(can be changed, but keep in mind you changed it as the user is needed in other steps as well) with password <ccolon-password>
(of course you should not use this password, choose one yourself ;-) ):
MariaDB [(none)]> CREATE USER 'ccolon'@'localhost' IDENTIFIED BY '<ccolon-password>';
(Attention: On the end of every statement an ;
is needed)
Give the user the needed privileges (unlimited queries per hour, etc):
MariaDB [(none)]> GRANT USAGE ON *.* TO 'ccolon'@'localhost' IDENTIFIED BY '<ccolon-password>' REQUIRE NONE WITH MAX_QUERIES_PER_HOUR 0 MAX_CONNECTIONS_PER_HOUR 0 MAX_UPDATES_PER_HOUR 0 MAX_USER_CONNECTIONS 0;
Add a database newscrawler
(of course you can use a different name as well) for the user:
MariaDB [(none)]> CREATE DATABASE IF NOT EXISTS `newscrawler`;
Now allow this user to do everything he wants to do on this database:
MariaDB [(none)]> GRANT ALL PRIVILEGES ON `newscrawler`.* TO 'ccolon'@'localhost';
Now the user has been created and all privileges have been granted, so the mysql-command-prompt can be exited:
MariaDB [(none)]> quit;
Now switch back to the directory where you installed the newscrawler (via git).
Afterwards go in ccolon_newscrawler
-directory:
$ cd ./ccolon_newscrawler
Now you need to import the tables to the newly generated database, here again newscrawler
is the database, the second <
(before ./init-db.sql
) is in fact meant to be there:
$ mysql -u ccolon --password=<ccolon-password> -D newscrawler < ./init-db.sql
Now the database-scheme was imported and the database-connection-information must be set in the config.
Therefore, open ./newscrawler.cfg
with any file editor and set in the [Database]
-Section following settings (leave the other settings as they are):
host = localhost
db = newscrawler
username = ccolon
password = <ccolon-password>
Save it and close the editor or change other settings as documented here.
To setup the sites edit the file ./input_data.json.
This file must conform all JSON-standards.
The basic setup is the following:
{
"base_urls": [
]
}
Now, for each site you want to crawl, add an object like this to "base_urls"
:
{
"url": "http://website.com"
}
so if you add multiple sites, the whole file looks like this:
{
"base_urls": [
{
"url": "http://website.com"
},
{
"url": "http://websitetwo.com"
}
]
}
For more configuration options, take a look here.
Now you are ready to go.
You can start the crawler by invoking:
python ./start_processes.py
Depending on your the Debug option set in the newscrawler.cfg
, you should see none to many output-messages.
If the crawler has daemons set in the input_json, the crawler won't terminate. Else it will terminate after all crawlers finished.
For terminating the crawler, press CTRL-C
once or send just SIGINT
.
The crawler will need some time to gracefully shut down.
Try to avoid ungraceful shutdowns (e.g. pressing CTRL-C
twice).
- [I am a developer](I am a developer)
- [I am a user](I am a user)
- [Database System](Database System)
- Logging
- Output
- Troubleshooting
- Use-cases
- Anti-crawling Issues
- Bottlenecks
- [Demo Crawls](Demo Crawls)
- IDE
- [RSS-Feed Decision](RSS-Feed Decision)
- [Thinking Process](Thinking Process)