Home

Newscrawler

Hey there! Who are you?

[I am a developer (I want to edit the source)](I am a developer)	[I am a user (I want to get the tool running)](I am a user)

About the project

Newscrawler is a software developed by the CColon-team in the context of the lecture "Softwareprojekt" by the University of Konstanz in the summer term 2016.

The team consisted of Jonathan Hassler (@JBH168), Franziska Schlor (@franziscl), Matt Sharinghousen (@msharing), Claudio Spener (@claudeeee) and Moritz Bock (@movabo).

Its goal is to download the HTML-source of news-articles on multiple sites given by multiple URLs. In this context, a news-article is a collection of multiple articles (as for example on most index pages).

It relies heavily on Scrapy 1.1.

Python Version

This program was originaly written in Python 2.7 and is tested there. We decided to write it in Python 2.7 because Scrapy was only stable with this version. Right now Scrapy is in a Python 3-beta. This program can run with Python 2.7 or Python 3.5 but is only tested with Python 2.7.

The main problem with Python 3 is the new string handling. Strings can be byte strings and normal strings.

TypeError Troubleshooting

[I am a developer](I am a developer)
[I am a user](I am a user)

Setup

Crawlers / Spiders

System design

[Database System](Database System)
Logging
Output
Troubleshooting
Use-cases

Further Documentation

Anti-crawling Issues
Bottlenecks
[Demo Crawls](Demo Crawls)
IDE
[RSS-Feed Decision](RSS-Feed Decision)
[Thinking Process](Thinking Process)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly