Skip to content
JBH168 edited this page Oct 24, 2016 · 52 revisions

Newscrawler

Hey there! Who are you?

[I am a developer (I want to edit the source)](I am a developer) [I am a user (I want to get the tool running)](I am a user)

About the project

Newscrawler is a software developed by the CColon-team in the context of the lecture "Softwareprojekt" by the University of Konstanz in the summer term 2016.

The team consisted of Jonathan Hassler (@JBH168), Franziska Schlor (@franziscl), Matt Sharinghousen (@msharing), Claudio Spener (@claudeeee) and Moritz Bock (@movabo).

Its goal is to download the HTML-source of news-articles on multiple sites given by multiple URLs. In this context, a news-article is a collection of multiple articles (as for example on most index pages).

It relies heavily on Scrapy 1.1.

Python Version

This program was originaly written in Python 2.7 and is tested there. We decided to write it in Python 2.7 because Scrapy was only stable with this version. Right now Scrapy is in a Python 3-beta. This program can run with Python 2.7 or Python 3.5 but is only tested with Python 2.7.

The main problem with Python 3 is the new string handling. Strings can be byte strings and normal strings.

TypeError Troubleshooting

  • [I am a developer](I am a developer)
  • [I am a user](I am a user)

Setup

Crawlers / Spiders

System design

Further Documentation

Clone this wiki locally