GitHub - timohund/CloudCrawlerJava: Hadoop based crawler to crawl large amount of websites

About

Author:	Timo Schmidt <timo-schmidt@gmx.net>
Description:	Crawler based on mapreduce for Apache Hadoop
Build status:

Cloudcrawler is java based crawler based on hadoop/mapreduces. The goal was to get familar with apache hadoop.

Alternatives are:

Apache Nutch
Heretrix Crawler

Compile

You can compile a single jar file with the following command:

mvn compile assembly:single

The jar file is located in:

target/org.cloudcrawler-jar-with-dependencies.jar

Crawl & Index

The crawler can be used the following way:

1. Create a source file with crawling start urls (one per line) eg:

http://www.heise.de/

http://www.spiegel.de/

Copy the file to hdfs

hadoop fs -copyFromLocal crawl.txt /cloudcrawler/crawl/start/
Now the crawling can be starte:

hadoop org.cloudcrawler-jar-with-dependencies.jar crawl /cloudcrawler/crawl/start/ /cloudcrawler/crawl/out1/

The job can be repeated several times and the crawler will discover more and more pages

hadoop org.cloudcrawler-jar-with-dependencies.jar crawl /cloudcrawler/crawl/out1/ /cloudcrawler/crawl/out2/

hadoop org.cloudcrawler-jar-with-dependencies.jar crawl /cloudcrawler/crawl/out2/ /cloudcrawler/crawl/out3/

When the crawling is done, there are some other jobs that can be used one of the is the linktrust job. This job should be executed 3-4 times and the first input is the last output of the crawling process

hadoop org.cloudcrawler-jar-with-dependencies.jar linktrust /cloudcrawler/crawl/out3/ /cloudcrawler/linktrust/out1/
The last process ist the indexing process. The indexing process can be used to write the documents to elasticsearch or solr

hadoop org.cloudcrawler-jar-with-dependencies.jar index /cloudcrawler/linktrust/out1/ /cloudcrawler/index/out1/

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.idea/libraries		.idea/libraries
documentation		documentation
example		example
runlocal		runlocal
src		src
target		target
.travis.yml		.travis.yml
CloudCrawlerJava.iml		CloudCrawlerJava.iml
CloudCrawlerMaven.iml		CloudCrawlerMaven.iml
README.rst		README.rst
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Compile

Crawl & Index

About

Releases

Packages

Languages

timohund/CloudCrawlerJava

Folders and files

Latest commit

History

Repository files navigation

About

Compile

Crawl & Index

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages