naward15

MapReduce Main classes:

1.1. CountURLsPerDomain.java

The main task of the MapReduce job that uses this main class is to count number of URLs for each host domain.

1.2. ArcLookupIndex.java

The main task is to create a lookup index which contains information about all the URLs in 
the common-crawl corpus, such as which ARC file contains which URL and what is the offset.

1.3. SearchTextDataForVacancyMatches.java

The main task is to find candidate vacancies URLs from the common-crawl corpus. Any URL that matches at least
two vacancy keywrds from a given vacancy filter could a candidate vacancy.

1.4. CreateSample.java

The main task is to produce a sample list of URLs, we will use this list in (1.5) to extract HTML content. 
The input for this MapReduce job is a file which contains domain IDs and their URLs that represent candidate 
vacancies URLs found in (1.3) . For each domain ID we pick number of URLs randomly until we get the required 
sample size.

1.5. ExtractHtmlFromArc.java

The main task is to extract HTML contents from a given list of URLs.

1.6 ProfilesExtraction.java

Extract profiles from pages of social sites, as pre-processing to this step we used the ExtractPagesForGivenHostdomains.java
to extract all pages from the social host domains of interest. To restrict profiles to locations of interest we 
use class ExtractProfilesForGivenCountries.java

1.7 ProfilesLocations

Extract the geolocations information from the extracted profiles in 1.6

Run any of the main classes:

Create a jar file which contains the main class, other needed classes and libraries
run the MapReduce job as follow:

#hadoop jar jar-file [main class name] arguments arguments: The usage() method in each class shows the arguments needed for running the corresponding class.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
lib		lib
src		src
PigLatin_commands		PigLatin_commands
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

naward15

About

Releases

Packages

Contributors 2

Languages

norvigaward/2012-naward15

Folders and files

Latest commit

History

Repository files navigation

naward15

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages