Skip to content

norvigaward/2012-naward15

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 

Repository files navigation

naward15

  1. MapReduce Main classes:

1.1. CountURLsPerDomain.java

The main task of the MapReduce job that uses this main class is to count number of URLs for each host domain. 

1.2. ArcLookupIndex.java

The main task is to create a lookup index which contains information about all the URLs in 
the common-crawl corpus, such as which ARC file contains which URL and what is the offset. 

1.3. SearchTextDataForVacancyMatches.java

The main task is to find candidate vacancies URLs from the common-crawl corpus. Any URL that matches at least
two vacancy keywrds from a given vacancy filter could a candidate vacancy.

1.4. CreateSample.java

The main task is to produce a sample list of URLs, we will use this list in (1.5) to extract HTML content. 
The input for this MapReduce job is a file which contains domain IDs and their URLs that represent candidate 
vacancies URLs found in (1.3) . For each domain ID we pick number of URLs randomly until we get the required 
sample size.

1.5. ExtractHtmlFromArc.java

The main task is to extract HTML contents from a given list of URLs. 

1.6 ProfilesExtraction.java

Extract profiles from pages of social sites, as pre-processing to this step we used the ExtractPagesForGivenHostdomains.java
to extract all pages from the social host domains of interest. To restrict profiles to locations of interest we 
use class ExtractProfilesForGivenCountries.java

1.7 ProfilesLocations

Extract the geolocations information from the extracted profiles in 1.6

Run any of the main classes:

  1. Create a jar file which contains the main class, other needed classes and libraries

  2. run the MapReduce job as follow:

    #hadoop jar jar-file [main class name] arguments arguments: The usage() method in each class shows the arguments needed for running the corresponding class.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages