- MapReduce Main classes:
1.1. CountURLsPerDomain.java
The main task of the MapReduce job that uses this main class is to count number of URLs for each host domain.
1.2. ArcLookupIndex.java
The main task is to create a lookup index which contains information about all the URLs in
the common-crawl corpus, such as which ARC file contains which URL and what is the offset.
1.3. SearchTextDataForVacancyMatches.java
The main task is to find candidate vacancies URLs from the common-crawl corpus. Any URL that matches at least
two vacancy keywrds from a given vacancy filter could a candidate vacancy.
1.4. CreateSample.java
The main task is to produce a sample list of URLs, we will use this list in (1.5) to extract HTML content.
The input for this MapReduce job is a file which contains domain IDs and their URLs that represent candidate
vacancies URLs found in (1.3) . For each domain ID we pick number of URLs randomly until we get the required
sample size.
1.5. ExtractHtmlFromArc.java
The main task is to extract HTML contents from a given list of URLs.
1.6 ProfilesExtraction.java
Extract profiles from pages of social sites, as pre-processing to this step we used the ExtractPagesForGivenHostdomains.java
to extract all pages from the social host domains of interest. To restrict profiles to locations of interest we
use class ExtractProfilesForGivenCountries.java
1.7 ProfilesLocations
Extract the geolocations information from the extracted profiles in 1.6
Run any of the main classes:
-
Create a jar file which contains the main class, other needed classes and libraries
-
run the MapReduce job as follow:
#hadoop jar jar-file [main class name] arguments arguments: The usage() method in each class shows the arguments needed for running the corresponding class.