Skip to content
Igor Levaja edited this page Aug 26, 2014 · 4 revisions

Introduction

The goal of this project is to perform an analysis of web crawl that consists of about 3 billion web pages. We were free to choose a topic for our research, and initially we wanted to do the following – for each country in the world find the most associated terms (for example: landmarks, food, national symbols…) by counting pairs (country, term) for each web page. However, after some initial implementation and testing, we came to the conclusion that we were not able to continue with this idea, since significant amount of pages has all the countries in the world listed for various reasons (user’s location, for example). This created a lot of “noise” in web crawl, which we found hard to filter. A possible solution for this would be to concentrate only on certain web pages/domains (Wikipedia would likely be the most appropriate one), but we wanted to use the whole dataset in our project. Therefore, we came up with a new idea - measuring popularity of various programming languages in different countries, by counting number of pages which mention certain language and linking this count to specific country domain.

Implementation

Since we did not have any previous experience with writing MapReduce, we decided to implement our project in Pig language. We first defined a list of 25 programming languages that are commonly used. We have used various sources from the internet to compile this list. We also defined our version of Generic Warc File Loader, in order to extract information from warc records of web pages. For each web page, we searched for an occurrence of certain language and our File Loader class returns a top level domain of that page and a Bag of Tuples. Each tuple is a string representation of a name of the programming language (C, Java, etc.).

In the beginning, we made one single script which was supposed to do the whole analysis. However, due to some large records, which caused out-of-memory exceptions, we had to divide our job into two parts – the sole purpose of the first one is to perform Map jobs on the whole crawl and extract relevant information by using our Warc File Loader. The output of this script is stored on cluster and it serves as an input for the second script.

The second script loads its input from the output of the first script. After flattening all the tuples (names of the programming languages), we perform grouping by pairs (domain, language) and counting number of these pairs for each domain and language. Finally, we performed another flattening of the results, so our results are in form (domain, language, N) where N stands for a number of pages which contain that language.

Results

As previously mentioned, results in the output are organized in triplets of a domain, a programming language and a number of pages related to that language. We made a graphical representation of results by using Google Charts API for data visualization. Screenshot of this visualization is the following: World Domains In this visualization, it is possible to move cursor over countries and to get a listing of distribution of programming languages for each country. We have also listed results for some global top-level domains (com, net, edu and org). Global Domains

Discussion

Even though our results are mostly consistent with some public lists1,2,3 which present popularity of programming languages, there is a certain level of noise which can cause threats to validity of our results. The most obvious noise, which we were not able to eliminate, is related to JavaScript language. The reason for a very large number of pages in a crawl which contain “JavaScript” is probably the amount of error/warning messages (“JavaScript is disabled on your browser”, “Please enable JavaScript”, etc.) that were shown to the crawler. Although we tried to filter such messages, due to the variety of those messages and significant amount of pages which contain them, we were not able to eliminate this noise. The only possible solution that we can propose is to eliminate JavaScript from our analysis, which will lead to results more consistent with other rankings.


1 http://langpop.com/

2 http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html

3 http://spectrum.ieee.org/static/interactive-the-top-programming-languages

Clone this wiki locally