Skip to content
ties edited this page Jan 15, 2013 · 54 revisions

Introduction

Authors: Dennis Pallett, Marcel Boersma, Niels Visser, Ties de Kock

This report describes our challenge submission for the Norvig Award. Using the SARA Hadoop cluster and the CommonCrawl data we have looked into the use of JavaScript libraries on the web and have determined the most popular libraries. First in section one we will discuss our idea in-depth and describe exactly what we want to do. In the second section we will describe our method to gather the data we need in order to get our results. In section three we present our results and discuss how we have produced the results. Finally in section four we discuss our results and present some conclusions.

1. Idea

Our idea is to use the CommonCrawl dataset and the MapReduce framework to extract a list of all JavaScript libraries used by the pages in the dataset. After we extracting the list of all external JavaScript files we will do post-processing to determine usage behaviour like usage of external JavaScript libraries and protocols e.g. http and https. We will define a JavaScript library as a linked JavaScript file, i.e. those files that are referred to by the 'src' attributes of <script> tags. Furthermore we will scan the data for usage of popular JavaScript libraries such as JQuery, MooTools, etc. The method used for extracting this information is described in the next section.

2. Method

We will use the MapReduce framework to extract the raw data that we will need to produce our results. In the map phase we will use the JSoup Java library to parse the HTML of each page into a DOM tree and from that extract all the <script> tags. Each <script> tag will be analysed and the 'src' attribute will be parsed into distinct parts (i.e. host, filename, protocol, etc). For the local files, meaning that the javascript file is hosted on het same host as the website, the following key will be emitted: cn [filename] protocol 1

This allows us the search for commonly used file names in the 'src' tag. If the 'src' tag host differs from the page host then there will be emitted an extra key for those files. This key contains the address and the count value.

cn [address of file] protocol 1

Due to emitting the external host it is possible to determine commonly used locations for external JavaScript file hosting. Additionally the page is also searched for inline scripts like google-analytics, facebook and twitter. Foreach inline script found the following key will be emitted:

cn il:[tag] protocol 1

With: tag = {google-analytics, facebook, twitter} protocol = {http, https}

Finally the co-occurrence of the JavaScript files is analysed. All the JavaScript files found on one page are emitted as a key with count 1.

co ([file1naam],)*[lastfilename] 1

To actually produce our results we will need to further analyze the results of our MapReduce job to get anything meaningful. This will be further discussed in the next section.

3. Results

In the following subsections we will present different results based on our raw data. These results have been created using various scripts that analysed our raw data.

3.1. Most commonly used JavaScript files

The following list shows the top 10 external JavaScript files. In many cases it is likely that different files with the same name are counted as the same script. Unfortunately there is nothing we can do about this since the actual contents of the JavaScripts are not in the dataset.

  1. show_ads.js (403.141.872)
  2. jquery.js (145.356.570)
  3. jquery.min.js (85.005.493)
  4. addthis_widget.js (72.739.691)
  5. swfobject.js (72.608.284)
  6. urchin.js (69.405.490)
  7. plusone.js (64.328.482)
  8. widgets.js (59.589.858)
  9. prototype.js (48.643.257)
  10. all.js (42.942.077)

3.2 Most commonly used libraries

We have also determined which generic JavaScript frameworks, such as jQuery and Mootools, are most popular. To determine this we have analysed our results by looking for specific library names in the filename of each JavaScript file (e.g. 'jquery' for jQuery). This has resulted in the following top 10 libraries:

  1. jQuery* (791.223.025; 82,64%)
  2. Prototype(58.023.086, 6,06%)
  3. Mootools (46.267.439; 4,83%)
  4. Ext(33.257.953; 3,47%)
  5. YUI (17.026.823; 1,78%)
  6. Modernizr (5.621.476; 0,59%)
  7. Dojo(1.985.520; 0,21%)
  8. Ember (1.356.498; 0,14%)
  9. Underscore (1.085.184; 0,11%)
  10. Backbone (865.260; 0,09%)

* note that the jQuery framework itself is counted but also its plugins. The percentages listed are relative for the top 10, i.e. they are not relative to the whole data set.

3.3. External libraries

There is a special key emitted for the external libraries as described in the method section. In the following list the top 10 external libraries are listed.

  1. http://pagead2.googlesyndication.com/pagead/show_ads.js (401.464.766)
  2. http://www.google-analytics.com/urchin.js (65.358.726)
  3. http://platform.twitter.com/widgets.js (58.433.169)
  4. http://s7.addthis.com/js/250/addthis_widget.js (56.714.863)
  5. https://apis.google.com/js/plusone.js (45.895.137)
  6. http://edge.quantserve.com/quant.js (32.537.400)
  7. http://partner.googleadservices.com/gampad/google_service.js (27.902.815)
  8. http://connect.facebook.net/en_us/all.js (27.637.642)
  9. http://www.google.com/jsapi (20.364.272)
  10. http://apis.google.com/js/plusone.js (18.118.039)

3.4 External hosts

Based on the remote JavaScript files we have also determined the top 10 external hosts. These hosts are often content delivery networks (CDN) for popular JavaScript libraries (e.g. Google Code). The top 10 list is as follows

  1. pagead2.googlesyndication.com (412.873.753)
  2. ajax.googleapis.com (98.984.803)
  3. s7.addthis.com (74.236.417)
  4. www.google-analytics.com (72.116.688)
  5. www.google.com (64.871.164)
  6. apis.google.com (64.066.708)
  7. platform.twitter.com (60.497.936)
  8. l.yimg.com (58.755.951)
  9. connect.facebook.net (39.909.476)
  10. s.ytimg.com (35.853.346)

3.5. Inline JavaScript

We have also analysed inline JavaScript to determine the popularity of some well-known widgets. This has resulted in the following top 3:

  1. Google Analytics (227.163.432; 85,47%)
  2. Facebook (25.835.531; 9,72%)
  3. Twitter (12.786.300; 4,81%)

These results are very much as expected, since Google Analytics is much more wide-spread than both Facebook and Twitter widgets. Additionally Facebook widgets also seem more popular which is confirmed by our results.

3.6. Co-occurrence

co-occurrence

In the table above, the co-occurrence of the libraries from section 3.2 is shown. Each row represents the totals of one library, for example: the combination jquery-prototype exists in 89% of all prototype mentions.

4. Discussion

Our results give a clear picture of the usage of JavaScript files by the web pages contained with the Common Crawl dataset, which we assume to be indicative for the whole internet. Our results do not show a lot of surprises; it mostly confirms what everyone knows and/or thinks, e.g. that Google Analytics is widely used. However our research and results conclusively proves this and provides a validation for most of the general beliefs about external JavaScript files.

Additionally we have also uncovered a top 10 list of most commonly used external hosts. These hosts are an integral part of the internet due to their high usage. This also makes them a prime target for hackers. This is something to consider when using any of these hosts. Thankfully most of these hosts are run by professional organizations who have the knowledge and technology to properly maintain and secure these hosts.