Skip to content

duaraghav8/larry-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

larry-crawler

Build Status

Kayako Twitter challenge

Installation

npm install --save larry-crawler

Usage

Navigate to the node_modules directory which contains larry-crawler.

cd larry-crawler/usage
node get-tweets.js

Test

npm test

Output

The application fetches tweets in batches of 100. Unless forcefully killed (CTRL+C), the app will keep running until all tweets matching the defined criteria have been fetched. See result.

NOTE: A batch might produce less than 100 tweets in output if you've applied a secondary filter (like retweetCounts). If 100 tweets were retrieved based on specified HashTag and 30 of them haven't been retweeted, then only 70 tweets are supplied in the response.statuses Array.

Module API

To access the class larry-crawler exposes for crawling twitter:

const {TwitterCrawler} = require ('./larry-crawler');

Get your app or user credentials from https://dev.twitter.com/, then create a new object like:

const crawler = new TwitterCrawler ({

	consumerKey: process.env.TWITTER_CONSUMER_KEY,
	consumerSecret: process.env.TWITTER_CONSUMER_SECRET,
	accessTokenKey: process.env.TWITTER_ACCESS_TOKEN_KEY,
	accessTokenSecret: process.env.TWITTER_ACCESS_TOKEN_SECRET

});

If you have a twitter app, use bearerToken instead of accessTokenKey & accessTokenSecret.

The new object exposes method getTweets() to fetch tweets based on criteria and returns a Promise.

const criteria = { hashtags: ['custserv'], retweetCount: {$gt: 0} };

crawler.getTweets (criteria).then ((response) => {
  console.log (JSON.stringify (response, null, 2));
}).catch (() => {});

To set the max_id parameter for pagination,

criteria.maxIdString = status.id_str

where status is an item in the response.statuses Array.

See get-tweets.js for a full example.

Technical Details

The module has only 1 dependancy - twitter.

  1. Searching based on Hashtags is simple since Twitter API has in-built support for that. But in order to further refine tweets based on number of retweets, the module contains a class SecondaryFilterForTweets.

See Working with search API

  1. Since a maximum of 100 tweeets are sent per request, an effective pagination strategy had to be implemented using the max_id parameter so we can retrieve ALL the tweets since the very beginning. This strategy was followed to achieve pagination.

  2. The primary challenge was to deal with the 64-bit integer ID provided by the Twitter API. JS can only provide precision upto 53 bits. Hence, the application uses id_str field at all times and a special decrement function has been written in usage/utils.js to operate on the string ID.

See Working with 64-bit id in Twitter