GitWatch

GitWatch directs potential contributors to projects on GitHub. This is done by predicting the future number of pushes and watches of a given repository. The repositories are ranked in terms of #pushes predicted. The user can toggle to only show certain languages and/or only repos that will be popular.

Step 1

Scrape data from the GitHub Archive. This process is very slow.

Create mySQL tables in the GitWatch database with:
- CREATE TABLE repo (id INT UNSIGNED, name TINYTEXT, private BOOL, created_at DATETIME, description TEXT, language TINYTEXT, watchers MEDIUMINT UNSIGNED);
- CREATE TABLE event (id INT UNSIGNED, type TINYINT UNSIGNED, timestamp DATETIME);
- CREATE INDEX id_index ON event (id); (This will improve lookup performance a lot!)
runExtractor.sh downloads the json that contains all the events for a given hour.
- extractor.py processes the json and records the relevant information in the GitWatch mySQL db.
- extractor_csv.py outputs to file rather than SQL for running remotely. The results are moved to SQL locally with csv_to_{repo,event}_sql.py. Use sort <filename> | uniq -u to reduce the size of the repo.csv file!
populateDB.py populates the rest of the database info in SQL using GitHub API. The limit is 5k requests / hour with authentication.

Step 2

Process the data

process_training.py queries sql and creates a csv file for training or applying stuff done in step 3

Step 3

Training This was done with the iPython notebook in the directory training. Check it out!

Step 4

Populate training to db. First create three new columns in the repo table for pred1, pred2, hot.

process_training.py creates a table of dimension n(repos) x 60 onto which training can be applied.
populateDB_withpred.py fills in the values
maskrepos.py will impose quality constraints on June and July when applying to October

Step 5

Run the webapp

run.py will run the web app on the local machine
sudo supervisord -c simple.conf will run the web app on AWS.

Check out `gitwatch.xyz` to see it run!

A side note

This project started with the following mission to assign a probability to a given repository stored on GitHub that the repository contains a bug. The algorithm would have used the previous commit messages and NLP to assign a probability. The scripts for this process are in the directory old_NLP_stuff.

Potentially userful scripts that are no longer used in the baseline project are in oldstuff.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitWatch

Step 1

Step 2

Step 3

Step 4

Step 5

Check out `gitwatch.xyz` to see it run!

A side note

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
app		app
old_NLP_stuff		old_NLP_stuff
oldstuff		oldstuff
plots		plots
training		training
.gitignore		.gitignore
GitWatch.ipynb		GitWatch.ipynb
README.md		README.md
Validation.ipynb		Validation.ipynb
csv_to_event_sql.py		csv_to_event_sql.py
csv_to_repo_sql.py		csv_to_repo_sql.py
extractor.py		extractor.py
extractor_csv.py		extractor_csv.py
maskrepos.py		maskrepos.py
populateDB.py		populateDB.py
populateDB_withpred.py		populateDB_withpred.py
process_training.py		process_training.py
run.py		run.py
runExtractor.sh		runExtractor.sh
simple.conf		simple.conf
validate.py		validate.py

hebda/GitWatch

Folders and files

Latest commit

History

Repository files navigation

GitWatch

Step 1

Step 2

Step 3

Step 4

Step 5

Check out gitwatch.xyz to see it run!

A side note

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Check out `gitwatch.xyz` to see it run!

Packages