Introduction
GitHub 3rd Annual Data Challenge is a data analytics project aimed to make sense of millions of data points gathered from all the repositories hosted on github.com. GitHub is a known to be the home of thousands of sophisticated and widely-known repositories. That leaves us wonder what factors have attributed to their successes.
The Questions
Hacker News, a very popular social news website catered to programmers and entrepreneurs, often contains posts that are linked to GitHub repositories, most of which are very popular.
How does linking a GitHub repository on HackerNews change its popularity? Is there such thing as ‘the HackerNews popularity bump’ ?
Those are the questions we want to answer after this project. We want to determine if being mentioned on Hacker News significantly contributes to a repository’s popularity and if so, by how much. Popularity is defined as the raw number of ‘stars’ a particular repository has. To answer this ‘simple’ binary question, we studied the growth rate of each of the 1.5 millions repositories a week before and after the date it was mentioned on Hacker News, if it was mentioned at all. We repeated this process for all repositories to find the average.
tl;dr
Stars: 10
HackerNews Upvotes: 10
HackerNews will increase the rate of growth in number of stars for the repo by 38.15% from -0.71% to 37.45%. The percentage increase varies depending on your repository’s number of stars and your HackerNews post’s upvotes.
The graph contains data that only matches with the constraints on the sliders. The red line shows the daily percentage increase in the number of stars for the week before and after the HackerNews mention. The blue line shows the daily percentage increase in the number of stars for all repositories in the week surrounding the expected date of mention.
Understanding tl;dr
The sliders control the two searching constraints: number of upvotes of a HackerNews post and number of stars of a linked GitHub repository. They represent different tiers of the data. This is because there is a large disparity in the effects on a 50,000-star repository as opposed to a 50-star one. The HackerNews upvotes slider determines the upvotes needed to create the “popularity bump”. Note that the repositories that have not been mentioned on HackerNews are not constrained by this slider.
The red line represents the percentage growth in stars each day. This is calculated by the generic growth formula. For instance, if a repository has 100 stars today and 120 stars tomorrow, the growth is (120 - 100)/100 = 20%. The data was processed and visualized this way instead of using the raw number of stars. This is because we found the data to be too far skewed when some repositories have a huge number of stars (cough cough Twitter’s Bootstrap).
The blue line represents the expected growth of a GitHub repository without being mentioned on HackerNews. This is extremely useful in determining whether HackerNews has any effect at all. The data included here is not the 15 day outlook, but rather an expected growth without any effect from HackerNews.
Obtaining Data
To gather information about the popularity of repositories over time, we attempted to recreate a timeline of all GitHub events useful for this study (i.e when a user starring a repository). Unfortunately, these data was not readily available to download; they had to be obtained from alternative data sources. There were 3 options available:
-GitHub’s native API: a typical API hosted by GitHub
Google BigQuery: a RESTful web service enabling interactive analysis of massive datasets
GitHub Archive: an open-source project by Google engineer Ilya Grigorik
All of these data sources provided enough information needed for our study. However, we ran into a few hiccups along the way.
We first tried gathering data from GitHub’s native API by using a Python script and a REST library. Getting the correct queries and responses from the GitHub API requests was fairly simple. The only fallback we could see with this approach was the 10,000 API calls per hour, until we saw discrepancies in the data we got back: we did not see the entire events history for some extremely popular repositories. We later found out that this was due to ‘pagination limit’. This meant if the history of a repository spanned across 10 pages, we would only be able to see up to the 10th page. We did not know how to work around this problem, so we turned to the next data source option, Google BigQuery.
Google BigQuery really blew us away with its computing power. It could crunch through gigabytes of data in a matter of seconds. The only thing it needed was a SQL-like query. The data could be downloaded in JSON or CSV format once the queries completed. This source proved to be ideal until we found out that the computing power we needed would cost more than ‘Free’. Moving on.
The last data source option was GitHub Archive. After finding a Ruby script example, we tweaked the script to download all the data files to process later on. However, the files were in GZIP format, a compression form used to store large files on the web. We had to do more tweaking to do the Ruby script to unzip and read the GZIP files, since Python could not finish this particular task. All the events history data could finally be obtained.
Analyzing Data
Once all the data was collected and stored in 360 GB worth of JSON, we thought reading from each file to process do the calculation was just a few steps away. Little did we know reading through gigabytes of data would take as long as 5 hours per repository. Our research led to the use of mySQL database, which seemed to provide a more promising lookup time. A Python script read from all the JSON files, collected relevant information, and stored them into a local mySQL database. This one-time process took as long as 7 and a half hours on a MacBook Pro.
Storing relevant data to a mySQL database improved lookup speed significantly. We could query for data
much faster using SQL queries, as opposed to reading directly from raw JSON files every time. Although with very efficient speed per repository, there were well over 17 millions events on GitHub that we needed to account for. We agreed that repositories without a certain amount of stars were not worth studying. The challenge was to find the appropriate threshold, which would not take out too many ‘good’ repositories or too few ‘bad’ repositories. The threshold was then decided to be 5 stars. Using a complicated SQL query with a little help of a Stack Overflow user, a table with all the names of the repositories that needed to be checked for Hacker News mention was created.
The next step was simple: iterating over the newly created table and check which repositories were mentioned on Hacker News. To achieve this, we used Hacker News’ Algolia API, which powered their search engine. The API returned standard JSON responses: if the repository was mentioned on Hacker News, we would know about all the posts it appeared in; if not, then the response would be empty. There were a few duplicates for some repositories (i.e more than one Hacker News post mentioned such repositories). We decided to leave out the ‘unimportant’ posts by selecting only the post with the most upvotes. This was done by tweaking the query used for limiting data coming from GitHub.
At this point, we had enough data to start cross referencing and create our final dataset: a table of all repositories that were mentioned on Hacker News at least once, along with their popularity data for 7 days before and after their date of mention. We would use this table for data visualization, calculating averaged values to make a general conclusion, and providing general, quick statistics about the dataset as a whole.
Visualizing Data
We plan to visualize these data using a line chart supported by Google Charts API. The line chart will have 15 points on the x-axis, representing the 7 days before, the day of Hacker News mention date, and the 7 days after. The y-axis will measure the popularity unit, be they the raw number of stars or the daily growth rate. The data obtained will be then transferred to a Google Spreadsheet, from which an HTML displays data queried via JavaScript and Google Queries API.
Files Used (in order)
downloadAllData.rb - download all of the data into JSON files
JSONtoMySQL.py - transfer relevant JSON data to a MySQL - database
searchDBForCurrentStars.py - responsible for getting most recent entry for each repository
getHNData.py - query Algolia API to get the hn data into DB (takes 55 hours)
hnTabletoHNTableMax.py - gets the most relevant HN event for each repo
getRelevantGHevents.py - gets the most relevant GH events based on HN mentions
getExpectedDateOfMention.py - gets the hn_expected_date_of_mention
getRelevantGHeventAvgCase.py - get GH events within 1 week of hn_expected_date_of_mention
analyzeData.py - analyzes the data and saves results into DB
analyzeDateAvgCase.py - analyzes the average cases data
get_chart_table.py - prepares the data that will be pasted into google spreadsheet graph.py - prepares the HTML for visualization
Tables Used
chart_table_avg
+------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+--------------+------+-----+---------+-------+
| day | int(6) | YES | | NULL | |
| slider_stars | int(6) | YES | | NULL | |
| slider_hn_points | int(6) | YES | | NULL | |
| daily_avg_stars | int(8) | YES | | NULL | |
| daily_growth | double(10,5) | YES | | NULL | |
| change_in_growth | double(10,5) | YES | | NULL | |
| num_data_points | int(6) | YES | | NULL | |
+------------------+--------------+------+-----+---------+-------+
chart_table
+-------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+-------+
| day | int(6) | YES | | NULL | |
| slider_stars | int(6) | YES | | NULL | |
| slider_hn_points | int(6) | YES | | NULL | |
| daily_total_stars | int(8) | YES | | NULL | |
| daily_growth | double(10,5) | YES | | NULL | |
| change_in_growth | double(10,5) | YES | | NULL | |
+-------------------+--------------+------+-----+---------+-------+
days_after
+-----------------+--------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+--------+------+-----+---------+-------+
| stars | int(6) | YES | | NULL | |
| hn_points | int(6) | YES | | NULL | |
| avg_days_after | int(6) | YES | | NULL | |
| mode_days_after | int(6) | YES | | NULL | |
| num_data_points | int(6) | YES | | NULL | |
+-----------------+--------+------+-----+---------+-------+
event_table_two
+--------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+-------+
| repo_name | varchar(255) | YES | | NULL | |
| stars | int(6) | YES | | NULL | |
| event_time | date | YES | | NULL | |
| repo_created | date | YES | | NULL | |
+--------------+--------------+------+-----+---------+-------+
event_table_condensed
+------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| repo_name | varchar(255) | YES | | NULL | |
| stars | int(6) | YES | | NULL | |
| event_time | date | YES | | NULL | |
+------------+--------------+------+-----+---------+-------+
event_table_general_condensed
+--------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+-------+
| repo_name | varchar(255) | YES | | NULL | |
| stars | int(6) | YES | | NULL | |
| event_time | date | YES | | NULL | |
| repo_created | date | YES | | NULL | |
+--------------+--------------+------+-----+---------+-------+
hn_event_max
+------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| repo_name | varchar(255) | YES | | NULL | |
| stars | int(6) | YES | | NULL | |
| hn_points | int(6) | YES | | NULL | |
| event_time | date | YES | | NULL | |
+------------+--------------+------+-----+---------+-------+
hn_event_table_two
+------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| repo_name | varchar(255) | YES | | NULL | |
| stars | int(6) | YES | | NULL | |
| hn_points | int(6) | YES | | NULL | |
| event_time | date | YES | | NULL | |
+------------+--------------+------+-----+---------+-------+
latest_repo_events
+------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| repo_name | varchar(255) | YES | | NULL | |
| stars | int(6) | YES | | NULL | |
| hn_points | int(6) | YES | | NULL | |
| event_time | date | YES | | NULL | |
+------------+--------------+------+-----+---------+-------+
max_stars
+------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| repo_name | varchar(255) | YES | | NULL | |
| stars | int(6) | YES | | NULL | |
| event_time | date | YES | | NULL | |
+------------+--------------+------+-----+---------+-------+