Our friends and colleagues who are data scientists would describe the ideal predictive modeling project as a situation where:
The above list is aspirational and data scientists are lucky to encounter a problem that meets all of these (the authors feel lucky if we can find a problem that satisfies even half of these!).
Today, we present a dataset, platform, and domain that we believe satisfies the criteria set forth above!
GH-Archive logs a tremendous amount of data from GitHub by ingesting most of these events from the GitHub REST API. These events are sent from GitHub to GH-Archive in JSON format, referred to as a payload. Below is an example of a payload that is received when an issue is edited:
As you can imagine, there is a large number of payloads given the number of event types and users on GitHub. Thankfully, this data is stored in BigQuery which allows for fast retrieval through a SQL interface! It is very economical to acquire this data, as Google gives you $300 when you first sign up for an account, and if you already have one, the costs are very reasonable.
Since the data is in a JSON format, the syntax for un-nesting this data may be a bit unfamiliar. We can use the JSON_EXTRACT function to get the data we need. Below is a toy example of how you might extract data from issue payloads:
A step-by-step explanation on how to extract GitHub issues from BigQuery can be found in the appendix section of this article, however, it is important to note that more than issue data is available — you can retrieve data for almost anything that happens on GitHub! You can even retrieve large corpus of code from public repos in BigQuery.
The GitHub platform allows you to build apps that can perform many actions, such as interacting with issues, creating repositories or fixing code in pull requests. Since all that is required from your app is to receive payloads from GitHub and make calls to the REST API, you can write the app in any language of your choice, including python.
Most importantly, the GitHub marketplace gives you a way to list your app on a searchable platform and charge users a monthly subscription. This is a great way to monetize your ideas. You can even host unverified free apps as a way to collect feedback and iterate.
Surprisingly, there are not many GitHub apps that use machine learning, despite the availability of these public datasets! Raising awareness of this is one of the motivations for this blog post.
In order to show you how to create your own apps, we will walk you through the process of creating a GitHub app that can automatically label issues. Note that all of the code for this app, including the model training steps are located in this GitHub repository.
First, you will need to set up your development environment. Complete steps 1–4 of this article. You do not need to read the section on “The Ruby Programming Language”, or any steps beyond step 4. Make sure you set up a Webhook secret even though that part is optional.
Note that there is a difference between GitHub apps and Oauth apps. For the purposes of this tutorial, we are interested in GitHub apps. You don’t need to worry about this too much, but the distinction is good to know in case you are going through the documentation.
Your app will need to interact with the GitHub API in order to perform actions on GitHub. It is useful to use a pre-built client in the programming language of your choice in order to make life easier. While the official docs on GitHub show you how to use a Ruby client, there are third-party clients for many other languages, including Python. For the purposes of this tutorial, we will be using the Github3.py library.
One of the most confusing aspects of interfacing with the GitHub API as an app is authentication. For the following instructions, use the curl commands, not the ruby examples in the documentation.
First, you must authenticate as an app by signing a JSON Web Token (JWT). Once you have signed a JWT, you may use it to authenticate as an app installation. Upon authenticating as an app installation, you will receive an installation access token which you can use to interact with the REST API.
Note that the authenticating as an app is done via a GET request, whereas authenticating as an app installation is done via a PUT request. Even though this is illustrated in the example CURL commands, it is a detail that we missed when getting started.
Knowing the above authentication steps are useful even though you will be using the Github3.py library, as there may be routes that are not supported that you may want to implement yourself using the requests library. This was the case for us, so we ended up writing a thin wrapper around the Github3.py library called mlapp to help us interact with issues, which is defined here.
Below is code that can be used to create an issue, make a comment, and apply a label. This code is also available in this notebook.
You can see the issue created by this code here.
As mentioned previously, we can use GH-Archive hosted on BigQuery to retrieve examples of issues. Additionally, we can also retrieve the labels that people manually apply for each issue. Below is the query we used to build a Pareto chart of all of these labels:
This spreadsheet contains the data for the entire Pareto chart. There is a long tail of issue labels which are not mutually exclusive. For example, the enhancement and feature labels could be grouped together. Furthermore, the quality and meaning of labels may vary greatly by project. Despite these hurdles, we decided to simplify the problem and group as many labels as possible into three categories: feature request, bug, and question using heuristics we constructed after manually looking at the top ~ 200 labels. Additionally, we consulted with the maintainers of a large open source project, Kubeflow, as our first customer to validate our intuitions.
We experimented with creating a fourth category called other in order to have negative samples of items not in the first three categories, however, we discovered that the information was noisy as there were many bugs, feature requests, and questionsin this “other” category_._ Therefore, we limited the training set to issues that we could categorize as either a feature request, bug or question exclusively_._
It should be noted that this arrangement of the training data is far from ideal, as we want our training data to resemble the distribution of real issues as closely as possible. However, our goal was to construct a minimal viable product with the least time and expense possible and iterate later, so we moved forward with this approach.
Finally, we took special care to de-duplicate issues. To be conservative, we resolved the following types of duplicates (by arbitrarily choosing one issue in the duplicate set):
The SQL query used to categorize issues and deduplicate issues can be viewed with this link. You don’t have to run this query, as our friends from the Kubeflow project have run this query and are hosting the resulting data as CSV files on Google Cloud Bucket, which you can retrieve by following the code in this notebook. An exploration of the raw data as well as a description of all the fields in the dataset is also located in the notebook.
Now that we have the data, the next step is to build and train the model. For this problem, we decided to borrow a text pre-processing pipeline that we built for a similar problem and apply it here. This pre-processing pipeline cleans the raw text, tokenizes the data, builds a vocabulary, and pads the sequences of text to equal length, which are steps that are outlined in the “Prepare and Clean Data” section of our prior blog post. The code that accomplishes this task for issue labeling is outlined in this notebook.
Our model takes two inputs: the issue title and body and classifies each issue as either a bug, feature request or question. Below is our model’s architecture defined with tensorflow.Keras:
A couple of notes about this model:
Evaluating the model
Below is a confusion matrix showing our model’s accuracy on a test set of the three categories. The model really struggles to categorize questions but does a fairly decent job at distinguishing bugs from features.
Note that since our test set is not representative of all issues (as we filtered the dataset to only those that we could categorize), the accuracy metrics above should be taken with a grain of salt. We somewhat mitigate this problem by gathering explicit feedback from our users, which allows us to re-train our model and debug problems very fast. We discuss the explicit feedback mechanism in a later section.
Below are model predictions on toy examples. The full code is available in this notebook.
We wanted to choose reasonable thresholds so the model is not spamming people with too many incorrect predictions (this means that our app may not offer any predictions in some cases). We selected thresholds by testing our system on several repos and consulting with several maintainers on an acceptable false positive rate.
Now that you have a model that can make predictions, and a way to programmatically add comments and labels to issues (step 2), all that is left is gluing the pieces together. You can accomplish this with the following steps:
A great way to accomplish this is to use a framework like Flask and database interface like SQLAlchemy. If you are already familiar with flask, below is a truncated version of the code that applies predicted issue labels when notified by GitHub that an issue has been opened:
We leave it as an exercise for the reader to go through the rest of the flask code in our GitHub repository.
As illustrated above, explicit feedback is requested by asking users to react with 👍 or 👎 to a prediction. We can store these reactions in a database which allows us to re-train and debug our models. This is is perhaps one of the most exciting and important aspects of launching a data product as a GitHub App!
You can see more examples of predictions and user feedback on our app’s homepage. For example, this is the page for the kubeflow/kubeflow repo:
If you enjoy what you have read thus far and want to support this project, please install this app on your public repositories (this app will not make predictions on private repos even if installed there), and give our bot feedback when it makes predictions 👍 👎.
One aspect we did not cover is how to serve your app at scale. When you are just starting out, you probably do not need to worry about this and can serve this on a single server with your favorite cloud provider. You can also use a service like Heroku, which is covered in the course on Flask linked in the resources section below.
In Part II, we will cover the following:
We believe there are many opportunities to improve upon the approach we illustrated in this post. Some ideas we have in mind are:
We hope you enjoyed this blog post. Please feel free to get in touch with us:
Any ideas or opinions presented in this article are our own. Any ideas or techniques presented do not necessarily foreshadow future products of any company. The purpose of this blog is for educational purposes only.