2 min Devops

GitHub uses AI to identify open issues in projects

GitHub uses AI to identify open issues in projects

Using AI, GitHub will provide users with recommendations for issues that suit their interests. That way, it should be less intimidating for users to start contributing to projects somewhere.

Large open source projects on GitHub often have long lists of problems that need to be addressed. To make it easier to identify the most pressing issues, GitHub recently introduced the “good first issues” feature. This feature links contributors to problems that are likely to be of interest to them. The first version, launched in May 2019, contained recommendations based on labels applied by admins to certain issues. The updated version, launched last month, includes an AI algorithm that, according to GitHub, detects issues in about 70% of repositories, which are then recommended to users.

GitHub notes that it is the first deep-learning product to be launched on Github.com.

Less manual work

According to Tiferet Gazit, senior machine learning engineer at GitHub, the company conducted an analysis and a manual survey last year to establish a list of 300 labels used in popular open source repositories. These labels were all synonymous with ‘good first issue’ or ‘documentation’, such as ‘beginner friendly’, ‘easy bug fix’ and ‘low-hanging fruit’. By relying on these labels, however, it seemed that only about 40% of the recommended repositories had problems that could surface in this way. Moreover, admins should continue to label problems themselves in this way.

The new AI system, on the other hand, is largely automatic. To build it, however, a training dataset of hundreds of thousands of samples had to be created first.

GitHub started with issues that had one of about 300 labels in the list, then supplemented that with a few sets of issues that were probably beginner-friendly as well. After troubleshooting and removing duplicate issues, several training, validation, and test data sets were classified so that the sets were not disturbed by irrelevant data.