As part of an academic study, researchers scanned thirteen percent of all public GitHub repositories. The many billions of documents scanned showed that more than 100,000 of those repositories contained API tokens and cryptographic keys. Thousands of new repositories leak secrets on a daily basis.

The scan was performed by a team from the North Carolina State University (NCSU) and the results were immediately shared with GitHub. It then decided to accelerate the development of its new safety function, Token Scanning, which is in a beta phase.

Broad analysis

The NCSU’s study is the most in-depth one of GitHub’s studies to date. GitHub accounts were scanned over a period of almost six months, between 31 October 2017 and 20 April 2018. In doing so, the researchers specifically searched for text that resembles API tokens and cryptographic keys.

To do so, the researchers not only used the GitHub Search API to search for those text patterns, but also looked at snapshots stored in Google’s BigQuery database. Eventually, 4,493,473 files from 681,784 repositories were viewed via the GitHub Search API. This data has been supplemented by a further 2,312,763,353 files from 3,374,973 repositories in the BigQuery database.

Analyzing keys

Because not all API tokens and cryptographic keys are written in the same format, the NCSU team decided to include 15 formats of API token (of 15 different services, belonging to 11 companies, of which five come from the Alexa Top 50) in the study. Four different cryptographic formats were also included in the study.

The researchers then found thousands of matches every day. A total of 757,456 API keys were found. Of these, 201,642 were unique and spread over more than 100,000 GitHub projects. Because the investigation lasted six months, the team was also able to see whether these keys had been in the public domain for a longer period of time.

Secrets public

According to the researchers, 6 percent of the keys that were tracked were removed within an hour. This implies that the owners of those GitHub repositories were immediately aware of their mistake. 12 percent of the keys were removed after 1 day and 19 percent turned out to be 16 days online. This also means that 81% of the secrets we discovered have not been removed, according to the researchers.

In response, GitHub has decided to accelerate its Token Scanning. That’s a project that allows the company to detect such leaks and report them to developers.

This news article was automatically translated from Dutch to give Techzine.eu a head start. All news articles after September 1, 2019 are written in native English and NOT translated. All our background stories are written in native English as well. For more information read our launch article.