2 min Devops

Flaw in Git bloated Microsoft repository by a factor of 35

Flaw in Git bloated Microsoft repository by a factor of 35

When Microsoft’s JavaScript repository was growing faster than Office’s, it was clear that something was wrong. A Microsoft engineer explains how the repo slimmed down from 178GB to 5GB.

The Git repository in question is a monorepo, essentially a collection bin of programming code for all kinds of projects. In this case, it involved all of Microsoft’s JavaScript initiatives. That’s quite a few: Teams, Visual Studio Code, Office Online are just a few examples making use of the language.

Changelog problems

The proliferation of the Git repo has been enormous in recent years. Microsoft engineer Jonathan Creamer explains that his first installation was 2GB, but months later it was 4GB and most recently 178GB. Despite usual optimizations, the repo remained huge.

The culprit: name-hash collisions. With files like changelog.md and changelog.json, Git found all kinds of differences with each commit even though there often were none, which gradually added 173GB of unnecessary bloat. However, the changelogs were in completely different packages. The algorithm checked only the last 16 characters of the path, so package differences were not considered.

A salient detail: this algorithm had been added to Git by one Linus Torvalds, its creator (alongside Linux, needless to say). A fix is forthcoming, as it happens.

Git popularity

To call Git popular is an understatement. In a Stack Overflow survey, 94 percent of developers surveyed said they use Git as their version control system. The popularity is also pronounced as such: some even say Git is Linus Torvalds’ best achievement, even beyond Linux. Predecessor SVN is used by 5 percent of those surveyed, but is possibly only used in projects older than Git itself, which originated in 2005.

For large projects, the new discovery may be a pleasant development that radically reduces download formats. Still, the performance gain will not be apparent for small repos, which contain far fewer changelogs and such frequently occurring files in various packages.

Also read: Microsoft reacts to criticism over Visual Studio Code APIs