Google published a blog post giving technical details of Colossus. It is an internal file system that powers Google Cloud and many consumer services, including the search engine. The software platform manages the storage hardware running in Alphabet’s subsidiaries’ data centres.
It is used to move information into and out of storage hardware for the applications that depend on it. With its massive size, Google has to ensure that users can access data reliably, which involves high levels of computational power. One of the main objectives for Colossus is reducing and fixing technical issues.
Failure is natural
At Google’s data centers, hardware is failing all the time but not because it is unreliable, but because there’s just too much of it. All of this was explained by Dean Hildebrand, a technical director at Google Cloud’s Office of the Chief Technology Officer, and Denis Serenyi, the tech head of Google’s cloud storage.
Failures are natural when you have a scale this big. It is always important that file systems have fault tolerance and transparent recovery. Colossus uses several programs that Google engineers have dubbed Custodians to ensure things are smoothly operating.
Custodians and Curators
When one of the drives in the storage system fails, the Custodians can reassemble the lost information from the data remaining in the still working drives. The program can also perform tasks like increasing the durability of Google’s storage environments, reaching several exabytes, running on thousands of machines.
The system also has Curators to determine the best storage options and control the metadata that customers and Google services are storing in Colossus. Colossus can automatically assign data to the most suitable hardware based on the metadata and information provided about the data.
The system’s complexities are hidden behind an abstraction layer, making it easy for non-technical users to extensively use Colossus for things like assigning data to the most suitable hardware.