US-based Clockwork introduces FleetIQ, a software layer designed to make GPU clusters much more efficient. In large-scale AI training, billions in computing power are currently lost due to communication problems between processors.
Clockwork tackles this problem with FleetIQ, a Software-Driven Fabric that provides real-time insight into GPU clusters. The system detects bottlenecks within microseconds and automatically redirects traffic via alternative routes. In addition, stateful fault tolerance prevents entire AI jobs from having to be restarted after a failure. As we once learned from Meta, about 10 percent of Llama 3’s training time was lost due to synchronization errors, hardware failures, and other missing optimizations.
Clockwork’s technology is an odd one out. It is hardware-agnostic and works with both Nvidia and AMD processors, which is not the case for the optimizations used by the Chinese team behind DeepSeek. It runs on various network protocols such as InfiniBand and Ethernet, both on-premises and in the cloud.
Billions wasted on AI training
AI training has become a communication problem. Whereas pure computing power used to be the bottleneck, the challenge now lies in synchronizing thousands of GPUs in a cluster. If one connection falters, the entire system comes to a standstill. This negates the billions of dollars in combined hardware costs and power consumption.
The figures are telling. GPU clusters only achieve 30 to 55 percent of their theoretical performance. For a cluster of 100,000 GPUs, representing an investment of $5 to $7 billion, this means a waste of more than $2.25 billion in unused capacity.
Practical results
The early adopters of Clockwork are showing promising results. Uber is seeing significant improvements in the network within its hybrid multi-cloud environment thanks to observability tooling. The tech company can now locate problems within minutes instead of hours.
European parties are also benefiting from the solution. DCAI, the operator of Denmark’s AI supercomputer Gefion, reports that Clockwork helps run workloads more efficiently and reliably. Nebius is seeing improvements in the reliability of its AI infrastructure.
Funding and leadership
The FleetIQ launch is accompanied by new funding. Existing investor NEA led an investment round in which the company achieved a valuation four times higher than two years ago. New investors include Intel CEO Lip-Bu Tan and former Cisco CEO John Chambers.
The company, which was founded in Stanford, USA, has also attracted new leadership. Suresh Vasudevan, known from Nimble Storage and Sysdig, is the new CEO. Joe Tarantino, formerly with neocloud partner GMI Cloud, becomes VP Worldwide Sales.
“Communications is the new Moore’s Law,” says Vasudevan. Clockwork’s Software-Driven Fabric is designed to help organizations get more out of their existing infrastructure. According to him, this will be crucial for economically viable AI in the coming decade. This will require further optimization steps, where we may see further progress from Clockwork in the future.