7 min Analytics

How Our Team Optimizes Infrastructure for Minimal AI Video Processing Latency 

A man with short brown hair and light facial hair is smiling, wearing a black blazer over a white shirt, with a dark background and partially visible text.
How Our Team Optimizes Infrastructure for Minimal AI Video Processing Latency 

Over the past year, AI-generated video diffusion models have enabled dramatic improvements in visual realism, as we’ve seen with OpenAI’s Sora 2, Google’s Veo 3, Runway Gen-4 and others. AI video generation is truly reaching an inflection point, and the latest models are capable of creating stunning clips with lifelike visuals. 

However, the way they’re built prevents these models from being used interactively and in real time, and when most AI practitioners talk about AI video, their main focus is generating clips to be watched later. To many, the idea of taking live video input from a camera and using AI to transform the output instantly is still years away.

To a large degree, the obstacle here is essentially an architectural one – it stems from the fact these models generate chunks of video frames sequentially, via a series of complex and computationally-intensive steps. Because each chunk must be processed before the model can start working on the next one, it inevitably leads to latency, making live-streamed AI video impossible. 

Our team at Decart decided to see if we could get around these obstacles, and LSD v2, our recently released model, validates this idea that achieving minimum latency is largely a matter of approach. To make it work, we developed and implemented a number of cutting-edge techniques, which we believe can be applied to various AI models. 

Using these techniques, we were able to optimize the underlying infrastructure needed to run our model and maximize GPU utilization, while accelerating the denoising process necessary to prevent error accumulation. LSD v2 is powered by a causal, auto-regressive architecture that generates video instantly and continuously, without any limitations on the duration of its output. 

Here’s how we’re able to do it.

Infinite Generation

For video models to generate outputs on a streaming basis, they need to operate “causally,” which reduces the computational load by producing each new frame based on only the frames that preceded it. 

Causal video models use an “auto-regressive” structure to ensure continuity. While this technique works well for short clips, the quality of their outputs degrades over time, due to “error accumulation,” where a small detail such as a shadow that’s slightly out of place becomes exaggerated with each new frame, gradually destroying the output’s coherence.  

Error accumulation has caused significant headaches for video model developers and is the main reason why the leading video models can only generate short sequences of a few seconds. To overcome this, we improved upon a technique known as “diffusion forcing,” which makes it possible to denoise each frame as it’s being generated. We combined diffusion forcing with “history augmentation,” which trains the underlying model to predict and recognize corrupted outputs. 

The result is a causal feedback loop, where for each new frame, models consider the earlier frames they have already generated, as well as the current input frame and the user’s prompt, which allows them to quickly predict what the next output in the sequence should be. 

The model is thereby equipped to identify and then correct any input artifacts that appear in the output, preventing the accumulation of errors. Thus, they can output high quality content infinitely while adapting continuously when the user enters a new prompt, making real-time editing and transformation a reality. 

Subsecond Latency

The most taxing quandary we faced was not video quality, but rather, how to process causal feedback loops quickly enough for real-time generation. 

For AI video to be used interactively, it needs to generate each new frame with less than 40 milliseconds latency, otherwise the lag will be noticeable by the human eye. But causal AI models are computationally intensive, and their design is at odds with the architecture of modern GPUs, which favor large-batch execution instead of low latency. 

We experimented with several new approaches to get around these obstacles. To speed up processing capabilities, we set about optimizing the underlying Nvidia Hopper GPU infrastructure. We focused on modifying the kernels, those tiny programs that run on each GPU and perform the individual steps involved in a computation. Typically, a single GPU will run several hundreds of these small kernels, so that they’re constantly stopping and starting and moving data back and forth between them. It wastes a lot of time, which means a large part of the GPU sits idle.   

Our solution to this was to optimize our kernels for how Hopper works. Essentially, we created a single “mega kernel” that enables the chip to process all of a model’s computations as a whole in a single, continuous pass. By doing this, we eliminate all of the stopping, starting and data movement, allowing more of the GPU to be utilized more of the time, speeding up processing by an order of magnitude. 

We like to think of it as analogous to how Henry Ford transformed manufacturing with cars moving down an assembly line, dramatically accelerating manufacturing times. Instead of one team struggling to integrate all of the components one by one, constantly stopping and starting, vehicles move sequentially from one workstation to another, allowing them to be completed far faster. 

Pruning and Distilling

Another key innovation we implemented was “architecture-aware pruning,” which involves making a series of optimizations at the system level to reduce the amount of computations required to generate outputs. 

We’re able to do this because neural networks tend to be “over-parameterized,” featuring scores of parameters that are not required to generate the desired outputs. Pruning these unnecessary parameters means less work is required of the GPU, and it can also help in terms of adapting a model’s architecture to that of the underlying hardware. 

Finally, we came up with a trick called “shortcut distillation,” which involves fine-tuning smaller, more lightweight models to match the denoising speed of larger models that require greater processing power. 

By using shortcut models for denoising, it’s possible to generate a coherent video frame in fewer steps, and these incremental gains quickly add up to enormously accelerate the time it takes to create quality outputs. 

A Game Changer for AI Video

Subsecond latency is a significant breakthrough for AI video generation, paving the way for it to be used in interactive scenarios that were previously impossible. Through continuous editing, it’s possible to generate content that evolves as it’s being created, entirely based on the user’s whims. 

A TikTok influencer or Twitch streamer will be able to start broadcasting a live video and then enter whatever prompts pop into their heads, or to incorporate suggestions from their audience, to adapt the content as it’s being streamed.  

It has implications for live video games, potentially enabling interactive AI-generated sequences that transform based on what actions the player takes. For instance, the gamer could be presented with a series of doors and be asked to choose one, with their choice leading to a unique outcome. The possibilities for use cases in extended reality, immersive education and mass event marketing are likewise exciting.

AI-generated videos can also act as neural rendering engines for engineers, enabling them to completely restyle different products and experiences using prompts. Architects and interior designers can iterate quickly on different themes to see what works best before deciding which direction to take.  

Even more exciting, the elimination of latency, combined with the ability to generate video infinitely, makes it possible for anyone to explore the depths of their imagination and create longform content. They’ll be able to interactively make adjustments to the scene, lighting, camera angles and character expressions as the video is being generated. It opens the door to a more dynamic creative experience that will transform how stories are created.  


Kfir Aberman is a founding member at Decart AI, where he leads the San Francisco office and drives research-to-product efforts in real-time generative video. His work focuses on building interactive, personalized, and real-time AI systems that merge research excellence with creative user experiences.