5 min Applications

CraftStory writes the script for image-to-video AI

CraftStory writes the script for image-to-video AI

Logically named San Francisco-based CraftStory is a specialist in realistic AI-generated human video. The company’s CraftStory Model 2.0 platform iteration now generates lifelike long-form studio-quality videos featuring “humans” from a single image and a script. Let’s unpack that – this is AI that enables users to generate a five-minute-long video of a human being talking and moving from a single image file, when a written script is provided.

The company initially launched its first video-to-video model in November 2025. This model enabled users to generate up to five minutes of video by animating a still image using motion captured from a “driving” (i.e source footage for base content) video.

Model 2.0 builds on CraftStory’s existing suite of models and introduces a new capability that removes the need for source footage. Companies can now create expressive, long-form videos starting from nothing more than a photo and text, while preserving the same realism, continuity, and performance quality previously available only through video-to-video workflows.

Amazing technology, but who would want a service like this? The company says that there is a huge market in training videos, demonstration videos that showcase people, places or products that are tough to get into a studio for… and where source footage materials (including models themselves) are tough to get hold of

Video as a primary communication channel 

“As video becomes a primary communication channel for companies, [creative and commercial] teams face a familiar bottleneck: producing consistent, human-led content at scale is still slow, expensive and difficult to update. While short AI clips exist, they often lack expressive motion, break down over time, or fail to sustain realism beyond a few seconds,” states CraftStory, in its technical briefing materials.

The company’s Image-to-Video model claims to address this gap by transforming a single image into a complete performance, driven entirely by script or audio. 

The system generates what are said to be “natural facial expressions” as well as natural-looking body language and gestures that “evolve coherently” over time. These factors make the technology suitable for creating “product explainers” (i.e. how-to videos and demonstrations), training videos, as well as customer communication and educational content.

Script-driven video creation

“Image-to-video is a major step toward fully script-driven video creation,” said Victor Erukhimov, founder and CEO of CraftStoryl. “You no longer need to record a video to get a realistic human performance. If you have an image and something to say, Model 2.0 can turn that into a believable, long-form video – complete with gestures and expressiveness that match the message.”

With image-to-video, users upload a single image of a person and a script or audio track. CraftStory Model 2.0 then synthesises a full video performance, animating both the person and the environment with realistic lip-sync, expressive gestures and scene motion aligned with speech rhythm and emotional tone.

The model shares the same core architecture as CraftStory’s video-to-video system, including advanced gesture generation algorithms that infer appropriate hand and body movements directly from audio. There is a high-fidelity lip-sync function that is capable of producing natural speech articulation over long sequences. An identity preservation service maintains consistent appearance, emotion and nuance throughout multi-minute videos.

According to CEO Erukhimov, “Model 2.0 also includes an advanced lip-sync system that turns any script or audio track into a realistic performance. A built-in gesture alignment algorithm ensures that body movements naturally match speech rhythm and emotion – bringing human expressiveness to AI-generated content.”

CraftStory is also introducing support for moving cameras. Model 2.0 can now generate walk-and-talk videos up to 80 seconds long, where the person moves naturally through the scene while speaking and the camera tracks the motion. This enables dynamic, cinematic shots that stand out from static, on-camera videos. The feature is currently in beta and will be rolled out gradually to existing accounts.

Proprietary Parallelised Diffusion Pipeline

At the core of Model 2.0 is a proprietary parallelised diffusion pipeline, designed to scale human video generation beyond short clips. By processing different temporal segments simultaneously while enforcing global coherence, the system maintains visual consistency across minutes of footage – a key challenge in long-form video synthesis.

TECHNICAL EXPLANATORY NOTE: The specific implementation, source code and unique algorithms are the intellectual property of CraftStory, they are thus proprietary. They are also parallelised i.e. the computational tasks involved in the diffusion process are broken down and distributed across multiple processing units (like graphical processing units (GPUs) or Tensor Processing Units (TPU), the latter being neural processing unit (NPU) application-specific integrated circuit (ASIC) developed by Google for neural network machines) to run simultaneously, rather than sequentially. The central goal of this proprietary parallelised diffusion pipeline is to accelerate the inference or training process of large diffusion models, reducing the time required to generate high-quality outputs. 

The model was trained on high-frame-rate footage of real actors, capturing subtle facial dynamics as well as expressive hand and body motion. This allows Image-to-Video outputs to feel fluid and human, rather than static or robotic. Videos can be generated in both portrait and landscape formats, at 480p and 720p, with optional upscaling to 1080p.

Coming to a screen near you, soon

Looking ahead, CraftStory says it is advancing Model 2.0 to progress towards “fully automated” text-to-video workflows, with a focus on making marketing video creation faster, simpler and more scalable for everyday use. The company’s YouTube pages make entertaining viewing.

A woman in a black strapless top and blue jeans walks on a wet city plaza with tall buildings and statues in the background, showcasing the realism possible with an Image-To-Video AI Service.
A moving walking AI model presents her views via CraftStory (no jacket or coat required).