2 min

Google has presented a new AI model that can analyze lengthy videos. While AI solutions focused on text, image and sound have now achieved commercial success, there is no current tool that can process these areas together. With Mirasol3B, Google believes it has found an approach that can.

Few will describe AI development as easy, but by now a variety of applications such as ChatGPT, Midjourney and numerous business-oriented ML solutions have shown that much is already possible with the technology. Great strides have also been made in the audio field, such as with synthetic singing voices. However, “multimodality,” such as combining video, audio and textual content, is considerably more difficult to analyze.

Combiners

According to Google researchers Isaac Noble and Anelia Angelova, joint processing of modalities is difficult to keep in sync. One therefore presents Mirasol3B, which contains different components for audio and video and splits them up to stay synchronous. With this, what the researchers describe as “long videos” could be analyzed. They cite 512 frames as the largest input, although not every individual frame from a video is actually analyzed. Other AI models use only 32 to 64 frames per video, even if it is several minutes long, according to the researchers. With Mirasol3B, a video is divided into chunks of 4 to 64 frames, which are analyzed with a synchronous piece of audio. A “learning module” called a “Combiner” processes the combined data, after which the process repeats itself. However, each Combiner step after the first concentrates on the changes that have taken place, so duplicate frames do not require the same calculations.

Een diagram dat de verschillende delen van een video laat zien.
Source: Google

Possible applications include adding video content in an AI search engine, analyzing user-generated content for moderation and QA for professional videos.

For Google itself, AI-powered content moderation will undoubtedly sound appealing: its own YouTube platform receives hundreds of thousands of hours of new content daily, already largely moderated by algorithms. False positives can be challenged, as can human-driven reporting of harmful or banned content. During the Covid-19 pandemic, YouTube was forced to use even fewer people for content moderation. Having a better AI companion in providing this service would help alleviate the tasks for human moderators in the process.

Not open-source

While other ML experts such as Leo Tronchon of AI platform Hugging Face have been positive about the tool, others are skeptical. For example, Google has chosen not to share the model, the training data and the programming code required to run it. Mirasol3B is thus closed-source and its details are only accessible via the Google blog post and research paper.

Also read: AI model Google predicts weather more accurately than previously possible