DeepMind, a sister company of Google, has written a paper in which it describes an artificial intelligence (AI) that can generate realistic videos using YouTube videos.

The AI – Dual Video Discriminator GAN (DVD-GAN) – can create coherent 256 by 256 pixel videos with remarkable reliability and a length of up to 48 frames, writes Venturebeat.

Generating natural video is an obvious next challenge, but one that is plagued by growing data complexity and computational demands, according to the paper’s authors.

According to them, much of the work involved in generating videos was therefore mainly about relatively simple datasets, or about tasks for which there is a high degree of time conditioning available. We focus on the tasks of video synthesis and video prediction, and aim to bring the strong results of generative image modelling to the video domain.


Specifically, the researchers used GANs, which are two-part AI systems consisting of generators that produce samples, and discriminators that try to differentiate between the samples generated and those from the real world. The researchers mainly used BigGANs, which are distinguished by the large quantities and millions of parameters they use.

DVD-GAN uses two discriminators. First of all, there is a discriminator that criticizes the content and structure of a single frame by randomly grabbing frames and processing them individually. The second discriminator offers a learning signal to generate movements. Finally, there is a Transformer, which allows learned information to be distributed throughout the AI model.


DVD-GAN was then trained on Kinetics-600, a dataset of natural videos. The dataset is composed of 500,000 10-second, high-resolution YouTube clips. This dataset was initially set up to recognise human actions. According to the researchers, the dataset is diverse and informal. This means that overfitting needs to be removed. Overfitting refers to models that correspond too closely to a specific dataset, making it difficult for them to predict future observations.

DVD-GAN was finally trained between 12 and 96 hours on Google’s Tensor Processing Units. After that, the AI could make videos with object composition, movement and even complicated textures as the side of an ice rink.

This news article was automatically translated from Dutch to give a head start. All news articles after September 1, 2019 are written in native English and NOT translated. All our background stories are written in native English as well. For more information read our launch article.