4 mins read

VideoPoet: The Language Model by Google For Video Generation

VideoPoet is a simple modelling method that can convert any autoregressive language model or large language model (LLM) into a high-quality video generator.

Sounds complex? Let’s creak it down. VideoPoet is basically a model based on a clever technique introduced by Google that turns ordinary language models (like Gemini or ChatGPT) into video creators, with a focus on coherence and camera controls.


This promising tool has been demonstrated through a unique project: a short movie entirely created by the model. The narrative centers around a traveling raccoon, with the storyline crafted by Bard, another text generation model. It then transformed this narrative into a series of visually compelling video clips, seamlessly stitched together to create a coherent and engaging video. This approach highlights the model’s ability to not only generate high-quality videos but also to maintain narrative coherence, which is a challenge for many of the tools currently available for video generation.

VideoPoet: Key Functionalities

VideoPoet has the power of autoregressive language models (LLMs), which are AI systems that can generate human-quality text. By training LLMs on a vast dataset of videos and text descriptions, the model learns to translate words into visual scenes.

  • Text-to-Video: generate realistic and engaging videos from text descriptions.
  • Image-to-Video: generate realistic and engaging videos from images.
  • Video Editing: can be used to seamlessly fill in missing or obscured portions of videos.
  • Stylization: can be used to apply styles from existing videos to new ones.
  • Inpainting: can be used to seamless fill in missing or obscured portions of videos.

Zero-shot video generation

A large language model (LLM) for zero-shot video generation is a type of artificial intelligence (AI) that can create videos from text descriptions without being explicitly trained on a specific dataset of videos. This means that the LLM can generate videos of scenes that it has never seen before, simply by understanding the text description.

The process is surprisingly simple: you provide VideoPoet with a text description of a video, and it generates a corresponding video clip. The generated videos are incredibly realistic, often indistinguishable from real recordings.

Long Videos

VideoPoet’s applications are as diverse as your imagination. It can be used to create marketing videos, educational tutorials, entertainment content, and even personalized videos for social media. This is also possible thanks to the possibility to generate long and coherent videos. This process involves using a model that takes the final one-second segment of a video and predicts what the next second will look like. By repeating this process, the model is capable of not only lengthening the video but also consistently maintaining the visual characteristics of all objects throughout the extended duration

For researchers, VideoPoet offers a powerful tool to study how people perceive and understand videos. It can help us understand how language can be used to effectively convey visual information.

VideoPoet functionalities
VideoPoet Capabilities

Here are some of the key features of VideoPoet:

  • It is simple to use. You can generate a video from a text description by simply providing the text description to the VideoPoet model.
  • It produces high-quality videos. The videos generated by VideoPoet are very realistic and often indistinguishable from real videos.
  • It is versatile. VideoPoet can be used to generate a wide variety of videos, including movies, documentaries, and music videos.

Camera Controls

One of the most promising feature of VideoPoet is the ability to control camera motions by specifying the type of camera shot in the text prompt. For example, you can tell VideoPoet to zoom out to show the entire mountain or to pan left to follow the river’s path.


This makes VideoPoet a versatile tool and promising technique for creating videos in a variety of genres, such as adventure games, documentaries, and even virtual reality experiences. This feature addresses a current challenge of Stable Video Diffusion, namely, the difficulty in controlling and predicting camera movements with precision.

Progresses made in AI video generation are impressive, and everything is going so fast! With Stable Video Diffusion by Stability.ai, Gen-2 by Runway ML, or Pika Art to name a few, making and editing videos using text or images is becoming more and more accessible.

Read more about VideoPoet directly from Google and on the blog post.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.