2 mins read

Image to Video with Stable Video Diffusion 1.1

Stable Video Diffusion 1.1 is the new iteration of SVD 1.0 designed to generate videos from images. How does it compare to the original model? Let’s explore some details about the new model and then see its performance using a simple ComfyUI workflow to make some experiments and a comparison between the two models.

Overview

Stable Video Diffusion (SVD) is an image-to-video model published by Stability.ai several months ago, providing an open-source alternative to AI-generated video tools such as runwayml or Pika. It impressed many with the quality of its outputs, although challenges in camera and movement controls render it not perfect, requiring multiple attempts and trials to find the right settings.

Stable Video Diffusion (SVD) 1.1 Image-to-Video is a new iteration of SVD, a latent diffusion model designed to generate cinematic video content from a single still image. It works by conditioning on an input frame and producing a short video clip, making it an exciting tool for various research and creative applications.

Stable Video Diffusion 1.1

Going a little bit technical, SVD 1.1 has been fine-tuned from SVD Image-to-Video at 25 frames. This model is therefore capable of generating 25 frames at a resolution of 1024×576. The training involved fixed conditioning (fps) at 6FPS and motion_bucket_id at 127, ensuring consistent outputs without the need for hyperparameter adjustments. This dose not mean that we cannot change these parameters for our video generations, so there is still flexibility.

The original SVD model was trained on 14 frame counts, and then it was fine-tuned to generate 25 frames to create SVDXT. In the case of SVD 1.1, we have a single version trained from SVDXT, capable of generating 25 frames. That’s why the model file is named svd_xt_1_1.

Limitations

Similar to the original version, stability.ai outlines some limitations expected from this model:

  • Video Length: Generated videos are short (<= 4sec), emphasizing brevity over extended sequences.
  • Photorealism: Perfect photorealism is not achieved, and videos can show distortions.
  • Motion Variability: Some videos may lack motion or feature slow camera pans.
  • Control through Text: The model cannot be directly controlled through text prompts.
  • Text Rendering: (legible) text may not be accurately rendered.
  • Faces and People: The model may struggle to generate faces and people.

Hands on in ComfyUI

Using a simple ComfyUI workflow to run the original SVD model, I downloaded the new weights on HuggingFace from stabilityai/stable-video-diffusion-img2vid-xt-1-1. You will need to be logged in and accept their terms before being able to download the model.

After this is done, put the model in the ComfyUI/models/checkpoints/ directory and run ComfyUI. For the workflow, simply drag the first image from this GitHub repository, and it will be loaded.

stable video diffusion 1.1 workflow

Note that the motion_bucket_id is left at 127, the same value that Stable Video Diffusion 1.1 has been fine-tuned with. You can also change this value to control the camera movements, but we can see this later.

Input Image

Now you need to choose an input image to animate. I decided to try with an aerial view of a beach. For this first example, I didn’t change any parameters in the workflow. Here is the result:

stable video diffusion 1.0 animation example
14 video frames


Maybe a bit too fast, but it doesn’t look bad as a first attempt! Probably the composition of the input image helps the model understand a good camera movement following the rocks in the lower part.

Let’s try to modify some parameters, especially the fps to 8 and video_frames to 25. Remember that 25 is the same number used to fine-tune this model, and using 14 frames naturally makes the video less smooth.

The increased number of frames provides a much smoother animation, keeping the same movements (and same seed).

motion_bucket_id: 127

To influence the movements of the camera, let’s try modifying the motion_bucket_id, which is arguably the most important parameter and also the least predictable in my opinion. Generally, reducing the value should result in less motion in the final animation, but I want to try a more dynamic output, so I will try to increase it.

motion_bucket_id: 154

Notice how increasing the motion bucket to 154 indeed changed the direction of the camera pan. One more remark, the composition of the input image also has an impact on the animation, so the same motion bucket might produce different movements when using (very) different images.

1.0 vs 1.1

Stable Video Diffusion 1.0 vs Stable Video Diffusion 1.1 – Is there a significant difference between the two? How much has the new version improved? To get more info about the first version, have a look at this guide.

Keep in mind that version 1.1 can be considered an extension of the original version 1.0, with additional training on specific parameters, such as the fixed conditioning at 6FPS and Motion Bucket Id 127. This means that you should expect better performance if you use the same values to generate your animation and comparable performance to the original model if you modify them.

Let’s do a quick comparison with a previous model, in particular, svd_xt_image_decoder, which is also trained on 25 frames. I will use the same input image and keep the same parameters between images to spot better the differences.

svd_xt_image_decoder (1.0)svd_xt_1_1 (1.1)
Comparing svd_xt_image_decoder and svd_xt_1_1

For the first two comparisons, I set ffps to 6 and motion_bucket_id to 127, whereas for the second two, I set ffps to 8 and motion_bucket_id to 135.

Conclusion

Stable Video Diffusion 1.1 seems to produce slightly smoother animations in some cases, with a stable camera view. It still has some limitations, such as morphing or a lack of movements that were also present in the original version. But anyway, as the version number suggests, this is just a minor improvement. One interesting remark is that the original model is much bigger in size, 9.5GB for 1.0 compared to 4.78GB for 1.1. Anyway you can also download the fp16 versions of the models which reduce the size of the weights without supposedly reducing quality: a third party repository on HuggingFace contains the lighter versions for both 1.0 and 1.1.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.