5 mins read

Run Stable Video Diffusion with ComfyUI and Just 12GB of VRAM

Made with Stable Video Diffusion

Stability.ai has released Stable Video Diffusion, yet another remarkable model, this time enabling the generation of videos from images or text! When this model was first released, the VRAM requirements seemed quite high (as far as 40 GB of VRAM). However, as has been the case with other models, numerous optimizations have been implemented over time. I successfully managed to run Stable Video Diffusion on my 3060 GPU with 12 GB of VRAM thanks to ComfyUI. Let’s explore how to do this in with some simple steps.

Preparing ComfyUI

First, you need to obtain the latest version of ComfyUI. You can install it from the repository or, if you already have it, simply run ComfyUI_windows_portable > update > update_comfyui.bat. If you don’t know how to use ComfyUI, check this step by step tutorial.

Now that you have the latest version, you need to download the workflows, which can be found here: ComfyUI Video Examples. There is no need to install additional custom nodes, they will be natively available with the latest ComfyUI update.

There are two workflows: one for image-to-video, and another for text-to-video.

Next, you will need to download the model, which can be found on Huggingface: Stable Video Diffusion img2vid. Download the svd.safetensors file (almost 9GB, so it might take some time to download!). You will also see svd_imager_decoder.safetensor. They both do the same thing, but the first one is for 14 frames, whereas the second one for 25 frames.

Alternatively, you can try an optimized model which is almost half the size, available here: Optimized Stable Video Diffusion img2vid. It won’t improve performance, but it’s beneficial to use less disk memory.

After downloading the model, place it in the ComfyUI > checkpoints folder, as you would with a standard image model.

Image to video

Let’s try the image-to-video first. Download the workflow and save it.

Open ComfyUI (double click on run_nvidia_gpu.bat) and load the workflow you downloaded previously. The workflow looks as follow:

stable video diffusion comfyui workflow

Simply add an input image and confirm that you’ve selected the right checkpoint in Image Only Checkpoint Loader. Since my GPU isn’t very powerful, I started by reducing the width and height to 576 x 576 in the SVD_img2vid_Conditioning node. I also reduced video_frames to 10, but you can adjust this for longer videos.

And after a few seconds, I got my animated image!

Some comments about the parameters of the SVD_img2vid_Conditioning node: the motion_bucket_id establish how fast the resulting video should move. The parameter augumentation_level determines how much movement you want to give to the background of your final video, then the fps can be usually kept at 6. Don’t forget that if you use the svd.safetensor model, you should set video_frames to 14, whereas for the svd_image_decoder.safetensor you’ll go with 25 video_frames.

Text to video

The text-to-video workflow generates an image first and then follows the same process as the previous workflow. Again, I reduced the size of the empty latent image and SVD_img2vid_Conditioning, but feel free to keep higher values depending on your GPU. You will also need the base stable diffusion model to run this workflow.

After around 1 minute, I got my short video.

text to video with stable video diffusion workflow

I find this way to run Stable Video Diffusion extremely easy and, most important, fast and efficient with my 12 GB of VRAM. It should be able to generate images even with 9GB, just try out different parameter to get the best optimization for your machine. I find this method of running Stable Video Diffusion extremely easy and, most importantly, fast and efficient with my 12GB VRAM GPU. It should even work with 9GB VRAM, just experiment with different parameters to optimize for your machine.

Improvements

We can obtain even smoother and detailed results if we add some custom nodes to our workflow, in particular the ComfyUI Frame Interpolation.

This is the workflow that I will try, if you see some errors because of missing nodes, use the ComfyUI manager to install them, then restart the UI.

comfyui workflow with frame interpolation

I got a very smooth and high resolution video in about 4 minutes. Definitely time consuming with my GPU, but generally it’s worth the wait.

svi in comfyui

You can find the workflow on this google drive. Thanks to Olivio Sarikas!

Tips

Note that, as I said, you can choose any resolution, but the model would work best at 1024×576.

If your GPU is not able to keep up with the jobs, you should try to reduce the resolution size, the number of video frames. Try also with different base models (14 frames – svd.safetensor or 25 frames – svd_image_decoder.safetensors.

Resources

ComfyUI git repository

ComfyUI video workflows

Stable Video Diffusion 14 frame video

Stable Video Diffusion 25 frame video

Stable Video Diffusion (less disk space)

3 thoughts on “Run Stable Video Diffusion with ComfyUI and Just 12GB of VRAM

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.