11 mins read

How to Run an LLM Locally

representing a PC with a chat icon above, to run llm locally
Chat with your PC

If you are looking to run a Large Language Model (LLM), like ChatGPT, offline and on your local PC, then you should know how to run an LLM locally. LLMs are a type of model that is trained on massive amounts of text data. This allows them to generate human-quality text, translate languages, debug code, write different kinds of creative content, and answer your questions in an informative way.

The focus here will be specifically on those LLMs that are open source or have open weights, so that they can be run locally on your PC.

Why a Local LLM?

ChatGPT is an amazing tool, with ChatGPT-4 still being the best available model in the market. Google is also catching up with Gemini and Bard, with great improvements in the latest months. But what do these models have in common? They are closed source and can only be run online, following the rules set by their respective companies. If you are looking for total customizations, there are other companies such as Mistral or Meta that released their source models (weights) so that anyone can use them on their own setups (locally or on a personal server).

Running an LLM locally on a PC private server offers several advantages, including:

Privacy: Running LLMs locally ensures that your data remains private and secure, as your interactions with the model are not shared with any third parties. This is particularly important for sensitive tasks such as medical diagnosis or financial planning.

Control: Running LLMs locally gives you complete control over the content of your conversation, allowing you to chat about topics that might be blocked or censored. Still be mindful on the content that you generate because the output and how you use it remains your responsibility.

Performance: While cloud-based LLMs often offer more powerful hardware, they can also experience latency due to network limitations. Running LLMs locally eliminates this latency, providing real-time responsiveness for interactive tasks (even if for most consumer PCs, it is still way faster to use cloud resources with higher tier GPUs and a lot of RAM).

Customization: LLMs can be trained by anyone on local data can better reflect the language patterns and nuances of your specific domain. This customization can lead to more accurate and relevant results for your applications.

And one of the most important, you do not need to pay for the service, so if you already have the equipment available or use some cheap cloud providers, it can be a money saving solution.

Available Models

There are many open source projects around the development of LLMs, optimized for different use cases, with different licenses and performance. I will share with you some names, then focus on a specific model that I will try to install on my PC.

Vicuna is a chat assistant trained by fine-tuning LLaMA on user-shared conversations collected from another LLM, aimed at hobbyists or researchers. LLaMA is another open source LLM realized by Meta, being one of the Big Companies to have released their weights, unlike OpenAI or Google.

Mistral 7B is considered one of the best open weight Large Language Models (LLM) for its good combination of efficiency and capabilities. The smaller models can run on many consumer GPUs and have good flexibility in terms of handling conversations and tasks. Find out more about Mistral AI and it’s latest LLMs.

The size of the model is determined by the parameters. The number of parameters is often found in the filename of the models themselves, such as 7B, 13B, 70B, and so on. With more parameters, the model has the ability to handle a wider range of tasks and situations, making it more intelligent. However, it will also require much more computational resources. Therefore, with a 7B parameter model, you might get disappointing results if the model is not very good.

Dolphin and Mixtral

Models can be fine-tuned to improve a specific use case or to integrate the original dataset with additional text, and the one that I will try in this article is one of them.

The Dolphin-2.1 Mistral-7B model is an LLM with 7 billions parameters developed by ehartford. It is based on Mistral AI, with an Apache-2.0 license, making it suitable for commercial or non-commercial use. It has a 7B parameter count, making it suitable to run on less performant machines, but if you have a powerful GPU (16GB or more), feel free to try bigger models. Dolphin-2.1 Mistral-7B was trained on a massive dataset of text and code (Dolphin dataset based on Microsoft’s Orca), and it is able to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way.

Run an LLM Locally

Let’s see how to run Dolphin-2.1 Mistral-7B using a simple UI on Windows.

Choosing the UI

In order to benefit from the capabilities of these assistants, we need a way to interact with them, allowing us to specify instructions in the form of text, whether we want to use an LLM locally or as a paid service. There are many interesting user interfaces (UIs) with different features and styles that can be simply downloaded and installed. For this tutorial, I will use one named Text Generation WebUI, following the path of the original stable-diffusion-webui for image generation.

I found the UI very clean and simple, sufficient for me to experiment and chat with my LLM locally. Some more UIs worth trying are:

  1. Hugging Face Chat UI – A Clean and Efficient Interface: Notable for its clean design and robust web search capabilities.
  2. Text Generation WebUI – Unparalleled Model Support: Recognized for its extensive model support and supports several extensions.
  3. Lollms WebUI – Stability, PDF Integration, and Web Search: Stands out for stability, PDF integration, and seamless web search.
  4. H2OGPT – File Ingestion Powerhouse: A powerful tool for file ingestion, supporting a wide range of formats, with strong capabilities in PDF management and web search.
  5. SillyTavern – Tailored for Custom Characters and Roleplay: Uniquely designed for custom characters and roleplay enthusiasts.
  6. GPT4All – Basic UI Replicating ChatGPT: Features a basic UI replicating the familiar ChatGPT interface.

The list goes on, so feel free to explore the best tool for you. Since I already made up my mind, I will go directly to the GitHub page of the UI to follow the installation process. You can either clone the repository using git or download the zip file directly.

Installing on Windows

After that, unzip the file and go into the text-generation-webui folder. You will find several start_ files; choose the right one for your setup, and the installation will start. For Windows, I run start_windows.bat.

During the installation, a guided process will help you with the setup using the Terminal. Read carefully what it is asking to make the installation as smooth as possible.

Type A and press Enter if you have an NVIDIA GPU; otherwise, choose the right letter depending on your system.

If you have an RTX or GTX GPU, type N. You should reply Y only if your GPU is older than these. After that, the installation will continue, and it will have to download some heavy files, so be patient!

If there are no errors, you should see something like the image above. Your UI is ready; just go to a browser and enter this URL: http://localhost:7860/?__theme=dark

Then you should be able to see the UI of the application, very minimal but functional, with a text input below and the content of the chats above. On the top, you can see some important sections that we will use to navigate the UI.

ui to run llm locally windows

Now we have to download the model! There are many open source options as I said before, similar to the image models: many fine-tunings, variants, and companies that provide these models. If you don’t know yet what to choose, let’s try Dolphin-2.1-Mistral.

The model can be found on HuggingFace with a nice model description that it’s always good to read. I will download a 7B version, in particular dolphin-2.1-mistral-7b.Q4_0.gguf, I’m not sure if my 16 GB of RAM and my 3060 GPU can handle more: you can download it here.

A remark about the format of these weights: GGUF stands for GPT-Generated Unified Format. It is a file format used to store language models that was released in August 2023. Models in this format can be downloaded as a single file, compared to other types like 16-bit transformers models and GPTQ models, that are made of several files, making it a bit easier and faster to get started with an LLM locally.

When your model is downloaded, put the GGUF file in the \text-generation-webui\models\ folder. Then head back to the UI, go to the Model panel, and next to the Model dropdown, click the two arrows button to refresh the UI since you added the model while it was already running.

load local llm model

You should be able to see dolphin-2.1-mixtral in the menu. Once the model is selected, choose Load so that it can be loaded in memory – it might take a few moments.

Chat Examples

llm locally having a conversation

It took 29 seconds to get my first reply. I was expecting more! My machine has 16 GB of RAM and 12GB of VRAM, which is not much compared to what an LLM usually needs to run fast, but this model seems to be small enough to run decently and still get good results. Next replies seem to be a bit faster.

dolphin-2.1-mistral-7b chat

I asked some random things, and the answers seem natural and precise. I’m not sure if they are accurate though. I like that the text is shown “live” while it generates, very similar to ChatGPT and Bard.

According to the model itself, it’s most recent knowledge update was completed on November 21, 2021. so keep this in mind for your tasks!

There it is! You basically have a mini GPT on your local PC that can be run offline and (almost) without limits. The UI that I have just shown is easy to run an LLM locally, offering many other settings that can improve your chat even more. For example, you can tweak some basic parameters below the text prompt, such as changing the graphic of your UI, and more interestingly, a way to control the behavior of the LLM, similar to custom instructions for ChatGPT.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.