Host LLM Locally: No GPU Needed | Open Source AI Fun

ⓘ Notebookcheck / KoboldCPP

It’s not the fastest solution, but it’s definitely usable even without a GPU – and at least you retain full control over your own data.

Whether it’s running a D&D campaign, fixing code issues, developing ideas, creating NSFW content, or getting around the restrictions imposed by major providers, KoboldCPP makes it easy to host your own LLM – and it works surprisingly well, even without a GPU.

Commentary articles exclusively reflect the individual opinion of the author listed.

Ultimately, hardly anyone knows what exactly happens to your own data when you make a request to an AI. But one thing is clear: Whatever happens to it, you will no longer really own this data.

In addition to image and video generation, hosting your own LLM is surprisingly easy and has a number of advantages over the offers from the major providers – especially if you want to experiment with Large Language Models without passing on your data to Big Tech.

The most important point: No matter what the model is used for, all data remains under your own control. That alone is a clear advantage if you don’t want to hand over your data to third parties. In addition, practically any model can be used – whether Deepseek, Gemma2 or GPT. Another advantage is being able to use versions that do not restrict certain types of requests.

KoboldCPP is an easy-to-use AI text generation tool consisting of a single executable file and designed for GGUF and GGML models. It supports both GPU and CPU and can serve as a specialized backend for AI storytelling and chats. KoboldCPP can be downloaded from GitHub and is available for Windows, Linux, Mac and Docker.

If the whole thing is hosted in a container, the LLM can be made available to every device in your own network without much effort. There are already ready-made templates for the most important platforms, including Unraid and TrueNAS. The same is possible with other installations as long as the necessary rules are set in the firewall.

Once the desired platform has been determined, the first step is to decide which model should be used. The best place to go is Hugging Face. The models must be in GGUF format.

If you want to host D&D scenarios, you should definitely choose an uncensored model. Otherwise, sooner or later the LLM will refuse to deal damage to a character, which can lead to undesirable results.

Some models, such as Deepseek and Claude, tend to “think”, i.e. output the entire thought process for a query. This may be fine with a GPU doing most of the work, but without a GPU it slows things down significantly. Ultimately, the only thing that helps here is to try it out to find a suitable model. Gemma2 is a good starting point for this.

The URL that leads to the GGUF file must then be copied to the respective file page. Many models come in several sizes, so you should choose a variant that stays within the available RAM.

The Unraid Docker template only requires 2 changes to get started. If you are working without a GPU, remove the GPU flag and paste the link to Huggingface’s GGUF file

Installation under Windows is largely the same. However, if the model is used without a GPU, the NoCUDA version must be downloaded. It may take some time to start because KoboldCPP first downloads the model before displaying the user interface. This is easy to see under Windows, but with Unraid or TrueNAS the log has to be opened to see the download progress. Under Unraid, it may also be necessary to increase the available storage space for Docker containers – depending on how large the model you choose is.

KoboldCPP offers four different interface modes: Instruct, Story, Chat and Adventure.

Instruct is used to give instructions to the LLM, Chat is similar to a chatbot, Story is good for novel writing, and Adventure is best for RPG-style interactive fiction.

ⓘ Notebookcheck

Instruct is used to give instructions to the LLM, Chat is similar to a chatbot, Story is good for novel writing, and Adventure is best for RPG-style interactive fiction.

It’s not particularly fast by any stretch of the imagination, but text generation is only slightly below the average reading speed. But it’s absolutely usable for D&D scenarios on a 16-core AMD 5950X (currently around 300 euros on Amazon) and will probably run even faster on more modern CPUs. The more cores available, the better. A decent amount of RAM also enables the use of larger models, although 16 GB should usually be sufficient. The size and type of the selected model also have a significant influence on the generation speed. With a slimmer model, the speed can be noticeably increased.

For the best possible experience, Large Language Models with a GPU are of course the best choice. But if you just want to try out your own LLM, bypass the restrictions of ChatGPT, Claude or Gemini or don’t want to entrust your data to these services, you don’t need any special hardware to get started – and you’ll still get a decently usable experience.

Related Posts

Leave a Comment