LLaMA.cpp: Serving Language Models

LLaMa.cpp

In this article we will explore how to use llama.cpp to deploy large language models (LLM) and vision language models (VLM) on a consumer hardware. Showing how the community was been using local models to create and consume AI applications.

What is llama.cpp?

llama.cpp is an open-source framework that makes running large language models (LLMs) and vision-language models (VLMs) practical on consumer hardware. Written in C/C++ and optimized for performance, it transforms resource-hungry AI models into efficient inference engines that can run on everything from laptops to enterprise servers.

The magic behind llama.cpp lies through quantization techniques, reducing model memory requirements by up to 75% while maintaining quality. For examle a 30-billion parameter model that would normally require 60GB of VRAM can run comfortably on a 24GB of VRAM GPU. The llama-server provides an OpenAI-compatible API, so you can drop llama.cpp into existing applications without rewriting code.

Key features include:

Hardware flexibility: Runs on NVIDIA GPUs, Apple Silicon, AMD GPUs, and even CPUs
Memory efficiency: Multiple quantization formats (Q4, Q5, Q6, Q8) balance quality and size
Production-ready: OpenAI-compatible API makes integration seamless
Advanced optimizations: Speculative decoding, flash attention, and continuous batching
Multimodal support: Handle text, images, and audio through vision-language models

GGUF Quantization

GGUF (GPT-Generated Unified Format) is a binary file format designed for efficient storage and inference of large language models. Developed by the llama.cpp team, GGUF is optimized for rapid model loading and works particularly well with CPU-based inference, though it can leverage GPUs when available.

Large language models typically store weights as 16-bit or 32-bit floating-point numbers, which are computationally expensive and memory-intensive. Quantization converts these high-precision weights into lower-precision representations—such as 8-bit, 4-bit, or even 2-bit integers—dramatically reducing model size and speeding up inference with minimal quality loss.

So when you see a model file labeled Q4_K_M, here’s what each part means:

The number (Q4, Q5, Q8): Indicates the average number of bits used to represent each weight—more bits mean higher precision and better accuracy, but also larger file sizes
The “K”: Represents “K-quants,” a significant advancement in GGUF quantization that uses grouped quantization with per-group scale and zero point for improved quality
The suffix (S, M, L): S (Small) prioritizes minimal size with heavier quantization, M (Medium) balances size and quality, and L (Large) uses higher precision on essential tensors for maximum quality

For example, Q4_K_M is the recommended 4-bit quantization, offering a balanced trade-off between quality and size. Meanwhile, Q8_0 is nearly indistinguishable from the original full-precision model, while Q2_K represents the smallest size but with extreme quality loss.

Setting Up llama.cpp Server

Lets start installing llama.cpp and use the REST API endpoint and process a simple query.

We can use a pre-build docker image and running instantly:


docker run -p 8080:8080 -v /path/to/models:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/your-model.gguf \
  --host 0.0.0.0 --port 8080 --n-gpu-layers 999

However I build from source with CUDA support in my Ubuntu Server machine:


apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_SERVER=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split llama-server
cp llama.cpp/build/bin/llama-* llama.cpp

Regarding models, latelly I have been using Qwen3-Coder-30B-A3B-Instruct: a Mixture-of-Experts (MoE) model with 30 billion total parameters but only 3 billion active at any time.

This model excels at coding tasks with support for multiple programming languages with strong Python and JavaScript performance. Have support for tool calling and function execution, 32K token native context windows which can be extended up to 128k, and the instruction-following are optimized for development workflows.

Large Language Models: Qwen3-Coder

Download pre-quantized GGUF files from Hugging Face:


huggingface-cli download \
  unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
  Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  --local-dir ./models

The Q4_K_M quantization is a good option for my GPU, requiring only 17GB of VRAM while maintaining strong code generation quality.

Launch the server with optimized parameters:


./llama-server \
  -m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --jinja \
  -ngl 99 \
  --threads 2 \
  --ctx-size 32684 \
  --temp 0.7 \
  --top-p 0.8 \
  --min-p 0.0 \
  --top-k 20 \
  --repeat-penalty 1.05 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --flash-attn 'on'

Let’s break down the parameters to understand what’s happening:

-m: Path to your model file in GGUF format
--host 0.0.0.0: Allow network access (use 127.0.0.1 for local-only)
--port 8080: API endpoint port
--jinja: Apply jinja chat template, enabling function calling support
-ngl 99: Offload all layers to GPU
--threads 2: Total threads
--ctx-size 32684: Context window size
--cache-type-k q8_0: Convert cache K to int8
--cache-type-v q8_0: Convert cache v to int8

Counter-intuitively, fewer threads often help GPU-bound inference:


--threads 2  # Optimal for single GPU
--threads -1  # Use all CPU cores.

The cache quantization (--cache-type-k q8_0) allow fitting large contexts into GPU memory without significant quality loss.


from openai import OpenAI
 
client = OpenAI(
    base_url='http://localhost:8080/v1',
    api_key='None'  # Not required for local server
)
 
response = client.chat.completions.create(
    model='Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf',
    messages=[{
        "role": "user",
        "content": "Write Python code for quicksort"
    }]
)
 
print(response.choices[0].message.content)

Also, Qwen3-Coder supports function calling using special tokens:


response = client.chat.completions.create(
    model='Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf',
    messages=[{
        "role": "user",
        "content": "What's the current temperature in San Francisco?"
    }],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_current_temperature",
            "description": "Get the current temperature for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                }
            }
        }
    }]
)
 
print(response.choices[0].message.tool_calls[0])

The model will generate structured tool calls that your application can execute and feed back for final response generation.

Speculative Decoding

One of the most exciting llama.cpp features is speculative decoding, a technique that can boost inference speed by 1.5-2.5x using a “draft” model.

First a small “draft” model quickly generates multiple token candidates, then the main model verifies these tokens in parallel. Accepted tokens are kept; rejected tokens trigger fallback, the process repeats, maintaining identical output as standard generation.

For examplem, using a 500M parameter draft model with a 30B main model. Download pre-quantized GGUF files for the draft model for Qwen2.5-Coder:


huggingface-cli download \
  unsloth/Qwen2.5-Coder-32B-Instruct-GGUF \
  Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf \
  --local-dir ./models

Then the draft model:


huggingface-cli download \
  unsloth/Qwen2.5-Coder-0.5B-Instruct-GGUF \
  Qwen2.5-Coder-0.5B-Instruct-Q4_K_M.gguf \
  --local-dir ./models

Running all together:


./llama-server \
  -m models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf \
  -md models/Qwen2.5-Coder-0.5B-Instruct-Q4_K_M.gguf \
  --draft-max 16 \
  --draft-min 1 \
  --host 0.0.0.0 \
  --port 8080 \
  --jinja \
  -ngl 99 \
  -ngld 99 \
  --threads 2 \
  --ctx-size 21790 \
  --temp 0.7 \
  --top-p 0.8 \
  --min-p 0.0 \
  --top-k 20 \
  --repeat-penalty 1.05 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --flash-attn 'on'

Additional key parameters:

-md: Draft model path
-ngld: GPU layers for draft model
--draft-max 16: Maximum tokens to draft (4-16 optimal for code)
--draft-min 1: Minimum draft tokens
--p-split 0.1: Probability threshold for token splitting

The key to effective speculative decoding is high acceptance rates (70%+), for that:

Use same-family models: Qwen draft for Qwen main, Llama draft for Llama main
Match vocabulary: Different tokenizers break speculation
Tune draft size: Too many tokens reduces acceptance; too few misses opportunities
Monitor metrics: Track acceptance rates via API stats

The API returns helpful diagnostics:


"timings": {
  "completion_tokens": 478,
  "draft_n": 439,
  "draft_n_accepted": 330
}

In our example have an 330/439=0.75 of acceptance.

Vision-Language Models: Qwen3-VL

While Qwen3-Coder excels at text and code, vision-language models (VLMs) extend AI capabilities to images and multimodal content. llama.cpp supports VLMs through multimodal projectors that bridge vision encoders with language models.

VLMs require two components:

Base language model (GGUF format)
Multimodal projector (mmproj file) that processes images


huggingface-cli download unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF \
  --local-dir models/qwen3-vl-30b-a3b-instruct-gguf \
  --include "*UD-Q4_K_XL.gguf" "mmproj-F16.gguf"

Serving a VLM:


./llama-server \
  --model models/qwen3-vl-30b-a3b-instruct-gguf/Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf \
  --mmproj models/qwen3-vl-30b-a3b-instruct-gguf/mmproj-F16.gguf \
  --n-gpu-layers 99 \
  --jinja \
  --top-p 0.8 \
  --top-k 20 \
  --temp 0.7 \
  --min-p 0.0 \
  --flash-attn 'on' \
  --presence-penalty 1.5 \
  --ctx-size 8192

The API accepts base64-encoded images:


import base64
from openai import OpenAI
 
client = OpenAI(
    base_url='http://localhost:8080/v1',
    api_key='None'  # Not required for local server
)
 
with open('image.jpg', 'rb') as f:
    image_data = base64.b64encode(f.read()).decode()
 
response = client.chat.completions.create(
    model='Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf',
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{image_data}"
                }
            }
        ]
    }]
)
 
print(response.choices[0].message.content)

VLMs consume more memory than text-only models due to image tokens. Images typically consume 300-1000 tokens depending on resolution, so adjust context windows accordingly.

Conclusion

llama.cpp has a production-ready platform for running AI models. With optimizations like speculative decoding, quantization, and continuous batching, and consumer hardware support to able handle AI workloads. I’s a great option if you’re building privacy-focused applications, reducing provider costs or experimenting with cutting-edge models.

References

Official Repository: github.com/ggml-org/llama.cpp
Model Hub: Hugging Face GGUF models
Documentation: llama.cpp examples
Unsloth: Unsloth

Happy inferencing!