Mozilla’s LlamaFile project is one of the most compelling ways to run large language models locally. Instead of wrestling with Python environments, CUDA versions, and dependency chains, LlamaFile bundles the entire runtime — weights and inference engine — into a single executable file. You download one file, make it executable, and run it. That’s the whole installation process.
What Is LlamaFile?
LlamaFile is built on top of llama.cpp, the highly optimized C++ inference engine for quantized LLMs. Mozilla’s contribution is the packaging layer: using the Cosmopolitan Libc toolchain, they compile everything — model weights included — into a single binary that runs natively on Linux, macOS, and Windows (via WSL) without any installation.
The project is maintained at github.com/Mozilla-Ocho/llamafile and has first-class support for models in GGUF format.
Key advantages:
- Zero-dependency distribution: share a model as a single file
- Cross-platform: one binary for Linux, Mac, and Windows
- Built-in web UI: every LlamaFile ships with a browser interface
- OpenAI-compatible API server out of the box
Downloading LlamaFile Models
The easiest way to get started is to grab a pre-built LlamaFile from Hugging Face. Mozilla hosts several ready-to-run files:
| Model | Size | Hugging Face Repo |
|---|---|---|
| Mistral 7B Instruct v0.2 | ~4GB | Mozilla/Mistral-7B-Instruct-v0.2-llamafile |
| Phi-3 Mini 128K Instruct | ~2.2GB | Mozilla/Phi-3-mini-128k-instruct-llamafile |
| Llama-3.2 1B Instruct | ~0.8GB | Mozilla/Llama-3.2-1B-Instruct-llamafile |
| Meta Llama 3 8B Instruct | ~4.7GB | Mozilla/Meta-Llama-3-8B-Instruct-llamafile |
Download directly with wget or the Hugging Face CLI:
# Using wget
wget https://huggingface.co/Mozilla/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.Q4_0.llamafile
# Using huggingface-cli
pip install huggingface-hub
huggingface-cli download Mozilla/Mistral-7B-Instruct-v0.2-llamafile mistral-7b-instruct-v0.2.Q4_0.llamafile
Making the File Executable (Linux/macOS)
On Linux and macOS, you need to set the execute permission before running:
chmod +x mistral-7b-instruct-v0.2.Q4_0.llamafile
On macOS, Gatekeeper may block unsigned binaries. Bypass it with:
# Remove the quarantine attribute
xattr -d com.apple.quarantine mistral-7b-instruct-v0.2.Q4_0.llamafile
Running the Built-in Web Server
Simply execute the file to launch the web server:
./mistral-7b-instruct-v0.2.Q4_0.llamafile
Within a few seconds, LlamaFile will print:
llamafile server listening at http://127.0.0.1:8080
Open your browser to http://127.0.0.1:8080 and you’ll find a full chat interface — no additional software needed. The interface supports system prompts, temperature control, and conversation history.
Custom Port and Host
# Listen on a different port
./mistral-7b-instruct-v0.2.Q4_0.llamafile --port 9090
# Expose to local network (use with caution)
./mistral-7b-instruct-v0.2.Q4_0.llamafile --host 0.0.0.0 --port 8080
Command-Line Inference
For scripted or headless use cases, skip the web server entirely:
# Single prompt, no interactive mode
./mistral-7b-instruct-v0.2.Q4_0.llamafile \
--cli \
-p "Explain quantum entanglement in two sentences"
# With a system prompt
./mistral-7b-instruct-v0.2.Q4_0.llamafile \
--cli \
--system "You are a senior Linux engineer." \
-p "What does the oom_killer process do?"
The --cli flag disables the HTTP server and streams the response directly to stdout — useful for shell pipelines.
# Pipe output to a file
./mistral-7b-instruct-v0.2.Q4_0.llamafile --cli -p "Write a bash script to monitor disk usage" > disk_monitor.sh
GPU Acceleration with -ngl
By default, LlamaFile runs on CPU only. To offload layers to an NVIDIA GPU, use the -ngl (number of GPU layers) flag:
# Offload all layers to GPU (best performance)
./mistral-7b-instruct-v0.2.Q4_0.llamafile -ngl 999
# Offload partial layers (useful when VRAM is limited)
./mistral-7b-instruct-v0.2.Q4_0.llamafile -ngl 24
CUDA must be installed on your system for GPU offloading to work. LlamaFile bundles its own CUDA kernels and will fall back to CPU automatically if GPU offloading fails.
For AMD GPUs with ROCm support, GPU offloading is more limited. Check the current llama.cpp ROCm compatibility for your specific card.
Thread Count for CPU Inference
# Use all physical cores
./mistral-7b-instruct-v0.2.Q4_0.llamafile --threads $(nproc)
Running on Windows via WSL
LlamaFile includes a .bat launcher for Windows compatibility. When you rename the .llamafile extension to .exe on Windows, it may run natively — but WSL2 is the more reliable path:
# Inside WSL2
chmod +x mistral-7b-instruct-v0.2.Q4_0.llamafile
./mistral-7b-instruct-v0.2.Q4_0.llamafile -ngl 999
Then access the web UI from your Windows browser at http://localhost:8080 — WSL2’s network bridge handles the port forwarding automatically.
Size vs. Performance Tradeoffs
LlamaFile ships models in Q4_0 quantization by default. Here’s how different quantization levels affect size and quality for a 7B model:
| Quantization | File Size | Quality Loss | Speed |
|---|---|---|---|
| Q4_0 | ~3.8GB | Moderate | Fastest |
| Q4_K_M | ~4.1GB | Low | Fast |
| Q5_K_M | ~4.8GB | Minimal | Moderate |
| Q8_0 | ~7.2GB | Near-lossless | Slower |
| F16 | ~14GB | None | Slowest |
For most users, Q4_K_M hits the sweet spot: small enough to fit in 8GB VRAM, good enough that output quality is indistinguishable from full precision for conversational tasks.
OpenAI-Compatible API
The LlamaFile server exposes an OpenAI-compatible REST API, so you can point existing tools at it without code changes:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral",
"messages": [{"role": "user", "content": "What is llama.cpp?"}]
}'
This means any tool that accepts a custom base_url — LangChain, Open WebUI, Continue.dev, etc. — can use your local LlamaFile as its backend.
When to Use LlamaFile
LlamaFile is the right tool when you need:
- Portability: share a working model with colleagues who can’t set up Python
- Air-gapped environments: the single file contains everything
- Quick experiments: no environment setup, just download and run
- Demos: the built-in UI looks professional out of the box
For production inference servers handling multiple concurrent users, tools like vLLM or Ollama offer better throughput. But for individual use and prototyping, LlamaFile is hard to beat.