LM Studio is the most user-friendly application for running large language models locally on your own hardware. Unlike Ollama (which is command-line focused), LM Studio provides a polished desktop UI for discovering, downloading, and chatting with models, plus a local server that emulates the OpenAI API so existing tools and scripts work without modification. This guide covers setup, model selection, and performance optimization.
What LM Studio Does
LM Studio handles everything in a single application:
- Model discovery and download — searchable catalog of models from Hugging Face
- Chat interface — ChatGPT-like UI for testing and using models
- Local server — OpenAI-compatible API at
localhost:1234 - Hardware acceleration — automatic GPU acceleration via CUDA, Metal, ROCm, or Vulkan
Your conversations and data stay entirely on your machine — no API keys, no cloud requests, no data collection.
System Requirements
LM Studio runs on Windows 10/11, macOS 12.6+, and Linux. The hardware requirements scale with the model you want to run:
| Model Size | RAM Needed | GPU VRAM Needed | Example Models |
|---|---|---|---|
| 1–3B params | 4 GB RAM | 2 GB VRAM | Phi-3 Mini, Gemma 2 2B |
| 7B params | 8 GB RAM | 6–8 GB VRAM | Mistral 7B, Llama 3.1 8B |
| 13–14B params | 16 GB RAM | 10–12 GB VRAM | Llama 3.1 14B, Phi-4 |
| 30B+ params | 32 GB RAM | 20+ GB VRAM | Llama 3.1 70B (quantized) |
Models can also run entirely on CPU (RAM-only) without a GPU — significantly slower but functional. A 7B model on CPU generates 3–8 tokens/second; with GPU acceleration, 30–80 tokens/second.
Installing LM Studio
Download from https://lmstudio.ai. Available as a direct download for all three platforms.
On Windows: run the installer .exe, no configuration needed.
On macOS: drag to Applications.
On Linux: download the .AppImage, make executable: chmod +x LM_Studio.AppImage && ./LM_Studio.AppImage
Downloading Your First Model
- Open LM Studio → click the Search icon (magnifying glass) in the left sidebar
- Search for a model name, e.g., “Llama 3.1” or “Mistral”
- Browse the results — each model shows:
- Parameter count (7B, 14B, etc.)
- Quantization level (Q4_K_M, Q5_K_M, Q8_0, etc.)
- File size
- Click the download button next to your chosen variant
Understanding Quantization
Models are available in different quantization levels that trade file size and quality:
| Quantization | Quality | File Size (7B model) |
|---|---|---|
| Q8_0 | Near-lossless | ~7 GB |
| Q5_K_M | Excellent | ~5 GB |
| Q4_K_M | Good (recommended default) | ~4 GB |
| Q3_K_M | Acceptable | ~3 GB |
| Q2_K | Degraded | ~2.5 GB |
For most uses, Q4_K_M offers the best balance. On limited VRAM, drop to Q3_K_M. If quality matters most and you have the space, use Q5_K_M.
Recommended Models for 2026
General Chat and Reasoning
- Llama 3.1 8B Instruct Q5_K_M — Meta’s excellent general-purpose model; great reasoning, good instruction following
- Mistral 7B Instruct v0.3 Q5_K_M — Fast, efficient, excellent for structured output
- Gemma 2 9B Instruct — Google’s Gemma 2 punches above its weight for a 9B model
Coding
- Qwen2.5 Coder 7B Instruct — Alibaba’s coding-specialized model with strong Python, JS, and Go performance
- DeepSeek-Coder-V2 Lite Instruct — Excellent code completion and explanation
- Codestral Mamba 7B — Mistral’s coding model, fast and accurate
Longer Context / Complex Tasks
- Phi-4 14B Q4_K_M — Microsoft’s Phi-4 model performs far above its 14B size
- Llama 3.1 14B Instruct Q4_K_M — Strong all-round performance if you have 16 GB RAM
Loading and Chatting with a Model
- Click the Chat icon in the left sidebar
- Click Load a model or use the model selector dropdown at the top
- Select your downloaded model
- LM Studio loads it into RAM/VRAM (takes 5–30 seconds)
- Type your message and press Enter
The chat UI supports system prompts, conversation history, and temperature/top-p controls.
Hardware Acceleration Settings
Go to Settings (gear icon) → Performance:
- GPU Offload Layers: How many transformer layers to run on GPU. Set to maximum your VRAM allows. More layers on GPU = faster generation
- CPU Threads: For CPU-only inference, set to your physical core count (not logical)
- Context Length: How much conversation history the model retains. Larger context = more RAM used; 4096 is fine for most chats
LM Studio auto-detects and configures your GPU. On NVIDIA, it uses CUDA. On AMD, it uses ROCm (Linux) or Vulkan (Windows). On Apple Silicon, it uses Metal for exceptional performance — even M2/M3/M4 MacBook Air chips run 7B models well.
Using the Local OpenAI-Compatible API
LM Studio’s local server is its most powerful feature for developers. Enable it:
- Click Local Server in the left sidebar
- Load a model
- Click Start Server (default port: 1234)
Use with Python:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="not-needed" # LM Studio doesn't require auth
)
response = client.chat.completions.create(
model="local-model", # model name doesn't matter with LM Studio
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain LUKS encryption in 3 sentences."}
]
)
print(response.choices[0].message.content)
Any application that supports custom OpenAI API endpoints works with LM Studio — including Cursor, Obsidian AI plugins, and custom scripts. This is how you integrate local AI into your existing tools without sending data to OpenAI.
LM Studio vs. Ollama
Both run local LLMs but with different approaches:
| Feature | LM Studio | Ollama |
|---|---|---|
| UI | Desktop GUI | CLI |
| Model discovery | Built-in browser | Manual pull |
| API | OpenAI-compatible | Ollama API + OpenAI compat |
| Open WebUI support | Yes | Yes |
| Platforms | Win/Mac/Linux | Win/Mac/Linux |
| Best for | Beginners, GUI users | Power users, Docker deployments |
Many users run both: LM Studio for exploring and testing models, Ollama for production deployments behind Open WebUI.
LM Studio lowers the barrier to running private AI substantially — if you’ve been hesitant to set up local LLMs due to the command-line complexity, LM Studio’s GUI makes it as approachable as installing any desktop app.