LM Studio is the most user-friendly application for running large language models locally on your own hardware. Unlike Ollama (which is command-line focused), LM Studio provides a polished desktop UI for discovering, downloading, and chatting with models, plus a local server that emulates the OpenAI API so existing tools and scripts work without modification. This guide covers setup, model selection, and performance optimization.

What LM Studio Does

LM Studio handles everything in a single application:

Model discovery and download — searchable catalog of models from Hugging Face
Chat interface — ChatGPT-like UI for testing and using models
Local server — OpenAI-compatible API at localhost:1234
Hardware acceleration — automatic GPU acceleration via CUDA, Metal, ROCm, or Vulkan

Your conversations and data stay entirely on your machine — no API keys, no cloud requests, no data collection.

System Requirements

LM Studio runs on Windows 10/11, macOS 12.6+, and Linux. The hardware requirements scale with the model you want to run:

Model Size	RAM Needed	GPU VRAM Needed	Example Models
1–3B params	4 GB RAM	2 GB VRAM	Phi-3 Mini, Gemma 2 2B
7B params	8 GB RAM	6–8 GB VRAM	Mistral 7B, Llama 3.1 8B
13–14B params	16 GB RAM	10–12 GB VRAM	Llama 3.1 14B, Phi-4
30B+ params	32 GB RAM	20+ GB VRAM	Llama 3.1 70B (quantized)

Models can also run entirely on CPU (RAM-only) without a GPU — significantly slower but functional. A 7B model on CPU generates 3–8 tokens/second; with GPU acceleration, 30–80 tokens/second.

Installing LM Studio

Download from https://lmstudio.ai. Available as a direct download for all three platforms.

On Windows: run the installer .exe, no configuration needed. On macOS: drag to Applications. On Linux: download the .AppImage, make executable: chmod +x LM_Studio.AppImage && ./LM_Studio.AppImage

Downloading Your First Model

Open LM Studio → click the Search icon (magnifying glass) in the left sidebar
Search for a model name, e.g., “Llama 3.1” or “Mistral”
Browse the results — each model shows:
- Parameter count (7B, 14B, etc.)
- Quantization level (Q4_K_M, Q5_K_M, Q8_0, etc.)
- File size
Click the download button next to your chosen variant

Understanding Quantization

Models are available in different quantization levels that trade file size and quality:

Quantization	Quality	File Size (7B model)
Q8_0	Near-lossless	~7 GB
Q5_K_M	Excellent	~5 GB
Q4_K_M	Good (recommended default)	~4 GB
Q3_K_M	Acceptable	~3 GB
Q2_K	Degraded	~2.5 GB

For most uses, Q4_K_M offers the best balance. On limited VRAM, drop to Q3_K_M. If quality matters most and you have the space, use Q5_K_M.

Recommended Models for 2026

General Chat and Reasoning

Llama 3.1 8B Instruct Q5_K_M — Meta’s excellent general-purpose model; great reasoning, good instruction following
Mistral 7B Instruct v0.3 Q5_K_M — Fast, efficient, excellent for structured output
Gemma 2 9B Instruct — Google’s Gemma 2 punches above its weight for a 9B model

Coding

Qwen2.5 Coder 7B Instruct — Alibaba’s coding-specialized model with strong Python, JS, and Go performance
DeepSeek-Coder-V2 Lite Instruct — Excellent code completion and explanation
Codestral Mamba 7B — Mistral’s coding model, fast and accurate

Longer Context / Complex Tasks

Phi-4 14B Q4_K_M — Microsoft’s Phi-4 model performs far above its 14B size
Llama 3.1 14B Instruct Q4_K_M — Strong all-round performance if you have 16 GB RAM

Loading and Chatting with a Model

Click the Chat icon in the left sidebar
Click Load a model or use the model selector dropdown at the top
Select your downloaded model
LM Studio loads it into RAM/VRAM (takes 5–30 seconds)
Type your message and press Enter

The chat UI supports system prompts, conversation history, and temperature/top-p controls.

Hardware Acceleration Settings

Go to Settings (gear icon) → Performance:

GPU Offload Layers: How many transformer layers to run on GPU. Set to maximum your VRAM allows. More layers on GPU = faster generation
CPU Threads: For CPU-only inference, set to your physical core count (not logical)
Context Length: How much conversation history the model retains. Larger context = more RAM used; 4096 is fine for most chats

LM Studio auto-detects and configures your GPU. On NVIDIA, it uses CUDA. On AMD, it uses ROCm (Linux) or Vulkan (Windows). On Apple Silicon, it uses Metal for exceptional performance — even M2/M3/M4 MacBook Air chips run 7B models well.

Using the Local OpenAI-Compatible API

LM Studio’s local server is its most powerful feature for developers. Enable it:

Click Local Server in the left sidebar
Load a model
Click Start Server (default port: 1234)

Use with Python:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="not-needed"  # LM Studio doesn't require auth
)

response = client.chat.completions.create(
    model="local-model",  # model name doesn't matter with LM Studio
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain LUKS encryption in 3 sentences."}
    ]
)
print(response.choices[0].message.content)

Any application that supports custom OpenAI API endpoints works with LM Studio — including Cursor, Obsidian AI plugins, and custom scripts. This is how you integrate local AI into your existing tools without sending data to OpenAI.

LM Studio vs. Ollama

Both run local LLMs but with different approaches:

Feature	LM Studio	Ollama
UI	Desktop GUI	CLI
Model discovery	Built-in browser	Manual pull
API	OpenAI-compatible	Ollama API + OpenAI compat
Open WebUI support	Yes	Yes
Platforms	Win/Mac/Linux	Win/Mac/Linux
Best for	Beginners, GUI users	Power users, Docker deployments

Many users run both: LM Studio for exploring and testing models, Ollama for production deployments behind Open WebUI.

LM Studio lowers the barrier to running private AI substantially — if you’ve been hesitant to set up local LLMs due to the command-line complexity, LM Studio’s GUI makes it as approachable as installing any desktop app.