AI Tools #LM Studio#local LLM#AI

LM Studio Guide 2026: Run Local LLMs on Windows and Mac

Set up LM Studio to run local AI language models on your PC or Mac—download models, configure hardware acceleration, and use the OpenAI-compatible API.

7 min read

LM Studio is the most user-friendly application for running large language models locally on your own hardware. Unlike Ollama (which is command-line focused), LM Studio provides a polished desktop UI for discovering, downloading, and chatting with models, plus a local server that emulates the OpenAI API so existing tools and scripts work without modification. This guide covers setup, model selection, and performance optimization.

What LM Studio Does

LM Studio handles everything in a single application:

  • Model discovery and download — searchable catalog of models from Hugging Face
  • Chat interface — ChatGPT-like UI for testing and using models
  • Local server — OpenAI-compatible API at localhost:1234
  • Hardware acceleration — automatic GPU acceleration via CUDA, Metal, ROCm, or Vulkan

Your conversations and data stay entirely on your machine — no API keys, no cloud requests, no data collection.

System Requirements

LM Studio runs on Windows 10/11, macOS 12.6+, and Linux. The hardware requirements scale with the model you want to run:

Model SizeRAM NeededGPU VRAM NeededExample Models
1–3B params4 GB RAM2 GB VRAMPhi-3 Mini, Gemma 2 2B
7B params8 GB RAM6–8 GB VRAMMistral 7B, Llama 3.1 8B
13–14B params16 GB RAM10–12 GB VRAMLlama 3.1 14B, Phi-4
30B+ params32 GB RAM20+ GB VRAMLlama 3.1 70B (quantized)

Models can also run entirely on CPU (RAM-only) without a GPU — significantly slower but functional. A 7B model on CPU generates 3–8 tokens/second; with GPU acceleration, 30–80 tokens/second.

Installing LM Studio

Download from https://lmstudio.ai. Available as a direct download for all three platforms.

On Windows: run the installer .exe, no configuration needed. On macOS: drag to Applications. On Linux: download the .AppImage, make executable: chmod +x LM_Studio.AppImage && ./LM_Studio.AppImage

Downloading Your First Model

  1. Open LM Studio → click the Search icon (magnifying glass) in the left sidebar
  2. Search for a model name, e.g., “Llama 3.1” or “Mistral”
  3. Browse the results — each model shows:
    • Parameter count (7B, 14B, etc.)
    • Quantization level (Q4_K_M, Q5_K_M, Q8_0, etc.)
    • File size
  4. Click the download button next to your chosen variant

Understanding Quantization

Models are available in different quantization levels that trade file size and quality:

QuantizationQualityFile Size (7B model)
Q8_0Near-lossless~7 GB
Q5_K_MExcellent~5 GB
Q4_K_MGood (recommended default)~4 GB
Q3_K_MAcceptable~3 GB
Q2_KDegraded~2.5 GB

For most uses, Q4_K_M offers the best balance. On limited VRAM, drop to Q3_K_M. If quality matters most and you have the space, use Q5_K_M.

General Chat and Reasoning

  • Llama 3.1 8B Instruct Q5_K_M — Meta’s excellent general-purpose model; great reasoning, good instruction following
  • Mistral 7B Instruct v0.3 Q5_K_M — Fast, efficient, excellent for structured output
  • Gemma 2 9B Instruct — Google’s Gemma 2 punches above its weight for a 9B model

Coding

  • Qwen2.5 Coder 7B Instruct — Alibaba’s coding-specialized model with strong Python, JS, and Go performance
  • DeepSeek-Coder-V2 Lite Instruct — Excellent code completion and explanation
  • Codestral Mamba 7B — Mistral’s coding model, fast and accurate

Longer Context / Complex Tasks

  • Phi-4 14B Q4_K_M — Microsoft’s Phi-4 model performs far above its 14B size
  • Llama 3.1 14B Instruct Q4_K_M — Strong all-round performance if you have 16 GB RAM

Loading and Chatting with a Model

  1. Click the Chat icon in the left sidebar
  2. Click Load a model or use the model selector dropdown at the top
  3. Select your downloaded model
  4. LM Studio loads it into RAM/VRAM (takes 5–30 seconds)
  5. Type your message and press Enter

The chat UI supports system prompts, conversation history, and temperature/top-p controls.

Hardware Acceleration Settings

Go to Settings (gear icon) → Performance:

  • GPU Offload Layers: How many transformer layers to run on GPU. Set to maximum your VRAM allows. More layers on GPU = faster generation
  • CPU Threads: For CPU-only inference, set to your physical core count (not logical)
  • Context Length: How much conversation history the model retains. Larger context = more RAM used; 4096 is fine for most chats

LM Studio auto-detects and configures your GPU. On NVIDIA, it uses CUDA. On AMD, it uses ROCm (Linux) or Vulkan (Windows). On Apple Silicon, it uses Metal for exceptional performance — even M2/M3/M4 MacBook Air chips run 7B models well.

Using the Local OpenAI-Compatible API

LM Studio’s local server is its most powerful feature for developers. Enable it:

  1. Click Local Server in the left sidebar
  2. Load a model
  3. Click Start Server (default port: 1234)

Use with Python:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="not-needed"  # LM Studio doesn't require auth
)

response = client.chat.completions.create(
    model="local-model",  # model name doesn't matter with LM Studio
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain LUKS encryption in 3 sentences."}
    ]
)
print(response.choices[0].message.content)

Any application that supports custom OpenAI API endpoints works with LM Studio — including Cursor, Obsidian AI plugins, and custom scripts. This is how you integrate local AI into your existing tools without sending data to OpenAI.

LM Studio vs. Ollama

Both run local LLMs but with different approaches:

FeatureLM StudioOllama
UIDesktop GUICLI
Model discoveryBuilt-in browserManual pull
APIOpenAI-compatibleOllama API + OpenAI compat
Open WebUI supportYesYes
PlatformsWin/Mac/LinuxWin/Mac/Linux
Best forBeginners, GUI usersPower users, Docker deployments

Many users run both: LM Studio for exploring and testing models, Ollama for production deployments behind Open WebUI.

LM Studio lowers the barrier to running private AI substantially — if you’ve been hesitant to set up local LLMs due to the command-line complexity, LM Studio’s GUI makes it as approachable as installing any desktop app.

#Mistral #Llama #private AI #AI #local LLM #LM Studio