Hugging Face Hub is the largest open-source AI model repository on the internet, hosting over 900,000 models, 200,000 datasets, and thousands of interactive demo applications. Whether you want to run a vision model, a text classifier, or a full instruction-tuned LLM, the Hub is where you start.
Navigating the Hub
Visit huggingface.co/models to browse the catalog. The search interface has powerful filters:
Filter by task:
text-generation— language models for chat and completiontext-classification— sentiment analysis, intent detectionimage-classification— CNNs and ViTs for visual tasksautomatic-speech-recognition— Whisper and friendstext-to-image— Stable Diffusion, FLUXfeature-extraction— embedding models
Filter by library:
transformers— the default HF librarygguf— quantized models for llama.cpp/Ollamadiffusers— image generation models
Reading Model Cards
Every model has a model card — a README that describes training data, intended use, limitations, and benchmark results. Before downloading, check:
- License — can you use it commercially?
- Model size — will it fit in your VRAM?
- Evaluation results — how does it perform on standard benchmarks?
- Tags — what tasks and frameworks are supported?
Installing the Hugging Face Python Libraries
pip install huggingface-hub transformers torch
For GPU acceleration with NVIDIA:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Downloading Models
Using huggingface-cli
The command-line tool is the fastest way to download individual files or full repositories:
# Install the CLI (included with huggingface-hub)
pip install huggingface-hub
# Download an entire model repository
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3
# Download a specific file (e.g., a GGUF quantized model)
huggingface-cli download \
bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--local-dir ./models
# Download with authentication (for gated models like Llama 3)
huggingface-cli login
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct
Using the Python API
from huggingface_hub import hf_hub_download, snapshot_download
# Download a single file
model_path = hf_hub_download(
repo_id="bartowski/Phi-3-mini-4k-instruct-GGUF",
filename="Phi-3-mini-4k-instruct-Q4_K_M.gguf",
local_dir="./models"
)
# Download entire repository
snapshot_download(
repo_id="mistralai/Mistral-7B-Instruct-v0.3",
local_dir="./mistral-7b"
)
Models are cached at ~/.cache/huggingface/hub by default. Override with the HF_HOME environment variable.
Running Inference with Transformers Pipeline
The pipeline abstraction handles tokenization, model loading, and post-processing in one call:
from transformers import pipeline
# Text generation
generator = pipeline(
"text-generation",
model="mistralai/Mistral-7B-Instruct-v0.3",
device_map="auto", # Automatically use GPU if available
torch_dtype="auto"
)
result = generator(
"Explain gradient descent in plain English",
max_new_tokens=300,
do_sample=True,
temperature=0.7
)
print(result[0]["generated_text"])
Chat-Formatted Inference
For instruction-tuned models, use the chat template:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{"role": "user", "content": "What is the capital of France?"}
]
inputs = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
add_generation_prompt=True
).to(model.device)
output = model.generate(inputs, max_new_tokens=200)
print(tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=True))
GGUF Models for llama.cpp
GGUF is the quantized model format used by llama.cpp, Ollama, LM Studio, and LlamaFile. On the Hub, search for repos with the gguf tag or look for community quantizers like bartowski, TheBloke, and unsloth.
# Download a GGUF file for use with llama.cpp
huggingface-cli download \
bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
# Run it with llama.cpp's CLI
./llama-cli \
-m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-p "Write a Python function to parse JSON" \
-n 300
Common GGUF quantizations and their tradeoffs:
| Format | Size (8B model) | Quality | Use Case |
|---|---|---|---|
| Q4_K_M | ~4.9GB | Good | Best balance for 8GB VRAM |
| Q5_K_M | ~5.7GB | Better | 10-12GB VRAM sweet spot |
| Q8_0 | ~8.5GB | Near-lossless | Max quality, 16GB VRAM |
| IQ3_M | ~3.5GB | Acceptable | Low VRAM or RAM constraints |
Model Licensing
Licensing varies significantly across Hub models. Always verify before commercial use:
Permissive licenses:
- Apache 2.0 — commercial use, modification, distribution allowed (Mistral 7B, Falcon)
- MIT — maximum permissiveness (Phi-3-mini)
Restrictive licenses:
- Llama Community License — requires attribution, prohibits competing products with 700M+ MAU
- Gemma Terms of Service — Google-specific restrictions on fine-tuning distribution
- CC BY-NC-4.0 — non-commercial use only
Filter by license on the Hub using the “License” dropdown. The apache-2.0 filter returns the most commercially usable models.
Hugging Face Spaces
Spaces are free hosted demo applications built with Gradio or Streamlit. Before downloading a multi-gigabyte model, try it in a Space:
- Search the model on the Hub
- Check the “Spaces using this model” section on the model card
- Interact with the model in-browser — no download required
Popular Spaces include hosted chat interfaces, image generation demos, and ASR transcription tools. You can also host your own Space for free (with limited GPU quotas) or upgrade to persistent GPU instances.
Gated Models and Authentication
Some models (Llama 3, Gemma, Falcon) require you to accept a license agreement before downloading:
# Login with your HF token
huggingface-cli login
# Enter your token from huggingface.co/settings/tokens
# Then download normally
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct
Generate your token at huggingface.co/settings/tokens with at least “Read” scope.