AI Tools #hugging-face#transformers#model-hub

Hugging Face Hub: Finding, Downloading, and Running Models

Navigate Hugging Face Hub, download models with huggingface-cli, run local inference with Transformers, and use GGUF models with llama.cpp.

7 min read

Hugging Face Hub is the largest open-source AI model repository on the internet, hosting over 900,000 models, 200,000 datasets, and thousands of interactive demo applications. Whether you want to run a vision model, a text classifier, or a full instruction-tuned LLM, the Hub is where you start.

Visit huggingface.co/models to browse the catalog. The search interface has powerful filters:

Filter by task:

  • text-generation — language models for chat and completion
  • text-classification — sentiment analysis, intent detection
  • image-classification — CNNs and ViTs for visual tasks
  • automatic-speech-recognition — Whisper and friends
  • text-to-image — Stable Diffusion, FLUX
  • feature-extraction — embedding models

Filter by library:

  • transformers — the default HF library
  • gguf — quantized models for llama.cpp/Ollama
  • diffusers — image generation models

Reading Model Cards

Every model has a model card — a README that describes training data, intended use, limitations, and benchmark results. Before downloading, check:

  1. License — can you use it commercially?
  2. Model size — will it fit in your VRAM?
  3. Evaluation results — how does it perform on standard benchmarks?
  4. Tags — what tasks and frameworks are supported?

Installing the Hugging Face Python Libraries

pip install huggingface-hub transformers torch

For GPU acceleration with NVIDIA:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Downloading Models

Using huggingface-cli

The command-line tool is the fastest way to download individual files or full repositories:

# Install the CLI (included with huggingface-hub)
pip install huggingface-hub

# Download an entire model repository
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3

# Download a specific file (e.g., a GGUF quantized model)
huggingface-cli download \
  bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --local-dir ./models

# Download with authentication (for gated models like Llama 3)
huggingface-cli login
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct

Using the Python API

from huggingface_hub import hf_hub_download, snapshot_download

# Download a single file
model_path = hf_hub_download(
    repo_id="bartowski/Phi-3-mini-4k-instruct-GGUF",
    filename="Phi-3-mini-4k-instruct-Q4_K_M.gguf",
    local_dir="./models"
)

# Download entire repository
snapshot_download(
    repo_id="mistralai/Mistral-7B-Instruct-v0.3",
    local_dir="./mistral-7b"
)

Models are cached at ~/.cache/huggingface/hub by default. Override with the HF_HOME environment variable.

Running Inference with Transformers Pipeline

The pipeline abstraction handles tokenization, model loading, and post-processing in one call:

from transformers import pipeline

# Text generation
generator = pipeline(
    "text-generation",
    model="mistralai/Mistral-7B-Instruct-v0.3",
    device_map="auto",  # Automatically use GPU if available
    torch_dtype="auto"
)

result = generator(
    "Explain gradient descent in plain English",
    max_new_tokens=300,
    do_sample=True,
    temperature=0.7
)
print(result[0]["generated_text"])

Chat-Formatted Inference

For instruction-tuned models, use the chat template:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "What is the capital of France?"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

output = model.generate(inputs, max_new_tokens=200)
print(tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=True))

GGUF Models for llama.cpp

GGUF is the quantized model format used by llama.cpp, Ollama, LM Studio, and LlamaFile. On the Hub, search for repos with the gguf tag or look for community quantizers like bartowski, TheBloke, and unsloth.

# Download a GGUF file for use with llama.cpp
huggingface-cli download \
  bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Run it with llama.cpp's CLI
./llama-cli \
  -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -p "Write a Python function to parse JSON" \
  -n 300

Common GGUF quantizations and their tradeoffs:

FormatSize (8B model)QualityUse Case
Q4_K_M~4.9GBGoodBest balance for 8GB VRAM
Q5_K_M~5.7GBBetter10-12GB VRAM sweet spot
Q8_0~8.5GBNear-losslessMax quality, 16GB VRAM
IQ3_M~3.5GBAcceptableLow VRAM or RAM constraints

Model Licensing

Licensing varies significantly across Hub models. Always verify before commercial use:

Permissive licenses:

  • Apache 2.0 — commercial use, modification, distribution allowed (Mistral 7B, Falcon)
  • MIT — maximum permissiveness (Phi-3-mini)

Restrictive licenses:

  • Llama Community License — requires attribution, prohibits competing products with 700M+ MAU
  • Gemma Terms of Service — Google-specific restrictions on fine-tuning distribution
  • CC BY-NC-4.0 — non-commercial use only

Filter by license on the Hub using the “License” dropdown. The apache-2.0 filter returns the most commercially usable models.

Hugging Face Spaces

Spaces are free hosted demo applications built with Gradio or Streamlit. Before downloading a multi-gigabyte model, try it in a Space:

  1. Search the model on the Hub
  2. Check the “Spaces using this model” section on the model card
  3. Interact with the model in-browser — no download required

Popular Spaces include hosted chat interfaces, image generation demos, and ASR transcription tools. You can also host your own Space for free (with limited GPU quotas) or upgrade to persistent GPU instances.

Gated Models and Authentication

Some models (Llama 3, Gemma, Falcon) require you to accept a license agreement before downloading:

# Login with your HF token
huggingface-cli login
# Enter your token from huggingface.co/settings/tokens

# Then download normally
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct

Generate your token at huggingface.co/settings/tokens with at least “Read” scope.

#gguf #local-llm #model-hub #transformers #hugging-face