LlamaIndex (formerly GPT Index) is the leading framework for building Retrieval-Augmented Generation (RAG) applications. While LangChain is a general-purpose LLM framework, LlamaIndex specializes in connecting LLMs to data — making it the better choice when your primary goal is building a system that answers questions from your documents, databases, or APIs.
LlamaIndex vs. LangChain for RAG
| Aspect | LlamaIndex | LangChain |
|---|---|---|
| Primary focus | Data + retrieval | General LLM orchestration |
| RAG quality | More fine-grained control | Good but less specialized |
| Learning curve | Moderate | Steeper |
| Data connectors | 160+ built-in | Fewer native |
| Advanced retrieval | Many built-in strategies | Manual implementation |
For pure RAG use cases, LlamaIndex provides more built-in options for chunking strategies, retrieval algorithms, and query transformations.
Installation
pip install llama-index
pip install llama-index-llms-openai
pip install llama-index-llms-ollama # for local models
pip install llama-index-embeddings-openai
pip install llama-index-embeddings-ollama # for local embeddings
pip install llama-index-vector-stores-chroma
pip install chromadb
Core Concepts
LlamaIndex is built around a few key abstractions:
- Documents — Raw content from files, databases, or APIs
- Nodes — Chunks of documents after splitting
- Index — A data structure optimized for retrieval (usually a vector index)
- Retriever — Fetches relevant nodes for a query
- Query Engine — Combines retriever + LLM to answer questions
- Chat Engine — A query engine with conversation memory
Basic Setup: Global Settings
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# Configure default LLM and embedding model globally
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.chunk_size = 1024
Settings.chunk_overlap = 200
# For local models (Ollama)
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
Settings.llm = Ollama(model="llama3.2", request_timeout=120)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
Loading Documents
LlamaIndex has 160+ data connectors (called “Readers”) for different sources:
from llama_index.core import SimpleDirectoryReader
# Load from a folder — handles PDF, Word, TXT, HTML, CSV, code files
documents = SimpleDirectoryReader(
input_dir="./docs",
recursive=True,
required_exts=[".pdf", ".txt", ".md"]
).load_data()
print(f"Loaded {len(documents)} documents")
# Load specific file types
from llama_index.core import SimpleDirectoryReader
# Single PDF
docs = SimpleDirectoryReader(input_files=["report.pdf"]).load_data()
# From a URL
from llama_index.readers.web import SimpleWebPageReader
web_docs = SimpleWebPageReader(html_to_text=True).load_data(
urls=["https://docs.example.com/api"]
)
# From a database (SQL)
from llama_index.readers.database import DatabaseReader
db_reader = DatabaseReader(
scheme="postgresql",
host="localhost",
port=5432,
user="postgres",
password="secret",
dbname="mydb"
)
db_docs = db_reader.load_data(query="SELECT title, content FROM articles")
Building a Vector Index
The VectorStoreIndex is the most common index type — it converts documents to embeddings and enables semantic search:
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
# Persistent ChromaDB storage
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection("my_docs")
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Build index (or load existing)
if collection.count() == 0:
# First time — index the documents
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
show_progress=True
)
else:
# Load existing index
index = VectorStoreIndex.from_vector_store(
vector_store,
storage_context=storage_context
)
Query Engines
A query engine combines retrieval and generation:
# Basic query engine
query_engine = index.as_query_engine(
similarity_top_k=5, # retrieve 5 most relevant chunks
response_mode="compact" # summarize retrieved context
)
response = query_engine.query("What are the main security vulnerabilities found?")
print(response.response)
# Access the source nodes (which chunks were used)
for node in response.source_nodes:
print(f"Score: {node.score:.3f} | File: {node.metadata.get('file_name')}")
print(node.text[:200])
print("---")
Response Modes
| Mode | Description | Use When |
|---|---|---|
compact | Stuffs context into one prompt | Short docs, fast responses |
refine | Iterates through each chunk | Long docs needing full coverage |
tree_summarize | Hierarchical summarization | Large document sets |
accumulate | Gets answers per chunk, combines | Multi-part questions |
no_text | Returns only source nodes | When you want raw retrieval |
Chat Engine: Conversational RAG
# Stateful chat over your documents
chat_engine = index.as_chat_engine(
chat_mode="condense_plus_context",
similarity_top_k=5,
verbose=True
)
# Multi-turn conversation
response1 = chat_engine.chat("What vulnerabilities were found in the API layer?")
print(response1.response)
response2 = chat_engine.chat("Which of those would be easiest to exploit?")
print(response2.response)
# Chat history is maintained automatically
chat_engine.reset() # clear history to start fresh
Advanced Retrieval Strategies
Hybrid Search (Keyword + Semantic)
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
# Vector (semantic) retriever
vector_retriever = VectorIndexRetriever(index=index, similarity_top_k=5)
# BM25 (keyword) retriever
all_nodes = list(index.docstore.docs.values())
bm25_retriever = BM25Retriever.from_defaults(nodes=all_nodes, similarity_top_k=5)
# Fuse both with Reciprocal Rank Fusion
hybrid_retriever = QueryFusionRetriever(
retrievers=[vector_retriever, bm25_retriever],
similarity_top_k=5,
num_queries=1, # don't generate query variants
mode="reciprocal_rerank"
)
Contextual Compression (Reranking)
Add a reranker to improve precision after initial retrieval:
from llama_index.core.postprocessor import SentenceTransformerRerank
reranker = SentenceTransformerRerank(
model="cross-encoder/ms-marco-MiniLM-L-2-v2",
top_n=3 # keep only the 3 best chunks after reranking
)
query_engine = index.as_query_engine(
similarity_top_k=10, # retrieve 10 initially
node_postprocessors=[reranker] # rerank to top 3
)
HyDE (Hypothetical Document Embeddings)
HyDE generates a hypothetical answer and embeds that for retrieval — often outperforms direct query embedding:
from llama_index.core.indices.query.query_transform.base import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine
hyde_transform = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(query_engine, hyde_transform)
response = hyde_query_engine.query("How does the authentication system work?")
Sub-Question Query Engine
For complex questions that need to query multiple topics:
from llama_index.core.tools import QueryEngineTool
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.question_gen import LLMQuestionGenerator
# Wrap your indexes as tools
security_tool = QueryEngineTool.from_defaults(
query_engine=security_index.as_query_engine(),
name="security_docs",
description="Security audit reports and vulnerability assessments"
)
architecture_tool = QueryEngineTool.from_defaults(
query_engine=arch_index.as_query_engine(),
name="architecture_docs",
description="System architecture and design documents"
)
# Sub-question engine breaks complex queries into sub-queries
sub_query_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=[security_tool, architecture_tool],
verbose=True
)
response = sub_query_engine.query(
"What are the security implications of the current architecture, "
"and which components need the most hardening?"
)
Evaluation
LlamaIndex has built-in evaluation to measure RAG quality:
from llama_index.core.evaluation import (
FaithfulnessEvaluator,
RelevancyEvaluator,
AnswerRelevancyEvaluator
)
faithfulness_eval = FaithfulnessEvaluator()
relevancy_eval = RelevancyEvaluator()
# Evaluate a response
result = query_engine.query("What is the main finding?")
faithfulness_score = faithfulness_eval.evaluate_response(response=result)
relevancy_score = relevancy_eval.evaluate_response(
query="What is the main finding?",
response=result
)
print(f"Faithfulness: {faithfulness_score.passing}")
print(f"Relevancy: {relevancy_score.passing}")
LlamaIndex gives you fine-grained control over every step of the RAG pipeline. Start with a simple VectorStoreIndex and query_engine, then layer in hybrid search, reranking, and sub-question decomposition as your accuracy requirements increase. For production RAG systems, it remains the most purpose-built and battle-tested framework available.