AI Tools #llamaindex#rag#ai-development

LlamaIndex Guide: Build RAG Applications in 2026

Learn LlamaIndex to build production RAG applications. Covers document indexing, retrieval strategies, query engines, and advanced RAG pipelines.

7 min read

LlamaIndex (formerly GPT Index) is the leading framework for building Retrieval-Augmented Generation (RAG) applications. While LangChain is a general-purpose LLM framework, LlamaIndex specializes in connecting LLMs to data — making it the better choice when your primary goal is building a system that answers questions from your documents, databases, or APIs.

LlamaIndex vs. LangChain for RAG

AspectLlamaIndexLangChain
Primary focusData + retrievalGeneral LLM orchestration
RAG qualityMore fine-grained controlGood but less specialized
Learning curveModerateSteeper
Data connectors160+ built-inFewer native
Advanced retrievalMany built-in strategiesManual implementation

For pure RAG use cases, LlamaIndex provides more built-in options for chunking strategies, retrieval algorithms, and query transformations.

Installation

pip install llama-index
pip install llama-index-llms-openai
pip install llama-index-llms-ollama         # for local models
pip install llama-index-embeddings-openai
pip install llama-index-embeddings-ollama   # for local embeddings
pip install llama-index-vector-stores-chroma
pip install chromadb

Core Concepts

LlamaIndex is built around a few key abstractions:

  • Documents — Raw content from files, databases, or APIs
  • Nodes — Chunks of documents after splitting
  • Index — A data structure optimized for retrieval (usually a vector index)
  • Retriever — Fetches relevant nodes for a query
  • Query Engine — Combines retriever + LLM to answer questions
  • Chat Engine — A query engine with conversation memory

Basic Setup: Global Settings

from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure default LLM and embedding model globally
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.chunk_size = 1024
Settings.chunk_overlap = 200

# For local models (Ollama)
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

Settings.llm = Ollama(model="llama3.2", request_timeout=120)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")

Loading Documents

LlamaIndex has 160+ data connectors (called “Readers”) for different sources:

from llama_index.core import SimpleDirectoryReader

# Load from a folder — handles PDF, Word, TXT, HTML, CSV, code files
documents = SimpleDirectoryReader(
    input_dir="./docs",
    recursive=True,
    required_exts=[".pdf", ".txt", ".md"]
).load_data()

print(f"Loaded {len(documents)} documents")
# Load specific file types
from llama_index.core import SimpleDirectoryReader

# Single PDF
docs = SimpleDirectoryReader(input_files=["report.pdf"]).load_data()

# From a URL
from llama_index.readers.web import SimpleWebPageReader
web_docs = SimpleWebPageReader(html_to_text=True).load_data(
    urls=["https://docs.example.com/api"]
)

# From a database (SQL)
from llama_index.readers.database import DatabaseReader
db_reader = DatabaseReader(
    scheme="postgresql",
    host="localhost",
    port=5432,
    user="postgres",
    password="secret",
    dbname="mydb"
)
db_docs = db_reader.load_data(query="SELECT title, content FROM articles")

Building a Vector Index

The VectorStoreIndex is the most common index type — it converts documents to embeddings and enables semantic search:

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

# Persistent ChromaDB storage
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection("my_docs")
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Build index (or load existing)
if collection.count() == 0:
    # First time — index the documents
    index = VectorStoreIndex.from_documents(
        documents,
        storage_context=storage_context,
        show_progress=True
    )
else:
    # Load existing index
    index = VectorStoreIndex.from_vector_store(
        vector_store,
        storage_context=storage_context
    )

Query Engines

A query engine combines retrieval and generation:

# Basic query engine
query_engine = index.as_query_engine(
    similarity_top_k=5,           # retrieve 5 most relevant chunks
    response_mode="compact"       # summarize retrieved context
)

response = query_engine.query("What are the main security vulnerabilities found?")
print(response.response)

# Access the source nodes (which chunks were used)
for node in response.source_nodes:
    print(f"Score: {node.score:.3f} | File: {node.metadata.get('file_name')}")
    print(node.text[:200])
    print("---")

Response Modes

ModeDescriptionUse When
compactStuffs context into one promptShort docs, fast responses
refineIterates through each chunkLong docs needing full coverage
tree_summarizeHierarchical summarizationLarge document sets
accumulateGets answers per chunk, combinesMulti-part questions
no_textReturns only source nodesWhen you want raw retrieval

Chat Engine: Conversational RAG

# Stateful chat over your documents
chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",
    similarity_top_k=5,
    verbose=True
)

# Multi-turn conversation
response1 = chat_engine.chat("What vulnerabilities were found in the API layer?")
print(response1.response)

response2 = chat_engine.chat("Which of those would be easiest to exploit?")
print(response2.response)

# Chat history is maintained automatically
chat_engine.reset()  # clear history to start fresh

Advanced Retrieval Strategies

Hybrid Search (Keyword + Semantic)

from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever

# Vector (semantic) retriever
vector_retriever = VectorIndexRetriever(index=index, similarity_top_k=5)

# BM25 (keyword) retriever
all_nodes = list(index.docstore.docs.values())
bm25_retriever = BM25Retriever.from_defaults(nodes=all_nodes, similarity_top_k=5)

# Fuse both with Reciprocal Rank Fusion
hybrid_retriever = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    similarity_top_k=5,
    num_queries=1,  # don't generate query variants
    mode="reciprocal_rerank"
)

Contextual Compression (Reranking)

Add a reranker to improve precision after initial retrieval:

from llama_index.core.postprocessor import SentenceTransformerRerank

reranker = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-2-v2",
    top_n=3  # keep only the 3 best chunks after reranking
)

query_engine = index.as_query_engine(
    similarity_top_k=10,        # retrieve 10 initially
    node_postprocessors=[reranker]  # rerank to top 3
)

HyDE (Hypothetical Document Embeddings)

HyDE generates a hypothetical answer and embeds that for retrieval — often outperforms direct query embedding:

from llama_index.core.indices.query.query_transform.base import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine

hyde_transform = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(query_engine, hyde_transform)

response = hyde_query_engine.query("How does the authentication system work?")

Sub-Question Query Engine

For complex questions that need to query multiple topics:

from llama_index.core.tools import QueryEngineTool
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.question_gen import LLMQuestionGenerator

# Wrap your indexes as tools
security_tool = QueryEngineTool.from_defaults(
    query_engine=security_index.as_query_engine(),
    name="security_docs",
    description="Security audit reports and vulnerability assessments"
)

architecture_tool = QueryEngineTool.from_defaults(
    query_engine=arch_index.as_query_engine(),
    name="architecture_docs",
    description="System architecture and design documents"
)

# Sub-question engine breaks complex queries into sub-queries
sub_query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[security_tool, architecture_tool],
    verbose=True
)

response = sub_query_engine.query(
    "What are the security implications of the current architecture, "
    "and which components need the most hardening?"
)

Evaluation

LlamaIndex has built-in evaluation to measure RAG quality:

from llama_index.core.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    AnswerRelevancyEvaluator
)

faithfulness_eval = FaithfulnessEvaluator()
relevancy_eval = RelevancyEvaluator()

# Evaluate a response
result = query_engine.query("What is the main finding?")

faithfulness_score = faithfulness_eval.evaluate_response(response=result)
relevancy_score = relevancy_eval.evaluate_response(
    query="What is the main finding?",
    response=result
)

print(f"Faithfulness: {faithfulness_score.passing}")
print(f"Relevancy: {relevancy_score.passing}")

LlamaIndex gives you fine-grained control over every step of the RAG pipeline. Start with a simple VectorStoreIndex and query_engine, then layer in hybrid search, reranking, and sub-question decomposition as your accuracy requirements increase. For production RAG systems, it remains the most purpose-built and battle-tested framework available.

#vector-database #python #ai-development #rag #llamaindex