Home / AI Arena / Agents / Embeddings and Vector Databases

Embeddings and Vector Databases

This is part of the AI Agents series. All code is at github.com/achintmehta/langchain.

What is an embedding?

An embedding is a list of floating-point numbers that represents the meaning of a piece of text. The numbers are produced by a neural network trained on large amounts of text to place semantically similar things close together in this high-dimensional space. "Paris is the capital of France" and "France's capital city is Paris" end up very close together. "The stock market fell 2%" ends up far away from both.

This is what makes similarity search possible: you embed a query, embed your document chunks, and then find the chunks whose vectors are closest to the query vector. Closest in vector space means most semantically similar in meaning.

Generating embeddings with a local model

The example code is in chunking/embedding.py. It uses a HuggingFace model called all-MiniLM-L6-v2 — a small, fast model that runs entirely on your CPU and produces 384-dimensional embeddings. It is a good choice for development and for use cases where you cannot send data to an external API.

from langchain_huggingface import HuggingFaceEmbeddings

embeddings_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Embed a list of strings
texts = [
    "Paris is the capital of France.",
    "The Eiffel Tower is in Paris.",
    "The stock market fell 2% today."
]

vectors = embeddings_model.embed_documents(texts)
print(len(vectors))       # 3 — one vector per document
print(len(vectors[0]))    # 384 — dimensions

You can also embed a single query string with embed_query:

query_vector = embeddings_model.embed_query("What is the capital of France?")

The reason embed_documents and embed_query are separate methods is that some embedding models apply a slightly different transformation to queries vs. documents for better retrieval performance. With all-MiniLM-L6-v2 the distinction doesn't matter much, but the separation is good practice for when it does.

The first time you run this, LangChain downloads the model weights from HuggingFace Hub (around 90 MB). After that, it is cached locally.

Why metadata matters

Each chunk you embed should carry metadata alongside the vector — at minimum the source document identifier, page number or section, and the original text. When you retrieve a chunk at query time, you need to know where it came from so you can:

LangChain's Document objects carry this metadata automatically. When you use a LangChain vector store, it stores both the vector and the metadata together.

Storing vectors in PostgreSQL with pgvector

For local development, PostgreSQL with the pgvector extension is an excellent vector database. It is easy to run via Docker, supports both exact and approximate nearest-neighbour search, and integrates cleanly with LangChain.

The full example is in chunking/vectorDb.py.

Start pgvector:

docker run --name pgvector-container \
  -e POSTGRES_USER=langchain \
  -e POSTGRES_PASSWORD=langchain \
  -e POSTGRES_DB=langchain \
  -p 6024:5432 \
  -d pgvector/pgvector:pg16

Store and search:

from langchain_postgres.vectorstores import PGVector
from langchain_huggingface import HuggingFaceEmbeddings
import uuid

embeddings_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

connection = "postgresql+psycopg://langchain:langchain@localhost:6024/langchain"

# Create the vector store and insert chunks
db = PGVector.from_documents(
    documents=chunks,            # your split Document objects
    embedding=embeddings_model,
    connection=connection,
    collection_name=str(uuid.uuid4())  # unique name per run
)

# Search for relevant chunks
results = db.similarity_search("What is the capital of France?", k=3)

for doc in results:
    print(doc.metadata["source"])
    print(doc.page_content[:200])
    print()

similarity_search embeds your query using the same model you used to embed the documents, then does a cosine similarity search over all stored vectors and returns the k most similar chunks as Document objects.

The collection_name tip with uuid.uuid4() is worth noting: if you run the same script multiple times without clearing the database, you will accumulate duplicate entries. Using a fresh UUID each time avoids this during development, though in production you will typically use a stable collection name and update documents incrementally.

Indexing strategies

For small document sets (up to a few thousand chunks), the default flat search in pgvector is fine — it compares the query vector against every stored vector exactly. For larger corpora, you can create an HNSW index which trades a small amount of accuracy for dramatically faster search:

-- Run this in psql after inserting data
CREATE INDEX ON langchain.embedding USING hnsw (embedding vector_cosine_ops);

HNSW (Hierarchical Navigable Small World) is the algorithm behind most production vector databases. It organises vectors into a multi-layer graph so that search can jump quickly to the approximate neighbourhood of the query rather than scanning everything. It is the right choice once you have tens of thousands of chunks or more.

For very large corpora (hundreds of millions of vectors) where memory is a constraint, IVF+PQ (Inverted File + Product Quantisation) compresses vectors and clusters them, trading more accuracy for much lower memory usage.

What's next

You can now embed chunks and retrieve the most relevant ones for a query. The next part covers how to plug retrieval into an LLM call — the full RAG pipeline — and explores more sophisticated RAG patterns beyond the naive baseline.