Goal
Define the fields your app should attach to every embedded chunk so retrieved answers can be traced back to the source document, page, extraction run, and processing configuration.Sample Document
Use the Legal Filing for long-document retrieval testing, or Blackstone Report for a shorter financial report.Use This Workflow
Use this recipe before you choose a vector database. The same metadata shape works with Pinecone, Weaviate, Qdrant, pgvector, OpenSearch, Elasticsearch, or a LangChain vector store.Metadata Shape
Python Normalizer
Upsert Pattern
Checks
- Use deterministic IDs so retries replace records instead of duplicating them.
- Put tenant/customer IDs in metadata if the vector database is multi-tenant.
- Keep source URLs internal when vectors cross trust boundaries.
- Store the extraction config version so search results can be explained later.
- Filter by tenant and document permissions before returning retrieved context to an agent or user.
Related
LangChain RAG Ingestion
End-to-end chunk and vector example.
Security Overview
Production controls for regulated workflows.