Skip to main content

Goal

Define the fields your app should attach to every embedded chunk so retrieved answers can be traced back to the source document, page, extraction run, and processing configuration.

Sample Document

Use the Legal Filing for long-document retrieval testing, or Blackstone Report for a shorter financial report.

Use This Workflow

Use this recipe before you choose a vector database. The same metadata shape works with Pinecone, Weaviate, Qdrant, pgvector, OpenSearch, Elasticsearch, or a LangChain vector store.

Metadata Shape

{
  "id": "extraction_123:semantic:00042",
  "text": "Chunk text to embed...",
  "metadata": {
    "tenant_id": "customer_abc",
    "document_id": "source_system_doc_789",
    "extraction_id": "extraction_123",
    "job_id": "job_456",
    "source_url": "s3://customer-bucket/legal/filing.pdf",
    "filename": "filing.pdf",
    "page": 12,
    "chunk_strategy": "semantic",
    "chunk_index": 42,
    "schema_version": "extract-config-v3",
    "created_at": "2026-06-30T00:00:00Z"
  }
}

Python Normalizer

def vector_records(extraction_result, *, tenant_id, document_id, source_url):
    extraction_id = extraction_result["extraction_id"]
    chunking = extraction_result.get("extensions", {}).get("chunking", {})

    for strategy, chunks in chunking.items():
        if not isinstance(chunks, list):
            continue

        for index, chunk in enumerate(chunks):
            text = chunk.get("text") or chunk.get("content") or chunk.get("markdown")
            if not text:
                continue

            yield {
                "id": f"{extraction_id}:{strategy}:{index:05d}",
                "text": text,
                "metadata": {
                    "tenant_id": tenant_id,
                    "document_id": document_id,
                    "extraction_id": extraction_id,
                    "job_id": extraction_result.get("job_id"),
                    "source_url": source_url,
                    "filename": extraction_result.get("filename"),
                    "page": chunk.get("page") or chunk.get("page_number"),
                    "chunk_strategy": strategy,
                    "chunk_index": index,
                },
            }

Upsert Pattern

records = list(vector_records(
    result,
    tenant_id="customer_abc",
    document_id="loan_file_789",
    source_url="s3://customer-bucket/loan_file.pdf",
))

texts = [record["text"] for record in records]
vectors = embed(texts)  # Use your approved embedding provider.

for record, vector in zip(records, vectors):
    vector_db.upsert(
        id=record["id"],
        vector=vector,
        metadata=record["metadata"],
    )

Checks

  • Use deterministic IDs so retries replace records instead of duplicating them.
  • Put tenant/customer IDs in metadata if the vector database is multi-tenant.
  • Keep source URLs internal when vectors cross trust boundaries.
  • Store the extraction config version so search results can be explained later.
  • Filter by tenant and document permissions before returning retrieved context to an agent or user.

LangChain RAG Ingestion

End-to-end chunk and vector example.

Security Overview

Production controls for regulated workflows.