> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpulse.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Vector Metadata Contract

> Design stable metadata for embeddings, chunks, citations, and retrieval audit trails.

## Goal

Define the fields your app should attach to every embedded chunk so retrieved answers can be traced back to the source document, page, extraction run, and processing configuration.

## Sample Document

Use the [Legal Filing](https://platform.runpulse.com/dashboard/examples/d4dfc1e2-60ac-4776-a5a8-20b88e68bf9f) for long-document retrieval testing, or [Blackstone Report](https://platform.runpulse.com/dashboard/examples/b63fca41-ee50-4a65-bd71-64188d5bdbf5) for a shorter financial report.

## Use This Workflow

```mermaid theme={null}
flowchart LR
    A["/extract with storage + chunking"] --> B[Chunk list]
    B --> C[Normalize metadata]
    C --> D[Embed text]
    D --> E[Upsert by stable ID]
```

Use this recipe before you choose a vector database. The same metadata shape works with Pinecone, Weaviate, Qdrant, pgvector, OpenSearch, Elasticsearch, or a LangChain vector store.

## Metadata Shape

```json theme={null}
{
  "id": "extraction_123:semantic:00042",
  "text": "Chunk text to embed...",
  "metadata": {
    "tenant_id": "customer_abc",
    "document_id": "source_system_doc_789",
    "extraction_id": "extraction_123",
    "job_id": "job_456",
    "source_url": "s3://customer-bucket/legal/filing.pdf",
    "filename": "filing.pdf",
    "page": 12,
    "chunk_strategy": "semantic",
    "chunk_index": 42,
    "schema_version": "extract-config-v3",
    "created_at": "2026-06-30T00:00:00Z"
  }
}
```

## Python Normalizer

```python theme={null}
def vector_records(extraction_result, *, tenant_id, document_id, source_url):
    extraction_id = extraction_result["extraction_id"]
    chunking = extraction_result.get("extensions", {}).get("chunking", {})

    for strategy, chunks in chunking.items():
        if not isinstance(chunks, list):
            continue

        for index, chunk in enumerate(chunks):
            text = chunk.get("text") or chunk.get("content") or chunk.get("markdown")
            if not text:
                continue

            yield {
                "id": f"{extraction_id}:{strategy}:{index:05d}",
                "text": text,
                "metadata": {
                    "tenant_id": tenant_id,
                    "document_id": document_id,
                    "extraction_id": extraction_id,
                    "job_id": extraction_result.get("job_id"),
                    "source_url": source_url,
                    "filename": extraction_result.get("filename"),
                    "page": chunk.get("page") or chunk.get("page_number"),
                    "chunk_strategy": strategy,
                    "chunk_index": index,
                },
            }
```

## Upsert Pattern

```python theme={null}
records = list(vector_records(
    result,
    tenant_id="customer_abc",
    document_id="loan_file_789",
    source_url="s3://customer-bucket/loan_file.pdf",
))

texts = [record["text"] for record in records]
vectors = embed(texts)  # Use your approved embedding provider.

for record, vector in zip(records, vectors):
    vector_db.upsert(
        id=record["id"],
        vector=vector,
        metadata=record["metadata"],
    )
```

## Checks

* Use deterministic IDs so retries replace records instead of duplicating them.
* Put tenant/customer IDs in metadata if the vector database is multi-tenant.
* Keep source URLs internal when vectors cross trust boundaries.
* Store the extraction config version so search results can be explained later.
* Filter by tenant and document permissions before returning retrieved context to an agent or user.

## Related

<CardGroup cols={2}>
  <Card title="LangChain RAG Ingestion" icon="diagram-project" href="/cookbooks/rag-langchain-vector-store">
    End-to-end chunk and vector example.
  </Card>

  <Card title="Security Overview" icon="shield" href="/security/overview">
    Production controls for regulated workflows.
  </Card>
</CardGroup>
