> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpulse.com/llms.txt
> Use this file to discover all available pages before exploring further.

# LangChain RAG Ingestion

> Extract, chunk, embed, and index a document for retrieval with LangChain.

## Goal

Turn a Pulse extraction into vector-searchable chunks for a RAG application while keeping document provenance attached to every retrieved passage.

## Sample Document

Use [Attention Is All You Need](https://platform.runpulse.com/dashboard/examples/3be15d23-d622-4f27-9843-ec2929140eec) for a long-form paper. The code below uses its public API `file_url`.

## Use This Workflow

```mermaid theme={null}
flowchart LR
    A[PDF or presigned URL] --> B["Pulse /extract with chunking"]
    B --> C["extensions.chunking"]
    C --> D[LangChain Documents]
    D --> E[Embedding model]
    E --> F[Vector store]
```

Use Pulse chunking when the source layout, pages, tables, and citations matter before the text reaches your retrieval layer.

## Python

```python theme={null}
import os
import requests
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

PULSE_API_KEY = os.environ["PULSE_API_KEY"]

resp = requests.post(
    "https://api.runpulse.com/extract",
    headers={"x-api-key": PULSE_API_KEY},
    json={
        "file_url": "https://platform.runpulse.com/api/examples/3be15d23-d622-4f27-9843-ec2929140eec/pdf",
        "extensions": {
            "chunking": {
                "chunk_types": ["semantic", "page"],
                "chunk_size": 1200
            },
            "footnote_references": True
        },
        "async": True,
        "storage": {"enabled": True}
    },
)
resp.raise_for_status()
job_id = resp.json()["job_id"]

while True:
    job = requests.get(
        f"https://api.runpulse.com/job/{job_id}",
        headers={"x-api-key": PULSE_API_KEY},
    ).json()
    if job["status"] == "completed":
        result = job["result"]
        break
    if job["status"] in {"failed", "canceled"}:
        raise RuntimeError(job)

def pulse_chunks(extraction_result):
    chunking = extraction_result.get("extensions", {}).get("chunking", {})
    for strategy, chunks in chunking.items():
        if not isinstance(chunks, list):
            continue
        for index, chunk in enumerate(chunks):
            text = chunk.get("text") or chunk.get("content") or chunk.get("markdown")
            if not text:
                continue
            yield Document(
                page_content=text,
                metadata={
                    "source": "attention-is-all-you-need",
                    "extraction_id": extraction_result.get("extraction_id"),
                    "chunk_strategy": strategy,
                    "chunk_index": index,
                    "page": chunk.get("page") or chunk.get("page_number"),
                },
            )

docs = list(pulse_chunks(result))

# Swap this for your approved embedding provider in regulated environments.
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
index = FAISS.from_documents(docs, embeddings)

for doc in index.similarity_search("What problem does attention solve?", k=3):
    print(doc.metadata)
    print(doc.page_content[:500])
```

## Checks

* Store `extraction_id`, source URL, chunk strategy, and page metadata with every vector.
* Prefer `page` chunking when an answer must cite exact pages.
* Prefer `semantic` or `header` chunking when topic coherence matters more than fixed size.
* Re-index when the extraction config changes; chunk boundaries are part of your retrieval contract.
* Keep raw extraction output in your storage boundary when retrieval results feed regulated decisions.

## Related

<CardGroup cols={3}>
  <Card title="Chunking Parameters" icon="scissors" href="/concepts/processing-parameters-chunking">
    Choose chunking and citation settings.
  </Card>

  <Card title="Sample Documents" icon="file-pdf" href="/cookbooks/sample-documents">
    Try public documents before connecting storage.
  </Card>

  <Card title="Agent Quickstart" icon="robot" href="/mcp/agent-quickstart">
    Give agents document tools over MCP.
  </Card>
</CardGroup>
