LangChain RAG Ingestion

Goal

Turn a Pulse extraction into vector-searchable chunks for a RAG application while keeping document provenance attached to every retrieved passage.

Sample Document

Use Attention Is All You Need for a long-form paper. The code below uses its public API file_url.

Use This Workflow

Use Pulse chunking when the source layout, pages, tables, and citations matter before the text reaches your retrieval layer.

Python

import os
import requests
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

PULSE_API_KEY = os.environ["PULSE_API_KEY"]

resp = requests.post(
    "https://api.runpulse.com/extract",
    headers={"x-api-key": PULSE_API_KEY},
    json={
        "file_url": "https://platform.runpulse.com/api/examples/3be15d23-d622-4f27-9843-ec2929140eec/pdf",
        "extensions": {
            "chunking": {
                "chunk_types": ["semantic", "page"],
                "chunk_size": 1200
            },
            "footnote_references": True
        },
        "async": True,
        "storage": {"enabled": True}
    },
)
resp.raise_for_status()
job_id = resp.json()["job_id"]

while True:
    job = requests.get(
        f"https://api.runpulse.com/job/{job_id}",
        headers={"x-api-key": PULSE_API_KEY},
    ).json()
    if job["status"] == "completed":
        result = job["result"]
        break
    if job["status"] in {"failed", "canceled"}:
        raise RuntimeError(job)

def pulse_chunks(extraction_result):
    chunking = extraction_result.get("extensions", {}).get("chunking", {})
    for strategy, chunks in chunking.items():
        if not isinstance(chunks, list):
            continue
        for index, chunk in enumerate(chunks):
            text = chunk.get("text") or chunk.get("content") or chunk.get("markdown")
            if not text:
                continue
            yield Document(
                page_content=text,
                metadata={
                    "source": "attention-is-all-you-need",
                    "extraction_id": extraction_result.get("extraction_id"),
                    "chunk_strategy": strategy,
                    "chunk_index": index,
                    "page": chunk.get("page") or chunk.get("page_number"),
                },
            )

docs = list(pulse_chunks(result))

# Swap this for your approved embedding provider in regulated environments.
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
index = FAISS.from_documents(docs, embeddings)

for doc in index.similarity_search("What problem does attention solve?", k=3):
    print(doc.metadata)
    print(doc.page_content[:500])

Checks

Store extraction_id, source URL, chunk strategy, and page metadata with every vector.
Prefer page chunking when an answer must cite exact pages.
Prefer semantic or header chunking when topic coherence matters more than fixed size.
Re-index when the extraction config changes; chunk boundaries are part of your retrieval contract.
Keep raw extraction output in your storage boundary when retrieval results feed regulated decisions.

Chunking Parameters

Choose chunking and citation settings.

Sample Documents

Try public documents before connecting storage.

Agent Quickstart

Give agents document tools over MCP.

​Goal

​Sample Document

​Use This Workflow

​Python

​Checks

​Related