Skip to main content

Goal

Pick a chunking setup that matches how your users ask questions and how reviewers validate answers.

Sample Documents

Use Attention Is All You Need for academic RAG, Legal Filing for page-grounded legal review, or 10-K Annual Report for section routing.

Decision Table

Retrieval needUseWhy
User asks broad conceptual questionssemanticKeeps related paragraphs together.
User navigates reports by headingsheaderPreserves section boundaries.
Reviewers cite exact pagespageKeeps provenance simple and auditable.
Vector DB has strict token/size limitsrecursiveProduces predictable size windows.

Request

curl -X POST https://api.runpulse.com/extract \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_url": "https://platform.runpulse.com/api/examples/d4dfc1e2-60ac-4776-a5a8-20b88e68bf9f/pdf",
    "extensions": {
      "chunking": {
        "chunk_types": ["semantic", "page"],
        "chunk_size": 1200
      }
    },
    "async": true,
    "storage": {"enabled": true}
  }'
Use caseChunk typesNotes
Enterprise searchsemantic, pageUse semantic for recall and page chunks for citation.
Legal reviewpage, headerFilter by page and section before answering.
Research assistantsemantic, headerBlend topic-level retrieval with section labels.
Embedding-only ingestionrecursiveUseful when the vector store enforces strict payload sizes.

Checks

  • Test chunking on three representative documents before locking the config.
  • Store chunk type and chunk index with every embedding.
  • Keep page references if any answer needs a citation or reviewer jump link.
  • Avoid tiny chunks for tables; they often lose headers and context.
  • Use Split before Schema when the document has multiple sections that need different extraction logic.

Chunking Parameters

Full chunking parameter guide.

Vector Metadata Contract

Persist chunk provenance.

LangChain RAG Ingestion

Build a local vector index.