Extract

Overview

The Extract pipeline is the simplest and most common way to use Pulse. Upload a document and get back clean, layout-aware markdown along with extracted tables, figures, bounding boxes, and optional chunks. This is the starting point for every other pipeline — Extract → Schema and Extract → Split → Schema both build on top of this step.

When to Use

RAG ingestion — feed clean markdown into a vector database
Search indexing — convert documents to searchable text
Content migration — pull content out of PDFs into your CMS
Table extraction — grab structured tables from financial reports, invoices, or spreadsheets
General-purpose parsing — convert any supported file type to machine-readable text

Supported File Types

Pulse handles a wide range of document formats out of the box:

Category	Extensions
PDF	`.pdf` — text-based, scanned/image-based, mixed, multi-page
Images	`.jpg`, `.jpeg`, `.png` — scans, photos, screenshots
Office	`.docx`, `.pptx`, `.xlsx` — Word, PowerPoint, Excel
Web	`.html`, `.htm` — saved web pages, HTML emails

For the full breakdown including processing tips per format, see Supported File Types.

How to Use in the Playground

Upload your document

Drag and drop a file or paste a URL into the upload area. You can also upload multiple documents at once for batch processing.

Configure extraction settings

Adjust settings on the Configuration tab before extracting:

SettingWhat it doesPage rangeProcess only specific pages (e.g. 1-5, 3,7,12)Extract figuresPull out embedded images and diagramsFigure descriptionsGenerate AI descriptions of extracted figuresShow imagesInclude inline images in the markdown outputReturn HTMLGet HTML output in addition to markdownEffort modeUse more compute for higher accuracy on complex layoutsChunkingSplit output into semantic, header, page, or recursive chunksChunk sizeTarget token count per chunk

Click “Extract All”

The extraction runs (synchronously or asynchronously depending on document size). Progress is shown in the pipeline tabs.

Review results

Results appear across several tabs:

Markdown — Full document text with layout-aware formatting

Tables — Detected tables rendered in a grid view

Bounding Boxes — Visual overlay showing where each element was detected on the page

Chunks — Chunked output (if chunking was enabled)

What You Get Back

Field	Description
`markdown`	Full document text with layout-aware markdown formatting
`html`	HTML output (if `return_html` was enabled)
`chunks`	Object with `semantic`, `header`, `page`, and/or `recursive` arrays
`bounding_boxes`	Coordinates for every text block, table, and figure
`extraction_id`	Saved extraction ID — use this for subsequent `/split` or `/schema` calls
`extraction_url`	Presigned URL to the stored extraction result
`page_count`	Number of pages processed

The extraction_id is the key to the rest of the Pulse pipeline. Once you have it, you can run Schema or Split on the same extraction without re-processing the document.

API Usage

Python
TypeScript
curl

from pulse_python_sdk import Pulse

client = Pulse(api_key="YOUR_API_KEY")

# Synchronous extraction
result = client.extract(
    file=open("invoice.pdf", "rb"),
    extract_figure=True,
    storage={"enabled": True}
)

print(result.markdown)
print(f"Extraction ID: {result.extraction_id}")

import { PulseClient } from "pulse-ts-sdk";
import fs from "fs";

const client = new PulseClient({
    headers: { "x-api-key": "YOUR_API_KEY" }
});

const result = await client.extract({
    file: fs.createReadStream("invoice.pdf"),
    extractFigure: true,
    storage: { enabled: true }
});

console.log(result.markdown);
console.log("Extraction ID:", result.extractionId);

curl -X POST https://api.runpulse.com/extract \
  -H "x-api-key: YOUR_API_KEY" \
  -F "file=@invoice.pdf" \
  -F "extract_figure=true" \
  -F 'storage={"enabled": true}'

For large documents, use async mode and poll for results:

# Async extraction
result = client.extract(
    file=open("large_report.pdf", "rb"),
    async_=True,
    storage={"enabled": True}
)

job_id = result.job_id
# Poll GET /job/{job_id} until status is "completed"

See Async Processing for the full polling flow.

After Extraction

Once you have your extraction_id, you can:

Add Schema

Extract structured data fields with a JSON Schema

Split & Schema

Divide into sections and extract per-section structured data

Export to Excel

Convert detected tables to .xlsx with Meridian

Extract API Reference

Full API documentation for the /extract endpoint

Supported File Types

Detailed breakdown of every supported format

Platform Reference

Pipelines

Additional Features

Overview

When to Use

Supported File Types

How to Use in the Playground

What You Get Back

API Usage

After Extraction

Add Schema

Split & Schema

Export to Excel

Extract API Reference

Supported File Types

Platform Reference

Pipelines

Additional Features

​Overview

​When to Use

​Supported File Types

​How to Use in the Playground

​What You Get Back

​API Usage

​After Extraction

Add Schema

Split & Schema

Export to Excel

​Related

Extract API Reference

Supported File Types

Overview

When to Use

Supported File Types

How to Use in the Playground

What You Get Back

API Usage

After Extraction

Related