Skip to main content

Overview

The Extract pipeline is the simplest and most common way to use Pulse. Upload a document and get back clean, layout-aware markdown along with extracted tables, figures, bounding boxes, and optional chunks. This is the starting point for every other pipeline — Extract → Schema and Extract → Split → Schema both build on top of this step.

When to Use

  • RAG ingestion — feed clean markdown into a vector database
  • Search indexing — convert documents to searchable text
  • Content migration — pull content out of PDFs into your CMS
  • Table extraction — grab structured tables from financial reports, invoices, or spreadsheets
  • General-purpose parsing — convert any supported file type to machine-readable text

Supported File Types

Pulse handles a wide range of document formats out of the box:
CategoryExtensions
PDF.pdf — text-based, scanned/image-based, mixed, multi-page
Images.jpg, .jpeg, .png — scans, photos, screenshots
Office.docx, .pptx, .xlsx — Word, PowerPoint, Excel
Web.html, .htm — saved web pages, HTML emails
For the full breakdown including processing tips per format, see Supported File Types.

How to Use in the Playground

1
Upload your document
2
Drag and drop a file or paste a URL into the upload area. You can also upload multiple documents at once for batch processing.
3
Configure extraction settings
4
Adjust settings on the Configuration tab before extracting:
5
SettingWhat it doesPage rangeProcess only specific pages (e.g. 1-5, 3,7,12)Extract figuresPull out embedded images and diagramsFigure descriptionsGenerate AI descriptions of extracted figuresShow imagesInclude inline images in the markdown outputReturn HTMLGet HTML output in addition to markdownEffort modeUse more compute for higher accuracy on complex layoutsChunkingSplit output into semantic, header, page, or recursive chunksChunk sizeTarget token count per chunk
6
Click “Extract All”
7
The extraction runs (synchronously or asynchronously depending on document size). Progress is shown in the pipeline tabs.
8
Review results
9
Results appear across several tabs:
10
  • Markdown — Full document text with layout-aware formatting
  • Tables — Detected tables rendered in a grid view
  • Bounding Boxes — Visual overlay showing where each element was detected on the page
  • Chunks — Chunked output (if chunking was enabled)

  • What You Get Back

    FieldDescription
    markdownFull document text with layout-aware markdown formatting
    htmlHTML output (if return_html was enabled)
    chunksObject with semantic, header, page, and/or recursive arrays
    bounding_boxesCoordinates for every text block, table, and figure
    extraction_idSaved extraction ID — use this for subsequent /split or /schema calls
    extraction_urlPresigned URL to the stored extraction result
    page_countNumber of pages processed
    The extraction_id is the key to the rest of the Pulse pipeline. Once you have it, you can run Schema or Split on the same extraction without re-processing the document.

    API Usage

    from pulse_python_sdk import Pulse
    
    client = Pulse(api_key="YOUR_API_KEY")
    
    # Synchronous extraction
    result = client.extract(
        file=open("invoice.pdf", "rb"),
        extract_figure=True,
        storage={"enabled": True}
    )
    
    print(result.markdown)
    print(f"Extraction ID: {result.extraction_id}")
    
    For large documents, use async mode and poll for results:
    # Async extraction
    result = client.extract(
        file=open("large_report.pdf", "rb"),
        async_=True,
        storage={"enabled": True}
    )
    
    job_id = result.job_id
    # Poll GET /job/{job_id} until status is "completed"
    
    See Async Processing for the full polling flow.

    After Extraction

    Once you have your extraction_id, you can: