Settings
General document processing settings.Password to decrypt password-protected documents.
If True, embed OCR metadata into the returned PDF. Defaults to False.
The mode to use for text extraction from PDFs. One of:
ocr- Uses optical character recognition onlyhybrid- Combines OCR with embedded PDF text for best accuracy (default)
Force the URL to be downloaded as a specific file extension (e.g.
.png).Force the result to be returned in URL form.
OCR system to use. One of:
standard- Best multilingual OCR systemlegacy- Only supports Germanic languages (for backwards compatibility)
The page range to process (1-indexed). By default, the entire document is processed.
If True, persist the results indefinitely. Defaults to False.
Whether to return images for the specified block types. Options:
figure, table. By default, no images are returned.If True, return OCR data in the result. Defaults to False.
The timeout for the job in seconds.
Enhance
Configuration for enhancing extraction accuracy using AI models.Agentic uses vision language models to enhance the accuracy of the output of different types of extraction. This will incur a cost and latency increase.
If True, summarize figures using a small vision language model. Defaults to True.
TableAgentic
Always set to “table”.
Custom prompt for table agentic processing.
FigureAgentic
Always set to “figure”.
If True, use the advanced chart agent. Defaults to False.
Custom prompt for figure agentic processing.
If True, return overlays for the figure. This allows you to verify the quality of the extraction.
TextAgentic
Always set to “text”.
Custom instructions for agentic text. Note: This only applies to form regions (key-value).
Formatting
Configuration for output formatting options.If True, add page markers to the output. Defaults to False. Useful for extracting data with page specific information.
A list of formatting to include in the output. Options:
change_trackinghighlightcommentshyperlinkssignatures
A flag to indicate if consecutive tables with the same number of columns should be merged. Defaults to False.
The mode to use for table output. Options:
html- HTML table formatjson- JSON array formatmd- Markdown table formatjsonbbox- JSON with bounding boxesdynamic- Returns md for simpler tables and html for complex tables (default)csv- CSV format
Retrieval
Configuration for retrieval and chunking behavior.Chunking configuration for the document.
If True, use embedding optimized mode. Defaults to False.
A list of block types to filter out from ‘content’ and ‘embed’ fields. By default, no blocks are filtered. Options:
HeaderFooterTitleSection HeaderPage NumberList ItemFigureTableKey ValueTextCommentSignature
Chunking
Configuration for document chunking behavior.Choose how to partition chunks. Options:
variable- Chunks by character length and visual contextsection- Chunks by section headerspage- Chunks according to pagespage_sections- Chunks first by page, then by sections within each pagedisabled- Returns one single chunkblock- Chunks by individual blocks
The approximate size of chunks (in characters) that the document will be split into. Defaults to null, in which case the chunk size is variable between 250-1500 characters.
ChunkingConfig
Alternate chunking configuration (similar to Chunking).Choose how to partition chunks. Options:
variable- Chunks by character length and visual contextsection- Chunks by section headerspage- Chunks according to pagespage_sections- Chunks first by page, then by sections within each pagedisabled- Returns one single chunkblock- Chunks by individual blocks
The approximate size of chunks (in characters) that the document will be split into. Defaults to None, in which case the chunk size is variable between 250-1500 characters.
EnrichConfig
Configuration for content enrichment using AI models.If enabled, a large language/vision model will be used to postprocess the extracted content. Note: enabling enrich requires tables be outputted in markdown format. Defaults to False.
The mode to use for enrichment. Options:
standard- Standard enrichment (default)page- Page-level enrichmenttable- Table-level enrichment
Add information to the prompt for enrichment.
Spreadsheet
Configuration for spreadsheet processing.In a spreadsheet with different tables inside, controls splitting behavior. Options:
accurate- Applies more powerful models for superior accuracy, at 5× the default per-cell ratefast- Default clustering modedisabled- Disables clustering; registers as one large table
Whether to exclude certain elements in the output. Options:
hidden_sheetshidden_rowshidden_colsstylingspreadsheet_images
Whether to include certain elements in the output. Options:
cell_colorsformula
Configuration for splitting large tables.
SplitLargeTables
If True, split large tables into smaller tables. Defaults to True.
The size of the tables to split into. Defaults to 50.
ArrayExtractConfig
Configuration for array extraction of long lists.Array extraction allows you to extract long lists of information from lengthy documents. It makes parallel calls on overlapping sections of the document.
The array extraction version to use. Options:
auto- Automatically selects the best modelegacy- Legacy extraction modestreaming- Streaming extraction modeno_overlap- No overlap between segments
Length of each segment, in pages, for parallel calls with array extraction.
Number of items to extract in each stream call. Lower numbers will increase quality but be much slower. 50 works well for most documents with tables.
ParseOptions
Complete parsing options combining all configuration types.Enhancement configuration.
Formatting configuration.
Retrieval configuration.
General settings configuration.
Spreadsheet-specific configuration.
WebhookConfigNew
Configuration for webhook delivery.A list of Svix channels the message will be delivered down. Omit to send to all channels.
JSON metadata included in webhook request body.
The mode to use for webhook delivery. Options:
disabled- No webhook delivery (default)svix- Use Svix for webhook delivery (recommended for production)direct- Direct webhook delivery
The URL to send the webhook to (if using direct webhook).