Skip to main content

Settings

General document processing settings.
document_password
string
Password to decrypt password-protected documents.
embed_pdf_metadata
boolean
If True, embed OCR metadata into the returned PDF. Defaults to False.
extraction_mode
enum
The mode to use for text extraction from PDFs. One of:
  • ocr - Uses optical character recognition only
  • hybrid - Combines OCR with embedded PDF text for best accuracy (default)
force_file_extension
string
Force the URL to be downloaded as a specific file extension (e.g. .png).
force_url_result
boolean
Force the result to be returned in URL form.
ocr_system
enum
OCR system to use. One of:
  • standard - Best multilingual OCR system
  • legacy - Only supports Germanic languages (for backwards compatibility)
page_range
Union[PageRange, List[PageRange], List[int]]
The page range to process (1-indexed). By default, the entire document is processed.
persist_results
boolean
If True, persist the results indefinitely. Defaults to False.
return_images
List[enum]
Whether to return images for the specified block types. Options: figure, table. By default, no images are returned.
return_ocr_data
boolean
If True, return OCR data in the result. Defaults to False.
timeout
float
The timeout for the job in seconds.

Enhance

Configuration for enhancing extraction accuracy using AI models.
agentic
List[Union[TableAgentic, FigureAgentic, TextAgentic]]
Agentic uses vision language models to enhance the accuracy of the output of different types of extraction. This will incur a cost and latency increase.
summarize_figures
boolean
If True, summarize figures using a small vision language model. Defaults to True.

TableAgentic

scope
literal
required
Always set to “table”.
prompt
string
Custom prompt for table agentic processing.

FigureAgentic

scope
literal
required
Always set to “figure”.
advanced_chart_agent
boolean
If True, use the advanced chart agent. Defaults to False.
prompt
string
Custom prompt for figure agentic processing.
return_overlays
boolean
If True, return overlays for the figure. This allows you to verify the quality of the extraction.

TextAgentic

scope
literal
required
Always set to “text”.
prompt
string
Custom instructions for agentic text. Note: This only applies to form regions (key-value).

Formatting

Configuration for output formatting options.
add_page_markers
boolean
If True, add page markers to the output. Defaults to False. Useful for extracting data with page specific information.
include
List[enum]
A list of formatting to include in the output. Options:
  • change_tracking
  • highlight
  • comments
  • hyperlinks
  • signatures
merge_tables
boolean
A flag to indicate if consecutive tables with the same number of columns should be merged. Defaults to False.
table_output_format
enum
The mode to use for table output. Options:
  • html - HTML table format
  • json - JSON array format
  • md - Markdown table format
  • jsonbbox - JSON with bounding boxes
  • dynamic - Returns md for simpler tables and html for complex tables (default)
  • csv - CSV format

Retrieval

Configuration for retrieval and chunking behavior.
chunking
Chunking
Chunking configuration for the document.
embedding_optimized
boolean
If True, use embedding optimized mode. Defaults to False.
filter_blocks
List[enum]
A list of block types to filter out from ‘content’ and ‘embed’ fields. By default, no blocks are filtered. Options:
  • Header
  • Footer
  • Title
  • Section Header
  • Page Number
  • List Item
  • Figure
  • Table
  • Key Value
  • Text
  • Comment
  • Signature

Chunking

Configuration for document chunking behavior.
chunk_mode
enum
Choose how to partition chunks. Options:
  • variable - Chunks by character length and visual context
  • section - Chunks by section headers
  • page - Chunks according to pages
  • page_sections - Chunks first by page, then by sections within each page
  • disabled - Returns one single chunk
  • block - Chunks by individual blocks
chunk_size
integer
The approximate size of chunks (in characters) that the document will be split into. Defaults to null, in which case the chunk size is variable between 250-1500 characters.

ChunkingConfig

Alternate chunking configuration (similar to Chunking).
chunk_mode
enum
Choose how to partition chunks. Options:
  • variable - Chunks by character length and visual context
  • section - Chunks by section headers
  • page - Chunks according to pages
  • page_sections - Chunks first by page, then by sections within each page
  • disabled - Returns one single chunk
  • block - Chunks by individual blocks
chunk_size
integer
The approximate size of chunks (in characters) that the document will be split into. Defaults to None, in which case the chunk size is variable between 250-1500 characters.

EnrichConfig

Configuration for content enrichment using AI models.
enabled
boolean
If enabled, a large language/vision model will be used to postprocess the extracted content. Note: enabling enrich requires tables be outputted in markdown format. Defaults to False.
mode
enum
The mode to use for enrichment. Options:
  • standard - Standard enrichment (default)
  • page - Page-level enrichment
  • table - Table-level enrichment
prompt
string
Add information to the prompt for enrichment.

Spreadsheet

Configuration for spreadsheet processing.
clustering
enum
In a spreadsheet with different tables inside, controls splitting behavior. Options:
  • accurate - Applies more powerful models for superior accuracy, at 5× the default per-cell rate
  • fast - Default clustering mode
  • disabled - Disables clustering; registers as one large table
exclude
List[enum]
Whether to exclude certain elements in the output. Options:
  • hidden_sheets
  • hidden_rows
  • hidden_cols
  • styling
  • spreadsheet_images
include
List[enum]
Whether to include certain elements in the output. Options:
  • cell_colors
  • formula
split_large_tables
SplitLargeTables
Configuration for splitting large tables.

SplitLargeTables

enabled
boolean
If True, split large tables into smaller tables. Defaults to True.
size
integer
The size of the tables to split into. Defaults to 50.

ArrayExtractConfig

Configuration for array extraction of long lists.
enabled
boolean
Array extraction allows you to extract long lists of information from lengthy documents. It makes parallel calls on overlapping sections of the document.
mode
enum
The array extraction version to use. Options:
  • auto - Automatically selects the best mode
  • legacy - Legacy extraction mode
  • streaming - Streaming extraction mode
  • no_overlap - No overlap between segments
pages_per_segment
integer
Length of each segment, in pages, for parallel calls with array extraction.
streaming_extract_item_density
integer
Number of items to extract in each stream call. Lower numbers will increase quality but be much slower. 50 works well for most documents with tables.

ParseOptions

Complete parsing options combining all configuration types.
enhance
Enhance
Enhancement configuration.
formatting
Formatting
Formatting configuration.
retrieval
Retrieval
Retrieval configuration.
settings
Settings
General settings configuration.
spreadsheet
Spreadsheet
Spreadsheet-specific configuration.

WebhookConfigNew

Configuration for webhook delivery.
channels
List[string]
A list of Svix channels the message will be delivered down. Omit to send to all channels.
metadata
object
JSON metadata included in webhook request body.
mode
enum
The mode to use for webhook delivery. Options:
  • disabled - No webhook delivery (default)
  • svix - Use Svix for webhook delivery (recommended for production)
  • direct - Direct webhook delivery
url
string
The URL to send the webhook to (if using direct webhook).