agent

PDF Workflow

01KFKD1PQ5AGZZQ1ENYSZBP6DY

Properties

actions_required
  • file:view
  • file:create
  • file:update
  • file:download
  • entity:view
  • entity:update
  • relationship:create
description
Processes PDF files: converts to JPEG pages, extracts text (OCR or native for born-digital), extracts structure
endpoint
https://pdf-workflow.arke.institute
endpoint_verified_at
2026-01-22T17:46:50.026Z
input_schema
properties
entity_id
description
PDF file entity ID to process
type
string
options
properties
custom_prompt
description
Custom prompt for structure extraction and description generation
type
string
dpi
description
Render DPI (default: 150)
type
number
extract_images
description
Extract embedded images from born-digital PDFs (default: true)
type
boolean
extraction_mode
description
How to extract content (default: auto-detect)
enum
  • auto
  • born_digital
  • scanned
type
string
label
description
Label for the assembled text file
type
string
quality
description
JPEG quality 1-100 (default: 85)
type
number
skip_image_descriptions
description
Skip derivative image descriptions
type
boolean
skip_text_workflow
description
Skip structure extraction and description generation
type
boolean
type
object
required
  • entity_id
type
object
output_description
After the full pipeline completes, the original PDF file entity sits at the top of a rich entity tree. Directly below the PDF are JPEG page entities (one per page), each linked via derived_from/has_derivative relationships and carrying properties like page_number, width, height, and text (either from born-digital extraction with text_source='born_digital', or from OCR with text_source='ocr'). Pages are linked to each other with prev/next relationships in page order. Each page entity also has three resized image derivatives (large at 2400px, medium at 1288px, thumbnail at 256px), each linked to its source page via derived_from/has_derivative. If OCR detected embedded images in a page, those are extracted as separate file entities linked to the page via extracted_from/has_extracted. A combined assembled text file is created that merges all page text into a single annotated document with page boundary markers. This text file has assembled_from relationships pointing to every page, and each page gets a has_assembly backlink to the text file. The text file also has an 'in' relationship to the PDF (its parent). Structure extraction then analyzes the assembled text and produces a hierarchical entity tree representing the document's logical organization -- the root entity (e.g. a 'book' or 'report') contains structural divisions (parts, chapters, sections) determined by the LLM, with leaf sections split into chunk entities (~1024 tokens each). Only leaf entities carry a 'text' property; container entities do not. Every structural entity has a source_file property pointing to the assembled text file and start_line/end_line properties for locating content. Finally, the description service generates a 'description' and 'description_title' property on each structural entity except chunks, and the image description service generates 'description' and 'label' properties on any derivative images extracted during OCR.
output_relationships
  • PDF -> pages: The PDF file has 'has_derivative' relationships pointing to each JPEG page entity. Each page has a 'derived_from' relationship back to the PDF.
  • Page ordering: Pages are linked sequentially with 'prev' and 'next' relationships. Page 1 has 'next' -> Page 2, Page 2 has 'prev' -> Page 1 and 'next' -> Page 3, etc.
  • Page -> image sizes: Each page has 'has_derivative' relationships to its large, medium, and thumbnail JPEG versions. Each version has 'derived_from' back to the page.
  • Page -> extracted images: If OCR found embedded images in a page, those image entities have 'extracted_from' -> page, and the page has 'has_extracted' -> image.
  • Pages -> assembled text: The assembled text file has 'assembled_from' relationships to every page entity (in page order). Each page has a 'has_assembly' backlink to the text file. The text file also has an 'in' relationship to the PDF.
  • Assembled text -> structure tree: The assembled text file has 'contains' relationships to the root structural entities. Each structural entity has 'extractedFrom' -> assembled text file and an 'in' relationship to its immediate parent (or to the file if it is a root).
  • Structure parent-child: Parent entities have 'contains' relationships to their direct children. Children have 'in' relationships to their parent. Additionally, children have 'partOf' pointing to the root structural entity.
  • Structure sibling ordering: Sibling entities at the same level are linked with 'prev' and 'next' relationships in document order.
  • To navigate from PDF to text content: Follow PDF -> has_derivative -> pages (sorted by page_number) -> has_assembly -> assembled text file -> contains -> root structural entity -> recursively follow 'contains' down to leaf chunks which carry the 'text' property.
  • To get all text quickly: Follow PDF -> has_derivative -> page entities, read each page's 'text' property (present on all pages after OCR or born-digital extraction), sorted by page_number.
  • To get structured text: From the assembled text file, follow 'contains' to find root entities, then recursively follow 'contains' to reach leaf entities (chunks and leaf sections) which hold the 'text' property with start_line/end_line references.
output_tree_example
PDF file (original upload) |-- [has_derivative] Page 1 (JPEG, page_number=1, has text property) | |-- [has_derivative] page_0001_large.jpg (2400px) | |-- [has_derivative] page_0001_medium.jpg (1288px) | |-- [has_derivative] page_0001_thumb.jpg (256px) | |-- [has_extracted] figure_1.jpg (embedded image, has description + label) | |-- [next] Page 2 | +-- [has_assembly] assembled-text.txt |-- [has_derivative] Page 2 (JPEG, page_number=2, has text property) | |-- [has_derivative] page_0002_large.jpg | |-- [has_derivative] page_0002_medium.jpg | |-- [has_derivative] page_0002_thumb.jpg | |-- [prev] Page 1 | |-- [next] Page 3 | +-- [has_assembly] assembled-text.txt |-- [has_derivative] Page 3 ... (same pattern) | +-- [contains] assembled-text.txt (text/plain, merged text with page markers) |-- [assembled_from] Page 1, Page 2, Page 3 ... |-- [in] PDF file +-- [contains] book 'Document Title' (root structural entity, has description) |-- [extractedFrom] assembled-text.txt |-- [contains] chapter 'Chapter 1' (lines 1-80, has description) | |-- [contains] section 'Introduction' (lines 1-40, has description) | | |-- [contains] chunk 1 (lines 1-20, has text property) | | +-- [contains] chunk 2 (lines 21-40, has text property) | +-- [contains] section 'Background' (lines 41-80, has description) | |-- [contains] chunk 3 (lines 41-60, has text property) | +-- [contains] chunk 4 (lines 61-80, has text property) +-- [contains] chapter 'Chapter 2' (lines 81-150, has description) +-- ...
status
active
uses_agents