agent

PDF Workflow

01KFKD1PQ5AGZZQ1ENYSZBP6DY

Properties

actions_required

file:view
file:create
file:update
file:download
entity:view
entity:update
relationship:create

description

Processes PDF files: converts to JPEG pages, extracts text (OCR or native for born-digital), extracts structure

endpoint

https://pdf-workflow.arke.institute

endpoint_verified_at

2026-01-22T17:46:50.026Z

input_schema

properties

entity_id

description: PDF file entity ID to process
type: string

options

properties

custom_prompt

description: Custom prompt for structure extraction and description generation
type: string

dpi

description: Render DPI (default: 150)
type: number

extract_images

description: Extract embedded images from born-digital PDFs (default: true)
type: boolean

extraction_mode

description: How to extract content (default: auto-detect)
enum: auto
born_digital
scanned
type: string

label

description: Label for the assembled text file
type: string

quality

description: JPEG quality 1-100 (default: 85)
type: number

skip_image_descriptions

description: Skip derivative image descriptions
type: boolean

skip_text_workflow

description: Skip structure extraction and description generation
type: boolean

type

object

required

entity_id

type

object

output_description

After the full pipeline completes, the original PDF file entity sits at the top of a rich entity tree. Directly below the PDF are JPEG page entities (one per page), each linked via derived_from/has_derivative relationships and carrying properties like page_number, width, height, and text (either from born-digital extraction with text_source='born_digital', or from OCR with text_source='ocr'). Pages are linked to each other with prev/next relationships in page order. Each page entity also has three resized image derivatives (large at 2400px, medium at 1288px, thumbnail at 256px), each linked to its source page via derived_from/has_derivative. If OCR detected embedded images in a page, those are extracted as separate file entities linked to the page via extracted_from/has_extracted. A combined assembled text file is created that merges all page text into a single annotated document with page boundary markers. This text file has assembled_from relationships pointing to every page, and each page gets a has_assembly backlink to the text file. The text file also has an 'in' relationship to the PDF (its parent). Structure extraction then analyzes the assembled text and produces a hierarchical entity tree representing the document's logical organization -- the root entity (e.g. a 'book' or 'report') contains structural divisions (parts, chapters, sections) determined by the LLM, with leaf sections split into chunk entities (~1024 tokens each). Only leaf entities carry a 'text' property; container entities do not. Every structural entity has a source_file property pointing to the assembled text file and start_line/end_line properties for locating content. Finally, the description service generates a 'description' and 'description_title' property on each structural entity except chunks, and the image description service generates 'description' and 'label' properties on any derivative images extracted during OCR.

output_relationships

PDF -> pages: The PDF file has 'has_derivative' relationships pointing to each JPEG page entity. Each page has a 'derived_from' relationship back to the PDF.
Page ordering: Pages are linked sequentially with 'prev' and 'next' relationships. Page 1 has 'next' -> Page 2, Page 2 has 'prev' -> Page 1 and 'next' -> Page 3, etc.
Page -> image sizes: Each page has 'has_derivative' relationships to its large, medium, and thumbnail JPEG versions. Each version has 'derived_from' back to the page.
Page -> extracted images: If OCR found embedded images in a page, those image entities have 'extracted_from' -> page, and the page has 'has_extracted' -> image.
Pages -> assembled text: The assembled text file has 'assembled_from' relationships to every page entity (in page order). Each page has a 'has_assembly' backlink to the text file. The text file also has an 'in' relationship to the PDF.
Assembled text -> structure tree: The assembled text file has 'contains' relationships to the root structural entities. Each structural entity has 'extractedFrom' -> assembled text file and an 'in' relationship to its immediate parent (or to the file if it is a root).
Structure parent-child: Parent entities have 'contains' relationships to their direct children. Children have 'in' relationships to their parent. Additionally, children have 'partOf' pointing to the root structural entity.
Structure sibling ordering: Sibling entities at the same level are linked with 'prev' and 'next' relationships in document order.
To navigate from PDF to text content: Follow PDF -> has_derivative -> pages (sorted by page_number) -> has_assembly -> assembled text file -> contains -> root structural entity -> recursively follow 'contains' down to leaf chunks which carry the 'text' property.
To get all text quickly: Follow PDF -> has_derivative -> page entities, read each page's 'text' property (present on all pages after OCR or born-digital extraction), sorted by page_number.
To get structured text: From the assembled text file, follow 'contains' to find root entities, then recursively follow 'contains' to reach leaf entities (chunks and leaf sections) which hold the 'text' property with start_line/end_line references.

output_tree_example

PDF file (original upload) |-- [has_derivative] Page 1 (JPEG, page_number=1, has text property) | |-- [has_derivative] page_0001_large.jpg (2400px) | |-- [has_derivative] page_0001_medium.jpg (1288px) | |-- [has_derivative] page_0001_thumb.jpg (256px) | |-- [has_extracted] figure_1.jpg (embedded image, has description + label) | |-- [next] Page 2 | +-- [has_assembly] assembled-text.txt |-- [has_derivative] Page 2 (JPEG, page_number=2, has text property) | |-- [has_derivative] page_0002_large.jpg | |-- [has_derivative] page_0002_medium.jpg | |-- [has_derivative] page_0002_thumb.jpg | |-- [prev] Page 1 | |-- [next] Page 3 | +-- [has_assembly] assembled-text.txt |-- [has_derivative] Page 3 ... (same pattern) | +-- [contains] assembled-text.txt (text/plain, merged text with page markers) |-- [assembled_from] Page 1, Page 2, Page 3 ... |-- [in] PDF file +-- [contains] book 'Document Title' (root structural entity, has description) |-- [extractedFrom] assembled-text.txt |-- [contains] chapter 'Chapter 1' (lines 1-80, has description) | |-- [contains] section 'Introduction' (lines 1-40, has description) | | |-- [contains] chunk 1 (lines 1-20, has text property) | | +-- [contains] chunk 2 (lines 21-40, has text property) | +-- [contains] section 'Background' (lines 41-80, has description) | |-- [contains] chunk 3 (lines 41-60, has text property) | +-- [contains] chunk 4 (lines 61-80, has text property) +-- [contains] chapter 'Chapter 2' (lines 81-150, has description) +-- ...

status

active

uses_agents

label
PDF Processor
pi
01KFFH6ETXGRVD10WPNP3007D6
label
Image Workflow
pi
01KFFGQJC874G1S3SCSWT60W5Y

Metadata

Version: 8
Created: 1/22/2026
Updated: 1/30/2026
Edited by: ARCHON