agent

File Processing Orchestrator

01KFM74TJDTCQ97KZF151YH9GH

Properties

actions_required
  • file:view
  • file:create
  • file:update
  • file:download
  • entity:view
  • entity:update
  • relationship:create
  • collection:view
description
Orchestrates file processing workflows for collections and folders - discovers files, groups by type, dispatches to appropriate workflows
endpoint
https://file-processing-orchestrator.arke.institute
endpoint_verified_at
2026-01-23T01:22:55.882Z
input_schema
properties
entity_id
description
Collection or folder entity ID to process
type
string
options
properties
custom_prompt
description
Custom prompt passed to downstream workflows for structure extraction and descriptions
type
string
skip_image_descriptions
description
Skip derivative image descriptions
type
boolean
skip_text_workflow
description
Skip structure extraction and descriptions for all files
type
boolean
workflow_concurrency
default
50
description
Maximum parallel workflow dispatches (default: 50)
type
integer
type
object
required
  • entity_id
type
object
output_description
The orchestrator processes an entire collection or folder end-to-end by recursively discovering all files, grouping them by type, and dispatching each group to the appropriate processing workflow. It operates in four stages. (1) Discovery: breadth-first traversal of the collection/folder hierarchy, finding all files and recording their content types. It respects collection boundaries — it will not traverse into folders that belong to a different collection, and it skips nested collections entirely. (2) Grouping: files are classified by content type and grouped per folder. PDFs are dispatched individually to the PDF Workflow. JPEGs in the same folder are batched together for the Image Workflow. Non-JPEG images (PNG, WebP, TIFF, AVIF, GIF) are batched for the Generic Image Workflow. Text files are dispatched individually to the Text Workflow. Files with unrecognized content types are skipped. Files within each folder are sorted in natural order (e.g., page_001.jpg before page_010.jpg). (3) Dispatch: groups are sent to their respective workflows in parallel (up to workflow_concurrency, default 50), with automatic retries (up to 3 attempts) and a 30-minute timeout per workflow. (4) Finalize: statistics are aggregated and a job log is written. The orchestrator itself does not create any entities or relationships — all entity creation is delegated to the sub-workflows it invokes. After completion, every supported file in the collection will have been fully processed by its corresponding workflow pipeline. See the individual workflow agent descriptions (PDF Workflow, Image Workflow, Generic Image Workflow, Text Workflow) for the specific entities, properties, and relationships each one produces.
output_relationships
  • The orchestrator does not create any entities or relationships itself — it discovers and dispatches
  • Discovery traverses 'contains' relationships downward from the root collection/folder to find all nested folders and files
  • Collection boundaries are enforced: folders belonging to a different collection are skipped, as are nested collections
  • Each sub-workflow creates its own entities and relationships (see individual workflow agent descriptions for details)
  • Dispatch routing: PDFs → PDF Workflow (one per file), JPEGs → Image Workflow (batched per folder), PNG/WebP/TIFF/AVIF/GIF → Generic Image Workflow (batched per folder), text/* → Text Workflow (one per file)
  • To find processing results after orchestration: for any file in the collection, follow 'has_derivative' to find its processed outputs, then continue navigating the derivative tree as described by each workflow's output_relationships
  • To check orchestrator progress: query GET /status/:job_id for summary, add ?detail=full for per-folder and per-dispatch breakdowns, add ?errors=N for recent error details
output_tree_example
collection 'Research Archive' ├── folder 'Papers' │ ├── file 'paper.pdf' (application/pdf) → dispatched to PDF Workflow │ │ ├── [page JPEGs with OCR text, assembled text, structure extraction, descriptions] │ │ └── (see PDF Workflow output for full detail) │ └── file 'supplement.pdf' (application/pdf) → dispatched to PDF Workflow │ └── ... ├── folder 'Photos' │ ├── file 'photo1.jpg' (image/jpeg) ─┐ │ ├── file 'photo2.jpg' (image/jpeg) ─┤ batched → dispatched to Image Workflow │ └── file 'photo3.jpg' (image/jpeg) ─┘ │ └── [resized derivatives, OCR text, assembled text, structure, descriptions] ├── folder 'Scans' │ ├── file 'scan1.png' (image/png) ─┐ batched → dispatched to Generic Image Workflow │ └── file 'scan2.tiff' (image/tiff)┘ │ └── [converted to JPEG, then full Image Workflow pipeline] ├── folder 'Notes' │ ├── file 'notes.txt' (text/plain) → dispatched to Text Workflow │ │ └── [structure extraction, descriptions] │ └── file 'readme.md' (text/markdown) → dispatched to Text Workflow │ └── ... └── folder 'Nested Collection' (belongs to different collection) → SKIPPED
status
active
uses_agents