- actions_required
- file:view
- file:create
- file:update
- file:download
- entity:view
- entity:update
- relationship:create
- collection:view
- description
- Orchestrates file processing workflows for collections and folders - discovers files, groups by type, dispatches to appropriate workflows
- endpoint
- https://file-processing-orchestrator.arke.institute
- endpoint_verified_at
- 2026-01-23T01:22:55.882Z
- input_schema
- properties
- entity_id
- description
- Collection or folder entity ID to process
- type
- string
- options
- properties
- custom_prompt
- description
- Custom prompt passed to downstream workflows for structure extraction and descriptions
- type
- string
- skip_image_descriptions
- description
- Skip derivative image descriptions
- type
- boolean
- skip_text_workflow
- description
- Skip structure extraction and descriptions for all files
- type
- boolean
- workflow_concurrency
- default
- 50
- description
- Maximum parallel workflow dispatches (default: 50)
- type
- integer
- type
- object
- type
- object
- output_description
- The orchestrator processes an entire collection or folder end-to-end by recursively discovering all files, grouping them by type, and dispatching each group to the appropriate processing workflow. It operates in four stages. (1) Discovery: breadth-first traversal of the collection/folder hierarchy, finding all files and recording their content types. It respects collection boundaries — it will not traverse into folders that belong to a different collection, and it skips nested collections entirely. (2) Grouping: files are classified by content type and grouped per folder. PDFs are dispatched individually to the PDF Workflow. JPEGs in the same folder are batched together for the Image Workflow. Non-JPEG images (PNG, WebP, TIFF, AVIF, GIF) are batched for the Generic Image Workflow. Text files are dispatched individually to the Text Workflow. Files with unrecognized content types are skipped. Files within each folder are sorted in natural order (e.g., page_001.jpg before page_010.jpg). (3) Dispatch: groups are sent to their respective workflows in parallel (up to workflow_concurrency, default 50), with automatic retries (up to 3 attempts) and a 30-minute timeout per workflow. (4) Finalize: statistics are aggregated and a job log is written. The orchestrator itself does not create any entities or relationships — all entity creation is delegated to the sub-workflows it invokes. After completion, every supported file in the collection will have been fully processed by its corresponding workflow pipeline. See the individual workflow agent descriptions (PDF Workflow, Image Workflow, Generic Image Workflow, Text Workflow) for the specific entities, properties, and relationships each one produces.
- output_relationships
- The orchestrator does not create any entities or relationships itself — it discovers and dispatches
- Discovery traverses 'contains' relationships downward from the root collection/folder to find all nested folders and files
- Collection boundaries are enforced: folders belonging to a different collection are skipped, as are nested collections
- Each sub-workflow creates its own entities and relationships (see individual workflow agent descriptions for details)
- Dispatch routing: PDFs → PDF Workflow (one per file), JPEGs → Image Workflow (batched per folder), PNG/WebP/TIFF/AVIF/GIF → Generic Image Workflow (batched per folder), text/* → Text Workflow (one per file)
- To find processing results after orchestration: for any file in the collection, follow 'has_derivative' to find its processed outputs, then continue navigating the derivative tree as described by each workflow's output_relationships
- To check orchestrator progress: query GET /status/:job_id for summary, add ?detail=full for per-folder and per-dispatch breakdowns, add ?errors=N for recent error details
- output_tree_example
- collection 'Research Archive'
├── folder 'Papers'
│ ├── file 'paper.pdf' (application/pdf) → dispatched to PDF Workflow
│ │ ├── [page JPEGs with OCR text, assembled text, structure extraction, descriptions]
│ │ └── (see PDF Workflow output for full detail)
│ └── file 'supplement.pdf' (application/pdf) → dispatched to PDF Workflow
│ └── ...
├── folder 'Photos'
│ ├── file 'photo1.jpg' (image/jpeg) ─┐
│ ├── file 'photo2.jpg' (image/jpeg) ─┤ batched → dispatched to Image Workflow
│ └── file 'photo3.jpg' (image/jpeg) ─┘
│ └── [resized derivatives, OCR text, assembled text, structure, descriptions]
├── folder 'Scans'
│ ├── file 'scan1.png' (image/png) ─┐ batched → dispatched to Generic Image Workflow
│ └── file 'scan2.tiff' (image/tiff)┘
│ └── [converted to JPEG, then full Image Workflow pipeline]
├── folder 'Notes'
│ ├── file 'notes.txt' (text/plain) → dispatched to Text Workflow
│ │ └── [structure extraction, descriptions]
│ └── file 'readme.md' (text/markdown) → dispatched to Text Workflow
│ └── ...
└── folder 'Nested Collection' (belongs to different collection) → SKIPPED
- status
- active
- uses_agents
- label
- PDF Workflow
- label
- Image Workflow
- label
- Generic Image Workflow
- label
- Text Workflow