agent

File Processing Orchestrator

01KFM74TJDTCQ97KZF151YH9GH

Properties

actions_required

file:view
file:create
file:update
file:download
entity:view
entity:update
relationship:create
collection:view

description

Orchestrates file processing workflows for collections and folders - discovers files, groups by type, dispatches to appropriate workflows

endpoint

https://file-processing-orchestrator.arke.institute

endpoint_verified_at

2026-01-23T01:22:55.882Z

input_schema

properties

entity_id

description: Collection or folder entity ID to process
type: string

options

properties

custom_prompt

description: Custom prompt passed to downstream workflows for structure extraction and descriptions
type: string

skip_image_descriptions

description: Skip derivative image descriptions
type: boolean

skip_text_workflow

description: Skip structure extraction and descriptions for all files
type: boolean

workflow_concurrency

default: 50
description: Maximum parallel workflow dispatches (default: 50)
type: integer

type

object

required

entity_id

type

object

output_description

The orchestrator processes an entire collection or folder end-to-end by recursively discovering all files, grouping them by type, and dispatching each group to the appropriate processing workflow. It operates in four stages. (1) Discovery: breadth-first traversal of the collection/folder hierarchy, finding all files and recording their content types. It respects collection boundaries — it will not traverse into folders that belong to a different collection, and it skips nested collections entirely. (2) Grouping: files are classified by content type and grouped per folder. PDFs are dispatched individually to the PDF Workflow. JPEGs in the same folder are batched together for the Image Workflow. Non-JPEG images (PNG, WebP, TIFF, AVIF, GIF) are batched for the Generic Image Workflow. Text files are dispatched individually to the Text Workflow. Files with unrecognized content types are skipped. Files within each folder are sorted in natural order (e.g., page_001.jpg before page_010.jpg). (3) Dispatch: groups are sent to their respective workflows in parallel (up to workflow_concurrency, default 50), with automatic retries (up to 3 attempts) and a 30-minute timeout per workflow. (4) Finalize: statistics are aggregated and a job log is written. The orchestrator itself does not create any entities or relationships — all entity creation is delegated to the sub-workflows it invokes. After completion, every supported file in the collection will have been fully processed by its corresponding workflow pipeline. See the individual workflow agent descriptions (PDF Workflow, Image Workflow, Generic Image Workflow, Text Workflow) for the specific entities, properties, and relationships each one produces.

output_relationships

The orchestrator does not create any entities or relationships itself — it discovers and dispatches
Discovery traverses 'contains' relationships downward from the root collection/folder to find all nested folders and files
Collection boundaries are enforced: folders belonging to a different collection are skipped, as are nested collections
Each sub-workflow creates its own entities and relationships (see individual workflow agent descriptions for details)
Dispatch routing: PDFs → PDF Workflow (one per file), JPEGs → Image Workflow (batched per folder), PNG/WebP/TIFF/AVIF/GIF → Generic Image Workflow (batched per folder), text/* → Text Workflow (one per file)
To find processing results after orchestration: for any file in the collection, follow 'has_derivative' to find its processed outputs, then continue navigating the derivative tree as described by each workflow's output_relationships
To check orchestrator progress: query GET /status/:job_id for summary, add ?detail=full for per-folder and per-dispatch breakdowns, add ?errors=N for recent error details

output_tree_example

collection 'Research Archive' ├── folder 'Papers' │ ├── file 'paper.pdf' (application/pdf) → dispatched to PDF Workflow │ │ ├── [page JPEGs with OCR text, assembled text, structure extraction, descriptions] │ │ └── (see PDF Workflow output for full detail) │ └── file 'supplement.pdf' (application/pdf) → dispatched to PDF Workflow │ └── ... ├── folder 'Photos' │ ├── file 'photo1.jpg' (image/jpeg) ─┐ │ ├── file 'photo2.jpg' (image/jpeg) ─┤ batched → dispatched to Image Workflow │ └── file 'photo3.jpg' (image/jpeg) ─┘ │ └── [resized derivatives, OCR text, assembled text, structure, descriptions] ├── folder 'Scans' │ ├── file 'scan1.png' (image/png) ─┐ batched → dispatched to Generic Image Workflow │ └── file 'scan2.tiff' (image/tiff)┘ │ └── [converted to JPEG, then full Image Workflow pipeline] ├── folder 'Notes' │ ├── file 'notes.txt' (text/plain) → dispatched to Text Workflow │ │ └── [structure extraction, descriptions] │ └── file 'readme.md' (text/markdown) → dispatched to Text Workflow │ └── ... └── folder 'Nested Collection' (belongs to different collection) → SKIPPED

status

active

uses_agents

label
PDF Workflow
pi
01KFKD1PQ5AGZZQ1ENYSZBP6DY
label
Image Workflow
pi
01KFFGQJC874G1S3SCSWT60W5Y
label
Generic Image Workflow
pi
01KFKD1ZCKBNNFB6J0VX5RXNGD
label
Text Workflow
pi
01KFFDRKYYW7M9ZYFZEK4HGBHQ

Metadata

Version: 6
Created: 1/23/2026
Updated: 1/30/2026
Edited by: ARCHON