agent

Text Assembler

01KFFC4A8W8939TXGEXCK439ZK

Properties

actions_required

file:view
file:create
entity:view
entity:create
entity:update

description

Combines OCR text from multiple page images into a single annotated text document

endpoint

https://text-assembler.arke.institute

endpoint_verified_at

2026-01-21T04:13:49.400Z

input_schema

properties

output

description

Output configuration

properties

collection_id

description: Collection to create the text entity in
type: string

label

description: Label for the combined text entity
type: string

parent_id

description: Optional parent entity (e.g., the PDF)
type: string

required

collection_id

type

object

pages

description

Page entities to combine, in order

items

properties

entity_id

description: Page entity ID
type: string

page_number

description: Page number for ordering and annotation
type: number

text_key

description: Optional R2 key for pre-extracted OCR text
type: string

required

entity_id
page_number

type

object

type

array

required

pages
output

type

object

output_description

Creates a single plain-text file entity (content_type 'text/plain; charset=utf-8') that concatenates OCR text from all input page entities in page-number order. Each page's text is preceded by an HTML-comment page marker of the form  that encodes both the page number and a URI reference back to the source page entity. Pages that contained no OCR text are represented by a marker with a NO_TEXT suffix: . After a page's text, any images extracted from that page appear as Markdown-style links using Arke URIs, e.g. [Figure: Revenue Chart](arke:IMAGE_ENTITY_ID). Tables are similarly linked, and if the table had markdown content it is included inline beneath the link. Pages are separated by blank lines. The resulting file does not include line numbers — downstream services like Structure Extraction add those. The file entity is created in the collection specified by output.collection_id, with a filename of '{label}.txt' where label defaults to 'assembled-text-{timestamp}' if not provided.

output_relationships

The text file entity has an 'assembled_from' relationship pointing to each source page entity it was built from
If output.parent_id is provided, the text file entity has an 'in' relationship pointing to the parent (e.g., the original PDF entity)
Each source page entity receives a 'has_assembly' backlink relationship pointing to the text file entity, allowing traversal from any page to its assembled text
To find the assembled text from a page: follow the page's 'has_assembly' relationship. To find all source pages from the text: follow the text file's 'assembled_from' relationships

output_tree_example

Input: 3 page entities from a scanned PDF Resulting text file content:  It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. [Decorative Header](arke:01JIMG0001)  However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families... [Table: Character List](arke:01JTBL0001) | Name | Role | |------|------| | Mr Bennet | Father | | Mrs Bennet | Mother |  Entity tree: Collection (output.collection_id) └── assembled-text.txt (file entity, content_type: text/plain) ├── assembled_from → Page 1 (01JABC1111) ├── assembled_from → Page 2 (01JABC2222) ├── assembled_from → Page 3 (01JABC3333) └── in → Parent PDF (if parent_id was provided)

status

active

Metadata

Version: 7
Created: 1/21/2026
Updated: 1/30/2026
Edited by: ARCHON