agent

Text Assembler

01KFFC4A8W8939TXGEXCK439ZK

Properties

actions_required
  • file:view
  • file:create
  • entity:view
  • entity:create
  • entity:update
description
Combines OCR text from multiple page images into a single annotated text document
endpoint
https://text-assembler.arke.institute
endpoint_verified_at
2026-01-21T04:13:49.400Z
input_schema
properties
output
description
Output configuration
properties
collection_id
description
Collection to create the text entity in
type
string
label
description
Label for the combined text entity
type
string
parent_id
description
Optional parent entity (e.g., the PDF)
type
string
required
  • collection_id
type
object
pages
description
Page entities to combine, in order
items
properties
entity_id
description
Page entity ID
type
string
page_number
description
Page number for ordering and annotation
type
number
text_key
description
Optional R2 key for pre-extracted OCR text
type
string
required
  • entity_id
  • page_number
type
object
type
array
required
  • pages
  • output
type
object
output_description
Creates a single plain-text file entity (content_type 'text/plain; charset=utf-8') that concatenates OCR text from all input page entities in page-number order. Each page's text is preceded by an HTML-comment page marker of the form <!-- [Page N](arke:PAGE_ENTITY_ID) --> that encodes both the page number and a URI reference back to the source page entity. Pages that contained no OCR text are represented by a marker with a NO_TEXT suffix: <!-- [Page N](arke:PAGE_ENTITY_ID) NO_TEXT -->. After a page's text, any images extracted from that page appear as Markdown-style links using Arke URIs, e.g. [Figure: Revenue Chart](arke:IMAGE_ENTITY_ID). Tables are similarly linked, and if the table had markdown content it is included inline beneath the link. Pages are separated by blank lines. The resulting file does not include line numbers — downstream services like Structure Extraction add those. The file entity is created in the collection specified by output.collection_id, with a filename of '{label}.txt' where label defaults to 'assembled-text-{timestamp}' if not provided.
output_relationships
  • The text file entity has an 'assembled_from' relationship pointing to each source page entity it was built from
  • If output.parent_id is provided, the text file entity has an 'in' relationship pointing to the parent (e.g., the original PDF entity)
  • Each source page entity receives a 'has_assembly' backlink relationship pointing to the text file entity, allowing traversal from any page to its assembled text
  • To find the assembled text from a page: follow the page's 'has_assembly' relationship. To find all source pages from the text: follow the text file's 'assembled_from' relationships
output_tree_example
Input: 3 page entities from a scanned PDF Resulting text file content: <!-- [Page 1](arke:01JABC1111) --> It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. [Decorative Header](arke:01JIMG0001) <!-- [Page 2](arke:01JABC2222) --> However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families... [Table: Character List](arke:01JTBL0001) | Name | Role | |------|------| | Mr Bennet | Father | | Mrs Bennet | Mother | <!-- [Page 3](arke:01JABC3333) NO_TEXT --> Entity tree: Collection (output.collection_id) └── assembled-text.txt (file entity, content_type: text/plain) ├── assembled_from → Page 1 (01JABC1111) ├── assembled_from → Page 2 (01JABC2222) ├── assembled_from → Page 3 (01JABC3333) └── in → Parent PDF (if parent_id was provided)
status
active