- actions_required
- file:view
- file:create
- entity:view
- entity:create
- entity:update
- description
- Combines OCR text from multiple page images into a single annotated text document
- endpoint
- https://text-assembler.arke.institute
- endpoint_verified_at
- 2026-01-21T04:13:49.400Z
- input_schema
- properties
- output
- description
- Output configuration
- properties
- collection_id
- description
- Collection to create the text entity in
- type
- string
- label
- description
- Label for the combined text entity
- type
- string
- parent_id
- description
- Optional parent entity (e.g., the PDF)
- type
- string
- type
- object
- pages
- description
- Page entities to combine, in order
- items
- properties
- entity_id
- description
- Page entity ID
- type
- string
- page_number
- description
- Page number for ordering and annotation
- type
- number
- text_key
- description
- Optional R2 key for pre-extracted OCR text
- type
- string
- type
- object
- type
- array
- type
- object
- output_description
- Creates a single plain-text file entity (content_type 'text/plain; charset=utf-8') that concatenates OCR text from all input page entities in page-number order. Each page's text is preceded by an HTML-comment page marker of the form <!-- [Page N](arke:PAGE_ENTITY_ID) --> that encodes both the page number and a URI reference back to the source page entity. Pages that contained no OCR text are represented by a marker with a NO_TEXT suffix: <!-- [Page N](arke:PAGE_ENTITY_ID) NO_TEXT -->. After a page's text, any images extracted from that page appear as Markdown-style links using Arke URIs, e.g. [Figure: Revenue Chart](arke:IMAGE_ENTITY_ID). Tables are similarly linked, and if the table had markdown content it is included inline beneath the link. Pages are separated by blank lines. The resulting file does not include line numbers — downstream services like Structure Extraction add those. The file entity is created in the collection specified by output.collection_id, with a filename of '{label}.txt' where label defaults to 'assembled-text-{timestamp}' if not provided.
- output_relationships
- The text file entity has an 'assembled_from' relationship pointing to each source page entity it was built from
- If output.parent_id is provided, the text file entity has an 'in' relationship pointing to the parent (e.g., the original PDF entity)
- Each source page entity receives a 'has_assembly' backlink relationship pointing to the text file entity, allowing traversal from any page to its assembled text
- To find the assembled text from a page: follow the page's 'has_assembly' relationship. To find all source pages from the text: follow the text file's 'assembled_from' relationships
- output_tree_example
- Input: 3 page entities from a scanned PDF
Resulting text file content:
<!-- [Page 1](arke:01JABC1111) -->
It is a truth universally acknowledged, that a single man
in possession of a good fortune, must be in want of a wife.
[Decorative Header](arke:01JIMG0001)
<!-- [Page 2](arke:01JABC2222) -->
However little known the feelings or views of such a man
may be on his first entering a neighbourhood, this truth is
so well fixed in the minds of the surrounding families...
[Table: Character List](arke:01JTBL0001)
| Name | Role |
|------|------|
| Mr Bennet | Father |
| Mrs Bennet | Mother |
<!-- [Page 3](arke:01JABC3333) NO_TEXT -->
Entity tree:
Collection (output.collection_id)
└── assembled-text.txt (file entity, content_type: text/plain)
├── assembled_from → Page 1 (01JABC1111)
├── assembled_from → Page 2 (01JABC2222)
├── assembled_from → Page 3 (01JABC3333)
└── in → Parent PDF (if parent_id was provided)
- status
- active