agent

OCR Service

01KFFC4ZBD52SY7E4BX6XV8623

Properties

actions_required

entity:view
entity:update
file:view
file:update
file:create
relationship:create

description

Extracts text and images from JPEG files using Mistral OCR

endpoint

https://ocr-service.arke.institute

endpoint_verified_at

2026-01-21T04:14:11.183Z

input_schema

properties

entity_id

description: File entity ID (JPEG) to process
type: string

options

description: Agent-specific options
properties: {}
type: object

required

entity_id

type

object

output_description

The OCR service writes extracted text directly back onto the input JPEG file entity. After processing, the source entity's 'text' property contains the full OCR output as markdown. If the page contained embedded images (figures, charts, tables rendered as images), those are extracted as new JPEG file entities and uploaded with their binary content. The markdown text uses arke: URIs to reference these extracted images inline (e.g., '![img-0.jpeg](arke:II...)'), so the text and its images stay linked. The source entity also receives metadata properties: 'text_source' is set to 'ocr', 'text_extracted_at' records the timestamp, 'text_has_content' indicates whether any non-whitespace text was found, 'text_images_count' records how many embedded images were detected, and 'ocr_model' records the model used (mistral-ocr-latest). If the source entity already has text from born-digital extraction (text_source = 'born_digital'), OCR is skipped unless force_ocr is set. Each extracted image entity gets properties recording its extraction origin: 'extraction_source', 'source_bbox' with bounding box coordinates, 'extracted_by', and 'extracted_at'.

output_relationships

source entity --[has_extracted]--> extracted image: follow 'has_extracted' from the input JPEG to find all images that were pulled out of it during OCR
extracted image --[extracted_from]--> source entity: follow 'extracted_from' from any extracted image back to the page it came from

output_tree_example

source-page.jpeg (input entity, updated in place) ├── properties.text = "# Chapter 1\n\nThe quick brown fox...\n\n![img-0.jpeg](arke:IIxyz123)\n\nMore text..." ├── properties.text_source = "ocr" ├── properties.text_has_content = true ├── properties.text_images_count = 1 ├── properties.ocr_model = "mistral-ocr-latest" │ └── [has_extracted] ──► source-page_img-0.jpeg (new file entity) ├── properties.extraction_source = "ocr" ├── properties.source_bbox = { x1, y1, x2, y2 } ├── properties.extracted_by = "ocr-service" └── [extracted_from] ──► source-page.jpeg

status

active

Metadata

Version: 6
Created: 1/21/2026
Updated: 1/30/2026
Edited by: ARCHON