agent

OCR Service

01KFFC4ZBD52SY7E4BX6XV8623

Properties

actions_required
  • entity:view
  • entity:update
  • file:view
  • file:update
  • file:create
  • relationship:create
description
Extracts text and images from JPEG files using Mistral OCR
endpoint
https://ocr-service.arke.institute
endpoint_verified_at
2026-01-21T04:14:11.183Z
input_schema
properties
entity_id
description
File entity ID (JPEG) to process
type
string
options
description
Agent-specific options
properties
{}
type
object
required
  • entity_id
type
object
output_description
The OCR service writes extracted text directly back onto the input JPEG file entity. After processing, the source entity's 'text' property contains the full OCR output as markdown. If the page contained embedded images (figures, charts, tables rendered as images), those are extracted as new JPEG file entities and uploaded with their binary content. The markdown text uses arke: URIs to reference these extracted images inline (e.g., '![img-0.jpeg](arke:II...)'), so the text and its images stay linked. The source entity also receives metadata properties: 'text_source' is set to 'ocr', 'text_extracted_at' records the timestamp, 'text_has_content' indicates whether any non-whitespace text was found, 'text_images_count' records how many embedded images were detected, and 'ocr_model' records the model used (mistral-ocr-latest). If the source entity already has text from born-digital extraction (text_source = 'born_digital'), OCR is skipped unless force_ocr is set. Each extracted image entity gets properties recording its extraction origin: 'extraction_source', 'source_bbox' with bounding box coordinates, 'extracted_by', and 'extracted_at'.
output_relationships
  • source entity --[has_extracted]--> extracted image: follow 'has_extracted' from the input JPEG to find all images that were pulled out of it during OCR
  • extracted image --[extracted_from]--> source entity: follow 'extracted_from' from any extracted image back to the page it came from
output_tree_example
source-page.jpeg (input entity, updated in place) ├── properties.text = "# Chapter 1\n\nThe quick brown fox...\n\n![img-0.jpeg](arke:IIxyz123)\n\nMore text..." ├── properties.text_source = "ocr" ├── properties.text_has_content = true ├── properties.text_images_count = 1 ├── properties.ocr_model = "mistral-ocr-latest" │ └── [has_extracted] ──► source-page_img-0.jpeg (new file entity) ├── properties.extraction_source = "ocr" ├── properties.source_bbox = { x1, y1, x2, y2 } ├── properties.extracted_by = "ocr-service" └── [extracted_from] ──► source-page.jpeg
status
active