agent

PDF to JPEG Processor

01KFFH6ETXGRVD10WPNP3007D6

Properties

actions_required

file:view
file:create
file:update
entity:view
entity:create
entity:update

description

Processes PDFs: detects type (born-digital vs scanned), extracts text and images for born-digital, renders pages to JPEG for scanned

endpoint

https://pdf-processor.arke.institute

endpoint_verified_at

2026-01-21T05:42:22.654Z

input_schema

properties

entity_id

description: Source PDF file entity to process
type: string

options

description

Processing options

properties

dpi

description: Resolution in DPI (default: 300)
type: number

extract_images

description: Extract embedded images from born-digital PDFs (default: true)
type: boolean

extraction_mode

description: Processing mode (default: auto)
enum: auto
born_digital
scanned
type: string

image_min_size

description: Minimum image dimension to extract (default: 100)
type: number

quality

description: JPEG quality 1-100 (default: 85)
type: number

type

object

required

entity_id

type

object

output_description

For every page in the source PDF, the processor creates a JPEG file entity representing that page. First, the PDF is classified as 'born_digital' or 'scanned' using a 3-tier detection system: producer/creator metadata, page structure analysis (full-page images vs vector text), and text rendering mode (invisible OCR layer detection). If detection is inconclusive, it defaults to 'scanned'. Every page is then rendered to JPEG via Ghostscript at the configured DPI (default 300) and quality (default 85), capped to a maximum dimension of 2400px. Each resulting JPEG is uploaded as a new file entity with properties including 'page_number', 'width', 'height', and 'pdf_type'. For born-digital PDFs, native text is extracted per page and stored directly on the page entity in a 'text' property, along with 'text_source' set to 'born_digital', 'text_extracted_at', 'text_extracted_by', and 'text_has_content'. Scanned pages have 'text_source' set to null, meaning downstream OCR is needed. For born-digital PDFs, embedded images (figures, diagrams, photos) are also extracted and uploaded as separate JPEG file entities, each with properties 'extraction_source', 'source_page_number', 'source_image_index', 'extracted_by', and 'extracted_at'. Small images below the minimum size threshold (default 100px) and full-page background images on text-heavy pages are filtered out.

output_relationships

Each page JPEG entity has a 'derived_from' relationship pointing to the source PDF entity
The source PDF entity has 'has_derivative' relationships pointing to all page JPEG entities
Page entities are linked sequentially with 'prev' and 'next' relationships (page 1 -> next -> page 2, page 2 -> prev -> page 1, etc.)
For born-digital PDFs: each extracted image entity has an 'extracted_from' relationship pointing to its source page entity
For born-digital PDFs: each page entity has 'has_derivative' relationships pointing to any images extracted from that page
To traverse: start from the source PDF, follow 'has_derivative' to find all page entities, then read 'page_number' to order them. Follow 'next'/'prev' to walk the page sequence. For born-digital pages, follow 'has_derivative' from a page to find its extracted images.

output_tree_example

source_pdf 'research-paper.pdf' (5 pages, born_digital) ├── page_jpeg 'research-paper_page_0001.jpg' (page_number: 1, width: 1700, height: 2200, pdf_type: 'born_digital', text: 'Title Page\nAuthors...', text_source: 'born_digital') ├── page_jpeg 'research-paper_page_0002.jpg' (page_number: 2, width: 1700, height: 2200, pdf_type: 'born_digital', text: 'Abstract\nThis paper...', text_source: 'born_digital') │ └── extracted_image 'research-paper_image_p2_i1.jpg' (source_page_number: 2, source_image_index: 1, extraction_source: 'born_digital') ├── page_jpeg 'research-paper_page_0003.jpg' (page_number: 3, width: 1700, height: 2200, pdf_type: 'born_digital', text: 'Section 1\nIntroduction...', text_source: 'born_digital') ├── page_jpeg 'research-paper_page_0004.jpg' (page_number: 4, width: 1700, height: 2200, pdf_type: 'born_digital', text: 'Section 2\nMethods...', text_source: 'born_digital') │ ├── extracted_image 'research-paper_image_p4_i1.jpg' (source_page_number: 4, source_image_index: 1, extraction_source: 'born_digital') │ └── extracted_image 'research-paper_image_p4_i2.jpg' (source_page_number: 4, source_image_index: 2, extraction_source: 'born_digital') └── page_jpeg 'research-paper_page_0005.jpg' (page_number: 5, width: 1700, height: 2200, pdf_type: 'born_digital', text: 'References\n1. Smith...', text_source: 'born_digital')

status

active

Metadata

Version: 7
Created: 1/21/2026
Updated: 1/30/2026
Edited by: ARCHON