agent

PDF to JPEG Processor

01KFFH6ETXGRVD10WPNP3007D6

Properties

actions_required
  • file:view
  • file:create
  • file:update
  • entity:view
  • entity:create
  • entity:update
description
Processes PDFs: detects type (born-digital vs scanned), extracts text and images for born-digital, renders pages to JPEG for scanned
endpoint
https://pdf-processor.arke.institute
endpoint_verified_at
2026-01-21T05:42:22.654Z
input_schema
properties
entity_id
description
Source PDF file entity to process
type
string
options
description
Processing options
properties
dpi
description
Resolution in DPI (default: 300)
type
number
extract_images
description
Extract embedded images from born-digital PDFs (default: true)
type
boolean
extraction_mode
description
Processing mode (default: auto)
enum
  • auto
  • born_digital
  • scanned
type
string
image_min_size
description
Minimum image dimension to extract (default: 100)
type
number
quality
description
JPEG quality 1-100 (default: 85)
type
number
type
object
required
  • entity_id
type
object
output_description
For every page in the source PDF, the processor creates a JPEG file entity representing that page. First, the PDF is classified as 'born_digital' or 'scanned' using a 3-tier detection system: producer/creator metadata, page structure analysis (full-page images vs vector text), and text rendering mode (invisible OCR layer detection). If detection is inconclusive, it defaults to 'scanned'. Every page is then rendered to JPEG via Ghostscript at the configured DPI (default 300) and quality (default 85), capped to a maximum dimension of 2400px. Each resulting JPEG is uploaded as a new file entity with properties including 'page_number', 'width', 'height', and 'pdf_type'. For born-digital PDFs, native text is extracted per page and stored directly on the page entity in a 'text' property, along with 'text_source' set to 'born_digital', 'text_extracted_at', 'text_extracted_by', and 'text_has_content'. Scanned pages have 'text_source' set to null, meaning downstream OCR is needed. For born-digital PDFs, embedded images (figures, diagrams, photos) are also extracted and uploaded as separate JPEG file entities, each with properties 'extraction_source', 'source_page_number', 'source_image_index', 'extracted_by', and 'extracted_at'. Small images below the minimum size threshold (default 100px) and full-page background images on text-heavy pages are filtered out.
output_relationships
  • Each page JPEG entity has a 'derived_from' relationship pointing to the source PDF entity
  • The source PDF entity has 'has_derivative' relationships pointing to all page JPEG entities
  • Page entities are linked sequentially with 'prev' and 'next' relationships (page 1 -> next -> page 2, page 2 -> prev -> page 1, etc.)
  • For born-digital PDFs: each extracted image entity has an 'extracted_from' relationship pointing to its source page entity
  • For born-digital PDFs: each page entity has 'has_derivative' relationships pointing to any images extracted from that page
  • To traverse: start from the source PDF, follow 'has_derivative' to find all page entities, then read 'page_number' to order them. Follow 'next'/'prev' to walk the page sequence. For born-digital pages, follow 'has_derivative' from a page to find its extracted images.
output_tree_example
source_pdf 'research-paper.pdf' (5 pages, born_digital) ├── page_jpeg 'research-paper_page_0001.jpg' (page_number: 1, width: 1700, height: 2200, pdf_type: 'born_digital', text: 'Title Page\nAuthors...', text_source: 'born_digital') ├── page_jpeg 'research-paper_page_0002.jpg' (page_number: 2, width: 1700, height: 2200, pdf_type: 'born_digital', text: 'Abstract\nThis paper...', text_source: 'born_digital') │ └── extracted_image 'research-paper_image_p2_i1.jpg' (source_page_number: 2, source_image_index: 1, extraction_source: 'born_digital') ├── page_jpeg 'research-paper_page_0003.jpg' (page_number: 3, width: 1700, height: 2200, pdf_type: 'born_digital', text: 'Section 1\nIntroduction...', text_source: 'born_digital') ├── page_jpeg 'research-paper_page_0004.jpg' (page_number: 4, width: 1700, height: 2200, pdf_type: 'born_digital', text: 'Section 2\nMethods...', text_source: 'born_digital') │ ├── extracted_image 'research-paper_image_p4_i1.jpg' (source_page_number: 4, source_image_index: 1, extraction_source: 'born_digital') │ └── extracted_image 'research-paper_image_p4_i2.jpg' (source_page_number: 4, source_image_index: 2, extraction_source: 'born_digital') └── page_jpeg 'research-paper_page_0005.jpg' (page_number: 5, width: 1700, height: 2200, pdf_type: 'born_digital', text: 'References\n1. Smith...', text_source: 'born_digital')
status
active