AI × InDesign OpenStax · CC BY 4.0 View the output PDF

An AI automation journey · Adobe InDesign 2026

Rebuilding a textbook page in InDesign

An AI agent reproduced five image-heavy pages of an OpenStax chapter five different ways — and recorded the real InDesign canvas doing the work, failures and all.

automation tracks

real failures fixed

93.3

best composite %

0.987

peak SSIM

↓ scroll the journey — the timeline above tracks where you are

01 — The challenge

A page never meant to be rebuilt by a machine

The target is one chapter of OpenStax Concepts of Biology — Creative-Commons, so the outputs are safe to share. It has a full-bleed photo, multi-column prose, colored callout boxes and figure captions.

InDesign is hard to automate against: it's proprietary, it has no headless desktop mode, and the book's fonts (IBM Plex Sans, Mulish) aren't installed by default — so the app silently substitutes and the layout drifts.

The brief

Reproduce the first five pages as closely as possible inside InDesign — and measure how close each approach gets, and what it costs.

extraction overlay — Extraction overlay: detected **text**, **images**, **fills**.

02 — Extraction

The win starts before InDesign opens

PyMuPDF walks the PDF once and records every text span, image and fill into a single layout_spec.json — 173 text runs, 7 images, 9 fills. The same structured understanding feeds all five engines.

Key insight

Decoupling “understand the source” from “draw it” is what makes the five engines comparable — and lets you swap engines without touching extraction.

03 — The live build

Watch the page assemble itself

Connected over COM, the agent places each element in order — fills, then the photograph, then the headline, the outline, the body. This is the genuine Adobe InDesign 2026 canvas updating after every placement.

Captured by attaching to a visible InDesign instance, building element-by-element with redraw on, and screenshotting the layout window after each step.

⤢ click to zoom & play

Live build — the real canvas, 53 snapshots, one per element placed.

04 — The five engines

Five ways to drive InDesign — same spec, same page

Each track consumes the identical extraction spec. Three scripted engines tie at the top; two specialist approaches trail — and each one taught us something. Click any clip to zoom and play.

Track B1 · PyWin32 COM

Python drives the DOM

90.9/100 · 5 of 5 pages

The most Python-native path: Dispatch("InDesign.Application.2026"), then place rectangles, images and text frames directly. The most controllable engine — and the one that hit the most setup snags before it worked.

Why it scored well: it places images and fills at exact coordinates (IoU 98–100) and applies the real installed fonts, so only fine text reflow separates it from the source.

p1 ⇄

p3 ⇄

p5 ⇄

⤢ zoom & play

B1 — element-by-element on the live canvas.

⤢ zoom & play

B2 — each element placed by a per-element ExtendScript call.

Track B2 · ExtendScript

The classic .jsx engine

90.9/100 · 5 of 5 pages

The same build logic run inside InDesign via app.doScript. Identical DOM, byte-identical output to B1.

Why it's the easiest: it runs inside the licensed app — no COM ProgID hunt, no makepy, named enums that read like the docs, and the fastest build. The lowest-friction way to reach ~91%.

Track B3 · UXP

Adobe's modern runtime

90.9/100 · 5 of 5 pages

The same DOM behind an async wall: require('indesign'), enums on the module, no return marshaling. Ported mechanically from B2; the output is byte-identical.

Why choose it: not for fidelity (it matches B1/B2) but for distribution — UXP is Adobe's strategic runtime for shippable panels and plugins.

⤢ zoom & play

B3 — per-element UXP doScript on the real canvas.

⤢ zoom & play

B5 — tagged template, then filled by pure Python.

Track B5 · simpleidml

Direct IDML — no InDesign in the loop

59.6/100 · 1 of 5 pages

Inject variable content into a fixed, pre-tagged IDML template — in pure Python, with no running InDesign. You see the placeholder frames + fixed photo, then the text injected by import_xml.

Why it trails: it can't reproduce free-form structure — every distinct element would need its own tagged slot. A template tool, not a layout engine: superb for catalogs and mail-merge, wrong for an arbitrary page.

Track B4 · MCP visual loop

The agentic correction loop

43.5/100 · 1 of 5 pages

The Blender-style idea: let the agent see its output, compare to the source, and correct — round after round. Here, rounds 0 → 1 → 2 rebuilt live from a vision-only guess.

Why it trails: it fixes gross layout (it caught a distorted photo) but composite only crawled 0.601 → 0.608. Matching exact text reflow by eye is the wall, and each round costs a 70–180 s rebuild.

⤢ zoom & play

B4 — the loop's three correction rounds.

05 — When things broke

Every track failed first. Here's what we cooked.

Six concrete failures, each with a clean root cause and a fix. This is where the real work was.

Failure 1 — the units trap (COM)

What broke

The page exported 10× too big, everything crammed in the corner.

Root cause

PageWidth = 612 was read as 612 picas = 7344 pt.

What we cooked

Set MeasurementUnit = idPoints before any geometry.

Failure 2 — the wrong-enum image bug (COM)

What broke

All seven images failed with a misleading “Invalid value”.

Root cause

It wasn't Place() — it was Fit(idFitContentToFrame), the wrong enum group.

What we cooked

Fit(idContentToFrame) — image-IoU jumped to 98–100%.

Failure 3 — a “more correct” call that hurt (ExtendScript)

What broke

Identical logic scored SSIM 0.79 vs 0.96.

Root cause

A CAP_HEIGHT baseline shifted every line up — PyMuPDF bboxes are ascender-derived.

What we cooked

ASCENT_OFFSET — match the baseline to the extractor's convention. Back to 0.96.

Failure 4 — the loop that fixes shape, not text (MCP)

What broke

The suggested MCP repo was macOS-only; and the loop's first guess distorted the photo.

What we tried

Built our own COM tool layer; the loop diagnosed the aspect and switched to fill-proportionally.

Outcome

Composite still only moved 0.601 → 0.608 — use the loop for QA, not generation.

Failure 5 — the placeholder overflow (simpleidml)

What broke

The title overflowed its placeholder (“Photosynthe-”); free-form structure couldn't be reproduced.

What we cooked

Pre-size slots; reframe the tool for templated jobs (catalogs, mail-merge) rather than arbitrary pages.

app is not defined  // UXP
Gemini 2.0-flash → HTTP 404
const idsn = require('indesign');
model = "gemini-2.5-flash"

Failure 6 — the plumbing (UXP & the judge)

What broke

UXP has no global app; and the Gemini 2.0 judge returned 404 for our key.

What we cooked

require('indesign') for UXP; switched the judge to gemini-2.5-flash.

06 — Pushing the ceiling

We tried the “obvious” fix — and it regressed

Hypothesis: stop placing one frame per line; build real paragraphs (threaded stories) and let InDesign reflow. Surely more faithful? The data said no.

**B1 · per-span** — composite 91.0 · text-IoU ~75

**B6 · block reflow** — composite 80.7 · text-IoU ~48

Why it lost ~10 points: InDesign's composer re-breaks justified prose at different line endings than the original, so every body line lands slightly off (text-IoU fell 75 → 48). Grouping by block also fused distinct elements — the title wrapped wrong, the 5.1/5.2/5.3 outline collapsed into one paragraph, and “INTRODUCTION” lost its bold.

The lesson

For faithful reproduction, per-span placement was already right — it reproduces the source's exact line breaks, which is the whole game. Threaded stories win when you want an editable, natural document, not a pixel match. It's a fidelity-vs-editability trade-off, not a ceiling you break by reflowing.

The genuine ways past 91% on this task lie outside our five tracks: a dedicated PDF→InDesign converter (Recosoft PDF2ID) that reconstructs matched stories, or Adobe's cloud InDesign API to break the headless limit.

07 — Breaking the ceiling

One real fix took us from 90.9 to 93.3

After the reflow experiment failed, the score sheet pointed at the real culprit: a single mis-rendered vector on page 3 — not the text at all.

**Before (B1)** — the callout border painted as a solid block · page 3 composite 85.7

**After (B7)** — border stroked correctly, image restored · page 3 composite 97.0

The diagnosis: the source's “Everyday Connection” box is a frame (an outer + inner rectangle filled with an even-odd rule). Our extractor had recorded it as one solid magenta fill and painted it over the grocery photo and text — wrecking page 3's colour (ΔE) and SSIM.

The fix (track B7): keep B1's exact per-span text and image placement, but re-classify every drawing — fills (header bars, tints), borders (frames → rendered as a stroke, not a block), and rules (thin bars). Page 3 jumped colour 63→100 and SSIM 0.875→0.994.

The lesson

The gain wasn't in the text — it was one correctly-classified vector. Let the metrics tell you where the error actually lives before you “improve” anything.

Composite 90.9 → 93.3. With our own tools that's near the ceiling; beyond it lies Recosoft PDF2ID or Adobe's cloud InDesign API.

08 — The actual output

Explore the rebuilt PDF yourself

This is the real export from InDesign — scroll and zoom all five pages. Toggle to compare against the source.

The B7 precision output · composite 93.3% · peak SSIM 0.994 · the residual gap is fine text positioning, not images or colour.

09 — Which performed well, and why

Spec-driven generation reaches ~93% — and improves with one fix

Scored on a documented blend: SSIM · text-bbox IoU · colour ΔE2000 · image IoU · a Gemini vision judge.

Track	Score	Pages
Precision pass (B7)	93.3	5/5
PyWin32 COM	90.9	5/5
ExtendScript	90.9	5/5
UXP	90.9	5/5
Block reflow (B6 · experiment)	80.7	5/5
simpleidml	59.6	1/5
MCP visual loop	43.5	1/5

✓ Works reliably

Image & fill placement (IoU 98–100), colour (ΔE<3), geometry, PDF/IDML export — once fonts and units are right.

~ Works with correction

Fonts (install the OFL families), per-span placement, the units & baseline conventions — fine once the gotchas are known.

✗ Struggles

Exact text reflow & line breaks, list formatting, the vision loop as a generator, free-form template-fill.

★ Best strategy

Extract with PyMuPDF → shared spec → generate via ExtendScript / COM / UXP. Reserve the loop for QA.

10 — The verdict

Extract first. Then script the live DOM.

ExtendScript for the least setup, PyWin32 COM for a Python-native pipeline, UXP to ship — all three reach ~91% out of the box, and a one-fix precision pass lifts it to 93.3.

External automation? Viable via COM/AppleScript, but InDesign is not headless — volume needs the separate InDesign Server licence. That, not the API, is the gate.

The honest ceiling: ~93% / SSIM 0.994 with our own tools. Images and colour are solved; only sub-pixel text positioning remains — and two experiments (reflow, fit-to-width) proved that chasing it by distorting type makes things worse. Beyond this lies a dedicated converter (Recosoft PDF2ID) or Adobe's cloud InDesign API.