Trajectory Forcing: Structure-First Generation with Controllable Semantic Trajectories

¹University of Tübingen, ²Tübingen AI Center, ³Zuse School ELIZA, ⁴ETH Zürich, ⁵Max Planck Institute for Intelligent Systems, ⁶ELLIS Institute, ⁷KE:SAI

^†Project Lead ^‡Shared Last Authorship

ECCV 2026 🎉

Overview

Trajectory Forcing reframes image generation as a structured coarse-to-fine process. Instead of treating intermediate states as hidden computation, it organizes generation into semantic stages that can be decoded, inspected, and edited along the way. Starting from global layout and progressing through parts and subparts to fine detail, the model exposes the generative trajectory as a controllable object rather than a black box.

Artistic training and generative modeling. Human visual artists typically adopt a coarse-to-fine workflow, first establishing global structure and dominant color relationships before refining local details, as illustrated in the Figure above (Images courtesy of @yerenhb). Across artistic training practices, beginners are repeatedly advised to avoid chasing details and instead focus on structural abstraction and global coherence. The recurrence of this principle across instructors and contexts suggests that it reflects a stable cognitive regularity: global organization precedes and constrains local articulation.

Motivated by this observation, we incorporate this structural prior into our model design, translating the coarse-to-fine principle into a hierarchical generation framework where global consistency guides detail refinement.

Teacher Hierarchies

Teacher hierarchies built from real images by clustering DINOv2 features into a coarse-to-fine stack: object/background, parts, subparts, then the finest tokens. The slider sweeps from the coarse object/background level toward fine detail; drag to reveal the original image.

Trajectory Forcing Sampling

Generated samples decoded at every level of the coarse-to-fine trajectory. The slider sweeps from the model's coarse first level toward fine detail; drag to reveal the final image.

Latent Generation Trajectory

The same samples viewed as the generation unfolds: each slider shows the generated latent (left, PCA-colored) beside its RAE decoding (right). Both sides sweep together from the coarse first level toward fine detail; drag to compare the latent structure with its decoded image at each level.

BibTeX