We tested Yann LeCun's LeWorldModel on three proxy tasks — warehouse routing, bin packing, and autonomous navigation — to map the boundary between what world models can and cannot do today.
LeWM was trained on four specific simulation environments: Two-Room (2D navigation), Reacher (arm control), Push-T (block manipulation), and OGBench Cube (3D pick-and-place). None of these are warehouses, delivery trucks, or packing lines.
The question is whether the underlying capabilities — planning from pixels, understanding physical constraints, predicting object states — transfer to problems that share structural similarities. We mapped three real-world problems to the closest available environments.
Whether LeWM's planning, physics understanding, and spatial encoding generalize as useful primitives for routing, packing, and navigation problems.
That LeWM can solve production warehouse optimization or drive a car today. These are proxy experiments to assess the architecture's potential.
| Real-World Problem | Proxy Environment | Why This Maps |
|---|---|---|
| Warehouse Routing | Two-Room (2D nav) | Navigate constrained spaces with doorways and walls |
| Bin Packing | Push-T (manipulation) | Move objects to precise target positions and orientations |
| Self-Driving | Two-Room (obstacle nav) | Plan paths around obstacles to reach goals from pixels |
A warehouse worker needs to travel between rooms through doorways to pick items efficiently. We tested LeWM's ability to plan navigation paths through the Two-Room environment — a 2D space with a dividing wall and a single doorway connecting two rooms.
The model had to encode start and goal images, then compute an action sequence to navigate from one room to the other through the doorway — purely from pixel observations.
The model achieves 100% success navigating through the doorway constraint. The latent PC1 plot (right panel) is striking — it clearly separates the two rooms by color, proving the latent space encodes room topology. The model "knows" which room it's in and plans accordingly.
The model was trained on this specific two-room layout. A real warehouse has arbitrary floorplans, multiple floors, dynamic obstacles (forklifts, other workers), and multi-stop routes. Scaling to these would require training on diverse warehouse-like environments.
Bin packing requires placing items into constrained spaces at precise positions and orientations. We tested LeWM's ability to plan object manipulation using the Push-T environment — where a circular agent must push a T-shaped block to match a target position and rotation angle.
This is the core primitive of packing: move an object from where it is to where it needs to be, accounting for shape, orientation, and contact physics.
The model tracks block position at R² = 0.97 — meaning a simple linear regression on a 192-dim vector tells you exactly where the object is. Block angle is harder (R² = 0.79), reflecting the inherent difficulty of rotation estimation. The 90% success rate shows reliable manipulation planning under contact physics.
Real bin packing involves multiple objects of different shapes, gravity, friction constraints, and optimizing for total space utilization. Push-T has a single object in 2D. Multi-object packing would require training on environments with multiple simultaneous objects and 3D physics.
A self-driving system must detect obstacles, plan collision-free paths, and navigate to destinations from visual input. We tested LeWM on the Two-Room environment as an obstacle navigation proxy — the dividing wall forces the agent to plan indirect paths through a doorway rather than straight-line trajectories.
We measured both planning success and prediction surprise — if the model understands obstacles, its surprise should remain low when the agent correctly navigates around walls, indicating the dynamics model has internalized wall constraints.
The contrast between the model's 100% planning success and the raw dataset's 30% goal-reach rate is significant. LeWM's CEM planner actively searches for action sequences that navigate the doorway constraint — it doesn't just replay expert trajectories. Low mean surprise (0.214) confirms the dynamics model has internalized wall physics: it predicts correctly what happens when the agent moves near walls.
Real self-driving involves 3D environments, vehicle dynamics (steering, braking, acceleration), multi-agent interactions (other cars, pedestrians), weather, traffic rules, and real-time decisions at 30+ FPS. This 2D proof-of-concept validates the planning mechanism but not the perceptual or physical complexity of actual driving.
| Experiment | Planning Success | Key Metric | Proxy Strength | Production Gap |
|---|---|---|---|---|
| Warehouse Routing | 100% | Latent space encodes room topology | Strong | Multi-room, dynamic obstacles |
| Bin Packing | 90% | Block position R² = 0.97 | Moderate | Multi-object, 3D, gravity |
| Self-Driving | 100% | Wall-aware dynamics model | Moderate | 3D, multi-agent, real-time |
Across all three experiments, the planning mechanism succeeds at 90–100%. The model encodes spatial structure, respects physical constraints, and computes goal-directed action sequences from raw pixels. The bottleneck is not the architecture — it's the training environments. To solve real warehouse routing, packing, or driving, you would train the same ~15M parameter JEPA on domain-specific simulation data. The architecture scales; the data determines the domain.
Three strategic implications emerge: First, world models are not competing with LLMs — they solve a fundamentally different class of problems (physical planning vs. text generation). Second, the ~15M parameter footprint and single-GPU training make this accessible for domain-specific deployment, not just research labs. Third, the path from these proofs-of-concept to production is a data problem (build better simulators), not an architecture problem (build bigger models).
All three experiments, including environment setup, data download, checkpoint building, and evaluation, are packaged in run_all.sh. Running the script on a fresh Ubuntu 24.04 machine with an NVIDIA GPU reproduces every plot and metric in this report.
Ubuntu 24.04 · Python 3.10 · NVIDIA GPU (any) · ~60GB disk · Internet access for model downloads
Data download: ~60 min · Checkpoint build: ~1 min · All 3 experiments: ~20 min · Total: ~90 min end-to-end