The panda has been watching world models for eighteen months. They started as cute video toys that hallucinated frame by frame. They are now training simulators for humanoid robots, with GPU bills that would make a small country flinch. Worth pausing on, before another decentralized compute pitch deck claims it has solved the problem.
What is a world model, and why does it matter now?
A world model is a neural network that learns to predict the next state of a scene. Feed it pixels, actions and a bit of physics, and it hallucinates what should happen one second from now, two seconds, ten seconds. It is the closest thing modern AI has to an internal physics engine, learned rather than coded.
According to DeepMind's research team, Genie 2 was the first model to generate playable 3D environments from a single image prompt, with up to a minute of consistent rollout. The marketing pictures looked like a Minecraft demo. The actual implication was different: a free training environment for whatever agent you point at it.
Robotics labs noticed first. Training a real-world humanoid is slow, expensive, and tends to break the humanoid. A world model gives you infinite synthetic episodes for the cost of GPU hours. Cheap, in robotics terms. Expensive, in any other terms.
From Genie 2 to industrial robot stacks
NVIDIA pitched the same idea at CES 2025, but aimed at industry. Cosmos is a family of foundation models that generate physics-aware video to train robots and self-driving stacks. The framing was blunt: you do not need a million real miles, you need a million simulated miles that look real enough to transfer.
Eighteen months later, the pattern is everywhere. Wayve trains driving policies in latent worlds. Physical Intelligence shipped Pi-Zero, an open generalist robot policy that learned from a mix of real and synthetic data. 1X Technologies and Figure both quietly admit, in interview after interview, that their humanoid stack is half real data and half simulator rollouts.
The interesting shift is not the existence of one big model. It is that the stack now assumes synthetic experience as a primary input. Pixels are training data. Actions are training data. The world model itself has become training infrastructure, sitting one layer below the policy a robotics team actually wants to ship.
The compute tax nobody pencils in
Here is the part that gets glossed over in keynotes. A frame of world model rollout is roughly an order of magnitude more expensive per token than a text LLM token. The system is predicting compressed video at high frame rates, conditioned on actions, with temporal consistency stretched over a full minute. That is a lot of FLOPs per pixel.
Industry write-ups from late 2025 put the inference-cost gap between frontier video generators and equivalent text LLMs at multiples, not percentages, and training-cost gaps are wider still. Training a single robot policy that actually generalizes is now firmly in the seven-figure GPU-hour range, before counting the simulator that produced its data.
The macro context does not help. Total crypto market cap sits at $2.57 trillion as of June 1, with Bitcoin still 57.26% dominant. Round numbers that have nothing to do with chip allocation in Santa Clara. The real auction is whether a robotics lab can outbid a hedge fund or a sovereign cloud for the same H200 cluster. Spoiler: the hedge fund usually wins, and the sovereign cloud wins the rest.
That auction creates an obvious opening for crypto compute networks. Whether they fit through that opening is the harder question, and the part most decks skip.
Where DePIN compute actually fits
Render, Akash, io.net and Bittensor compute subnets all pitch the same line: idle GPUs around the world, rented cheaper than AWS or CoreWeave. For inference workloads, that is sometimes true. We covered the Cerebras and Groq inference economics debate last week, and the same logic applies to world models at inference time. For training a foundation world model, however, the line is mostly fiction.
Training requires high-bandwidth interconnect between GPUs, low-latency NVLink or InfiniBand fabrics, and clusters that stay coherent for weeks. Renting twenty H100s from twenty different basements does not produce a usable training run. It produces twenty paperweights with a shared Discord channel.
Akash Network publishes workload breakdowns on its blog and is honest in roadmap discussions: GPU marketplaces win on inference and on fine-tuning, not on pre-training. Where DePIN compute fits today is the long tail. Researchers running ablations, indie game studios generating assets, on-chain agents needing bursty inference for a few seconds at a time. That tail is real, and it is growing. It is also not the foundation training market.
The split matters for AI-gaming projects. Synthetic worlds for a game studio are a near-perfect decentralized compute use case, because each scene is independent and latency-tolerant. Generalist humanoid policy training is not. We argued in our cheap-chains thesis for AI agents that the right fit between AI workloads and crypto rails is workload-specific. World models prove the point again, this time at higher resolution.
What to watch next
Three signals over the next quarter. First, whether any major lab publishes full training-cost numbers for a robot world model, not just inference cost. Second, whether DePIN networks start publishing utilization data broken down by workload type, instead of one aggregate figure that flatters everyone. Third, whether AI-gaming projects building on-chain economies, including Zentrix-style platforms tied to on-chain assets, start sourcing simulator capacity from crypto compute rails for the slices that genuinely fit.
The panda is not betting on a clean answer this quarter. World models are real, the compute tax is real, and the marketing layer on top of both is, as always, exactly what it has always been.



