gmziven 2 hours ago

Text in these models isn't generated as language, it's generated as texture. The video diffusion model never learns characters or spelling; it learns 'what regions that look like writing tend to look like' from a training set that's heavily multilingual. So when it needs to fill a sign or UI element it samples from that whole distribution, and your English prompt only conditions the scene, not the glyphs. It's the same reason early image models produced gibberish text — there's no character-level grounding, just visual priors. The ones that get English right usually bolt on a separate text-rendering path rather than leaving it to the diffusion process.