I keep wondering: is the diffusion/flow formulation alone already enough to build a virtual world as real as Ready Player One, with intelligent agents as alive as in Her?
From my work, you can see how current video diffusion models excel in photorealistic generation and their emerging 3D structure.
Yet I’m still exploring what the best generative formulation and 3D representation might be, and how they can be applied to virtual reality, robotics, and other practical scenarios.
We reformulate novel-view synthesis as a structured inpainting task.
CogNVS is a video diffusion model for dynamic novel-view synthesis trained in a self-supervised manner using only 2D videos!
Given a modal (visible) object sequence in a video,
we develop a two-stage method that generates its amodal (visible + invisible) masks and RGB content via video diffusion.