Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop

Zhaofang Qian1, Abolfazl Sharifi2, Tucker Carroll1, Ser-Nam Lim1,
1University of Central Florida, 2University of Kashan

Abstract

Video generation has achieved impressive quality, but it still suffers from artifacts such as temporal inconsistency and violation of physical laws. Leveraging 3D scenes can fundamentally resolve these issues by providing precise control over scene entities. To facilitate the easy generation of diverse photorealistic scenes, we propose Scene Copilo, a framework combining large language models (LLMs) with a procedural 3D scene generator. Specifically, Scene Copilot consists of Scene Codex, BlenderGPT, and Human in the loop. Scene Codex is designed to translate textual user input into commands understandable by the 3D scene generator. BlenderGPT provides users with an intuitive and direct way to precisely control the generated 3D scene and the final output video. Furthermore, users can utilize Blender UI to receive instant visual feedback. Additionally, we have curated a procedural dataset of objects in code format to further enhance our system's capabilities. Each component works seamlessly together to support users in generating desired 3D scenes. Extensive experiments demonstrate the capability of our framework in customizing 3D scenes and video generation.

The Pipeline

Overview of the Scene Copilot with procedural dataset. Starting with a user textual input, Scene Codex combines an LLM with an RAG database of the Infinigen code and generates an Infinigen executable Python command. Infinigen initially creates a coarse scene, which is converted into textual format so that LLMs can comprehend objects and metadata in the scene. Such scene file is condensed into a coarse RAG database. BlenderGPT, incorporating Blender and LLM, utilizes this database to edit and modify the 3D contents in the scene with user involved through either textual or visual interaction. The updated coarse scene is fed back to Infinigen to create a fine scene. Similar to coarse scene, it is condensed into a fine RAG database, and BlenderGPT collaborates both databases while editing the final scene. Meanwhile, The procedural dataset will provide the requested procedural asset code. The finished scene is then rendered by Inifnigen and outputs the requested video.



The Demo of Scene Co-piolt


Blender is designed for 3D editing, and it is intuitive and efficient to interact with 3D scenes through the GUI. Therefore, to allow users to control the objects in the scene precisely, we believe it is essential to preserve the user interface. With both visual and textual interactions available, users can select an object and then prompt BlenderGPT with a textual request. The video demonstrates an example of asking the camera to follow a snake’s movement in the scene. Three snakes are shown as white cubic placeholders in this coarse scene. As there are multiple snakes in the scene, it could be challenging and inefficient to use only textual input to describe the desired object. Instead, the user can directly click on the snake object with the textual input “follow the selected object during the whole animation”.


Comparison of Infinigen and Scene Co-pilot


Infinigen: Graveyard at sunset


Scene Copilot: Graveyard at sunset


Infinigen: Relaxing scenery of beach view under cloudy sky


Scene Copilot: Relaxing scenery of beach view under cloudy sky


Infinigen: Time lapse of a sunset sky in the countryside


Scene Copilot: Time lapse of a sunset sky in the countryside



We compare the rendered videos between the direct output videos from Infinigen and the results after editing with Scene Copilot. Note that we optimized Scene Codex specifically for two different tasks. Because of Infinigen’s randomness, the direct output videos have a high probability of not focusing on the main subjects or may even fail to generate the requested assets. In contrast, with Scene Copilot, the user, acting as a “director”, can have more control over the scene and the output video. For example, in the graveyard video, since Infinigen does not include a “graveyard” in the asset, the camera is pointed in a random direction. However, using BlenderGPT, we generated a church and gravestones with fixed camera animation.


Long Videos Generated by Scene Copilot



We further rendered additional video examples to demonstrate our long-video generation capabilities. We created one 10-minute video, two 2-minute videos, three 1-minute videos, and three 30-second videos. We have included these examples to showcase the utility of our procedural dataset and the capabilities of Scene Copilot beyond the creation of generic environments and scenery.

BibTeX

@article{Zhaofang2025scenecopilot,
  author    = {Zhaofang Qian, Abolfazl Sharifi, Tucker Carroll, Ser-Nam Lim},
  title     = {Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop},
  journal   = {CVPR},
  year      = {2025},
}