[ By Hillbot Team • Jan 7, 2025 ]
Hillbot Accelerates Robotic Training with NVIDIA Cosmos Video Generation
World foundation models enable Hillbot to push the boundaries of robotic training data generation.
Using NVIDIA Cosmos, a platform comprising state-of-the-art generative world foundation models, advanced tokenizers and an accelerated video processing pipeline, Hillbot is able to jumpstart its data pipeline with terabytes of AI-generated high-fidelity 3D environments. This technology has helped us refine our robotic training philosophy and operations to enable faster, more efficient robotic skilling with markedly improved performance.
But how do world foundation models help Hillbot bridge the Sim2Real gap—the countless subtle differences between simulation and the real world?
In a Nutshell: The Robotic Training Data Landscape
Today, foundation models for robots are typically trained using real-world data, which requires physically replicating environments and tasks the robots are expected to perform. Robots are recorded performing their assigned tasks, a process that is not always feasible. This approach is costly and constrained by the human imagination, which cannot anticipate and replicate every potential scenario or challenge the robot might face.
While real-world data generation is highly transferable, it falls short as robots demand increasingly large and diverse training datasets to tackle more complex tasks effectively. That’s why developers have turned to simulators in hopes they deliver infinite synthetic training data and diversity for foundation model training.
But the Sim2Real gap has proven uncanny. This, expectedly, has resulted in the continued reliance on real-world training data generation, effectively reinforcing a seemingly insurmountable training data bottleneck.
NVIDIA Cosmos offers a brilliant solution to this hurdle: instant high-fidelity video generation of diverse environments with incredible detail. While training our robots on pick-and-placing canned beverages from a trolley to a refrigerator, Hillbot leveraged Cosmos to generate terabytes of rich, diverse video data depicting kitchen environments, effectively jumpstarting our data pipeline several months.
Here’s a step-by-step breakdown of how Cosmos helped us teach our robot a brand new skill.
Step 1: Generating Video Clips with NVIDIA Cosmos
Training robots to master manipulation tasks requires diverse, high-quality data. For Hillbot, a video generation model is our first step.
Starting with NVIDIA NeMo Curator, Hillbot engineers access an AI-accelerated video preprocessing and curation toolkit, empowering them to prepare high-quality training datasets from visual data. From there, Cosmos generates realistic video clips of diverse kitchen environments, accounting for different refrigerator designs, sizes, and kitchen layouts. These simulations create a wide range of scenarios for our robots to learn from.
Cosmos’ autoregressive models, compression algorithm and tokenizers ensure that these video clips are highly efficient to process. This helps us save time, money, and scale up our data pipeline while maintaining exceptional quality.
Step 2: Reconstructing 3D Scenes for Simulation Training
With these environments, Hillbot uses its proprietary robot simulation suites to reconstruct detailed 3D scenes and objects, enabling the replication of robots performing tasks with specific objects—such as placing cans on a shelf. Hillbot owns state-of-the-art robot simulation techniques, which are incorporated in our simulation products Sapien and ManiSkill. For instance, the simulation showcases Hillbot Alpha, the robot, picking up cans from a nearby trolley and stocking them across various refrigerator designs.
To be considered a success, the foundation model must distinguish between soda brands and place them on their corresponding shelves. This simulation is performed exponentially, enabling training data creation of an infinite array of edge cases, obstacles, and other potential variables.
In just a few hours, Sapien and Maniskill leverage virtual environments to generate millions of task simulations, complete with corresponding successes and failures. By simulating different kitchens and refrigerator designs, we enable our robots to practice manipulation tasks in a wide array of settings. This approach not only accelerates training but also minimizes the risks and costs associated with real-world testing.
Step 3: Transferring Skills to Physical Robots
After rigorous training in simulation, the mobile manipulation policies are transferred to Hillbot’s physical robots. Equipped with these refined skills, our robots demonstrate remarkable adaptability in real-world scenarios. Whether it’s a compact kitchen with a side-by-side refrigerator or a spacious kitchen with a bottom-freezer model, Hillbot robots can navigate the environment and accurately place drinks into the refrigerator.
Why World Foundation Models Are a Game-Changer
World foundation models such as those provided by NVIDIA Cosmos have been instrumental in transforming our approach to robotic training. Its promptable video generation ability and accelerated data pipelines enable us to generate and process massive amounts of environmental data with unprecedented speed and efficiency. And when combined with controllable 3D scenarios built in Sapien, Maniskill, and NVIDIA Isaac Sim, we generate an exponentially large number of realistic simulations that bridge the gap between virtual training and real-world performance.
The Future of Robotics with Hillbot
Hillbot sees video generation models as a perfect compliment to our training and tech ecosystem. The diversity of simulation data ensures that our robots are prepared to handle variations in kitchen layouts and appliance designs. This approach accelerates training while substantially reducing the risks and costs associated with real-world testing.
The ability to train robots in diverse, physics-aware simulations is unlocking new possibilities for automation in households, warehouses, and beyond. From putting drinks into refrigerators to tackling more complex tasks, Hillbot is setting new standards for what robots can achieve.