Availability and Disaster Recovery for NVIDIA Omniverse Enterprise Nucleus

by Kellan Cartledge, Isidro Hernandez, and Kelly Williamson | on

NVIDIA Omniverse is a revolutionary platform, which allows creators and organizations to collaborate in real-time on 3D designs and simulations. It offers a wide range of integrations and tools to enable teams to work together, and to bring their ideas to life.

One of the fundamental aspects of NVIDIA Omniverse is the ability to author content in your traditional application. Omniverse has connections to popular CAD tools like Autodesk Revit, PTC CREO, as well as content creation tools such as Autodesk 3ds Max, Autodesk Maya, and Blender, see a full list here: Connecting to Omniverse . This breadth of support allows multi-functional teams with conflicting data formats to collaborate in a common digital space in real-time.

NVIDIA Omniverse Nucleus is the database and collaboration engine of the Omniverse platform. With Omniverse Nucleus, teams can have multiple live users connected using different applications at once. Nucleus enables efficient live synchronization between NVIDIA Omniverse applications. Changes to Universal Scene Description (USD) files, the core Omniverse data format, are transmitted in real-time between connected Omniverse clients.

As companies look to leverage NVIDIA Omniverse to drive their digital innovation, it is important to consider where, and how, the Nucleus server is configured. With many teams and companies spread throughout a country, or globally, it is important to understand why it’s ideal to deploy Nucleus in the cloud, and how to ensure quick recovery in the event of a server failure.

Deploying Omniverse Enterprise Nucleus on Amazon Web Services with SoftServe

As a member of the NVIDIA Service Delivery Partner – Professional Services (SDP-PS) program, SoftServe has an experienced team of AI, ML, and DevOps experts. Amazon Web Services (Amazon Web Services) and SoftServe have developed this Nucleus reference architecture to help customers accelerate their digital transformation and reduce the time to deploy Nucleus on Amazon Web Services.

The SoftServe professional services team works with customers to set up Nucleus cloud deployments by automating and provisioning Amazon Web Services resources, such as Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Simple Storage Service (Amazon S3) Buckets, Amazon Web Services Identity and Access Management (IAM) roles, networking, auto-scaling, load balancers, etc. SoftServe delivers these Amazon Web Services resources as a managed deployment and customers receive the solution, documentation, and training. The Nucleus deployment on Amazon Web Services solution allows for customization and extensibility to add additional cloud resources as required.

Solution Overview

Architecture for Availability Disaster Recovery for NVIDIA Omniverse Enterprise Nucleus

  1. End users of the Omniverse tools are supported by on-premises graphics workstations. These workstations have high-end NVIDIA GPUs, the Omniverse clients, and additional Digital Content Creation tools connected using Nucleus Connectors.
  2. Depending on network security requirements, the Amazon Web Services component of this hybrid deployment can be privately connected to the on-premises network via VPN connection or Amazon Web Services Direct Connect connection. A managed private certificate authority can be deployed with Amazon Route 53 for private DNS resolution. Amazon Virtual Private Cloud (Amazon VPC) private link endpoints maintain private communication between Amazon EC2 instances and services such as Amazon Web Services Systems Manager Agent (SSM Agent) , Amazon S3, and Amazon CloudWatch .
  3. An Application Load Balancer (ALB) is deployed in public subnets to redirect client requests from HTTP to HTTPS and then to the NGINX reverse proxy servers. The ALB also balances traffic load across the reverse proxy servers if multiple have been provisioned.
  4. The reverse proxy is a NGINX server deployed in a highly available multi-AZ auto scaling group. The reverse proxy routes requests based on paths to the specific Nucleus ports.
  5. The Nucleus server is comprised of Docker containers orchestrated by a Docker Compose stack provided by NVIDIA. Nucleus data is stored on Amazon Elastic Block Store (Amazon EBS) volumes.
  6. When deployed, Amazon Web Services Systems Manager Run Commands pull the Nucleus Docker container images from the NVIDIA Container Registry and configure the Nucleus instance on Amazon EC2.
  7. Access to the NVIDIA Container Registry is required for Docker to pull the appropriate images.
  8. Auto scaling Lifecycle hooks, backed by Amazon Web Services Lambda , support runtime configuration of the NGINX proxy instances when they scale up and when the instances terminate.
  9. Triggered by the Nucleus ASG On Terminate Lifecycle Hook, the Nucleus failover procedure uses Amazon Web Services Step Functions to pull the Nucleus backup data from Amazon S3 and reconfigure the newly launched EC2 instance. During this time, it is expected to have a downtime of a few minutes while the new EC2 instance is launched and configured.
  10. Triggered periodically by Amazon EventBridge , the Nucleus backup procedure uses Amazon Web Services Step Functions and the NVIDIA nucleus-tools to perform incremental backups of the Nucleus data to Amazon S3.
  11. CloudWatch aggregates logs from the Amazon EC2 instances and facilitates metric monitoring and alarms. The Nucleus stack also exposes metrics about its load characteristics (such as number of requests per user, per request type, etc.). These metrics are exposed to be consumable by Prometheus.

High Availability

Production teams expect reliable and consistent access to the data stored in Nucleus. To address this expectation, features of high availability have been implemented in this solution.

Using an ALB, Route 53 requests are sent to a single DNS host name and dynamically routed across multiple Availability Zones (AZs). To ensure encrypted connections, an Amazon Web Services Certificate Manager (ACM) SSL/TLS certificate is associated with the ALB which terminates the front-end connection and decrypts the requests.

NGINX reverse proxy servers route traffic to specific ports on the Nucleus server. An Amazon EC2 Auto Scaling group ensures the reverse proxy instances are deployed across multiple AZs and that the number of instances will scale up or down depending on the current request load. By default, this solution scales-out depending on the CPU usage of the reverse proxy instances.

The maximum number of reverse proxy instances, the scaling mechanism, and the number of AZs to scale across are configurable to ensure high availability for each use case.
Nucleus Backup Procedure for Availability Disaster Recovery for NVIDIA Omniverse Enterprise Nucleus

Backup and Restore

The Omniverse Nucleus on Amazon Web Services solution implements backup procedures at different levels:

  • Snapshots of Amazon EBS volumes
  • Copy and transfer of the Nucleus data to an Amazon S3 Bucket

These backup features are configurable and automated by using an Amazon Web Services Step Functions state machine, which is triggered by a Lambda function on a configurable schedule. Using the NVIDIA nucleus-tools, incremental copies of the Nucleus data are synchronized with the Amazon S3 Bucket. Since the backup happens incrementally, it is best to allow frequent backups reducing the file transfer size and the point of recovery time.

Disaster Recovery

When managing centralized datastores such as the Omniverse Nucleus collaboration engine for digital assets, companies need to protect the continuity of the business and avoid work disruptions.

To maintain a Recovery Time Objective (RTO) of a few minutes, this solution implements incremental Nucleus data backups and automated configuration procedures. This includes periodic, incremental backups of the Nucleus data to an Amazon S3 Bucket but also serverless processes using Amazon Web Services Lambda, Auto Scaling Groups, and Amazon Web Services Step Functions for automatically launching and reconfiguring Nucleus instances running on Amazon EC2.

When an instance failure is detected by the Nucleus Auto Scaling Group, a new instance is automatically launched and the failover Step Function procedure starts. The Step Function procedure pulls the Nucleus backup from S3 and, with Amazon Web Services Systems Manager and the NVIDIA nucleus-tools, uploads the data into the new Nucleus instance.

Nucleus Failover Procedure for Availability Disaster Recovery for NVIDIA Omniverse Enterprise Nucleus

This approach allows customers to recover quickly from unexpected incidents that affect the availability of the Nucleus server. The recovery process is configurable and works with a health check and Lambda functions to implement the failover process.

Infrastructure as Code

One of the key objectives of building the Omniverse Nucleus on Amazon Web Services reference architecture is to allow customers to provision the Nucleus server in an automated fashion by using Amazon Web Services Cloud Development Kit (Amazon Web Services CDK) . By using Infrastructure as Code (IaC), customers receive source code of the solution that can be deployed in a repeatable way. For customers that require customizations, Amazon Web Services CDK allows customers to add Amazon Web Services resources or modify the solution as required by their needs.

This solution also deploys an Amazon Web Services CodeCommit repository and an Amazon Web Services CodePipeline CI/CD pipeline that is used to automate modifications to the Nucleus deployment on Amazon Web Services.

Conclusion

With Amazon Web Services, customers can connect distributed users all over the globe to NVIDIA Omniverse Enterprise Nucleus. With the breadth and depth of Amazon Web Services, high availability and disaster recovery techniques can be implemented for Nucleus deployed on Amazon Web Services. This includes load balancing, auto scaling, backup, restore, of data in Nucleus. All of this ensures teams can collaborate in real-time with reliable access to their data.

Working alongside SoftServe professional services teams, customers can quickly deploy Nucleus in their Amazon Web Services accounts and customize the solution for their business needs.

For a technical deep dive, please review this open-source solution from Amazon Web Services and SoftServe:
NVIDIA Omniverse Nucleus on Amazon EC2

SoftServe – Amazon Web Services Premier Partner

As an Amazon Web Services Premier Tier Services Partner, SoftServe consistently helps customers to implement repeatable solutions in the Amazon Web Services cloud through deep industry experience, innovation, and advanced technologies.

SoftServe can help you to transform your 3D workflows and enable your teams to achieve a new level of collaboration in 3D production quality with NVIDIA Omniverse Enterprise.

For more information about SoftServe and NVIDIA Omniverse Enterprise, please go to our website:
SoftServe – NVIDIA Omniverse Enterprise

Kellan Cartledge

Kellan Cartledge

Kellan Cartledge is a Spatial Computing Specialist Solutions Architect at Amazon Web Services. At Amazon Web Services, he defines and explores the art of the possible with immersive technology on the cloud. He has over a decade of experience at the intersection of architecture, entertainment, and computer graphics. Kellan is passionate about the combination of the virtual and the physical worlds, and the future of spatial experiences.

Isidro Hernandez

Isidro Hernandez

Isidro Hernandez is a Solution Architect in the CoE Critical Services Amazon Web Services Cluster at SoftServe, with experience in Cloud, Solution Architecture, Containers, and DevOps. In his current role, Isidro helps customers adopt cloud solutions, design and plan for Amazon Web Services migrations, and build repeatable solutions on the Amazon Web Services cloud.

Kelly Williamson

Kelly Williamson

Kelly Williamson is a Sr Account Executive at SoftServe serving as a strategic solutions consultant helping our customers realize strategic goals with advanced technology. He leads teams that drive digital products and services for the world's most innovative organizations, with a focus on software development driven by design thinking and advanced technologies with focus on Metaverse, AI, Big Data, IoT, and Security solutions to transform our clients and prepare them for the future.


The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.