There’s no denying that GPUs have incredible potential for accelerating workloads of all kinds, but developing applications that can scale on two, four, or even more GPUs continues to be a prohibitively expensive proposition.
The cloud has certainly made renting many compute resources more accessible. A few CPU cores and memory can be had for just a few dollars a month. But renting GPU resources is a whole different matter.
Unlike CPU cores, which can be divided and even shared by multiple users, this type of virtualization is relatively new for GPUs. Traditionally, GPUs were only accessible to a single tenant at any given time. As a result, customers have to pay thousands of dollars every month for a single dedicated GPU when they only need a fraction of its performance.
For large development teams building AI/ML frameworks, this might not be a big deal, but it limits the ability of smaller developers to build accelerated applications, especially those designed to scale across multiple GPUs.
Their options have been to spend a lot of money upfront to buy and manage their own infrastructure, or to spend even more on renting the compute by the minute. However, with improvements in virtualization technology, this is starting to change.
In May, Vultr became one of the first cloud providers to slice an Nvidia A100 into fractional GPU instances with the launch of its Talon virtual machine instances. Customers can now rent 1/20 of an A100 for as little as $0.13 per hour or $90 per month. To put that into perspective, a VM with a single A100 would cost you $2.60 per hour or $1,750 per month of Vultr.
“For some of the less compute-intensive workloads like AI inference or edge AI, often those don’t really need the full power of a full GPU and they can work on smaller GPU plans,” said Vultr CEO JJ Kardwell. The next platform.
Slice and dice a GPU
Today, most accelerated cloud instances have one or more GPUs that have been physically handed off to the virtual machine. While this means the customer has access to all the performance of the GPU, it also means that cloud providers are not able to achieve the same performance as CPUs.
To circumvent this limitation, Vultr used a combination of Nvidia’s vGPU Manager and Multi-Instance GPU functionality, which allows a single GPU to behave like several less powerful ones.
vGPUs use a technique called time slicing, sometimes referred to as time slicing. This involves loading multiple workloads into GPU memory and then rapidly switching between them until they are complete. Each workload technically has access to the full compute resources of the GPUs – excluding memory – but performance is limited by the allocated execution time. The more vGPU instances there are, the less time each has to work.
These vGPUs are not without their challenges. Context switching overhead being the primary concern since the GPU stops and starts each workload in rapid succession. If a vGPU is like a big machine that’s really good at multitasking, MIG – introduced in 2020 in Nvidia’s GA100 GPU – takes the divide and conquer approach. (Or maybe more accurately, it’s a multi-core GPU on a monolithic chip that can pretend to be a big core when needed. . . .) MIG allows a single A100 to be split into eight GPUs separate, each with 10 GB of video memory. But unlike vGPU, MIG is not defined in the hypervisor.
“It’s true hardware partitioning, where the hardware itself is memory is mapped to vGPUs and has a direct allocation of those resources,” said David Gucker, COO of Vultr. The next platform. “That means there’s no possibility of noisy neighbors and it’s as close as you can get to a literal physical map per virtual instance.”
In other words, while vGPU uses software to make a single powerful GPU behave like many less powerful ones, MIG breaks it up into several smaller ones.
Vultr is among the first to use either technology in the cloud to serve multiple tenant workloads. For example, its cheapest GPU instances use the Nvidia vGPU manager to divide each card into 10 or 20 individually addressable instances.
Meanwhile, its larger fractional instances take advantage of MIG, which Vultr claims provides better memory isolation and better quality of service. This is likely because, unlike vGPUs, MIG instances are not obtained by software trickery and are actually dedicated GPUs in their own right.
Multi-GPU software development virtualization
Currently, Vultr Talon instances are limited to a single fractional GPU per instance, but according to Kardwell, there’s nothing stopping the cloud provider from deploying VMs with multiple vGPU or MIG instances attached.
“It’s a natural extension of what we’re doing in beta,” he said. “As we deploy the next wave of physical capability, we plan to deliver this capability as well.”
The ability to provision a virtual machine with two or more vGPU or MIG instances would significantly lower the barrier of entry for developers working on software designed to scale on large accelerated compute clusters.
And at least according to a study recently published by VMware, there doesn’t seem to be a significant performance penalty to GPU virtualization. The virtualization giant recently demonstrated “near or better than bare metal performance” using vGPUs running in vSphere. Testing showed that this performance could be achieved when scaling vGPU workloads across multiple physical GPUs connected through Nvidia’s NVLink interconnect. In theory, this means that a large workload could be scaled to run on 1.5 GPUs or 10.5 GPUs or 100.5 GPUs, for example, without leaving half a GPU idle.
So while Vultr is among the first to deploy this technology in a public cloud environment, the fact that it’s based on Nvidia’s AI Enterprise suite means it won’t be the last vendor to do so.