Tanzu / VCF · January 30, 2025

Accelerating AI/ML Workloads on VCF 5.2.1 – GPU-as-a-Service in Manufacturing

One of my most rewarding experiences was helping a semiconductor manufacturing client run GPU workloads efficiently in a private cloud. The client struggled with statically assigned GPUs, resulting in significant resource underutilization.

What Changed with VCF 5.2.1: VCF 5.2.1 offered improved support for GPU resource pools and profiles. We deployed a workload domain dedicated to AI/ML workloads with GPU pass-through and NVIDIA vGPU integration. Tanzu Kubernetes Grid was used to support containerized pipelines, with Kubernetes clusters consuming GPU profiles via VM Class Definitions (https://docs.vmware.com/en/VMware-vSphere/8.0/vmware-vsphere-with-tanzu/GUID-B95EC2BC-9B89-4DAF-8D4B-46F4383AF76B.html).

TAM Engagement: I helped them create a vGPU-backed service catalog in VMware Cloud Director (VCD), exposing GPU profiles as tenant-consumable services with quotas, policies, and chargeback tracking.

Aria Operations for Applications and Aria Operations for Logs were used to monitor resource trends, abnormal usage patterns, and potential bottlenecks.

Impact:

  • 80%+ GPU utilization across workloads
  • 50% faster deployment of AI training pipelines
  • Real-time visibility into per-tenant GPU usage and cost

Reference Documentation: