Overview

An overview of Kubeflow Trainer

Note

If you are using Kubeflow Training Operator V1, refer to this migration document.

For legacy Kubeflow Training Operator V1 documentation, check these guides.

What is Kubeflow Trainer

Kubeflow Trainer is a Kubernetes-native distributed AI platform for scalable large language model (LLM) fine-tuning and training of AI models across a wide range of frameworks, including PyTorch, MLX, HuggingFace, DeepSpeed, JAX, XGBoost, and more.

Kubeflow Trainer brings MPI to Kubernetes, orchestrating multi-node, multi-GPU distributed jobs efficiently across high-performance computing (HPC) clusters. This enables high-throughput communication between processes, making it ideal for large-scale AI training that requires ultra-fast synchronization between GPUs nodes.

Kubeflow Trainer seamlessly integrates with the Cloud Native AI ecosystem, including Kueue for topology-aware scheduling and multi-cluster job dispatching, as well as JobSet and LeaderWorkerSet for AI workload orchestration.

Kubeflow Trainer provides a distributed data cache designed to stream large-scale data with zero-copy transfer directly to GPU nodes. This ensures memory-efficient training jobs while maximizing GPU utilization.

With the Kubeflow Python SDK, AI practitioners can effortlessly develop and fine-tune LLMs while leveraging the Kubeflow Trainer APIs: TrainJob and Runtimes.

Kubeflow Trainer Tech Stack

Who is this for

Checkout following KubeCon + CloudNativeCon talks for Kubeflow Trainer capabilities:

Kubeflow Trainer Personas

Additional talks:

User Personas

Kubeflow Trainer documentation is separated between these user personas:

AI Practitioners: ML engineers and data scientists who develop AI models using the Kubeflow Python SDK and TrainJob.
Platform Administrators: administrators and DevOps engineers responsible for managing Kubernetes clusters and Kubeflow Training Runtimes.
Contributors: open source contributors working on Kubeflow Trainer project.

Kubeflow Trainer Introduction

Watch the following KubeCon + CloudNativeCon 2024 talk which provides an overview of Kubeflow Trainer:

Why use Kubeflow Trainer

The Kubeflow Trainer supports key phases on the AI lifecycle, including model training and LLMs fine-tuning, as shown in the diagram below:

AI Lifecycle Trainer

Key Benefits

🚀 Simple, Scalable, and Built for LLM Fine-Tuning

Effortlessly scale from single-machine training to large, distributed Kubernetes clusters with Kubeflow’s Python APIs and supported Training Runtimes. Perfect for modern AI workloads.

🔧 Extensible and Portable

Run Kubeflow Trainer on any cloud or on-premises Kubernetes cluster. Easily integrate your own ML frameworks—regardless of language or runtime—through a flexible, extensible API layer.

⚡️ Distributed AI Data Caching

Powered by Apache Arrow and Apache DataFusion, Kubeflow Trainer streams tensors directly to GPU nodes via a distributed cache layer – enabling seamless access to large datasets, minimizing I/O overhead, and cutting GPU costs.

🧠 LLM Fine-Tuning Blueprints

Accelerate your generative AI use-cases with ready-to-use Kubeflow LLM blueprints designed for efficient fine-tuning and deployment of LLMs on Kubernetes.

💰 Optimized for GPU Efficiency

Reduce GPU costs through intelligent dataset streaming and model initialization. Kubeflow Trainer offloads data preprocessing and I/O to CPU workloads, ensuring GPUs stay focused on training.

☸️ Native Kubernetes Integrations

Achieve optimal GPU utilization and coordinated scheduling for large-scale AI workloads. Kubeflow Trainer seamlessly integrates with Kubernetes ecosystem projects like Kueue, Coscheduling, Volcano, or YuniKorn.

Next steps

Run your first Kubeflow TrainJob by following the Getting Started guide.

Feedback

Was this page helpful?

Thank you for your feedback!

We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.

Last modified February 10, 2026: trainer: Update overview and adjust TorchTune guide for VolumeClaimPolicies (#4309) (02b39d40)