The Road to Seamless Cluster Upgrades

You just finished cluster upgrades last week and it’s time to start planning for the next version. Will you ever get out of constant upgrade cycle to get back to your real work? Justin will show you practical tips to make your workloads and clusters easier to upgrade. You may not be able to eliminate upgrades, but you can minimize their potential impact and streamline the amount of coordination required.

Benefits to the ecosystem

There are a lot of gaps in content and documentation around running GPU enabled workloads beyond the basics. Many examples will show 1 workload but not how to scale that for the size of jobs needed for real production ML requirements.

Additional resources

I will be looking at clusters in complex environments with storage, lots of workloads, and more advanced things like service mesh and multi clusters. I will be looking at different options for cluster management but focusing on process, workload, and cluster improvements to reduce upgrade overhead.


