Justin Garrison

The Road to Seamless Cluster Upgrades

Posted on June 1, 2023  •  2 minutes  • 286 words

Description

You just finished cluster upgrades last week and it’s time to start planning for the next version. Will you ever get out of constant upgrade cycle to get back to your real work? Justin will show you practical tips to make your workloads and clusters easier to upgrade. You may not be able to eliminate upgrades, but you can minimize their potential impact and streamline the amount of coordination required.

Benefits to the ecosystem

There are a lot of gaps in content and documentation around running GPU enabled workloads beyond the basics. Many examples will show 1 workload but not how to scale that for the size of jobs needed for real production ML requirements.

Additional resources

I will be looking at clusters in complex environments with storage, lots of workloads, and more advanced things like service mesh and multi clusters. I will be looking at different options for cluster management but focusing on process, workload, and cluster improvements to reduce upgrade overhead.

Notes

Dear talk reviewer. I wrote you a short story to break up the monotony of reviews. There once was a lamp who sat on a desk. It was electrifying, hot work, and the lamp had a bright future ahead. Until, one day, the poor lamp’s bulb burnt out. It was dark times. Mostly between 8 PM and 7 AM. But the lamp sat proudly until one day, the lamp got an LED replacement bulb. The lamp was full of new energy and the work required less resistance and wasn’t hot anymore. The lamp was so happy, and let their light shine doing their best work. Thank you for doing your best work and reviewing this (and many other) talks.

Status

Rejected

Follow me

Here's where I hang out in social media