Monitoring and controlling network congestion is now a critical element of operating large-scale HPC, ML, and AI clusters. Amdahl’s law predicts an effective maximum scale for HPC workloads related to the time taken for the parts of the workload that must be executed serially. However, Amdahl’s law ignores the cost associated with inter-process communication by assuming that the serial part of a workload is fixed. In fact, network congestion caused by parallelization increases the serial part of a workload, further limiting workload scale.
This roundtable, moderated by Hyperion Research, will offer real-world insights on the challenges of congestion in multi-workload environments. We will discuss the root causes of congestion and the resulting ripple effect it can have on performance and scale. We will share benchmarking approaches to help measure and even predict the impact in production. And finally, we will explore the strategies and techniques available to mitigate or eliminate these impacts altogether.