Kubernetes Scheduling with Checkpoint/Restore: Challenges and Open Problems
| Autoři | |
|---|---|
| Rok publikování | 2026 |
| Druh | Článek ve sborníku |
| Konference | Job Scheduling Strategies for Parallel Processing |
| Fakulta / Pracoviště MU | |
| Citace | |
| Doi | https://doi.org/10.1007/978-3-032-10507-3_3 |
| Klíčová slova | Checkpoint and Restore; Kubernetes; Containers; Resource Management; Scheduling |
| Popis | Efficient resource management and scheduling have been persistent challenges since the early days of computing and remain critical to this day.The widespread adoption of containers managed by orchestrators like Kubernetes have introduced new dimensions to this challenge. Despite the lightweight nature and minimal overhead of containers, they still suffer from utilization inefficiencies due to overprovisioning. Existing scheduling techniques are not enough to meet these demands and there is a growing need for orchestration and scheduling policies that support advanced preemption, migration, and fault tolerance. Well-established container checkpoint/restore (C/R) mechanisms implemented through tools like CRIU, offer a promising solution for improving resource scheduling efficiency. However, these mechanisms remain only partially integrated with platforms like Kubernetes. In this paper, we explore the use cases for general C/R, examine the current state, and delve into the open problems and challenges associated with native integration into Kubernetes. We propose potential solutions to these challenges, offering a pathway towards more efficient resource management to better meet the needs of today's computational landscape. While scheduling efficiency is considered critical in HPC clusters, serverless and deep learning platforms also benefit directly from these optimizations. |
| Související projekty: |