Opportunistic Resource Reclamation in Kubernetes: From Aggressive Resizing to Flash Jobs
| Autoři | |
|---|---|
| Rok publikování | 2026 |
| Druh | Článek ve sborníku |
| Konference | Job Scheduling Strategies for Parallel Processing |
| Fakulta / Pracoviště MU | |
| Citace | |
| Klíčová slova | Kubernetes; Resource Management; Resource Utilization; In-place Resizing |
| Popis | Modern cloud data centers suffer from chronic resource under-utilization. The gap between static resource allocations and dynamic workload demand creates systemic inefficiency that current orchestration platforms fail to address adequately. In this work, we explore resource reclamation strategies in production Kubernetes clusters using emerging infrastructure-level primitives---in-place resource resizing and transparent checkpoint/restore (C/R). For CPU resources, we analyze a production workload trace, which we release publicly, and reveal significant allocation-utilization gaps. Through trace-driven simulation, we demonstrate that aggressive in-place resizing substantially increases resource utilization as well as workload evictions. We find a balanced strategy for in-place resizing and identify C/R as the missing primitive that makes aggressive resizing safe by enabling graceful termination and resumable migrations instead of progress loss. For GPU resources, where dynamic resizing is infeasible, we propose a C/R-enabled sharing strategy that allocates reserved-but-idle GPU memory to secondary workloads (flash jobs) with safety guarantees for reclamation. Our work demonstrates how the same infrastructure primitives address resource reclamation across different resource types, each with distinct technical constraints, validated through real production cluster deployments. |
| Související projekty: |