Towards On-the-Fly Snapshot Memory Compression for Low-Latency Elastic Inference Serving Systems
| Autoři | |
|---|---|
| Rok publikování | 2026 |
| Druh | Článek ve sborníku |
| Konference | The 6th Workshop on Machine Learning and Systems (EuroMLSys) |
| Fakulta / Pracoviště MU | |
| Citace | |
| Doi | https://doi.org/10.1145/3805621.3807612 |
| Klíčová slova | Container Checkpoint/Restore; GPU Memory Compression; CRIU; Cold-start Latency; LLM Inference Serving |
| Popis | In-memory model caching and startup latency are key bottlenecks in large-scale AI serving systems, especially for GPU-accelerated large language model (LLM) inference in elastic, serverless environments. While container checkpointing enables hot starts, it introduces new challenges in memory footprint, storage bandwidth, and restore latency. Existing offline snapshot compression methods reduce snapshot size but add extra I/O, storage duplication, and decompression overhead. In this paper, we present CRIU-LZ4, a restore-optimized method for on-the-fly compression integrated directly into the CPU–GPU checkpoint and restore pipelines. Built atop CRIUgpu, CRIU-LZ4 performs page-level compression during memory transfer, eliminating intermediate artifacts and minimizing the latency on the restore critical path. Our evaluation results show that CRIU-LZ4 reduces cold-start latency by 46–59% and achieves up to 6×smaller snapshots compared to uncompressed GPU-aware checkpointing, while eliminating the decompression bottleneck of offline compression, significantly reducing both end-to-end restore time and peak disk usage. |