Towards On-the-Fly Snapshot Memory Compression for Low-Latency Elastic Inference Serving Systems

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.
Autoři

STOYANOV Radostin SPIŠAKOVÁ Viktória REBER Adrian VAGIN Andrei BRUNO Rodrigo

Rok publikování 2026
Druh Článek ve sborníku
Konference The 6th Workshop on Machine Learning and Systems (EuroMLSys)
Fakulta / Pracoviště MU

Fakulta informatiky

Citace
Doi https://doi.org/10.1145/3805621.3807612
Klíčová slova Container Checkpoint/Restore; GPU Memory Compression; CRIU; Cold-start Latency; LLM Inference Serving
Popis In-memory model caching and startup latency are key bottlenecks in large-scale AI serving systems, especially for GPU-accelerated large language model (LLM) inference in elastic, serverless environments. While container checkpointing enables hot starts, it introduces new challenges in memory footprint, storage bandwidth, and restore latency. Existing offline snapshot compression methods reduce snapshot size but add extra I/O, storage duplication, and decompression overhead. In this paper, we present CRIU-LZ4, a restore-optimized method for on-the-fly compression integrated directly into the CPU–GPU checkpoint and restore pipelines. Built atop CRIUgpu, CRIU-LZ4 performs page-level compression during memory transfer, eliminating intermediate artifacts and minimizing the latency on the restore critical path. Our evaluation results show that CRIU-LZ4 reduces cold-start latency by 46–59% and achieves up to 6×smaller snapshots compared to uncompressed GPU-aware checkpointing, while eliminating the decompression bottleneck of offline compression, significantly reducing both end-to-end restore time and peak disk usage.

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.

Další info