Towards On-the-Fly Snapshot Memory Compression for Low-Latency Elastic Inference Serving Systems

Warning

This publication doesn't include Institute of Computer Science. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

STOYANOV Radostin SPIŠAKOVÁ Viktória REBER Adrian VAGIN Andrei BRUNO Rodrigo

Year of publication 2026
Type Paper in proceedings
Conference The 6th Workshop on Machine Learning and Systems (EuroMLSys)
MU Faculty or unit

Faculty of Informatics

Citation
Doi https://doi.org/10.1145/3805621.3807612
Keywords Container Checkpoint/Restore; GPU Memory Compression; CRIU; Cold-start Latency; LLM Inference Serving
Description In-memory model caching and startup latency are key bottlenecks in large-scale AI serving systems, especially for GPU-accelerated large language model (LLM) inference in elastic, serverless environments. While container checkpointing enables hot starts, it introduces new challenges in memory footprint, storage bandwidth, and restore latency. Existing offline snapshot compression methods reduce snapshot size but add extra I/O, storage duplication, and decompression overhead. In this paper, we present CRIU-LZ4, a restore-optimized method for on-the-fly compression integrated directly into the CPU–GPU checkpoint and restore pipelines. Built atop CRIUgpu, CRIU-LZ4 performs page-level compression during memory transfer, eliminating intermediate artifacts and minimizing the latency on the restore critical path. Our evaluation results show that CRIU-LZ4 reduces cold-start latency by 46–59% and achieves up to 6×smaller snapshots compared to uncompressed GPU-aware checkpointing, while eliminating the decompression bottleneck of offline compression, significantly reducing both end-to-end restore time and peak disk usage.

You are running an old browser version. We recommend updating your browser to its latest version.

More info