HyperAIHyperAI
Back to Headlines

OpenZL: A Universal Open Source Framework for Format-Aware Data Compression with High Performance and Flexibility

4 days ago

Today we are proud to announce the public release of OpenZL, an open source data compression framework designed for structured data. OpenZL delivers lossless compression with performance rivaling specialized compressors, by intelligently applying a configurable sequence of transformations tailored to a file’s structure. Despite using different transformation paths for different data types, all OpenZL files can be decompressed using a single universal decompressor. The journey began with Zstandard, which revolutionized datacenter compression by combining strong entropy coding with modern CPU optimization. While Zstandard has evolved significantly, improvements within its framework are now diminishing. This led us to explore a new direction: leveraging the inherent structure of data to unlock greater compression gains. General-purpose compressors treat data as raw bytes, missing opportunities hidden in format, type, and repetition. Format-aware compressors can outperform them, but at the cost of complexity—each new format requires a custom compressor and decompressor. OpenZL solves this tension by enabling format-specific compression while maintaining a single, universal decompressor. OpenZL makes structure explicit. Users provide a data shape—via a preset or a lightweight format description—and an offline trainer generates an optimized compression configuration. This plan is then resolved at encode time into a concrete execution recipe, embedded directly in the compressed data. The universal decoder reads this recipe and executes it without needing external metadata. For example, when compressing sao from the Silesia Corpus—a structured file of star records—OpenZL outperforms both zstd and xz. On an M1 CPU, it achieves a compression ratio of x2.06, beating zstd’s x1.31 and xz’s x1.64, while compressing at 340 MB/s and decompressing at 1200 MB/s—significantly faster than xz. The process starts by separating the header and converting the record array into a structure of arrays. Each field becomes a homogeneous stream, enabling targeted compression strategies. The trainer then explores optimal transformation sequences, using clustering to group similar fields and graph search to evaluate configurations. Users can describe data using SDDL (Simple Data Description Language), or write custom parsers in supported languages. The trainer then generates a compression plan with tradeoffs across speed, ratio, and decompression performance. At encode time, the plan becomes a resolved graph, with optional control points that make dynamic decisions based on lightweight statistics—like string repetition or delta variance—without sacrificing speed. This enables runtime adaptation: the system can respond to data shifts, outliers, or seasonal patterns without unbounded exploration. The chosen path is recorded in the frame, so the decoder simply follows it—no coordination needed. The universal decoder is a key innovation. It supports all OpenZL-compressed data, regardless of format or plan. This means: - Security and correctness reviews focus on one binary. - Performance or safety updates benefit all data, old and new. - Operations remain simple: one CLI, one set of metrics, one rollout process. - Compression plans can be continuously retrained and deployed without breaking backward compatibility. Results show OpenZL excels on structured data—tabular, columnar, and nested formats like Parquet, CSV, and time-series. On ERA5 Flux and Binance datasets, it achieves higher compression ratios at faster speeds than general tools. For CSV, parsing overhead limits speed, but OpenZL falls back to zstd, ensuring performance is always bounded. However, OpenZL doesn’t help with unstructured data like plain text (e.g., enwik or dickens), where it defaults to zstd. OpenZL is ideal for vector data, ML tensors, database tables, and time-series. It works best when data has exploitable structure. Future work includes expanding transform libraries for time-series and grid data, improving training efficiency, enhancing SDDL for nested formats, and building smarter plan explorers. We invite the community to try OpenZL with structured datasets, contribute new plans, parsers, or codecs, and help expand the ecosystem. The source code, documentation, and examples are available on GitHub. Join us in shaping the future of intelligent, format-aware compression.

Related Links