zlib for R
Description
The zlib
package for R aims to offer an R-based equivalent of Python's built-in zlib
module for data compression and decompression. This package provides a suite of functions for working with zlib compression, including utilities for compressing and decompressing data streams, manipulating compressed files, and working with gzip
, zlib
and deflate formats.
Post
Usage
This example demonstrates how to use the zlib
Rcpp module for chunked compression and decompression. We will take a string, write it to a temporary file, and then read it back into a raw vector. Then we will compress and decompress the data using the zlib
Rcpp module.
Installation
To install the zlib
package, you can use the following command:
install.packages("zlib") # Uncomment this line if zlib is hosted on CRAN or a similar repo```
Code Example
First, make sure to load the zlib
package:
library(zlib)
Writing and Reading a String to/from a Temporary File
# Create a temporary file
temp_file <- tempfile(fileext = ".txt")
# Generate example data and write to the temp file
example_data <- "This is an example string. It contains more than just 'hello, world!'"
writeBin(charToRaw(example_data), temp_file)
# Read data from the temp file into a raw vector
file_con <- file(temp_file, "rb")
raw_data <- readBin(file_con, "raw", file.info(temp_file)$size)
close(file_con)
Compressing the Data
# Create a Compressor object
compressor <- zlib$compressobj(zlib$Z_DEFAULT_COMPRESSION, zlib$DEFLATED, zlib$MAX_WBITS + 16)
# Initialize variables for chunked compression
chunk_size <- 1024
compressed_data <- raw(0)
# Compress the data in chunks
for (i in seq(1, length(raw_data), by = chunk_size)) {
chunk <- raw_data[i:min(i + chunk_size - 1, length(raw_data))]
compressed_chunk <- compressor$compress(chunk)
compressed_data <- c(compressed_data, compressed_chunk)
}
# Flush the compressor buffer
compressed_data <- c(compressed_data, compressor$flush())
Decompressing the Data
# Create a Decompressor object
decompressor <- zlib$decompressobj(zlib$MAX_WBITS + 16)
# Initialize variable for decompressed data
decompressed_data <- raw(0)
# Decompress the data in chunks
for (i in seq(1, length(compressed_data), by = chunk_size)) {
chunk <- compressed_data[i:min(i + chunk_size - 1, length(compressed_data))]
decompressed_chunk <- decompressor$decompress(chunk)
decompressed_data <- c(decompressed_data, decompressed_chunk)
}
# Flush the decompressor buffer
decompressed_data <- c(decompressed_data, decompressor$flush())
Verifying the Results
# Convert decompressed raw vector back to string
decompressed_str <- rawToChar(decompressed_data)
# Should print TRUE
print(decompressed_str == example_data)
By following these steps, you can successfully compress and decompress data in chunks using the zlib
Rcpp module.
Why Another Zlib Package for R?
You might wonder, "Why do we need another zlib package when R already has built-in methods for compression and decompression?" Let me clarify why I decided to develop this package.
The Problems with Built-in Methods
R's built-in functions like memDecompress
and memCompress
are good for simple tasks, but they lack robustness and flexibility for advanced use-cases:
Handling Corrupt Data: Functions like
memDecompress
are unstable when dealing with gzip bytes that may be corrupt. If the data has multiple header blocks or too small a chunk, the function doesn't just fail—it can even crash your computer. Incomplete Data Stream: A too-small chunk can hang your system when usingcompressed_data <- memCompress(charToRaw(paste0(rep("This is an example string. It contains more than just 'hello, world!'", 1000), collapse = ", "))) rawToChar(memDecompress(compressed_data[1:300], type="gzip")) # Caused a hang-up readLines(gzcon(rawConnection(compressed_data[1:300]))) # Warnung in readLines(gzcon(rawConnection(compressed_data[1:300]))) # unvollständige letzte Zeile in 'gzcon(compressed_data[1:300])' gefunden # [1] "x\x9c\...."
Multiple Header Blocks: R's
memDecompress
doesn't handle gzip data with multiple headers well.multi_header_compressed_data_15wbits <- c(memCompress(charToRaw("Hello World"), type="gzip"),memCompress(charToRaw("Hello World"), type="gzip")) rawToChar(memDecompress(multi_header_compressed_data_15wbits, type="gzip")) # Returns only one compressed header
Whereas, using
pigz
orgzip
with pipes returns the expected concatenated strings:tmp_file <- tempfile(fileext=".gzip") writeBin(multi_header_compressed_data_15wbits, tmp_file) readLines(pipe(sprintf("pigz -c -d %s --verbose 2>/dev/null", tmp_file), open = "rb")) # working correct but with warning # Warnung in readLines(pipe(sprintf("pigz -c -d %s --verbose 2>/dev/null", # unvollständige letzte Zeile in 'pigz -c -d /tmp/Rtmp5IRHIZ/file44665a0e2eb2.gzip --verbose 2>/dev/null' gefunden # [1] "Hello WorldHello World" readLines(pipe(sprintf("gzip -d %s --verbose --stdout", tmp_file), open = "rb")) # not working becouse of the wrong wbits (see next point) # character(0)
GZIP File Format Specification: R's
memCompress
doesn't adhere strictly to the GZIP File Format Specification, particularly regarding the usage of window bits.memCompress("Hello World", type="gzip") # Only 15 wbits -> without header checksum # [1] 78 9c f3 48 cd c9 c9 57 08 cf 2f ca 49 01 00 18 0b 04 1d
Official GZIP File Format Specification Incorrect Behavior with Different
wbits
: The behavior ofmemCompress
is inconsistent when differentwbits
are used for compression and decompression.compressor <- zlib$compressobj(zlib$Z_DEFAULT_COMPRESSION, zlib$DEFLATED, zlib$MAX_WBITS + 16) multi_header_compressed_data_31wbits <- c(c(compressor$compress(charToRaw("Hello World")), compressor$flush()), c(compressor$compress(charToRaw("Hello World")), compressor$flush())) readLines(gzcon(rawConnection(multi_header_compressed_data_31wbits))) # returns a single line if the gzip wbits are correct! # Warnung in readLines(gzcon(rawConnection(wrong_compressed_data_31wbit))) # unvollständige letzte Zeile in 'gzcon(wrong_compressed_data_31wbit)' gefunden # [1] "Hello World" readLines(gzcon(rawConnection(multi_header_compressed_data_15wbits))) # Warnung in readLines(gzcon(rawConnection(wrong_compressed_data))) # Zeile 1 scheint ein nul Zeichen zu enthalten # Warnung in readLines(gzcon(rawConnection(wrong_compressed_data))) # unvollständige letzte Zeile in 'gzcon(wrong_compressed_data)' gefunden # [1] "x\x9c\xf3H\xcd\xc9\xc9W\b\xcf/\xcaI\001" tmp_file <- tempfile(fileext=".gzip") writeBin(multi_header_compressed_data_31wbits, tmp_file) readLines(pipe(sprintf("gzip -d %s --verbose --stdout", tmp_file), open = "rb")) # gzip pipe works with correct wbits # Warnung in readLines(pipe(sprintf("gzip -d %s --verbose --stdout", tmp_file), # unvollständige letzte Zeile in 'gzip -d /tmp/RtmpPIZPMP/file6eed29844dc4.gzip --verbose --stdout' gefunden # [1] "Hello WorldHello World"
No Streaming Support: There's no nati ve way to handle Gzip streams from REST APIs or other data streams without creating temporary files or implementing cumbersome workarounds (e.g. with pipes and tmp files).
What My Package Offers
Robustness: Built to handle even corrupted or incomplete gzip data efficiently without causing system failures.
compressed_data <- memCompress(charToRaw(paste0(rep("This is an example string. It contains more than just 'hello, world!'", 1000), collapse = ", "))) decompressor <- zlib$decompressobj(zlib$MAX_WBITS) rawToChar(c(decompressor$decompress(compressed_data[1:300]), decompressor$flush())) # Still working
Compliance: Strict adherence to the GZIP File Format Specification, ensuring compatibility across systems.
compressor <- zlib$compressobj(zlib$Z_DEFAULT_COMPRESSION, zlib$DEFLATED, zlib$MAX_WBITS + 16) c(compressor$compress(charToRaw("Hello World")), compressor$flush()) # Correct 31 wbits (or custom wbits you provide) # [1] 1f 8b 08 00 00 00 00 00 00 03 f3 48 cd c9 c9 57 08 cf 2f ca 49 01 00 56 b1 17 4a 0b 00 00 00
Flexibility: Ability to manage Gzip streams from REST APIs without the need for temporary files or other workarounds.
# Byte-Range Request and decompression in chunks # Initialize the decompressor decompressor <- zlib$decompressobj(zlib$MAX_WBITS + 16) # Define the URL and initial byte ranges url <- "https://example.com/api/data.gz" range_start <- 0 range_increment <- 5000 # Adjust based on desired chunk size # Placeholder for the decompressed content decompressed_content <- character(0) # Loop to make multiple requests and decompress chunk by chunk for (i in 1:5) { # Adjust the loop count based on the number of chunks you want to retrieve range_end <- range_start + range_increment # Make a byte-range request response <- httr::GET(url, httr::add_headers(`Range` = paste0("bytes=", range_start, "-", range_end))) # Check if the request was successful if (httr::http_type(response) != "application/octet-stream" || httr::http_status(response)$category != "Success") { stop("Failed to retrieve data.") } # Decompress the received chunk compressed_data <- httr::content(response, "raw") decompressed_chunk <- decompressor$decompress(compressed_data) decompressed_content <- c(decompressed_content, rawToChar(decompressed_chunk)) # Update the byte range for the next request range_start <- range_end + 1 } # Flush the decompressor after all chunks have been processed final_data <- decompressor$flush() decompressed_content <- c(decompressed_content, rawToChar(final_data))
In summary, while R’s built-in methods could someday catch up in functionality, my zlib package for now fills an important gap by providing a more robust and flexible way to handle compression and decompression tasks.
Little Benchmark
The following benchmark compares the performance of the zlib
package with the built-in memCompress
and memDecompress
functions.
The benchmark was run on a Latitude 7430 with a 12th Gen Intel(R) Core(TM) i5-1245U (12) @ 4.4 GHz Processor and 32 GB of RAM.
library(zlib)
example_data <- charToRaw(paste0(rep("This is an example string. It contains more than just 'hello, world!'", 1000), collapse = "\n"))
microbenchmark::microbenchmark({
compressed_data <- memCompress(example_data, type="gzip")
decompressed_data <- memDecompress(compressed_data, type="gzip")
}, {
compressor <- zlib$compressobj()
compressed_data <- c(compressor$compress(example_data), compressor$flush())
decompressor <- zlib$decompressobj()
decompressed_data <- c(decompressor$decompress(compressed_data), decompressor$flush())
}, times = 5000)
# min lq mean median uq max neval
# 277.041 323.6640 408.4731 363.7165 395.5025 7931.280 5000
# 203.626 255.7815 308.8654 297.2095 337.2320 6864.512 5000
Future Enhancements
We've identified some exciting opportunities for extending the capabilities of this library. While these features are not currently planned for immediate development, we're open to collaboration or feature requests to bring these ideas to life.
Gztool
Gztool specializes in indexing, compressing, and data retrieval for GZIP files. With Gztool integration, you could create lightweight indexes for your gzipped files, enabling you to extract data more quickly and randomly. This would eliminate the need to decompress large gzip files entirely just to access specific data at the end of the file.
Pugz
Pugz offers parallel decompression of gzipped text files. It employs a truly parallel algorithm that works in two passes, significantly accelerating the decompression process. This could be a valuable addition for users dealing with large datasets and seeking more efficient data processing.
If any of these feature enhancements interest you, or if you have other suggestions for improving the library, feel free to reach out for collaboration.
Dependencies
Software Requirements
Libraries
- zlib
Development
Installing Dependencies on Ubuntu
sudo apt-get update
sudo apt-get install cmake ninja-build r-base libblas-dev liblapack-dev build-essential
Installing Dependencies on Red Hat
sudo yum update
sudo yum install cmake ninja-build R libblas-devel liblapack-devel gcc-c++
Building
- Clone the repository:
git clone https://github.com/yourusername/zlib.git
- Install the package local
make install
- Build the package local
make build
License
This project is licensed under the MIT License - see the LICENSE file for details.