Tag Archives: GZip

Compressing Strings Using GZip in C#

Compressing data is a great way to reduce its size. This helps us reduce storage requirements as well as the bandwidth and latency of network transmissions.

There are many different compression algorithms, but here, well focus on GZip. We will use the .NET Framework’s own GZipStream class (in the System.IO.Compression namespace), although it is also possible to use a third party library such as SharpZipLib. We’ll also focus explicitly on compressing and decompressing strings; the steps to deal with other types (such as byte arrays or streams) will be a little different.

Compressing Data with GZipStream

In its simplest form, GZipStream takes an underlying stream and a compression mode as parameters. The compression mode determines whether you want to compress and decompress; the underlying stream is manipulated according to that compression mode.

            string inputStr = "Hello world!";
            byte[] inputBytes = Encoding.UTF8.GetBytes(inputStr);

            using (var outputStream = new MemoryStream())
            {
                using (var gZipStream = new GZipStream(outputStream, CompressionMode.Compress))
                    gZipStream.Write(inputBytes, 0, inputBytes.Length);

                // TODO do something with the outputStream
            }

In the code above, we are using a memory stream as our underlying output stream. The GZipStream effectively wraps the output stream. When we write our input data into the GZipStream, it goes into the output stream as compressed data. By wrapping the write operation in a using block by itself, we ensure that the data is flushed.

Let’s add some code to take the bytes from the output stream and write them to the console window:

            string inputStr = "Hello world!";
            byte[] inputBytes = Encoding.UTF8.GetBytes(inputStr);

            using (var outputStream = new MemoryStream())
            {
                using (var gZipStream = new GZipStream(outputStream, CompressionMode.Compress))
                    gZipStream.Write(inputBytes, 0, inputBytes.Length);

                var outputBytes = outputStream.ToArray();

                var outputStr = Encoding.UTF8.GetString(outputBytes);
                Console.WriteLine(outputStr);

                Console.ReadLine();
            }

The output of this may be a little bit surprising:

The bytes resulting from the GZip compression are actually binary data. They are not intelligible when rendered, and may also cause problems when transmitted over a network (due to byte ordering, for instance). One way to deal with this is to encode the compressed bytes in base64:

            string inputStr = "Hello world!";
            byte[] inputBytes = Encoding.UTF8.GetBytes(inputStr);

            using (var outputStream = new MemoryStream())
            {
                using (var gZipStream = new GZipStream(outputStream, CompressionMode.Compress))
                    gZipStream.Write(inputBytes, 0, inputBytes.Length);

                var outputBytes = outputStream.ToArray();

                var outputbase64 = Convert.ToBase64String(outputBytes);
                Console.WriteLine(outputbase64);

                Console.ReadLine();
            }

Base64, however, is far from a compact representation. In this specific example, the length of the output string goes from 32 bytes (binary) to 44 (base64), reducing the effectiveness of compression. However, for larger strings, this still represents significant savings over the plain, uncompressed string.

Which brings us to the next question: why is our compressed data much larger than our uncompressed data (12 bytes)? While I don’t know how the GZip algorithm works internally, compression algorithms generally work best on larger data where there is a lot of repetition. On a very small string, the overhead required to represent the compressed format’s internal data structures dwarfs the data itself, negating benefits of compression. Thus, compression should typically be applied only to data whose length exceeds an arbitrary threshold.

Decompressing Data with GZipStream

When decompressing, the underlying stream is an input stream. The GZipStream still wraps it, but the flow is inverted so that when you read data from the GZipStream, it translates compressed data into uncompressed data.

The basic workflow looks something like this:

            string inputStr = "H4sIAAAAAAAAC/NIzcnJVyjPL8pJUQQAlRmFGwwAAAA=";
            byte[] inputBytes = Convert.FromBase64String(inputStr);

            using (var inputStream = new MemoryStream(inputBytes))
            using (var gZipStream = new GZipStream(inputStream, CompressionMode.Decompress))
            {
                // TODO read the gZipStream
            }

There are different ways to implement this, even if we just focus on decompressing from a string to a string. However, a low-level buffer read such as the following will not work:

The Length property is not supported in a GZipStream, so the above code gives a runtime error. We cannot use the length of the inputStream in its stead because it will generally not be the same (it does match for this “Hello World!” example, but it won’t if you try a longer string). Rather than read the entire length of the buffer, you could read block by block until you reach the end of the stream. But that’s more work than you need, and I’m lazy.

One way to get this working with very little effort is to introduce a third stream, and copy the GZipStream into it:

            string inputStr = "H4sIAAAAAAAAC/NIzcnJVyjPL8pJUQQAlRmFGwwAAAA=";
            byte[] inputBytes = Convert.FromBase64String(inputStr);

            using (var inputStream = new MemoryStream(inputBytes))
            using (var gZipStream = new GZipStream(inputStream, CompressionMode.Decompress))
            using (var outputStream = new MemoryStream())
            {
                gZipStream.CopyTo(outputStream);
                var outputBytes = outputStream.ToArray();

                string decompressed = Encoding.UTF8.GetString(outputBytes);

                Console.WriteLine(decompressed);
                Console.ReadLine();
            }

An even more concise approach is to use StreamReader:

            string inputStr = "H4sIAAAAAAAAC/NIzcnJVyjPL8pJUQQAlRmFGwwAAAA=";
            byte[] inputBytes = Convert.FromBase64String(inputStr);

            using (var inputStream = new MemoryStream(inputBytes))
            using (var gZipStream = new GZipStream(inputStream, CompressionMode.Decompress))
            using (var streamReader = new StreamReader(gZipStream))
            {
                var decompressed = streamReader.ReadToEnd();

                Console.WriteLine(decompressed);
                Console.ReadLine();
            }

…and without too much effort, we have our decompressed output:

Now again, your mileage may vary depending on what you’re doing. For instance, you might opt to use the asynchronous versions of stream manipulation methods if you’re dealing with streams that aren’t memory streams (e.g. a file). Or you might want to work exclusively with bytes rather than converting back to a string. In any case, hopefully the code in this article will give you a head start when you need to compress and decompress some data.