Thursday, September 23, 2004

"Perpetual motion" in data compression?

Roger Howorth writes:


But there is a danger that people could obtain patents for technologies that they have not yet created. This scenario is not as unlikely as one might think.

Well, I became interested in the patent-free Gzip compression algorithm once I noticed that Microsoft had begun using it. It was at a press briefing about Exchange 2003 recently that I found out that the new version of Microsoft's groupware uses Gzip compression to speed up network performance from the email server to certain types of client device.

Now, I had come across Gzip on various Linux and Unix systems, and in popular Windows tools, such as WinZip, and I guess Microsoft must have first used it for the Compressed Folder feature in Windows XP. So I was interested that Microsoft's use of open compression is growing.

I was particularly interested as Gzip was developed specifically as an open - IETF RFC 1952 - compression algorithm that did not contain any patented technologies. Given that Microsoft did not simply go off and grab a ready-to-run version of Gzip from the web, it surely chose Gzip because it's an open standard that does not include patented technologies.

Well good for Microsoft.

I guess it chose Gzip so that both Microsoft-designed devices and third-party devices could connect to Exchange using the same compression. Now if Microsoft's choice of a patent-free compression algorithm is an irony, it is not the only one.

Among the rather dubious patents issued in the past by the US are several for perpetual motion machines. Eventually the folks at the US patent office came to their senses and stopped issuing patents for what is commonly accepted to be something that is impossible to invent.

But this has not stopped them from allowing several patents of compression algorithms that claim to be able to compress random data. For the record, storage experts agree that it's not possible to compress random data.


As comments to Roger's text, I note that

** if it is truly impossible to compress random data, then 1) no one has to worry about infringing claims of a patent which asserts compressing random data because it is impossible to infringe and 2) the USPTO does get around to impossible patents (see BlackLight).

** if it is possible to compress random data and someone got a patent claim to do it, but didn't really describe the invention and/or enable it, then the patent will be found invalid. See for example, the demise of the University of Rochester's COX-2 patent in University of Rochester v. Searle.


Blogger Mark Bednarczyk said...

I did an experiment on "prepetual compression" for a project in my statistics class some years ago. The concept was to match series of numbers from PI digits and record offsets and lengths of matches. Then, through multiple passes, recompress the prior's pass data using the same technique until you no longer can achive a benefit. Then simply record number of passes so that the process can be reversed and the data in effect decompressed.

Obviously I wasn't worried about peformance here. What was the result, the random data compressed nicely to exactly 1:1 ratio. Another words, the amount of storage (memory) needed to record offset/length information about the data blocks matched, took exactly the same amount of storage as the original data. No compression gain. This was with an algorithm that tried to save every single bit, but the matches occurred farther and farther into the PI sequence that eventually there was no benefit.

You can't beat the laws of statistics. It did however prove to be a somewhat useful as an encryption algorithm. Not too hard to implement and you only needed to know how many passes were used to encrypt the data.

The intriguing idea is that if we were able to save even 1 bit between each consecutive pass, we'd eventually be able to reach the perpetual compression where data set of any size would be compressable to a much smaller finite size compressed state. Laws of statistics prevent this from happening though.

Lastly, I didn't persue it any further, but there could be big potential for non-random data. If you can match a number something like a PI that fits the data better, you could at that point achieve compression. All that needs to happen is that data sequences need to be matched sooner into the number sequence then later and you would immediately see positive compression results. Of course that is easier said then done.

11:36 AM  

Post a Comment

<< Home