- tags
- Compression, Complexity
- papers
- (Lempel and Ziv 1976; Ziv and Lempel 1977, 1978; 1984; Storer and Szymanski 1982)
Context
The LZW algorithm was originally designed as a complexity (“randomness”) metric for finite sequences (Lempel and Ziv 1976). It was then extended as a compression algorithm by the same authors to LZ77 (Ziv and Lempel 1977) and LZ78 (Ziv and Lempel 1978). Those last two are the basis of many well known and widely used compression utilities such as GIF
, compress
(LZW (1984) ) or DEFLATE
, gzip
(LZSS (Storer and Szymanski 1982)), etc.
Description
The base idea of these algorithms is to build a dictionary of words in the input sequence. A given sequence is scanned and an empty dictionary is initialized. Each new word is the shortest next word not in the dictionary. It is registered by using a previous word and adding a single digit in the form \((j, s_{last})\). This encoding allows a unique decoding of the original sequence. When encoding or decoding the string the same dictionary of words is built. For sequences with string repetitions, the length of each new word increases fast and the numbers of needed pairs to encode increases very slowly.
Example implementation of the LZ algorithm in Python:
def compress(string):
dic = {'0': 0, '1': 1}
word = ""
result = []
for c in string:
wc = word + c
if wc in dic:
word = wc
else:
result.append(dic[word])
dic[wc] = len(dic)
word = c
if word:
result.append(dic[word])
return result
print(compress("1101010101011101010001110100"))
print(compress("1010101010101010101010101010"))
Results:
[1, 1, 0, 3, 5, 4, 4, 2, 7, 3, 0, 8, 6, 0]
[1, 0, 2, 4, 3, 6, 5, 8, 7, 6]
Bibliography
- Lempel, A., and J. Ziv. January 1976. "On the Complexity of Finite Sequences". IEEE Transactions on Information Theory 22 (1):75–81. DOI.
- Storer, James A., and Thomas G. Szymanski. October 1982. “Data Compression via Textual Substitution”. Journal of the ACM (JACM) 29 (4):928–51. DOI.
- Ziv, J., and A. Lempel. May 1977. “A Universal Algorithm for Sequential Data Compression”. IEEE Transactions on Information Theory 23 (3):337–43. DOI.
- ———. September 1978. “Compression of Individual Sequences via Variable-Rate Coding”. IEEE Transactions on Information Theory 24 (5):530–36. DOI.
- June 1984. “A Technique for High-Performance Data Compression”. Computer 17 (6):8–19. DOI.