data compression

data compression
Process of reducing the amount of data needed for storage or transmission of a given piece of information (text, graphics, video, sound, etc.), typically by use of encoding techniques.

Data compression is characterized as either lossy or lossless depending on whether some data is discarded or not, respectively. Lossless compression scans the data for repetitive sequences or regions and replaces them with a single "token." For example, every occurrence of the word the or region with the colour red might be converted to $. ZIP and GIF are the most common lossless formats for text and graphics, respectively. Lossy compression is frequently used for photographs, video, and sound files where the loss of some detail is generally unnoticeable. JPEG and MPEG (see MP3) are the most common lossy formats.

* * *

also called  compaction 

      the process of reducing the amount of data needed for the storage or transmission of a given piece of information, typically by the use of encoding techniques. Compression predates digital technology, having been used in Morse Code, which assigned the shortest codes to the most common characters, and in telephony, which cuts off high frequencies in voice transmission. Today, when an uncompressed digital image may require 20 megabytes, data compression is important in storing information digitally on computer disks and in transmitting it over communications networks.

      Information is digitally encoded as a pattern of 0s and 1s, or bits (bit) (binary digits). A four-letter alphabet (a, e, r, t) would require two bits per character if all characters were equally probable. All the letters in the sentence “A rat ate a tart at a tea,” could thus be encoded with 2 × 18 = 36 bits. Because a is most frequent in this text, with t the second most common, assigning a variable-length binary code—a: 0, t: 10, r: 110, e: 111—would result in a compressed message of only 32 bits. This encoding has the important property that no code is a prefix of any other. That is, no extra bits are required to separate letter codes: 010111 decodes unambiguously as ate.

      Data compression may be lossless (exact) or lossy (inexact). Lossless compression can be reversed to yield the original data, while lossy compression loses detail or introduces small errors upon reversal. Lossless compression is necessary for text, where every character is important, while lossy compression may be acceptable for images or voice (the limitation of the frequency spectrum in telephony being an example of lossy compression). The three most common compression programs for general data are Zip (on computers using Windows operating system), StuffIt (on Apple computers), and gzip (on computers running UNIX); all use lossless compression. A common format for compressing static images, especially for display over the Internet, is GIF (graphics interchange format), which is also lossless except that its images are limited to 256 colours. A greater range of colours can be used with the JPEG (joint photographic experts group) formatting standard, which uses both lossless and lossy techniques, as do various standards of MPEG (moving picture expert group) for videos.

      For compression programs to work, they must have a model of the data that describes the distribution of characters, words, or other elements, such as the frequency with which individual characters occur in English. Fixed models such as the simple example of the four-character alphabet, above, may not characterize a single text very well, particularly if the text contains tabular data or uses a specialized vocabulary. In these cases, adaptive models, derived from the text itself, may be superior. Adaptive models estimate the distribution of characters or words based on what they have processed so far. An important property of adaptive modeling is that if the compression and decompression programs use precisely the same rules for forming the model and the same table of codes that they assign to its elements, then the model itself need not be sent to the decompression program. For example, if the compressing program gives the next available code to the when it is seen for the third time, decompression will follow the same rule and expect that code for the after its second occurrence.

      Coding may work with individual symbols or with words. Huffman codes (telecommunication) use a static model and construct codes like that illustrated earlier in the four-letter alphabet. Arithmetic coding encodes strings of symbols as ranges of real numbers and achieves more nearly optimal codes. It is slower than Huffman coding but is suitable for adaptive models. Run-length encoding (RLE) is good for repetitive data, replacing it by a count and one copy of a repeated item. Adaptive dictionary methods build a table of strings and then replace occurrences of them by shorter codes. The Lempel-Ziv algorithm (telecommunication), invented by Israeli computer scientists Abraham Lempel and Jacob Ziv, uses the text itself as the dictionary, replacing later occurrences of a string by numbers indicating where it occurred before and its length. Zip and gzip use variations of the Lempel-Ziv algorithm.

      Lossy compression extends these techniques by removing detail. In particular, digital images are composed of pixels that represent gray-scale or colour information. When a pixel differs only slightly from its neighbours, its value may be replaced by theirs, after which the “smoothed” image can be compressed using RLE. While smoothing out a large section of an image would be glaringly evident, the change is far less noticeable when spread over small scattered sections. The most common method uses the discrete cosine transform, a mathematical formula related to the Fourier transform, which breaks the image into separate parts of differing levels of importance for image quality. This technique, as well as fractal techniques, can achieve excellent compression ratios. While the performance of lossless compression is measured by its degree of compression, lossy compression is also evaluated on the basis of the error it introduces. There are mathematical methods for calculating error, but the measure of error also depends on how the data are to be used: discarding high-frequency tones produces little loss for spoken recordings, for example, but an unacceptable degradation for music.

      Video images may be compressed by storing only the slight differences between successive frames. MPEG-1 is common in compressing video for CD-ROMs (CD-ROM); it is also the basis for the MP3 format used to compress music. MPEG-2 is a higher “broadcast” quality format used for DVDs (see compact disc: DVD (compact disc)) and some television networking devices. MPEG-4 is designed for “low bandwidth” applications and is common for broadcasting video over the World Wide Web (WWW). (MPEG-3 was subsumed into MPEG-2.) Video compression can achieve compression ratios approaching 20-to-1 with minimal distortion.

      There is a trade-off between the time and memory that compression algorithms require and the compression that they achieve. English text can generally be compressed to one-half or one-third of its original size. Images can often be compressed by factors of 10 to 20 or more. Despite the growth of computer storage capacity and network speeds, data compression remains an essential tool for storing and transmitting ever-larger collections of data. See also information theory: Data compression (information theory); telecommunication: Source encoding (telecommunication).

David Hemmendinger

Additional Reading
Ian H. Witten, Alistair Moffat, and Timothy C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd ed. (1999), surveys modeling and lossless compression techniques. Khalid Sayood, Introduction to Data Compression, 2nd ed. (2000), is a general textbook on lossless and lossy compression.David Hemmendinger

* * *


Universalium. 2010.

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

  • data compression — noun (computing) Altering the form of data to reduce its storage space • • • Main Entry: ↑data * * * data compression UK US noun [uncountable] computing the process of changing information into a smaller form that can be stored more easily or… …   Useful english dictionary

  • Data compression —   Data in computers are normally stored in a way which causes every character (including spaces) to occupy the same amount of memory space. This is usually an eight digit code consisting of 0s and 1s. As the amount of data needed to be held in… …   International financial encyclopaedia

  • Data compression — Source coding redirects here. For the term in computer programming, see Source code. In computer science and information theory, data compression, source coding or bit rate reduction is the process of encoding information using fewer bits than… …   Wikipedia

  • data compression —    Any method of encoding data so that it occupies less space than it did in its original form, thus allowing that data to be stored, backed up, retrieved, or transmitted more efficiently. Data compression is used in fax and many other forms of… …   Dictionary of networking

  • data compression — Some modems have the capability to squash data so that it takes up less space. When another modem (with this capability) receives the data, it unsquashes it to its original form. By using data compression, a modem can send information faster …   Dictionary of telecommunications

  • Data Compression —   Techniques used to reduce the amount of redundant information being held in or transmitted by a computer system. Data compression techniques allow more data to be stored in a computer and more data to be electronically transmitted than would… …   International financial encyclopaedia

  • data compression — duomenų spūda statusas T sritis radioelektronika atitikmenys: angl. data compression vok. Datenkompression, f rus. сжатие данных, n pranc. compression de données, f …   Radioelektronikos terminų žodynas

  • Data compression ratio — Data compression ratio, also known as compression power, is a computer science term used to quantify the reduction in data representation size produced by a data compression algorithm. The data compression ratio is analogous to the physical… …   Wikipedia

  • Data compression symmetry — Symmetry and asymmetry, in the context of data compression, refer to the time relation between compression and decompression for a given compression algorithm. If an algorithm takes the same time to compress a data archive as it does to… …   Wikipedia

  • data compression protocol — standard for data compression in communications transfer between computers …   English contemporary dictionary

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”