Canterbury Corpus

The Canterbury Corpus is a collection of files to measure the performance and compression ratio of compression methods of lossless data compression. It was developed in 1997 by the University of Canterbury and is the Calgary Corpus developed in 1980 to replace.

Purpose

The Canterbury corpus was developed as a basis for application of metrics on newly developed data compression method and is primarily used for the creation of test cases for testing the algorithms during the development cycle. Although it can in principle be used also for comparing different compression techniques, the authors dissociate explicitly repudiate and refer to similar collections and resources. In addition, the Canterbury Corpus is provided exclusively for testing lossless compression method.

Packages

The Canterbury corpus consists of various packages that contain depending on the test purpose algorithm and partly highly specialized data. Thus, the package offers The Canterbury Corpus eleven files in text and binary formats, including an excerpt from a work of William Shakespeare and is primarily a comparison of the test algorithm with other existing compression methods. The packages Artificial, Large Miscellaneous and offer files with synthetically generated content, especially large files (such as the full content of the CIA World Factbook) or purely numerical content. These packages are for testing a compression method in special situations.

162057