Jaccard index

The Jaccard coefficient or Jaccard index after the Swiss botanist Paul Jaccard (1868-1944) is a measure of the similarity of sets.

Definition

To calculate the Jaccard coefficient of two sets, one divides the number of common elements by the size of the union of:

For quantities

The closer the Jaccard coefficient is to 1, the greater the similarity between the sets. The minimum value of the Jaccard coefficient is 0

Example

The two volumes and have the Jaccard coefficient

Jaccard metric

From the Jaccard coefficient, the Jaccard metric can be derived. This metric is calculated according to the formula

General:

Applications

In the area of ​​text mining, and in particular the duplicate detection, the Jaccard similarity is a well known measure of the similarity between two elements. Two strings are decomposed into tokens (for example, divided at the space or by the use of n-gram with ). The resulting amounts of string portions as described above is used to calculate the similarity of the two sets.

  • Set theory
423398
de