Apriori algorithm

The Apriori algorithm is a method for association analysis, a portion of the data mining. He is the discovery of meaningful and useful relationships in transaction-based databases that are represented in the form of so-called association rules. A common application of the Apriori algorithm is the market basket analysis. Items here are offered products and a shopping represents a transaction that contains the items purchased. The algorithm then determines correlations of the form:

A suitable data base consists of a table of transactions ( lines) in which any binary item (column ) are combined. The Apriori algorithm finds connections between sets of items that occur in a large part of the transactions. The output association rules have the form, there are and sets of items and the rule states that if in a large part of the transactions, the Itemmenge occurs, then the Itemmenge is often included.

  • 3.1 Find common Itemmengen
  • 3.2 Apriori - Gen 3.2.1 Example

Requirements

The Apriori algorithm is applied to data bases of a certain shape. The form the data base must be as follows:

  • Is a set of possible items
  • Is the data base, consisting of transactions
  • A transaction holds a lot of items together

Typically an amount of more than 500,000 operations on a large item basis is analyzed. The representation of the data base is done in a de - normalized database table, which has a column for each possible item. The rows in each represents a transaction, are marked with a 0 in the transaction contained items with a 1, not included. A transaction can thus be regarded as a vector with dimensions.

An association rule is of the form

Which applies

Review of rules

Association rules are evaluated with two probabilistic measurements: support and confidence. The Apriori algorithm is expected, among other things, the values ​​and which represent the minimum support and the minimum confidence of a rule so that they are considered as input.

Support

The support is a Itemmenge the probability that this occurs in a Itemmenge transaction.

Be Itemmenge.

The support of an association rule, with and, is defined as

The support of a rule there is the relative frequency with which the control is found in the database. Usually a high support is desirable to find statements about majorities.

Confidence

Be an association rule, with and.

The confidence of a rule corresponds to the probability of the conclusion under condition of the premise:

So

Thus, the confidence measures the relative frequency of occurrence of the conclusion of the condition under premise. Also a high value for the confidence level is desirable. Or in simpler terms: The confidence measure for how much of the transactions in which occurs also occurs. To calculate the confidence the total number of rule- settled transactions, is (ie the support), divided by the number of transactions that contain.

The algorithm

The Apriori algorithm receives as inputs

  • The database
  • The minimum support
  • The minimum confidence

And outputs a set of association rules, which both meet.

The algorithm works in two steps both of which use a common step Apriori - Gen:

Find more often Itemmengen

The search for frequent Itemmengen starts with 1 -element sets and iteratively continued with n elements amounts to no Itemmengen with sufficient support are found. In this case, in each iteration, a set of candidate sets generated by Apriori - Gen and checked a lot on the property back. Can not add new volumes more are found, the algorithm stops and returns the amounts found out.

The returned set contains all frequent Itemmengen.

Apriori - Gen

The sub- routine priori gene is used both for calculating common volumes, as well as in the generation of association rules. Instead of calculating directly the support for all possible Itemmengen, is generating a lot of candidates for further review by a priori gene on the basis of the already found frequent amounts.

The routine takes as input a set of frequent - Itemmengen () and returns a set of - Itemmengen () returns as a possible candidate. It is based on the principle that all the subsets of a common Itemmenge are often all supersets of a non- common Itemmenge but also non- common. Unnecessary support calculations are avoided.

Example

The input to Apriori - Gen is:

Step 1 of Apriori - Gen routine now calculates the following set of candidates:

Step 2 removed from the quantities and again since and are not included. Both quantities are therefore not often and your supersets need not be considered.

The result of a priori gene is therefore

Generation of association rules

Only Itemmengen which are already common in itself, the algorithm must be considered for this step. Such Itemmengen were calculated from step 1 of the Apriori algorithm. The routine priori gene used in step 1 is used again in the generation of association rules.

For each found Itemmenge frequent attempts to generate association rules. It is started with the shortest possible (1 -element ) conclusions, which can be increased iteratively. The following pseudo code is performed for each found Itemmenge:

The rules generated and satisfy all.

73464
de