MapReduce

MapReduce is a well-established by the company Google Inc. programming model for concurrent computations on (several petabytes ) large amounts of data on computer clusters. MapReduce is also the name of an implementation of the programming model in the form of a software library.

MapReduce the method, the data is processed in two stages. This calculation can be parallelized and distributed over several computers. For very large amounts of data, therefore, the parallelization is possibly already been required because the amount of data for a single process ( and the execution host system ) are too large.

The programming model was map by the functions commonly used in functional programming and reduce inspired, even if the operation of the library deviates from it. 2010 a U.S. patent was granted for MapReduce.

MapReduce implementations have been realized in C , Erlang, Java, Python and many other languages ​​.

  • 2.1 problem
  • 2.2 Details of the Map and Reduce functions
  • 2.3 Map - phase
  • 2.4 Reduce- phase
  • 2.5 The overall
  • 2.6 Exemplary calculation
  • 4.1 Articles
  • 4.2 Software

Operation

Illustration of the data flow

The picture above illustrates the data flow for the MapReduce computation.

  • The input data (D, A, T, A) be a set of distributed Map - processes (colored rectangles), each of which is calculated as supplied by the users map function.
  • The map processes ideally be performed in parallel.
  • Each of Map instances presents interim results from (represented by the pink star ).
  • From each map instance data may flow into various intermediate result storage.
  • All intermediate results are calculated, these so-called map- phase is terminated and the Reduce- phase.
  • For each set of interim results in each case a Reduce process accurately calculated (purple rectangles) as supplied by the users Reduce function and thus the output data (purple circles X, Y and Z).
  • The Reduce- processes are ideally also run in parallel.

Definition of MapReduce function

The MapReduce library implements a function, which takes a list of key - value pairs ( input list ) returns a new list of key - value pairs ( output list ):

Explanation:

  • The amounts and contain keys that contain volumes and values ​​.
  • All keys are of the same type, such as strings.
  • All the keys are the same type, such as integers.
  • All values ​​are of the same type, such as atoms.
  • All values ​​are of the same type, such as floating point numbers.
  • If and quantities, so is meant by the set of all pairs, and ( Cartesian product ).
  • If a lot is, so is the set of all finite lists with elements of meaning (based on the Kleene star) - the list can be empty.

Definition of Map and Reduce functions

The user configures the library for the provision of the map and reduce features that are defined as follows:

Or

Map phase

  • Map is a pair consisting of a key and a value from a list of new pairs, which play the role of intermediate results, the values ​​are of the same type as the final results.
  • With a new pair of keys assigned by Map it refers to a list of intermediate results, in which the calculated value of map is collected.
  • The library calls the function map for each pair in the input list.
  • All of these map calculations are independent, so that they can perform concurrent and distributed on a computer cluster.

Reduce- phase

  • Are all map calls made ​​or are all intermediate results before, so call the library for each intermediate value list, select the Reduce function, which calculates a list of result values ​​that are collected by the library in the output list as pairs.
  • The Views of Reduce can be distributed independently on the different processes in the computer cluster.

Note: This presentation was somewhat simplified, because in general, the control of the MapReduce process will seek a number of Reduce processes, so that if there are more than several key intermediate results, intermediate results with different keys are stored in a single list. The corresponding pairs are sorted before the Reduce- calculation for keys.

Combine phase

Optionally, between the Map and the Reduce- phase nor a combine phase take place. This usually has the same functionality as the Reducefunktion, but will run on the same node as the phase map. The aim is to reduce the network load. The purpose of the Combine phase is immediately apparent when viewing the WordCount example: Due to the different frequency of words in natural language, would in a German text, for example, very often an issue of the form ("and ", 1) generated (same applies to articles and auxiliary verbs). Combine the phase will now be made ​​of messages 100 of the mold ( " and ", 1) only a message of the form ( "and", 100). This can significantly reduce the network load, but not in all applications possible.

Example: Distributed Frequency Analysis with MapReduce

Problem

One would like to find out for extensive texts, as often occur which words.

Specification of the map and Reduce functions

Map ( String name, String document):    / / Name: document name ("key" )    / / Document: document contents ( "value" )    for each word w in document:      EmitIntermediate (m, 1);     reduce ( String word, Iterator partialCounts ):    / / Word: a word ("key" )    / / PartialCounts: a list of aggregated partial counts ( " values" )    / / For 'word '    int result = 0;    for each v in partialCounts:      result ;    Emit (word, result); Map phase

  • Map each receives a document named name and a document document passed as a string.
  • Map scans the document word for word.
  • Each time a word w is encountered, wanders a 1 in the w -result list (if it does not exist, it is created ).
  • If one is through with all of the words and the text has a total of n different words, the map phase ends with n intermediate result lists, each collecting for another word which contains so many 1- entries as the corresponding word was found in the document.
  • You may be running a lot of Map instances simultaneously if the library were to pass multiple words and documents.

Reduce- phase

  • Reduce is word for word and called the intermediate result list partialCounts.
  • Reduce passes through the intermediate result list and adds all the numbers found in.
  • The sum result is returned to the library, it contains the number of times the word word was found in the document.
  • The interim results were parallel, by simultaneous Reduce calls are calculated.

Overall

  • From a list of document names and document a list of words and word frequencies is generated.

Exemplary calculation

For example, the following calculation on a classic text would be conceivable:

The text is divided into sentences, it offers itself normalization by all writes small and the punctuation removed:

Input list = [( satz_1, " laid brick in the ground the form of clay is fired " ),                    ( satz_2, " Today, the bell must be fresh 're joined at hand " ),                    ( satz_3, " from his brow hot trickle of perspiration to the factory must be the champion but the praise "                             " blessing comes from above ")] The input list has three pairs as elements, we can therefore start three processes map:

P1 = Map ( satz_1, " laid brick in the ground is the form of clay fired " )   P2 = Map ( satz_2, " today must be the bell are newly joined to the hand " )   P3 = Map ( satz_3, " from his brow hot trickle of perspiration to the factory must be the master praise but the blessing "                    " from above " ) The Map Views generate this intermediate result couples:

P1 = [( "fixed", 1 ), ( " bricked ", 1), ( " in ", 1 ) (" the ", 1), ( "ground ", 1),          ( " is " 1 ) (" the ", 1), ("form " 1) ( "off" 1 ) (" clay, 1),          ( " fired ", 1) ]   P2 = [( " today ", 1), ( " must ", 1 ) (" the ", 1 ) (" bell ", 1), ( " be ", 1),          ( " fresh", 1), ( " join ", 1), ( "be ", 1), ( " to ", 1), ( "hand", 1)]   P3 = [ ("from ", 1 ) (" the ", 1), ( " forehead ", 1), ( " hot ", 1), ( " run ", 1),          ( " must, 1 ), ( " the ", 1), ( " sweat ", 1), (" shall ", 1), (" the ", 1),          ( " factory ", 1 ) (" the ", 1), ( " master ", 1), ( " praise ", 1), ( " but ", 1),          ( " the ", 1), ( " blessing ", 1), ( " come ", 1), ("from ", 1), ( "top", 1)] The map processes deliver their pairs to the MapReduce library that collects these in the intermediate result lists. Parallel could happen next ( the same timing of the 3 Map processes is unrealistic, actually overlap the versions T_wort The lists are available locally per map process and are * not * synchronized between steps. ):

First iteration:      P1: T_fest = ( new)      P2: T_heute = ( new)      P3: T_von = ( new)     Second iteration:      P1: T_gemauert = ( new)      P2: T_muß = ( new)      P3: T_der = ( new)     Third iteration:      P1: T_in = ( new)      P2: T_die = ( new)      P3: T_stirne = ( new) In the fourth step we see that intermediate result lists exist locally for each map process and can not be reused globally:

Fourth iteration:      P1: T_der = (new, the first map process has not yet T_der, only P3)      P2: T_glocke = ( new)      P3: T_heiss = ( new)     5th iteration      P1: T_erden = ( new)      P2: T_werden = ( new)      P3: T_rinnen = ( new)     6th iteration      P1: T_steht = ( new)      P2: T_frisch = ( new)      P3: T_muß = (new, the 3rd map process has not yet T_muß, P2 only ) In the seventh step then occurs for the first time that a further occurrence is collected in an already created intermediate result list:

Step 7      P1: T_die = (new, the first map process has not yet T_die )      P2: T_gesellen = ( new)      P3: T_der = [1, 1] ( use existing at the 3rd map process for iteration 2 list) etc.

After 21 steps, all three map processes with their work, which ends Map phase and it starts the Reduce phase. The intermediate result lists that were created by different map processes to the same word are joined together. For each of the resulting intermediate result lists ( sorted listed here )

Reduce     T_der = [1, 1, 1 ] ->     T_die = ->     T_fest = ->     T_gemauert = ->     T_glocke = ->     T_heiss = ->     T_heute = ->     T_in = ->     T_muß = ->     T_stirne = ->     T_von = [ 1, 1 ] ->    .    .    . ( for all the various T- lists) parallel we can start a Reduce process that each enumerating the elements. The result of MapReduce looks something like this:

Output list = [( "fixed", 1 ), ( " today ", 1), ("from ", 2), ( " bricked ", 1),                    ( " must ", 2), ( " the ", 4 ), ( " in ", 1 ) (" the ", 2), ..] other examples

546009
de