Apache Hadoop

Apache Hadoop is a free, written in Java Framework for scalable, distributed working software. It is based on the well-known MapReduce algorithm of Google Inc. as well as proposals of the Google file system and allows intensive computing processes with large amounts of data ( big data, petabytes ) perform on computer clusters. Hadoop was originally initiated by the Lucene creator Doug Cutting. On 23 January 2008, it became the top-level project of the Apache Software Foundation. Users include Facebook, a9.com, AOL, Baidu, IBM, ImageShack and Yahoo.

  • 2.1 HBase
  • 2.2 Hive
  • 2.3 Pig
  • 2.4 Chukwa
  • 2.5 ZooKeeper


Hadoop Distributed File System ( HDFS )

HDFS is a highly available, high-performance file system for storing very large amounts of data on the file systems of multiple computers (nodes). Files are divided into data blocks with a fixed length and distributes them redundantly on the participating nodes. HDFS is pursuing a master-slave approach. A master node, called the NameNode, processes incoming data requests, organized storage of files in the slave node and stores resulting metadata. HDFS supports file systems with several 100 million files. Both file block length and degree of redundancy are configurable.


Hadoop implements the MapReduce algorithm with configurable classes for Map, Reduce and Combine phases.



HBase is a scalable, simple database to manage very large amounts of data within a Hadoop cluster. The HBase database is based on a free implementation of Google 's BigTable. This data structure is suitable for data which is rarely changed, but very often updated. With HBase billions of rows can be distributed and managed efficiently.


Hadoop Hive extended to data warehouse functionalities, namely the query language HiveQL and indexes. HiveQL is an SQL -based query language and allows the developer to use a SQL - like syntax.

In the summer of 2008, Facebook, the original developer of Hive, the project of the open source community is available. The Hadoop database used by Facebook is one with a little more than 100 petabytes (August 2012) the largest in the world ..


With Pig MapReduce programs can be written in high -level language Pig Latin for Hadoop. Pig is characterized by the following properties:

  • Simplicity. The parallel execution of complex analysis is easy to understand and simple.
  • Optimization. Pig optimized independently performing more complex operations after Carsten method.
  • Extensibility. Pig can be extended with custom functionality and thus be adapted to individual applications.


Chukwa enables real-time monitoring of very large distributed systems.


ZooKeeper is the (distributed) configuration of distributed systems.


A system based on Apache Hadoop cluster system has the price Terabyte Sort won in 2008 and 2009 benchmark. It was among the systems tested as IT Benchmark fastest large amounts of data ( in 2009 one hundred terabytes Integer) Sort distributed - but with a much larger number of nodes than the competitors because it is not regulated in the benchmark statutes. It was thus the first Java - and also the first open source program that could decide this benchmark for itself.

Commercial Support and commercial Forks

As the use of Hadoop is particularly interesting for companies, there are a number of companies that commercial support or Forks of Hadoop offer:

  • Cloudera provides CDH with an "enterprise ready" Open Source Distribution for Hadoop
  • Hortonworks is Hadoop distributor with the Hortonworks Data Platform, the second large enterprise open source. It is an extraction of Yahoo! and Benchmark Capital.
  • Microsoft integrates Hadoop currently in Windows Azure and SQL Server
  • The Google App Engine MapReduce Hadoop support programs.
  • The IBM InfoSphere product BigInsights based on Hadoop.
  • EMC ² Greenplum HD Hadoop offers as part of a product package.

There are also further suppliers.

Pictures of Apache Hadoop