Share this: Google+
There are so much hypes in Data Science community about Hadoop and MapReduce. As if by mentioning the two keywords, the companies have adapted into the new level mysterious Data Science or Big Data aspect. The purpose of this very short tutorial is to remove those hypes and get down to the understandable list of frequently asked questions.
What is MapReduce ?
MapReduce is programming model for computation of Big Data in distributed parallel servers. MapReduce can be seen as a simplified access and process of our distributed databases.
What are the main functionality of MapReduce ?
In a nutshell, MapReduce consists of two main functionality: to Map() and to Reduce(). Map functionality takes the data from local computer and distribute it into many servers for computation and storage. Reduce functionality aggregates the result of computation from many servers back into the local computer.
What makes MapReduce better than the traditional distributed computing ?
MapReduce provides higher level of abstraction. Programmers does not need to handle system level details such as synchronization, share memory, deadlock and race and many others that usually exists in the traditional distributed computing such as OpenMP, MPI etc.
What are the most common programming languages to implement MapReduce ?
Java, Python, C++.
Who develop MapReduce ?
Programmer team in Google.
What is the connection between MapReduce and Hadoop ?
To implement MapReduce model, you can use Apache Hadoop. Use Pig (for dataflow style) and Hive (for SQL style) for processing in Hadoop ecosystem. You can also use Mahout for machine learning / data mining algorithm.
Where to get MapReduce ?
Use Apache Hadoop .
How large is Big Data ?
About terabyte to petabyte data per day.
What is the basic data structure accepted in MapReduce ?
You can use the primitives such as integers, floating point, strings, bytes, or more complex data structure such as lists, tuples, associative arrays, or you can also built your own custom data type. For simplest way of thinking, it is easier if you think the basic data structure in term of a list of key-value pair (key, value).
Can you give example ?
The following example is from
Google Research team
Input : Large number of text documents
Task : Compute word count across all the document
For every word in a document output (word, "1") word as key, value = 1
Sum all occurrences of words and output (word, total_count)
PseudoCode for Word Counting:
(String key, String value):
// key: document name or document ID
// value: document contents
for each word w in value: // for each word in a document
EmitIntermediate(w, "1"); // emits an intermediate key-value pair for each word
(String key, Iterator values):
// key: a word
// values: a list of counts [c1, c2, c3, ...]
int word_count = 0;
for each v in values:
word_count += ParseInt(v);
Emit(key, AsString(word_count)); // sum up the partial counts to produce the final count
If you want to count the number of occurrences of each word in a given input file. How to transform this task into a MapReduce task? Here is better idea of internal four steps
- Partition : The input data is split into records (by row).
- Map : process these records to produce key-value pairs for each word.
- Shuffling : Merge the results of map function and group by key and sort.
- Combine : aggregate the results of the key-value pairs to produce the final output.
The four steps above are represented in the diagram below:
How to use MapReduce for my specific applications?
From Data Science point of view, handling the programming to use MapReduce is still on the lower level of abstraction. To be applied, you still need higher level of abstraction to use these computational power and technology. You should identify common tasks and common analysis and then build the higher abstraction libraries on top of MapReduce.