Mapreduce
satya - 2/19/2019, 3:10:38 PM
Understand Mapreduce
Understand Mapreduce
satya - 2/19/2019, 3:19:18 PM
Input: (Key, Complex-Value)
write an output key
write an output that is a selection, interpretation, or transformation, or a combination there off from the complex value
(Key2, Value)
framework will group and sorts the output by Key2
So each key2 will have multiple values before the data is sent to reduce
Reduce will produce
(key3, value)
satya - 2/19/2019, 3:22:15 PM
High level
Select - Mapper
Sort and Group - Framework
Process sorted sets - Reduce
satya - 2/19/2019, 3:23:05 PM
How is parallelism achieved in mapreduce?
How is parallelism achieved in mapreduce?
Search for: How is parallelism achieved in mapreduce?
To be clarified.
satya - 2/19/2019, 3:27:13 PM
psuedocode of a mapper per record
void map(key, value, context)
{
//Extract values
val1 = pickfrom(value);
val2 = pickfrom(value);
newvalue = join(val1,val2)
//figureout a new key if needed
newkey = getmeSomekey(value);
//send it back to the framework
context.write(newkey, newvalue);
}
satya - 2/19/2019, 3:32:50 PM
Notice
How only one key and value is passed to the mapper.
The value can be complex, such as a string with many columns in it. Or it can be an XML document etc.
See how it doesn't return anything, instead returns through the bigdata context
This works on ONE record
So it can be paralleled
The framework will orchestrate the calls to the mapper in parallel
satya - 2/19/2019, 3:50:59 PM
Similarly here is the pseudo code for reduce
void reduce(key, value-set, context)
{
newvalue = process(value-set);
newkey = sameOrNewKey(key);
context.write(newkey, newvalue);
}
satya - 2/19/2019, 3:52:54 PM
Notice that the reduction is per key emitted by the mapper
the received key is setup by the mapper
each key has many values
reducer is expected to make some meaning out of these multiple values for that key
in a way, a reducer is a reduction of that key (records belonging to that key)
There are no other records for this key outside of this set
this was ensured by grouping and sorting the key by the framework after the mapper
satya - 2/20/2019, 2:43:20 PM
Tell me now about the job object/code
class Job.main
{
inputpath = file;
inputpath1 = directory;
inputpaht3 = anotherfile-or-directory;
//You can use any of the previous inputs
job.setinputPath(inputpath);
//you can also call it multiple times
outputpath = directory; //only.
//directory shouldnt exist
job.setOutputPath(outputpath);
job.setMapper(mapper.class);
job.setReducer(reducer.class);
job.setOutputKeyType(Text.class);// text
job.setOutputValueType(....); //an allowed type
job.waitforCompletion(....);
}
satya - 2/20/2019, 2:46:27 PM
Few additional notes
So multiple files can be read
The mapper output types are expected to be matched with the output types of the reducer. They can be specified independently.
input to the Mapper is expected to be a TextInputRecord, which is expected to pass text record which are implied to be broken at the new line in each file.
** I am assuming the TextInputFormat knows how to pass the key as well in some meaningful nomenclature **
satya - 2/20/2019, 2:47:45 PM
Behavior of TextInputFormat in HDFS
Behavior of TextInputFormat in HDFS
satya - 2/20/2019, 2:57:09 PM
Read this and see if this clarifies how files are processed to be sent to mapper
Read this and see if this clarifies how files are processed to be sent to mapper
satya - 2/21/2019, 2:31:41 PM
The short answer for TextInputFormat default behavior is
Pass the records to the mapper one line at a time using a new line as the break
the key passed is the offset of the record in the file
You can configure mapreduce to not pass the key in which case the mapper will only get the value
satya - 2/21/2019, 2:35:44 PM
How do I run a Hadoop map reduce job?
//set classpath for your jar file
set HADOOP_CLASSPATH=your-jar-file
//hadoop is a script
hadoop your-class-name input/somedir/sample.txt some-output-dir
satya - 2/21/2019, 2:40:10 PM
Additional notes
In this case the sample text file is a file with 5 or 6 records and copied as a single file into the hadoop system.
The TextInputFormat will read this file and call a single mapper instance
the mapper will call a single reducer instance
Apparently the number of mappers match the number of "splits" in the input file.
The number of reducers used is decided by what is called partitions. the output of the mapper keys are sorted and grouped into partitions. A key is never present in 2 partitions. A hashfunction is used to do something like this.
Each partition of keys are sent to a separate reducer. Each reduced writes one file into the directory.
Because a directory can be seen as a single file in hadoop (with splits) a downstream process can treat it as a single input as needed.
satya - 2/21/2019, 2:41:50 PM
because the output is a directory
you will see files like output/part-r-0000 and output/part-r-0001 etc.
satya - 2/21/2019, 2:44:10 PM
Mapreduce can be run on a simple file in a local file system
But it typically runs on files sitting in HDFS
This allows hadoop to allocate mappers close to their data
It uses a resource manager called YARN for this
satya - 2/21/2019, 2:54:54 PM
How do input splits work in HDFS?
How do input splits work in HDFS?
satya - 2/21/2019, 3:00:40 PM
Input splits
Hadoop divides input data to a map reduce job into fixed size pieces called splits.
Each split is passed to a mapreduce job
the output of the mapreduce job is written to a local disk where the mapreduce is running
So this data is not written to HDFS with replication capabilities
However when this data is shuffled and sorted for the reducer, if the machine goes down the map output is recalculated
Reduce tasks don't have the luxury of data locality on the other hand
the sorted output of the mappers are transferred to the reducer node and grouped again and passed to the reducer
the reducer output is stored in HDFS
satya - 2/21/2019, 3:01:40 PM
Map reduce data flow with a single reduce task
Show images for: Map reduce data flow with a single reduce task
See this link to see various good things on how data is processed and gathered in a mapreduce flow
satya - 2/21/2019, 3:05:04 PM
See an image
satya - 2/25/2019, 11:47:40 AM
Few more observations
Each mapper, as input, has a collection of keys that are incidental, which means no order or uniqueness or lack there of
The output of a mapper emits with equivalent or same keys going out with a selected or transformed output
Each "set" of the mapper output is "indpendently" sorted and sent to the reducer machine across the network
where they are merged so that all the same primary keys are adjacent to each other (and sorted)
All instances of a given primary key will have their values gathered into a bucket belonging to that primary key
A single instance of each primary key is sent to the reducer. However that single primary key which is unique from everything else has a collection of values
Reducer then writes the output to HDFS (means replicated for fault tolerancy)
satya - 2/25/2019, 11:52:27 AM
How about multiple reducers!!!
Mappers and reducers are not one to one
When output from mappers are directed to (a smaller set of) reducers then output from multiple mappers may reach the same reducer
To do this, output of a single mapper is first classified into multiple buckets that are targeted for multiple mappers. So ONE mapper to multiple targeted mappers happen on the mapper local machine
The buckets then travel to the respective reducer where they are merged for uniqueness of keys as stated above
satya - 2/25/2019, 11:58:15 AM
So the partitioning of keys headed for a reducer takes place on the mapper side
So the partitioning of keys headed for a reducer takes place on the mapper side
satya - 3/21/2019, 5:11:43 PM
Role of record readers and input records in Hadoop
Role of record readers and input records in Hadoop
Search for: Role of record readers and input records in Hadoop
satya - 3/21/2019, 5:26:23 PM
Is it possible to run MapReduce locally, without HDFS and Hadoop cluster?
Is it possible to run MapReduce locally, without HDFS and Hadoop cluster?
Search for: Is it possible to run MapReduce locally, without HDFS and Hadoop cluster?