Mapreduce

Understand Mapreduce

Search for: Understand Mapreduce

write an output key

write an output that is a selection, interpretation, or transformation, or a combination there off from the complex value

(Key2, Value)

framework will group and sorts the output by Key2

So each key2 will have multiple values before the data is sent to reduce

Reduce will produce

(key3, value)

Select - Mapper

Sort and Group - Framework

Process sorted sets - Reduce

How is parallelism achieved in mapreduce?

Search for: How is parallelism achieved in mapreduce?

To be clarified.


void map(key, value, context)
{
   //Extract values
   val1 = pickfrom(value);
   val2 = pickfrom(value);
   newvalue = join(val1,val2)
 
   //figureout a new key if needed
   newkey = getmeSomekey(value);
   
   //send it back to the framework
   context.write(newkey, newvalue);
   
}

How only one key and value is passed to the mapper.

The value can be complex, such as a string with many columns in it. Or it can be an XML document etc.

See how it doesn't return anything, instead returns through the bigdata context

This works on ONE record

So it can be paralleled

The framework will orchestrate the calls to the mapper in parallel


void reduce(key, value-set, context)
{
     newvalue = process(value-set);
     newkey = sameOrNewKey(key);
     context.write(newkey, newvalue);
}

the received key is setup by the mapper

each key has many values

reducer is expected to make some meaning out of these multiple values for that key

in a way, a reducer is a reduction of that key (records belonging to that key)

There are no other records for this key outside of this set

this was ensured by grouping and sorting the key by the framework after the mapper


class Job.main
{
    inputpath = file;
    inputpath1 = directory;
    inputpaht3 = anotherfile-or-directory;

    //You can use any of the previous inputs
    job.setinputPath(inputpath);
    //you can also call it multiple times

    outputpath = directory; //only.
    //directory shouldnt exist
    job.setOutputPath(outputpath);

    job.setMapper(mapper.class);
    job.setReducer(reducer.class);

    job.setOutputKeyType(Text.class);// text
    job.setOutputValueType(....); //an allowed type

    job.waitforCompletion(....);

}

So multiple files can be read

The mapper output types are expected to be matched with the output types of the reducer. They can be specified independently.

input to the Mapper is expected to be a TextInputRecord, which is expected to pass text record which are implied to be broken at the new line in each file.

** I am assuming the TextInputFormat knows how to pass the key as well in some meaningful nomenclature **

Behavior of TextInputFormat in HDFS

Search for: Behavior of TextInputFormat in HDFS

Read this and see if this clarifies how files are processed to be sent to mapper

Pass the records to the mapper one line at a time using a new line as the break

the key passed is the offset of the record in the file

You can configure mapreduce to not pass the key in which case the mapper will only get the value


//set classpath for your jar file
set HADOOP_CLASSPATH=your-jar-file

//hadoop is a script
hadoop your-class-name input/somedir/sample.txt some-output-dir

In this case the sample text file is a file with 5 or 6 records and copied as a single file into the hadoop system.

The TextInputFormat will read this file and call a single mapper instance

the mapper will call a single reducer instance

Apparently the number of mappers match the number of "splits" in the input file.

The number of reducers used is decided by what is called partitions. the output of the mapper keys are sorted and grouped into partitions. A key is never present in 2 partitions. A hashfunction is used to do something like this.

Each partition of keys are sent to a separate reducer. Each reduced writes one file into the directory.

Because a directory can be seen as a single file in hadoop (with splits) a downstream process can treat it as a single input as needed.

you will see files like output/part-r-0000 and output/part-r-0001 etc.

But it typically runs on files sitting in HDFS

This allows hadoop to allocate mappers close to their data

It uses a resource manager called YARN for this

How do input splits work in HDFS?

Search for: How do input splits work in HDFS?

Hadoop divides input data to a map reduce job into fixed size pieces called splits.

Each split is passed to a mapreduce job

the output of the mapreduce job is written to a local disk where the mapreduce is running

So this data is not written to HDFS with replication capabilities

However when this data is shuffled and sorted for the reducer, if the machine goes down the map output is recalculated

Reduce tasks don't have the luxury of data locality on the other hand

the sorted output of the mappers are transferred to the reducer node and grouped again and passed to the reducer

the reducer output is stored in HDFS

Show images for: Map reduce data flow with a single reduce task

See this link to see various good things on how data is processed and gathered in a mapreduce flow

Each mapper, as input, has a collection of keys that are incidental, which means no order or uniqueness or lack there of

The output of a mapper emits with equivalent or same keys going out with a selected or transformed output

Each "set" of the mapper output is "indpendently" sorted and sent to the reducer machine across the network

where they are merged so that all the same primary keys are adjacent to each other (and sorted)

All instances of a given primary key will have their values gathered into a bucket belonging to that primary key

A single instance of each primary key is sent to the reducer. However that single primary key which is unique from everything else has a collection of values

Reducer then writes the output to HDFS (means replicated for fault tolerancy)

Mappers and reducers are not one to one

When output from mappers are directed to (a smaller set of) reducers then output from multiple mappers may reach the same reducer

To do this, output of a single mapper is first classified into multiple buckets that are targeted for multiple mappers. So ONE mapper to multiple targeted mappers happen on the mapper local machine

The buckets then travel to the respective reducer where they are merged for uniqueness of keys as stated above

MapReduce data flow with multiple reduce tasks [37]

So the partitioning of keys headed for a reducer takes place on the mapper side

Role of record readers and input records in Hadoop

Search for: Role of record readers and input records in Hadoop

Is it possible to run MapReduce locally, without HDFS and Hadoop cluster?

Search for: Is it possible to run MapReduce locally, without HDFS and Hadoop cluster?