Apache Pig, Grunt

A set of commands to read, manipulate, write HDFS/HBase/Hive data sets.

Like a set of select, aggregate, join functions that can be tied together in a work flow

Statement 1 (operation, on a dataset, produce a new named dataset)

Statement 2

Statement 3

etc.

A node.js based command line task runner

Extensible

Originally built for Javascript build/compile tasks

You can use "sh" subcommand to invoke any shell command from grunt itself

you can use "fs" subcommand to invoke a number of hdfs commands

Pig commands are often used with Grunt

Relation - like a table ( a bag of tuples, or a bag of rows)

tuple - like a row in a table with fields


// A is a named relation
//LOAD is a command
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);

//DUMP is a command
DUMP A;
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);

//select a column named "name"
//Also select the "2nd" field
//Create a new table called X
X = FOREACH A GENERATE name,$2;

//print it
DUMP X;
(John,4.0F)
(Mary,3.8F)
(Bill,3.9F)
(Joe,3.8F)

It is a set of commands to manipulate data sets

It is generally called a script with commands that are scripted

In case of Hive, Hive merely provides a SQL like standalone queries against those data sets.