Apache Pig, Grunt
satya - 7/14/2019, 12:58:52 PM
Briefly
A set of commands to read, manipulate, write HDFS/HBase/Hive data sets.
Like a set of select, aggregate, join functions that can be tied together in a work flow
Statement 1 (operation, on a dataset, produce a new named dataset)
Statement 2
Statement 3
etc.
satya - 7/14/2019, 1:06:24 PM
So Grunt
A node.js based command line task runner
Extensible
Originally built for Javascript build/compile tasks
You can use "sh" subcommand to invoke any shell command from grunt itself
you can use "fs" subcommand to invoke a number of hdfs commands
Pig commands are often used with Grunt
satya - 7/14/2019, 1:10:09 PM
Key data structures in pig
Relation - like a table ( a bag of tuples, or a bag of rows)
tuple - like a row in a table with fields
satya - 7/14/2019, 1:11:22 PM
Example
// A is a named relation
//LOAD is a command
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
//DUMP is a command
DUMP A;
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)
satya - 7/14/2019, 1:13:08 PM
Example 2
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
//select a column named "name"
//Also select the "2nd" field
//Create a new table called X
X = FOREACH A GENERATE name,$2;
//print it
DUMP X;
(John,4.0F)
(Mary,3.8F)
(Bill,3.9F)
(Joe,3.8F)
satya - 7/15/2019, 12:11:40 PM
Pig is
It is a set of commands to manipulate data sets
It is generally called a script with commands that are scripted
In case of Hive, Hive merely provides a SQL like standalone queries against those data sets.
satya - 7/15/2019, 12:14:01 PM
Hadoop ecosystem, one view: from Hadooptutorial.info