Apache Pig, Grunt

satya - 7/14/2019, 12:58:52 PM

Briefly

A set of commands to read, manipulate, write HDFS/HBase/Hive data sets.

Like a set of select, aggregate, join functions that can be tied together in a work flow

Statement 1 (operation, on a dataset, produce a new named dataset)

Statement 2

Statement 3

etc.

satya - 7/14/2019, 12:59:07 PM

Apache

Pig:Search On Web

Pig Latin:Search On Web

Grunt:Search On Web

satya - 7/14/2019, 1:06:24 PM

So Grunt

A node.js based command line task runner

Extensible

Originally built for Javascript build/compile tasks

You can use "sh" subcommand to invoke any shell command from grunt itself

you can use "fs" subcommand to invoke a number of hdfs commands

Pig commands are often used with Grunt

satya - 7/14/2019, 1:10:09 PM

Key data structures in pig

Relation - like a table ( a bag of tuples, or a bag of rows)

tuple - like a row in a table with fields

satya - 7/14/2019, 1:11:22 PM

Example


// A is a named relation
//LOAD is a command
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);

//DUMP is a command
DUMP A;
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)

satya - 7/14/2019, 1:13:08 PM

Example 2


A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);

//select a column named "name"
//Also select the "2nd" field
//Create a new table called X
X = FOREACH A GENERATE name,$2;

//print it
DUMP X;
(John,4.0F)
(Mary,3.8F)
(Bill,3.9F)
(Joe,3.8F)

satya - 7/15/2019, 12:11:40 PM

Pig is

It is a set of commands to manipulate data sets

It is generally called a script with commands that are scripted

In case of Hive, Hive merely provides a SQL like standalone queries against those data sets.

satya - 7/15/2019, 12:14:01 PM

Hadoop ecosystem, one view: from Hadooptutorial.info