Hive

A mechanism to see distributed big data as a set of tables

Run SQL (HiveQL) against that data

Have drivers like JDBC so that languages can execute SQL queries while taking advantage of distributed data and distributed processing

Uses a Hive metastore in a relational database to register schemas and where the physical paths are for those tables

Show images for: Apache Hive architecture

Hive is not a scripting language like Pig: FYI

They use Yarn underneath

They use mapreduce

they are fundamentally batch jobs/operations

Good for long running jobs/processes

These are inspired by the need for real time queries

Inspired by Googl'e Big Query

They have a flavor of MPP (Massively Parallel processing) of relational databases like in Postgres

These use their own "distributed processing" protocols and may diverge from Yarn

they may use lot of memory

Cloudera Impala - MPP architecture uses its own clustering (not YARN or MapReduce) for processing

Apache Tez - An efficient application processing framework abstraction on top of Yarn. Can be used as an engine for Hive or Pig

Apache Hawq - A Pivotal solution that morphed from postgress MPP

Apache Drill - A direct implementation of BigQuery

Presto - Brand new kid on the block. they all support federated queries across various sources