How is HDFS and HBase different?

satya - 11/23/2018, 3:56:10 PM

Search for: How is HDFS and HBase different?

satya - 11/23/2018, 4:08:09 PM

Here is the same question posed on Quora

satya - 11/23/2018, 4:10:04 PM

Few quotes from that topic on Quora

satya - 11/23/2018, 4:14:41 PM

Start with a picture - Maya Singh

satya - 11/23/2018, 4:16:28 PM

Quote 1

HBase is a non-relational column oriented distributed database that runs on top of HDFS. It is a NoSQL open source database in which data is stored in rows and columns. Cell is the intersection of rows and columns.

To track changes in the cell, versioning makes it possible to retrieve any version of contents. Versioning makes difference between HBase tables and RDBMS.

Each cell value includes a ?version? attribute, which is nothing more than a timestamp uniquely identifying the cell. Each value in the map is an uninterrupted array of bytes.

The map is indexed by a row key, column key, and a timestamp. Implementations of HBase are highly scalable, sparse, distributed, persistent, and multidimensional-sorted maps.

satya - 11/23/2018, 4:17:46 PM

what are various HDFS file storage types

Search for: what are various HDFS file storage types

satya - 11/23/2018, 4:25:31 PM

Key nature of a file in HDFS


Splittable into records
Records have a primary key
Storage optimized for parallel processing

A single data set is stored on multiple machines
Fault tolerance
Compressable
Higher level Schema on read

satya - 11/23/2018, 4:28:29 PM

File storage in HDFS

satya - 11/23/2018, 4:29:14 PM

What kind of HDFS storage type does HBase use?

Search for: What kind of HDFS storage type does HBase use?

satya - 11/23/2018, 4:31:02 PM

Before I veer off here is an article on HDFS storage types

satya - 7/26/2019, 2:09:00 PM

Revisiting the topic in 2019

satya - 7/26/2019, 2:10:23 PM

HBase stores files in HFiles format in HDFS

ALthough these files are stored in HDFS, HBase does not use YARN for its clustering. It has its own clustering framework, and likely its own "JDBC" like read framework for APIs.

satya - 7/26/2019, 2:10:47 PM

What is cell level security in HBase?

Search for: What is cell level security in HBase?

satya - 7/26/2019, 2:11:27 PM

HBase and resources

HBase clustering maximizes resource usage among its nodes. So there is a warning to share HBase clustering with other big data jobs

satya - 7/26/2019, 2:11:51 PM

Both Hbase and Accumulo are based on Google Big Table implementation

satya - 7/26/2019, 2:14:32 PM

See this image of high level tool breakdown (there could be errors)

satya - 7/26/2019, 2:19:30 PM

Few things about this image

Machines are clustered using YARN processing protocol and HDFS storage protocol

Although HBase uses its own clustering other than YARN it stores files in the end as HDFS files (HFile as its format)

Parquet is another HDFS format

MapReduce, Tez, and Spark are three ways YARN is used as a general purpose distributed processing framework.

PIG is a way to process HDFS files using declarative Script (relational sets). Underneath it can use Mapreduce or Tez as its processing engine.

Hive is a way to process the same files using HiveMetadata and HiveQL and closer to SQL along with JDBC drivers to access the so defined tables.

SQOOP also uses MapReduce to batchload relational databases into HDFS or HBase.

Impala, Presto, and Drill may use YARN to provide a federated SQL (fast and real time) queries and processing.

Spark uses YARN and in memory processing of data using SparkQL as the data access language or directly using its RDDs (hierarchical data sets)

satya - 7/26/2019, 2:21:15 PM

Here is how YARN distributes its work as Tasks in containers

satya - 7/26/2019, 2:21:31 PM

An Application master schedules tasks in containers using node managers

satya - 7/26/2019, 2:23:16 PM

Here is another view of the tool set

Notice how Hue provides the high level UI for working with all the tools including SQL queries and exploring data

satya - 7/26/2019, 2:25:11 PM

HBae and Acuumulo as Big tables

satya - 7/26/2019, 2:26:35 PM

How multiple approaches to distributed processing allowed by YARN

satya - 7/26/2019, 2:30:23 PM

How do you access HBase tables in Pig or Hive?

Search for: How do you access HBase tables in Pig or Hive?

satya - 7/26/2019, 2:30:57 PM

Hive metadata and Hbase

Search for: Hive metadata and Hbase

satya - 7/26/2019, 2:31:06 PM

Hbase and JDBC architecture

Search for: Hbase and JDBC architecture