How is HDFS and HBase different?

How is HDFS and HBase different?

Search for: How is HDFS and HBase different?

Here is the same question posed on Quora

Few quotes from that topic on Quora

HBase is a non-relational column oriented distributed database that runs on top of HDFS. It is a NoSQL open source database in which data is stored in rows and columns. Cell is the intersection of rows and columns.

To track changes in the cell, versioning makes it possible to retrieve any version of contents. Versioning makes difference between HBase tables and RDBMS.

Each cell value includes a ?version? attribute, which is nothing more than a timestamp uniquely identifying the cell. Each value in the map is an uninterrupted array of bytes.

The map is indexed by a row key, column key, and a timestamp. Implementations of HBase are highly scalable, sparse, distributed, persistent, and multidimensional-sorted maps.

what are various HDFS file storage types

Search for: what are various HDFS file storage types


Splittable into records
Records have a primary key
Storage optimized for parallel processing

A single data set is stored on multiple machines
Fault tolerance
Compressable
Higher level Schema on read

What kind of HDFS storage type does HBase use?

Search for: What kind of HDFS storage type does HBase use?

Before I veer off here is an article on HDFS storage types

Revisiting the topic in 2019

ALthough these files are stored in HDFS, HBase does not use YARN for its clustering. It has its own clustering framework, and likely its own "JDBC" like read framework for APIs.

What is cell level security in HBase?

Search for: What is cell level security in HBase?

HBase clustering maximizes resource usage among its nodes. So there is a warning to share HBase clustering with other big data jobs

Both Hbase and Accumulo are based on Google Big Table implementation

Machines are clustered using YARN processing protocol and HDFS storage protocol

Although HBase uses its own clustering other than YARN it stores files in the end as HDFS files (HFile as its format)

Parquet is another HDFS format

MapReduce, Tez, and Spark are three ways YARN is used as a general purpose distributed processing framework.

PIG is a way to process HDFS files using declarative Script (relational sets). Underneath it can use Mapreduce or Tez as its processing engine.

Hive is a way to process the same files using HiveMetadata and HiveQL and closer to SQL along with JDBC drivers to access the so defined tables.

SQOOP also uses MapReduce to batchload relational databases into HDFS or HBase.

Impala, Presto, and Drill may use YARN to provide a federated SQL (fast and real time) queries and processing.

Spark uses YARN and in memory processing of data using SparkQL as the data access language or directly using its RDDs (hierarchical data sets)

An Application master schedules tasks in containers using node managers

Notice how Hue provides the high level UI for working with all the tools including SQL queries and exploring data

How do you access HBase tables in Pig or Hive?

Search for: How do you access HBase tables in Pig or Hive?

Hive metadata and Hbase

Search for: Hive metadata and Hbase

Hbase and JDBC architecture

Search for: Hbase and JDBC architecture