How is HDFS and HBase different?
satya - 11/23/2018, 3:56:10 PM
How is HDFS and HBase different?
How is HDFS and HBase different?
satya - 11/23/2018, 4:08:09 PM
Here is the same question posed on Quora
satya - 11/23/2018, 4:10:04 PM
Few quotes from that topic on Quora
Few quotes from that topic on Quora
satya - 11/23/2018, 4:14:41 PM
Start with a picture - Maya Singh
satya - 11/23/2018, 4:16:28 PM
Quote 1
HBase is a non-relational column oriented distributed database that runs on top of HDFS. It is a NoSQL open source database in which data is stored in rows and columns. Cell is the intersection of rows and columns.
To track changes in the cell, versioning makes it possible to retrieve any version of contents. Versioning makes difference between HBase tables and RDBMS.
Each cell value includes a ?version? attribute, which is nothing more than a timestamp uniquely identifying the cell. Each value in the map is an uninterrupted array of bytes.
The map is indexed by a row key, column key, and a timestamp. Implementations of HBase are highly scalable, sparse, distributed, persistent, and multidimensional-sorted maps.
satya - 11/23/2018, 4:17:46 PM
what are various HDFS file storage types
what are various HDFS file storage types
satya - 11/23/2018, 4:25:31 PM
Key nature of a file in HDFS
Splittable into records
Records have a primary key
Storage optimized for parallel processing
A single data set is stored on multiple machines
Fault tolerance
Compressable
Higher level Schema on read
satya - 11/23/2018, 4:28:29 PM
File storage in HDFS
satya - 11/23/2018, 4:29:14 PM
What kind of HDFS storage type does HBase use?
What kind of HDFS storage type does HBase use?
satya - 11/23/2018, 4:31:02 PM
Before I veer off here is an article on HDFS storage types
satya - 7/26/2019, 2:09:00 PM
Revisiting the topic in 2019
Revisiting the topic in 2019
satya - 7/26/2019, 2:10:23 PM
HBase stores files in HFiles format in HDFS
ALthough these files are stored in HDFS, HBase does not use YARN for its clustering. It has its own clustering framework, and likely its own "JDBC" like read framework for APIs.
satya - 7/26/2019, 2:10:47 PM
What is cell level security in HBase?
What is cell level security in HBase?
satya - 7/26/2019, 2:11:27 PM
HBase and resources
HBase clustering maximizes resource usage among its nodes. So there is a warning to share HBase clustering with other big data jobs
satya - 7/26/2019, 2:11:51 PM
Both Hbase and Accumulo are based on Google Big Table implementation
Both Hbase and Accumulo are based on Google Big Table implementation
satya - 7/26/2019, 2:14:32 PM
See this image of high level tool breakdown (there could be errors)
satya - 7/26/2019, 2:19:30 PM
Few things about this image
Machines are clustered using YARN processing protocol and HDFS storage protocol
Although HBase uses its own clustering other than YARN it stores files in the end as HDFS files (HFile as its format)
Parquet is another HDFS format
MapReduce, Tez, and Spark are three ways YARN is used as a general purpose distributed processing framework.
PIG is a way to process HDFS files using declarative Script (relational sets). Underneath it can use Mapreduce or Tez as its processing engine.
Hive is a way to process the same files using HiveMetadata and HiveQL and closer to SQL along with JDBC drivers to access the so defined tables.
SQOOP also uses MapReduce to batchload relational databases into HDFS or HBase.
Impala, Presto, and Drill may use YARN to provide a federated SQL (fast and real time) queries and processing.
Spark uses YARN and in memory processing of data using SparkQL as the data access language or directly using its RDDs (hierarchical data sets)
satya - 7/26/2019, 2:21:15 PM
Here is how YARN distributes its work as Tasks in containers
satya - 7/26/2019, 2:21:31 PM
An Application master schedules tasks in containers using node managers
An Application master schedules tasks in containers using node managers
satya - 7/26/2019, 2:23:16 PM
Here is another view of the tool set
Notice how Hue provides the high level UI for working with all the tools including SQL queries and exploring data
satya - 7/26/2019, 2:25:11 PM
HBae and Acuumulo as Big tables
satya - 7/26/2019, 2:26:35 PM
How multiple approaches to distributed processing allowed by YARN
satya - 7/26/2019, 2:30:23 PM
How do you access HBase tables in Pig or Hive?
How do you access HBase tables in Pig or Hive?
satya - 7/26/2019, 2:30:57 PM
Hive metadata and Hbase
Hive metadata and Hbase
satya - 7/26/2019, 2:31:06 PM
Hbase and JDBC architecture
Hbase and JDBC architecture