Apache HBase is a non-relational distributed database, which can host very large tables with billions of rows and millions of columns providing quick and random query function in e-commerce and high-frequency trading platforms.
HBase can run on Hadoop Distributed File System (HDFS) or Amazon Simple Storage Service (Amazon S3).
Amazon EMR 5.2.0, provides an option to run Apache HBase on Amazon S3. Running HBase on Amazon S3 provides benefits like lower costs, data durability, and easier scalability.
For existing HBase clusters, we recommend using HBase snapshot and replication technologies to migrate to Apache HBase on Amazon EMR without significant downtime of service.
HBase snapshots allow you to take a snapshot of a table without too much impact on region servers and exporting a snapshot to another cluster has little impact on the region servers. HBase replication is a way to copy data between HBase clusters.
During HBase migration, you can export the snapshot files to S3 and use them for recovery.
Customers can use BucketCache in file mode to enhance HBase’s read performance and thus cache data. The cache can be cleared by restarting the the region servers.
It is recommended to choose a more recent minor version when migrating to Amazon EMR and keep the major version unchanged.
Users sometimes face the issue of high response latency when accessing to HBase. It can be reduced by adding the host name and IP mapping to the /etc/hosts file in the HBase client host.
This pieced provides the best practices for HBase online migration to Amazon EMR using HBase snapshot and replication and also covers the key challenges faced during migrations.