Explor­ing Ice­berg in Cloudera Data Platform



In a recent blog post, we presen­ted Apache Ice­berg, an open-source data table format ori­gin­ally developed at Net­flix that is quickly becom­ing the new stand­ard in mod­ern, open data lakes.

In this new post, we will take a deeper dive into its impress­ive cap­ab­il­it­ies, explor­ing how and why a giant like Cloudera adop­ted it within its Cloudera Data Plat­form (CDP).

As of today, Ice­berg V2 is avail­able in all the CDP form factors: Private Cloud Base (7.1.9), Data Ser­vices and Pub­lic Cloud. In our case, we will lever­age the robust infra­struc­ture of the CDP Pub­lic Cloud to explore Ice­berg through Cloudera Data Ware­house (CDW), show­ing how it can unlock a new level of scalab­il­ity, flex­ib­il­ity, and per­form­ance for your data-intens­ive tasks.

Integ­ra­tion with the Cloudera Eco­sys­tem: A Tech Marvel

Within the Cloudera envir­on­ment, Ice­berg comes with all its valu­able fea­tures, bring­ing addi­tional scalab­il­ity, per­form­ance, data con­sist­ency, and secur­ity for man­aging and ana­lys­ing large data­sets in CDP. The great thing is that it integ­rates seam­lessly with the exist­ing CDP core com­pon­ents without requir­ing any setup. Let’s look at some of the advant­ages this offers.

Acid Com­pli­ance

Ice­berg V2 tables, when used with Impala, are ACID com­pli­ant with seri­al­is­able isol­a­tion and an optim­istic con­cur­rency model, ensur­ing data con­sist­ency and integ­rity. Fur­ther­more, with the intro­duc­tion of the V2 spe­cific­a­tion, it became cru­cial for imple­ment­a­tions to identify newly cre­ated tables accur­ately to ensure the cor­rect beha­viour for both read­ers and writers. As a res­ult, the value assigned to the format-ver­sion prop­erty for Ice­berg tables gained sig­ni­fic­ant import­ance, a factor that had been largely over­looked prior to the release of the V2 specification.

Integ­ra­tion with Hive Metastore

Ice­berg integ­rates per­fectly with the Hive Metastore, serving as a repos­it­ory for key metadata, includ­ing pivotal table loc­a­tion details. This enables rapid and hassle-free scal­ing without any adverse effects on the Metastore per­form­ance, open­ing up a new world of pos­sib­il­it­ies (such as time travel) from the get-go.

Secur­ity Integration

In terms of data secur­ity and access man­age­ment, Ice­berg has also advanced by integ­rat­ing with Ranger in the CDP envir­on­ment, offer­ing metic­u­lous con­trol over access to sens­it­ive data within Ice­berg tables, allow­ing the imple­ment­a­tion of rig­or­ous secur­ity policies, ensur­ing that your valu­able inform­a­tion is safe­guarded like never before.

Integ­ra­tion with Data Visualisations

Ice­berg also integ­rates with visu­al­isa­tion tools, allow­ing data ana­lysts, data sci­ent­ists and busi­ness users to eas­ily incor­por­ate Ice­berg data into their dash­boards, reports, and visualisations.

Ice­berg At Work: Real-World Applic­a­tions and Functionalities

In this sec­tion, we’ll test how the main Ice­berg fea­tures behave in CDP. As men­tioned, we chose to use CDW within Pub­lic Cloud, but the same steps apply to all the other CDP form factors (for Private Cloud Base, you need ver­sion 7.1.9).

(If you want to know how to set up a Cloudera Pub­lic Cloud envir­on­ment, check out our pre­vi­ous blog post).

Now that our envir­on­ment is ready, let’s use the Ice­berg frame­work in the con­text of both Hive and Impala. We’ll explore fun­da­mental oper­a­tions that can be per­formed with Ice­berg tables, lever­aging its fea­tures (already described in our pre­vi­ous blog post on Ice­berg) for improved data man­age­ment, query per­form­ance, and more!

Unearth­ing His­tor­ical Insights

Time travel is undoubtedly one of the most appre­ci­ated Ice­berg fea­tures, so let’s see how it can be har­nessed from Hive and Impala.

Cre­at­ing A Table

First of all, we need to cre­ate an Ice­berg table; all we need to do is to append STORED BY ICEBERG to a typ­ical CREATE TABLE statement:

Creating an Iceberg table in Hive
Fig­ure 1: Cre­at­ing an Ice­berg table in Hive

The same syn­tax applies in Impala, and as with any other Hive table, the usual EXTERNAL/MANAGED table con­cepts are applic­able (more inform­a­tion here).

Table His­tory

In Hive, every Ice­berg tables comes with its own his­tory. By select­ing the table his­tory, we get access to the fol­low­ing metadata:

  • Timestamp: When a record was inserted.
  • Snapshot_ID: Asso­ci­ated with the spe­cific inser­tion, allow­ing you to track and to ref­er­ence this snapshot.
  • Parent_snapshot_ID: Provides a link to the pre­vi­ous snap­shot. This rela­tion­ship between snap­shots helps to trace the data lineage.
  • is_current_ancestor: Indic­ates whether the cur­rent snap­shot is an ancestor of another snap­shot. This inform­a­tion is valu­able for under­stand­ing the hier­arch­ical struc­ture of snapshots.

This level of detail empowers us to bet­ter man­age, ana­lyse, and com­pre­hend the his­tor­ical changes and rela­tion­ships within our Ice­berg tables.

In the fol­low­ing screen­shot, we can see how the his­tory is read­ily avail­able in Hive, access­ible through a simple SELECT statement:

Table history in Hive
Fig­ure 2: Table His­tory in Hive
Time Travel

Thanks to the avail­ab­il­ity of a table’s his­tory, Ice­berg allows us to per­form time travel. Ima­gine being able to view his­tor­ical records with ease and thus gain a dynamic per­spect­ive on updates, changes, and trends!

In Ice­berg, we can explore data based on timestamp and ver­sion, and of course, we can do the same in Hive. Timestamp and ver­sion val­ues can be obtained from the his­tory, as we just saw (with the ver­sion being the snapshot_id).

Below you can see a tem­plate query (and two sample screen­shots) to per­form time travel on a spe­cific table using both these methods:

SELECT *
FROM ICEBERG_DB.your_table
FOR SYSTEM_TIME AS OF <TIMESTAMP>
-- OR
FOR SYSTEM_VERSION AS OF <VERSION_NO>;
Time travel via timestamp
Fig­ure 3: Time Travel via Timestamp
Time travel via version
Fig­ure 4: Time Travel via Version
Time Travel with Impala

Unlike Hive, Impala intro­duces its own dis­tinct­ive method to access his­tor­ical details. To reveal a table’s his­tory using Impala, we need to use the DESCRIBE HISTORY command:

DESCRIBE HISTORY ICEBERG_DB.impala_table;
Table history in Impala
Fig­ure 5: Table His­tory in Impala
Roll­back

Iceberg’s Roll­back func­tion employs both the his­tory and time travel cap­ab­il­it­ies to allow us to effort­lessly return our data to pre­vi­ous ver­sions. To use this fea­ture, we need a com­mand like this:

ALTER TABLE ICEBERG_DB.t_table EXECUTE ROLLBACK(<snapshot_id>);
Rollback example
Fig­ure 6: Roll­back example

You can also per­form a roll­back based on a timestamp:

ALTER TABLE ICEBERG_DB.t_table EXECUTE rollback('timestamp');

The res­ult of this oper­a­tion is clearly vis­ible in the table’s his­tory. The is_current_ancestor value of the latest ver­sion is turned to False (record 3 in the screen­shot below), and a new record is added as the last cur­rent ancestor (record 4), with the same snapshot_id and parent_id of the one we rolled back to (record 2):

Updated history after rollback
Fig­ure 7: Updated His­tory after Rollback

Note that we can­not ini­ti­ate a roll­back to a snap­shot where the is_current_ancestor flag from the his­tory metadata is set to False, mean­ing that we can only use snap­shots marked as cur­rent ancest­ors for roll­back oper­a­tions. In other words, we can­not “roll­back a rollback”.

Roll­back oper­a­tions are avail­able in Hive as well as in Impala.
 
 

Effi­cient Metadata Man­age­ment: Stor­ing and Organ­ising Data Insights

In our pre­vi­ous blog post, we dis­cussed the archi­tec­ture of Ice­berg extens­ively, examin­ing cru­cial com­pon­ents like the cata­logue, metadata, and data files. Now it’s time to look at the imple­ment­a­tion of such con­cepts within the CDP envir­on­ment. Assum­ing we’ve already cre­ated an Ice­berg table, we can now delve into the spe­cif­ics of where its metadata and data files are located.

By using the famil­iar SHOW CREATE TABLE com­mand, we can see these inter­est­ing properties:

  • Metadata_location
  • Previous_metadata_location

Let’s try it:

SHOW CREATE TABLE ICEBERG_DB.t_table EXECUTE ROLLBACK(<version id>);

The same inform­a­tion is also eas­ily access­ible in Hue thanks to the table browser. In the screen­shot below, the metadata loc­a­tion val­ues are highlighted:

Iceberg table metadata
Fig­ure 8: Ice­berg table metadata

This is the phys­ical loc­a­tion where all the metadata lay­ers of Ice­berg for this table are stored (metadata files, mani­fest, etc.). Refer to our pre­vi­ous post if you need a reminder!

In our case, we use Azure Blob Stor­age loc­a­tions because our cluster is a Pub­lic Cloud instance on Azure. For AWS, these would be S3 loc­a­tions, and for Ice­berg on Private Cloud they would be HDFS dir­ect­or­ies. How­ever, the under­ly­ing concept remains the same across these platforms.

We can access this loc­a­tion to explore these files. Next to them we find, of course, the /data dir­ect­ory, cor­res­pond­ing to the Loc­a­tion prop­erty of the table:

Metadata and data directories for an Iceberg table
Fig­ure 9: Metadata and Data dir­ect­or­ies for an Ice­berg table

Nav­ig­at­ing to these dir­ect­or­ies, we can see the table data and its metadata files:

Iceberg table data
Fig­ure 10: Ice­berg table data
Iceberg table metadata
Fig­ure 11: Ice­berg table metadata

Optim­ising Data Man­age­ment: Par­ti­tion­ing in Iceberg

In our pre­vi­ous blog post, we also explored the fas­cin­at­ing concept of par­ti­tion evol­u­tion, and as we might expect, this works in exactly the same way in Cloudera. Let’s cre­ate a par­ti­tioned table in Hive:

Iceberg partitioned table in Hive
Fig­ure 12: Ice­berg par­ti­tioned table in Hive

As we pro­ceed to insert records into this table, we can observe how addi­tional sub-dir­ect­or­ies are cre­ated within the ori­ginal /data loc­a­tion. The beha­viour is exactly as expec­ted, demon­strat­ing once again how Ice­berg is seam­lessly embed­ded in the struc­ture of CDW, provid­ing a fully trans­par­ent integ­ra­tion for users:

Partitioned data in Iceberg
Fig­ure 13: Par­ti­tioned data in Iceberg

Seam­less Migra­tion: Trans­ition­ing Your Hive Table with Ease

The final key concept to explore in today’s post is the trans­ition from Hive to Ice­berg. For users with an exist­ing CDP instance con­tain­ing sig­ni­fic­ant data stored as Hive tables, under­stand­ing this pro­cess and the feas­ib­il­ity of migrat­ing these tables to Ice­berg is crucial.

There are two ways to migrate such tables:

  • In-place migra­tions
  • Shadow migra­tions

Both are very straight­for­ward, but with dif­fer­ent implic­a­tions (as their names sug­gest). Here are the details.

In-place Migra­tions

In an in-place migra­tion, the exist­ing data is trans­formed dir­ectly within the cur­rent stor­age system.

For instance, con­sider a scen­ario where we have a Hive table called employee_parquet in a Hadoop HDFS stor­age sys­tem. To migrate this data to Ice­berg using an in-place migra­tion, we have to apply Iceberg’s format to the exist­ing data within the same stor­age infra­struc­ture. This approach min­im­ises the need for addi­tional stor­age resources, as the data trans­form­a­tion occurs within the same stor­age location:

CREATE TABLE employees_parquet(
	id int,
	name string,
	salary DOUBLE)
STORED BY PARQUET;
ALTER TABLE 
  EMPLOYEES_PARQUET
SET
  TBLPROPERTIES (
   'STORAGE_HANDLER' = 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler',
   'FORMAT-VERSION' = 2
);

Fol­low­ing the exe­cu­tion of the given query and a sub­sequent review, the SHOW CREATE TABLE state­ment con­firms the table’s suc­cess­ful migra­tion to the Ice­berg format:

In-place migration
Fig­ure 14: In-Place migra­tion

Shadow Migra­tions

Shadow migra­tion involves cre­at­ing a copy of the data in the Ice­berg format without chan­ging the ori­ginal data.

To illus­trate this, ima­gine that we have an Employ­ees table in a rela­tional data­base. Using the fol­low­ing CTAS state­ment, we can duplic­ate it as an Ice­berg table (note the use of the STORED BY clause):

CREATE TABLE employee1
STORED BY 'org.apache.iceberg.hive.HiveCatalog'
AS (SELECT 
* 
    FROM 
    Employees
);

You can see the res­ult in the screen­shot below. All the typ­ical Ice­berg prop­er­ties are present, con­firm­ing that this is indeed a new Ice­berg table:

Shadow migration
Fig­ure 15: Shadow migra­tion

In the screen­shot, we can see vari­ous pieces of inform­a­tion related to the migra­tion pro­cess. One note­worthy detail is the spe­cific snap­shot used for the migra­tion, which serves as a key ref­er­ence point for track­ing and man­aging changes within your data.

Com­pre­hens­ive Evaluation

In this art­icle, we’ve delved deep into Ice­berg-sup­por­ted func­tion­al­it­ies within Hive and Impala using Cloudera Data Plat­form. In the table below we’ve sum­mar­ised the com­pat­ib­il­ity of these ser­vices for dif­fer­ent Ice­berg operations:

Comprehensive evaluation

Sum­mary

In this art­icle, our second about Apache Ice­berg, we’ve explored its power­ful func­tion­al­it­ies within the con­text of Cloudera Data Plat­form, and more spe­cific­ally, using CDW in CDP Pub­lic Cloud. By integ­rat­ing per­fectly with CDW, Apache Ice­berg provides CDP-based organ­isa­tions with enhanced cap­ab­il­it­ies for man­aging their data, enabling the seam­less scal­ing of data lakes, improved data gov­ernance, and stream­lined data operations.

If you are inter­ested in explor­ing the world of Ice­berg or CDP, don’t hes­it­ate to con­tact us with any ques­tions that you might have. Our team of cer­ti­fied experts has extens­ive exper­i­ence and is ready to assist you to fully lever­age Apache Ice­berg within your data infra­struc­ture. We look for­ward to help­ing you to optim­ise your data man­age­ment strategies and so to achieve greater effi­ciency and per­form­ance in your sector.