In the con­text of data ware­housing, ETL plays a cru­cial role in con­sol­id­at­ing, cleans­ing, trans­form­ing and organ­ising data from diverse sources into a uni­fied, con­sist­ent data­set. As well as ensur­ing data reli­ab­il­ity and accur­acy, it is also import­ant to ensure the whole pro­cess is per­form­ance-optim­ised and timely.

In this art­icle, we’ll explore some optim­isa­tion tech­niques spe­cific­ally tailored for Oracle Data Integ­rator (ODI), a lead­ing ETL tool.

We’ll cover the following:

  1. Choice of Know­ledge Module
  2. Cus­tom­ising Know­ledge Mod­ules to disable/enable indexes
  3. Gather Stat­ist­ics

Choice of Know­ledge Module

There is a spe­cial (and per­haps its most power­ful) cat­egory of com­pon­ent in ODI called Know­ledge Mod­ules. These con­sist of code tem­plates that can be used to per­form a wide range of tasks dur­ing the data integ­ra­tion pro­cess. They are reusable, cus­tom­is­able, and have spe­cific names based on the tasks they are designed to do. 

In this sec­tion we’ll look at Load­ing Know­ledge Mod­ules (LKMs) and the advant­ages they can offer when chosen correctly. 

Firstly, it is import­ant to remem­ber that LKMs are used in the pro­cess of load­ing data from a source data­store to a sta­ging table, which means that not all map­pings will require this com­pon­ent. LKMs are par­tic­u­larly use­ful in scen­arios where the source data­store and tar­get data­store use dif­fer­ent tech­no­lo­gies, or are hos­ted on sep­ar­ate servers. 

Secondly, it is vital to select the most suit­able LKM for each map­ping. The typ­ical nam­ing con­ven­tion for LKMs is as follows: 

LKM source tech­no­logy to tar­get tech­no­logy (load­ing method) 

It should be rel­at­ively easy to choose one if we know the source and tar­get tech­no­logy. For example, we could use the LKM SQL to Oracle (Built-in).GLOBAL LKM if we are load­ing data from SQL to an Oracle envir­on­ment. This will mean a more effi­cient load­ing pro­cess that will likely speed up the exe­cu­tion time.

How­ever, choos­ing the right tech­no­lo­gies is not the only thing that can speed up the load­ing pro­cess, as the load­ing method can also play an import­ant role. For example, if both the source and tar­get tech­no­lo­gies are Oracle, instead of using the default method, a data­base link (DBLINK) can be used to achieve a faster load­ing pro­cess (using, for example, the LKM Oracle to Oracle Pull (DB Link).GLOBAL LKM). This can be par­tic­u­larly help­ful when load­ing massive amounts of data.

ODI sup­ports mul­tiple meth­ods for per­form­ing data upload oper­a­tions across dif­fer­ent tech­no­lo­gies. For instance, when work­ing with Big Data, ODI offers a vari­ety of LKMs to facil­it­ate data uploads between HDFS, Hive, Spark, HBase, and Pig. Each of these tools provides dis­tinct per­form­ance advant­ages, with some enabling faster data uploads than oth­ers. So the choice of LKM is a key factor in determ­in­ing the over­all effi­ciency and per­form­ance of your data upload processes.

Cus­tom­ising Know­ledge Mod­ules to Disable/Enable Indexes

Indexes enhance the per­form­ance of SQL quer­ies by enabling faster data retrieval, but they can neg­at­ively impact the per­form­ance of cer­tain oper­a­tions, par­tic­u­larly those involving data inser­tion or dele­tion. This is because whenever rows are inser­ted or deleted, all indexes asso­ci­ated with the affected table must be updated, which can sig­ni­fic­antly slow these pro­cesses down.

To mit­ig­ate this, a best prac­tice when insert­ing new data into a table with exist­ing indexes is to dis­able the indexes before the data load, then re-enable them after­wards. In ODI this can be done using Know­ledge Mod­ules – spe­cific­ally Integ­ra­tion Know­ledge Mod­ules (IKMs). 

Whilst LKMs are typ­ic­ally used for data load­ing, IKMs handle data integ­ra­tion tasks in the tar­get sys­tem. These tasks may include strategies such as Insert, Update, or Slowly Chan­ging Dimen­sions (SCDs), and IKMs can be cus­tom­ised to handle the pro­cess of dis­abling and enabling indexes dur­ing data load operations.

Dis­abling and enabling can also be exten­ded to table con­straints, as con­straint val­id­a­tion can sig­ni­fic­antly impact per­form­ance, espe­cially when insert­ing large volumes of data into a table.

It is import­ant to note that unique indexes can­not be dis­abled dir­ectly. How­ever, this lim­it­a­tion can be worked around if the unique index is gen­er­ated through a unique con­straint, using the fol­low­ing syntax:

CONSTRAINT constraint_name UNIQUE (column1, column2, ... column_n) USING INDEX

If a unique index is cre­ated dir­ectly (i.e., without a con­straint), attempts to dis­able it will lead to errors.

In ODI, the option to enable or dis­able indexes and con­straints is typ­ic­ally avail­able as part of the IKM con­fig­ur­a­tion in the Phys­ical layer of a map­ping. This flex­ib­il­ity allows you to con­trol index beha­viour dur­ing data load pro­cesses effect­ively, avoid­ing poten­tial errors while optim­iz­ing performance:

There are, how­ever, some types of IKM that do not offer these actions auto­mat­ic­ally, such as the IKM SQL Con­trol Append IKM:

For­tu­nately, a KM can be cus­tom­ised to include new tasks or to change exist­ing ones. To do so, we access the Know­ledge Mod­ules list in the Pro­jects or the Global Objects menu in the Designer tab (depend­ing on whether you want the IKM to be spe­cific to a pro­ject or avail­able for use across all pro­jects), and duplic­ate the IKM we want to use as a base:

If we double–click on the newly cre­ated IKM, we can access the gen­eral con­fig­ur­a­tion, but more import­antly, the tasks it will per­form when executed:

The task list now includes options for dis­abling and enabling indexes and con­straints. It also allows you to define the PL/SQL com­mand to be executed and to con­fig­ure other exe­cu­tion-related options for the tasks:

Gather Stat­ist­ics

Oracle provides the DBMS_STATS pack­age to facil­it­ate the col­lec­tion of vari­ous types of stat­ist­ics to enhance per­form­ance. Typ­ic­ally, Oracle auto­mat­ic­ally gath­ers stat­ist­ics for data­base objects that are miss­ing or have out­dated stat­ist­ics dur­ing a pre­defined main­ten­ance win­dow. How­ever, in a data ware­house con­text, it may be neces­sary to manu­ally gather stat­ist­ics tailored to spe­cific use cases, such as when data changes in a table which is then being used as a source to load another table. 

The oper­a­tions included in this pack­age can be applied to stat­ist­ics stored either in the dic­tion­ary or in a table cre­ated for this pur­pose; the pack­age also sup­ports user-defined stat­ist­ics for tables and domain indexes.

In this sec­tion we’ll focus on the GATHER_STATS pro­ced­ures, spe­cific­ally the GATHER_TABLE_STATS operation.

The GATHER_STATS pro­ced­ures can be used in ODI and can be cru­cial for optim­ising SQL exe­cu­tion plans. They can be applied in ODI in the fol­low­ing situations:

  1. Inter­faces and Map­pings: To ensure effi­cient ETL pro­cesses by col­lect­ing stat­ist­ics on source and tar­get tables. 
  2. Pack­ages: As pro­ced­ural steps within ODI pack­ages to auto­mate the gath­er­ing of stat­ist­ics at key stages in a data integ­ra­tion workflow. 
  3. Scen­arios: As part of gen­er­ated scen­arios to ensure that stat­ist­ics are gathered dur­ing the auto­mated deploy­ment of ETL processes. 

Below is the syn­tax for the GATHER_TABLE_STATS pro­ced­ure, as well as an explan­a­tion of the para­met­ers involved:

DBMS_STATS.GATHER_TABLE_STATS ( 
ownname => ‘SCHEMA_NAME’, 
tabname => ‘TABLE_NAME’, 
granularity => ‘GRANULARITY’, 
method_opt => ‘METHOD_OPT’, 
cascade => ‘CASCADE’, 
degree => ‘DEGREE’ 
);

The para­met­ers rep­res­ent the following:

1. SCHEMA_NAME: The schema con­tain­ing the table(s) for which stat­ist­ics are to be gathered. This para­meter is optional and can be used to gather stat­ist­ics for all tables in the schema. 

    • Example: ‘HR’ 

2. TABLE_NAME: The name of the table for which stat­ist­ics are to be gathered. This para­meter is required when gath­er­ing stat­ist­ics for a single table. 

    • Example: ‘HR.EMPLOYEES’ 

3. GRANULARITY: Spe­cifies the level of gran­u­lar­ity of the stat­ist­ics. It can be set to ‘GLOBAL’, ‘PARTITION’, ‘SUBPARTITION’, etc. 

    • Example: ‘GLOBAL’ 

4. METHOD_OPT: Defines the method options for col­lect­ing stat­ist­ics on columns. This para­meter also con­trols whether his­to­grams should be gathered, and if so, on which columns. 

    • Example: ‘FOR ALL COLUMNS SIZE AUTO’ 

5. CASCADE: Determ­ines whether stat­ist­ics should also be gathered for the indexes on the table. The default value is ‘TRUE’. 

    • Val­ues: ‘TRUE’ or ‘FALSE’ 
    • Example: ‘TRUE’ 

6. DEGREE: Spe­cifies the degree of par­al­lel­ism to be used for gath­er­ing stat­ist­ics. Increas­ing this value can speed up the pro­cess but requires more resources. 

    • Example: ‘4’ 

7. OPTIONS: Addi­tional options for gath­er­ing stat­ist­ics, such as ‘GATHER’, ‘GATHER AUTO’, ‘GATHER STALE’, ‘GATHER EMPTY’, etc. 

    • Example: ‘GATHER’ 

This com­mand can be run dir­ectly in a map­ping, nor­mally in the End Map­ping Com­mand option in the Phys­ical layer:

How­ever, this is not the only way to include the GATHER_STATS pro­ced­ure in a map­ping. You can also cus­tom­ise the IKM to include this com­mand as a new task. This approach ensures that stat­ist­ics for the table are gathered dur­ing map­ping exe­cu­tion, fur­ther improv­ing performance:

The new ‘Gather Stats’ step would be included in the cus­tom­ised IKM, with the fol­low­ing PL/SQL command:

If we use the cus­tom IKM in a map­ping, we can see the new task has been added:

To con­clude this sec­tion, the bene­fits of using GATHER_STATS in ODI are:

  • Optim­ised SQL Exe­cu­tion: Up-to-date stat­ist­ics help the Oracle Optim­izer to gen­er­ate more effi­cient exe­cu­tion plans. 
  • Improved Per­form­ance: Bet­ter exe­cu­tion plans lead to faster query exe­cu­tion and improved over­all ETL performance. 
  • Auto­ma­tion: Integ­rat­ing GATHER_STATS into ODI work­flows auto­mates the stat­ist­ics col­lec­tion pro­cess, redu­cing manual main­ten­ance efforts.

By using the GATHER_STATS com­mand effect­ively within ODI, you can ensure that your ETL pro­cesses are run­ning optim­ally, with the most effi­cient exe­cu­tion plans.

Con­clu­sion

ODI typ­ic­ally deliv­ers excel­lent per­form­ance thanks to its E‑LT archi­tec­ture, which har­nesses nat­ive DBMS SQL and util­it­ies. How­ever, real-world scen­arios often mean per­form­ance chal­lenges, and in this art­icle we’ve aimed to address one such aspect with tar­geted solutions.

There are addi­tional optim­isa­tion tech­niques to con­sider, including:

  • Install and run ODI Stu­dio on the same sub­net where the repos­it­ory DB server is running. 
  • Run the Agent on the same machine as the data­base to take advant­age of data­base utilities. 
  • Use JAVA EE Agents to bene­fit from the load bal­an­cing cap­ab­il­it­ies avail­able in Oracle WebLogic. 
  • Whilst the sta­ging area is typ­ic­ally loc­ated on the tar­get server, mov­ing it to the source server can enhance per­form­ance in some cases. 
  • Enable the Use Unique Tem­por­ary Object Names option when mul­tiple map­pings load the same tar­get table. 
  • Increase the Array Fetch Size and Batch Fetch Size in the Data Server con­fig­ur­a­tion to optim­ise data fetch­ing and inser­tion by the Agent. 
  • Execute fil­ters on source servers.

Here at ClearPeaks, our cer­ti­fied experts are ready to sup­port you in extract­ing the max­imum value from your data and in meet­ing your stra­tegic busi­ness needs. Con­nect with us today to unlock the full poten­tial of your data integ­ra­tion processes!