In our rap­idly evolving world of data integ­ra­tion and trans­form­a­tion, Matil­lion has taken a big step for­ward by intro­du­cing AI-driven fea­tures into its plat­form, revolu­tion­ising the way busi­nesses man­age their data pipelines. With the intro­duc­tion of sev­eral cut­ting-edge tools, Matil­lion is enhan­cing tra­di­tional ETL/ELT pro­cesses, mak­ing them smarter, faster, and more efficient.

In this blog post we’ll explore Matillion’s new AI cap­ab­il­it­ies, their impact on mod­ern data work­flows, and how they com­pare to con­ven­tional data integ­ra­tion meth­ods. We’ll dive deep into spe­cific use cases of their AI Prompt com­pon­ent, dis­cover the Auto Doc­u­ment­a­tion fea­ture, dis­cuss the many bene­fits of RAG, and explain how their Copi­lot acts as an intel­li­gent assist­ant to stream­line oper­a­tions. Finally, we’ll be test­ing these AI fea­tures in two real-world scen­arios, show­cas­ing their trans­form­at­ive poten­tial. So let’s take a closer look at how Matil­lion is har­ness­ing AI to redefine data transformation!

Matillion’s New AI Capabilities

Matil­lion has intro­duced four power­ful AI-driven fea­tures designed to enhance pro­ductiv­ity and effi­ciency in data engineering:

  • AI Prompt Com­pon­ent: This new cap­ab­il­ity empowers data engin­eers to trans­form data pro­cessing and ana­lysis across OpenAI, Azure, and AWS envir­on­ments. By lever­aging LLM tech­no­logy, pipelines can gen­er­ate insight­ful responses to user prompts, adding valu­able con­text to data work­flows. Later we’ll look at some dif­fer­ent cases using these brand-new Matil­lion components.

OpenAI, Amazon, and Azure new components

  • Auto Doc­u­ment­a­tion: Matillion’s new AI Auto Doc­u­ment­a­tion fea­ture helps to gen­er­ate auto­matic doc­u­ment­a­tion for data pipelines. By right-click­ing on any com­pon­ent and select­ing the Add note using AI option, users can pro­duce descrip­tions of a pipeline’s logic. The fea­ture enables mark­down format­ting, allow­ing users to high­light key sec­tions in bold or with back­ground col­ours. It also offers options to regen­er­ate or modify the doc­u­ment­a­tion gen­er­ated, mak­ing it easier for teams to col­lab­or­ate and under­stand pipeline pro­cesses without too much tech­nical know­ledge. This fea­ture boosts pro­ductiv­ity whilst also offer­ing flex­ible customisation:

Note cre­ated with Auto Documentation

  • Retrieval-Aug­men­ted Gen­er­a­tion (RAG) com­pon­ents: Typ­ic­ally, an LLM oper­ates like a stu­dent tak­ing an exam without access to external resources, offer­ing edu­cated guesses or, at times, fail­ing to provide answers alto­gether. How­ever, with RAG, the LLM gets an “open book”, allow­ing users to sup­ply it with spe­cific, rel­ev­ant inform­a­tion to enhance both the accur­acy and pre­ci­sion of its responses. An LLM may struggle to provide the most up-to-date or detailed insights, espe­cially when it comes to recent events or very spe­cific topics—like the intric­a­cies of your company’s pro­pri­et­ary product line—even if there is plenty of pub­licly avail­able data. This is where RAG plays a crit­ical role: by integ­rat­ing external, often private, data into the LLM’s response pro­cess, users can sig­ni­fic­antly improve the model’s per­form­ance. Matillion’s new RAG com­pon­ents enable users to load data into pop­u­lar vec­tor stores, such as Pinecone or Post­gr­eSQL (which is open-source), allow­ing the sys­tem to lever­age private, unstruc­tured data, enrich­ing the con­text for more accur­ate and rel­ev­ant responses. Whether it’s product spe­cific­a­tions, internal doc­u­ments, or other pro­pri­et­ary know­ledge, RAG ensures that the LLM can tap into the right data sources to gen­er­ate tailored, insight­ful answers. What’s more, this cap­ab­il­ity helps to close the know­ledge gap, let­ting users gen­er­ate responses that are not only more reli­able but are also spe­cific to their busi­ness needs.

New com­pon­ents to lever­age RAG

  • Copi­lot: This power­ful fea­ture revolu­tion­ises how data engin­eers build pipelines. Copi­lot allows users to cre­ate com­plex data work­flows using nat­ural lan­guage inputs, redu­cing the need for manual cod­ing. Not only does it accel­er­ate pipeline devel­op­ment, but it also offers real-time sup­port, help­ing users to optim­ise their data trans­form­a­tions and integ­ra­tions. By sim­pli­fy­ing the pro­cess, Copi­lot lowers entry bar­ri­ers, mak­ing data engin­eer­ing more access­ible. It empowers a wider range of team members—from data engin­eers to analysts—to act­ively con­trib­ute to data ini­ti­at­ives, fos­ter­ing greater col­lab­or­a­tion and innov­a­tion across the organ­isa­tion. Cur­rently, Copi­lot is only avail­able for trans­form­a­tion pipelines, but Matillion’s offer­ing is con­tinu­ally evolving. Future updates are expec­ted to expand its func­tion­al­ity, mak­ing it an even more robust and ver­sat­ile fea­ture. As Copi­lot improves, it prom­ises to play a key role in demo­crat­ising data engin­eer­ing, trans­form­ing the way teams approach data management.

Build­ing a pipeline with Copilot

These cut­ting-edge fea­tures are going to sig­ni­fic­antly enhance pro­ductiv­ity and drive innov­a­tion in data engin­eer­ing. We con­sidered a few fea­tures to be of par­tic­u­lar interest, so we decided to explore them hands-on to see if we could identify any prac­tical applic­a­tions. Take a look at what we discovered!

AI Prompt Component

The new AI Prompt com­pon­ents in Matil­lion are an excit­ing addi­tion enabling you to use advanced lan­guage mod­els, such as Chat­GPT, within your data integ­ra­tion work­flows, and we used our own Chat­GPT account to check how the tool behaves in cer­tain scen­arios. First, let’s list some com­mon applic­a­tions in which we can test this brand-new feature:

  1. Sen­ti­ment ana­lysis: Use the AI Prompt com­pon­ent to ana­lyse com­ments or reviews, such as product feed­back or social media posts. This can help the mar­ket­ing team to mon­itor brand sen­ti­ment in real time, provid­ing insights into cus­tomer per­cep­tions and facil­it­at­ing repu­ta­tion management.
  2. Auto­matic sum­mar­isa­tion: Trans­form extens­ive reports, news art­icles, or doc­u­ments into con­cise sum­mar­ies. By speed­ing up the ana­lysis of long texts, you get quicker decision-mak­ing and bet­ter productivity.
  3. Clas­si­fic­a­tion of unstruc­tured text: Cat­egor­ise large volumes of unstruc­tured text, such as emails or sup­port tick­ets, into pre­defined cat­egor­ies, optim­ising cus­tomer ser­vice work­flows and improv­ing con­tent man­age­ment and classification.
  4. Trans­la­tion of mul­ti­lin­gual con­tent: When deal­ing with global cus­tom­ers, use this com­pon­ent to auto­mat­ic­ally trans­late con­tent into dif­fer­ent lan­guages. This facil­it­ates the inter­na­tion­al­isa­tion of products and ser­vices, improv­ing the cus­tomer exper­i­ence across dif­fer­ent regions.
  5. Key data extrac­tion: Extract key inform­a­tion, such as names, loc­a­tions, and dates, from long or com­plex texts, improv­ing data access­ib­il­ity and usability.
  6. Data cor­rec­tion and nor­m­al­isa­tion: Address data issues such as gram­mat­ical errors or incon­sist­en­cies in units of meas­ure­ment. This com­pon­ent improves data qual­ity by auto­mat­ing the cor­rec­tion and nor­m­al­isa­tion pro­cesses, redu­cing the need for manual data cleaning.
  7. Con­tent gen­er­a­tion from struc­tured data: Util­ise struc­tured data, such as sales stat­ist­ics, to auto­mat­ic­ally gen­er­ate insight­ful content—like con­clu­sions or recommendations—for reports and dash­boards. This enhances data inter­pret­a­tion and decision-making.

Now we’ll test the OpenAI Prompt com­pon­ent by explor­ing the sum­mar­isa­tion and clas­si­fic­a­tion options, eval­u­at­ing per­form­ance and effect­ive­ness in real-world scenarios.

Use Case 1: Clas­si­fic­a­tion Of Unstruc­tured Text

In this use case we loaded a list of blog post titles with the aim of auto­mat­ic­ally cat­egor­ising them based solely on the inform­a­tion provided by the title. To achieve this, we used an OpenAI prompt as one of three dis­tinct com­pon­ents in the pro­cess. This prompt was con­figured to clas­sify the titles accord­ing to a pre­defined list of eleven cat­egor­ies. Addi­tion­ally, we gave the model the option to assign a “New Cat­egory” if it determ­ined that a post title did not fit any of the exist­ing cat­egor­ies. This setup enabled us to eval­u­ate how effect­ively the model could both refine its cat­egor­isa­tions and adhere to our spe­cific cat­egor­isa­tion guidelines:

Pipeline load­ing and ana­lys­ing data with an Open AI prompt

With Matillion’s handy fea­ture, the Sample data but­ton, you can retrieve a sample of the data passing through each com­pon­ent. In the fol­low­ing image the data sample in a spe­cific com­pon­ent is depic­ted, allow­ing us to review and debug the data easily:

Sampling our data to see what loaded data looks like

After set­ting up the Matil­lion con­nec­tion with Open AI through the API key, the com­pon­ent options allow you to choose the AI model and con­fig­ure its para­met­ers, as shown in the fol­low­ing image:

Open AI prompt com­pon­ent settings

Once con­fig­ur­a­tion has been com­pleted, the prompt will need to be defined. In this case, the prompt we used (with mod­els gpt‑3.5‑turbo and gpt-4o-mini) to gen­er­ate the CATEGORY column was:

Accord­ing to the Post Title (“Post­Title” column), give a cat­egory. You can only select one from this list:

[Sci­ence & Envir­on­ment, Home & Garden, Health & Well­ness, Tech­no­logy & Innov­a­tion, Food & Cook­ing, Travel & Explor­a­tion, Fit­ness & Per­sonal Growth, Fin­ance & Invest­ment, Books & Lit­er­at­ure, Pets & Animal Care, Arts & Crafts, Pho­to­graphy & Visual Arts].

Just in case you think there is no way to include the Post Title in the cat­egory list, write “New Category”.

Here you can add as many columns/outputs as needed

Our choice of this par­tic­u­lar prompt allowed us to observe vari­ous lim­it­a­tions and poten­tial optim­isa­tions. Ini­tially, we tasked the AI with cat­egor­ising post titles without a pre­defined list and reques­ted that it avoid duplic­at­ing cat­egor­ies with sim­ilar word­ing (e.g., not cre­at­ing “Garden­ing” as a new cat­egory if “Home and Garden­ing” already exis­ted). How­ever, this approach revealed a sig­ni­fic­ant lim­it­a­tion. The prompt oper­ates inde­pend­ently for each title, without any memory of pre­vi­ous clas­si­fic­a­tions, and this lack of con­tinu­ity means incon­sist­ent cat­egor­isa­tion, as the model can­not remem­ber which cat­egor­ies it has already used.

We included the option to gen­er­ate a “New Cat­egory” to eval­u­ate the AI’s reas­on­ing cap­ab­il­it­ies, grant­ing it the flex­ib­il­ity to devi­ate from stand­ard pat­terns. How­ever, this slightly increases the risk of AI hal­lu­cin­a­tions, a well-known chal­lenge in the field.

Below we can see some res­ults, com­par­ing the gpt‑3.5‑turbo and the gpt-4o-mini models:

A.    Gpt‑3.5‑turbo results:

Cat­egor­isa­tion res­ults using gpt‑3.5‑turbo model

As you can see, 3.5‑turbo failed at cat­egor­ising posts. In just these eleven post titles, there are two issues: in red, the prompt given to OpenAI is writ­ten as the cat­egory, whilst in blue, an incor­rect cat­egor­isa­tion is high­lighted, show­ing instances where “New Cat­egory” was selec­ted des­pite cor­rect options already being avail­able in the pre­defined list.

A)    Gpt-4o-mini results:

Cat­egor­isa­tion res­ults using gpt-4o-mini model

As shown above, the cat­egor­ies match the post titles per­fectly. Fur­ther­more, as dis­played in the Snow­flake stat­ist­ics, the cat­egor­ies were effect­ively dis­trib­uted across the data­set of 1,000 posts. Inter­est­ingly, the “Arts and Crafts” cat­egory was not used, likely absorbed by other closely related cat­egor­ies; there was no need to cre­ate any “New Category”.

Com­pared with the gpt‑3.5‑turbo model, the qual­ity of this res­ult is even more evident.

Cat­egory dis­tri­bu­tion (4.o‑mini) left and (3.5‑turbo) right

Use Case 2: Auto­matic Summarisation

In this use case we loaded a list of CNN and Daily Mail art­icles, and we wanted the OpenAI Prompt com­pon­ent to sum­mar­ise them into a few lines, as well as extract­ing three hasht­ags and find­ing the coun­try in which the events took place, if possible.

Pipeline load­ing and ana­lys­ing data with Open AI prompt

Sample data loaded for review

We asked the AI to fill three new columns using the fol­low­ing prompts (only using gpt-4o-mini):

  • Sum­mary: Out­line the high­lights of each art­icle in 4 sen­tences maximum.
  • Hasht­ags: Extract 3 hasht­ags from the article.
  • Coun­try: State the coun­try where it hap­pens; if that can’t be loc­ated, write “undefined”.

We must remem­ber that each new column means more API calls, so costs increase as you request new out­puts. One option to optim­ise costs would be to cre­ate a prompt that encom­passes all three and returns a JSON file to be parsed. How­ever, this option would have to be tested first to make sure the model under­stands the three requests sep­ar­ately and returns an ordered output:

Here you can add as many columns/outputs as needed

Res­ults obtained with the spe­cified outputs

These AI res­ults align with what we were expect­ing: the sum­mar­ies and hasht­ags are cor­rectly related to the art­icle, although some val­ues were “undefined” in the Loc­a­tion column.

Coun­tries distribution

The top two rows show 27 news stor­ies from the UK (which makes sense as the data­set comes from CNN – Daily Mail) and 23 undefined.

All other entries in the list are actual coun­tries, indic­at­ing no hal­lu­cin­a­tions. How­ever, it is not­able that the model used the term “West Indies”, which refers to a region rather than a coun­try. On review­ing the art­icle, “West Indies” is indeed the term used, but it would be advis­able to cla­rify if exact coun­try names are needed, as any vari­ation in ter­min­o­logy could dis­rupt a join oper­a­tion with another table.

Review­ing the 23 art­icles where the coun­try was undefined shows that these res­ults were equally well cat­egor­ised, as news about movies, tech­no­logy or gen­eric inform­a­tion does not have loc­a­tions mentioned.

Here’s an example sum­mary extrac­ted from an art­icle using AI:

Sum­mary:

Experts are rais­ing con­cerns about the shrink­ing size of plane seats, sug­gest­ing it could jeop­ard­ize pas­sen­ger health and safety. A U.S con­sumer advis­ory group cri­ti­cized the lack of min­imum space reg­u­la­tions for humans com­pared to anim­als. United Air­lines, for example, has reduced its seat pitch to as little as 30 inches, caus­ing dis­com­fort for trav­el­ers. The Fed­eral Avi­ation Admin­is­tra­tion has been con­duct­ing tests based on an out­dated stand­ard of 31 inches, prompt­ing calls for a review of these practices.

Hasht­ags: #Flight­Safety #Air­lineSeats #Pas­sen­ger­Rights

Coun­try: United States

Through the AI Prompt com­pon­ent, we have seen how using lan­guage mod­els can sim­plify com­plex tasks like text clas­si­fic­a­tion or sum­mar­isa­tion. These real-world examples have proved its poten­tial to auto­mate tra­di­tion­ally time-con­sum­ing tasks, unlock­ing a wealth of possibilities.

In the first use case, we util­ised the OpenAI Prompt com­pon­ent to cat­egor­ise vari­ous post titles. We ran the com­pon­ent using two dif­fer­ent models—gpt‑3.5‑turbo and gpt-4o-mini—and com­pared the res­ults from each model to assess their per­form­ance and accuracy.

In the second case, we used the same com­pon­ent to per­form three tasks on an article’s data­set: to gen­er­ate sum­mar­ies, to extract three rel­ev­ant hasht­ags, and to identify the coun­try where the event took place.

ProsCons
Seam­less integ­ra­tion with OpenAI, enhan­cing accessibility.Ideal for auto­mat­ing repet­it­ive, cum­ber­some, and time-con­sum­ing tasks that are simple but labour-intens­ive when done manually.Unlocks vast poten­tial and new oppor­tun­it­ies for improv­ing data engin­eer­ing work­flows and processes.Fine-tuning para­met­ers allows for a highly effi­cient and cost-effect­ive solu­tion, tailored to spe­cific busi­ness needs.AI effi­ciency is highly depend­ent on spe­cific use cases, lim­it­ing broader application.Costs can escal­ate rap­idly if the AI pipelines are not prop­erly optim­ised or configured.Managing AI for high-fre­quency, data-heavy pipelines can be chal­len­ging in terms of both per­form­ance and cost-efficiency.

With ref­er­ence to the AI mod­els them­selves, users must remem­ber that each request has a price. As we observed in the first real use case, the effect­ive­ness of the gpt-4o-mini model is much greater than that of the gpt‑3.5‑turbo, and it’s also sig­ni­fic­antly cheaper on a per-token basis, whilst offer­ing a much lar­ger con­text capa­city (128K tokens vs. 4K) and addi­tional vis­ion cap­ab­il­it­ies, mak­ing it ideal for com­plex tasks involving large data­sets or mul­ti­me­dia. Although gpt‑3.5‑turbo is faster for simple text-based tasks, it comes at a higher cost per token, mak­ing gpt-4o-mini a more cost-effect­ive and ver­sat­ile option for most use cases.

How­ever, it’s import­ant to note that using AI on large data­sets should be care­fully man­aged for spe­cific events, as the com­pu­ta­tional costs can become sig­ni­fic­ant when applied to extens­ive data­sets on a daily basis.

Cost table per model

Auto Doc­u­ment­a­tion

Now that the OpenAI com­pon­ent has been tested, let’s see how to use the Auto Doc­u­ment­a­tion fea­ture in the same pipeline as before.

All com­pon­ents from the pre­vi­ous pipeline were selec­ted for this use case, but feel free to choose as many as you need. Once your com­pon­ents have been selec­ted, simply right-click on the can­vas. Below the stand­ard Add note option, you’ll see Add note using AI, so click on that to let the AI gen­er­ate help­ful doc­u­ment­a­tion for your pipeline.

How Auto Doc­u­ment­a­tion works

Ini­tially, the AI provides a com­pre­hens­ive explan­a­tion of the pro­cess. How­ever, by using the Refine but­ton loc­ated in the bot­tom-left corner, we can con­dense the con­tent to bet­ter suit our needs. You also have the option to fur­ther elab­or­ate or regen­er­ate the doc­u­ment­a­tion as necessary.

After adding the note with the but­ton in the bot­tom-right corner, you can cus­tom­ise its appear­ance by chan­ging its col­our or edit­ing the text manu­ally. The notes sup­port all stand­ard mark­down syn­tax, offer­ing a high level of cus­tom­isa­tion to tailor the doc­u­ment­a­tion to your preferences.

Con­clu­sions

To sum up, Matil­lion’s new AI cap­ab­il­it­ies rep­res­ent a big step in the right dir­ec­tion when it comes to data integ­ra­tion and trans­form­a­tion. The intro­duc­tion of the ele­ments men­tioned in this blog post – AI Prompt Com­pon­ents, Auto Doc­u­ment­a­tion, RAG Com­pon­ents, and Copi­lot – speaks volumes about Matillion’s com­mit­ment to bring­ing the latest AI cap­ab­il­it­ies to users, trans­form­ing tra­di­tional ETL/ELT data­flows into AI pipelines.

As we said, whilst tra­di­tional pipelines are designed to handle struc­tured data, AI pipelines excel at pro­cessing unstruc­tured data, such as text or PDFs. Nev­er­the­less, even though AI offers immense poten­tial, sev­eral chal­lenges must be addressed to fully har­ness its bene­fits. For AI mod­els to per­form effect­ively, data must be prop­erly pre­pared and cleaned, par­tic­u­larly when deal­ing with large volumes. Without a stream­lined inges­tion pro­cess, AI mod­els may struggle to deliver mean­ing­ful results.

In our opin­ion, Matil­lion is clearly ahead of the competition.

  • Copi­lot is a power­ful tool, and it will be even more impress­ive with upcom­ing improve­ments to sup­port orches­tra­tion pipelines and improve accur­acy for com­plex tasks.
  • Auto Doc­u­ment­a­tion stands out for its abil­ity to effort­lessly gen­er­ate com­pre­hens­ive doc­u­ment­a­tion, sav­ing time and effort.
  • The AI Prompt com­pon­ent unlocks new pos­sib­il­it­ies, espe­cially for indus­tries that deal with unstruc­tured data, like mar­ket­ing or health­care. Whilst costs may be a factor, it is best lever­aged stra­tegic­ally or as a fall­back when pipeline exe­cu­tion faces chal­lenges. For busi­nesses with lim­ited pro­gram­ming expert­ise, this fea­ture sim­pli­fies com­plex tasks and stream­lines oper­a­tions, espe­cially with tedi­ous workflows.
  • Finally, RAG com­pon­ents provide a power­ful way to integ­rate external data into LLM-driven pro­cesses, sig­ni­fic­antly boost­ing per­form­ance. As with the AI Prompt com­pon­ent, it’s essen­tial to factor in the asso­ci­ated costs when plan­ning its use.

Over­all, the addi­tion of these AI-driven func­tion­al­it­ies enhances Matillion’s intu­it­ive­ness and cap­ab­il­it­ies. When the right use cases arise, these tools offer a high degree of flex­ib­il­ity and efficiency.

AI pipelines are undoubtedly revolu­tion­ising the way engin­eers inter­act with data. Instead of rely­ing on SQL, Python, or other pro­gram­ming lan­guages, users can lever­age nat­ural lan­guage quer­ies, lower­ing the tech­nical bar­rier and mak­ing data inter­ac­tion more effi­cient. Given the chal­lenges of data inges­tion and data­base man­age­ment, one might won­der: will AI even­tu­ally over­come these obstacles and become reli­able enough to fun­da­ment­ally reshape the role of data engineers?

Here at ClearPeaks, our expert team is ready to guide you in har­ness­ing the full poten­tial of Matillion’s latest AI-powered fea­tures. Con­nect with us today to unlock the future of AI-driven pipelines—building smarter, faster, and with the latest tech­no­logy at your fin­ger­tips. Let’s shape your data strategy for success!