Metadata-Driven Insights: Enabling Smart Oper­a­tions in Data Mesh



Intro­duc­tion to the Smar­tOps sys­tem of the Tchibo data plat­form. The Smar­tOps sys­tem is a core com­pon­ent of the Tchibo data plat­form, designed to drive intel­li­gent oper­a­tions through metadata-driven insights. By integ­rat­ing gov­ernance, observ­ab­il­ity, and auto­ma­tion across data domains, Smar­tOps enables scal­able, effi­cient, and respons­ive data man­age­ment in a mod­ern data mesh architecture.

The amount of data is grow­ing. And so are the tech­no­lo­gical struc­tures that pro­cess it. Infra­struc­tures and pro­cesses become mighty data plat­forms hos­ted in cloud envir­on­ments. To main­tain the upper hand in these sys­tems, there are chal­lenges in terms of trans­par­ency, coordin­a­tion, and gov­ernance that need to be solved.

With our data plat­form at Tchibo, which has exis­ted for 5 years and is con­stantly evolving, we also have these issues in our focus. We have developed a key com­pon­ent that helps us to stay on top des­pite increas­ing com­plex­ity and provide optimal sup­port for all stake­hold­ers involved.

Our tech­no­lo­gical and organ­iz­a­tional approach is called Smar­tOps (Smart Oper­a­tions). This art­icle presents ideas and vis­ions for more trans­par­ency through metadata col­lec­tion and visu­al­iz­a­tion in the data mesh. The sys­tem was imple­men­ted in the Tchibo data plat­form in 2024 with the pro­fes­sional help of the synvert Datadrivers and Wolfgang Wan­gerin in particular.

Photo by Robert Wiedemann on Unsplash

Gloss­ary — words to be defined for com­mon understanding

Data Mesh — a soci­o­tech­nical approach to build­ing a decent­ral­ized data archi­tec­ture by lever­aging a domain-ori­ented, self-serve design.

Data Ser­vice — mani­fes­ted data domain as an inde­pend­ent node in the mesh

Mesh Board — archi­tec­tural expert group to gov­ern the data mesh

Data Ser­vice Pro­file — Single point of entry for the explor­a­tion of the Data Ser­vice documentation

BDAP — Big Data Ana­lyt­ics Platform

Our ini­tial situation

Talk­ing about Tchibo, the fam­ous Ger­man cof­fee and life­style com­pany, we are work­ing with an envir­on­ment in the Google Cloud. More than 50 people are work­ing on the well-caf­fein­ated plat­form for data ana­lyt­ics, and the team is grow­ing. Accord­ing to the data mesh prin­ciple, we work in them­at­ic­ally or tech­no­lo­gic­ally sep­ar­ate data ser­vices that are loosely organ­ized and mon­itored by a mesh board.

The BDAP (Big Data Ana­lyt­ics Plat­form) hosts a huge num­ber of pro­jects. Internal and external col­leagues pro­duce a vast amount of know­ledge here. To give an estim­ate about the plat­form size, there are cur­rently ~50 Data Ser­vices with over 500 code repos­it­or­ies, 70000 Data­base Tables, and 800TB of BigQuery data. The plat­form under­went a couple of changes over the last 5 years, but it has quickly grown these met­rics. The pro­jects them­self deliver data in dif­fer­ent ways, start­ing with reports and aggreg­a­tion, but also mov­ing strongly into Machine Learn­ing pro­jects and recom­mend­a­tion sys­tems. With such a large order, new chal­lenges arise con­cern­ing transparency.

Fig­ure 1: This is what AI ima­gines a Data Mesh

The prob­lem of transparency

A vibrant data plat­form not only brings together a lot of data, but also people with dif­fer­ent know­ledge, skills, and respons­ib­il­it­ies. The prob­lem with trans­par­ency is not that every­one is aware of everything, but that each per­son has access to the rel­ev­ant and cru­cial inform­a­tion. Why is this prob­lem a par­tic­u­lar chal­lenge if your data plat­form is ori­ented at the data mesh architecture?

Invis­ible nodes and overlaps

The data mesh at Tchibo is formed by many data ser­vices. Due to the decent­ral­ized nature of data ser­vices, we gain speed and autonomy but also risk a lack of trans­par­ency in devel­op­ments and decisions. Has a cer­tain prob­lem already been solved in another ser­vice? Who is con­sum­ing my data? While a data cata­log (mainly) could answer the ques­tion of what the data lin­eage looks like and the implic­a­tions of changes, e.g. in table struc­tures, the ques­tion of the users and inter­re­la­tion­ships of used ser­vices can­not be answered on the plat­form side.

Regard­less of whether it is a to-be-changed shared CICD tem­plate (or com­pon­ent) or a new cloud-nat­ive ser­vice that needs to go through the same set-up phase in dif­fer­ent teams. The inform­a­tion is often gathered by ask­ing around in com­munit­ies of prac­tice or Slack chan­nels with no guar­an­teed suc­cess. It is pre­cisely these invis­ible nodes and over­laps that can make work more effect­ive and efficient.

Growth and changes in personnel

Data and Arti­fi­cial Intel­li­gence are still invest­ment top­ics, so it is not sur­pris­ing that the teams tend to grow. As data sci­ent­ists often still have a back­ground in stat­ist­ics rather than soft­ware archi­tec­ture, cloud skills, and gen­eral cod­ing best prac­tices, per­sonal know­ledge in those fields var­ies greatly. The rapid onboard­ing of new employ­ees is there­fore tough without stand­ards and ref­er­ence solu­tions. How­ever, very few stand­ards can be quickly checked to see if they are being adhered to across the entire plat­form. The use of cus­tomer-man­aged encryp­tion for tables is an example of this. To ensure that the stand­ard is also effect­ive for new mem­bers of the data plat­form, it should not only be recor­ded in the internal wiki (noth­ing is older than yesterday’s wiki entry) but also con­tinu­ously reviewed to ensure rapid feed­back, espe­cially for new people on board.

Many tools, many possibilities

The data-driven ori­ent­a­tion of busi­ness pro­cesses not only leads to many tools that need to be integ­rated into the data plat­form to provide new data, but internal sys­tems also need to be linked so that the pro­cessed data out­put can also work act­ively e.g. in the form of fore­casts for­war­ded to logistic plan­ning sys­tems (e.g. our demand fore­cast­ing ser­vices). This increases the com­plex­ity of the indi­vidual data ser­vice. When a new per­son starts to work on the ser­vice or is a vaca­tion sub­sti­tute for it, get­ting to famil­i­ar­ise them­selves with all the pro­cesses, data pools, tools, and cloud ser­vices behind it is time-con­sum­ing. In the best-case scen­ario, everything is doc­u­mented, but in real­ity, there is a lot of manual browsing.

Of course, the prob­lems can all be solved today. How­ever, there is a lot of manual work behind it. With DevOps in mind, how­ever, we at Tchibo wanted to focus on auto­ma­tion and con­tinu­ity. Incid­ent­ally, we also speak the lan­guage that every­one on the plat­form under­stands: data tables and visualizations.

From PoC to vision

The idea of using the power of metadata to provide more trans­par­ency in our Big Data Ana­lyt­ics Plat­form (BDAP) star­ted from the fact that we wanted to review unwrit­ten laws. Assump­tions are often made to sim­plify things like mon­it­or­ing and deploy­ment, e.g. code is ver­sioned via Git and pro­duct­ive data products have at least a “dev” and “live” stage. When sup­port­ing the teams, dis­crep­an­cies were occa­sion­ally detec­ted. For a real­istic assess­ment of where we needed to adjust, we wanted to take a script-driven and data-driven approach. Once we had gained a bene­fit from this approach, we wanted to extend it to vari­ous tech­no­lo­gies and build it into a small frame­work. Smar­tOps as such was born.

At the begin­ning of the actual pro­ject, we iden­ti­fied the three per­spect­ives to be served by the col­lec­ted metadata. Firstly, the unwrit­ten laws should be con­ver­ted into writ­ten, declared stand­ards and con­tinu­ously reviewed (plat­form stand­ards). For the indi­vidual data ser­vices, it should also be clear from the out­side what exactly hap­pens in the pro­ject, i.e. the tech­no­lo­gies used as well as pecu­li­ar­it­ies such as the pro­cessing of per­sonal data should be recor­ded (data ser­vice pro­files). If cer­tain tech­no­lo­gies are to be replaced or secur­ity risks are to be checked, it is also worth tak­ing a tech­nical view to determ­ine which data ser­vices are affected by a prob­lem or new requirement.

Fig­ure 2: schem­atic view of the web inter­face of the views

To bet­ter under­stand the pur­pose of the views, here are a few spe­cific ques­tions they answer:

Data Ser­vice Profile

  • Who is respons­ible for the service?
  • Which cloud ser­vices are used in the project?

Plat­form Standards

  • Which data ser­vice pro­jects ful­fill the guideline for encryp­tion of BigQuery tables and to what degree?
  • Why was a stand­ard intro­duced and when?

Tech­no­logy Overview

  • Which data ser­vice pro­ject still uses first-gen­er­a­tion cloud-run services?
  • Which Air­flow DAGs use oper­at­ors that are marked as deprecated?

How it works

For Smar­tOps, we believe in a com­pre­hens­ive approach that goes bey­ond mere tech­nical imple­ment­a­tion in the Google Cloud. Our strategy encom­passes organ­iz­a­tional meas­ures and pro­cesses to ensure that prob­lems are addressed holistically.

Organ­iz­a­tional Level

To facil­it­ate access­ib­il­ity and usab­il­ity, key views such as the Data Ser­vice Pro­file, Plat­form Stand­ards, and Tech­no­logy Over­view are made avail­able as web inter­faces in the internal net­work for respect­ive tar­get groups. These resources are act­ively integ­rated into the onboard­ing pro­cess and reg­u­lar exchange groups to ensure they are util­ized effectively.

The plat­form stand­ards, in par­tic­u­lar, have a strong com­munity aspect. Tasks are dis­trib­uted among respons­ible Data Ser­vice Own­ers, and each new stand­ard is there­fore intro­duced and explained within the Com­munity of Prac­tice (CoP) of our data plat­form. After the present­a­tion with visu­al­iz­a­tions to mon­itor the indi­vidual data ser­vices con­form­ity the devel­op­ment and ful­fill­ment of these stand­ards are closely mon­itored in fur­ther com­munity meetings.

Flag­ship pro­jects have the oppor­tun­ity to engage with the stand­ard-set­ting board to dis­cuss any devi­ations that may be approved as excep­tions for their innov­at­ive pro­jects after the stand­ard is presen­ted. Excep­tions can be marked with labels which then do not affect the total score of the data service.

How­ever, it is not only inter­est­ing to track the one-off path to com­pli­ance with the data ser­vices, but future deteri­or­a­tion can also be traced through his­tor­iciz­a­tion. Noti­fic­a­tion chains can be star­ted, espe­cially for secur­ity-rel­ev­ant stand­ards. For example, the reas­ons for the delay in imple­ment­a­tion can be dis­cussed in team feed­back and fur­ther escal­a­tion steps along the respons­ib­il­ity hier­archy can be taken in the event of non-compliance.

By embed­ding these prac­tices into our organ­iz­a­tional frame­work, we ensure that our approach to prob­lem-solv­ing is both com­pre­hens­ive and col­lab­or­at­ive, lever­aging the col­lect­ive expert­ise of our community.

Tech­nical Implementation

Tech­nic­ally, there is a Smar­tOps SDK that takes over the col­lec­tion and stor­age of the metadata and the 3 views presen­ted use this SDK to pull the data for the ana­lyses. Roughly speak­ing, the metadata flow is as follows.

Fig­ure 3: metadata flow within Smar­tOps Framework

Metadata is being col­lec­ted from mul­tiple sources. We pro­duce a com­mand line tool for each source that can be run by every type of orches­tra­tion frame­work. The scope of the tool is to con­nect, col­lect the metadata, and replace/append to the data store. Cur­rent metadata sources are

  • Google Cloud pro­jects (used for dis­cov­ery of data services)
  • Git­lab repos
  • BigQuery resources
  • Cloud Run services
  • Arte­fact Registry contents
  • Steampipe res­ults
  • Pub­Sub subscriptions
  • Cloud com­poser dags and operators

The metadata is then assigned to the indi­vidual data ser­vices and their stages (dev, live). In the case of shared ser­vices or Git­Lab repos, struc­tures and labels are designed in such a way that assign­ment is pos­sible. This keeps account­ab­il­ity visible.

Tech­no­logy selection

We access the metadata via APIs or cor­res­pond­ing Python cli­ents. Since Python is also the lan­guage of data in our data plat­form, onboard­ing new con­trib­ut­ors is a low effort. To retrieve the metadata, these Python cli­ents are struc­tured via our Smar­tOps SDK and ulti­mately provided as a com­mand line interface.

The com­mand line tools to gather the data are released in a docker image run with Cloud Run Jobs and orches­trated with a Google Work­flow as a robust, cost-effect­ive, and simple setup.

As the data store will be the same for all sources, but may be replaced over time, an abstrac­tion layer has been placed for the stor­age. We moved from Firestore to BigQuery. At that time, Firestore offered us a fast backend, but the max­imum memory size of the objects quickly became an obstacle. We also wanted to improve the ana­lysis options (prefer­ably with simple SQL quer­ies) in addi­tion to the pre­defined visu­al­iz­a­tions given with the three views. Another move to AlloyDB is in the works for a speed-up but will come with some addi­tional fin­an­cial costs.

The data ser­vice pro­file, the stand­ard view, and the tech­no­logy view use the Smar­tOps SDK to pull the data via the abstrac­tion layer. The data ser­vice pro­file and the tech view are built as Flask applic­a­tions. As the stand­ard view requires quite some text to explain the stand­ards, Sphinx (read the docs) is used to cre­ate the web inter­face as the explan­at­ory text con­tent can be cre­ated via mark­downs and the search func­tion­al­ity may be used. For host­ing them in our restric­ted net­work AppEn­gine is selected.

In the cur­rent setup with daily data updates and start­ing use of the com­munity only costs around €15 per month are pro­duced by all com­pon­ents (not includ­ing BigQuery slot usage).

Photo by Dawid Zaw­iła on Unsplash

Suc­cess Stories

Let’s explore three scen­arios where we were able to prove faster prob­lem-solv­ing with our approach.

Blue-Green Migra­tion of Cloud Composers

When we upgrade the Air­flow ver­sion in our Cloud Com­poser (hos­ted Air­flow ser­vice in the Google Cloud) to bene­fit from the latest fea­tures, we pre­pare a new sys­tem with the new Air­flow ver­sion due to pos­sible incom­pat­ib­il­ity of the old work­flows with the new sys­tem. Our CI/CD pipeline offers easy trans­fer from blue to green or the other way around and also par­al­lel deploy­ment to both instances. The work­flows are then tested and one by one migrated to the new sys­tem without any inter­rup­tion in pro­duct­ive processes.

But this trans­fer of approx. 200 DAGs, spread over mul­tiple teams, can grow from a task to a demand­ing pro­ject. Very often, the ques­tion arises, how far along are we? Who is still on the old sys­tems? Which Data­ser­vice do these DAGs belong to? The first couple of upgrades, major ver­sion changes included, took half a year. Now, with clear vis­ib­il­ity, exped­it­ing is an easy task. Our Smar­tOps-assisted trans­fer took only about a month, with sig­ni­fic­antly less pro­ject man­age­ment. Remind­ers could be more spe­cific and people can track the pro­gress on their own.

Cloud Run hardening

The Plat­form Infra­struc­ture Team real­ised, there could be a poten­tial secur­ity risk with a spe­cific type of cloud run con­fig­ur­a­tion. With the Smar­tOps tech­no­logy over­view, we were able to address and mit­ig­ate on the same day, since we knew all affected cloud runs across approx. 100 Google Cloud Pro­jects, by simply query­ing the gathered metadata. Own­ers were informed imme­di­ately and changes were made on day one.

Mon­it­or­ing BigQuery Table encryption

In addi­tion to the spe­cial encryp­tion of per­sonal data, which has been audited since the begin­ning of the plat­form, all other busi­ness data should also be encryp­ted with an exten­ded cus­tomer-man­aged encryp­tion key (CMEK). By default, this encryp­tion is not enforced. Enfor­cing it means that the team will also be restric­ted (e.g. no wild­card quer­ies in BigQuery) with test data, which we don’t want to do at the moment. How­ever, we want to know how our best prac­tice is accep­ted and imple­men­ted in pro­duct­ive pro­cesses. Smar­tOps allows us to identify pro­jects where encryp­tion has not been imple­men­ted as we expect (but where it should be). We can also see when this changes. Aggreg­at­ing the metadata of all pro­jects helps us to mon­itor these across 75000 tables.

Con­clu­sion

A good 7 months have passed since the idea of Smar­tOps took shape. In the course of devel­op­ment and the integ­ra­tion of new data sources, we have been able to reveal some weak­nesses and quickly rec­tify them. The three dif­fer­ent views are grate­fully received by the com­munity and the trans­par­ency regard­ing unwrit­ten laws is very wel­come. In par­tic­u­lar, know­ledge gaps become vis­ible here and can be explained and dir­ectly imple­men­ted in smal­ler ses­sions. The con­tinu­ous mon­it­or­ing and auto­mated main­ten­ance of data ser­vice pro­files ensures that the status is always up to date, which is par­tic­u­larly appre­ci­ated at the man­age­ment level.

Of course, the pro­ject has not come to an end and there are many ideas as to where it can con­tinue to grow. For example, ter­ra­form states can be used to identify resources that have been cre­ated as infra­struc­ture as code. This gives the indi­vidual metadata a link that allows us to under­stand how close we are to a com­plete pen­et­ra­tion of a recov­er­able code rep­res­ent­a­tion of the platform.

With every stand­ard, the accept­ance and bene­fits in the com­munity also increases. Ques­tions that pre­vi­ously had to be resolved manu­ally can now be answered auto­mat­ic­ally, con­tinu­ously, and, at best, even visu­al­ized. And that is our suc­cess!