Metadata-Driven Insights: Enabling Smart Operations in Data Mesh

Introduction to the SmartOps system of the Tchibo data platform. The SmartOps system is a core component of the Tchibo data platform, designed to drive intelligent operations through metadata-driven insights. By integrating governance, observability, and automation across data domains, SmartOps enables scalable, efficient, and responsive data management in a modern data mesh architecture.

The amount of data is growing. And so are the technological structures that process it. Infrastructures and processes become mighty data platforms hosted in cloud environments. To maintain the upper hand in these systems, there are challenges in terms of transparency, coordination, and governance that need to be solved.

With our data platform at Tchibo, which has existed for 5 years and is constantly evolving, we also have these issues in our focus. We have developed a key component that helps us to stay on top despite increasing complexity and provide optimal support for all stakeholders involved.

Our technological and organizational approach is called SmartOps (Smart Operations). This article presents ideas and visions for more transparency through metadata collection and visualization in the data mesh. The system was implemented in the Tchibo data platform in 2024 with the professional help of the synvert Datadrivers and Wolfgang Wangerin in particular.

Glossary — words to be defined for common understanding

Data Mesh — a sociotechnical approach to building a decentralized data architecture by leveraging a domain-oriented, self-serve design.

Data Service — manifested data domain as an independent node in the mesh

Mesh Board — architectural expert group to govern the data mesh

Data Service Profile — Single point of entry for the exploration of the Data Service documentation

BDAP — Big Data Analytics Platform

Our initial situation

Talking about Tchibo, the famous German coffee and lifestyle company, we are working with an environment in the Google Cloud. More than 50 people are working on the well-caffeinated platform for data analytics, and the team is growing. According to the data mesh principle, we work in thematically or technologically separate data services that are loosely organized and monitored by a mesh board.

The BDAP (Big Data Analytics Platform) hosts a huge number of projects. Internal and external colleagues produce a vast amount of knowledge here. To give an estimate about the platform size, there are currently ~50 Data Services with over 500 code repositories, 70000 Database Tables, and 800TB of BigQuery data. The platform underwent a couple of changes over the last 5 years, but it has quickly grown these metrics. The projects themself deliver data in different ways, starting with reports and aggregation, but also moving strongly into Machine Learning projects and recommendation systems. With such a large order, new challenges arise concerning transparency.

***Figure 1: This is what AI imagines a Data Mesh***

The problem of transparency

A vibrant data platform not only brings together a lot of data, but also people with different knowledge, skills, and responsibilities. The problem with transparency is not that everyone is aware of everything, but that each person has access to the relevant and crucial information. Why is this problem a particular challenge if your data platform is oriented at the data mesh architecture?

Invisible nodes and overlaps

The data mesh at Tchibo is formed by many data services. Due to the decentralized nature of data services, we gain speed and autonomy but also risk a lack of transparency in developments and decisions. Has a certain problem already been solved in another service? Who is consuming my data? While a data catalog (mainly) could answer the question of what the data lineage looks like and the implications of changes, e.g. in table structures, the question of the users and interrelationships of used services cannot be answered on the platform side.

Regardless of whether it is a to-be-changed shared CICD template (or component) or a new cloud-native service that needs to go through the same set-up phase in different teams. The information is often gathered by asking around in communities of practice or Slack channels with no guaranteed success. It is precisely these invisible nodes and overlaps that can make work more effective and efficient.

Growth and changes in personnel

Data and Artificial Intelligence are still investment topics, so it is not surprising that the teams tend to grow. As data scientists often still have a background in statistics rather than software architecture, cloud skills, and general coding best practices, personal knowledge in those fields varies greatly. The rapid onboarding of new employees is therefore tough without standards and reference solutions. However, very few standards can be quickly checked to see if they are being adhered to across the entire platform. The use of customer-managed encryption for tables is an example of this. To ensure that the standard is also effective for new members of the data platform, it should not only be recorded in the internal wiki (nothing is older than yesterday’s wiki entry) but also continuously reviewed to ensure rapid feedback, especially for new people on board.

Many tools, many possibilities

The data-driven orientation of business processes not only leads to many tools that need to be integrated into the data platform to provide new data, but internal systems also need to be linked so that the processed data output can also work actively e.g. in the form of forecasts forwarded to logistic planning systems (e.g. our demand forecasting services). This increases the complexity of the individual data service. When a new person starts to work on the service or is a vacation substitute for it, getting to familiarise themselves with all the processes, data pools, tools, and cloud services behind it is time-consuming. In the best-case scenario, everything is documented, but in reality, there is a lot of manual browsing.

Of course, the problems can all be solved today. However, there is a lot of manual work behind it. With DevOps in mind, however, we at Tchibo wanted to focus on automation and continuity. Incidentally, we also speak the language that everyone on the platform understands: data tables and visualizations.

From PoC to vision

The idea of using the power of metadata to provide more transparency in our Big Data Analytics Platform (BDAP) started from the fact that we wanted to review unwritten laws. Assumptions are often made to simplify things like monitoring and deployment, e.g. code is versioned via Git and productive data products have at least a “dev” and “live” stage. When supporting the teams, discrepancies were occasionally detected. For a realistic assessment of where we needed to adjust, we wanted to take a script-driven and data-driven approach. Once we had gained a benefit from this approach, we wanted to extend it to various technologies and build it into a small framework. SmartOps as such was born.

At the beginning of the actual project, we identified the three perspectives to be served by the collected metadata. Firstly, the unwritten laws should be converted into written, declared standards and continuously reviewed (platform standards). For the individual data services, it should also be clear from the outside what exactly happens in the project, i.e. the technologies used as well as peculiarities such as the processing of personal data should be recorded (data service profiles). If certain technologies are to be replaced or security risks are to be checked, it is also worth taking a technical view to determine which data services are affected by a problem or new requirement.

***Figure 2: schematic view of the web interface of the views***

To better understand the purpose of the views, here are a few specific questions they answer:

Data Service Profile

Who is responsible for the service?
Which cloud services are used in the project?

Platform Standards

Which data service projects fulfill the guideline for encryption of BigQuery tables and to what degree?
Why was a standard introduced and when?

Technology Overview

Which data service project still uses first-generation cloud-run services?
Which Airflow DAGs use operators that are marked as deprecated?

How it works

For SmartOps, we believe in a comprehensive approach that goes beyond mere technical implementation in the Google Cloud. Our strategy encompasses organizational measures and processes to ensure that problems are addressed holistically.

Organizational Level

To facilitate accessibility and usability, key views such as the Data Service Profile, Platform Standards, and Technology Overview are made available as web interfaces in the internal network for respective target groups. These resources are actively integrated into the onboarding process and regular exchange groups to ensure they are utilized effectively.

The platform standards, in particular, have a strong community aspect. Tasks are distributed among responsible Data Service Owners, and each new standard is therefore introduced and explained within the Community of Practice (CoP) of our data platform. After the presentation with visualizations to monitor the individual data services conformity the development and fulfillment of these standards are closely monitored in further community meetings.

Flagship projects have the opportunity to engage with the standard-setting board to discuss any deviations that may be approved as exceptions for their innovative projects after the standard is presented. Exceptions can be marked with labels which then do not affect the total score of the data service.

However, it is not only interesting to track the one-off path to compliance with the data services, but future deterioration can also be traced through historicization. Notification chains can be started, especially for security-relevant standards. For example, the reasons for the delay in implementation can be discussed in team feedback and further escalation steps along the responsibility hierarchy can be taken in the event of non-compliance.

By embedding these practices into our organizational framework, we ensure that our approach to problem-solving is both comprehensive and collaborative, leveraging the collective expertise of our community.

Technical Implementation

Technically, there is a SmartOps SDK that takes over the collection and storage of the metadata and the 3 views presented use this SDK to pull the data for the analyses. Roughly speaking, the metadata flow is as follows.

***Figure 3: metadata flow within SmartOps Framework***

Metadata is being collected from multiple sources. We produce a command line tool for each source that can be run by every type of orchestration framework. The scope of the tool is to connect, collect the metadata, and replace/append to the data store. Current metadata sources are

Google Cloud projects (used for discovery of data services)
Gitlab repos
BigQuery resources
Cloud Run services
Artefact Registry contents
Steampipe results
PubSub subscriptions
Cloud composer dags and operators

The metadata is then assigned to the individual data services and their stages (dev, live). In the case of shared services or GitLab repos, structures and labels are designed in such a way that assignment is possible. This keeps accountability visible.

Technology selection

We access the metadata via APIs or corresponding Python clients. Since Python is also the language of data in our data platform, onboarding new contributors is a low effort. To retrieve the metadata, these Python clients are structured via our SmartOps SDK and ultimately provided as a command line interface.

The command line tools to gather the data are released in a docker image run with Cloud Run Jobs and orchestrated with a Google Workflow as a robust, cost-effective, and simple setup.

As the data store will be the same for all sources, but may be replaced over time, an abstraction layer has been placed for the storage. We moved from Firestore to BigQuery. At that time, Firestore offered us a fast backend, but the maximum memory size of the objects quickly became an obstacle. We also wanted to improve the analysis options (preferably with simple SQL queries) in addition to the predefined visualizations given with the three views. Another move to AlloyDB is in the works for a speed-up but will come with some additional financial costs.

The data service profile, the standard view, and the technology view use the SmartOps SDK to pull the data via the abstraction layer. The data service profile and the tech view are built as Flask applications. As the standard view requires quite some text to explain the standards, Sphinx (read the docs) is used to create the web interface as the explanatory text content can be created via markdowns and the search functionality may be used. For hosting them in our restricted network AppEngine is selected.

In the current setup with daily data updates and starting use of the community only costs around €15 per month are produced by all components (not including BigQuery slot usage).

Success Stories

Let’s explore three scenarios where we were able to prove faster problem-solving with our approach.

Blue-Green Migration of Cloud Composers

When we upgrade the Airflow version in our Cloud Composer (hosted Airflow service in the Google Cloud) to benefit from the latest features, we prepare a new system with the new Airflow version due to possible incompatibility of the old workflows with the new system. Our CI/CD pipeline offers easy transfer from blue to green or the other way around and also parallel deployment to both instances. The workflows are then tested and one by one migrated to the new system without any interruption in productive processes.

But this transfer of approx. 200 DAGs, spread over multiple teams, can grow from a task to a demanding project. Very often, the question arises, how far along are we? Who is still on the old systems? Which Dataservice do these DAGs belong to? The first couple of upgrades, major version changes included, took half a year. Now, with clear visibility, expediting is an easy task. Our SmartOps-assisted transfer took only about a month, with significantly less project management. Reminders could be more specific and people can track the progress on their own.

Cloud Run hardening

The Platform Infrastructure Team realised, there could be a potential security risk with a specific type of cloud run configuration. With the SmartOps technology overview, we were able to address and mitigate on the same day, since we knew all affected cloud runs across approx. 100 Google Cloud Projects, by simply querying the gathered metadata. Owners were informed immediately and changes were made on day one.

Monitoring BigQuery Table encryption

In addition to the special encryption of personal data, which has been audited since the beginning of the platform, all other business data should also be encrypted with an extended customer-managed encryption key (CMEK). By default, this encryption is not enforced. Enforcing it means that the team will also be restricted (e.g. no wildcard queries in BigQuery) with test data, which we don’t want to do at the moment. However, we want to know how our best practice is accepted and implemented in productive processes. SmartOps allows us to identify projects where encryption has not been implemented as we expect (but where it should be). We can also see when this changes. Aggregating the metadata of all projects helps us to monitor these across 75000 tables.

Conclusion

A good 7 months have passed since the idea of SmartOps took shape. In the course of development and the integration of new data sources, we have been able to reveal some weaknesses and quickly rectify them. The three different views are gratefully received by the community and the transparency regarding unwritten laws is very welcome. In particular, knowledge gaps become visible here and can be explained and directly implemented in smaller sessions. The continuous monitoring and automated maintenance of data service profiles ensures that the status is always up to date, which is particularly appreciated at the management level.

Of course, the project has not come to an end and there are many ideas as to where it can continue to grow. For example, terraform states can be used to identify resources that have been created as infrastructure as code. This gives the individual metadata a link that allows us to understand how close we are to a complete penetration of a recoverable code representation of the platform.

With every standard, the acceptance and benefits in the community also increases. Questions that previously had to be resolved manually can now be answered automatically, continuously, and, at best, even visualized. And that is our success!