Is Data Transformation Poorly Understood?

Introducing category theory as a foundation for a new paradigm in data transformation

Data pipelines are made of software. But unlike traditional software, they seem to pose unique challenges when it comes to testing — as suggested in “The challenge of testing data pipelines” (ETL, ELT, etc.). The distinctive burden associated with data pipelines, which the article points to, is the need for data quality checks:

Data Pipelines often have ‘tests’ built into the data pipeline itself, executed as data is processed, to ensure data quality.

When these tests are not built into the actual data pipeline, detection of data quality issues is simply deferred — e.g. until a business user creates a ticket for resolving a reporting bug. The pursuit of data quality testing is therefore usually regarded as an imposition, on top of the more traditional, let’s call it logic testing.

With roughly 40% of enterprise IT budgets spent on data integration, the challenge of information management is arguably one of the biggest in IT, and thus a major societal issue¹. Managing data quality is one such challenge. For instance, on the order of 80% of data migration projects are said to either fail or exceed their budgets, in no small measure on account of poor data quality. On this view, we might want to challenge whether something is amiss — on a deep level — in the way we do data transformation. And whether data quality testing is simply an inherent aspect of building data pipelines. Is there an alternative view perhaps — a new way of thinking about data transformation – which could obviate this practice?

In this article, I’ll introduce an approach to data transformation — data integration, data migration, and so on — based on category theory. According to the U.S. National Institute of Standards and Technology, category theory is a potential mathematical foundation for “next-generation data integration”¹³. It offers a new language for talking about data transformation, introducing crucial concepts such as that of a functor, the likes of which is acutely missing from current thinking¹². At the outset, I’ll suggest that we can view this approach as giving rise to a distinct programming paradigm, one which can spare us the costly and often inadequate effort of data quality testing, all while guaranteeing data integrity.

Programming paradigms

To get a sense of the kind of paradigm shift this new way of thinking about data might entail, let’s draw inspiration from other computing paradigms. Robert C. Martin identifies three paradigms in its history²: structured programming, object-oriented programming, and functional programming. He defines a paradigm in this context as follows:

Paradigms are ways of programming, relatively unrelated to languages. A paradigm tells you which programming structures to use, and when to use them².

He observes a deep connection that lies at the heart of each paradigm; they each take something away from the programmer. “Each of the paradigms removes capabilities from the programmer. None of them adds new capabilities. Each imposes some kind of extra discipline”:

Structured programming imposes discipline on direct transfer of control
Object-oriented programming imposes discipline on indirect transfer of control
Functional programming imposes discipline upon assignment

Scope and constraints

Together, the three aforementioned paradigms remove goto statements, function pointers, and assignment². At first glance, it might seem surprising that we’d gain something so profound through a constraint (were that not the case, these paradigms would hardly have been so widely adopted). But scope and constraints are in fact two sides of the same coin. The capacity (or the scope, or the freedom) that we gain — for instance, the ability to build large and intricate yet robust and extensible applications — rests on certain limitations that we face — such as being completely unrestrained and cavalier about introducing coupling among software components wherever we please.

To go on a bit of a philosophical detour, let’s look to Noam Chomsky, who illuminated the concept playing out in the organic world while discussing the limits of human understanding:

Far from bewailing the existence of mysteries-for-humans, we should be extremely grateful for it. With no limits to growth and development, our cognitive capacities would also have no scope. Similarly, if the genetic endowment imposed no constraints on growth and development of an organism it could become only a shapeless amoeboid creature, reflecting accidents of an unanalyzed environment, each quite unlike the next. Classical aesthetic theory recognized the same relation between scope and limits. Without rules, there can be no genuinely creative activity, even when creative work challenges and revises prevailing rules.

In her enlightening book “The Joy of Abstraction”³, Eugenia Cheng introduces the reader to the “Zero World”. In doing so, she also conveys the weightiness of constraints, showing that if you loosen the wrong ones, it can cause worlds to collapse:

[The Zero world] is the world you end up in if you decide to try and declare 1 + 1 = 1, but still want some other usual rules of arithmetic to hold: in that case we could subtract 1 from both sides and get 1 = 0. If that’s true then everything will be 0…You can’t really do anything in [the Zero world]. Or rather, you can do anything in it and it’s all equally valid. It turns out that a world in which everything is equally valid is not very interesting, and is more or less the same as a world in which nothing is possible³.

Imposing discipline on data transformation

Given the above, we might ask if there are specific constraints that could usher in this new paradigm that I’ve been alluding to? At this point I should mention that Robert C. Martin believes there will be no more than the three paradigms we’ve highlighted, that we’ve seen them all. The reason being that there is nothing “left to take away”². The evidence seems to bear this out; there have been no new paradigms since 1968².

But what about the ability to scramble the meaning of data when we move it from one database to another? This might not be your prototypical programming task — one reason it might have been overlooked by Robert C. Martin — but wouldn’t we want to take that ability away from the programmer if we could? Wouldn’t we want to eliminate the need for all that expensive data quality testing and cleaning, and instead “automatically guarantee that data integrity is preserved as it’s transformed (migrated, integrated, composed, queried, viewed, etc) throughout the enterprise, so that data and programs that depend on that data need not constantly be re-validated for every particular use [emphasis added]”⁴?

If so, we’re essentially looking for a paradigm that respects the integrity of data. What would that look like? In short, we’d firstly need a way to encode rules that distinguish sensible from non-sensible data — that is, encoded data integrity constraints. And, secondly, we’d need a way to make use of — indeed, a way that imposes the use of — such encoded “programming structures”² to ensure that we can’t transform sensible data into non-sensible data.

Isn’t the use of data integrity constraints quite common when working with structured data though?

Constraints are often built in to database schemas; for example, in SQL, a primary key constraint (stating for example that if two people have the same social security numbers, they must be the same person) can be given when defining the columns of a table. Other example constraints include range constraints (that for example an age be > 21) and join decompositions (that for example state that a table is the join of two other tables)⁴.

So yes, there are ways to specify some aspects of data integrity. But the language of constraints could be richer, and we’re still lacking means to transform data while ensuring that we respect the constraints without ex-post-facto data validation.

The authors of [4] bemoan the “failure of tooling” used for data transformation (ETL tooling such as Informatica PowerCenter and IBM DataStage) to support expressive constraints. While there is a wealth of theoretical work that exists “on applying constraints to data transformation, integration, cleaning, schema mapping, and more”, there is a complicated “trade-off between how expressive constraint languages are, and how computationally difficult it is to reason about them”⁴. As a result, “in the enterprise, data quality degradation due to loss of integrity during transformation is a systemic problem” which is simply “resistant to current techniques”⁴.

Since recently, however, there’s a tool to address this failing. A tool which embodies this new paradigm. It’s a tool whose quintessence is a weird and wonderful formalism called category theory.

Getting to know category theory

You might call category theory a branch of mathematics, or, even, a way of thinking about mathematics. Bolder yet, some claim category theory is the “mathematics of mathematics”³.

And in case that’s not flexing it enough, how about a central hub for all of pure mathematics [which] is unmatched in its ability to organize and layer abstractions, to find commonalities between structures of all sorts, and to facilitate communication between different mathematical communities¹.

Yet it’s not just abstract nonsense as it’s referred to by some. Category theory may have, for too long, “stayed in the realm of mind”, but “it is ripe to be brought to hand”¹. In fact, by focusing on its use in the area of data engineering, I’m leaving out many other potential applications.

In this attempt to introduce the subject to the nonprofessional, and being merely an enthusiast myself, I’ll have to tiptoe through the field, while leaning heavily on [1] and [3]. That being said, I‘ll start by — chances are, blasphemously — claiming that category theory is about categories. And any category is made up of objects and relationships³. Visually, a category is like a graph; objects correspond to the nodes, relationships correspond to the edges — the relationships are more commonly referred to as arrows or morphisms.

The objects can be anything — numbers, people, shapes, mathematical structures — anything really³. They could also just be unspecified and abstract. Formally, objects have no properties or internal structure. In a way, they serve no other purpose than to identify the ends of arrows. But it’s often helpful to fill them with content when thinking about the relationships.

The relationships — or arrows — could be things like “less than”, “is the mother of”, or a function between sets³. Or they could be more abstract, departing from the usual notion of a relationship². There can be any number of arrows between two objects a and b. In the case of a relationship such as “less than”, there would be one or none. When there are more arrows, each represents a different sense in which two objects can be related² — e.g. there are 3² different possible functions from a set a with two elements, to a set bwith three elements; represented categorically, each function would be a distinct arrow from object a to object b.

Doing category theory is about studying the networks of relationships in categories and finding patterns within them. Hence the notion of an object can be devoid of any internal structure. Rather than focusing on its intrinsic properties, we’re focussed on its relationships and the role it plays within its context.

With the following examples from [5], I’ll try to motivate the idea that simply looking at structures of relationships may reveal important concepts, thereby hopefully conveying a sense of promise for this kind of approach. Specifically, we’ll look at a fairly basic categorical notion — that of a product — and we’ll see how it captures some well-known mathematical notions in several different settings (categories). Loosely speaking, we can view the product of two objects (a and b) as “the last thing that goes into both a and b”⁵ — we’ll see what that means…

GCD and LCM

The following diagram is an example of a simple category. The objects are numbers, and we’ve drawn an arrow between two objects, if one divides evenly into the other. Notice, I haven’t explicitly drawn all arrows for which this relationship exists. For a start, there should be a looping arrow from each object to itself (as every number evenly divides itself). Such arrows are called identities in category theory and essentially say that everything is the same as itself — they’re usually just implied in diagrams. Secondly, by composing arrows — putting them in a ‘nose-to-tail’ configuration — you’ll see that indeed all possible relationships have been captured. From 3 to 30, for example, there are two paths (composite arrows); via 6 and via 15.

Now what is the product of two objects a and b — in other words, what is the last number to go into both? If we take 6 and 15 as an example, the candidates would be 1 and 3, as they both divide evenly into 6 and 15. However, as the path from 1 to 6 and 15 is longer, the last thing to go into both would be 3. Hence,

product(6,15) = 3

How about the product of 6 and 30? In this case, the answer would be 6 — remember that every object also goes into itself. If you keep playing this game, you might notice that the notion of product in this particular category exactly corresponds to the mathemtical notion of “greatest common divisor” (gcd).

Now let’s imagine we’ve reversed all of the arrows in the above category. By doing so, we’ve actually created a new category — known as the dualcategory. What does the product in this new category correspond to? In this case, the ‘last thing to go into both’ would correspond to the notion of “least common multiple” (lcm).

Min and Max

Let’s try the same thing for the following simple category (again, I’ve left out composite and identity arrows). What is the last thing to go into two objects here?

I’ve listed a few examples to suggest the pattern:

product(0,0) = 0
product(0,1) = 0
product(1,2) = 1
product(2,4) = 2
...

In this category, the product corresponds to the “min” function over two numbers. Similarly, if we generate the reverse category (by reversing the arrows), the product embodies the “max” function.

Intersections and Unions

The following category is one in which the objects are sets. There is a relationship between two sets — drawn on the right — if one set is a subset of the other. In this case, the product actually defines the intersection of two sets. An example in the diagram is the black criss-crossed area which intersects the blue and red (discontiguous) set. It’s the last thing to go into both.

Once again, we could reverse the arrows to generate a new category in which the arrows denote superset relationships. The product in that category corresponds to the union of two sets.

And and Or

Finally, let’s have a Logic example using the following category — usually referred to as Bool — which depicts an ordering relationship; false ≤ true.

The product here corresponds to the logical AND operator — the last thing to go into both is always ‘false’, unless both product elements are ‘true’ (which corresponds precisely to the behaviour of the AND operator). Now instead of reversing arrows (because true ≤ false would be a strange relationship), let’s instead look for the least element that is greater than both. And voila, we find a pattern corresponding to the logical OR operator.

Structure and preservation

Unlike a graph, there are certain rules a category must obey, beyond just looking like a graph. In a way that might seem like a small deal. But to hark back to the section on scope and constraints, many times it’s the “small doors that open up into large rooms”.

So what things must a category observe? As alluded to in the previous section, there are two basic structural stipulations any category must satisfy:

Identity arrows: every object must be related to itself. This can be a weak condition³.
Composition: given two arrows in a nose-to-tail configuration, we want to be able to combine them into a single arrow³. If there’s a sense in which a is related to b, and a sense in which b is related to c, then we can combine those to get a sense in which a is related to c³. For example, if a is the mother of b, and b is the mother of c, there is a sense in which a is the grandmother of c. To give a counter-example: if document a references document b, and document b references document c, we can’t compose these two relationships as it’s not implied that document a references document c.

Composition is the thing that elevates a category beyond a mere graph, or “just a load of arrows”³. But to ensure that the structures which can be built “behave in a most bascially sensible way”³, there are two conditions which are imposed:

Unitality: stipulating this property ensures that the identity arrows “do nothing” w.r.t. composition of arrows³. As an analogy, 0 is the identity element which “does nothing” in the binary operation + (which can be represented by a monoidal category by the way). Given an arrow f that goes from s to t, and identity arrows on s (idₛ) and t (idₜ), unitality is expressed as follows (using ○ as the composition operator):
f○idₛ = idₜ○f = f.
Associativity: given three labelled arrows (h, g, and f, say) in a nose-to-tail configuration, the associativity law is stated thus : h○(g○f) = (h○g)○f.
Note that while this condition specifically addresses three composable arrows, the law really entails that a string of composable arrows of any length has one unambiguous composite³. And because of associativity, this composite can be regarded as the “sum” of its parts — the order in which things are combined doesn’t affect the outcome.

So those are the ingredients for building structures in category theory. But what’s it got to do with preserving data integrity? One clue is that a central theme of category theory is the study of structure-preserving maps¹. In fact, category theory was invented as a means to translate theorems and mathematical structures from one branch of mathematics to another. Thus it’s a language that’s purpose built for relating structures, while respecting the integrity of structures in source and target. A second important clue concerns the relation between categories and structured data.

Databases as categories

The foundational idea of what (I am suggesting) might develop into a new programming paradigm, is that database schemas are categories.

From that idea, an entire mathematical and algorithmic theory and practice of data transformation emerges⁴.

The figure below from [6] is an example. It is at once a database schema and a category. As you can see, it looks like a graph. But it obeys all the rules mentioned above. In addition, it expresses two data integrity constraints (the equations at the top). These are called path equations in the language of category theory (as we’ll see, they’re not the only kind of constraint we can express using this algorithmic theory).

Not every category has path equations, hence they weren’t mentioned in the preceding section as part of the definition of a category — they’re a bit of extra expressivity (i.e. scope, which, notice, relies on a constraint; that a composition of arrows can be identified with an unambiguous path — cf. associativity law). The first one in the example reads: an employee must work in the same department as their manager. The second one tells us that a department’s secretary must be an employee of the same department.

A database schema presented categorically. Source: [6]

A path equation essentially encodes a fact⁷ — or a business rule, as it’s called in the enterprise context. If we have no means to encode such facts as constraints the data must satisfy, then we risk losing knowledge when data quality degrades.

You can check that the path equations hold true by inspecting the following tables (e.g. 101.Manager.isIn = 103.isIn = q10 and 101.isIn = q10). Tables are the objects of the schema above. More generally, each vertex in the schema represents an entity, each arrow a foreign key relationship. They are the columns of the entity. The vertices/objects labelled “String” can be thought of as “pure data” entities with a single column holding a (potentially infinite) set of strings.

The tables of data represent instances on the schema category, and these instances themselves are categories in a technical sense — they are categories in which the objects represent sets and the foreign key relations represent functions. Now hold on tight: a mapping between two schemas A and B induces mappings between instances of data on A and B⁶. These induced mappings are called functors — data migration functors in this context — and this is the sense in which the practice of data transformation naturally emerges, as alluded to in the opening quote.

The mapping between schemas is also a functor. It’s a mapping from one category to another. If one views a category as a kind of language, then a functor would act as a kind of translating dictionary between languages⁷. It’s a bridge between two domains, which respects the rules of both. To program a data migration is to express how two schemas are related — a functor between them automatically induces data migrations.

Concretely, a functor takes objects to objects and arrows to arrows, but in doing so it must preserve structure — i.e., composition and identities (see section on Structure and Preservation). So for instance, I couldn’t map the above schema to one in which everything stays the same, except the secretary arrow points not to “Employee”, but to a separate entity (e.g. “Persons”); I would no longer be able to compose Secr and isIn in the target schema, and that would be disrespecting the structure of composition in the source. Note, the idea of preserving structure doesn’t mean I can’t drop data when translating from a richer schema to a more basic one. It just means that the transformation must be done in a way that respects both the target and the source structure. Incidentally, this also protects those who share data from having its semantics misinterpreted!

The precise mathematical machinery of these concepts has been detailed in several places, e.g. [6, 1]. There’d be little point in sipping from those firehoses of technicality. Let’s take it as a given, and rather see what we can do with this formalism, which not only subsumes most current approaches, such as SQL (by virtue of the fact that it’s a kind of meta-theory for mathematics), but which gives rise to a practice of data transformation which automatically guarantees that data integrity is preserved as it’s transformed⁴. Let’s explore this practice more closely.

Practical nonsense?

In this section we’ll look at some practical use cases. These have been implemented in an as yet little-known tool called CQL (categorical query language). CQL — largely to be credited to Ryan Wisnesky — is an open-source functional programming language (written in Java) which uses category theory for performing data management tasks — such as querying, integrating, migrating, and evolving databases . It’s primary value proposition, in terms we’ve used before, is that it imposes discipline on data transformation. Or, as it’s put on the official website:

Preserve data quality. High-quality data is expensive to obtain, so it is important to preserve that quality throughout the data life-cycle. CQL programs evolve and migrate data in a mathematically universal way, with zero degradation.

We’ll take off with a demo in an aviation setting.

High-assurance user defined functions

For CQL to properly impose discipline on data transformation, we’d expect code, which violates data integrity constraints, to throw an error — that is, it shouldn’t even compile. Errors are thus detected much earlier without costly runtime-checking. And indeed, this is one of the key features of CQL, which ships with a few built-in examples of this.

One of the examples concerns a violation of foreign key constraints: “In CQL, queries that target schemas containing foreign keys are guaranteed, at compile time, and without accessing any source data, to always materialize target instances with correctly populated foreign keys”⁸. Similarly, this example also demonstrates a compile time error, but in this case when a data type conversion — associated with a user-defined function — is violated. The demo uses a metric conversion task, somewhat reminiscent of the metric conversion error that famously resulted in Nasa’s Mars Climate Orbit Disaster.

The barebone example defines a source schema about American airplanes, which have wing spans in inches, and a target schema about European airplanes, which have wing spans in centimetres. The type side specifies how to convert inches to centimetres, and a query which does not convert inches to centimetres is rejected. In CQL, user-defined functions are first-class schema elements. This allows CQL’s automated theorem proving technology to provide compile-time aid to schemas with user-defined functions.

In the first screenshot, we see the entire code base for this example — in this case, a working version. Glossing over the details, what we have are two schema definitions (one for American Airplanes, one for European), a query (which defines a data migration from the American to the European schema), an instance definition which generates some data for the American Airplane schema, and, finally, an instance definition which migrates data from the American Airplane instance to the European one — by evaluating the query “AmericanToEuropean”.

If you compare the two schemas, you’ll notice the only difference is that the American wingLength attribute maps to an “in” (inches) data type (defined in the “typeside”), while the European maps to a “cm” (centimetre) type. The user-defined function (UDF) — which converts inches to centimetres — is specified in the typeside. In this working version, the UDF is correctly included to convert the wingLength to the European metric system — the invocation has been highlighted in yellow. The following two tables show the resulting American and European instances respectively.

The counter-example is presented in the following screenshot. If you look at the query definition (starting line 24), you’ll see that the UDF invocation has been removed. As a result, the program no longer compiles; the output window at the bottom throws an error message, indicating that the type conversion has not been respected in the query.

Constraints > Tests

Imagine the following scenario: you’re operating an enterprise data warehouse with many hundreds of tables. A new client is about to be folded into business operations, and hence, into the enterprise’s data processes. You want to insert this client into a bunch of reference and control tables which are integrated into a multitude of reporting processes in complicated ways. If the client is missing in some of these tables, data might go missing in reports, or worse, get silently corrupted.

The point of this example is to make the case for data constraints when they can’t be enforced at compile time because you’re input is a database without constraints. The advantage of data constraints, as they’re operationalized in CQL, is that they make conditions of data integrity very explicit — they make data validation transparent and easy to reason about. Without explicitly encoded constraints, knowing what to expect is either somewhat loosely documented somewhere, or it’s left as an implied property of some comparatively verbose data quality test. These are factors that give way to mistakes in testing, or to validation tests that go out-of-sync with schema changes.

On top of these potential benefits, the manner in which data constraints are encoded — using category theoretic concepts — makes them suitable for the chase algorithm. We can leverage the chase algorithm to “repair” data which does not satisfy the modelled constraints. Given a chase engine, “we have a general means of propagating information implied by the properties of a model”⁹.

The first screenshot lays out the data we’ll be playing with in this example — two reference tables; T_CLIENT and T_CURRENCY, each containing two rows. Incidentally, the data is loaded into an in-memory database (H2) via the jdbc API.

The next step sees us loading data from step 1, which had no special constraints attached, into a schema that does. Notice that we’ve added a new entity — T_NEW_TABLE. The data constraints that are imposed on the “ClientReferenceTables” schema are shown starting on line 53. The syntax is fairly self-explanatory; in the first block we require that T_NEW_TABLEcontains all the clients that exist in T_CLIENT. In the second block, we simply have extra where conditions; for every instance in T_CURRENCY, we must find a row in T_NEW_TABLE that satisfies all conditions in the where clause.

Here we are performing the actual load into ClientReferenceTables and naming the instance I1. Notice that T_NEW_TABLE is not being populated — this will become relevant in the next step.

Now we’ll perform a “check” command to see if the constraints — which we’ve called “EDs” (embedded dependencies) — are satisfied. As we might expect, since T_NEW_TABLE is empty, the check fails, reporting what the failing triggers are.

Now let’s add some dummy data to T_NEW_TABLE. We can do so by first importing the instance I1, and then generating some data through equational logic.

I2 now looks as follows:

As suggested in the opening, CQL allows us to use the chase algorithm on the basis of our specified data constraints. In the next step, we’ll construct a valid instance (I3) by “chasing” the data constraints on the instance I2. While I3 is a valid instance, as the check command on line 95 will attest to, we’ll invoke one more category theoretic operation in creating instance I4 — the quotient query. This is an operation used to identify equivalences within a dataset, allowing you to create a new dataset wherein certain linked elements are merged together.

What results is the following database in which the new table has now been populated (1) prior to record-linking, and (2) post record-linking:

(1) Instance I3: the “chased” version of I2

As alluded to earlier, the check command now also checks out:

Universal data warehouse

The quotient operation also plays a prominent role in our final example. This one concerns a proposal to correct the “backward” and often failing way of doing data warehousing projects. A detailed case study can be found here [10] or [11]. The fundamental insight in the proposal is that risks of misalignment between reporting needs and the available data are not identified early enough in the traditional data warehousing approach.

Loosely speaking, we might compare data warehouse construction to that of tunnel construction, wherein the ends of the tunnel would represent source and target schemas — the target schema being a representation of the business need. The identified problem of traditional data warehouse construction would thus be analogous to misalignment issues in the construction of tunnels. Misalignment refers to deviations from the planned path or design of the tunnel, which is often a cause of increased costs, delays and also structural weaknesses.

In this vein, one of the key features to innovate data warehouse construction would be a kind of surveillance technology. One which tells us about the alignment between a source schema and a desired target schema. In this way, risks — e.g. the fact that the target schema may not be constructible by the available data — may be identified and mitigated early. “The traditional data warehousing approach makes the early identification of risk nearly impossible, because users typically cannot find out what can go wrong during data integration until they actually integrate data”¹⁰.

In a sense, this approach — if not imposes, promotes — more discipline in architecting data pipelines. We have a tool that can help us to avoid spending a great deal of effort trying to construct a schema whose integrity cannot be satisfied.

The technology at the core of this feature is the aforementioned “quotient operation”. It can operate at both the schema level and the data level. Operating at the schema level, it generates an integrated schema which represents the best possible result of combining source schemas (e.g. a set of heterogeneous source databases) — “CQL computes the unique optimally integrated schema that takes into account all of schema matches, constraints, etc.”¹⁰ — a universal schema so to speak. I’ll leave you to explore the inner workings elsewhere, suffice to say that universal constructions are a common theme in category theory.

Finally, the following screenshot shows this idea of an integrated schema — “S” — generated by the quotient operation (the green arrow), thereby permitting an early “gap analysis” between S and the business desired target schema, before any ETL work has even begun.

Outlook

Are we at a historical moment? A founder of Connexus, and proginator of this paradigmatic approach, David Spivak, believes so. He diagnoses a desperate need for the channelling of information in our age. He likens the “uncontrolled flows of information” and data waste that we are grappling with to the engineering challenges of sewage in streets and runoff in rivers pre industrial revolution¹². “There are always these trends of ‘what is the best way to store data’, but the multiplicity of perspectives is not going away!”¹² We need to focus instead on how best to transform and integrate information¹². Category theory holds the promise to unlock a new paradigm for just that — “to do for data and information what Newton’s calculus has done for physics”¹³.

[1]: Fong, Brendan (2019). An invitation to applied category theory: seven sketches in compositionality. New York, NY: Cambridge University Press. Edited by David I. Spivak.
[2]: Martin, R. (2017). Clean Architecture. Pearson Education. https://elibrary.pearson.de/book/99.150005/9780134494333
[3]: Cheng E. The Joy of Abstraction: An Exploration of Math, Category Theory, and Life. Cambridge University Press; 2022.
[4]: Informal Data Transformation Considered Harmful (2019)
[5]: https://youtu.be/cJ46AOEOc14?feature=shared
[6]: Functorial Data Migration (2013)
[7]: Ologs: A Categorical Framework for Knowledge Representation (2011)
[8]: https://categoricaldata.net/
[9]: https://blog.algebraicjulia.org/post/2022/06/chase/#ref-spivak2012
[10]: FinanceIntegration.pdf
[11]: https://conexus.com/cql-demo/
[12]: https://www.youtube.com/watch?v=fTporauBJEs&t=33s
[13]: https://dspivak.net/grants/NSF_IIS.pdf