Is Data Trans­form­a­tion Poorly Understood?



Intro­du­cing cat­egory the­ory as a found­a­tion for a new paradigm in data transformation


Data pipelines are made of soft­ware. But unlike tra­di­tional soft­ware, they seem to pose unique chal­lenges when it comes to test­ing — as sug­ges­ted in “The chal­lenge of test­ing data pipelines” (ETL, ELT, etc.). The dis­tinct­ive bur­den asso­ci­ated with data pipelines, which the art­icle points to, is the need for data qual­ity checks:

Data Pipelines often have ‘tests’ built into the data pipeline itself, executed as data is pro­cessed, to ensure data quality.

When these tests are not built into the actual data pipeline, detec­tion of data qual­ity issues is simply deferred — e.g. until a busi­ness user cre­ates a ticket for resolv­ing a report­ing bug. The pur­suit of data qual­ity test­ing is there­fore usu­ally regarded as an impos­i­tion, on top of the more tra­di­tional, let’s call it logic testing.

With roughly 40% of enter­prise IT budgets spent on data integ­ra­tion, the chal­lenge of inform­a­tion man­age­ment is argu­ably one of the biggest in IT, and thus a major soci­etal issue¹. Man­aging data qual­ity is one such chal­lenge. For instance, on the order of 80% of data migra­tion pro­jects are said to either fail or exceed their budgets, in no small meas­ure on account of poor data qual­ity. On this view, we might want to chal­lenge whether some­thing is amiss — on a deep level — in the way we do data trans­form­a­tion. And whether data qual­ity test­ing is simply an inher­ent aspect of build­ing data pipelines. Is there an altern­at­ive view per­haps — a new way of think­ing about data trans­form­a­tion – which could obvi­ate this practice?

In this art­icle, I’ll intro­duce an approach to data trans­form­a­tion — data integ­ra­tion, data migra­tion, and so on — based on cat­egory the­ory. Accord­ing to the U.S. National Insti­tute of Stand­ards and Tech­no­logy, cat­egory the­ory is a poten­tial math­em­at­ical found­a­tion for “next-gen­er­a­tion data integration”¹³. It offers a new lan­guage for talk­ing about data trans­form­a­tion, intro­du­cing cru­cial con­cepts such as that of a func­tor, the likes of which is acutely miss­ing from cur­rent think­ing¹². At the out­set, I’ll sug­gest that we can view this approach as giv­ing rise to a dis­tinct pro­gram­ming paradigm, one which can spare us the costly and often inad­equate effort of data qual­ity test­ing, all while guar­an­tee­ing data integrity.

Pro­gram­ming paradigms

To get a sense of the kind of paradigm shift this new way of think­ing about data might entail, let’s draw inspir­a­tion from other com­put­ing paradigms. Robert C. Mar­tin iden­ti­fies three paradigms in its his­tory²: struc­tured pro­gram­ming, object-ori­ented pro­gram­ming, and func­tional pro­gram­ming. He defines a paradigm in this con­text as follows:

Paradigms are ways of pro­gram­ming, rel­at­ively unre­lated to lan­guages. A paradigm tells you which pro­gram­ming struc­tures to use, and when to use them².

He observes a deep con­nec­tion that lies at the heart of each paradigm; they each take some­thing away from the pro­gram­mer. “Each of the paradigms removes cap­ab­il­it­ies from the pro­gram­mer. None of them adds new cap­ab­il­it­ies. Each imposes some kind of extra discipline”:

  • Struc­tured pro­gram­ming imposes dis­cip­line on dir­ect trans­fer of control
  • Object-ori­ented pro­gram­ming imposes dis­cip­line on indir­ect trans­fer of control
  • Func­tional pro­gram­ming imposes dis­cip­line upon assign­ment

Scope and constraints

Together, the three afore­men­tioned paradigms remove goto state­ments, func­tion point­ers, and assign­ment². At first glance, it might seem sur­pris­ing that we’d gain some­thing so pro­found through a con­straint (were that not the case, these paradigms would hardly have been so widely adop­ted). But scope and con­straints are in fact two sides of the same coin. The capa­city (or the scope, or the free­dom) that we gain — for instance, the abil­ity to build large and intric­ate yet robust and extens­ible applic­a­tions — rests on cer­tain lim­it­a­tions that we face — such as being com­pletely unres­trained and cava­lier about intro­du­cing coup­ling among soft­ware com­pon­ents wherever we please.

To go on a bit of a philo­soph­ical detour, let’s look to Noam Chom­sky, who illu­min­ated the concept play­ing out in the organic world while dis­cuss­ing the lim­its of human under­stand­ing:

Far from bewail­ing the exist­ence of mys­ter­ies-for-humans, we should be extremely grate­ful for it. With no lim­its to growth and devel­op­ment, our cog­nit­ive capa­cit­ies would also have no scope. Sim­il­arly, if the genetic endow­ment imposed no con­straints on growth and devel­op­ment of an organ­ism it could become only a shape­less amoeboid creature, reflect­ing acci­dents of an unana­lyzed envir­on­ment, each quite unlike the next. Clas­sical aes­thetic the­ory recog­nized the same rela­tion between scope and lim­its. Without rules, there can be no genu­inely cre­at­ive activ­ity, even when cre­at­ive work chal­lenges and revises pre­vail­ing rules.

In her enlight­en­ing book “The Joy of Abstraction”³, Eugenia Cheng intro­duces the reader to the “Zero World”. In doing so, she also con­veys the weight­i­ness of con­straints, show­ing that if you loosen the wrong ones, it can cause worlds to collapse:

[The Zero world] is the world you end up in if you decide to try and declare 1 + 1 = 1, but still want some other usual rules of arith­metic to hold: in that case we could sub­tract 1 from both sides and get 1 = 0. If that’s true then everything will be 0…You can’t really do any­thing in [the Zero world]. Or rather, you can do any­thing in it and it’s all equally valid. It turns out that a world in which everything is equally valid is not very inter­est­ing, and is more or less the same as a world in which noth­ing is possible³.

Impos­ing dis­cip­line on data transformation

Given the above, we might ask if there are spe­cific con­straints that could usher in this new paradigm that I’ve been allud­ing to? At this point I should men­tion that Robert C. Mar­tin believes there will be no more than the three paradigms we’ve high­lighted, that we’ve seen them all. The reason being that there is noth­ing “left to take away”². The evid­ence seems to bear this out; there have been no new paradigms since 1968².

But what about the abil­ity to scramble the mean­ing of data when we move it from one data­base to another? This might not be your pro­to­typ­ical pro­gram­ming task — one reason it might have been over­looked by Robert C. Mar­tin — but wouldn’t we want to take that abil­ity away from the pro­gram­mer if we could? Wouldn’t we want to elim­in­ate the need for all that expens­ive data qual­ity test­ing and clean­ing, and instead “auto­mat­ic­ally guar­an­tee that data integ­rity is pre­served as it’s trans­formed (migrated, integ­rated, com­posed, quer­ied, viewed, etc) through­out the enter­prise, so that data and pro­grams that depend on that data need not con­stantly be re-val­id­ated for every par­tic­u­lar use [emphasis added]”⁴?

If so, we’re essen­tially look­ing for a paradigm that respects the integ­rity of data. What would that look like? In short, we’d firstly need a way to encode rules that dis­tin­guish sens­ible from non-sens­ible data — that is, encoded data integ­rity con­straints. And, secondly, we’d need a way to make use of — indeed, a way that imposes the use of — such encoded “pro­gram­ming structures”² to ensure that we can’t trans­form sens­ible data into non-sens­ible data.

Isn’t the use of data integ­rity con­straints quite com­mon when work­ing with struc­tured data though?

Con­straints are often built in to data­base schemas; for example, in SQL, a primary key con­straint (stat­ing for example that if two people have the same social secur­ity num­bers, they must be the same per­son) can be given when defin­ing the columns of a table. Other example con­straints include range con­straints (that for example an age be > 21) and join decom­pos­i­tions (that for example state that a table is the join of two other tables)⁴.

So yes, there are ways to spe­cify some aspects of data integ­rity. But the lan­guage of con­straints could be richer, and we’re still lack­ing means to trans­form data while ensur­ing that we respect the con­straints without ex-post-facto data validation.

The authors of [4] bemoan the “fail­ure of tool­ing” used for data trans­form­a­tion (ETL tool­ing such as Inform­at­ica Power­Cen­ter and IBM Data­Stage) to sup­port express­ive con­straints. While there is a wealth of the­or­et­ical work that exists “on apply­ing con­straints to data trans­form­a­tion, integ­ra­tion, clean­ing, schema map­ping, and more”, there is a com­plic­ated “trade-off between how express­ive con­straint lan­guages are, and how com­pu­ta­tion­ally dif­fi­cult it is to reason about them”⁴. As a res­ult, “in the enter­prise, data qual­ity degrad­a­tion due to loss of integ­rity dur­ing trans­form­a­tion is a sys­temic prob­lem” which is simply “res­ist­ant to cur­rent techniques”⁴.

Since recently, how­ever, there’s a tool to address this fail­ing. A tool which embod­ies this new paradigm. It’s a tool whose quint­essence is a weird and won­der­ful form­al­ism called cat­egory theory.

Get­ting to know cat­egory theory

You might call cat­egory the­ory a branch of math­em­at­ics, or, even, a way of think­ing about math­em­at­ics. Bolder yet, some claim cat­egory the­ory is the “math­em­at­ics of mathematics”³.

And in case that’s not flex­ing it enough, how about a cent­ral hub for all of pure math­em­at­ics [which] is unmatched in its abil­ity to organ­ize and layer abstrac­tions, to find com­mon­al­it­ies between struc­tures of all sorts, and to facil­it­ate com­mu­nic­a­tion between dif­fer­ent math­em­at­ical communities¹.

Yet it’s not just abstract non­sense as it’s referred to by some. Cat­egory the­ory may have, for too long, “stayed in the realm of mind”, but “it is ripe to be brought to hand”¹. In fact, by focus­ing on its use in the area of data engin­eer­ing, I’m leav­ing out many other poten­tial applications.

In this attempt to intro­duce the sub­ject to the non­pro­fes­sional, and being merely an enthu­si­ast myself, I’ll have to tip­toe through the field, while lean­ing heav­ily on [1] and [3]. That being said, I‘ll start by — chances are, blas­phem­ously — claim­ing that cat­egory the­ory is about cat­egor­ies. And any cat­egory is made up of objects and rela­tion­ships³. Visu­ally, a cat­egory is like a graph; objects cor­res­pond to the nodes, rela­tion­ships cor­res­pond to the edges — the rela­tion­ships are more com­monly referred to as arrows or morphisms.

The objects can be any­thing — num­bers, people, shapes, math­em­at­ical struc­tures — any­thing really³. They could also just be unspe­cified and abstract. Form­ally, objects have no prop­er­ties or internal struc­ture. In a way, they serve no other pur­pose than to identify the ends of arrows. But it’s often help­ful to fill them with con­tent when think­ing about the relationships.

The rela­tion­ships — or arrows — could be things like “less than”, “is the mother of”, or a func­tion between sets³. Or they could be more abstract, depart­ing from the usual notion of a rela­tion­ship². There can be any num­ber of arrows between two objects a and b. In the case of a rela­tion­ship such as “less than”, there would be one or none. When there are more arrows, each rep­res­ents a dif­fer­ent sense in which two objects can be related² — e.g. there are 3² dif­fer­ent pos­sible func­tions from a set a with two ele­ments, to a set bwith three ele­ments; rep­res­en­ted cat­egor­ic­ally, each func­tion would be a dis­tinct arrow from object a to object b.

Doing cat­egory the­ory is about study­ing the net­works of rela­tion­ships in cat­egor­ies and find­ing pat­terns within them. Hence the notion of an object can be devoid of any internal struc­ture. Rather than focus­ing on its intrinsic prop­er­ties, we’re focussed on its rela­tion­ships and the role it plays within its context.

With the fol­low­ing examples from [5], I’ll try to motiv­ate the idea that simply look­ing at struc­tures of rela­tion­ships may reveal import­ant con­cepts, thereby hope­fully con­vey­ing a sense of prom­ise for this kind of approach. Spe­cific­ally, we’ll look at a fairly basic cat­egor­ical notion — that of a product — and we’ll see how it cap­tures some well-known math­em­at­ical notions in sev­eral dif­fer­ent set­tings (cat­egor­ies). Loosely speak­ing, we can view the product of two objects (a and b) as “the last thing that goes into both a and b”⁵ — we’ll see what that means…

GCD and LCM

The fol­low­ing dia­gram is an example of a simple cat­egory. The objects are num­bers, and we’ve drawn an arrow between two objects, if one divides evenly into the other. Notice, I haven’t expli­citly drawn all arrows for which this rela­tion­ship exists. For a start, there should be a loop­ing arrow from each object to itself (as every num­ber evenly divides itself). Such arrows are called iden­tit­ies in cat­egory the­ory and essen­tially say that everything is the same as itself — they’re usu­ally just implied in dia­grams. Secondly, by com­pos­ing arrows — put­ting them in a ‘nose-to-tail’ con­fig­ur­a­tion — you’ll see that indeed all pos­sible rela­tion­ships have been cap­tured. From 3 to 30, for example, there are two paths (com­pos­ite arrows); via 6 and via 15.

Now what is the product of two objects a and b — in other words, what is the last num­ber to go into both? If we take 6 and 15 as an example, the can­did­ates would be 1 and 3, as they both divide evenly into 6 and 15. How­ever, as the path from 1 to 6 and 15 is longer, the last thing to go into both would be 3. Hence,

product(6,15) = 3

How about the product of 6 and 30? In this case, the answer would be 6 — remem­ber that every object also goes into itself. If you keep play­ing this game, you might notice that the notion of product in this par­tic­u­lar cat­egory exactly cor­res­ponds to the math­emt­ical notion of “greatest com­mon divisor” (gcd).

Now let’s ima­gine we’ve reversed all of the arrows in the above cat­egory. By doing so, we’ve actu­ally cre­ated a new cat­egory — known as the dualcat­egory. What does the product in this new cat­egory cor­res­pond to? In this case, the ‘last thing to go into both’ would cor­res­pond to the notion of “least com­mon mul­tiple” (lcm).

Min and Max

Let’s try the same thing for the fol­low­ing simple cat­egory (again, I’ve left out com­pos­ite and iden­tity arrows). What is the last thing to go into two objects here?

I’ve lis­ted a few examples to sug­gest the pattern:

product(0,0) = 0
product(0,1) = 0
product(1,2) = 1
product(2,4) = 2
...

In this cat­egory, the product cor­res­ponds to the “min” func­tion over two num­bers. Sim­il­arly, if we gen­er­ate the reverse cat­egory (by revers­ing the arrows), the product embod­ies the “max” function.

Inter­sec­tions and Unions

The fol­low­ing cat­egory is one in which the objects are sets. There is a rela­tion­ship between two sets — drawn on the right — if one set is a sub­set of the other. In this case, the product actu­ally defines the inter­sec­tion of two sets. An example in the dia­gram is the black criss-crossed area which inter­sects the blue and red (dis­con­tigu­ous) set. It’s the last thing to go into both.

Once again, we could reverse the arrows to gen­er­ate a new cat­egory in which the arrows denote super­set rela­tion­ships. The product in that cat­egory cor­res­ponds to the union of two sets.

And and Or

Finally, let’s have a Logic example using the fol­low­ing cat­egory — usu­ally referred to as Bool — which depicts an order­ing rela­tion­ship; false ≤ true.

The product here cor­res­ponds to the logical AND oper­ator — the last thing to go into both is always ‘false’, unless both product ele­ments are ‘true’ (which cor­res­ponds pre­cisely to the beha­viour of the AND oper­ator). Now instead of revers­ing arrows (because true ≤ false would be a strange rela­tion­ship), let’s instead look for the least ele­ment that is greater than both. And voila, we find a pat­tern cor­res­pond­ing to the logical OR operator.

Struc­ture and preservation

Unlike a graph, there are cer­tain rules a cat­egory must obey, bey­ond just look­ing like a graph. In a way that might seem like a small deal. But to hark back to the sec­tion on scope and con­straints, many times it’s the “small doors that open up into large rooms”.

So what things must a cat­egory observe? As alluded to in the pre­vi­ous sec­tion, there are two basic struc­tural stip­u­la­tions any cat­egory must satisfy:

  1. Iden­tity arrows: every object must be related to itself. This can be a weak condition³.
  2. Com­pos­i­tion: given two arrows in a nose-to-tail con­fig­ur­a­tion, we want to be able to com­bine them into a single arrow³. If there’s a sense in which is related to b, and a sense in which is related to c, then we can com­bine those to get a sense in which is related to c³. For example, if is the mother of b, and is the mother of c, there is a sense in which is the grand­mother of c. To give a counter-example: if doc­u­ment ref­er­ences doc­u­ment b, and doc­u­ment ref­er­ences doc­u­ment c, we can’t com­pose these two rela­tion­ships as it’s not implied that doc­u­ment ref­er­ences doc­u­ment c.

Com­pos­i­tion is the thing that elev­ates a cat­egory bey­ond a mere graph, or “just a load of arrows”³. But to ensure that the struc­tures which can be built “behave in a most bas­cially sens­ible way”³, there are two con­di­tions which are imposed:

  1. Unit­al­ity: stip­u­lat­ing this prop­erty ensures that the iden­tity arrows “do noth­ing” w.r.t. com­pos­i­tion of arrows³. As an ana­logy, 0 is the iden­tity ele­ment which “does noth­ing” in the bin­ary oper­a­tion + (which can be rep­res­en­ted by a mon­oidal cat­egory by the way). Given an arrow that goes from to t, and iden­tity arrows on s (idand t (id), unit­al­ity is expressed as fol­lows (using ○ as the com­pos­i­tion oper­ator):
    f○id = id○f = f.
  2. Asso­ci­ativ­ity: given three labelled arrows (hg, and f, say) in a nose-to-tail con­fig­ur­a­tion, the asso­ci­ativ­ity law is stated thus : h○(g○f) = (h○g)○f.
    Note that while this con­di­tion spe­cific­ally addresses three com­pos­able arrows, the law really entails that a string of com­pos­able arrows of any length has one unam­bigu­ous com­pos­ite³. And because of asso­ci­ativ­ity, this com­pos­ite can be regarded as the “sum” of its parts — the order in which things are com­bined doesn’t affect the outcome.

So those are the ingredi­ents for build­ing struc­tures in cat­egory the­ory. But what’s it got to do with pre­serving data integ­rity? One clue is that a cent­ral theme of cat­egory the­ory is the study of struc­ture-pre­serving maps¹. In fact, cat­egory the­ory was inven­ted as a means to trans­late the­or­ems and math­em­at­ical struc­tures from one branch of math­em­at­ics to another. Thus it’s a lan­guage that’s pur­pose built for relat­ing struc­tures, while respect­ing the integ­rity of struc­tures in source and tar­get. A second import­ant clue con­cerns the rela­tion between cat­egor­ies and struc­tured data.

Data­bases as categories

The found­a­tional idea of what (I am sug­gest­ing) might develop into a new pro­gram­ming paradigm, is that data­base schemas are cat­egor­ies.

From that idea, an entire math­em­at­ical and algorithmic the­ory and prac­tice of data trans­form­a­tion emerges⁴.

The fig­ure below from [6] is an example. It is at once a data­base schema and a cat­egory. As you can see, it looks like a graph. But it obeys all the rules men­tioned above. In addi­tion, it expresses two data integ­rity con­straints (the equa­tions at the top). These are called path equa­tions in the lan­guage of cat­egory the­ory (as we’ll see, they’re not the only kind of con­straint we can express using this algorithmic theory).

Not every cat­egory has path equa­tions, hence they weren’t men­tioned in the pre­ced­ing sec­tion as part of the defin­i­tion of a cat­egory — they’re a bit of extra expressiv­ity (i.e. scope, which, notice, relies on a con­straint; that a com­pos­i­tion of arrows can be iden­ti­fied with an unam­bigu­ous path — cf. asso­ci­ativ­ity law). The first one in the example reads: an employee must work in the same depart­ment as their man­ager. The second one tells us that a department’s sec­ret­ary must be an employee of the same department.

A data­base schema presen­ted cat­egor­ic­ally. Source: [6]

A path equa­tion essen­tially encodes a fact⁷ — or a busi­ness rule, as it’s called in the enter­prise con­text. If we have no means to encode such facts as con­straints the data must sat­isfy, then we risk los­ing know­ledge when data qual­ity degrades.

You can check that the path equa­tions hold true by inspect­ing the fol­low­ing tables (e.g. 101.Manager.isIn = 103.isIn = q10 and 101.isIn = q10). Tables are the objects of the schema above. More gen­er­ally, each ver­tex in the schema rep­res­ents an entity, each arrow a for­eign key rela­tion­ship. They are the columns of the entity. The vertices/objects labelled “String” can be thought of as “pure data” entit­ies with a single column hold­ing a (poten­tially infin­ite) set of strings.

The tables of data rep­res­ent instances on the schema cat­egory, and these instances them­selves are cat­egor­ies in a tech­nical sense — they are cat­egor­ies in which the objects rep­res­ent sets and the for­eign key rela­tions rep­res­ent func­tions. Now hold on tight: a map­ping between two schemas A and B induces map­pings between instances of data on A and B⁶. These induced map­pings are called func­tors — data migra­tion func­tors in this con­text — and this is the sense in which the prac­tice of data trans­form­a­tion nat­ur­ally emerges, as alluded to in the open­ing quote.

The map­ping between schemas is also a func­tor. It’s a map­ping from one cat­egory to another. If one views a cat­egory as a kind of lan­guage, then a func­tor would act as a kind of trans­lat­ing dic­tion­ary between lan­guages⁷. It’s a bridge between two domains, which respects the rules of both. To pro­gram a data migra­tion is to express how two schemas are related — a func­tor between them auto­mat­ic­ally induces data migrations.

Con­cretely, a func­tor takes objects to objects and arrows to arrows, but in doing so it must pre­serve struc­ture — i.e., com­pos­i­tion and iden­tit­ies (see sec­tion on Struc­ture and Pre­ser­va­tion). So for instance, I couldn’t map the above schema to one in which everything stays the same, except the sec­ret­ary arrow points not to “Employee”, but to a sep­ar­ate entity (e.g. “Per­sons”); I would no longer be able to com­pose Secr and isIn in the tar­get schema, and that would be dis­respect­ing the struc­ture of com­pos­i­tion in the source. Note, the idea of pre­serving struc­ture doesn’t mean I can’t drop data when trans­lat­ing from a richer schema to a more basic one. It just means that the trans­form­a­tion must be done in a way that respects both the tar­get and the source struc­ture. Incid­ent­ally, this also pro­tects those who share data from hav­ing its semantics misinterpreted!

The pre­cise math­em­at­ical machinery of these con­cepts has been detailed in sev­eral places, e.g. [6, 1]. There’d be little point in sip­ping from those fire­hoses of tech­nic­al­ity. Let’s take it as a given, and rather see what we can do with this form­al­ism, which not only sub­sumes most cur­rent approaches, such as SQL (by vir­tue of the fact that it’s a kind of meta-the­ory for math­em­at­ics), but which gives rise to a prac­tice of data trans­form­a­tion which auto­mat­ic­ally guar­an­tees that data integ­rity is pre­served as it’s trans­formed⁴. Let’s explore this prac­tice more closely.

Prac­tical nonsense?

In this sec­tion we’ll look at some prac­tical use cases. These have been imple­men­ted in an as yet little-known tool called CQL (cat­egor­ical query lan­guage). CQL — largely to be cred­ited to Ryan Wis­nesky — is an open-source func­tional pro­gram­ming lan­guage (writ­ten in Java) which uses cat­egory the­ory for per­form­ing data man­age­ment tasks — such as query­ing, integ­rat­ing, migrat­ing, and evolving data­bases . It’s primary value pro­pos­i­tion, in terms we’ve used before, is that it imposes dis­cip­line on data trans­form­a­tion. Or, as it’s put on the offi­cial web­site:

Pre­serve data qual­ity. High-qual­ity data is expens­ive to obtain, so it is import­ant to pre­serve that qual­ity through­out the data life-cycle. CQL pro­grams evolve and migrate data in a math­em­at­ic­ally uni­ver­sal way, with zero degradation.

We’ll take off with a demo in an avi­ation setting.

High-assur­ance user defined functions

For CQL to prop­erly impose dis­cip­line on data trans­form­a­tion, we’d expect code, which viol­ates data integ­rity con­straints, to throw an error — that is, it shouldn’t even com­pile. Errors are thus detec­ted much earlier without costly runtime-check­ing. And indeed, this is one of the key fea­tures of CQL, which ships with a few built-in examples of this.

One of the examples con­cerns a viol­a­tion of for­eign key con­straints: “In CQL, quer­ies that tar­get schemas con­tain­ing for­eign keys are guar­an­teed, at com­pile time, and without access­ing any source data, to always mater­i­al­ize tar­get instances with cor­rectly pop­u­lated for­eign keys”⁸. Sim­il­arly, this example also demon­strates a com­pile time error, but in this case when a data type con­ver­sion — asso­ci­ated with a user-defined func­tion — is viol­ated. The demo uses a met­ric con­ver­sion task, some­what remin­is­cent of the met­ric con­ver­sion error that fam­ously res­ul­ted in Nasa’s Mars Cli­mate Orbit Dis­aster.

The bare­bone example defines a source schema about Amer­ican air­planes, which have wing spans in inches, and a tar­get schema about European air­planes, which have wing spans in cen­ti­metres. The type side spe­cifies how to con­vert inches to cen­ti­metres, and a query which does not con­vert inches to cen­ti­metres is rejec­ted. In CQL, user-defined func­tions are first-class schema ele­ments. This allows CQL’s auto­mated the­orem prov­ing tech­no­logy to provide com­pile-time aid to schemas with user-defined functions.

In the first screen­shot, we see the entire code base for this example — in this case, a work­ing ver­sion. Glossing over the details, what we have are two schema defin­i­tions (one for Amer­ican Air­planes, one for European), a query (which defines a data migra­tion from the Amer­ican to the European schema), an instance defin­i­tion which gen­er­ates some data for the Amer­ican Air­plane schema, and, finally, an instance defin­i­tion which migrates data from the Amer­ican Air­plane instance to the European one — by eval­u­at­ing the query “Amer­ic­an­T­oEuropean”.

If you com­pare the two schemas, you’ll notice the only dif­fer­ence is that the Amer­ican wingLength attrib­ute maps to an “in” (inches) data type (defined in the “typeside”), while the European maps to a “cm” (cen­ti­metre) type. The user-defined func­tion (UDF) — which con­verts inches to cen­ti­metres — is spe­cified in the typeside. In this work­ing ver­sion, the UDF is cor­rectly included to con­vert the wingLength to the European met­ric sys­tem — the invoc­a­tion has been high­lighted in yel­low. The fol­low­ing two tables show the res­ult­ing Amer­ican and European instances respectively.

The counter-example is presen­ted in the fol­low­ing screen­shot. If you look at the query defin­i­tion (start­ing line 24), you’ll see that the UDF invoc­a­tion has been removed. As a res­ult, the pro­gram no longer com­piles; the out­put win­dow at the bot­tom throws an error mes­sage, indic­at­ing that the type con­ver­sion has not been respec­ted in the query.

Con­straints > Tests

Ima­gine the fol­low­ing scen­ario: you’re oper­at­ing an enter­prise data ware­house with many hun­dreds of tables. A new cli­ent is about to be fol­ded into busi­ness oper­a­tions, and hence, into the enterprise’s data pro­cesses. You want to insert this cli­ent into a bunch of ref­er­ence and con­trol tables which are integ­rated into a mul­ti­tude of report­ing pro­cesses in com­plic­ated ways. If the cli­ent is miss­ing in some of these tables, data might go miss­ing in reports, or worse, get silently corrupted.

The point of this example is to make the case for data con­straints when they can’t be enforced at com­pile time because you’re input is a data­base without con­straints. The advant­age of data con­straints, as they’re oper­a­tion­al­ized in CQL, is that they make con­di­tions of data integ­rity very expli­cit — they make data val­id­a­tion trans­par­ent and easy to reason about. Without expli­citly encoded con­straints, know­ing what to expect is either some­what loosely doc­u­mented some­where, or it’s left as an implied prop­erty of some com­par­at­ively verb­ose data qual­ity test. These are factors that give way to mis­takes in test­ing, or to val­id­a­tion tests that go out-of-sync with schema changes.

On top of these poten­tial bene­fits, the man­ner in which data con­straints are encoded — using cat­egory the­or­etic con­cepts — makes them suit­able for the chase algorithm. We can lever­age the chase algorithm to “repair” data which does not sat­isfy the mod­elled con­straints. Given a chase engine, “we have a gen­eral means of propagat­ing inform­a­tion implied by the prop­er­ties of a model”⁹.

The first screen­shot lays out the data we’ll be play­ing with in this example — two ref­er­ence tables; T_CLIENT and T_CURRENCY, each con­tain­ing two rows. Incid­ent­ally, the data is loaded into an in-memory data­base (H2) via the jdbc API.

The next step sees us load­ing data from step 1, which had no spe­cial con­straints attached, into a schema that does. Notice that we’ve added a new entity — T_NEW_TABLE. The data con­straints that are imposed on the “Cli­en­tRefer­en­ceTables” schema are shown start­ing on line 53. The syn­tax is fairly self-explan­at­ory; in the first block we require that T_NEW_TABLEcon­tains all the cli­ents that exist in T_CLIENT. In the second block, we simply have extra where con­di­tions; for every instance in T_CURRENCY, we must find a row in T_NEW_TABLE that sat­is­fies all con­di­tions in the where clause.

Here we are per­form­ing the actual load into Cli­en­tRefer­en­ceTables and nam­ing the instance I1. Notice that T_NEW_TABLE is not being pop­u­lated — this will become rel­ev­ant in the next step.

Now we’ll per­form a “check” com­mand to see if the con­straints — which we’ve called “EDs” (embed­ded depend­en­cies) — are sat­is­fied. As we might expect, since T_NEW_TABLE is empty, the check fails, report­ing what the fail­ing trig­gers are.

Now let’s add some dummy data to T_NEW_TABLE. We can do so by first import­ing the instance I1, and then gen­er­at­ing some data through equa­tional logic.

I2 now looks as follows:

Instance I2

As sug­ges­ted in the open­ing, CQL allows us to use the chase algorithm on the basis of our spe­cified data con­straints. In the next step, we’ll con­struct a valid instance (I3) by “chas­ing” the data con­straints on the instance I2. While I3 is a valid instance, as the check com­mand on line 95 will attest to, we’ll invoke one more cat­egory the­or­etic oper­a­tion in cre­at­ing instance I4 — the quo­tient query. This is an oper­a­tion used to identify equi­val­ences within a data­set, allow­ing you to cre­ate a new data­set wherein cer­tain linked ele­ments are merged together.

What res­ults is the fol­low­ing data­base in which the new table has now been pop­u­lated (1) prior to record-link­ing, and (2) post record-linking:

(1) Instance I3: the “chased” ver­sion of I2
(2) Instance I4: I3 post record-linking

As alluded to earlier, the check com­mand now also checks out:

Uni­ver­sal data warehouse

The quo­tient oper­a­tion also plays a prom­in­ent role in our final example. This one con­cerns a pro­posal to cor­rect the “back­ward” and often fail­ing way of doing data ware­housing pro­jects. A detailed case study can be found here [10] or [11]. The fun­da­mental insight in the pro­posal is that risks of mis­align­ment between report­ing needs and the avail­able data are not iden­ti­fied early enough in the tra­di­tional data ware­housing approach.

Loosely speak­ing, we might com­pare data ware­house con­struc­tion to that of tun­nel con­struc­tion, wherein the ends of the tun­nel would rep­res­ent source and tar­get schemas — the tar­get schema being a rep­res­ent­a­tion of the busi­ness need. The iden­ti­fied prob­lem of tra­di­tional data ware­house con­struc­tion would thus be ana­log­ous to mis­align­ment issues in the con­struc­tion of tun­nels. Mis­align­ment refers to devi­ations from the planned path or design of the tun­nel, which is often a cause of increased costs, delays and also struc­tural weaknesses.

In this vein, one of the key fea­tures to innov­ate data ware­house con­struc­tion would be a kind of sur­veil­lance tech­no­logy. One which tells us about the align­ment between a source schema and a desired tar­get schema. In this way, risks — e.g. the fact that the tar­get schema may not be con­struct­ible by the avail­able data — may be iden­ti­fied and mit­ig­ated early. “The tra­di­tional data ware­housing approach makes the early iden­ti­fic­a­tion of risk nearly impossible, because users typ­ic­ally can­not find out what can go wrong dur­ing data integ­ra­tion until they actu­ally integ­rate data”¹⁰.

In a sense, this approach — if not imposes, pro­motes — more dis­cip­line in archi­tect­ing data pipelines. We have a tool that can help us to avoid spend­ing a great deal of effort try­ing to con­struct a schema whose integ­rity can­not be satisfied.

The tech­no­logy at the core of this fea­ture is the afore­men­tioned “quo­tient oper­a­tion”. It can oper­ate at both the schema level and the data level. Oper­at­ing at the schema level, it gen­er­ates an integ­rated schema which rep­res­ents the best pos­sible res­ult of com­bin­ing source schemas (e.g. a set of het­ero­gen­eous source data­bases) — “CQL com­putes the unique optim­ally integ­rated schema that takes into account all of schema matches, con­straints, etc.”¹⁰ — a uni­ver­sal schema so to speak. I’ll leave you to explore the inner work­ings else­where, suf­fice to say that uni­ver­sal con­struc­tions are a com­mon theme in cat­egory theory.

Finally, the fol­low­ing screen­shot shows this idea of an integ­rated schema — “S” — gen­er­ated by the quo­tient oper­a­tion (the green arrow), thereby per­mit­ting an early “gap ana­lysis” between S and the busi­ness desired tar­get schema, before any ETL work has even begun.

Out­look

Are we at a his­tor­ical moment? A founder of Con­nexus, and pro­gin­ator of this paradig­matic approach, David Spivak, believes so. He dia­gnoses a des­per­ate need for the chan­nel­ling of inform­a­tion in our age. He likens the “uncon­trolled flows of inform­a­tion” and data waste that we are grap­pling with to the engin­eer­ing chal­lenges of sewage in streets and run­off in rivers pre indus­trial revolu­tion¹². “There are always these trends of ‘what is the best way to store data’, but the mul­ti­pli­city of per­spect­ives is not going away!”¹² We need to focus instead on how best to trans­form and integ­rate inform­a­tion¹². Cat­egory the­ory holds the prom­ise to unlock a new paradigm for just that — “to do for data and inform­a­tion what Newton’s cal­cu­lus has done for physics”¹³.

[1]: Fong, Brendan (2019). An invit­a­tion to applied cat­egory the­ory: seven sketches in com­pos­i­tion­al­ity. New York, NY: Cam­bridge Uni­ver­sity Press. Edited by David I. Spivak.
[2]: Mar­tin, R. (2017). Clean Archi­tec­ture. Pear­son Edu­ca­tion. https://elibrary.pearson.de/book/99.150005/9780134494333
[3]: Cheng E. The Joy of Abstrac­tion: An Explor­a­tion of Math, Cat­egory The­ory, and Life. Cam­bridge Uni­ver­sity Press; 2022.
[4]: Informal Data Trans­form­a­tion Con­sidered Harm­ful (2019)
[5]: https://youtu.be/cJ46AOEOc14?feature=shared
[6]: Func­torial Data Migra­tion (2013)
[7]: Ologs: A Cat­egor­ical Frame­work for Know­ledge Rep­res­ent­a­tion (2011)
[8]: https://categoricaldata.net/
[9]: https://blog.algebraicjulia.org/post/2022/06/chase/#ref-spivak2012
[10]: FinanceIntegration.pdf
[11]: https://conexus.com/cql-demo/
[12]: https://www.youtube.com/watch?v=fTporauBJEs&t=33s
[13]: https://dspivak.net/grants/NSF_IIS.pdf