Event–Driven Archi­tec­ture



As engin­eers, we are some­times faced with archi­tec­tural chal­lenges when try­ing to anti­cip­ate the long-term needs of a busi­ness. This is espe­cially tough in a hyper-scal­ing envir­on­ment, as we some­times need to plan for the unknown, we just know the busi­ness is grow­ing at a very fast pace, and this means we’ll need to pre­pare for unfore­seen chal­lenges. In this type of engin­eer­ing envir­on­ment, there are a couple of options we can fol­low to struc­ture our archi­tec­ture and teams in this art­icle we’re going to delve deeper into one of the pos­sible pat­terns, event-driven architecture.

In this archi­tec­ture, a sys­tem reacts to spe­cific events or changes in the envir­on­ment and trig­gers cor­res­pond­ing actions. The sys­tem is built around a set of loosely coupled ser­vices that com­mu­nic­ate with each other through the use of events. The ser­vices are isol­ated from one another, which allows them to evolve and scale inde­pend­ently. This allows for a highly scal­able and respons­ive sys­tem, where ser­vices can react to changes and events in real time.

This pat­tern shines brighter in dis­trib­uted tech envir­on­ments, where respons­ib­il­it­ies are dis­trib­uted across mul­tiple teams, which need to com­mu­nic­ate between them while also giv­ing space for the birth of other future teams to par­take in these com­mu­nic­a­tion chan­nels, which makes it ideal for a hyper-scal­ing com­pany, where the future is not always fully pre­dict­able, and where we have to plan for a lar­ger, and not always known, dimension.

Our scen­ario

In this art­icle we will talk about one of our past exper­i­ences, an engin­eer­ing eco­sys­tem of over 400 engin­eers work­ing on the same product, a large-scale e‑commerce plat­form with a huge logist­ics and oper­a­tions com­pon­ent to acquire, pre­pare, re-sell and deliver its products, all of this powered by internal and external facing web applic­a­tions that integ­rate with mul­tiple 3rd party ser­vices. The engin­eers work­ing on this pro­ject were split into ver­tical teams of around 10 people each, each team fully own­ing the pieces of soft­ware they worked on, from con­cep­tion all the way down to devel­op­ment and maintenance.

One of these teams is respons­ible for the man­age­ment of the life cycle of third-party con­tracts between the cli­ent and the com­pany, namely, exten­ded war­ranties and insur­ance. The man­age­ment of these con­tracts meant syn­chron­iz­ing inform­a­tion with mul­tiple other external teams, an example of this would be inform­ing the CRM team to handle cus­tomer com­mu­nic­a­tion when these con­tracts change state.

One good prac­tical applic­a­tion of event-driven com­mu­nic­a­tion is this cross-team sync, espe­cially if it is not time-crit­ical. After a cus­tomer pur­chases a product there are mul­tiple stages before it gets delivered, so most of the work needed to ful­fil the order can be done asyn­chron­ously. In fact, since some of these stages involve third-party work, some­times it would actu­ally be bene­fi­cial to intro­duce some delay to ensure sys­tems are able to cre­ate their arte­facts and have them ready for oth­ers to pick up. As such, mul­tiple teams (e.g.: fin­ance, logist­ics, oper­a­tions, etc…) would be work­ing in par­al­lel on dif­fer­ent ver­tic­als to move the order to a final­ized state.

In this spe­cific case, the solu­tion was fully archi­tec­ted based on AWS arte­facts to provide the cloud com­put­ing cap­ab­il­ity it required. We used Event Bridge to receive, fil­ter, trans­form, route, and deliver events between dif­fer­ent applications/teams, Lambda func­tions to con­sume these events and Cloud­Watch for mon­it­or­ing and alerting.

Advant­ages over syn­chron­ous API calls

One of the biggest advant­ages of events is that we can eas­ily track the entire life cycle of an entity (usu­ally orders), includ­ing all actions and sys­tems it went through. This provides a clear and easy-to-under­stand view of the entity’s progress.

As men­tioned pre­vi­ously, most of what we needed to do was not time-crit­ical, so syn­chron­ous API calls were not needed, or even ideal, as a team would not block the pro­gress of another one that is not dir­ectly depend­ent and would just wait for exactly what it needs. One advant­age of this is that if one of the teams requires more time to handle spe­cific oper­a­tions (like wait­ing for third-party ser­vices) or to recover from a fail­ure when pro­cessing the event on their end, it would not block teams from doing what they needed to do and all this work could be run in parallel.

Addi­tion­ally, it’s very simple to just re-trig­ger the same event glob­ally in case some­thing goes wrong, or re-run­ning the lambda pro­cessing the event in a spe­cific team that might have had a tem­por­ary issue. In order for this to work, there are a few con­di­tions that have to be met, for example:

  • Each listener has to be able to pro­cess the event at any given time;
  • Each team would have to imple­ment fault tol­er­ance, by log­ging any errors that occur while try­ing to pro­cess an event and enabling the team to react to it and fix any issues in order;
  • Each event pro­cessor would have to ensure idem­potency, being able to detect and ignore duplic­ate events.

This archi­tec­ture greatly con­trib­utes to the sta­bil­ity and scalab­il­ity of the pro­ject as each team can listen to the events they need and trig­ger their actions accord­ingly. This decoup­ling gives teams more free­dom and makes it easier to main­tain, improve and extend the sys­tem as a whole.

The fol­low­ing pic­tures exem­plify the same flow for a cre­ated order in syn­chron­ous and asyn­chron­ous architectures.

An overview of synchronous architecture for when an order is created.
Fig­ure 1: An over­view of syn­chron­ous architecture.

The image above shows an example of a syn­chron­ous archi­tec­ture for when an order is cre­ated. We can see that the Orders team takes the task of inform­ing every other team about this inform­a­tion. This causes a big delay and work­load in a single team. Now let’s take a look at the same example but using an event-driven architecture.

An overview of asynchronous architecture.
Fig­ure 2: An over­view of asyn­chron­ous architecture.

As you can see in the image above the asyn­chron­ous example allows mul­tiple teams to do their work inde­pend­ently of other teams’ work. The Orders team is also not forced to wait for them to fin­ish and doesn’t need to worry about who they need to con­tact, sim­pli­fy­ing a lot their work­load and busi­ness logic.

Prob­lems faced

Of course, no archi­tec­ture is per­fect, and one of the main issues we had while work­ing with event-driven archi­tec­ture was the lack or poor imple­ment­a­tion of event ver­sion­ing and deal­ing with unex­pec­ted changes that were not prop­erly com­mu­nic­ated or well thought through. When chan­ging an event it is import­ant to ensure those changes do not break exist­ing con­sumers of the event. This can be achieved by main­tain­ing back­ward com­pat­ib­il­ity and provid­ing appro­pri­ate migra­tion paths for exist­ing con­sumers. If a given event has to change drastic­ally poten­tially caus­ing con­sumers to break when using it, it should be ver­sioned, by either includ­ing a ver­sion num­ber on the event pay­load or by cre­at­ing a new event type altogether.

Another big prob­lem was under­stand­ing exactly what kind of events spe­cific teams needed to listen to. For example, for a long time in one of our teams, we had some issues with some third-party con­tracts that were not auto­mat­ic­ally can­celled when an order was can­celled or returned for some reason. This was caused by the fail­ure of listen­ing to spe­cific events related to the order can­cel­la­tion and went undis­covered for some time before it was detec­ted and fixed.

To help mit­ig­ate this issue, we had a com­pany-wide repos­it­ory that held inform­a­tion on all events that all teams were emit­ting, as well as who was con­sum­ing those events. All these events were lis­ted with examples of pay­loads, so every­one could be aware of not only the events that were being emit­ted but also what data they car­ried with them. Each team would list them­selves as con­sumers of events so, in case of changes, they would be informed. We also had RFCs for changes that would be crit­ical for a lot of other teams as well as planned sun­sets for events that were going to be deprec­ated, giv­ing teams time to start using the new ones.

These solu­tions are not per­fect by them­selves, after all, people being people will for­get to update this kind of data. One way to poten­tially fix this would be to set up alerts to be triggered whenever an event that is not prop­erly doc­u­mented in a cata­logue is fired, noti­fy­ing a spe­cific team that would then be respons­ible for inform­ing the pro­du­cer of the event that they should update the event data in the catalogue.

Fat vs thin events

Fat or thin events refer to the amount of data that is con­tained in an event. Fat refers to events that are usu­ally self-suf­fi­cient and con­tain a large amount of data, while thin events con­tain min­imal data, which means that listen­ers will most likely require to con­tact other teams for the neces­sary extra details.

Ini­tially, most, if not all, events in the com­pany would be fat events, as the teams were not build­ing all the end­points to get the neces­sary inform­a­tion avail­able on a thin event struc­ture. This allowed for the quick deliv­ery of new fea­tures. How­ever, this can lead to some issues as we enu­mer­ate below:

  • Per­form­ance
    Fat events can have a neg­at­ive impact on per­form­ance, as they increase the amount of data that needs to be trans­mit­ted and pro­cessed, poten­tially lead­ing to delays and slow­ing down the entire system.
  • Secur­ity
    This was only minor in our scen­ario, but it was still an issue, as teams ended up being able to access data that was out of their remit, open­ing the way for poten­tial secur­ity breaches.
  • Ver­sion­ing
    Ver­sion­ing would also be much more chal­len­ging as the data struc­tures would con­stantly be chan­ging in such a fast-grow­ing busi­ness. As time went by, these changes kept get­ting harder and harder to man­age without intro­du­cing break­ing changes.

As we grew, our fat events became harder and harder to man­age and we even­tu­ally star­ted mov­ing towards thin events. So over time, we replaced our ini­tial cata­logue of events with new ver­sions that would send only the bare min­imum data, includ­ing iden­ti­fi­ers that teams could use to fetch addi­tional inform­a­tion if needed. This made the events much easier to man­age, and secur­ity was less of an issue as the access could now be con­trolled when the teams reques­ted addi­tional data to the respect­ive sys­tems that owned that data.

As an example, in the Order­Cre­ated event we no longer send all the product details as most con­sumers won’t require them, so we simply reduce that to an order iden­ti­fier and the total price. When con­sum­ing this event, the add-ons team would request addi­tional inform­a­tion regard­ing the order items from the logist­ics team, enabling them to cre­ate con­tracts with third-party sys­tems like insur­ance companies.

The biggest advant­age of this is that we can eas­ily con­trol who has access to which inform­a­tion, which is some­thing crit­ical for cli­ents’ PII (Per­sonal Identi­fy­ing Inform­a­tion) and we reduce the amount of busi­ness logic trans­ferred on the event pay­loads. This way, most changes to the data struc­ture would not require a new ver­sion, after all, we can eas­ily add more inform­a­tion to the event without break­ing anything.

But thin events mean that teams are required to con­tact other teams for rel­ev­ant spe­cific inform­a­tion, which is not always pos­sible. As such, not all events were thin events and we’re cur­rently using a mix­ture of both, depend­ing on the res­ults of dis­cus­sions among the teams involved.

Final thoughts

In con­clu­sion, event-driven archi­tec­ture is a power­ful approach to build­ing dis­trib­uted sys­tems, but it requires care­ful plan­ning, design, and imple­ment­a­tion to be suc­cess­ful. By under­stand­ing the chal­lenges and using the right tools and strategies, you can ensure that your archi­tec­ture is highly scal­able, resi­li­ent, and respons­ive, and serves as a tool to help your organ­iz­a­tion scale faster.

We hope this art­icle shed some light on some of the things to keep in mind when imple­ment­ing this type of archi­tec­ture on a hyper-scal­ing com­pany, and that it can aid you in your future developments.

Thank you for reading.