Com­pan­ies nowadays must pro­cess vast quant­it­ies of data using Big Data tech­no­lo­gies in tight time­frames. In addi­tion to its enorm­ous volume, this data must be pro­cessed in real time, so the archi­tec­ture must be flex­ible, dis­trib­uted, and scal­able. This scen­ario is par­tic­u­larly rel­ev­ant to the logist­ics sec­tor, for example, where IoT (Inter­net of Things) devices mon­itor a series of indic­at­ors to help com­pan­ies to assess drivers’ beha­viour for sub­sequent pro­cessing and ana­lysis. In this blog post, we’ll explore some mod­ern soft­ware solu­tions that allow busi­nesses to design dis­trib­uted and scal­able sys­tems that per­form ETL (Extract, Trans­form, Load) oper­a­tions in real time.

Scal­able and Dis­trib­uted Processing

A com­mon bot­tle­neck in data pro­cessing is the abil­ity of sys­tems to ingest, trans­form, and deliver raw data, sub­sequently used to offer power­ful insights into key aspects of the busi­ness. This bot­tle­neck can be resolved by enhan­cing scalab­il­ity (adding addi­tional pro­cessing power) and dis­tri­bu­tion (spread­ing the data pro­cessing tasks across mul­tiple sys­tems), and a sys­tem that is both scal­able and dis­trib­uted can handle load bal­an­cing and enhance fault tolerance.

Kappa archi­tec­tures are one of the soft­ware solu­tions that meet these needs and chal­lenges within the Big Data sec­tor. They con­sist of two lay­ers, the first to pro­cess the data flow intro­duced into the sys­tem and the second, a ser­vice layer, to pro­cess data in real time and present the end user with the res­ults of all this processing.

By defin­i­tion, Kappa archi­tec­tures are:

  • Dis­trib­uted: A dis­trib­uted sys­tem is one in which data and pro­cessing tasks are spread across mul­tiple machines or nodes within a net­work. A dis­trib­uted sys­tem offers bene­fits such as: 
    1. Fault tol­er­ance: The abil­ity of a sys­tem to oper­ate reli­ably without inter­rup­tion even if one of its parts fails.
    2. Load bal­an­cing: The cap­ab­il­ity of effi­ciently dis­trib­ut­ing the load of a task.
    3. Par­al­lel pro­cessing: Tasks can be executed across mul­tiple parts of the sys­tem, lead­ing to faster data processing.
  • Scal­able: A system’s abil­ity to man­age grow­ing data volumes and increas­ing work­flows. As data or pro­cessing require­ments increase, the sys­tem can adapt and expand its resources to meet these new demands.

The use of a data pro­cessing archi­tec­ture that imple­ments the bene­fits presen­ted in this blog post can already be seen in industry. For example, the trans­port­a­tion sec­tor has benefited from employ­ing this archi­tec­ture to address its spe­cific needs and challenges.

Today we’ll explore an archi­tec­tural solu­tion to this prob­lem that can enhance data pro­cessing for busi­nesses, and review three open-source tech­no­lo­gies that can be integ­rated to achieve this improvement.

Rab­bitMQ

Rab­bitMQ is an open-source mes­saging soft­ware that can be used as a real-time data queue. It was launched in 2007 and is con­tinu­ously developed and updated by Pivotal Soft­ware. Oper­at­ing as a mes­saging broker, Rab­bitMQ relays mes­sages from a source to a pro­cessing machine reli­ably and effi­ciently. It can con­nect to vari­ous inform­a­tion sources, for example an IoT device, which are typ­ic­ally small sensors like accel­er­o­met­ers or pres­sure sensors that can send large amounts of raw data.

Rab­bitMQ is scal­able thanks to its abil­ity to form clusters, join­ing mul­tiple serv­ers on a local net­work. These clusters work col­lect­ively to optim­ise load bal­an­cing and fault tolerance.

It also offers the added bene­fit of being com­pat­ible with mul­tiple plat­forms and pro­gram­ming lan­guages. For instance, users pro­fi­cient in Python can use the Pika lib­rary to cre­ate mes­saging queues that enable Rab­bitMQ to route mes­sages across dif­fer­ent systems.

Apache Flink

Apache Flink is an open-source soft­ware that provides lib­rar­ies and a dis­trib­uted pro­cessing engine, writ­ten in Java and Scala, for pro­cessing data streams. Flink is adept at hand­ling large volumes of data, whether in batches or as a con­tinu­ous stream, and can be con­figured as a cluster of pro­cessing units. Flink applic­a­tions are fault-tol­er­ant and offer the pos­sib­il­ity of estab­lish­ing a qual­ity of service.

Flink can eas­ily be con­figured to con­nect with mes­saging soft­ware like Rab­bitMQ and other inform­a­tion sources, enabling sim­ul­tan­eous data pro­cessing from mul­tiple sources. What’s more, Flink facil­it­ates the extrac­tion of inform­a­tion from raw data through ETL tech­niques, sub­sequently out­put­ting the res­ults to other mes­saging soft­ware or stor­ing the trans­formed, insight­ful data in a database.

Cas­sandra

Apache Cas­sandra is a dis­trib­uted, scal­able, and highly avail­able NoSQL data­base. It is a lead­ing tech­no­logy in its field and is used to man­age some of the world’s largest data­sets in clusters as well as being deployed in mul­tiple data centres. The most com­mon use cases for Cas­sandra are product cata­logues, IoT sensor data, mes­saging and social net­works, recom­mend­a­tions, per­son­al­isa­tion, fraud detec­tion and other applic­a­tions such as time series data.

A Cas­sandra cluster com­prises of a group of nodes that store data to optim­ise both read­ing from and writ­ing to the data­base, ensur­ing even dis­tri­bu­tion across the nodes within the cluster. Data stor­age is man­aged through a primary key, which acts as a unique iden­ti­fier for each piece of data. The primary key in Cas­sandra is made up of a par­ti­tion key and a clus­ter­ing key. The par­ti­tion key determ­ines which data­base server con­tains our data, whilst the clus­ter­ing key organ­ises this data within each server, thereby optim­ising stor­age for easier retrieval later.

Real Use Case

An example of an imple­ment­a­tion that uses the Kappa archi­tec­ture to pro­cess data in real time from IoT devices can be seen in this thesis, where a data stream emit­ted from thou­sands of IoT sensors found in a fleet of deliv­ery vehicles is pro­cessed in real time. Before imple­ment­ing a dis­trib­uted archi­tec­ture, the com­pany had to wait for a whole day before being able to pro­cess all the inform­a­tion com­ing from these IoT sensors to obtain stat­ist­ics on their drivers’ per­form­ance. After imple­ment­a­tion, they can now obtain these stat­ist­ics imme­di­ately thanks to the real-time data stream pro­cessing engine.

An addi­tional prob­lem that can be solved by using a dis­trib­uted archi­tec­ture is scalab­il­ity, and this was also the case in this imple­ment­a­tion: the ori­ginal pro­cessing sys­tem would not have been able to deal with a sig­ni­fic­ant increase in sensor data, but now the com­pany can, simply by adding addi­tional pro­cessing power when necessary.

Con­clu­sion

As we’ve seen, the dif­fer­ent tech­no­lo­gies and the Kappa archi­tec­ture presen­ted in this blog post provide com­pan­ies with the abil­ity to enhance their data extrac­tion pro­cess. These tech­no­lo­gies can provide more insights into data at a quicker and more effi­cient pace, but they also come with a higher grade of com­plex­ity.  The fol­low­ing table presents the advant­ages and dis­ad­vant­ages of a scal­able and dis­trib­uted architecture:

Advant­ages

Speed and Per­form­ance: Par­al­lel com­put­ing can sig­ni­fic­antly improve speed and performance.

Scalab­il­ity: Dis­trib­uted com­put­ing allows you to scale by adding more machines or nodes as needed.

Fault Tol­er­ance: Even if one machine fails, the sys­tem can con­tinue to operate.

Resource Util­isa­tion: Com­put­ing makes effi­cient use of multi-core processors.

Cost-effect­ive: Dis­trib­ut­ing work­loads across mul­tiple low-cost machines can be more cost-effect­ive than invest­ing in a single high-end machine.

Dis­ad­vant­ages

Code Com­plex­ity: Par­al­lel code is sig­ni­fic­antly more complex.

Com­mu­nic­a­tion Over­head: Com­mu­nic­a­tion between nodes can intro­duce latency and overhead.

Archi­tec­ture Com­plex­ity: There are chal­lenges like load bal­an­cing, data con­sist­ency, and fault tolerance.

A dis­trib­uted and scal­able Kappa archi­tec­ture offers the right tools to deal with mod­ern Big Data pro­cessing. What’s more, a scal­able sys­tem allows you to pro­vi­sion resources as neces­sary, redu­cing infra­struc­ture costs dur­ing peri­ods of lower demand. The flex­ible, dis­trib­uted, and scal­able archi­tec­ture gives com­pan­ies the abil­ity to pro­cess huge amounts of data in a lim­ited amount of time to provide an effi­cient deliv­ery of per­cept­ive per­form­ance indic­at­ors derived dir­ectly from raw data.

A sig­ni­fic­ant data trans­form­a­tion to an archi­tec­ture like this is an intric­ate pro­cess and requires a lot of thought and metic­u­lous pre­par­a­tion. For fur­ther inform­a­tion and guid­ance, take a look at this ded­ic­ated pre­vi­ous blog post. Should you have any ques­tions, don’t wait a minute longer – just con­tact us and we’ll be happy to help!