How rig­or­ous test­ing ensures sys­tems with­stand real-world pres­sures, fea­tur­ing a deep dive into Ab Ini­tio implementations.

Intro­duc­tion

Often over­looked, or the first to be cut as dead­lines approach (an all too com­mon occur­rence unfor­tu­nately), test­ing is one of the most import­ant parts of the soft­ware life­cycle help­ing to ensure qual­ity and adher­ence to the requirements.

From unit, accept­ance, integ­ra­tion and sys­tem test­ing, today we are going to look at a form of per­form­ance related test­ing taken from a real world imple­ment­a­tion using Ab Initio.

Stress Test­ing

What is it

Stress test­ing is a method of push­ing an inter­vals worth of data (for example, an hour, or a day) through a run­ning sys­tem, but at a higher through­put than in the real pro­duc­tion envir­on­ment, and in some sec­tors, for example fin­ance, can be a reg­u­lat­ory requirement.

Run­ning pro­duc­tion data through QA, or devel­op­ment envir­on­ments has implic­a­tions for GDPR, it may not be appro­pri­ate, or legal to copy that data to test or devel­op­ment teams without hav­ing sens­it­ive data first masked or sim­u­lated. Whilst this is an import­ant con­sid­er­a­tion and topic in and of itself, in this post we are demon­strat­ing the mech­an­ism for stress test­ing so we will assume that such mask­ing of data has already been performed.

It provides the abil­ity to push an interval’s worth of data (num­ber of hours, a whole day) through a sys­tem at double, triple or higher rates, but it also does not pre­clude run­ning the data through at the ori­ginal rate to sim­u­late a full time interval.

The inten­tion then, is to demon­strate that the sys­tem can handle unusual load levels without fail­ure or pro­du­cing the incor­rect responses.

The prob­lems that must be solved include

  1. Cap­tur­ing the data
  2. Replay­ing the data in the ori­ginal order sequence but at a higher rate
  3. Record­ing and com­par­ing the two runs

Back­ground

The sys­tem under stress test­ing in this example is a con­tinu­ous sys­tem, pro­cessing incom­ing ref­er­ence data and mar­ket depth (mul­tiple levels of bid and ask prices and quant­it­ies of each), this data con­tains a mix­ture of high fre­quency updates (mar­ket data) and low fre­quency updates (sanc­tioned and per­mit­ted instruments).

This data must be main­tained at the same time as pro­cessing requests for prices through­out the day based on the instant­an­eous val­ues of the mar­ket depth.

The sys­tem has to respond to requests in under 500ms under vari­ous load con­di­tions, with a reg­u­lat­ory require­ment for be able to do so accur­ately and con­sist­ently at double and triple the nor­mal input rates across a whole day.

In order to replay a time period’s worth of data (a num­ber of hours, a whole day) and still pro­duce the same responses, the mar­ket depth data must be avail­able at the cor­res­pond­ing time that it was avail­able in the ori­ginal run.

Below is an over­view of the sys­tem under stress testing:

Here we can see that each topic from the mes­sage bus is served by a single reader process.

A mes­sage bus is a 1‑to-many model of dis­tri­bu­tion. The des­tin­a­tion in this model is usu­ally called topic or sub­ject.  In the above dia­gram each topic is a sep­ar­ate stream of data.

A reader pro­cess is a piece of code that can read a topic and pro­cess the mes­sages it receives into a format suit­able to be pro­cessed by a server process.

A server pro­cess is a piece of code that can receive mes­sages and, in this case, store the con­tents of those mes­sages in a form that can be used for later retrieval by the request pro­cessing com­pon­ent of the dia­gram above.

Each reader pro­cess will send its data to an asso­ci­ated ser­vice pro­cess. Each ser­vice pro­cess is stor­ing, modi­fy­ing or updat­ing either ref­er­ence data, mar­ket depth or pro­cessing requests.

We need to be able to main­tain the exact arrive time of each mes­sage across all top­ics so that dur­ing the stress test­ing each mes­sage is replayed in the ori­ginal time sequence order, even if at faster rates.

Imple­ment­a­tion

Cap­tur­ing the Data

As each mes­sage from each topic is read by its reader pro­cess, that pro­cess can option­ally per­sist a copy of the data along with a timestamp. As this can gen­er­ate very large data sets, the option to write such files is turned off by default, it can be switched on by set­ting a simple flag, even whilst the pro­cess is run­ning. Writ­ing the files to disk whilst the pro­cess is run­ning does not impact the per­form­ance due to the par­al­lel pro­cessing nature of the processes.

The res­ult of set­ting the flag, is to cre­ate one data file per topic, con­tain­ing data for the period of time that the flag is act­ive. Clear­ing the flag stops the record­ing of data.

In this Ab Ini­tio imple­ment­a­tion, all the reader pro­cesses are a single graph. Each reader pro­cess is con­figured using Ab Initio’s para­meter sets (psets).

Con­vert­ing the Data flow to higher rates

The data for each topic has been recor­ded into a series of files, one per topic. Each record in each file, also con­tains a timestamp of when that mes­sage was received.

What is required now, is a twostep process:

  1. cor­ral all the data from all the recor­ded data files, into a single file sor­ted on the timestamp so that we are able to recre­ate the exact sequence of mes­sages as they occurred in real time across all the topics.
  2. cal­cu­late the time gap between each message.

Once we have the mes­sages and the gap inform­a­tion, we can change the gap, halv­ing the gap for instance, doubles the through­put of each mes­sage. Using this, it´s now a simple mat­ter to gen­er­ate data files for dif­fer­ent through­put levels.

Replay­ing the Data

We´ve now cre­ated a num­ber of data files, each con­tain­ing data for the whole test period for all mes­sages in cor­rect time order and a cal­cu­lated gap between each mes­sage provid­ing the higher rate throughput.

All we need to do now is read that data file, main­tain­ing the cal­cu­lated gap between each mes­sage and pub­lish those mes­sages to the appro­pri­ate ser­vice pro­cess. This single pub­lish­ing pro­cess will pub­lish each mes­sage to the appro­pri­ate ser­vice at the required time intervals.

We´ve now cre­ated a num­ber of data files, each con­tain­ing data for the whole test period for all mes­sages in cor­rect time order and a cal­cu­lated gap between each mes­sage provid­ing the higher rate throughput.

All we need to do now is read that data file, main­tain­ing the cal­cu­lated gap between each mes­sage and pub­lish those mes­sages to the appro­pri­ate ser­vice pro­cess. This single pub­lish­ing pro­cess will pub­lish each mes­sage to the appro­pri­ate ser­vice at the required time intervals.

Record­ing and Com­par­ing the Data

Whilst all the code in the sys­tem is being tested at the higher through­put, in this example, it is the out­put of the request pro­cessing code that is of most interest.  We need the out­put of this code taken at the time the data was ori­gin­ally recor­ded to use as our baseline for the stress test­ing at higher rates. This can be recor­ded at the time the input data was recor­ded, using the same flag mechanism.

How­ever, this means hav­ing code in pro­duc­tion whose only pur­pose is to record stress test data and that may not be appro­pri­ate for per­form­ance reas­ons. The altern­at­ive then is to first run the stress test data using the ori­ginal tim­ings and then record­ing the code out­put to use as the baseline. The test can then be repeated at higher through­put rates to com­pare against the baseline.

Con­clu­sion

Stress test­ing provides a means of per­form­ance test­ing under heav­ier than nor­mal loads using actual data from either the QA test envir­on­ment or from pro­duc­tion data (with suit­able con­sid­er­a­tions to GDPR). In some sec­tors such as fin­ance this kind of test­ing is a reg­u­lat­ory require­ment. It is designed to be used in con­tinu­ous real time pro­cessing environments.

This stress test­ing design pat­tern can be incor­por­ated into a wide range of real and near real tie sys­tems, and we at synvert would be happy to provide a more in-depth demon­stra­tion or provide assist­ance in your implementation.