The Jour­ney of Data-Driven Trans­form­a­tion at a Lead­ing Vehicle Manufacturer



In an era where data plays a cru­cial role within any com­pany, Data Engin­eer­ing has become one of the most import­ant top­ics for indus­tries look­ing to mod­ern­ize and stream­line their oper­a­tions and decision mak­ing pro­cess. This is the story of our jour­ney to a vehicle man­u­fac­turer in Europe, a com­pany that sets the stand­ard in vehicle excel­lence and where we were presen­ted with a challenge.

Our chal­lenge: To auto­mate a num­ber of manual tasks within their fin­ance depart­ment, in order to increase oper­a­tional effi­ciency, data qual­ity, reduce human error and set a good found­a­tion for the com­pany to build from.

In this art­icle we will go through:

  1. The Chal­lenge: What is the problem?
  2. The Solu­tion: What do we want to do?
  3. The Imple­ment­a­tion: How do we want to do it?
  4. The Out­comes: What does suc­cess look like?

The Chal­lenge

The fin­ance depart­ment was spend­ing extens­ive hours manu­ally manip­u­lat­ing and val­id­at­ing data. Not only were its pro­cesses labor-intens­ive, but prone to errors as well, mak­ing fin­an­cial report­ing and ana­lysis a time-con­sum­ing admin­is­trat­ive chal­lenge. Key chal­lenges included:

  • Manual Extrac­tion: The pro­cess of extract­ing data was fully depend­ent on employee actions which could lead to delays and mistakes.
  • Incon­sist­ent Data Qual­ity: With no auto­mated pro­cess for data clean­ing and trans­form­a­tion, the data qual­ity was vul­ner­able to incon­sist­en­cies, lead­ing to delays and affect­ing the accur­acy of future fin­an­cial reports.
  • Lim­ited Scalab­il­ity: The manual pro­cesses were not scal­able, con­strain­ing the depart­ment’s cap­ab­il­ity to handle the increas­ing volume of data and its complexity.

Being this pro­ject con­duc­ted by a com­pany that is a prom­in­ent player in the vehicle dis­tri­bu­tion industry, it lead to sev­eral secur­ity con­sid­er­a­tions we needed to account for. One of them, which we would only find out later, being the highly con­fid­en­tial nature of the data and because of it, devel­op­ing this solu­tion remotely was simply not viable. In the end, we were asked to go to their headquar­ters in order to sync with their team and fully deploy the solu­tion with the lim­it­a­tion of using their machines and tools.

The Solu­tion

Our team’s solu­tion was centered around auto­mat­ing the ETL (Extract, Trans­form, Load) pro­cesses using a data engin­eer­ing pipeline, tailored spe­cific­ally to meet the unique needs of the fin­ance depart­ment. The solu­tion included sev­eral key components:

  1. Auto­mated Data Extrac­tion: Extract the data in a stream­lined and auto­mated fash­ion, ensur­ing a timely and error-free procedure.
  2. Data Val­id­a­tion: Develop a stand­ard­ized data val­id­a­tion pro­tocol that gen­er­ates error reports and for­wards them to the respect­ive depart­ment. The pro­cess primar­ily involved migrat­ing the ‘hard-coded’ Excel for­mu­las, which served as val­id­a­tions and required manual checks with each new Excel data­sheet, to PyS­park Data Manip­u­la­tion methods.
  3. Data Cor­rec­tion and Restruc­ture: Devel­op­ing a suite of algorithms to clean and restruc­ture the data into a con­sist­ent and stand­ard­ized format. This step was of great import­ance to ensure data qual­ity and reli­ab­il­ity for down­stream usage.
  4. Auto­mated Data Load­ing: Auto­mated load of the treated data into the data­base, short­en­ing the time neces­sary for the data to be accessible.
  5. Task Sched­uler: The nature of work­ing with monthly fin­an­cial reports provided us the oppor­tun­ity to build a sched­uler that would auto­mat­ic­ally retrieve the cor­res­pond­ing monthly data­sheet on the X day of each month. We also took into account the poten­tial delays in the avail­ab­il­ity of these data­sheets. There­fore, we imple­men­ted safety meas­ures to ensure that no files would be skipped and to pre­vent the data col­lec­tion pro­cess from reach­ing a ‘dead­lock’.
  6. The Deliv­er­able: The pro­ject’s deliv­er­able was to fetch data from a raw Excel file, load it into Data­Frames using PyS­park and SQL, and then check for errors, miss­ing data, and typos. Once these issues were addressed, we uploaded the enhanced Excel file to their File­Share, allow­ing the fin­an­cial team to review their data without qual­ity concerns.

The Imple­ment­a­tion Plan

The imple­ment­a­tion phase was care­fully planned to be delivered in stages to min­im­ize any pos­sible com­plic­a­tions and ensure a smooth trans­ition. Key steps included:

  • Require­ments Eval­u­ation: Iden­ti­fic­a­tion of the needs and con­straints of the task at hand. Ana­lyze the infra­struc­ture that is cur­rently in place to define a proper plan for integration.
  • Solu­tion Design: Design a cus­tom ETL pipeline and choose the right tools to achieve the estab­lished goals effectively.
  • Test­ing: Imple­ment the solu­tion in a con­trolled envir­on­ment to assess its effect­ive­ness and make neces­sary modifications.
  • Sup­port: Give clear instruc­tions to the fin­ance depart­ment staff on using the new sys­tem, and ensure avail­ab­il­ity to resolve any future issues.

The Actual Implementation

Pre­par­a­tion before arrival

Know­ing that there were sev­eral infra­struc­ture bar­ri­ers that needed to be tackled once we arrived, since we would be going into it blind­folded, we approached the on-site week with some pre­med­it­ated con­cepts. The back­bone of the pro­gram was already pre­pared, includ­ing most of the data trans­form­a­tions that needed to be done. This allowed for more flex­ib­il­ity when it came to integ­rat­ing our solu­tion into the cli­ent’s archi­tec­ture, as well as con­duct­ing man­dat­ory tests to ensure that everything ran smoothly.

With that said all work also had to be per­formed on their machines; we had to com­mu­nic­ate in advance all the tools we would need, so they could be vet­ted for poten­tial secur­ity vul­ner­ab­il­it­ies that might threaten their sys­tem and swapped for oth­ers if that was the case.

Arrival and Execution

The first step of the pro­cess depended on our abil­ity to adapt our pre-made pro­gram to their sys­tem, which, we ended up dis­cov­er­ing, involved mov­ing Excel files to and from File­Share folders via Microsoft Access and val­id­at­ing the data through manual quer­ies and checks. Doc­u­ment­a­tion was lack­ing, so we fre­quently had to reach out to the cli­ent’s team (which at times was insuf­fi­cient due to the extens­ive num­ber of checks and the depth of know­ledge required) and under­take exhaust­ive tri­als.

Original Data Lifecycle Process: A path diagram starting at a manual task and ending at a FileShare.
Fig­ure 1: Ori­ginal Data Life­cycle Process

We faced two main set­backs, which required us to devise cre­at­ive and uncon­ven­tional solu­tions to over­come these challenges:

Integ­ra­tion with the pre-exist­ing system

The first chal­lenge was estab­lish­ing a proper con­nec­tion to their on-premise sys­tem, which relied on out­dated tech­no­logy, to ensure smooth integ­ra­tion with our data pro­cessing meth­ods. Installing all the neces­sary tools to suc­cess­fully deploy and run our pro­gram was not straight­for­ward. Although we were using Python and SQL without any ‘fancy’ lib­rar­ies, the required install­a­tions encountered obstacles that we needed to over­come due to their sys­tem’s ver­sions and policies.

Build­ing the bridge between the new sys­tem and the pre-exist­ing infra­struc­ture high­lighted the com­plex­ity of the task, par­tic­u­larly because of the com­pany’s size, which involved numer­ous mov­ing parts and con­nec­tions. Addi­tion­ally, the doc­u­ment­a­tion was either nonex­ist­ent or out­dated. In this con­text, inter­ac­tion with the cli­ent-side team became not just advant­age­ous but essen­tial for the suc­cess of the pro­ject. It also ensured that the solu­tions developed were closely aligned with the actual needs and work­flows of the client.

Hand­ling Data Pri­vacy and Security

Now, the second chal­lenge. Given the sens­it­ive nature of fin­an­cial data, pri­vacy and secur­ity were of high import­ance. There­fore, prob­lems nat­ur­ally arose in ensur­ing that the sys­tems com­plied with data pro­tec­tion reg­u­la­tions. Address­ing these chal­lenges required extra effort from our team that needed to adapt the solu­tion to the new pro­ject demands.

The Full Circle

Fol­low­ing up our ini­tial integ­ra­tion steps, this jour­ney became even more unique when we faced the real­iz­a­tion that we had to relo­cate to their headquar­ters (as pre­vi­ously men­tioned) to fully deploy the pro­ject. This move intro­duced sev­eral unpre­dict­able vari­ables for which we could not pre­pare (aside from main­tain­ing an open mind and a fear­less approach) due to the ‘black box’ nature of their archi­tec­ture and software.

Fur­ther­more, we found ourselves con­stantly in limbo due to lack of per­mis­sions and issues with com­pat­ib­il­ity between the ver­sions they could install ‘in-house’ versus the latest ver­sions of the pre­requis­ites needed for the pro­ject’s deploy­ment. This uncon­ven­tional scen­ario posed inter­est­ing chal­lenges for us in under­stand­ing and adapt­ing to the cli­ent-side archi­tec­ture. This com­bined with their depend­ency on an on-premise sys­tem fur­ther com­plic­ated the devel­op­ment of our solu­tion. How­ever, des­pite all of these obstacles, we man­aged to pull through at the set dead­line date.

Final Data Lifecycle Process: A path diagram starting at Excel and ending in a FileShare.
Fig­ure 2: Final Data Life­cycle Process

Con­clu­sion

This Jour­ney of Data-Driven trans­form­a­tion at a Lead­ing Vehicle Man­u­fac­turer marks a sig­ni­fic­ant mile­stone in lever­aging tech­no­logy to enhance effi­ciency and decision-mak­ing cap­ab­il­it­ies. It def­in­itely set a pre­ced­ent for fur­ther improve­ments as the indus­tries con­tinue to embrace digital trans­form­a­tion, pro­pelling the effi­ciency and qual­ity of their services.

Our team’s exper­i­ence with this pro­ject high­lights the trans­form­at­ive poten­tial of auto­ma­tion in revamp­ing tra­di­tional busi­ness pro­cesses. It’s cru­cial to remem­ber that, des­pite being in an era of digital trans­form­a­tion, the major­ity of tasks that are highly pri­or­it­ized and of extreme import­ance to com­pan­ies still rely on ‘old-school’ tech­no­logy. This under­scores the import­ance for data engin­eers to be able to swiftly adapt to unex­pec­ted situ­ations. Being adapt­able is one of your greatest assets in aid­ing the migra­tion of these sys­tems into the ‘new world’ of data automation.

A spe­cial thank you to Cátia Antunes and Paulo Souza for their insight­ful col­lab­or­a­tion in this project.