Cent­ral Repos­it­ory of CVs with Tex­tract and Neptune



Work­ing in a con­sultancy com­pany has the advant­age of par­ti­cip­at­ing in sev­eral pro­jects where dif­fer­ent tech­no­lo­gies are used to provide a solu­tion for sev­eral use cases.
This lets us expand our per­spect­ive to solve prob­lems, which is great, but with time is hard to keep up to date with all our col­leagues on all the work we have done so far, due to being in dif­fer­ent teams, pro­jects, or just because we are mostly focused with our cus­tomer and we don’t have much time to chat.
But since we have a shar­ing know­ledge cul­ture at Data Insights, we thought of pos­sible solu­tions for this. One was already covered by my col­league Hsiao-Ching in this blog post. And here we present another one that can be seen as a com­ple­ment­ary part.
Con­sid­er­ing our CVs has all the neces­sary inform­a­tion, we need a pro­cess to read and cluster it in the rel­ev­ant groups like the tech­no­lo­gies used, cus­tom­ers we have worked with so far, etc. After that, we can store this in our data­base of choice to query the res­ults later in time.

amazon textract

For this pro­cess, Amazon Tex­tract is the per­fect choice to gather the data from within the CVs, which are already in PDF format, though Tex­tract also sup­ports other formats. The first part of our pro­cess is to upload the files to S3. From there, a Lambda func­tion is triggered in order to start the Tex­tract job.

By using the fea­ture of Form extrac­tion, Tex­tract auto­mat­ic­ally detects that the CVs have some inform­a­tion in com­mon about the per­son, like the pro­file, cus­tom­ers, and tech­no­lo­gies used, among oth­ers, and puts this inform­a­tion in key-val­ues. After this, the func­tion stores back the res­ult in another S3 bucket in CSV format.

amazon textract

For the second part of the pro­cess, we use Amazon Nep­tune, as this will allow us to make quer­ies in terms of the rela­tion­ship of the data, like “give me all the col­leagues who have worked with this tech­no­logy or with this customer”.

amazon neptune

Then in order to ingest the data into Nep­tune, another Lambda func­tion is triggered when the res­ult from Tex­tract arrives at S3. The inges­tion is pos­sible as the CSV data is in Grem­lin load data format, but other formats are also sup­por­ted, and this lets us do the inges­tion in a bulk load fash­ion, instead of send­ing insert by insert, and by check­ing the status of the inges­tion, we can see how many files failed to be inges­ted and need more preprocessing.

amazon neptune

Once the data is loaded, by using a Sage­maker note­book, we can start to identify eas­ily who can sup­port us whenever we have a ques­tion about a spe­cific topic, find com­mon interests, or check which pro­jects we can use as ref­er­ences for other ones.

aws cloud

Finally, we can remark that this solu­tion is com­pletely server­less, as we don’t man­age any infra­struc­ture, and by just upload­ing the file, the pro­cess is auto­mat­ic­ally star­ted. And as the next steps, we can fur­ther clean the inform­a­tion provided by Tex­tract, in order to have a stand­ard­ized out­put of values.