Get some data insights from Synvert based on Graph Data­base Neo4j



Tra­di­tional rela­tional data­bases have dom­in­ated the data model for the past dec­ades. How­ever, NoSQL (Not only SQL), as well as graph data­bases have star­ted attract­ing people’s atten­tion in the last few years. In this blog, we’ll intro­duce a small internal applic­a­tion explor­ing data insights within Data Insights based on one of the most prom­in­ent Graph Data­bases — Neo4j.

Back­story

At the end of last year (2020), Data Insights hos­ted the first internal data com­pet­i­tion. The scen­ario was pro­posed by our col­league Marin (who can always be depended on for inspir­ing ideas). This time, Marin sug­ges­ted explor­ing and devel­op­ing some data-driven applic­a­tions based on the data we have on our internal Slack channels.

There­fore, my col­leagues Chris­tian, Mario, and I star­ted brain­storm­ing; which data-driven applic­a­tions could be use­ful, and to which prob­lems could they be applied? We imme­di­ately thought of the tedi­ous quarterly updat­ing of per­sonal CVs, because some­times we simply for­get what we have been doing for the last quarter. Another poten­tial applic­a­tion could be when we feel stuck with some very spe­cific tech­nical prob­lems, but we don’t know which per­son to approach.

To solve these annoy­ances, we decided to build a small pro­to­type of the cent­ral Data Insights Data­base based on our internal Slack mes­sages. The pro­posed solu­tion enables us to have:

  • A clear pic­ture of each person’s profile
  • A big pic­ture about how our pro­files are linked with one another
  • Iden­ti­fied domain experts for vari­ous technologies
  • Rough ideas about what we have done over the past time (weeks, months, quar­ters, even years)
  • The buzzwords we are talk­ing about

This lays out the rough pro­posal of our internal pro­ject LinkeDIn (Data Insights’ ver­sion of LinkedIn). In this blog post, I would like to explain how we approached this prob­lem, the tech stacks we used, focus­ing on Graph Data­base (not so sur­pris­ing because of the title of this blog post :p).

Graph Data­base

What is Graph Database?

Tra­di­tion­ally, data are usu­ally stored within a row-column table in a rela­tional data­base. This means that a data record is a row in the data­base table, and this table could con­tain mul­tiple attrib­utes. How­ever, it is waste­ful and chaotic to store everything in just one table. Usu­ally, there­fore, there will be mul­tiple tables for dif­fer­ent cat­egor­ies (for example cus­tom­ers, orders, items, etc.). Under this archi­tec­ture, if we would like to ask some ques­tions (for example, which cus­tom­ers have bought which items), we need to join dif­fer­ent tables (cus­tomer and items) to retrieve the results.

How­ever, the idea of a graph data­base is dif­fer­ent. Graph data­bases focus more on the links between data. In other words, how the data are con­nec­ted. There­fore, in a graph data­base, the data are stored as Nodes (e.g., Cus­tom­ers, Items, Orders, etc.) and Rela­tion­ships (e.g., BOUGHT, ORDERED, etc.), which aim at cap­tur­ing how the data are linked. Besides, we can also add Prop­er­ties for the nodes, for example, first_name and last_name for the Cus­tomer. Of course, we could also assign some Con­straints to the rela­tional data­base, to state which prop­er­ties must exist.

Why Graph Database?

Now that we have a rough idea about what a graph data­base looks like, the next ques­tion is: why do we need a graph data­base? As men­tioned above, to query the res­ult of a com­plex ques­tion in a rela­tional data­base, we might have to do mul­tiple joins of tables or index look­ups. This kind of oper­a­tion could be quite expens­ive. How­ever, in a graph data­base (for example, Neo4j), the data­base engine just needs to fol­low the point­ers to nav­ig­ate the data.

Besides, a graph data­base can answer the reversed ques­tion as fast as the non-reversed ones, which is usu­ally not the case in a rela­tional data­base. For example, “who bought the items?” or altern­ately “which items have been bought by whom?”

When we are talk­ing about query latency (how long the query takes to run) in a rela­tional data­base, this is highly depend­ent on the size of the data. On the other hand, in a graph data­base, it is pro­por­tional to the length of the tra­versed path in the graph.

Unlike a rela­tional data­base, where the schema needs to be expli­citly defined when cre­at­ing the table, a graph data­base is schema-less. The con­straints only need to be added as needed as the data grows. Since con­straints are served as a schema, this makes schemas in graph data­bases more flexible.

As pre­vi­ously men­tioned, the value of graph data­bases lies in the links between nodes, which includes indir­ect con­nec­tions as well. To explore the indir­ect con­nec­tions, we simply fol­low the paths in the graph data­base. This char­ac­ter­istic brings some aston­ish­ing advant­ages, for example, recog­niz­ing the hid­den value of the pat­terns (people may have the same interests as the friends of friends).

All these intriguing prop­er­ties provide a basis for mul­tiple graph data sci­ence applic­a­tions. These include using a machine learn­ing approach for fraud detec­tion, clas­si­fic­a­tion, pat­tern recog­ni­tion, recom­mend­a­tion sys­tems, graph embed­dings, and so on.

Work­flow

After a (short) intro­duc­tion to the graph data­base, we are going back to our data-driven applic­a­tion. Let’s grasp a big pic­ture of the work­flow first, which includes 5 steps:

  • Slack Scrap­per: scrap data from slack channels
  • Data Pre­pro­cessing: extract and pre­pare the data we are going to use
  • Name Entity Recog­ni­tion (NER): retrieve the terms that we are inter­ested in from the data
  • Graph Data­base: insert the data which includes nodes and rela­tion­ships to Neo4j Database
  • Cypher Query: use the Cypher query to ana­lyze the data from Neo4j Database
Database Neo4j

Slack Data

Now we are famil­iar with the graph data­base, and we have also seen the work­flow of the applic­a­tion, but what about the data? Which data should we col­lect, to solve the prob­lems which we addressed? Well, luck­ily, in Data Insights, usu­ally every­body writes down their weekly update within the Slack­bot thread. And this is exactly the data we intend to extract. For example:

Database Neo4j

And the raw data in JSON format from the Slack chan­nel looks this:

Database Neo4j

After pre­pro­cessing, map­ping the user name, and extract­ing the mes­sages we are inter­ested in, we get the pro­cessed data resem­bling the following:

Database Neo4j

Then we feed this pro­cessed data into our NLP pipeline to get name entit­ies. We also insert the nodes and the cor­res­pond­ing rela­tion­ships between nodes into our graph in Neo4j. You would like to know more about what kind of nodes and rela­tion­ships we insert into our graph? Here it comes!

MetaGraph

In our imple­ment­a­tion, we choose Graph Data­base (Neo4j) as our data model. Now, we are going to illus­trate our graph and how we are able to ana­lyze the data within our graph.

As the wis­dom goes, under­stand your data before you are going to do fur­ther ana­lysis. So let’s take a look at the schema of our data. In Graph Data­base, a schema is the meta graph, which is shown below:

// Show meta-graph
CALL db.schema.visualization()
Database Neo4j


From this meta graph, we can observe the nodes and the rela­tion­ships between them. For example:

  • (Author) -[WROTE]-> (Mes­sage)
  • (Mes­sage) -> [CONTAINS] -> (Tech)
  • (Author) -[MENTIONED] -> (Per­son), etc.

Query Graph — Cypher

You prob­ably also noticed that I use some weird nota­tion to describe the rela­tion­ships between entit­ies. This might look a bit weird at the first glance, but still, it’s quite com­pre­hens­ible without fur­ther explan­a­tion, right?

This is exactly the basic idea of Cypher, the equi­val­ent of SQL for graph data­bases — straight­for­ward and declar­at­ive when one wishes to ana­lyze data in a graph data­base. Let us go through some scenarios.

Per­sonal Profile

Say you would like to get a cer­tain pat­tern from your graph data­base. For example, to get a cer­tain per­sonal pro­file (in this case, mine), you can use the MATCH clause. The syn­tax will look like the following:

MATCH (a:AUTHOR {name:"hsiaoching"}) --> (t:tech)
RETURN a, t
Database Neo4j


You just need to spe­cify the pat­tern you would like to observe. In this case, the author (hsiaoch­ing), and the tech­nical terms she’s men­tion­ing. And voilà, here comes the answer after you ask your ques­tion in Cypher lan­guage. Appar­ently, I talk a lot about Amazon Web Ser­vices and SQL. 👀

How our per­sonal pro­files are linked with each other

Now I would like to know which tech­no­logy terms are men­tioned by our LinkeDIn col­leagues, so this time I use the WITH clause to spe­cify the authors. Let’s say I want the authors to include christian.p and Mario. Now the syn­tax looks like this:

MATCH (author:AUTHOR) -[]- (item) 
WHERE item:tech and author.name in ["hsiaoching", "mario", "christian.p"] 
RETURN author, item
Database Neo4j

Appar­ently, Chris­tian (Paul) and I were both men­tion­ing Amazon Web Ser­vices and Con­flu­ence in our weekly update. And Mario and I were doing Test­ing and SQL at some point in the past weeks. 🤔

Domain Expert

Now, let’s say, I had some ques­tions about Ab Ini­tio, and I’d like to know which internal domain expert I should approach. So this time, instead of nam­ing the spe­cific author, I am spe­cify­ing the tech­no­logy term as Ab Ini­tio.

MATCH (a:AUTHOR) --> (tech:tech {name: "Ab Initio"})
RETURN a, tech
Database Neo4j


It looks like we have a lot of experts who are equipped with the know­ledge of Ab Ini­tio. But of course, if we approach Pavel (the leader of our Ab Ini­tio team), we should always be able to get our ques­tions solved! 💡

Then what if the prob­lems I have are regard­ing Amazon Web ser­vices?

MATCH (a:AUTHOR) --> (tech:tech {name: "Amazon Web Services"})
RETURN a, tech
Database Neo4j


This time, PasqualeSocrates, and Evan might be the right per­son for me to discuss! 🤓

Let’s see what Pasquale talks about over the years

So far so good? Then let’s play around with some more com­plex examples. Let’s say we would like to know the buzzwords Pasquale (the CEO and Founder of Data Insights) talks about over the years.

Since, in our data model, the name of the Mes­sage node is the timestamp (the exact timestamp when the Author wrote this mes­sage), we need to parse the timestamp string to get the year. In Cypher, we can approach the sim­il­arly as with other pro­gram­ming lan­guages, split­ting the timestamp string (split), and then con­vert­ing it to integer (toInteger). After that, we con­vert this integer to the dat­e­time type (datetime) and then retrieve the year of this timestamp.

You might notice that the MATCH clause is now a bit more com­plex because we would like to retrieve buzzwords. These are words that not only Pasquale men­tioned, but also tech­nical terms that other people were using in the same year. We use a <> a2 syn­tax to make sure, when we are check­ing the mes­sage from the author (a), we will only con­sider the mes­sages from authors a2 other than a.

We illus­trate the code snip­pets below. Pasquale’s buzzwords from 2018 to 2021 are shown fur­ther below.

MATCH (a:AUTHOR {name:"Pasquale"})-->(msg:MESSAGE)-->(tech:tech)<--(msg2:MESSAGE)<--(a2:AUTHOR)
WITH a, a2, msg, msg2, tech, 
             toInteger(split(msg.name, ".")[0]) AS ts,
             toInteger(split(msg2.name, ".")[0]) AS ts2
WHERE a <> a2 
             AND datetime({epochSeconds:ts}).year = 2018 // 2019, 2020, 2021
             AND datetime({epochSeconds:ts2}).year = 2018 // 2019, 2020, 2021
RETURN a, tech, COUNT(tech) AS cnt
ORDER BY cnt DESC
LIMIT 10

2018

Database Neo4j
Database Neo4j

2019

Database Neo4j
Database Neo4j

2020

Database Neo4j
Database Neo4j

2021

Database Neo4j
Database Neo4j

Here, you may notice that there are always 32 counts between Pasquale and Amazon Web Ser­vices. The reason is that here the graph shows the nodes Author and Amazon Web Ser­vices, and the rela­tion­ship between them is MENTIONED. This rela­tion­ship does not depend on the timestamp in our data model (as men­tioned before, only the node Mes­sage con­tains the inform­a­tion of timestamp). Here we avoid show­ing Mes­sage nodes in the graph on pur­pose in order to have a clean illustration.

As you can see, the cnt from the table is actu­ally from count(tech) AS cnt. This code snip­pet is doing cartesian join for each mes­sage from the spe­cified year between Pasquale and other authors (except Pasquale). It’s a rough approx­im­a­tion enabling us to grasp the tech terms people were writ­ing about within the spe­cified year.

Buzzwords within Data Insights

We could also ask a sim­ilar ques­tion about Data Insight’s buzzwords using Cypher. After see­ing the pre­vi­ous examples, I believe now this syn­tax should be really straight­for­ward for you.

MATCH (author:AUTHOR)-[e1]-(item)-[e2]-(author2:AUTHOR) 
WHERE e1.count > 2 AND e2.count > 2 and item:tech
RETURN author, item, author2
Database Neo4j

It looks like, at Data Insights, we indeed talk a lot about Big Data buzzwords, such as Big­Data, Amazon Web Ser­vices, Ab Ini­tio, DevOps, Dat­ab­ricks, Test­ing, Docker, Jen­kins, etc. 📈

Sum­mary

In this blog post, we have explained how to store data in a graph data­base and how to ana­lyze these data within the Neo4j by using Cypher. Based on the char­ac­ter­istic of a graph data­base, tons of graph data sci­ence approaches can be applied to explore more rela­tions between dif­fer­ent entit­ies. I hope you enjoyed read­ing this art­icle, and feel free to let us know if you have any feedback 🙂