Part 1 – What is Meant by Com­puter Vision?

The First Sparks

One fam­ous anec­dote talks of how, in 1966, MIT Pro­fessor Mar­vin Min­sky set up a sum­mer pro­ject for some of his stu­dents.
They were asked to explore cre­at­ing arti­fi­cial vis­ion sys­tems which would ulti­mately recog­nize objects. By the end of the sum­mer, the stu­dents were unable to meet the goals. How­ever, this is not a reflec­tion on the cal­ibre of the stu­dents. Rather, it rep­res­ents a first glimpse into the dif­fi­culties of the Com­puter Vis­ion prob­lem. Some­times, when not famil­iar with the tech­no­lo­gies, this can seem odd. Why is some­thing so intu­it­ive for us, but so dif­fi­cult for a computer?

Source: https://xkcd.com/1425/

To be blunt: human eyes are com­plic­ated. Work­ing out the intric­a­cies of the photo-recept­ors, the nerves, and the man­ner in which vis­ion mani­fests within the neur­ons of the brain has been a slow pro­cess span­ning dec­ades. And we are still nowhere near fin­ished! Recent research is still uncov­er­ing fas­cin­at­ing aspects of our own brain-eye inter­play, such as the fact that “vis­ion” itself may hap­pen more in the brain than in the eye (see Chariker et al. 2018 and Joglekar et al 2019).

All this to say, it should not sur­prise that prim­it­ive visual-cor­tex mod­els from half a cen­tury ago gave under­whelm­ing res­ults when trans­lated into com­puter archi­tec­tures (also in their infancy).

Another issue is that rule-based “step-by-step” algorithms fall flat when asked to tackle com­puter vis­ion prob­lems. In the case of Object Recog­ni­tion, the many angles, col­ours, and back­grounds in which an object can exist mean that any viable tra­di­tional (non ML) code will be mon­strously long and inefficient.

There­fore, instead of both­er­ing with the tedi­ous “step-by-step” approach, the code must some­how be “let loose” to learn on its own.

Con­vo­lu­tions to the Rescue

The solu­tion arrived in the form of an ingeni­ous new neural net­work archi­tec­ture called a “Con­vo­lu­tional Neural Net­work”, or CNN. Per­haps the most sem­inal moment in the birth of CNN was a 1989 paper by Yann LeCun in which he “taught” a net­work to recog­nize hand­writ­ten digits. This rep­res­en­ted a very big step for­ward in Com­puter Vis­ion, espe­cially with respect to how the net­work could be repeatedly shown labelled images and gradu­ally learn how to identify them. No rules need ever be defined.

Instead of one pixel per neuron, the neur­ons in a CNN are each fed a “group­ing” of pixels from the image. This group­ing is com­bined together math­em­at­ic­ally (in gen­eral either the aver­age or max of the group­ing is taken) before being input into the neuron. The pro­cess of com­bin­ing pixels is referred to as a “con­vo­lu­tion”, hence the name.

Image from Yann LeCun’s 1989 paper show­ing the frame­work of an early Con­vo­lu­tional Neural Network.

Although ground­break­ing, the first CNNs were still very restric­ted. On excep­tion­ally straight­for­ward tasks (such as digit recog­ni­tion) they per­formed well, but they were unable to gen­er­al­ize to the pleth­ora of situ­ations humans encounter in our day-to-day lives.

Going Deeper

The next big leap for­ward would not come for another 23 years, in 2012.
At the time, ImageNet was the bench­mark data­set for Com­puter Vis­ion tasks. It con­tained 1000 classes of objects (a much more dif­fi­cult chal­lenge than 10 simple digits!). Large research bod­ies around the world con­verged each year to attempt to lower the error on the data­set – that is to say, reduce the num­ber of falsely iden­ti­fied objects when a neural net­work is run on the images.

In 2012, a team led by Alex Krizhevsky designed an unusu­ally deep neural net­work (although still quite shal­low by today’s stand­ards). Until that point, it had been assumed that such a deep neural net­work would be too com­pu­ta­tion­ally expens­ive to reas­on­ably train, and that any gains in per­form­ance would be out­weighed by this fact.

How­ever, at the ImageNet com­pet­i­tion in 2012, Alex’s neural net­work, named AlexNet, blew the com­pet­i­tion out of the water. In the above graph of ImageNet per­form­ance with time, notice how one team lowered the error rate by around 10% in 2012. This was AlexNet, and its suc­cess in 2012 is part of a lar­ger group of suc­cesses ush­er­ing in the begin­ning of the Deep Learn­ing Revolu­tion.

In sub­sequent years, the error rate was pushed even lower with fur­ther refinements.

Where we are now

We now find ourselves nearly a dec­ade from the intro­duc­tion of AlexNet. The dec­ade has wit­nessed a breath­tak­ing wave of AI innov­a­tion. Even if there are huge hurdles remain­ing to true human-level AI, tasks that were tra­di­tion­ally con­sidered “Human-only ter­rit­ory” are being rap­idly knocked into the “AI-pos­sible” domain. As research pushes AI bound­ar­ies, state-of-the-art tech­no­lo­gies are find­ing them­selves applied in industry.

This is great news for Data Insights, as it means fas­cin­at­ing Com­puter Vis­ion pro­jects.
And with that brief into into the domain, let’s take a moment to enu­mer­ate a few examples!

Part 2 – Com­puter Vis­ion in Data Insights

1 – Car Recognition

Data Insights was approached by a com­pany seek­ing to identify car mod­els. Spe­cific­ally, the ima­gined AI sys­tem would be shown a photo, and would need to recog­nize the model of the car in said photo. To accom­plish this task, a 150-layer CNN was trained in the cloud on over ten thou­sand pho­tos of cars. Gradu­ally it learned to recog­nize dif­fer­ent car mod­els, even­tu­ally achiev­ing roughly 85% across 197 dif­fer­ent mod­els. Addi­tion­ally power­ful is that each pre­dic­tion could be made in a frac­tion of a second. Much faster than pos­sible for humans.

More inform­a­tion about the archi­tec­ture is avail­able here.

Link to Car Recog­ni­tion video

2 – Hand­writ­ing Recognition

Another fas­cin­at­ing avenue is text extrac­tion. Spe­cific­ally, we have been approached by a num­ber of com­pan­ies look­ing to cre­ate pro­grams to parse files and extract text. In the simplest case, these are forms con­tain­ing com­puter gen­er­ated text. A Com­puter Vis­ion tech­no­logy named Optical Char­ac­ter Recogntion (OCR) can be used to parse doc­u­ments and extract text.

Some­what more chal­len­ging, but now pos­sible with state-of-the-art CNN, are hand­writ­ten doc­u­ments. As hand­writ­ing comes in a huge vari­ety of shapes and styles, research groups have only recently been able to reli­ably extract hand­writ­ten text. And cloud solu­tions have also star­ted to appear (eg Azure OCR).

One final “cherry-on-top” exten­sion is the addi­tion of Nat­ural Lan­guage Pro­cessing (NLP). With NLP, it is pos­sible to apply lan­guage logic to extrac­ted text to “fix-up” typos and gram­mat­ical errors. For example, if the OCR has extrac­ted “the fool was super tasty!”, NLP can infer that it would be more logical for the sen­tence to have been writ­ten “the food was super tasty!”. Through this, com­bin­ing NLP as an addi­tional step leads to a fant­astic boost in extrac­tion accuracy.

An example archi­tec­ture for hand­writ­ten text extrac­tion. The final layer (LSTM) is a pop­u­lar NLP archi­tec­ture which can apply lan­guage logic to text data.

Once the hand­writ­ten doc­u­ments have been digit­ized and retouched via NLP, addi­tional ML lay­ers can be applied for a whole range of use cases. For example, doc­u­ments can be auto­mat­ic­ally sor­ted into dif­fer­ent doc­u­ment types via clus­ter­ing algorithms. Or anom­aly detec­tion can be applied to identify sus­pi­cious oddit­ies within the doc­u­ments (per­haps, for example, in the case of an audit).

3 – GANs

One fas­cin­at­ing Com­puter Vis­ion tech­no­logy to have emerged in the past few years is Gen­er­at­ive Adversarial Net­works (GANs). These net­works work excep­tion­ally well when pho­tos need to be manip­u­lated in a man­ner more com­plic­ated than what tra­di­tional CNNs allow. For example, col­our­iz­ing old black and white pho­tos, auto­mat­ic­ally blur­ring faces, or turn­ing satel­lite pho­tos into maps. GANs per­form excel­lently on these more “com­plic­ated non-tra­di­tional” Com­puter Vis­ion Tasks.

Data Insights wanted to have a demo GAN to explore this new tech­no­logy. We deployed a GAN on Dat­ab­ricks, and linked it with Git­Hub such that changes would be registered in a remote repos­it­ory. Addi­tion­ally, Dat­ab­ricks was con­nec­ted to an Amazon EC2 instance, which would allow train­ing to be visu­al­ized using the Face­book applic­a­tion Vis­dom. Below you can see what it looks like within Vis­dom when the applic­a­tion is train­ing. The images on the top show what the GAN is “see­ing”, and the orange curve show how well the GAN is “learn­ing”.

We taught the GAN to identify people within pho­tos, and col­our them out of the pho­tos. This could be used for auto­matic data anonymization.

Below are some of the res­ults when applied to real pho­tos which the GAN had not seen before:

The GAN is able to con­sist­ently identify where the per­son is within the photo and blur them.

4 – Pill Defect Detection

A phar­ma­ceut­ical com­pany approached Data Insights to help with detect­ing defects within batches of med­ic­a­tion. Com­puter Vis­ion is now excep­tion­ally good at such tasks. One advant­age is that mod­ern AI sys­tems are not only much faster than humans, but they are not sus­cept­ible to fatigue and distractions.

An added bonus is that such tasks can incor­por­ate some­thing known as Vari­ational AutoEn­coders (VAE). This tech­no­logy allows Com­puter Vis­ion sys­tems to auto­mat­ic­ally encode the fea­tures of what an object “should look like”. Once this is done (using thou­sands of pills, for example) the AI sys­tem can notice subtle vari­ations which stand out against what has been learned as “nor­mal”. Through this, we do not need to expli­citly and tedi­ously label images as “nor­mal” and “defect­ive” – the AI sys­tem can do that on its own.