Devel­op­ing a GAN “Invis­ib­il­ity Cloak”



What the heck is a GAN?

One long­stand­ing prob­lem in AI is fig­ur­ing out the best way to gen­er­ate real­istic con­tent. How can you teach a com­puter to cre­ate images, videos, text, or music indis­tin­guish­able from human-gen­er­ated equi­val­ents? In 2014, a group of research­ers at the Uni­versité de Mon­tréal had a fairly quirky idea regard­ing how to address this prob­lem. Why not cre­ate two AI pro­grams: one would repeatedly try to gen­er­ate human-level out­put, while the other would try to dis­tin­guish between human-gen­er­ated and machine-gen­er­ated con­tent. The res­ult­ing archi­tec­ture was given the name “Gen­er­at­ive Adversarial Net­work”, or GAN for short (https://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf).

The concept worked splen­didly well: within a few years GANs were able to gen­er­ate stun­ningly real­istic images. For example, human faces which were indis­tin­guish­ably authen­tic
(https://thispersondoesnotexist.com/).

The advent of GANs was espe­cially for­tu­it­ous for another reason. In 2018 the world was intro­duced to Edge TPU (Tensor Pro­cessing Units). Ana­log­ous to how GPU excelled at Graph­ical Pro­cessing, and CPUs at Com­put­ing, TPUs are designed to facil­it­ate Tensor oper­a­tions. This is entirely inspired by ML: tensor oper­a­tions form the core of ML pro­cesses. On a lar­ger scale, the out­come of Edge TPUs is that ML oper­a­tions can now be run on edge devices (think phones, tab­lets, per­sonal com­puters, and small-scale on-prem com­put­ing facil­it­ies in hos­pit­als, ware­houses, etc.). With Edge TPUs, ML train­ing is not only faster, but also more energy efficient.

This shift towards “ML on the Edge” is partly the res­ult of Edge TPU infra­struc­ture, but it also demon­strates a reac­tion towards recent data pri­vacy devel­op­ments. The massive implic­a­tions of the 2016 adop­tion of GDPR legal infra­struc­ture have made it clear that huge cor­puses of user-sur­rendered train­ing data may no longer be leg­ally feas­ible. Com­pan­ies will have to adapt to these new lim­it­a­tions. There are two such ways to accom­plish this. Com­pan­ies can turn towards Fed­er­ated Learn­ing, in which small mod­els are trained on edge devices, and then aggreg­ated at a cent­ral point. Data need not leave the edge devices.

Another method is through the cre­ation of syn­thetic data. GANs can take a small sample of data – col­lec­ted from either con­sent­ing users, or a dev team – and use it as a blue­print for gen­er­at­ing much more data. This syn­thetic GAN-gen­er­ated data can then be used to train sub­sequent ML mod­els (see, for example, https://arxiv.org/abs/1909.13403)

The uses of GANs do not end simply at syn­thetic data gen­er­a­tion. They can auto­mate many com­plex data manip­u­la­tion pro­cesses, lead­ing to a num­ber of poten­tial use cases. To list a few examples:

  • Google Maps uses GANs to gen­er­ate simple map images from com­plex satel­lite data.
  • Fash­ion retail­ers can use GANs to auto­mat­ic­ally dress poten­tial cus­tom­ers in pro­spect­ive cloth­ing pur­chases using an uploaded image.
  • GANs can identify faces in images, and learn to blur them for data anonym­iz­a­tion applications.
  • GANs can improve video, image, or audio qual­ity. This can include col­our­ing old pho­tos, improv­ing the frame rate of old videos, or even push­ing the res­ol­u­tion of media past that it was cap­tured in (super-res­ol­u­tion).

Data Insights was there­fore eager to try out such a new and prom­ising tech­no­logy. So we took it for a spin.

The Idea

The inter­net is littered with videos of tech­ies show­ing off “ML Invis­ib­il­ity Cloaks”.

GAN “Invisibility Cloak”

A cloth is held up, and amaz­ingly the per­son hold­ing it van­ishes! As magical as this seems, it’s noth­ing new, and does not incor­por­ate ML. This is the time­less “green screen” effect, which has been used in movie pro­duc­tion since the 1950s, back when ML was noth­ing more than a dis­tant pipe dream. In a nut­shell, a pic­ture of the back­ground without the per­son is over­laid by an identic­ally framed video in which the per­son walks around and holds a green cloth. A fil­ter elim­in­ates all green pixels, and the back­ground image comes through.

The point here is that the pro­gram already knows what is behind the per­son, mak­ing the claim of a Harry Pot­ter style Invis­ib­il­ity Cloak feel a little over­blown. But then, one may ask, could it be pos­sible to some­how “know” what is behind a per­son, without hav­ing any inform­a­tion? That is: true invisibility?

Well, ulti­mately, no. You can never be truly sure what lies behind some­body without hav­ing seen it pre­vi­ously (bar­ring per­haps grav­it­a­tional lens­ing, and a few other extreme cases). But, as Pho­toshop has taught us, we can do a pretty good job at guess­ing. In fact, long before Pho­toshop, expert photo manip­u­lat­ors could con­vin­cingly edit some­body out of a photo. Take this infam­ous example of Soviet secret police offi­cial Nikolai Yezhov.

GAN “Invisibility Cloak”

After fall­ing out of favour with Stalin dur­ing the purges of the 1930s, Nikolai was executed, and his pres­ence in pho­tos was care­fully doctored away. Put­ting aside the macabre nature of the example for the more tech­nical con­sid­er­a­tions, one can see how the editor has inferred the con­tinu­ation of the low wall on the right, as well as repeat­ing the pat­tern of ripples in the water. The ques­tion arises: could a GAN learn to auto­mate the same process?

Train­ing a GAN Invis­ib­il­ity Cloak

To attempt this, we opted for the pix2pix GAN archi­tec­ture intro­duced in 2017: https://phillipi.github.io/pix2pix/

As train­ing data, pix2pix expects pairs of pho­tos with an exact pixel-to-pixel (hence the name) cor­res­pond­ence between what is input, and what is desired as out­put. To cre­ate such a data­set, we scoured the web for stock pho­tos of people with a trans­par­ent back­ground, and applied a script to ran­domly re-scale the fig­ures and place them on pho­tos of vari­ous loc­a­tions. Through this, we could have pho­tos with the per­son, as well as without the per­son. An example photo pair is shown below:

GAN “Invisibility Cloak”

One cool thing about the pix2pix archi­tec­ture is that it is suf­fi­ciently light to be run on an every­day laptop (gran­ted, the laptop must have a GPU driver). How­ever, the train­ing is much faster if run in the cloud. There­fore, we uploaded the GAN archi­tec­ture to Databricks.

One other note­worthy aspect of GANs is that defin­ing the loss func­tion is non-trivial. As opposed to com­puter vis­ion clas­si­fic­a­tion tasks, ML fore­cast­ing, and basic NLP pro­cesses, quan­ti­fy­ing when a GAN is doing a “good job” can be an espe­cially tricky task (and, in fact, remains an act­ive area of research). As such, the best approach is to care­fully watch the out­put images being pro­duced by the GAN dur­ing train­ing, to be sure that things are going as expec­ted. To do this, we used Vis­dom (https://github.com/facebookresearch/visdom), a tool which can visu­al­ize met­rics and image out­put on a web browser in real time when track­ing ML applic­a­tions. Vis­dom was run on a micro EC2 instance, and a port was exposed via which inform­a­tion could be sent from Databricks.

With the addi­tion of git integ­ra­tion, the final archi­tec­ture was as follows:

GAN “Invisibility Cloak”

With everything set up, we booted up the train­ing routine, and (pop­corn bucket in hand) watched the magic through Visdom.

At first, as expec­ted, the GAN clearly has no idea what it is doing. It simply plays with the image hues, and makes images blurry. The below screen cap­ture gives an example. The pic­ture on the left is the input image, the centre pic­ture is what the GAN gen­er­ator has out­put, and the pic­ture on the right is the true des­tin­a­tion photo (which the Gen­er­ator is not allowed to see until after it has out­put its attempt).

GAN “Invisibility Cloak”

How­ever, pro­gress is rel­at­ively fast. After only a few epochs (each epoch tak­ing around 10 seconds on Dat­ab­ricks, and each epoch com­pris­ing a pass through roughly 200 train­ing pho­tos) the GAN begins to con­sist­ently find the human fig­ure in each photo.

GAN “Invisibility Cloak”

How­ever, we see that the GAN is still not quite clear on what to do with the per­son. It simply makes their fig­ure pale. Yet after a few hun­dred epochs, the GAN begins to under­stand what needs to be done, and learns to imple­ment col­ours and fea­tures of the sur­round­ings towards prop­erly obscur­ing human fig­ures. Here are some example after 200 epochs of training.

GAN “Invisibility Cloak”
GAN “Invisibility Cloak”
GAN “Invisibility Cloak”

GAN “Invisibility Cloak”

What about on Test Data?

It’s import­ant to remem­ber that train­ing res­ults are always more refined when com­pared to test res­ults – that is, res­ults obtained when apply­ing the pro­gram to pho­tos it has not yet seen. Our case is no excep­tion. Here is what the Gen­er­ator out­puts when given a real photo it has not seen before (that is, a photo of a real per­son in a real place, as opposed to the script-gen­er­ated stock pho­tos shown above).

GAN “Invisibility Cloak”

Look­ing at this, we see that we still have some work to do. But the res­ults are still encour­aging. First off, the GAN is able to loc­ate the per­son within the image fairly con­sist­ently. Secondly, if we zoom in on one of the pho­tos we can see that the GAN is still try­ing to use sur­round­ing col­ours to edit out the figure:

GAN “Invisibility Cloak”

Even if the test images are not per­fectly handled, we shouldn’t be too hard on our GAN. That is to say, we need to keep in mind that the scope of this fun pro­to­type was limited:

  • The train­ing data con­sisted of only 100 images, and 100 flipped images (data aug­ment­a­tion). A proper attempt would involve a few thou­sand images.
  • The train­ing was run for 200 epochs (100 nor­mally, and 100 with learn­ing rate decay). Again, a proper attempt would involve thousands.
  • In the interest of time, hyper­para­meter tun­ing was not nearly as com­pre­hens­ive as it could have been.
  • The Gen­er­ator archi­tec­ture was also quite shal­low (15 lay­ers). There is no reason this couldn’t be pushed deeper.

Ulti­mately, and to put things in per­spect­ive, it’s still quite hum­bling that, in only an hour of train­ing, a GAN with abso­lutely no idea what it’s sup­posed to be doing, nor any concept of what a ‘per­son’ is, can under­stand how to identify a fig­ure and use the sur­round­ings to edit away said fig­ure. Another fun thought: as far as we know this is the first ever deploy­ment of a full GAN on Dat­ab­ricks. Whoohoo!

Hope­fully you’ve found this neat little applic­a­tion fun, and it has con­jured up some fun child­hood memor­ies of the ini­tial won­der of the Harry Pot­ter Invis­ib­il­ity Cloak.