Engines of Wow, Part III: Opportunities and Pitfalls

This is the third in a three-part series introducing revolutionary changes in AI-generated art. In Part I: AI Art Comes of Age, we traced back through some of the winding path that brought us to this point. Part II: Deep Learning and The Diffusion Revolution, 2014-present, introduced three basic methods for generating art via deep-learning networks: GANs, VAEs and Diffusion models.

But what does it all mean? What’s at stake? In this final installment, let’s discuss some of the opportunities, legal and ethical questions presented by these new Engines of Wow.

Opportunities and Disruptions

We suddenly have robots which can turn text prompts into relevant, engaging, surprising images for pennies in a matter of seconds. They can compete with custom-created art taking illustrators and designers days or weeks to create.

Anywhere an image is needed, a robot can now help. We might even see side-by-side image creation with spoken words or written text, in near real-time.

  • Videogame designers have an amazing new tool to envision worlds.
  • Bloggers, web streamers and video producers can instantly and automatically create background graphics to describe their stories.
  • Graphic design firms can quickly make logos or background imagery for presentations, and accelerate their work. Authors can bring their stories to life.
  • Architects and storytellers can get inspired by worlds which don’t exist.
  • Entire graphic novels can now be generated from a text script which describes the scenes without any human intervention. (The stories themselves can even be created by new Chat models from OpenAI and other players.)
  • Storyboards for movies, which once cost hundreds of thousands of dollars to assemble, can soon be generated quickly, just by ingesting a script.

It’s already happening. In the Midjourney chat room, user Marcio84 writes: “I wrote a story 10 years ago, and finally have a tool to represent its characters.” With a few text prompts, the Midjourney Diffusion Engine created these images for him for just a few pennies:

Industrial designers, too, have a magical new tool. Inspiration for new vehicles can appear by the dozens and be voted up or down by a community:

Motorcycle concept generated by Midjourney, 2022

These engines are capable of competing with humans. In some surveys, as many as 87% of respondents incorrectly felt an AI-generated image was that of a real person. Think you can do better? Take the quiz.

I bet you could sell the art below, generated by Midjourney from a “street scene in the city, snow” prompt, in an art gallery or Etsy shop online. If I spotted it framed on a wall somewhere, or on a book cover or movie poster, I’d have no clue it was computer-generated:

img

A group of images stitched together becomes a video. One Midjourney user has tried to envision the aging destruction of a room, via successive videoframes of ever-more decaying description:

These are just a few of the things we can now do with these new AI art generation tools. Anywhere an image is useful, AI tools will have an impact, by lowering cost, blending concepts and styles, and envisioning many more options.

Where do images have a role? Well, that’s pretty much every field: architecture, graphic design, music, movies, industrial design, journalism, advertising, photography, painting, illustration, logo design, training, software, medicine, marketing, education and more.

Disruption

The first obvious impact is that many millions of employed or employable people may soon have far fewer opportunities.

Looking just at graphic design, approximately half a million designers employed globally, about 265,000 of whom are in the United States. (Source: Matt Moran of Colorlib.) The total market size for graphic design is about $43 billion per year. 90% of graphic designers work freelance, and the Fortune 500 accounts for nearly one-fifth of graphic design employment.

That’s just the graphic design segment. Then, there are photographers, painters, landscape architects and more.

But don’t count out the designers yet. These are merely tools, just as photography ones. And, while the Internet disrupted (or “disintermediated”) brokers in certain markets in the ’90s and ’00s (particularly in travel, in-person retail, financial services and real estate), I don’t expect that AI-generation tools means these experts are obsolete.

But the AI revolution is very likely to reduce the dollars available and reshape what their roles are. For instance, travel agents and financial advisers very much do still exist, though their numbers are far lower. The ones who have survived — even thrived — have used the new tools to imagine new businesses and have moved up the value-creation chain.

Who Owns the Ingested Data? Who Compensates the Artists?

Is this all plagiarism of sorts? There are sure to be lawsuits.

These algorithms rely upon massive image training sets. And there isn’t much dotting of i’s and crossing of t’s to secure digital rights. Recently, an artist found her own private medical records in one publicly available training dataset on the web which has been used by Stability AI. You can check to see if your own images have been part of the training datasets at www.haveibeenstrained.com.

But unlike most plagiarism and “derivative work” lawsuits up until about 2020, these lawsuits will need to contend with not being able to firmly identify just how the works are directly derivative. Current caselaw around derivative works generally requires some degree of sameness or likeness from input to final result. But the body of imagery which go into training the models is vast. A given creative end-product might be associated with hundreds of thousands or even millions of inputs. So how do the original artists get compensated, and how much, if at all?

No matter the algorithm, all generative AI models rely upon enormous datasets, as discussed in Part II. That’s their food. They go nowhere without creative input. And these datasets are the collective work of millions of artists and photographers. While some AI researchers go to great lengths to ensure that the images are copyright-free, many (most?) do not. Web scraping is often used to fetch and assemble images, and then a lot of human effort is put into data-cleansing and labeling.

The sheer scale of “original art” that goes into these engines is vast. It’s not unusual for a model to be trained by 5 million images. So these generative models learn patterns in art from millions of samples, not just by staring at one or two paintings. Are they “clones?” No. Are they even “derivative?” Probably, but not in the same way that George Harrison’s “My Sweet Lord” was derivative of Ronnie Mack’s “He’s So Fine.”

In the art world, American artist Jeff Koons created a collection called Banality, which featured sculptures from pop culture: mostly three dimensional representations of advertisements and kitsch. Fait d’Hiver (Fact of Winter) was one such work, which sold for approximately $4.3 million in a Christie’s auction in 2007:

Davidovici acknowledged that his sculpture was both inspired by and derived from this advertisement:

img

It’s plain to the eye the work is derivative.

And in fact, that was the whole point: Koons brought to three dimensions some of banality of everyday kitsch. In a legal battle spanning four years, Koons’ lawyers argued unsuccessfully that such derivative work was still unique, on several grounds. For instance, he had turned it into three dimensions, added a penguin and goggles on the woman, applied color, changed her jacket, the material representing snow, changed the scale, and much more.

While derivative, with all these new attributes, was the work not then brand new? The French court said non. Koons was found guilty in 2018. And it’s not the first time he was found guilty — in five lawsuits which sprang from this Banality collection, Koons lost three, and another settled out of court.

Unlike other “derivative works” lawsuits of the past, generative models in AI are relying not upon one work of a given artist, but an entire body of millions of images from hundreds of thousands of creators. Photographs are often lumped in with artistic sketches, oil paintings, graphic novel art and more to fashion new styles.

And, while it’s possible to look into the latent layers of AI models and see vectors of numbers, it’s impossible to translate that into something akin to “this new image is 2% based on image A64929, and 1.3% based on image B3929, etc.” An AI model learns patterns from enormous datasets, and those patterns are not well articulated.

Potential Approaches

It would be possible, it seems to me, to pass laws requiring that AI generative models use properly licensed (i.e., copyright-free or royalty-paid images), and then divvy up that royalty amongst its creators. Each artist has a different value for their work, so presumably they’d set the prices and AI model trainers would either pay for those or not.

Compliance with this is another matter entirely; perhaps certification technologies would offer valid tokens once verifying ownership. Similar to the blockchain concept, perhaps all images would have to be traceable to some payment or royalty agreement license. Or perhaps the new technique of Non Fungible Tokens (NFTs) can be used to license out ownership for ingestion during the training phase. Obviously this would have to be scalable, so it suggests automation and a de-facto standard must emerge.

Or will we see new kinds of art comparison or “plagiarism” tools, letting artists compare similarity and influence between generated works and their own creation? Perhaps if a generated piece of art is found to be more than 95% similar (or some such threshold) to an existing work, it will not retain copyright and/or require licensing of the underlying work. It’s possible to build such comparative tools today.

In the meantime, it’s the Wild West of sorts. As has often happened in the past, technology’s rapid pace of advancement has gotten ahead of where legislation and currency and monetary flow is.

What’s Ahead

If you’ve come with me on this journey into AI-generated art in 2022, or have seen these tools up close, you’re like the viewer who’s seen the world wide web in 1994. You’re on the early end of a major wave. This is a revolution in its nascent stage. We don’t know all that’s ahead, and the incredible capabilities of these tools are known only to a tiny fraction of society at the moment. It is hard to predict all the ramifications.

But if prior disintermediation moments are any guide, I’d expect change to happen along a few axes.

First, advancement and adoption will spread from horizontal tools to many more specialized verticals. Right now, there’s great advantage to being a skilled “image prompter.” I suspect that like photography, which initially required extreme expertise to create even passing mundane results, the engines will get better at delivering remarkable images the first pass through. Time an again in technology, generalized “horizontal” applications have concentrated into an oligopoly market of a few tools (e.g., spreadsheets), yet also launched a thousand flowers in much more specialized “vertical” ones (accounting systems, vertical applications, etc.) I expect the same pattern here. These tools have come of age, but only a tiny fraction of people know about them. We’re still in the generalist period. Meanwhile, they’re getting better and better. Right now, these horizontal applications stun with their potential, and are applicable to a wide variety of uses. But I’d expect thousands of specialty, domain specific applications and brand-names to emerge (e.g., book illustration, logo design, storyboard design, etc.) One set of players might generate sketches for creating book covers. Another for graphic novels. Another might specialize in video streaming backgrounds. Not only will this make the training datasets much more specific and outputs even more relevant, but it will allow the brand-names of each engine to better penetrate a specific customer base and respond to their needs. Applications will emerge for specific domains (e.g., automated graphic design for, say, blog posts.)

Second, many artists will demand compensation and seek to restrict rights to their work. Perhaps new guilds will emerge. A new technology and payments system might likely emerge to allow this to scale. Content generally has many ancillary rights, and one of those rights will likely be “ingestion rights” or “training model rights.” I would expect micropayment solutions or perhaps some form of blockchain-based technology to allow photographers, illustrators and artists to protect their work from being ingested into models without compensation. This might emerge as some kind of paywall approach to individual imagery. As is happening in the world of music, the most powerful and influential creative artists may initiate this trend, by cordoning off their entire collective body of work. For instance, the Ansel Adams estate might decide to disallow all ingestion into training models; right now however it’s very difficult to prove whether or not those images were used in the training of any datasets.

Third, regulation might be necessary to protect vital creative ecosystems. If an AI generative machine is able to create works auctioned at Christie’s for $5 million, and it may well soon, what does this do to the ecosystem of creators? It is likely necessary for regulators to protect the ecosystem for creators which feeds derivative engines, restricting AI generative model-makers from just fetching and ingesting any old image.

Fourth, in the near-term, skilled “image prompters” are like skilled photographers, web graphic designers, or painters. Today, there is a noticeable difference between those who know how to get the most of these new tools from those who do not. For the short term, this is likely to “gatekeep” this technology and validate the expertise of designers. I do not expect to this to be extremely durable, however; the quality of output from very unskilled prompters (e.g., yours truly) now meets or exceeds a lot of royalty-free art that’s out there from the likes of Envato or Shutterstock.

Conclusion

Machines now seem capable of visual creativity. While their output is often stunning, under the covers, they’re just learning patterns from data and semi-randomly assembling results. The shocking advancements since just 2015 suggest much more change is on the way: human realism, more styles, video, music, dialogue… we are likely to see these engines pass the artistic “Turing test” across more dimensions and domains.

For now, you need to be plugged into geeky circles of Reddit and Discord to try them out. And skill in crafting just the right prompts separates talented jockeys from the pack. But it’s likely that the power will fan out to the masses, with engines of wow built directly into several consumer end-user products and apps over the next three to five years.

We’re in a new era, where it costs pennies to envision dozens of new images visualizing anything from text. Expect some landmark lawsuits to arrive soon on what is and is not derivative work, and whether such machine-learning output can even be copyrighted. For now, especially if you’re in a creative field, it’s good advice to get acquainted with these new tools, because they’re here to stay.

Engines of Wow: Part II: Deep Learning and The Diffusion Revolution, 2014-present

A revolutionary insight in 2015, plus AI work on natural language, unleashed a new wave of generative AI models.

In Part I of this series on AI-generated art, we introduced how deep learning systems can be used to “learn” from a well-labeled dataset. In other words, algorithmic tools can “learn” patterns from data to reliably predict or label things. Now on their way to being “solved” via better and better tweaks and rework, these predictive engines are magical power-tools with intriguing applications in pretty much every field.

Here, we’re focused on media generation, specifically images, but it bears a note that many of the same basic techniques described below can apply to songwriting, video, text (e.g., customer service chatbots, poetry and story-creation), financial trading strategies, personal counseling and advice, text summarization, computer coding and more.

Generative AI in Art: GANs, VAEs and Diffusion Models

From Part I of this series, we know at a high level how we can use deep-learning neural networks to predict things or add meaning to data (e.g., translate text, or recognize what’s in a photo.) But we can also use deep learning techniques to generate new things. This type of neural network system, often comprised of multiple neural networks, is called a Generative Model. Rather than just interpreting things passively or searching through existing data, AI engines can now generate highly relevant and engaging new media.

How? The three most common types of Generative Models in AI are Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs) and Diffusion Models. Sometimes these techniques are combined. They aren’t the only approaches, but they are currently the most popular. Today’s star products in art-generating AI are Midjourney by Midjourney.com (Diffusion-based) DALL-E by OpenAI (VAE-based), and Stable Diffusion (Diffusion-based) by Stability AI. It’s important to understand that each of these algorithmic techniques were conceived just in the past 6 years or so.

My goal is to describe these three methods at a cocktail-party chat level. The intuition behind them are incredibly clever ways of thinking about the problem. There are lots of resources on the Internet which go much further into each methodology, listed at the end of each section.

Generative Adversarial Networks

The first strand of generative-AI models, Generative Adversarial Networks (GANs), have been very fruitful for single-domain image generation. For instance, visit thispersondoesnotexist.com. Refresh the page a few times.

Each time, you’ll see highly* convincing images like this, but never the same one twice:

As the domain name suggests, these people do not exist. This is the computer creating a convincing image, using a Generative Adversarial Network (GAN) trained to construct a human-like photograph.

*Note that for the adult male, it only rendered half his glasses. This GAN doesn’t really understand the concept of “glasses,” simply a series of pixels that need to be adjacent to one another.

Generative Adversarial Networks were introduced in a 2014 paper by Ian Goodfellow et al. That was just eight years ago! The basic idea is that you have two deep-learning neural networks: a Generator and a Discriminator. You can think of them like a Conterfeiter and a Detective respectively. One Deep Learning model, serving as the “Discriminator” (Detective), learns to distinguish between genuine articles and counterfeits. It penalizes the generator for producing implausible results. Meanwhile, a Generator model learns to “generate” plausible data, which, if it “fools” the discriminator, becomes negative training data for the Discriminator. They play a zero-sum game against each other (thus it’s “adversarial”) thousands and thousands of times, and with each adjustment to the Generator and Discriminator’s weights and attributes, the Generator gets better and better at “learning” how to construct something to fool the Discriminator, and the Discriminator gets better and better at detecting fakes.

The whole system looks like this:

Generative Adversarial Network, source: Google

GANs have delivered pretty spectacular results, but in fairly narrow domains. For instance, GANs have been pretty good at mimicking artistic styles (called “Neural Style Transfer“) and Colorizing Black and White Images.

GANs are cool and a major area of generative AI research.

More reading on GANs:

Variational Autoencoders (VAE)

An encoder can be thought of as a compressor of data, and a decompressor, something which does this opposite. You’ve probably compressed an image down to a smaller size without losing recognizability. It turns out you can use AI models to compress an image. Data scientists call this reducing its dimensionality.

What if you built two neural network models, an Encoder and a Decoder? It might look like this, going from x, the original image, to x’, the “compressed and then decompressed” image:

Variational Autoencoder, high-level diagram. Images go in on left, and come out on right. If you train. the networks to minimize the difference between output and input, you get to a compression algorithm of sorts. What’s left in red are lower-dimension representation of the images.

So conceptually, you could train an Encoder neural network to “compress” images into vectors, and then a Decoder neural network to “decompress” the image back into something close to the original.

Then, you could consider the red “latent space” in the middle as basically the rosetta stone for what a given image means. Run that algorithm numerous times over multiple images, encoding it with the text of the labeled images, and you would end up with the condensed encoding of how to render various images. If you did this across many, many images and subjects, these numerous red vectors would overlap in n-dimensional space, and could be sampled and mixed and then run through the decoder to generate images.

With some mathematical tricks (specifically, forcing the latent variables in red to conform to a normal distribution), you can build a system which can generate images that never existed before, but which have some very similar properties to the dataset which was used to train the encoder.

More reading on VAEs:

2015: “Diffusion” Arrives

Is there another method entirely? What else could you do with a deep learning system which can “learn” how to predict things?

In March 2015, a revolutionary paper came out from researchers Sohl-Dickstein, Weiss, Maheswaranathan and Ganguli. It was inspired by the physics of non-equilibrium systems: for instance, dropping a drop of food coloring into a glass of water. Imagine you saw a film of that process of “destruction”, and could stop it frame by frame. Could you build a neural network to reliably predict what a reverse might look like?

Let’s think about a massive training set of animal images. Imagine you take an image in your training dataset, and create multiple copies of the image, each time systematically adding graphic “noise” to it. Step by step, more noise is added to your image (x), via what mathematicians call a Markov chain (incremental steps.) You apply a normally-distributed distortion, let’s say, Gaussian Blur.

In a forward direction, from left to right, it might look something like this. At each step from left to right, you’re going from data (the image) to pure noise:

Adding noise to an image, left to right. Credit: image from “AI Summer”: How diffusion models work: the math from scratch | AI Summer (theaisummer.com)

But here’s the magical insight behind Diffusion models. Once you’ve done this, what if you trained a deep learning model to try to predict frames in the reverse direction? Could you predict a “de-noised” image X(t) from its more noisier version, X(t+1)? Could you could read each step backward, from right to left, and try to predict the best way to remove noise at each step?

This was the insight in the 2015 paper, albeit with much more mathematics behind it. It turns out you can train a deep learning system to learn how to “undo” noise in an image, with pretty good results. For instance, if you input the pure-noise image in the last step, x(T), and train a deep learning network that its output should be the previous step x(T-1), and do this over and over again with many images, you can “train” a deep learning network to subtract noise in an image, all the way back to an original image.

Do this enough times, with enough terrier images, say. And then, ask your trained model to divine a “terrier” from random noise. Gradually, step by step, it removes noise from an image to synthesize a “terrier”, like this:

Screen captured video of using the Midjourney chatroom (on Discord) to generate: “terrier, looking up, cute, white background”

Images generated from the current Midjourney model:

“terrier looking up, cute, white background” entered into Midjourney. Unretouched, first-pass output with v3 model.

Wow! Just slap “No One Hates a Terrier” on any of these images above, print 100 t-shirts, and sell it on Amazon. Profit! I’ll touch on some of the legal and ethical controversies and ramifications in the final post in this series.

Training the Text Prompts: Embeddings

How did Midjourney know to produce a “terrier”, and not some other object or scene or animal?

This relied upon another major parallel track in deep learning: natural language processing. In particular, word “embeddings” can be used to get from keywords to meanings. And during the image model training, these embeddings were applied by Midjourney to enhance each noisy-image with meaning.

An “embedding” is a mapping of a chunk of text into a vector of continuous numbers. Think about a word as a list of numbers. A textual variable could be a word or a node in a graph, or a relation between nodes in a graph. By ingesting massive amounts of text, you can train a deep learning network to understand relationships between words and entities, and numerically pull out how closely associated some words and phrases are with others. They can be used to cluster together the sentiment of an expression in mathematical terms a computer can appear to understand. For instance, embedding models are now able to interpret semantics and relationships between words, like “royalty + woman – man = queen.”

An example on Google Colab took a vocabulary of 50,000 words in a collection of movie reviews, and learned over 100 different attributes from words used with them, based on their adjacency to one another:

img

Source: Movie Sentiment Word Embeddings

So, if you simultaneously injected into the “de-noising” diffusion-based learning process the information that this is about a “dog, looking up, on white background, terrier, smiling, cute,” you can get a deep learning network to “learn” how to go from random noise (x(T)) to a very faint outline of a terrier (x(T-1)), to even less faint (x(T-2)) and so on, all the way back to x(0). If you do this over thousands of images, and thousands of keyword embeddings, you end up with a neural network that can construct an image from some keywords.

Incidentally, researchers have found that about T=1000 is about all you need in this process, but millions of input images and enormous amounts of computing power are needed to learn how to “undo” noise at high resolution.

Let’s step back a moment to note that this revelation about Diffusion Models was only really put forward in 2015, and improved upon in 2018 and 2020. So we are just at the very beginning of understanding what might be possible here.

In 2021, Dhariwal and Nichol convincingly note that diffusion models can achieve image quality superior to the existing state-of-the-art GAN models.

Up next, Part III: Ramifications and Questions

That’s it for now. In the final Part III of Engines of Wow, we’ll explore some of the ramifications, controversies and make some predictions about where this goes next.

Engines of Wow: AI Art Comes of Age

Advancements in AI-generated art test our understanding of human creativity and laws around derivative art.

While most of us were focused on Ukraine, the midterm elections, or simply returning to normal as best we can, Artificial Intelligence (AI) took a gigantic leap forward in 2022. Seemingly all of a sudden, computers are now eerily capable of human-level creativity. Natural language agents like GPT-3 are able to carry on an intelligent conversation. GitHub CoPilot is able to write major blocks of software code. And new AI-assisted art engines with names like Midjourney, DALL-E and Stable Diffusion delight our eyes, but threaten to disrupt entire creative professions. They raise important questions about artistic ownership, derivative work and compensation.

In this three-part blog series, I’m going to dive in to the brave new world of AI-generated art. How did we get here? How do these engines work? What are some of the ramifications?

This series is divided into three parts:

[featured image above: “God of Storm Clouds” created by Midjourney AI algorithm]

But first, why should we care? What kind of output are we talking about?

Let’s try one of the big players, the Midjourney algorithm. Midjourney lets you play around in their sandbox for free for about 25 queries. You can register for free at Midjourney.com; they’ll invite you to a Discord chat server. After reading a “Getting Started” agreement and accepting some terms, you can type in a prompt. You might go with: “/imagine portrait of a cute leopard, beautiful happy, Gryffindor outfit, super detailed, hyper realism.”

Wait about 60 seconds, choose one of the four samples generated for you, click the “upscale” button for a bigger image, and voila:

image created by the Midjourney image generation engine, version 4.0. Full prompt used to create it was “portrait of a cute leopard, Beautiful happy, Gryffindor Outfit, white background, biomechanical intricate details, super detailed, hyper realism, heavenly, unreal engine, rtx, magical lighting, HD 8k, 4k”

The Leopard of Gryffindor was created without any human retouching. This is final Midjourney output. The algorithm took the text prompt, and then did all the work.

I look at this image, and I think: Stunning.

Looking at it, I get the kind of “this changes everything” feeling, like the first time I browsed the world-wide web, spoke to Siri or Alexa, rode in an electric vehicle, or did a live video chat with friends across the country for pennies. It’s the kind of revolutionary step-function that causes you to think “this will cause a huge wave of changes and opportunities,” though it’s not even clear what they all are.

Are artists, graphic designers and illustrators doomed? Will these engines ultimately help artists or hurt them? How will the creative ecosystem change when it becomes nearly free to go from idea to visual image?

Once mainly focused at just processing existing images, computers are now extremely capable at generating brand new things. Before diving into a high-level overview of these new generative AI art algorithms, let me emphasize a few things: First, no artist has ever created exactly the above image before, nor will it likely be generated again. That is, Midjourney and its competitors (notably DALL-E and Stable Diffusion) aren’t search engines: they are media creation engines.

In fact, if you typed this same exact prompt into Midjourney again, you’ll get an entirely different image, yet one which also is likely to deliver on the prompt fairly well.

There is an old joke within Computer Science circles that “Artificial Intelligence is what we call things that aren’t working yet.” That’s now sounding quaint. AI is all around us, making better and better recommendations, completing sentences, curating our media feeds, “optimizing” the prices of what we buy, helping us with driving assistance on the road, defending our computer networks and detecting spam.

Part I: The Artists in the Machine, 1950-2015+

How did this revolutionary achievement come about? Two ways, just as bankruptcy came about for Mike Campbell in Hemingway’s The Sun Also Rises: First gradually. Then suddenly.

Computer scientists have spent more than fifty years trying to perfect art generation algorithms. These five decades can be roughly divided into two distinct eras, each with entirely different approaches: “Procedural” and “Deep Learning.” And, as we’ll see in Part II, the Deep-Learning era had three parallel but critical deep learning efforts which all converged to make it the clear winner: Natural Language, Image Classifiers, and Diffusion Models.

But first, let’s rewind the videotape. How did we get here?

Procedural Era: 1970’s-1990’s

If you asked most computer users, the naive approach to generating computer art would be to try to encode various “rules of painting” into software programs, via the very “if this then that” kind of logic that computers excel at. And that’s precisely how it began.

In 1973, American computer scientist Harold Cohen, resident at Stanford University’s Artificial Intelligence Laboratory (SAIL) created AARON, the first computer program dedicated to generating art. Cohen was actually an accomplished, talented artist and a computer scientist. He thought it would be intriguing to try to “teach” a computer how to draw and paint.

His thinking was to encode various “rules about drawing” into software components and then have them work together to compose a complete piece of art. Cohen relied upon his skill as an exceptional artist, and coded his own “style” into his software.

AARON was an artificial intelligence program first written in the C programming language (a low level language compiled for speed), and later LISP (a language designed for symbolic manipulation.) AARON knew about various rules of drawing, such as how to “draw a wavy blue line, intersecting with a black line.” Later, constructs were added to combine these primitives together to “draw an adult human face, smiling.” By 1995, Cohen added rules for painting color within the drawn lines.

Though there were aspects of AARON which were artificially intelligent, by and large computer scientists call his a procedural approach. Do this, then that. Pick up a brush, pick an ink color, and draw from point A to B. Construct an image from its components. Join the lines. And you know what? after a few decades of work, Cohen created some really nice pieces, worthy of hanging on a wall. You can see some of them at the Museum of Computer History in Menlo Park, California.

In 1980, AARON was able to generate this:

Detail from an untitled AARON drawing, ca. 1980.
Detail from an untitled AARON drawing, ca. 1980, via Computer History Museum

By 1995, Cohen had encoded rules of color, and AARON was generating images like this:

img
The first color image created by AARON, 1995. via Computer Museum, Boston, MA

Just a few months ago, other attempts at AI-generated art were flat-looking and derivative, like this image from 2019:

img

Twenty seven years after AARON’s first AI-generated color painting, algorithms like Midjourney would be quickly rendering photorealistic images from text prompts. But to accomplish it, the primary method is completely different.

Deep Learning Era (1986-Present)

Algorithms which can create photorealistic images-on-demand are the culmination of multiple parallel academic research threads in learning systems dating back several decades.

We’ll get to the generative models which are key to this new wave of “engines of wow” in the next post, but first, it’s helpful to understand a bit about their central component: neural networks.

Since about 2000, you have probably noticed everyday computer services making massive leaps in predictive capabilities; that’s because of neural networks. Turn on Netflix or YouTube, and these services will serve up ever-better recommendations for you. Or, literally speak to Siri, and she will largely understand what you’re saying. Tap on your iPhone’s keyboard, and it’ll automatically suggest which letters or words might follow.

Each of these systems rely upon trained prediction models built by neural networks. And to envision them, a branch of computer scientists and mathematicians had to radically shift their thinking from the procedural approach. A branch of them did so first in the 1950’s and 60’s, and then again in a machine-learning renaissance which began in earnest in the mid-1980’s.

The key insight: these researchers speculated that instead of procedural coding, perhaps something akin to “intelligence” could be fashioned from general purpose software models, which would algorithmically “learn” patterns from a massive body of well-labeled training data. This is the field of “machine learning,” specifically supervised machine learning, because it’s using accurately pre-labeled data to train a system. That is, rather than “Computer, do this step first, then this step, then that step”, it became “Computer: learn patterns from this well-labeled training dataset; don’t expect me to tell you step-by-step which sequence of operations to do.”

The first big step began in 1958. Frank Rosenblatt, a researcher at Cornell University, created a simplistic precursor to neural networks, the “Perceptron,” basically a one-layer network consisting of visual sensor inputs and software outputs. The Perceptron system was fed a series of punchcards. After 50 trials, the computer “taught” itself to distinguish those cards which were marked on the left from cards marked on the right. The computer which ran this program was a five-ton IBM 704, the size of a room. By today’s standards, it was an extremely simple task, but it worked.

A single-layer perceptron is the basic component of a neural network. A perceptron consists of input values, weights and a bias, a weighted sum and activation function:

Frank Rosenblatt and the Perceptron system, 1958

Rosenblatt described it as the “first machine capable of having an original idea.” But the Perceptron was extremely simplistic; it merely added up the optical signals it detected to “perceive” dark marks on one side of the punchcard versus the other.

In 1969, MIT’s Marvin Minsky, whose father was an eye surgeon, wrote convincingly that neural networks needed multiple layers (like the optical neuron fabric in our eyes) to really do complex things. But his book Perceptrons, though well-respected in hindsight, got little traction at the time. That’s partially because during these intervening decades, the computing power required to “learn” more complex things via multi-layer networks were out of reach computationally. But time marched on, and over the next three decades, computing power, storage, languages and networks all improved dramatically.

From the 1950’s through the early 1980’s, many researchers doubted that computing power would be sufficient for intelligent learning systems via a neural network style approach. Skeptics also wondered if models could ever get to a level of specificity to be worthwhile. Early experiments often “overfit” the training data and simply output the input data. Some would get stuck on local maxima or minima from a training set. There were reasons to doubt this would work.

And then, in 1986, Carnegie Mellon Professor Geoffrey Hinton, whom many consider the “Godfather of Deep Learning” (go Tartans!), demonstrated that “neural networks” could learn to predict shapes and words by statistically “learning” from a large, labeled dataset. Hinton’s revolutionary 1986 breakthrough was the concept of “backpropagation.” This adds both multiple layers to the model (hidden layers), and also iterations through the neural network model using the output of one or more mathematical functions to adjust weights to minimize “loss” or distance from the expected output.

This is rather like the golfer who adjusts each successive golf swing, having observed how far off their last shots were. Eventually, with enough adjustments, they calculate the optimal way to hit the ball to minimize its resting distance from the hole. (This is where terms like “loss function” and “gradient descent” come in.)

In 1986-87, around the time of the 1986 Hinton-Rumelhart-Williams paper on Backpropagation, the whole AI field was in flux between these procedural and learning approaches, and I was earning a Masters in Computer Science at Stanford, concentrating in “Symbolic and Heuristic Computation.” I had classes which dove into the AARON-style type of symbolic, procedural AI, and a few classes touching on neural networks and learning systems. (My masters thesis was in getting a neural network to “learn” how to win the Tower of Hanoi game, which requires apparent backtracking to win.)

In essence, you can think of a neural network as a fabric of software-represented units (neurons) waiting to soak up patterns in data. The methodology to train them is: “here is some input data and the output I expect, learn it. Here’s some more input and its expected output, adjust your weights and assumptions. Got it? Keep updating your priors. OK, let’s keep doing that.” Like a dog learning what “sit” means (do this, get a treat / don’t do this, don’t get a treat), neural networks are able to “learn” over iterations, by adjusting the software model’s weights and thresholds.

Do this enough times, and what you end up with is a trained model that’s able to “recognize” patterns in the input data, outputting predictions, or labels, or anything you’d like classified.

A neural network, and in particular the special type of multi-layered network called a deep learning system, is “trained” on a very large, well-labeled dataset (i.e., with inputs and correct labels.) The training process uses Hinton’s “backpropagation” idea to adjust the weights of the various neuron thresholds in the statistical model, getting closer and closer to “learning” the underlying pattern in the data.

For much more detail on Deep Learning and the mathematics involved, see this excellent overview:

Deep Learning Revolutionizes AI Art

We’ll rely heavily upon this background of neural networks and deep learning for Part II: The Diffusion Revolution. The AI revolution uses deep learning networks to interpret natural language (text to meaning), classify images, and “learns” how to synthetically build an image from random noise.

Gallery

Before leaving, here are a few more images created from text prompts on Midjourney:

You get the idea. We’ll check in on how deep learning enabled new kinds of generative approach to AI art, called Generative Adversarial Networks, Variable Autoencoders and Diffusion, in Part II: Engines of Wow: Deep Learning and The Diffusion Revolution, 2014-present.

Neural Style Transfer – Current Models

I’m working on a neural-style transfer project, and have several machine learning models trained to render input photos in particularly styles.

The current set is below; input image on the left, output image on the right, with model name in lower right hand corner.

I’ve got a few clear favorites, but I’d love to see if they match yours. Which 3-5 do you like?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

Elektro, the Smoking Robot of 1937

I’ve always been fascinated by past visions of the future. Science fiction uses the future to tell us something about ourselves, so looking back on past visions of the future, we can learn something about that age and the values, myopia, optimism and fears of the time. It’s also healthy to continually do cross-checks on “how accurate was that prediction” and “what did we miss?” so that we can improve the accuracy of futuristic predictions over time.

Lost in the drama and bloodshed of the WWII age is the story of Elektro, the Smoking Robot.

In an era when we should have been much more focused on the rise of authoritarianism and threats to freedom, we human beings actually built, at great time and expense, a robot that could respond to basic voice commands, talk, distinguish red from green, do confined movements and smoke a cigarette.

Built by Westinghouse in Mansfield, Ohio, in 1937, Elektro was a 7-foot, 250-pound star of the 1939 World’s Fair. Elektro responded to voice commands of the operator, which did basic syllabic recognition. The chest cavity lit up as it recognized each word. Each word set up vibrations which were converted into electrical impulses, which in turn operated the relays controlling eleven motors. What mattered was how many impulses were sent by the operator, not what was actually said.

Check out this video to see a full demonstration of what Elektro could do, from The Middleton Family at the New York World’s Fair:

The Tin Man was to make his appearance on film that year, in the 1939 release The Wizard of Oz.

Meanwhile, across the Atlantic, Hitler was set to invade Poland. Alan Turing was off taking mathematics seminars by Wittgenstein in Cambridge, England. His Enigma decoding efforts had not yet begun. But those efforts would, within 3 years, help usher into existence the age of the computing and the programmable machine.

Elektro didn’t house any real software, aside from pre-recorded audio. He also didn’t learn anything — what Elektro could do was entirely predetermined by engineers through circuitry, relays and actuators.

Elektro could:

  • “Recognize” basic spoken words — actually, just distinguish between the number of impulses
  • Do basic audio output (via 78rpm record player)
  • “Walk” and move his hands (thanks to nine motors)
  • Recognize red or green
  • …and of course, smoke

A series of words properly spaced selected the movement Elektro was to make. His fingers, arms and turntable for talking were operated by nine motors, while another small motor worked the bellows so the giant could smoke. The eleventh motor drove the four rubber rollers under each foot, enabling him to walk. He relied on a series of record players, photo voltaic cells, motors and telephone relays to carry out its actions. It was capable to perform 26 routines (movements), and a vocabulary of 700 words. Sentences were formulated by a series of 78 RPM record players connected to relay switches.

Elektro did his talking by means of recordings, thanks to 8 embedded turntables, each of which could be used to give 10-minute talks. Except for an opening talk of about a minute, his other speeches were only a few seconds long. A solenoid activated by electrical impulses in proportion to the harshness or softness of spoken words makes Elektro’s aluminum lips move in rhythm to his speech-making.

Millions stood in line for as many as three hours to watch Elektro during his 20-minute performances at the 1939-40 World’s Fair in New York City.

The hole in Elektro’s chest was deliberate, since Westinghouse wanted visual proof that no one was inside. As commands were spoken to him, one of two lightbulbs in his chest would flash, letting the operator know he was receiving the signals. He could turn his head side to side and up and down. He talked and his mouth opened and closed. His arms moved independently with articulated fingers.

He also smoked. An embedded bellows system let him puff on a cigarette, which was lit by his operators. Apparently, one of the operators trained to work Elektro (John Angel, shown below) used to smoke a pipe, but then quit when he saw how much buildup was in Elektro during the cleaning after each day.

Elektro was later joined by a robotic dog, Sparko:

After the World’s Fair, the two embarked on a cross-country journey. Apparently, a female companion was planned for Elektro, but when World War II broke out, aluminum was in short supply, Westinghouse was needed on many projects, and the plans to build one were cancelled.

Monitor Shell Status Remotely with Seashells.io

Now that I’m knee-deep in machine learning models, I’m finding there are several times where I need to let my CPU/GPU crank away on a long-running “training” task for hours at a time, and I’d like to be able to check their status from afar.

The handy, free and cleverly-named tool seashells.io (“See shells”) makes this easy.

then you can pipe any terminal output as follows, via netcat:

echo 'Hello, Seashells!' | nc seashells.io 1337

You’ll get back a short seashells.io URL, and if you visit that URL from any web browser (including on your mobile phone of course), you can see the output status.

Even simpler, get the handy seashells client written by Anish Athalye through pip install:

pip install seashells

And then, for instance, a long-running command like:

python train.py | seashells

will send the training log to a URL that you can bookmark, and monitor status. Before you leave your machine, simply grab the URL via the handy QR Code Generator.

Applying Artist Styles to Photographs with Neural Style Transfer

In 2015, a research paper by Gatys, Ecker and Bethge posited that you could use a deep neural network to apply the artistic style of a painting to an existing image and get amazing results, as though the artist had rendered the image in question.

Soon after, a terrific and fun app was released to the app store called Prisma, which lets you do this on your phone.

How do they work?

There’s a comprehensive explanation of two different methods of Neural Style Transfer here on Medium; I won’t attempt to reproduce the explanation here, because he does such a thorough job.  The author, Subhang Desai, explains that there are two basic approaches, the slow “optimization forward” approach (2015) and the much faster “feedforward” approach where styles are precomputed (2016.)

On the first “straightforward” approach, there are two main projects that I’ve found — one based on Pytorch and one based on Tensorflow. Frankly, I found the Pytorch-based project insanely difficult to configure on a Windows machine (I also tried on a Mac) — so many missing libraries and things that had to be compiled. The project was originally built for specific Linux-based configurations and made a lot of assumptions about how to get the local machine up and running.

But the second project (the one linked above) is based on Google’s Tensorflow library, and is much easier to set up, though from Github message board comments I conclude it’s quite a bit slower than the Pytorch-based project.

On-the-fly “Optimization” Approach

As Desai explains, the most straightforward approach is to do an on-the-fly paired learning of two images — the style image and the photograph.

The neural network learning algorithm pay attention to two loss scores, which it mathematically tries to minimize by adjusting weights:

  • (a) How close the generated image is to the style of the artist, and
  • (b) How close the generated image is to the original photograph.

In this way, by iterating multiple times over newly generated images, the code generates images that are similar to both the artistic style and the original image — that is, it renders details of the photograph in the “style” of the image.

I can confirm that this “optimization” approach — iterating through images takes a longtime. To get reasonable results, it took about 500+ iterations. The example image below took 1 hour and 23 minutes to render on a very fast CPU equipped with a 6Gb NVIDIA Titan 780 GPU.

I’ve used the neural-style transfer Tensorflow code written by Anish Athalye to transform this photo:

…and this artistic style:

…and, with 1,000 iterations, it renders this:

Faster “Feedforward” Precompute-the-Style Approach

The second and much faster approach is to precompute the filter based on artist styles (paper). That appears to be the way that Prisma works, since it’s a whole lot faster.

I’ve managed to get Pytorch installed and configured properly, and don’t need any of the luarocks dependencies and hassle of the main Torch library. In fact, a fast_neural_style transfer example is available via the Pytorch install, in the examples directory.

Wow! It worked in about 10 seconds (on Windows)!

Applying the image with the “Candy” artistic style rendered this image:

Here’s a Mosaic render:

…also took about 5 seconds or so. Amazing. The pre-trained model is so much faster! But on Windows, I had a devil of a time trying to get the actual training of new style models working.

Training New Models (new Artist Styles)

This whole project (as well as other deep learning and data science projects) inspired me to get a working Ubuntu setup going. After a couple hours, I’ve successfully gotten an Ubuntu 18.04 setup, and I’m dual-booting my desktop machine.

The deep learning community and libraries are mostly Linux-first.

After setting up Ubuntu on an NVIDIA-powered machine, installing PyTorch and various libraries, I can now run the faster version of this neural encoding.

Training “Red Balloon” by Paul Klee

To train a new model, you have to take a massive set of input training images, a “style” painting, and you tell the script to effectively “learn the style”. This iteratively tries to minimize the weighted losses between the original input image and output image and the “style” image and the output image.

During the training of new models (by default, two “epochs”, or iterations through the image dataset), you can see the loss score for content and style (as well as a weighted total). Notice that the total is declining on the right — the result of the training using gradient-descent in successive iterations to minimize the overall loss.

Screenshot at 19-33-26

I had to install CUDA, which is the machine learning parallel processing library written by the clever folks at NVIDIA. This allows tensor code (matrix math) to be parallelized, harnessing the incredible power of the GPU, dramatically speeding up the process. So far, CUDA is the de-facto “machine learning for the masses” GPU library; none of the other major graphics chip makers have widely used libraries.

Amazingly, once you have a trained artist-style model — which took about 3.5 hours per input style on my machine — each rendered image in the “style” of an artist takes about a second to render, as you can see in the demo video below. Cool!

For instance, I’ve “trained” the algorithm to learn the following style (Paul Klee’s Red Balloon):

red-balloon-1922

And now, I can take any input image — say, this photo of the Space Needle:

spaceneedle

And run it through the pytorch-based script, and get the following output image:

out10klee

Total time:

(One-time) Model training learning the “Paul Klee Red Balloon Style”: 3.4 hours

Application of Space Needle Transform: ~1 second

Another Example

Learning from this style:

rendering the Eiffel Tower:

looks like this:

Training the Seurrat artist model took about 2.4 hours, but once done, it took about 2 seconds to render that stylized Eiffel Tower image.

I built a simple test harness in Angular with a Flask (Python) back-end to demonstrate these new trained models, and a bash script to let me train new models from images in a fairly hands-off way.

Note how fast the rendering is once the model is complete. Each image is generated on the fly from a Python-powered API based on a learned model, and the final images are not pre-cached:

Really very cool!

 

original image:

style:

Image result for rain princess

Output Image:

Survivorship Bias

In WWII, researcher Abraham Wald was assigned the task of figuring out where to place more reinforcing armor on bombers. Since every extra pound meant reduced range and agility, optimizing these decisions was crucial. So he and his team looked at a ton of data from returning bombers, noting the bullet hole placement.

They came up with numerous diagrams that looked like this:

See the source image

Most of his team members observed “Wow! Look at all those bullet holes in the center of the fuselage and on the wing tips! The armor clearly ought to go there, because those are the areas that are most marked-up in red!”

But Wald realized that they were only looking at those bombers which SURVIVED, and he correctly argued that these areas were instead precisely the damage areas that were already most survivable, while the areas which were NOT marked by bulletholes meant they were fatal. In so doing, he helped us understand “survivorship bias” — that is, if we only sample from the successful outcomes, we avoid seeing the crucial factors that caused failure, which in many cases are the most important factors of all.

Such survivorship bias can lead to conclusions and strategies which are precisely the opposite of optimal, so pay attention to the datapoints that you may have already artificially and incorrectly eliminated. 

Machine Learning/AI for Kids: Resources

I’m on a parent advisory committee at my daughter’s school. The committee is taking a look at the school’s existing Computational Thinking curriculum and where it might want to head in the future.

Luckily for us, the faculty is already doing a very good job with the curriculum. So our role as advisers is to provide a sounding board and perhaps additional guidance regarding ways they might want to augment the program. Key topics not yet addressed much in the existing computing curriculum are Machine LearningDeep Learning and Artificial Intelligence. These are pretty advanced fields, but becoming so essential to both the world we live in today and the one we will experience in the future. So what kinds of things might be useful to introduce and explore at the middle school (grades 6-8) and high school (grades 9-12) level?

What’s There, What Might Be Missing

At the school, they’re already introducing many central concepts of computing, like breaking down problems into smaller problemsbasic algorithmsdata modelingabstraction, and testing. They’re teaching basic circuits, robotics, website creation, Javascript, HTML, Python basics and more.

In the current era, understanding data is absolutely essential. Topics like what makes a good dataset, how to gather data, ethics involved in data gathering, basic statistics, what the difference is between correlation and causation, how to “clean” a dataset, how to separate out a “training” and “validation” dataset, what signals we use to make an educated guess as to whether we should trust the data set or not, and more. Next, a basic understanding of how machine learning works is useful, because it can build a better picture both of what’s possible and what might be limitations. So understanding at a very basic level what we mean by terms like “deep learning”, “machine learning” and “artificial intelligence” can be helpful, because these terms come up in the news a great deal, and they might also make great career choices for many students (and also hint at fields ripe for disruption and potential decline.) One participant also pointed out that an understanding of current agile development practices is helpful. Agile is a development process and philosophy that emphasizes flexibility, all-team focus, constant feedback and continuous updates. It typically includes components like source control, short work bursts called “sprints” where everyone works toward some specific set of short/medium-term achievements, regular reviews and adaptability. Some of these tools and techniques (e.g., version control, “stand-up” reviews, “minimum viable product”, iteration and measurement (A/B testing), etc.) can actually be quite useful in group projects outside of the computing world. And simply browsing through Github and seeing what people are working on can open up a world of possibilities. So it’s good to know how to explore it, and that it’s right there and available to anyone with access to a computer. Another basic piece of the conceptual puzzle: Application Programming Interfaces (API.) APIs are how various computer services and devices talk among one another “across the wire,” and because they usually have tiny services behind them (“microservices”), they can best be thought of as the LEGO building-blocks of today’s applications and “Internet of Things”. My hunch is that once students fully understand that pretty much any of the API’s they run across can be composed together to form one big solution, that conceptual understanding unlocks a gigantic world of possibility. (Just a few of the many APIs in the machine learning field are listed below.)

Machine Learning Resources

As for actually getting your hands dirty and building out an intelligent algorithm or two inside or outside of class, the likelihood of success certainly depends upon the students and their interest. Are there canonical, interesting and accessible examples to introduce these topics? We in the committee (including the key faculty members) certainly think so. With that in mind, here’s a short running list of projects and videos in the computing world that might be interesting to educators and kids alike.

Introductory Resources

Introduces how repeated input data is used to train a machine to “predict” (classify) output based on input. Google Quick Draw This is a fun Pictionary-style game where it’s the computer that does the guessing. It might be a fun way to introduce questions of how it’s done — it seems magical.

Questions for class:

  • How does it work?
  • How do you think they built this? What data and tools might you need if you wanted to make your own?
Authors: Jonas Jonjegan and Henry Rowley, @kawahima_san @cmiscm and @n1ckfg

Microsoft AI Demos Area

Microsoft Corporation has several great interactive playgrounds. You can experiment with text analytics (including sentiment analysis), speech authenticationface and emotion recognitionroute planninglanguage understanding and more.

ML-Playground

A terrific playground to experiment with classification algorithms (k-means clustering, support vector machines and more) is at http://ml-playground.com/. You can plot two colors of points on a 2D (x, y) graph, and then apply a few algorithms to visually see how well they recognize “clusters” of like-points. Excellent and free.

ML Showcase

A fun meta-site that rolls up a list of machine learning resources is the ML ShowcaseCheck it out.

Amazon Machine Learning APIs

Amazon also has a very large set of useful machine learning APIs, but in my cursory look, they are short on “playground” demo areas that are in front of a paywall, so they might not be the best fit for a classroom at present writing.

Create Music with Machine Learning

Fun app: For those musically inclined, check out Humtap on iOS. Hum into your phone and tap the phone, and the AI will create a song based on your input.

Programming Tools

Machine Learning for Kids (Scratch + IBM Watson Free Level)Scratch is a great, free programming environment for kids which grew out of the Media Lab at MIT. This Machine Learning for Kids project is a very clever and surprisingly powerful extension to the Scratch Programming Language written by Dale Lane, an interested parent. It brings in the power of the IBM Watson engine to Scratch by presenting Machine Learning Building Blocks such as text classifiers and image recognizers. These visual drag-and-drop blocks can then be connected into a Scratch program. Fun examples include:

  • An insult vs. compliments recognizer (video below)
  • Rock-scissors-paper guessing game
  • A dog vs. cat picture recognizer

I’ve wired up the compliments vs. insult recognizer on my own desktop, and it’s a very good overview of the promise and pitfalls when trying to build out a machine learning (classifier) model. I was impressed with the design and documentation of the free add-on, and it makes playing around with these tools a natural extension to any curriculum that’s already incorporating Scratch. I can imagine that for many middle schoolers and high schoolers, coming up with a list of insults to “train” the model would be quite fun.

Real-World Examples

Perhaps you’d like to begin with a list of real-world examples for machine learning? Examples abound:

  • Voice devices like Amazon Alexa (Echo), Siri, Cortana and Google Assistant
  • Netflix, Spotify and Pandora Recommendations
  • Spam/ham email detection (how does your computer know it’s junk email?)
  • Automatic colorization of B&W Images
  • Amazon product recommendations
  • Machine translation — Check out the amazing new Skype Translator
  • Synthetic video
  • Twitter, Facebook, Snapchat news feeds
  • Weather forecasting
  • Optical character recognition, and more specifically Zipcode recognition, one of the canonical examples (MNIST) Machines which recognize handwritten digits.
  • Videogame automated opponents
  • Self-driving technology
  • Google Search (which results to show you first, text analysis, etc.)
  • Antivirus software

What these solutions all have in common is a Machine Learning engine that ingests vasts amount of data, has known-good outcomes, a training set of data, a validation set of data, and a set of algorithms used to programmatically try to guess the best output given a set of inputs.

Data Science

I’ll likely do a separate set of posts on introducing Data Science to kids, but in the meantime, I wanted to mention one dataset here.

“Hello World” for Data Science: Titanic Survival

Machine learning is about learning from data, so Data Science is a direct cousin to (and overlaps heavily with) both Machine Learning and Deep Learning. A machine learning algorithm is only as good as its training and validation data, and students need to become familiar with how to recognize valid vs. invalid data, what data is the right kind to include vs. exclude, how to clean and augment data and more. Tools of the trade vary, but the Python data analysis stack (such as the libraries pandas, numpy and scikit-learn) are becoming a lingua-franca of the field.

There are several datasets that are interesting ways to introduce data analysis, but one of my favorites is the: Titanic Dataset (High Schoolers: Data Science, Predictions and AI):  Few events in history can match the drama, scale and both social and engineering lessons contained in the Titanic disaster. Would you have survived? What would have been your odds? What is the difference between correlation and causation? You can actually make a prediction as to what the survival likelihood was of a passenger based on their class of service, gender, age, point of embarkation and more. While not strictly “machine learning” per se, this dataset introduces the basic building block of machine learning: datafeatures, and labeled outcomes

By doing this exercise, you lay the groundwork for much better insight into how machines can use lots of data, and features in that data, to begin to make predictions. Machine learning is about training computers to recognize patterns from data, and this is a great “Hello World” for data science. (For those schools who want to introduce topics of privilege and diversity and of an era, it’s also avenue to discuss those social issues using data.)

Machine Learning Explained Simply

Terrific Hands-on Lab (Intermediate Learners): Google Machine Learning Recipes

What is Machine Learning? (Google)

What is a Neural Network?

This superb and accurate video takes the classic MNIST dataset (which is about getting the computer to correctly “recognize” handwritten digits) and walks through how it’s done. About 3/4 of the way through, it starts getting into matrix/vector math, which is likely beyond most high school curricula, but it’s very thorough in its explanation:

A Pioneer of Modern Machine Learning

Advanced (but Fascinating) Videos and Projects

Got advanced students interested in more? There’s so much out there. I’m currently going through the amazing, free Fast.ai course. Really good overview, and includes fun recognizers like a cats vs. dogs recognizer, text sentiment analysis and more. There are fun projects on Github like DeOldify, which attempts to programmatically colorize Black & White photos

What do (Convolutional) Neural Networks “see”?  

Neural networks “learn” to pay attention to certain kinds of features. What do these look like? This video does a nice job letting you see into the “black box” of on type of neural network recognizer:

Neural Style Project

It took a little while to set up on my machine, but the Neural Style project is pretty amazing. If you’ve used the app called “Prisma,” you know that it’s possible to take an input photograph and render it in the style of a famous painting. Well, it works with a neural network, and code that does basically the same thing is available on the web in a couple of projects, one of which is neural-style. Fair warning: Getting this up and running is not for the feint of heart. You’ll need a high-powered computer with an NVIDIA graphics card (GPU) and several steps of setup (it took about 30 minutes to get running on my machine.) But when you run it, you can play around with input and output that looks like this:

Input

Taking this photograph I took, and getting the neural-style project to render it with a Van Gogh style:

Great AI Podcast

The AI Podcast has a lot of great interviews.

Generative Adversarial Networks (GANs)

One of the most interesting things going on in machine learning these days are the so-called Generative Adversarial Networks, which use a “counterfeiter vs. police” adversarial contest to train an algorithm to actually synthesize new things. It’s a very recent idea (the research paper by Goodfellow et al which set it off was only published in 2014.) The idea is that you create a game of sorts between two algorithms: a “Generator” and a “Discriminator”. The Generator can be thought of as a counterfeiter, and the discriminator can be thought of as the “police”.

Basically, the counterfeiter tries to create realistic-looking fakes, and “wins” when it fools the police. The police, in turn, “win” when they catch the generator in the act. Played tens of thousands — even millions of times — these models eventually optimize themselves and you’re left with a counterfeiter that is pretty good at churning out realistic-looking fakes. Check out the hashtag #BigGAN on Twitter to see some interesting things going on in the field — or at the very least, some very strange images and videos generated by computer.

There’s a great overview of using a GAN to generate pictures of people who don’t exist. (Tons of ethical questions to discuss there, no?) For instance, these two people do not exist, but rather, were synthesized from a GAN which had ingested a lot of celebrity photos:

Another researcher used a GAN to train a neural network to synthesize photos of houses on hillsides, audio equipment and tourist attractions that do not exist in the real world: homes on a hillside which do not exist in the real world: 

Mostly thatched huts in mountains or forests

audio equipment which does not exist in the real world: 

tourist attractions which do not exist in the real world : 

And how about this incredible work from the AI team at University of Washington?

https://www.youtube.com/watch?v=AmUC4m6w1wo https://www.youtube.com/watch?v=UCwbJxW-ZRg

Interesting Topic for Mature Audience

One of the pitfalls of machine learning and AI is that bad data used in training can lead to “learned” bad outcomes. An emblematic story that might be of interest to some high-school audiences which illustrates this is the time that Microsoft unleashed Tay, a robot which learned, and was quickly trained into a sex-crazed Nazi. On second thought…

Suggestions Welcome

Do you have suggestions for this list? Please be sure to add them in the comments section below.