Engines of Wow, Part III: Opportunities and Pitfalls

This is the third in a three-part series introducing revolutionary changes in AI-generated art. In Part I: AI Art Comes of Age, we traced back through some of the winding path that brought us to this point. Part II: Deep Learning and The Diffusion Revolution, 2014-present, introduced three basic methods for generating art via deep-learning networks: GANs, VAEs and Diffusion models.

But what does it all mean? What’s at stake? In this final installment, let’s discuss some of the opportunities, legal and ethical questions presented by these new Engines of Wow.

Opportunities and Disruptions

We suddenly have robots which can turn text prompts into relevant, engaging, surprising images for pennies in a matter of seconds. They can compete with custom-created art taking illustrators and designers days or weeks to create.

Anywhere an image is needed, a robot can now help. We might even see side-by-side image creation with spoken words or written text, in near real-time.

  • Videogame designers have an amazing new tool to envision worlds.
  • Bloggers, web streamers and video producers can instantly and automatically create background graphics to describe their stories.
  • Graphic design firms can quickly make logos or background imagery for presentations, and accelerate their work. Authors can bring their stories to life.
  • Architects and storytellers can get inspired by worlds which don’t exist.
  • Entire graphic novels can now be generated from a text script which describes the scenes without any human intervention. (The stories themselves can even be created by new Chat models from OpenAI and other players.)
  • Storyboards for movies, which once cost hundreds of thousands of dollars to assemble, can soon be generated quickly, just by ingesting a script.

It’s already happening. In the Midjourney chat room, user Marcio84 writes: “I wrote a story 10 years ago, and finally have a tool to represent its characters.” With a few text prompts, the Midjourney Diffusion Engine created these images for him for just a few pennies:

Industrial designers, too, have a magical new tool. Inspiration for new vehicles can appear by the dozens and be voted up or down by a community:

Motorcycle concept generated by Midjourney, 2022

These engines are capable of competing with humans. In some surveys, as many as 87% of respondents incorrectly felt an AI-generated image was that of a real person. Think you can do better? Take the quiz.

I bet you could sell the art below, generated by Midjourney from a “street scene in the city, snow” prompt, in an art gallery or Etsy shop online. If I spotted it framed on a wall somewhere, or on a book cover or movie poster, I’d have no clue it was computer-generated:

img

A group of images stitched together becomes a video. One Midjourney user has tried to envision the aging destruction of a room, via successive videoframes of ever-more decaying description:

These are just a few of the things we can now do with these new AI art generation tools. Anywhere an image is useful, AI tools will have an impact, by lowering cost, blending concepts and styles, and envisioning many more options.

Where do images have a role? Well, that’s pretty much every field: architecture, graphic design, music, movies, industrial design, journalism, advertising, photography, painting, illustration, logo design, training, software, medicine, marketing, education and more.

Disruption

The first obvious impact is that many millions of employed or employable people may soon have far fewer opportunities.

Looking just at graphic design, approximately half a million designers employed globally, about 265,000 of whom are in the United States. (Source: Matt Moran of Colorlib.) The total market size for graphic design is about $43 billion per year. 90% of graphic designers work freelance, and the Fortune 500 accounts for nearly one-fifth of graphic design employment.

That’s just the graphic design segment. Then, there are photographers, painters, landscape architects and more.

But don’t count out the designers yet. These are merely tools, just as photography ones. And, while the Internet disrupted (or “disintermediated”) brokers in certain markets in the ’90s and ’00s (particularly in travel, in-person retail, financial services and real estate), I don’t expect that AI-generation tools means these experts are obsolete.

But the AI revolution is very likely to reduce the dollars available and reshape what their roles are. For instance, travel agents and financial advisers very much do still exist, though their numbers are far lower. The ones who have survived — even thrived — have used the new tools to imagine new businesses and have moved up the value-creation chain.

Who Owns the Ingested Data? Who Compensates the Artists?

Is this all plagiarism of sorts? There are sure to be lawsuits.

These algorithms rely upon massive image training sets. And there isn’t much dotting of i’s and crossing of t’s to secure digital rights. Recently, an artist found her own private medical records in one publicly available training dataset on the web which has been used by Stability AI. You can check to see if your own images have been part of the training datasets at www.haveibeenstrained.com.

But unlike most plagiarism and “derivative work” lawsuits up until about 2020, these lawsuits will need to contend with not being able to firmly identify just how the works are directly derivative. Current caselaw around derivative works generally requires some degree of sameness or likeness from input to final result. But the body of imagery which go into training the models is vast. A given creative end-product might be associated with hundreds of thousands or even millions of inputs. So how do the original artists get compensated, and how much, if at all?

No matter the algorithm, all generative AI models rely upon enormous datasets, as discussed in Part II. That’s their food. They go nowhere without creative input. And these datasets are the collective work of millions of artists and photographers. While some AI researchers go to great lengths to ensure that the images are copyright-free, many (most?) do not. Web scraping is often used to fetch and assemble images, and then a lot of human effort is put into data-cleansing and labeling.

The sheer scale of “original art” that goes into these engines is vast. It’s not unusual for a model to be trained by 5 million images. So these generative models learn patterns in art from millions of samples, not just by staring at one or two paintings. Are they “clones?” No. Are they even “derivative?” Probably, but not in the same way that George Harrison’s “My Sweet Lord” was derivative of Ronnie Mack’s “He’s So Fine.”

In the art world, American artist Jeff Koons created a collection called Banality, which featured sculptures from pop culture: mostly three dimensional representations of advertisements and kitsch. Fait d’Hiver (Fact of Winter) was one such work, which sold for approximately $4.3 million in a Christie’s auction in 2007:

Davidovici acknowledged that his sculpture was both inspired by and derived from this advertisement:

img

It’s plain to the eye the work is derivative.

And in fact, that was the whole point: Koons brought to three dimensions some of banality of everyday kitsch. In a legal battle spanning four years, Koons’ lawyers argued unsuccessfully that such derivative work was still unique, on several grounds. For instance, he had turned it into three dimensions, added a penguin and goggles on the woman, applied color, changed her jacket, the material representing snow, changed the scale, and much more.

While derivative, with all these new attributes, was the work not then brand new? The French court said non. Koons was found guilty in 2018. And it’s not the first time he was found guilty — in five lawsuits which sprang from this Banality collection, Koons lost three, and another settled out of court.

Unlike other “derivative works” lawsuits of the past, generative models in AI are relying not upon one work of a given artist, but an entire body of millions of images from hundreds of thousands of creators. Photographs are often lumped in with artistic sketches, oil paintings, graphic novel art and more to fashion new styles.

And, while it’s possible to look into the latent layers of AI models and see vectors of numbers, it’s impossible to translate that into something akin to “this new image is 2% based on image A64929, and 1.3% based on image B3929, etc.” An AI model learns patterns from enormous datasets, and those patterns are not well articulated.

Potential Approaches

It would be possible, it seems to me, to pass laws requiring that AI generative models use properly licensed (i.e., copyright-free or royalty-paid images), and then divvy up that royalty amongst its creators. Each artist has a different value for their work, so presumably they’d set the prices and AI model trainers would either pay for those or not.

Compliance with this is another matter entirely; perhaps certification technologies would offer valid tokens once verifying ownership. Similar to the blockchain concept, perhaps all images would have to be traceable to some payment or royalty agreement license. Or perhaps the new technique of Non Fungible Tokens (NFTs) can be used to license out ownership for ingestion during the training phase. Obviously this would have to be scalable, so it suggests automation and a de-facto standard must emerge.

Or will we see new kinds of art comparison or “plagiarism” tools, letting artists compare similarity and influence between generated works and their own creation? Perhaps if a generated piece of art is found to be more than 95% similar (or some such threshold) to an existing work, it will not retain copyright and/or require licensing of the underlying work. It’s possible to build such comparative tools today.

In the meantime, it’s the Wild West of sorts. As has often happened in the past, technology’s rapid pace of advancement has gotten ahead of where legislation and currency and monetary flow is.

What’s Ahead

If you’ve come with me on this journey into AI-generated art in 2022, or have seen these tools up close, you’re like the viewer who’s seen the world wide web in 1994. You’re on the early end of a major wave. This is a revolution in its nascent stage. We don’t know all that’s ahead, and the incredible capabilities of these tools are known only to a tiny fraction of society at the moment. It is hard to predict all the ramifications.

But if prior disintermediation moments are any guide, I’d expect change to happen along a few axes.

First, advancement and adoption will spread from horizontal tools to many more specialized verticals. Right now, there’s great advantage to being a skilled “image prompter.” I suspect that like photography, which initially required extreme expertise to create even passing mundane results, the engines will get better at delivering remarkable images the first pass through. Time an again in technology, generalized “horizontal” applications have concentrated into an oligopoly market of a few tools (e.g., spreadsheets), yet also launched a thousand flowers in much more specialized “vertical” ones (accounting systems, vertical applications, etc.) I expect the same pattern here. These tools have come of age, but only a tiny fraction of people know about them. We’re still in the generalist period. Meanwhile, they’re getting better and better. Right now, these horizontal applications stun with their potential, and are applicable to a wide variety of uses. But I’d expect thousands of specialty, domain specific applications and brand-names to emerge (e.g., book illustration, logo design, storyboard design, etc.) One set of players might generate sketches for creating book covers. Another for graphic novels. Another might specialize in video streaming backgrounds. Not only will this make the training datasets much more specific and outputs even more relevant, but it will allow the brand-names of each engine to better penetrate a specific customer base and respond to their needs. Applications will emerge for specific domains (e.g., automated graphic design for, say, blog posts.)

Second, many artists will demand compensation and seek to restrict rights to their work. Perhaps new guilds will emerge. A new technology and payments system might likely emerge to allow this to scale. Content generally has many ancillary rights, and one of those rights will likely be “ingestion rights” or “training model rights.” I would expect micropayment solutions or perhaps some form of blockchain-based technology to allow photographers, illustrators and artists to protect their work from being ingested into models without compensation. This might emerge as some kind of paywall approach to individual imagery. As is happening in the world of music, the most powerful and influential creative artists may initiate this trend, by cordoning off their entire collective body of work. For instance, the Ansel Adams estate might decide to disallow all ingestion into training models; right now however it’s very difficult to prove whether or not those images were used in the training of any datasets.

Third, regulation might be necessary to protect vital creative ecosystems. If an AI generative machine is able to create works auctioned at Christie’s for $5 million, and it may well soon, what does this do to the ecosystem of creators? It is likely necessary for regulators to protect the ecosystem for creators which feeds derivative engines, restricting AI generative model-makers from just fetching and ingesting any old image.

Fourth, in the near-term, skilled “image prompters” are like skilled photographers, web graphic designers, or painters. Today, there is a noticeable difference between those who know how to get the most of these new tools from those who do not. For the short term, this is likely to “gatekeep” this technology and validate the expertise of designers. I do not expect to this to be extremely durable, however; the quality of output from very unskilled prompters (e.g., yours truly) now meets or exceeds a lot of royalty-free art that’s out there from the likes of Envato or Shutterstock.

Conclusion

Machines now seem capable of visual creativity. While their output is often stunning, under the covers, they’re just learning patterns from data and semi-randomly assembling results. The shocking advancements since just 2015 suggest much more change is on the way: human realism, more styles, video, music, dialogue… we are likely to see these engines pass the artistic “Turing test” across more dimensions and domains.

For now, you need to be plugged into geeky circles of Reddit and Discord to try them out. And skill in crafting just the right prompts separates talented jockeys from the pack. But it’s likely that the power will fan out to the masses, with engines of wow built directly into several consumer end-user products and apps over the next three to five years.

We’re in a new era, where it costs pennies to envision dozens of new images visualizing anything from text. Expect some landmark lawsuits to arrive soon on what is and is not derivative work, and whether such machine-learning output can even be copyrighted. For now, especially if you’re in a creative field, it’s good advice to get acquainted with these new tools, because they’re here to stay.

Engines of Wow: Part II: Deep Learning and The Diffusion Revolution, 2014-present

A revolutionary insight in 2015, plus AI work on natural language, unleashed a new wave of generative AI models.

In Part I of this series on AI-generated art, we introduced how deep learning systems can be used to “learn” from a well-labeled dataset. In other words, algorithmic tools can “learn” patterns from data to reliably predict or label things. Now on their way to being “solved” via better and better tweaks and rework, these predictive engines are magical power-tools with intriguing applications in pretty much every field.

Here, we’re focused on media generation, specifically images, but it bears a note that many of the same basic techniques described below can apply to songwriting, video, text (e.g., customer service chatbots, poetry and story-creation), financial trading strategies, personal counseling and advice, text summarization, computer coding and more.

Generative AI in Art: GANs, VAEs and Diffusion Models

From Part I of this series, we know at a high level how we can use deep-learning neural networks to predict things or add meaning to data (e.g., translate text, or recognize what’s in a photo.) But we can also use deep learning techniques to generate new things. This type of neural network system, often comprised of multiple neural networks, is called a Generative Model. Rather than just interpreting things passively or searching through existing data, AI engines can now generate highly relevant and engaging new media.

How? The three most common types of Generative Models in AI are Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs) and Diffusion Models. Sometimes these techniques are combined. They aren’t the only approaches, but they are currently the most popular. Today’s star products in art-generating AI are Midjourney by Midjourney.com (Diffusion-based) DALL-E by OpenAI (VAE-based), and Stable Diffusion (Diffusion-based) by Stability AI. It’s important to understand that each of these algorithmic techniques were conceived just in the past 6 years or so.

My goal is to describe these three methods at a cocktail-party chat level. The intuition behind them are incredibly clever ways of thinking about the problem. There are lots of resources on the Internet which go much further into each methodology, listed at the end of each section.

Generative Adversarial Networks

The first strand of generative-AI models, Generative Adversarial Networks (GANs), have been very fruitful for single-domain image generation. For instance, visit thispersondoesnotexist.com. Refresh the page a few times.

Each time, you’ll see highly* convincing images like this, but never the same one twice:

As the domain name suggests, these people do not exist. This is the computer creating a convincing image, using a Generative Adversarial Network (GAN) trained to construct a human-like photograph.

*Note that for the adult male, it only rendered half his glasses. This GAN doesn’t really understand the concept of “glasses,” simply a series of pixels that need to be adjacent to one another.

Generative Adversarial Networks were introduced in a 2014 paper by Ian Goodfellow et al. That was just eight years ago! The basic idea is that you have two deep-learning neural networks: a Generator and a Discriminator. You can think of them like a Conterfeiter and a Detective respectively. One Deep Learning model, serving as the “Discriminator” (Detective), learns to distinguish between genuine articles and counterfeits. It penalizes the generator for producing implausible results. Meanwhile, a Generator model learns to “generate” plausible data, which, if it “fools” the discriminator, becomes negative training data for the Discriminator. They play a zero-sum game against each other (thus it’s “adversarial”) thousands and thousands of times, and with each adjustment to the Generator and Discriminator’s weights and attributes, the Generator gets better and better at “learning” how to construct something to fool the Discriminator, and the Discriminator gets better and better at detecting fakes.

The whole system looks like this:

Generative Adversarial Network, source: Google

GANs have delivered pretty spectacular results, but in fairly narrow domains. For instance, GANs have been pretty good at mimicking artistic styles (called “Neural Style Transfer“) and Colorizing Black and White Images.

GANs are cool and a major area of generative AI research.

More reading on GANs:

Variational Autoencoders (VAE)

An encoder can be thought of as a compressor of data, and a decompressor, something which does this opposite. You’ve probably compressed an image down to a smaller size without losing recognizability. It turns out you can use AI models to compress an image. Data scientists call this reducing its dimensionality.

What if you built two neural network models, an Encoder and a Decoder? It might look like this, going from x, the original image, to x’, the “compressed and then decompressed” image:

Variational Autoencoder, high-level diagram. Images go in on left, and come out on right. If you train. the networks to minimize the difference between output and input, you get to a compression algorithm of sorts. What’s left in red are lower-dimension representation of the images.

So conceptually, you could train an Encoder neural network to “compress” images into vectors, and then a Decoder neural network to “decompress” the image back into something close to the original.

Then, you could consider the red “latent space” in the middle as basically the rosetta stone for what a given image means. Run that algorithm numerous times over multiple images, encoding it with the text of the labeled images, and you would end up with the condensed encoding of how to render various images. If you did this across many, many images and subjects, these numerous red vectors would overlap in n-dimensional space, and could be sampled and mixed and then run through the decoder to generate images.

With some mathematical tricks (specifically, forcing the latent variables in red to conform to a normal distribution), you can build a system which can generate images that never existed before, but which have some very similar properties to the dataset which was used to train the encoder.

More reading on VAEs:

2015: “Diffusion” Arrives

Is there another method entirely? What else could you do with a deep learning system which can “learn” how to predict things?

In March 2015, a revolutionary paper came out from researchers Sohl-Dickstein, Weiss, Maheswaranathan and Ganguli. It was inspired by the physics of non-equilibrium systems: for instance, dropping a drop of food coloring into a glass of water. Imagine you saw a film of that process of “destruction”, and could stop it frame by frame. Could you build a neural network to reliably predict what a reverse might look like?

Let’s think about a massive training set of animal images. Imagine you take an image in your training dataset, and create multiple copies of the image, each time systematically adding graphic “noise” to it. Step by step, more noise is added to your image (x), via what mathematicians call a Markov chain (incremental steps.) You apply a normally-distributed distortion, let’s say, Gaussian Blur.

In a forward direction, from left to right, it might look something like this. At each step from left to right, you’re going from data (the image) to pure noise:

Adding noise to an image, left to right. Credit: image from “AI Summer”: How diffusion models work: the math from scratch | AI Summer (theaisummer.com)

But here’s the magical insight behind Diffusion models. Once you’ve done this, what if you trained a deep learning model to try to predict frames in the reverse direction? Could you predict a “de-noised” image X(t) from its more noisier version, X(t+1)? Could you could read each step backward, from right to left, and try to predict the best way to remove noise at each step?

This was the insight in the 2015 paper, albeit with much more mathematics behind it. It turns out you can train a deep learning system to learn how to “undo” noise in an image, with pretty good results. For instance, if you input the pure-noise image in the last step, x(T), and train a deep learning network that its output should be the previous step x(T-1), and do this over and over again with many images, you can “train” a deep learning network to subtract noise in an image, all the way back to an original image.

Do this enough times, with enough terrier images, say. And then, ask your trained model to divine a “terrier” from random noise. Gradually, step by step, it removes noise from an image to synthesize a “terrier”, like this:

Screen captured video of using the Midjourney chatroom (on Discord) to generate: “terrier, looking up, cute, white background”

Images generated from the current Midjourney model:

“terrier looking up, cute, white background” entered into Midjourney. Unretouched, first-pass output with v3 model.

Wow! Just slap “No One Hates a Terrier” on any of these images above, print 100 t-shirts, and sell it on Amazon. Profit! I’ll touch on some of the legal and ethical controversies and ramifications in the final post in this series.

Training the Text Prompts: Embeddings

How did Midjourney know to produce a “terrier”, and not some other object or scene or animal?

This relied upon another major parallel track in deep learning: natural language processing. In particular, word “embeddings” can be used to get from keywords to meanings. And during the image model training, these embeddings were applied by Midjourney to enhance each noisy-image with meaning.

An “embedding” is a mapping of a chunk of text into a vector of continuous numbers. Think about a word as a list of numbers. A textual variable could be a word or a node in a graph, or a relation between nodes in a graph. By ingesting massive amounts of text, you can train a deep learning network to understand relationships between words and entities, and numerically pull out how closely associated some words and phrases are with others. They can be used to cluster together the sentiment of an expression in mathematical terms a computer can appear to understand. For instance, embedding models are now able to interpret semantics and relationships between words, like “royalty + woman – man = queen.”

An example on Google Colab took a vocabulary of 50,000 words in a collection of movie reviews, and learned over 100 different attributes from words used with them, based on their adjacency to one another:

img

Source: Movie Sentiment Word Embeddings

So, if you simultaneously injected into the “de-noising” diffusion-based learning process the information that this is about a “dog, looking up, on white background, terrier, smiling, cute,” you can get a deep learning network to “learn” how to go from random noise (x(T)) to a very faint outline of a terrier (x(T-1)), to even less faint (x(T-2)) and so on, all the way back to x(0). If you do this over thousands of images, and thousands of keyword embeddings, you end up with a neural network that can construct an image from some keywords.

Incidentally, researchers have found that about T=1000 is about all you need in this process, but millions of input images and enormous amounts of computing power are needed to learn how to “undo” noise at high resolution.

Let’s step back a moment to note that this revelation about Diffusion Models was only really put forward in 2015, and improved upon in 2018 and 2020. So we are just at the very beginning of understanding what might be possible here.

In 2021, Dhariwal and Nichol convincingly note that diffusion models can achieve image quality superior to the existing state-of-the-art GAN models.

Up next, Part III: Ramifications and Questions

That’s it for now. In the final Part III of Engines of Wow, we’ll explore some of the ramifications, controversies and make some predictions about where this goes next.

Engines of Wow: AI Art Comes of Age

Advancements in AI-generated art test our understanding of human creativity and laws around derivative art.

While most of us were focused on Ukraine, the midterm elections, or simply returning to normal as best we can, Artificial Intelligence (AI) took a gigantic leap forward in 2022. Seemingly all of a sudden, computers are now eerily capable of human-level creativity. Natural language agents like GPT-3 are able to carry on an intelligent conversation. GitHub CoPilot is able to write major blocks of software code. And new AI-assisted art engines with names like Midjourney, DALL-E and Stable Diffusion delight our eyes, but threaten to disrupt entire creative professions. They raise important questions about artistic ownership, derivative work and compensation.

In this three-part blog series, I’m going to dive in to the brave new world of AI-generated art. How did we get here? How do these engines work? What are some of the ramifications?

This series is divided into three parts:

[featured image above: “God of Storm Clouds” created by Midjourney AI algorithm]

But first, why should we care? What kind of output are we talking about?

Let’s try one of the big players, the Midjourney algorithm. Midjourney lets you play around in their sandbox for free for about 25 queries. You can register for free at Midjourney.com; they’ll invite you to a Discord chat server. After reading a “Getting Started” agreement and accepting some terms, you can type in a prompt. You might go with: “/imagine portrait of a cute leopard, beautiful happy, Gryffindor outfit, super detailed, hyper realism.”

Wait about 60 seconds, choose one of the four samples generated for you, click the “upscale” button for a bigger image, and voila:

image created by the Midjourney image generation engine, version 4.0. Full prompt used to create it was “portrait of a cute leopard, Beautiful happy, Gryffindor Outfit, white background, biomechanical intricate details, super detailed, hyper realism, heavenly, unreal engine, rtx, magical lighting, HD 8k, 4k”

The Leopard of Gryffindor was created without any human retouching. This is final Midjourney output. The algorithm took the text prompt, and then did all the work.

I look at this image, and I think: Stunning.

Looking at it, I get the kind of “this changes everything” feeling, like the first time I browsed the world-wide web, spoke to Siri or Alexa, rode in an electric vehicle, or did a live video chat with friends across the country for pennies. It’s the kind of revolutionary step-function that causes you to think “this will cause a huge wave of changes and opportunities,” though it’s not even clear what they all are.

Are artists, graphic designers and illustrators doomed? Will these engines ultimately help artists or hurt them? How will the creative ecosystem change when it becomes nearly free to go from idea to visual image?

Once mainly focused at just processing existing images, computers are now extremely capable at generating brand new things. Before diving into a high-level overview of these new generative AI art algorithms, let me emphasize a few things: First, no artist has ever created exactly the above image before, nor will it likely be generated again. That is, Midjourney and its competitors (notably DALL-E and Stable Diffusion) aren’t search engines: they are media creation engines.

In fact, if you typed this same exact prompt into Midjourney again, you’ll get an entirely different image, yet one which also is likely to deliver on the prompt fairly well.

There is an old joke within Computer Science circles that “Artificial Intelligence is what we call things that aren’t working yet.” That’s now sounding quaint. AI is all around us, making better and better recommendations, completing sentences, curating our media feeds, “optimizing” the prices of what we buy, helping us with driving assistance on the road, defending our computer networks and detecting spam.

Part I: The Artists in the Machine, 1950-2015+

How did this revolutionary achievement come about? Two ways, just as bankruptcy came about for Mike Campbell in Hemingway’s The Sun Also Rises: First gradually. Then suddenly.

Computer scientists have spent more than fifty years trying to perfect art generation algorithms. These five decades can be roughly divided into two distinct eras, each with entirely different approaches: “Procedural” and “Deep Learning.” And, as we’ll see in Part II, the Deep-Learning era had three parallel but critical deep learning efforts which all converged to make it the clear winner: Natural Language, Image Classifiers, and Diffusion Models.

But first, let’s rewind the videotape. How did we get here?

Procedural Era: 1970’s-1990’s

If you asked most computer users, the naive approach to generating computer art would be to try to encode various “rules of painting” into software programs, via the very “if this then that” kind of logic that computers excel at. And that’s precisely how it began.

In 1973, American computer scientist Harold Cohen, resident at Stanford University’s Artificial Intelligence Laboratory (SAIL) created AARON, the first computer program dedicated to generating art. Cohen was actually an accomplished, talented artist and a computer scientist. He thought it would be intriguing to try to “teach” a computer how to draw and paint.

His thinking was to encode various “rules about drawing” into software components and then have them work together to compose a complete piece of art. Cohen relied upon his skill as an exceptional artist, and coded his own “style” into his software.

AARON was an artificial intelligence program first written in the C programming language (a low level language compiled for speed), and later LISP (a language designed for symbolic manipulation.) AARON knew about various rules of drawing, such as how to “draw a wavy blue line, intersecting with a black line.” Later, constructs were added to combine these primitives together to “draw an adult human face, smiling.” By 1995, Cohen added rules for painting color within the drawn lines.

Though there were aspects of AARON which were artificially intelligent, by and large computer scientists call his a procedural approach. Do this, then that. Pick up a brush, pick an ink color, and draw from point A to B. Construct an image from its components. Join the lines. And you know what? after a few decades of work, Cohen created some really nice pieces, worthy of hanging on a wall. You can see some of them at the Museum of Computer History in Menlo Park, California.

In 1980, AARON was able to generate this:

Detail from an untitled AARON drawing, ca. 1980.
Detail from an untitled AARON drawing, ca. 1980, via Computer History Museum

By 1995, Cohen had encoded rules of color, and AARON was generating images like this:

img
The first color image created by AARON, 1995. via Computer Museum, Boston, MA

Just a few months ago, other attempts at AI-generated art were flat-looking and derivative, like this image from 2019:

img

Twenty seven years after AARON’s first AI-generated color painting, algorithms like Midjourney would be quickly rendering photorealistic images from text prompts. But to accomplish it, the primary method is completely different.

Deep Learning Era (1986-Present)

Algorithms which can create photorealistic images-on-demand are the culmination of multiple parallel academic research threads in learning systems dating back several decades.

We’ll get to the generative models which are key to this new wave of “engines of wow” in the next post, but first, it’s helpful to understand a bit about their central component: neural networks.

Since about 2000, you have probably noticed everyday computer services making massive leaps in predictive capabilities; that’s because of neural networks. Turn on Netflix or YouTube, and these services will serve up ever-better recommendations for you. Or, literally speak to Siri, and she will largely understand what you’re saying. Tap on your iPhone’s keyboard, and it’ll automatically suggest which letters or words might follow.

Each of these systems rely upon trained prediction models built by neural networks. And to envision them, a branch of computer scientists and mathematicians had to radically shift their thinking from the procedural approach. A branch of them did so first in the 1950’s and 60’s, and then again in a machine-learning renaissance which began in earnest in the mid-1980’s.

The key insight: these researchers speculated that instead of procedural coding, perhaps something akin to “intelligence” could be fashioned from general purpose software models, which would algorithmically “learn” patterns from a massive body of well-labeled training data. This is the field of “machine learning,” specifically supervised machine learning, because it’s using accurately pre-labeled data to train a system. That is, rather than “Computer, do this step first, then this step, then that step”, it became “Computer: learn patterns from this well-labeled training dataset; don’t expect me to tell you step-by-step which sequence of operations to do.”

The first big step began in 1958. Frank Rosenblatt, a researcher at Cornell University, created a simplistic precursor to neural networks, the “Perceptron,” basically a one-layer network consisting of visual sensor inputs and software outputs. The Perceptron system was fed a series of punchcards. After 50 trials, the computer “taught” itself to distinguish those cards which were marked on the left from cards marked on the right. The computer which ran this program was a five-ton IBM 704, the size of a room. By today’s standards, it was an extremely simple task, but it worked.

A single-layer perceptron is the basic component of a neural network. A perceptron consists of input values, weights and a bias, a weighted sum and activation function:

Frank Rosenblatt and the Perceptron system, 1958

Rosenblatt described it as the “first machine capable of having an original idea.” But the Perceptron was extremely simplistic; it merely added up the optical signals it detected to “perceive” dark marks on one side of the punchcard versus the other.

In 1969, MIT’s Marvin Minsky, whose father was an eye surgeon, wrote convincingly that neural networks needed multiple layers (like the optical neuron fabric in our eyes) to really do complex things. But his book Perceptrons, though well-respected in hindsight, got little traction at the time. That’s partially because during these intervening decades, the computing power required to “learn” more complex things via multi-layer networks were out of reach computationally. But time marched on, and over the next three decades, computing power, storage, languages and networks all improved dramatically.

From the 1950’s through the early 1980’s, many researchers doubted that computing power would be sufficient for intelligent learning systems via a neural network style approach. Skeptics also wondered if models could ever get to a level of specificity to be worthwhile. Early experiments often “overfit” the training data and simply output the input data. Some would get stuck on local maxima or minima from a training set. There were reasons to doubt this would work.

And then, in 1986, Carnegie Mellon Professor Geoffrey Hinton, whom many consider the “Godfather of Deep Learning” (go Tartans!), demonstrated that “neural networks” could learn to predict shapes and words by statistically “learning” from a large, labeled dataset. Hinton’s revolutionary 1986 breakthrough was the concept of “backpropagation.” This adds both multiple layers to the model (hidden layers), and also iterations through the neural network model using the output of one or more mathematical functions to adjust weights to minimize “loss” or distance from the expected output.

This is rather like the golfer who adjusts each successive golf swing, having observed how far off their last shots were. Eventually, with enough adjustments, they calculate the optimal way to hit the ball to minimize its resting distance from the hole. (This is where terms like “loss function” and “gradient descent” come in.)

In 1986-87, around the time of the 1986 Hinton-Rumelhart-Williams paper on Backpropagation, the whole AI field was in flux between these procedural and learning approaches, and I was earning a Masters in Computer Science at Stanford, concentrating in “Symbolic and Heuristic Computation.” I had classes which dove into the AARON-style type of symbolic, procedural AI, and a few classes touching on neural networks and learning systems. (My masters thesis was in getting a neural network to “learn” how to win the Tower of Hanoi game, which requires apparent backtracking to win.)

In essence, you can think of a neural network as a fabric of software-represented units (neurons) waiting to soak up patterns in data. The methodology to train them is: “here is some input data and the output I expect, learn it. Here’s some more input and its expected output, adjust your weights and assumptions. Got it? Keep updating your priors. OK, let’s keep doing that.” Like a dog learning what “sit” means (do this, get a treat / don’t do this, don’t get a treat), neural networks are able to “learn” over iterations, by adjusting the software model’s weights and thresholds.

Do this enough times, and what you end up with is a trained model that’s able to “recognize” patterns in the input data, outputting predictions, or labels, or anything you’d like classified.

A neural network, and in particular the special type of multi-layered network called a deep learning system, is “trained” on a very large, well-labeled dataset (i.e., with inputs and correct labels.) The training process uses Hinton’s “backpropagation” idea to adjust the weights of the various neuron thresholds in the statistical model, getting closer and closer to “learning” the underlying pattern in the data.

For much more detail on Deep Learning and the mathematics involved, see this excellent overview:

Deep Learning Revolutionizes AI Art

We’ll rely heavily upon this background of neural networks and deep learning for Part II: The Diffusion Revolution. The AI revolution uses deep learning networks to interpret natural language (text to meaning), classify images, and “learns” how to synthetically build an image from random noise.

Gallery

Before leaving, here are a few more images created from text prompts on Midjourney:

You get the idea. We’ll check in on how deep learning enabled new kinds of generative approach to AI art, called Generative Adversarial Networks, Variable Autoencoders and Diffusion, in Part II: Engines of Wow: Deep Learning and The Diffusion Revolution, 2014-present.