Is Generative AI a Hype ?

There was no doubt that OpenAI was onto something. In late 2021, a small team of researchers at the company’s San Francisco office was experimenting with an idea. They’d created a new version of OpenAI’s text-to-image model, DALL-E, an AI that converts short written descriptions into pictures, such as a Van Gogh-inspired fox or a pizza-based corgi. They only had to figure out what to do with it now.

Sam Altman, cofounder and CEO of OpenAI, tells MIT Technology Review that almost always, we build something and then we all have to use it for a while. We try to determine what it will be and what it will be used for.

No, not now. Everyone involved realized this was something special as they fiddled with the model. According to Altman, It was very obvious that this was it—this was the product. There was no discussion. They didn’t even hold a meeting to discuss it.

However, neither Altman nor the DALL-E team could have foreseen the magnitude of the impact that this product would have. According to Altman, this is the first instance of AI technology that has become popular with regular people.

In April 2022, DALL-E 2 was released. Google unveiled two text-to-image models in May under the names Imagen and Parti, but did not release them. Midjourney, a text-to-image model created for artists, then appeared. And August brought Stable Diffusion, a free open-source model made available to the public by the UK-based startup Stability AI.

The doors’ hinges were broken. Million users joined OpenAI in just 2.5 months. In less than half that time, more than a million people began using Stable Diffusion through its paid service Dream Studio; many more made use of Stable Diffusion through third-party apps or downloaded the free version on their own computers. (Emad Mostaque, the founder of Stability AI, claims that his goal is one billion users.)

Following that, a wave of text-to-video models from Google, Meta, and other companies appeared in October. These can produce brief video clips, animations, and 3D images in addition to still images.

Development has occurred at an astounding rate. The technology has sparked numerous newspaper and magazine covers, memes have taken over social media, a hype machine has gone into overdrive, and there has been a significant backlash in just a few short months.

Mike Cook, an AI researcher at King’s College London who focuses on computational creativity, says that the technology’s ability to astound users is “amazing” and that it is also enjoyable, which is how new technology should be. But because of how quickly things have changed, your initial impressions have already been modified. He believes it will take some time for society to process it.

The biggest changes in a generation are currently affecting artists. Some will lose their jobs, while others will discover new opportunities. A few people are preparing to go to court to pursue legal actions regarding what they perceive to be the misappropriation of images for the training of models that could replace them.

Don Allen Stevenson III, a digital artist based in California who has worked at visual effects studios like DreamWorks, claims that creators were taken off guard. It’s very frightening for people like him who are technically trained. You immediately exclaim, Oh my god, that’s my entire job, he says. For the first month after starting to use DALL-E, he experienced an existential crisis.

While some people are still in shock, many people, including Stevenson, are figuring out how to use these tools and predict what will happen next.

The intriguing fact is that we are unsure. Since this technology will grant everyone with creative superpowers, even though the creative industries—from entertainment media to fashion, architecture, marketing, and more—will feel the impact first. In the long run, it could be used to create designs for almost anything, including new kinds of clothing, structures, and drugs. It’s time for the generative revolution.

A magical uprising

Text-to-image models are a once-in-a-lifetime innovation for digital creator Chad Nelson, who has worked on video games and TV shows. In a matter of seconds, he claims, this technology takes you from that lightbulb in your head to a first sketch. Beyond anything he has encountered in 30 years, the speed at which you can create and explore is revolutionary.

Within weeks of their release, people were prototyping and brainstorming everything from marketing layouts and video game environments to magazine illustrations and movie concepts. Numerous people created fan art, including complete comic books, and shared it online. Even after tweeting the image, Altman used DALL-E to generate designs for sneakers that someone later made for him.

DALL-E has been used to create tattoo designs by Amy Smith, a computer scientist at Queen Mary University of London and tattoo artist. She says, You can sit down with the client and develop designs together. We are experiencing a media generation revolution.

California-based digital and video artist Paul Trillo believes that technology will make it simpler and quicker to generate visual effect ideas. People are claiming that this spells the end for fashion designers or effects artists, he says. He doesn’t believe anything is dying, We won’t have to work on the weekends or nights, I believe.

Different stances are taken by stock image companies. AI-generated images are not allowed at Getty. Shutterstock and OpenAI have a contract for Shutterstock to embed DALL-E in its website, and Shutterstock says it will establish a fund to pay artists whose work was used to train the models.

Stevenson claims to have used DALL-E throughout every stage of an animation studio’s production process, including the design of characters and environments. He completed the tasks of several departments in a short period of time using DALL-E. For all the people who have never been able to create because it was too expensive or complicated, he says, it’s uplifting. But if you’re not open to change, it’s terrifying.

Nelson anticipates that there will be more to come. He envisions this technology eventually being adopted by both large media companies and architectural and design firms. However, he claims that it is not yet prepared.

Right now, it seems as though you possess a miniature magic box or wizard, the man claims. That’s fantastic if all you want to do is keep creating images, but not if you need a partner in creativity. It needs a lot more awareness of what I’m creating if I want it to create stories and build worlds, he says.

The issue is that these models still don’t have a clue what they’re doing.

Inside the black box

Let’s examine how these programs function to understand why. The software appears to be a black box from the outside. After entering a brief prompt, you must wait a short while. You receive a few pictures that correspond to the prompt (more or less). You might need to make changes to your text to persuade the model to come up with something that is more in line with what you had in mind or to refine a happy accident. This practice is referred to as prompt engineering.

Prompts for the most detailed, stylized images can be hundreds of words long, and finding the right words has become a valuable skill. Prompts known to produce desirable results can now be bought and sold on online marketplaces.

Prompts can contain phrases that instruct the model to go for a particular style: “trending on ArtStation” tells the AI to mimic the (typically very detailed) style of images popular on ArtStation, a website where thousands of artists showcase their work; “Unreal engine” invokes the familiar graphic style of certain video games; and so on. Users can even enter the names of specific artists to have the AI produce pastiches of their work, which has caused some artists to be extremely unhappy.

Text-to-image models are made up of two main parts: a neural network that is trained to pair an image with text that describes it and another that is trained to create images from scratch. Getting the second neural network to produce an image that the first neural network recognizes as a match for the prompt is the basic idea.

The method used to create images is the major innovation behind the new models. The first iteration of DALL-E created images by predicting the subsequent pixels in an image as though they were words in a sentence, extending the technology behind OpenAI’s language model GPT-3. It did so, but not very well. Altman claims that the experience “was not magical. The fact that it worked at all is amazing.

DALL-E 2 substitutes a model known as diffusion instead. Diffusion models are neural networks that have been trained to improve images by removing pixelated noise that is added during training. Several pixels at a time, over a number of steps, are changed in images until the original images are erased and only random pixels are left. According to Björn Ommer, who studies generative AI at the University of Munich in Germany and contributed to the development of the diffusion model that underpins Stable Diffusion, if you repeat this process a thousand times, eventually the image will appear as though you have pulled the antenna cable from your television.

The neural network is then trained to reverse that process and forecast what an image would look like with less pixelation. The end result is that a diffusion model will attempt to produce something a little cleaner if you give it a mess of pixels. Re-inject the cleaned-up image, and the model will create an even cleaner result. If you repeat this process often enough, the model will be able to take you from TV snow to a high-resolution image.

The challenge with text-to-image models is that the language model, which is attempting to match a prompt to the images the diffusion model is producing, guides this process. As a result, the diffusion model is pushed in the direction of pictures that the language model deems appropriate.

However, the text and image links aren’t just appearing out of nowhere in the models. The majority of text-to-image models used today are trained on LAION, a sizable data set made up of billions of text and image combinations that were taken from the internet. Thus, the images you obtain from a text-to-image model are a condensed version of how the world is portrayed online, skewed due to prejudice (and pornography).

One more thing: The two most widely used models, DALL-E 2 and Stable Diffusion, differ in a small but important way. The diffusion model in DALL-E 2 functions on full-size images. On the other hand, Stable Diffusion makes use of a method called latent diffusion that was developed by Ommer and his associates. It operates on compressed versions of images that are only retained in their most fundamental forms in a latent space within the neural network.

Stable Diffusion can therefore operate with less computational effort. Stable Diffusion can run on (good) personal computers, in contrast to DALL-E 2, which utilizes OpenAI’s powerful servers. The fact that Stable Diffusion is both open source—programmers are free to change it, build on it, and profit from it—and lightweight enough for people to use at home is largely responsible for the explosion of creativity and the quick development of new apps.

Redefining creativity

These models are seen by some as a step toward artificial general intelligence, or AGI—a hyped-up buzzword for a future AI with all-purpose or even human-like capabilities. OpenAI has been clear about wanting to develop AGI. Altman doesn’t mind that DALL-E 2 now faces competition from a variety of tools that are similar, some of which are free, for that reason. He claims, We’re here to make AGI, not image generators. It will be included in a larger product roadmap. It only constitutes a minor portion of what an AGI will do.

That’s overly optimistic; many experts think modern AI will never achieve that level. Text-to-image models are no more intelligent than the language-generating AIs that support them in terms of fundamental intelligence. Tools like GPT-3 and Google’s PaLM reproduce textual patterns that were learned from the many billions of training documents. Similar associations between text and images discovered across billions of examples online are replicated by DALL-E and Stable Diffusion.

The results are stunning, but if you prod too hard, the illusion will fall apart. These simulations produce simple howls, such as “a bat flying over a baseball stadium” being answered by a picture of both a flying mammal and a wooden stick, or “salmon in a river” being answered by a picture of sliced-up fillets floating downstream. This is due to the fact that they are constructed using technology that is not even close to matching human (or even most animal) levels of understanding of the world.

Even so, it might only take a short while for these models to pick up more sophisticated tricks. It’s obvious that it isn’t very good at this thing right now, despite what people say, according to Cook. But it might be a hundred million dollars from now.

Undoubtedly, that is OpenAI’s strategy.

Altman claims, We already know how to make it ten times better.” “We are aware that it mishandles some logical reasoning tasks. Going through a list of issues, we’ll release a new version that corrects all the current issues.

What about creativity, if claims about intelligence and understanding are exaggerated? Humans often cite artists, mathematicians, businesspeople, kindergarten students, and teachers as examples of creativity. But it’s challenging to identify what these people have in common.

Results may be the most important factor for some. Others contend that the process of creation—and whether there was a purpose in it—is crucial.

Still, many cling to Margaret Boden’s definition of creativity, which reduces the concept to three essential criteria: an idea or artifact must be creative if it is new, surprising, and valuable. Boden is a well-known AI researcher and philosopher at the University of Sussex in the UK.

Beyond that, you usually just have to recognize it when you see it. Computational creativity researchers use computers to produce outcomes that would be regarded as creative if produced by humans alone, according to their own descriptions of their work.

Therefore, despite their stupidity, Smith is pleased to describe this new breed of generative models as creative. It is obvious that these images contain innovation that was not influenced by any human input, according to the author. The translation from text to image is frequently unexpected and lovely.

Text-to-image models, according to Maria Teresa Llano, who studies computational creativity at Monash University in Melbourne, Australia, are going beyond what was previously thought possible. But Llano disagrees, saying they lack originality. According to her, the outcomes of using these programs frequently can begin to repeat themselves. This indicates that they don’t meet all of Boden’s requirements. And that might be the technology’s main drawback. A text-to-image model generates new images that are exact replicas of the billions of existing images by design. It’s possible that machine learning will only ever create images that mimic those it has previously seen.

For computer graphics, that might not be relevant. Photoshop already has text-to-image generation built in by Adobe, and its open-source cousin Blender has a Stable Diffusion plug-in. Additionally, OpenAI and Microsoft are working together to develop an Office text-to-image widget.

The real impact of machines that don’t replace human creativity but rather enhance it may be felt in this type of interaction, in later iterations of these well-known tools. According to Llano, “the creativity we see today comes from the use of the systems, rather than from the systems themselves”—from the call-and-response process necessary to get the desired outcome.

Other computational creativity researchers share this viewpoint. It’s not just what these machines do; it’s also how they do it. They need to be encouraged to be more independent, given creative responsibility, and encouraged to curate as well as create in order to become true creative partners.

We’ll talk about some of that soon. A program called CLIP Interrogator has already been created that analyses an image and generates a prompt to produce more images similar to it. Others are automating prompt engineering, a task that has only been around for a few months, by adding phrases to simple prompts that are intended to improve the image’s quality and fidelity.

We are also building other foundations as the flood of images continues. Cook claims that AI-generated images have contaminated the internet for all time. From now on, every model that is created will include the images that we created in 2022.

It will take time for us to determine the precise impact these tools will have on the creative industries and the field of AI as a whole. A new tool for expression has emerged in the form of generative AI. Altman claims that in place of emoji, he now uses generated images in private messages. Some of my friends just type the prompt instead of bothering to generate the image, he claims.

However, text-to-image models might only be the beginning. In the future, generative AI could be used to create designs for anything from new structures to pharmaceuticals—just think of text-to-X.

According to Nelson, people will realize that skill or craft is no longer the barrier—just it’s their capacity for imagination.

Numerous industries already use computers to generate enormous amounts of potential designs, which are then sorted for those that might succeed. By using words to direct computers through an infinite number of options and toward outcomes that are not just possible but desirable, text-to-X models would enable a human designer to fine-tune that generative process from the outset.

Computers have the ability to create worlds that are infinitely possible. We can use text-to-X to verbally explore those areas.

Altman says, he thinks that’s the legacy. Everything will eventually be generated, including images, video, and audio. He believes it will simply seep everywhere.

Source link