DALL.E: A creative step towards general Artificial Intelligence

DALL·E is an artificial intelligence program created by OpenAI that generates images based on short text phrases. CLIP, another AI model, curates DALL·E’s creations to present the most suitable images as results. DALL·E has evidenced interesting and intelligent features, some even quite unexpected, and the program may eventually prove to have multiple useful applications.

The Technology

DALL·E is an artificial intelligence program that visualises concepts by generating images of realistic and unrealistic objects, from short phrase-like natural language prompts or text descriptions. The program, released in early 2021, gets its name from a portmanteau¹ of Pixar’s movie WALL-E, about an eco-friendly robot, and surrealist painter, Dalí, whose work merges imaginations with the rational world.

The AI program uses a 12 billion parameter version of the GPT-3 Transformer model² to interpret natural language inputs and works with another model called CLIP (Contrastive Language-Image Pre-training)³. It uses the 12 billion parameter transformer to replace text inputs with pixel outputs through training on text-image pairs from the Internet. For each text description, DALL·E generates multiple images which are ranked and curated by the image recognition system, CLIP. DALL·E has evidenced the ability to provide artificial intelligence models with a better understanding of how humans understand and interpret everything, through reference and relation; to create new ideas and concepts leading to more general artificial intelligence.

The Features

DALL·E⁴ is an image generator that has shown its ability to create humanised animals and objects, combine unrelated concepts to portray them in a reasonable interpretation and apply transformations to existing images, among many other features, some of which are listed below:

The AI program can identify how many of an object’s attributes should be modified and how many times an object should be displayed in a generated image.
It is also capable of creating multiple objects within an image while maintaining sensibility in their attributes and spatial relationships through associations, relative positioning, stacking objects, etc.⁵
DALL·E can visualise perspective and three-dimensionality; it has also been tested to repeatedly draw a figure’s head at equally spaced angles to result in an animation of a rotating head.
The model can also visualise internal and external structures such as cross-sectional views and macro photographs.
DALL·E can infer contextual details for changing styles, settings and time and can also cater to optical distortions and reflections. It can draw the same object in different situations as well as generate images with specific text on them.

The exact same cat on the top as a sketch on the bottom

It is also capable of learning about geographic facts, landmarks and neighbourhoods and has demonstrated temporal knowledge.
One of DALL·E’s most interesting abilities is being able to fill in the blanks regarding the number of objects, their attributes, arrangement, angle, lighting and location.
It can perform image to image translation tasks and can also sensibly combine unrelated concepts.

An armchair in the shape of an avocado

The Results

DALL·E’s strengths lie in its ability to understand natural language, grasp the concept of relation and reference in human understanding and then generate images that could be photorealistic, paintings or emojis. DALL·E evidenced some intelligent features that also came as a surprise to its creators at OpenAI. One of the most exciting features is DALL·E’s learning of visual reasoning skills6 that are said to be sufficient to solve Raven’s Matrices.⁷ The model’s intelligence is also reflected in its manipulation and placement of objects in the produced images. Another striking feature of DALL·E is its use of creativity that bares a remarkable resemblance to human imagination and creativity that allow it to coherently blend concepts. Other key features include being able to infer appropriate contextual details and its understanding of visual and design trends that allows it to create images appropriate for specific periods of time. All these achievements of DALL·E are a step towards achieving general artificial intelligence.

The Applications

OpenAI mentioned that they did not have a specific application in mind while creating DALL·E. However, the program could have many applications. Venture Beat has called DALL·E “a visual idea generator”, which may be the most apt definition for it. With OpenAI taking the responsibility to note that there is a potential for bias and ethical challenges and equally importantly there may be a widespread impact on society including the impact on work processes and professions, it may be a while before we see DALL·E in action in multiple applications, however it is pertinent to note that there could be a plethora of applications, including but not limited to the following:

DALL·E could contribute significantly to the fields of journalism and content writing as well as advertising.
Of course, Design is one of the first areas that comes to mind – beginning with illustration and graphic design and going all the way to digital art.
DALL·E has shown an aptitude for fashion design, furniture design as well as interior design. It is miles ahead of the 3D rendering software that are currently in use.

A female mannequin dressed in a black leather jacket and gold pleated skirt

Architects could use it to visualise buildings and, in the future, potentially complete cityscapes. Archaeologists could recreate ancient structures.

A loft bedroom with a white bed next to a nightstand there is a fish tank standing beside the bed

Other applications could include video games, education, medicine, etc. Some of DALL·E’s images have already been said to induce feelings of joy and that opens the field of mental health, at the very least, the program could be a mental health aid.
With a little more progress DALL·E could be used for more immersive experiences, producing synthetic video on demand, and even producing storyboards.

As seen above, DALL·E has great potential for numerous and widespread applications. It has demonstrated quite a large number of intelligent features that seem to be edging very close to general artificial intelligence and human imagination and creativity. In summary, DALL·E is a leap towards the future of general artificial intelligence and once OpenAi has considered it’s potential for bias and the ethical challenges it presents; with further development it could prove to be very useful.

_____________________________________________________________________________________________

¹Portmanteau is a word that combines the sounds and meanings of two words

²DALL·E uses a scaled down version of GPT-3. GPT-3 originally has 175 billion parameters

³CLIP is an AI model that curates image outputs from DALL·E to present the highest quality images for any prompt. CLIP was trained on 400 million image and text pairs

⁴With DALL·E, Open AI has refined GPT-3 to focus on visual concepts through language

⁵As the number of objects in a prompt increases, DALL·E begins to get confused

⁶DALL·E uses zero-shot learning – which means that the input data was not used during training and is being observed for the first time by the program

⁷Raven’s Matrices is an intelligence test usually used to measure abstract reasoning

__________________________________________________________________________________________

References:

1. https://openai.com/blog/dall-e/

2. https://venturebeat.com/2021/01/16/openais-text-to-image-engine-dall-e-is-a-powerful-visual-idea-generator/

Did you find this useful?

Yes