### This page requires Javascript. Please enable it for https://es-clip.github.io/

 "Landscape" "Self" "Human" "A drawing of a cat" "Walt Disney World" "A picture of Tokyo" "The Tokyo Tower is ..." "The Tokyo Tower is a communications and observation tower in the Shiba-koen district of Minato Tokyo Japan." "The United States ..." "The United States of America commonly known as the United States or America is a country primarily located in North America."

# Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts

## Abstract

Evolutionary algorithms have been used in the digital art scene since the 1970s. A popular application of genetic algorithms is to optimize the procedural placement of vector graphic primitives to resemble a given painting. In recent years, deep learning-based approaches have also been proposed to generate procedural drawings, which can be optimized using gradient descent.

In this work, we revisit the use of evolutionary algorithms for computational creativity. We find that modern evolution strategies (ES) algorithms, when tasked with the placement of shapes, offer large improvements in both quality and efficiency compared to traditional genetic algorithms, and even comparable to gradient-based methods. We demonstrate that ES is also well suited at optimizing the placement of shapes to fit the CLIP model, and can produce diverse, distinct geometric abstractions that are aligned with human interpretation of language.

## Introduction

Staring from early 20th-century in the wider context of modernism, a series of avant-garde art abandoned the depiction of objects from tradition rules of perspective and instead picking revolutionary, abstract point of views. The Cubism art movement, popularized by influential artists including Pablo Picasso, proposed that objects are analyzed by the artist, broken up, and reassembled in an abstract form consisting of geometric representations. This naturally develops into the geometric abstraction, where pioneer abstractionists like Wassily Kandinsky and Piet Mondrian represented the world using composed primitives that are either purely geometric or elementary. The impact is far-reaching: The use of simple geometry can be seen as one of styles found in abstract expressionism, where artists expressed their subconscious or impulsive feelings. It also helped shape the minimalist art and minimalist architecture movements, in which everything is stripped down to its essential quality to achieve simplicity.

The idea of minimalist art has also been explored in computer art with a root in mathematical art. Schmidhuber proposed an art form in the 1990s, called low-complexity art, as the minimal art in the computer age that attempts to depict the essence of an object by making use of ideas from algorithmic complexity. Similarly, algorithmic art proposed to generate arts using the algorithm designed by the artist. In a broad sense, algorithmic art could be said to include genetic algorithm where the artist determines the rules governing how images evolves iteratively, which are a popular method applied to approximate images using simple shapes, often producing abstract art style. As one example, a basic genetic algorithm using evolution has been proposed to represent a target image using semi-transparent, overlapping trianglesSee "Basic" in Figure: Compare choices of evolution algorithm for an example.. This approach has gained popularity over the years with the creative coding community, resulting in a number of sophisticated extensions. Interestingly, since these methods are iterative, they also echo process art which emphasizes the process of making.

With the recent resurgence of interest in evolution strategies (ES) in the machine learning community, in this work, we revisit the use of ES for creativity applications as an alternative to gradient-based methods. For approximating an image with shapes, we find that modern ES algorithms offer large improvements in both quality and efficiency when compared to traditional genetic algorithms, and as we will also demonstrate, even comparable to state-of-the-art differentiable rendering methods.

 Prompt Evolved Results "Self" "Walt Disney World" "The corporate headquarters complex of Google located at 1600 Amphitheatre Parkway in Mountain View, California."

We show that ES is also well suited at optimizing the placement of shapes to fit the CLIP model, and can produce diverse, distinct geometric abstractions that are aligned with human interpretation of language. Such an alignment is due to the use of CLIP model that are trained on aligned real-world text-image dataset. Interestingly, the results produced by our method resemble abstract expressionism and minimalist art. We provide a reference code implementation of our approach online so that it can be a useful tool in the computational artist's toolbox.

## Modern Evolution Strategies based Creativity

The architecture of our proposed is shown in Figure: The architecture of our method above. Our proposed method synthesizes painting by placing transparent triangles using evolution strategy (ES). Overall, we can represent a configuration of triangles in a parameter space which composes of positions and colors of triangles, render such configuration onto a canvas, and calculate its fitness based on how well the rendered canvas fits a target image or an concept in the form of a text prompt. The ES algorithm keeps a pool of candidate configurations and uses mutations to evolves better ones measured by the said fitness. To have better creative results, we use a modern ES algorithm, PGPE optimized by ClipUp optimizer. Engineering-wise we use the pgpelib implementation of PGPE and ClipUp.

As we choose to follow the spirit of minimalist art, we use transparent triangles as the parameter space. Concretely, a configuration of $N$ triangles is parameterized by a collection of $(x_1, y_1, x_2, y_2, x_3, y_3, r,g,b,a)$ for each of the triangles, which are vertex coordinates and the RGBA (Red, Green, Blue, and Alpha a.k.a. transparency channel) color, totally making $10N$ parameters. In the ES, we update all parameters and use a fixed hyper-parameter, the number of triangles $N$. Note that $N$ is better understood as the upper bound of number of triangles to use: although $N$ is fixed, the algorithm is still capable of effectively using "fewer" triangles by making unwanted ones transparent.

As the ES is orthogonal to the concrete fitness evaluation, we are left with many free choices regarding what counts as fitting. Particularly, we consider two kinds of fitness, namely, fitting a concrete image and fitting a concept (the lower branch and the upper branch in Figure: The architecture of our method above respectively). Fitting a concrete image is straightforward, where we can simply use the pixel-wise L2 loss between the rendered canvas and the target image as the fitness. Fitting a concept requires more elaboration. We represent the concept as a text prompt and embed the text prompt using the text encoder in CLIP which we discuss in detail in Related Works. Then we embed the rendered canvas using the image encoder also available in CLIP. Since the CLIP models are trained so that both embedded images and texts are comparable under Cosine distance for similarity, we use such distance as the fitness. We note that since the ES algorithm provides black-box optimization, the renderer, like fitness computation, does not necessarily need to be differentiable.

We find in practice a few decisions should be made so the whole pipeline can work reasonably well. First, we augment the rendered canvas by random cropping in calculating the fitness and average the fitness on each of the augmented canvas, following the practice of. This would prevent the rendered canvas from overfitting and increase the stability in the optimization. Second, we render the triangles on top of a background with a uniform distribution noise. Mathematically, this equals to modeling the uncertainty of parts in the canvas not covered by triangles with a max-entropy assumption, and using Monte Carlo method for approximation. Finally, we limit the maximal alpha value for each triangle to $0.1$, which prevents front triangles from (overly) shadowing the back ones.

## Fitting Concrete Target Image

In this section, we show the performance of our proposed work on fitting a concrete target image. In doing so, the model takes the lower branch in the architecture of our method we show the result fitting the famous painting "Mona Lisa" with $50$ triangles and running evolution for $10,000$ steps in Figure: Our method fitting painting "Mona Lisa" earlier in the text. The results show a distinctive art style represented by well-placed triangles that care both fine-grained textures and large backgrounds. The evolution process also demonstrates the coarse-to-fine adjustment of triangles' positions and colors.

 Target Image 10 Triangles 25 Triangles 50 Triangles 200 Triangles "Darwin" Fitness = 96.82% Fitness = 99.25% Fitness = 99.51% Fitness = 99.75% "Mona Lisa" Fitness = 98.02% Fitness = 99.30% Fitness = 99.62% Fitness = 99.80% "Anime Face" Fitness = 94.97% Fitness = 98.17% Fitness = 98.80% Fitness = 99.07% "Landscape" Fitness = 97.07% Fitness = 98.83% Fitness = 99.08% Fitness = 99.25% "Impressionism" Fitness = 98.82% Fitness = 99.23% Fitness = 99.34% Fitness = 99.48%

Number of triangles and parameters. Our proposed pipeline is able to fit any target images and could handle a wide range of number of parameters, since PGPE runs efficiently, i.e., linear to the number of parameters. This is demonstrated by applying our method to fit several target images with $10$, $25$, $50$, $200$ triangles, which corresponds to $100$, $250$, $500$ and $2000$ parameters respectively. As shown in Figure: Qualitative and quantitative results from fitting target images with different number of triangles above, our proposed pipeline works well for a wide range of target images, and the ES algorithm is capable of using the number of triangles as a "computational budget" where extra triangles could always be utilized for gaining in fitness. This allows a human artist to use the number of triangles in order to find the right balance between abstractness and details in the produced art.

 Target Image Ours (10k iters) Basic (10k iters) Basic (560k iters)

Choice of ES Algorithm. We compare two choices of evolution algorithm: ours, which uses the recent PGPE with ClipUp, and a basic, traditional one, which consists of mutation and simulated annealing adopted earlier. As shown in Figure: Compare choices of evolution algorithm above, our choice of more recent algorithms leads to better results than the basic one under the same parameter budget. Subjectively, our final results are more visually closer to the target image with a smoother evolution process, and quantitatively, our method leads to much better fitness ($99.62\%$ vs. $97.23\%$). Furthermore, even allowing $56$ times more iterations for the basic algorithm does not lead to results better than ours.

Comparison with Gradient-based Optimization. While our proposed approach is ES-based, it is interesting to investigate how it compares to gradient-based optimization since the latter is commonly adopted recently. Therefore we conduct a gradient-based setup by implementing rendering of composed triangles using nvdiffrast, a point-sampling-based differentiable renderer. We use the same processing as does our ES approach. As shown in Figure: Evolution strategies vs. differentiable renderer above, our proposed ES-based method can achieve similar yet slightly higher fitness than results compared with the gradient-optimized differentiable renderer. Furthermore and perhaps more interestingly, two methods produce artworks with different styles: our proposed method can adaptive allocating large triangles for background and small ones for detailed textures, whereas the differentiable renderer tends to introduce textures unseen in the target image (especially in the background). We argue that due to the difference in the optimization mechanism, our method focuses more on the placement of triangles while the differentiable renderer pays attention to the compositing of transparent colors.

## Fitting Abstract Concept with CLIP

In this section, we show the performance of our method configured to fit an abstract concept represented by language. In doing so, the model takes the upper branch in Figure: The architecture of our method above. Formally, the parameter space remains the same, but the fitness is calculated as the cosine distance between the text prompt and the rendered canvas, both encoded by CLIP. Since the model is given more freedom to decide what to paint, this problem is arguably a much harder yet more interesting problem than fitting concrete images in the previous section.

In Figure: ES and CLIP fit the concept represented in text prompt earlier in the text, we show the evolution result and process of fitting abstract concept represented as text prompt, using 50 triangles and running evolution for $2,000$ steps. We found that unlike fitting a concrete images, $2,000$ steps is enough for fitting a concept to converge.
Our method could handle text prompts ranging from a single word to a phrase, and finally, to a long sentence, even though the task itself is arguably more challenging than the previous one. The results show a creative art concept that is abstract, not resembling a particular image, yet correlated with humans' interpretation of the text. The evolution process also demonstrates iterative adjustment, such as the human shape in the first two examples, the shape of castles in Disney World, as well as in the final example, the cooperate-themed headquarters. Also, compared to fitting concrete images in the previous section, our method cares more about the placement of triangles.

 Prompt 10 Triangles 25 Triangles 50 Triangles 200 Triangles "Self" "Human" "Walt Disney World" "A picture of Tokyo" "The corporate headquarters complex of Google located at 1600 Amphitheatre Parkway in Mountain View, California." "The United States of America commonly known as the United States or America is a country primarily located in North America."

Number of triangles and parameters. Like fitting a concrete image, we can also fit an abstract concept with a wide range of number of parameters since the PGPE algorithm and the way we represent canvas remains the same. In Figure: Qualitative results from ES and CLIP fitting several text prompt with different numbers of triangles above, we apply our method to fit several concept (text prompt) with $10$, $25$, $50$, $200$ triangles, which corresponds to $100$, $250$, $500$ and $2000$ parameters respectively. It is shown that our proposed pipeline is capable of leveraging the number of triangles as a "budget for fitting" to balance between the details and the level of abstraction. Like in the previous task, this allows a human artist to balance the abstractness in the produced art.

We observe that while the model could comfortably handle at least up to $50$ triangles, more triangles ($200$) sometimes poses challenges: for example, with $200$ triangles, "corporate headquarters ..." gets a better result while "a picture of Tokyo" leads to a poor one. This may be due to the difficulties composing overly shadowed triangles, and we leave it for future study.

 Prompt 4 Individual Runs "Self" "Human" "Walt Disney World" "A picture of Tokyo" "The corporate headquarters complex of Google located at 1600 Amphitheatre Parkway in Mountain View, California." "The United States of America commonly known as the United States or America is a country primarily located in North America."

Multiple Runs. Since the target is an abstract concept rather than a concrete image, our method is given much freedom in arranging the configuration of triangles, which means random initialization and noise in the optimization can lead to drastically different solutions. In Figure: Qualitative results from ES and CLIP fitting several text prompt with different numbers of triangles above , we show 4 separate runs of our method on several text prompts, each using $50$ triangles with $2,000$ iterations, which is the same as previous examples. As shown, our method creates distinctive abstractions aligned with human interpretation of language while being capable of producing diverse results from the same text prompt. This, again, is a desired property for computer-assisted art creation, where human creators can be put "in the loop", not only poking around the text prompt but also picking the from multiple candidates produced by our method.

Comparison with Gradient-based Optimization. With CLIP in mind, we are also interested in how our ES-based approach compares to the gradient-based optimization, especially since many existing works have proposed to leverage CLIP to guide the generations using gradients. Arguably, this is a more challenging task due to the dynamic presented by two drastically different gradient dynamics by renderer and CLIP. Usually, to make such kind of combination work ideally, more studies are required, which warrant a manuscript itself like. Nonetheless, we have made a reasonably working version for comparison. Like fitting a target image, we implement the rendering process of composing triangles using nvdiffrast. In the forward pass, we render the canvas from parameters, feed the canvas to CLIP image encoder, and use Cosine distance between encoded image and encoded text prompt as a loss. Then we back-propagate all the way til the parameters of triangles to allow gradient-based optimization. We use the same processing as does our ES approach.

As shown in Figure: Evolution strategies vs. differentiable renderer for CLIP, while both our ES method and the differentiable method produce images that are aligned with human interpretation of the text prompt, ours produces more clear abstraction and clear boundaries between shapes and objects. More interestingly, since ours represents an art style closely resembling abstract expressionism art, the difference between ours and the differentiable rendered is similar to that between post-impressionism and impressionism, where bolder geometric forms and colors are used. Like the counterpart comparison in fitting a concrete image, we argue that such results are intrinsically rooted in the optimization mechanism, and our proposed method leads to a unique art style through our design choices.

## Related Works and Backgrounds of our Work

### Realted Works

Generating procedural drawings by optimizing with gradient descent using deep learning has has been attracting attention in recent years A growing list of works also tackle the problem of approximating pixel images with simulated paint medium, and differentiable rendering methods enable computer graphics to be optimized directly using gradient descent. For learning abstract representations, probabilistic generative models have been proposed to sample procedurally drawings directly from a latent space, without any given input images, similar to their pixel image counterparts. To interface with natural language, methods have been proposed to procedurally generate drawings of image categories, and word embeddings, enabling an algorithm to draw what's written. This combination of NLP and pixel image generation is explored at larger scale in CLIP, and its procedural sketch counterpart CLIPDraw.

Perhaps the closest to our approach among the related works is , which, similar to our work, uses a CLIP-like dual-encoder model pre-trained on the ALIGN dataset to judge the similarity between generated art and text prompt, and leverages evolutionary algorithms to optimize a non-differentiable rendering process. However, there are several key differences: parameterizes the rendering process with a hierarchical neural Lindenmayer system powered by multiple-layer LSTM and, as a result, it models well patterns with complex spatial relation, whereas our work favors a drastically simpler parameterization which just puts triangles individually on canvas to facilitate a different, minimalist art style that is complementary to theirs (See). Moreover, while uses a simple binary-tournament genetic algorithm, we opt for a modern state-of-the-art evolution strategy, PGPE with ClipUp, performing well enough to produce interesting results within a few thousand steps.

### Backgrounds of our Work

Evolution Strategies (ES) has been applied to optimization problems for a long period of time. A straightforward implementation of ES can be iteratively perturbing parameters in a pool and keeping those that are most fitting, which is simple yet inefficient. As a consequence, applying such a straightforward algorithm can lead to sub-optimal performance for art creativity. To overcome this generic issue in ES, recent advances have been proposed to improve the performance of ES algorithms. One such improvement is Policy Gradients with Parameter-Based Exploration (PGPE), which estimates gradients in a black-box fashion so the computation of fitness does not have to be differentiable per se.
Since PGPE runs linear to the number of parameters for each iteration, it is an efficient and the go-to algorithm in many scenarios. With the estimated gradients, gradient-based optimizers such as Adam can be used for optimization, while there are also work such as ClipUp offering a simpler and more efficient optimizer specifically tailored for PGPE. Another representative ES algorithm is Covariance matrix adaptation evolution strategy (CMA-ES), which in practice is considered more performant than PGPE. However, it runs in the quadratic time w.r.t. the number of parameters for each iteration, which limits its use in many problems with larger numbers of parameters where PGPE is still feasible.

Language-derived Image Generation has been seeing very recent trends in creativity setting, where there are several directions to leverage CLIP, a pre-trained model with two encoders, one for image and one for text, that can convert images and text into the same, comparable low-dimensional embedding space. As the image encoder is a differentiable neural network, it can provide a gradient to the output of a differentiable generative model. The gradient can be further back-propagated through the said model till its parameters. For example, one direction of works uses CLIP's gradient to guide a GAN's generator, such as guiding BigGAN, guiding VQGAN, guiding Siren, or a GAN with genetic algorithm-generated latent space. Another direction of works applies CLIP to differentiable renderers. CLIPDraw proposes to generate the images with diffvg, a differentiable SVG renderer. Although all these methods use the same pre-trained CLIP model for guidance, they show a drastically different artistic property, for which we hypothesize that the art style is determined by the intrinsic properties of "painter", i.e., the GAN generator or renderer.

## Discussion and Conclusion

In this work, we revisit evolutionary algorithms for computational creativity by proposing to combine modern evolution strategies (ES) algorithms with the drawing primitives of triangles inspired by the minimalism art style. Our proposed method offers considerable improvements in both quality and efficiency compared to traditional genetic algorithms and is comparable to gradient-based methods. Furthermore, we demonstrate that the ES algorithm could produce diverse, distinct geometric abstractions aligned with human interpretation of language and images. Our finds suggests that ES method produce very different and sometimes better results compared to gradient based methods, arguably due to the intrinsical behavior of the optimization mechanism. However it remains an open problem to understand how in general setting ES method compares with gradient methods. We expect future works investigate further into broader spectrum of art forms beyond the minimalism explored here.

Our dealing with evolutionary algorithms provides an insight into a different paradigm that can be applied to computational creativity. Widely adopted gradient-based methods are fine-tuned for specific domains, i.e., diff rendering for edges, parameterized shapes, or data-driven techniques for rendering better textures. Each of the applications requires tunes and tweaks that are domain-specific and are hard to transfer. In contrast, ES is agnostic to the domain, i.e., how the renderer works. We envision that ES-inspired approaches could potentially unify various domains with significantly less effort for adaption in the future.

## Acknowledgements

We thank Toru Lin, Jerry Li, Yujin Tang, Yanghua Jin, Jesse Engel, Yifu Zhao, Yuxuan Shui for their valuable comments and suggestions. We specially thank Yanghua Jin for his kind help with nvdiffrast.

Any errors here are our own and do not reflect opinions of our proofreaders and colleagues. If you see mistakes or want to suggest changes, feel free to contribute feedback by participating in the discussion forum for this article.

The experiments in this work were performed on multi-GPU Linux virtual machines provided by Google Cloud Platform.

Vision icon by artist monkik.

## Reuse

Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by the citations in their caption.