Stable Diffusion: da testo ad immagine

Sep 2, 2022 · 2 min read

[UPDATE] Nuovo articolo in inglese che lo spiega in modo approfondito il funzionamento di Stable Diffusion.

Scrivo questo breve articolo per illustrarvi Stable Diffusion, un tool in grado di creare una immagine partendo da un testo qualsiasi che la descriva.

Questo sistema è simile a DALL·E ma richiede molto meno risorse per girare (lo proveremo a breve su un MacMini…) e le immagini che genera sono libere da Copyright.

Stable Diffusion funziona incastonando tre sistemi di apprendimento (Machine Learning): uno che è in grado di “comprimere” un notevole numero di immagini e memorizzarle nella rete neurale, ed uno che è in grado di associare delle parole e una semantica a tale immagini. La “compressione” utilizza un “rumore” di fondo per diciamo “sfocare” l’immagine ad ogni passo, ma è pefettamente reversibile. Una volta che queste immagini sono “qualunquizzate”, e associate ad una semantica, il sistema è in grado di usarle per creare altre immagini. Questo articolo spiega il meccanismo nel dettaglio, e conclude:

So the real question that you are asking yourselves, of course, is: where does the magic come from?
As I’ve described, it’s a complex system composed of three parts - the autoencoder, the language model for text embeddings and the latent diffusion model.

All of these parts are trained on a huge amount of images or image/text pairs, so the embeddings for the autoencoder and the language model are quite sophisticated and cover most of our human semantic space. […] The latent diffusion model itself is trained to uncover an image out of noise, but guided by this embedding, so it drives the creative embedding concept towards an image representation. Then finally, the decoder helps to bring the latent representation to a more upscaled and human visible version (and it also is trained on millions of images!).

Da quello che si può capire il tool crea “patchwork” partendo da immagini di base: ma le immagini risultanti sono “pseudo-originali”.

Non tutte le immagini vengono bene, ma alcune sono veramente impressionanti:

A spooky witch dressed in white, in front of a dark house during the sunset:

“A tir in the sunset in a desert road, with a gigant spider in the background awaiting for eating it”

“A fairy in the space near saturn, fighting a gigant nippon robot”

A gigant spider on a skyscreaper in the middle of San Francisco