8/17/2023 0 Comments Google photo aiThe equivalent UniTune command would be ‘Guy at the back as ', where x is whatever weird and unique word was bound to the fine-trained concept associated with the furry monster character. in Imagic, you’d input ‘the third person, sitting in the background, as a cute furry monster'. UniTune's results are on the far right, while the estimated mask is seen in the second image from the left. Therefore, if you were editing with Imagic and wished to effect a transformation of this nature…įrom the UniTune paper – UniTune sets itself against Google's favorite rival neural editing framework, SDEdit. Though it is mystifying why two almost identical papers, in terms of their end functionality, should arrive from Google in the same week, there is, despite a huge number of similarities between the two initiatives, at least one clear difference between UniTune and Imagic – the latter uses ‘uncompressed' natural language prompts to guide image-editing operations, whereas UniTune trains in unique DreamBooth style tokens. “beikkpic two dogs in a restaurant” or “beikkpic a minion”).' The Process ‘To perform the edit operation, we sample the fine-tuned models with the prompt “ edit_prompt” (e.g. Nonetheless, UniTune uses the same encapsulated semantic ‘metaprompt' approach as DreamBooth, with trained changes summoned up by unique words chosen by the trainer, that will not clash with any terms that currently exist in a laboriously-trained public release model. They found that Textual Inversion omitted too many important details, while DreamBooth ‘performed worse and took longer' than the solution they finally settled on. In fact, the researchers assert, UniTune rejected both these approaches. The proposed UniTune instead ‘fine tunes' an existing diffusion model – in this case, Google's own Imagen, though the researchers state that the method is compatible with other latent diffusion architectures – so that a unique token is injected into it which can be summoned up by including it in a text prompt.Īt face value, this sounds like Google DreamBooth, currently an obsession among Stable Diffusion fans and developers, which can inject novel characters or objects into an existing checkpoint, often in less than an hour, based on a mere handful of source pictures or else like Textual Inversion, which creates ‘sidecar' files for a checkpoint, which are then treated as if they were originally trained into the model, and can take advantage of the model's own vast resources by modifying its text classifier, resulting in a tiny file (compared to the minimum 2GB pruned checkpoints of DreamBooth). However, since CLIP is arguably both the culprit and the solution in such a scenario (because it essentially also informed the way that the image was evolved), and since the hardware requirements may exceed what's likely to be available locally to an end-user, this approach may not be ideal. This is about to be instituted (Discord link) at Stability.ai's DreamStudio API-driven portal. In this context, CLIP should act as a sentinel and quality-control module, rejecting malformed or otherwise unsuitable renders. The obvious answer, at least to a computer vision practitioner, is to interpose a layer of semantic segmentation that's capable of recognizing and isolating objects in an image without user intervention, and, indeed, there have been several new initiatives lately along this line of thought.Īnother possibility for locking down messy and entangled neural image-editing operations is to leverage OpenAI's influential Contrastive Language–Image Pre-training ( CLIP) module, which is at the heart of latent diffusion models such as DALL-E 2 and Stable Diffusion, to act as a filter at the point at which a text-to-image model is ready to send an interpreted render back to the user. Though popular distributions such as AUTOMATIC1111 can create masks for local and restricted edits, the process is tortuous and frequently unpredictable. Source: Īs Stable Diffusion fans will have learned by now, applying edits to partial sections of a picture without adversely altering the rest of the image can be a tricky, sometimes impossible operation. Note how in the uppermost row of pictures, the faces of the two people have not been distorted by the extraordinary transformation on the rest of the source image (right). UniTune's command of semantic composition is outstanding.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |