文本引导I2I迈向统一！北大王选所提出FCDiffusion：端到端适用于各种图像转换任务

发布于 2024-10-25

426

版权声明

我们非常重视原创文章，为尊重知识产权并避免潜在的版权问题，我们在此提供文章的摘要供您初步了解。如果您想要查阅更为详尽的内容，访问作者的公众号页面获取完整文章。

查看原文：文本引导I2I迈向统一！北大王选所提出FCDiffusion：端到端适用于各种图像转换任务

文章来源：

AI生成未来

扫码关注公众号

扫码阅读

手机扫码阅读

Article Summary

Summary

The article introduces the Frequency-Controlled Diffusion Model (FCDiffusion), a novel text-guided image-to-image (I2I) translation framework that leverages a diffusion-based approach from a frequency domain perspective. FCDiffusion is designed to manage a variety of I2I tasks by filtering source image features in the Discrete Cosine Transform (DCT) domain and using them as control signals for a pre-trained Latent Diffusion Model (LDM).

Key Features of FCDiffusion

Applies different DCT filters to create control signals suitable for various I2I tasks.
Incorporates multiple scalable frequency control branches within a single model for flexible task switching during inference.
Has a straightforward learning objective, requires low computational resources, provides efficient inference speeds, and competes well in terms of I2I visual quality.

Methodology

FCDiffusion consists of three main components: a pre-trained LDM, a Frequency Filtering Module (FFM), and a FreqControlNet (FCNet). The FFM filters encoded image features in the DCT spectrum to produce control signals, which guide the LDM's denoising sampling process during inference based on textual prompts.

Experimental Results

FCDiffusion has been shown to produce high-quality results across diverse I2I scenarios, including style-driven content creation, semantic image manipulation, image scene transition, and image style conversion. It outperforms other state-of-the-art methods in qualitative comparisons and demonstrates superior capabilities in maintaining the desired associations between source and generated images.

Ablation Studies and Quantitative Evaluations

Ablation studies validate the importance of time embedding, text embedding, and the Equifrequency Shuffle operation in FCNet. Quantitative evaluations highlight FCDiffusion's ability to balance text fidelity and structural similarity in generated images, as well as its competitive performance in both content description and spatial separation of image transformations. Furthermore, FCDiffusion offers significant advantages in inference speed compared to other text-guided I2I methods.

Conclusion and Discussion

FCDiffusion presents a versatile text-guided I2I solution with advantages in inference time and generation quality. It adapts a pre-trained text-to-image model for I2I translation by controlling the reverse diffusion process using filtered image features from different DCT spectral bands. The approach is scalable and can be further developed to allow for continuous control effects without the need for additional training of frequency control branches.

For further information and technical exchange, readers are encouraged to join the AIGC technology exchange group, with references to the source material provided at the end of the article.