生成一个好故事!StoryDiffusion:一致自注意力和语义运动预测器必不可少(南开&字节)
我们非常重视原创文章,为尊重知识产权并避免潜在的版权问题,我们在此提供文章的摘要供您初步了解。如果您想要查阅更为详尽的内容,访问作者的公众号页面获取完整文章。
Article Summary:
Introduction
Diffusion models have shown exceptional potential in content generation, such as images, 3D objects, and videos. However, maintaining a consistent theme, such as consistent identity and attire in images and videos that describe a story, is challenging. The objective of this paper is to find a method that can generate images and videos with consistent characters in terms of identity and attire while maximizing text prompt controllability by the user.
Related Work
Diffusion models have dominated the field of generative modeling due to their ability to generate realistic images. The applications have expanded to various fields including image and video generation, 3D modeling, and low-level vision tasks. Text-to-image generation has seen significant advances with models like Latent Diffusion, DiT, and Stable XL.
Method
The paper introduces two novel components: Consistent Self-Attention, which generates images with thematic consistency without training, and Semantic Motion Predictor for video generation. The framework, called StoryDiffusion, can describe text-based stories with consistent images or videos covering a wide range of content.
Experiments
The proposed StoryDiffusion outperforms recent methods in generating consistent images and stable long videos. The method is evaluated qualitatively and quantitatively against other state-of-the-art methods. A user study further confirms StoryDiffusion's superior performance.
Conclusion
StoryDiffusion is a new method that generates consistent images for storytelling without training and transforms these images into videos. It aims to inspire future efforts in controllable image and video generation.
References
[1] STORYDIFFUSION: CONSISTENT SELF-ATTENTION FOR LONG-RANGE IMAGE AND VIDEO GENERATION
想要了解更多内容?