Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Diffusion Models Without Attention (2023)
Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers (2023)
Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling (2023)
SRTransGAN: Image Super-Resolution using Transformer based Generative Adversarial Network (2023)
Do text-free diffusion models learn discriminative visual representations? (2023)

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

\n","updatedAt":"2023-12-06T16:01:41.377Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6997045874595642},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"👍","users":["ameerazam08"],"count":1}],"isReport":false}},{"id":"65fd746ff31aac18cc0b7f46","author":{"_id":"644bcc3befec70fb6897d0c8","avatarUrl":"/avatars/12fb5fee500ef3666b77ab55b4037ca7.svg","fullname":"Amit Saini","name":"Amit321","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false},"createdAt":"2024-03-22T12:07:11.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2024-03-22T12:07:56.049Z","author":{"_id":"644bcc3befec70fb6897d0c8","avatarUrl":"/avatars/12fb5fee500ef3666b77ab55b4037ca7.svg","fullname":"Amit Saini","name":"Amit321","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}}],"primaryEmailConfirmed":false,"paper":{"id":"2312.02139","authors":[{"_id":"656e8f3fb075b63c909d629f","user":{"_id":"64414b62603214724ebd2636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg","isPro":false,"fullname":"Ali","user":"ahatamiz","type":"user"},"name":"Ali Hatamizadeh","status":"claimed_verified","statusLastChangedAt":"2023-12-06T10:25:38.486Z","hidden":false},{"_id":"656e8f3fb075b63c909d62a0","user":{"_id":"6312c2f4bbaa385279d2da1b","avatarUrl":"/avatars/2bf162889fa726ed18cc205b3f28609e.svg","isPro":false,"fullname":"Jiaming Song","user":"jiamings","type":"user"},"name":"Jiaming Song","status":"admin_assigned","statusLastChangedAt":"2023-12-05T11:21:26.226Z","hidden":false},{"_id":"656e8f3fb075b63c909d62a1","user":{"_id":"63bc4115141c7d395c4aa02c","avatarUrl":"/avatars/a9eca59e8c52cceab6524a4aaab11e38.svg","isPro":false,"fullname":"Liu","user":"liuguilin","type":"user"},"name":"Guilin Liu","status":"claimed_verified","statusLastChangedAt":"2025-04-22T10:18:28.863Z","hidden":false},{"_id":"656e8f3fb075b63c909d62a2","name":"Jan Kautz","hidden":false},{"_id":"656e8f3fb075b63c909d62a3","user":{"_id":"62f6956cd3bdacb7eec02920","avatarUrl":"/avatars/b22db0823311f866c00db2efc4b9f814.svg","isPro":false,"fullname":"Arash Vahdat","user":"avahdat","type":"user"},"name":"Arash Vahdat","status":"admin_assigned","statusLastChangedAt":"2023-12-05T11:22:08.057Z","hidden":false}],"publishedAt":"2023-12-04T18:57:01.000Z","submittedOnDailyAt":"2023-12-05T00:17:33.579Z","title":"DiffiT: Diffusion Vision Transformers for Image Generation","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Diffusion models with their powerful expressivity and high sample quality\nhave enabled many new applications and use-cases in various domains. For sample\ngeneration, these models rely on a denoising neural network that generates\nimages by iterative denoising. Yet, the role of denoising network architecture\nis not well-studied with most efforts relying on convolutional residual U-Nets.\nIn this paper, we study the effectiveness of vision transformers in\ndiffusion-based generative learning. Specifically, we propose a new model,\ndenoted as Diffusion Vision Transformers (DiffiT), which consists of a hybrid\nhierarchical architecture with a U-shaped encoder and decoder. We introduce a\nnovel time-dependent self-attention module that allows attention layers to\nadapt their behavior at different stages of the denoising process in an\nefficient manner. We also introduce latent DiffiT which consists of transformer\nmodel with the proposed self-attention layers, for high-resolution image\ngeneration. Our results show that DiffiT is surprisingly effective in\ngenerating high-fidelity images, and it achieves state-of-the-art (SOTA)\nbenchmarks on a variety of class-conditional and unconditional synthesis tasks.\nIn the latent space, DiffiT achieves a new SOTA FID score of 1.73 on\nImageNet-256 dataset. Repository: https://github.com/NVlabs/DiffiT","upvotes":16,"discussionId":"656e8f45b075b63c909d644d","ai_summary":"A novel diffusion model using vision transformers with a hierarchical architecture and time-dependent self-attention achieves state-of-the-art performance in image generation.","ai_keywords":["diffusion models","denoising neural network","convolutional residual U-Nets","vision transformers","Diffusion Vision Transformers","DiffiT","U-shaped encoder","decoder","time-dependent self-attention","latent DiffiT","high-resolution image generation","class-conditional synthesis","unconditional synthesis","FID score"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64414b62603214724ebd2636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/x9JVcJRZKZE7hdEII1JRR.jpeg","isPro":false,"fullname":"Ali","user":"ahatamiz","type":"user"},{"_id":"63054f9320668afe24865bba","avatarUrl":"/avatars/75962ffed33d38761bce6c947750e1e4.svg","isPro":false,"fullname":"KW","user":"kevineen","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"63d4c8ce13ae45b780792f32","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675778487155-63d4c8ce13ae45b780792f32.jpeg","isPro":false,"fullname":"Ohenenoo","user":"PeepDaSlan9","type":"user"},{"_id":"6266513d539521e602b5dc3a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6266513d539521e602b5dc3a/7ZU_GyMBzrFHcHDoAkQlp.png","isPro":false,"fullname":"Ameer Azam","user":"ameerazam08","type":"user"},{"_id":"623c636949b6a399ee11152e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/623c636949b6a399ee11152e/s_58Qr4gM-ZdLd1cegTQO.png","isPro":false,"fullname":"Gyanateet Dutta","user":"Ryukijano","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"62e54f0eae9d3f10acb95cb9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e54f0eae9d3f10acb95cb9/VAyk05hqB3OZWXEZW-B0q.png","isPro":true,"fullname":"mrfakename","user":"mrfakename","type":"user"},{"_id":"60c8d264224e250fb0178f77","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60c8d264224e250fb0178f77/i8fbkBVcoFeJRmkQ9kYAE.png","isPro":true,"fullname":"Adam Lee","user":"Abecid","type":"user"},{"_id":"6549135c196ae037a74e10a3","avatarUrl":"/avatars/86194456844c7b2b5389de36cb258472.svg","isPro":false,"fullname":"Richrich","user":"RichardForests","type":"user"},{"_id":"650c8bfb3d3542884da1a845","avatarUrl":"/avatars/863a5deebf2ac6d4faedc4dd368e0561.svg","isPro":false,"fullname":"Adhurim ","user":"Limi07","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2312.02139

DiffiT: Diffusion Vision Transformers for Image Generation

Published on Dec 4, 2023

· Submitted by

AK on Dec 5, 2023

Upvote

Authors:

Ali Hatamizadeh ,

Jiaming Song ,

Guilin Liu ,

Arash Vahdat

Abstract

A novel diffusion model using vision transformers with a hierarchical architecture and time-dependent self-attention achieves state-of-the-art performance in image generation.

AI-generated summary

Diffusion models with their powerful expressivity and high sample quality have enabled many new applications and use-cases in various domains. For sample generation, these models rely on a denoising neural network that generates images by iterative denoising. Yet, the role of denoising network architecture is not well-studied with most efforts relying on convolutional residual U-Nets. In this paper, we study the effectiveness of vision transformers in diffusion-based generative learning. Specifically, we propose a new model, denoted as Diffusion Vision Transformers (DiffiT), which consists of a hybrid hierarchical architecture with a U-shaped encoder and decoder. We introduce a novel time-dependent self-attention module that allows attention layers to adapt their behavior at different stages of the denoising process in an efficient manner. We also introduce latent DiffiT which consists of transformer model with the proposed self-attention layers, for high-resolution image generation. Our results show that DiffiT is surprisingly effective in generating high-fidelity images, and it achieves state-of-the-art (SOTA) benchmarks on a variety of class-conditional and unconditional synthesis tasks. In the latent space, DiffiT achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset. Repository: https://github.com/NVlabs/DiffiT