https://github.com/whlzy/FiT\n","updatedAt":"2024-02-20T15:17:06.708Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":8250}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7469061017036438},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"65d48b1d2e5a6964e0699f74"}},{"id":"65e6e8c82dd08bbccdb3d1f1","author":{"_id":"62af665424488e6adfa9b8e2","avatarUrl":"/avatars/2bdb4a26fde4cbe5b4673e53e0d44540.svg","fullname":"Edmond Jacoupeau","name":"edmond","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3},"createdAt":"2024-03-05T09:41:28.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"> github link: https://github.com/whlzy/FiT\n\nCode is not present xD","html":"
\n","updatedAt":"2024-03-05T09:41:28.028Z","author":{"_id":"62af665424488e6adfa9b8e2","avatarUrl":"/avatars/2bdb4a26fde4cbe5b4673e53e0d44540.svg","fullname":"Edmond Jacoupeau","name":"edmond","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8501057028770447},"editors":["edmond"],"editorAvatarUrls":["/avatars/2bdb4a26fde4cbe5b4673e53e0d44540.svg"],"reactions":[],"isReport":false,"parentCommentId":"65d48b1d2e5a6964e0699f74"}}]},{"id":"65d55054c81e8c3cffcce646","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2024-02-21T01:22:28.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers](https://huggingface.co/papers/2401.11605) (2024)\n* [Cross-view Masked Diffusion Transformers for Person Image Synthesis](https://huggingface.co/papers/2402.01516) (2024)\n* [Scalable Diffusion Models with State Space Backbone](https://huggingface.co/papers/2402.05608) (2024)\n* [Latte: Latent Diffusion Transformer for Video Generation](https://huggingface.co/papers/2401.03048) (2024)\n* [Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation](https://huggingface.co/papers/2402.10491) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
\n","updatedAt":"2024-06-08T22:27:25.928Z","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":143}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5212690234184265},"editors":["blanchon"],"editorAvatarUrls":["/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2402.12376","authors":[{"_id":"65d4455bf08891235afc6390","user":{"_id":"6522ab765e7247c29172067e","avatarUrl":"/avatars/ea8d64a50d55456bd034166170a67748.svg","isPro":false,"fullname":"Zeyu Lu","user":"zeyulu","type":"user"},"name":"Zeyu Lu","status":"admin_assigned","statusLastChangedAt":"2024-02-20T09:53:20.006Z","hidden":false},{"_id":"65d4455bf08891235afc6391","user":{"_id":"64b7aa374df206a3ed1947d2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b7aa374df206a3ed1947d2/Ostk72ehOR6yUX-PhUvyQ.jpeg","isPro":false,"fullname":"wzd","user":"GoodEnough","type":"user"},"name":"Zidong Wang","status":"claimed_verified","statusLastChangedAt":"2024-10-21T16:13:40.719Z","hidden":false},{"_id":"65d4455bf08891235afc6392","user":{"_id":"646dd0654ad7f907279e4e96","avatarUrl":"/avatars/23e3e39f8ffae2f31d6b64cdbb44d47a.svg","isPro":false,"fullname":"dihuang","user":"Kellan3327","type":"user"},"name":"Di Huang","status":"claimed_verified","statusLastChangedAt":"2024-02-20T11:02:17.512Z","hidden":false},{"_id":"65d4455bf08891235afc6393","user":{"_id":"617526c9de8feb54b0ce45ad","avatarUrl":"/avatars/7faf8c6f71fc318a0113d780d376c381.svg","isPro":false,"fullname":"Wu Chengyue","user":"WuChengyue","type":"user"},"name":"Chengyue Wu","status":"admin_assigned","statusLastChangedAt":"2024-02-20T09:54:03.704Z","hidden":false},{"_id":"65d4455bf08891235afc6394","user":{"_id":"65d5ec74cd05bc1eaa125040","avatarUrl":"/avatars/2de1b1539a86452c2c89570eeb02f5ab.svg","isPro":false,"fullname":"Xihui Liu","user":"XihuiLiu","type":"user"},"name":"Xihui Liu","status":"claimed_verified","statusLastChangedAt":"2024-06-03T18:29:12.877Z","hidden":false},{"_id":"65d4455bf08891235afc6395","name":"Wanli Ouyang","hidden":false},{"_id":"65d4455bf08891235afc6396","name":"Lei Bai","hidden":false}],"publishedAt":"2024-02-19T18:59:07.000Z","submittedOnDailyAt":"2024-02-20T03:53:23.847Z","title":"FiT: Flexible Vision Transformer for Diffusion Model","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Nature is infinitely resolution-free. In the context of this reality,\nexisting diffusion models, such as Diffusion Transformers, often face\nchallenges when processing image resolutions outside of their trained domain.\nTo overcome this limitation, we present the Flexible Vision Transformer (FiT),\na transformer architecture specifically designed for generating images with\nunrestricted resolutions and aspect ratios. Unlike traditional methods that\nperceive images as static-resolution grids, FiT conceptualizes images as\nsequences of dynamically-sized tokens. This perspective enables a flexible\ntraining strategy that effortlessly adapts to diverse aspect ratios during both\ntraining and inference phases, thus promoting resolution generalization and\neliminating biases induced by image cropping. Enhanced by a meticulously\nadjusted network structure and the integration of training-free extrapolation\ntechniques, FiT exhibits remarkable flexibility in resolution extrapolation\ngeneration. Comprehensive experiments demonstrate the exceptional performance\nof FiT across a broad range of resolutions, showcasing its effectiveness both\nwithin and beyond its training resolution distribution. Repository available at\nhttps://github.com/whlzy/FiT.","upvotes":48,"discussionId":"65d4455bf08891235afc63b9","ai_summary":"The Flexible Vision Transformer adapts to varied image resolutions and aspect ratios through dynamic tokenization and extrapolation techniques, outperforming traditional methods.","ai_keywords":["diffusion models","Diffusion Transformers","Flexible Vision Transformer","transformer architecture","dynamically-sized tokens","resolution generalization","image cropping","training-free extrapolation","resolution extrapolation generation"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"652b83b73b5997ed71a310f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652b83b73b5997ed71a310f2/ipCpdeHUp4-0OmRz5z8IW.png","isPro":false,"fullname":"Rui Zhao","user":"ruizhaocv","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"635626a8ec32331b227f407b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/635626a8ec32331b227f407b/KRkAEN3eXN_mYzzJQ_8dO.jpeg","isPro":false,"fullname":"LuZeyu","user":"whlzy","type":"user"},{"_id":"645ac72ec35da9c7afd833cf","avatarUrl":"/avatars/fe15dd9bf42b8fb67236f1f8bad0df53.svg","isPro":false,"fullname":"Jaleel Akinyemi","user":"JAkinyemi","type":"user"},{"_id":"6303fe7beedc089484c73e5b","avatarUrl":"/avatars/9b1ad68461590539d2be50b986dcd67a.svg","isPro":false,"fullname":"Toqi Tahamid Sarker","user":"toqi","type":"user"},{"_id":"64df20dc22d604b137270864","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64df20dc22d604b137270864/C-1_EzY0tnrb-Cyn6lh93.jpeg","isPro":false,"fullname":"TA","user":"AIIAR","type":"user"},{"_id":"646dd0654ad7f907279e4e96","avatarUrl":"/avatars/23e3e39f8ffae2f31d6b64cdbb44d47a.svg","isPro":false,"fullname":"dihuang","user":"Kellan3327","type":"user"},{"_id":"63653b1e25aa3bd177d06f8b","avatarUrl":"/avatars/405eb83d5171f90f6cc00da4d51a28ab.svg","isPro":false,"fullname":"Federico Minutoli","user":"DiTo97","type":"user"},{"_id":"6479f8335f3450e1ded40774","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/iOjE2dcSUS-XYE0WJe6U8.jpeg","isPro":false,"fullname":"Andrei Semenov","user":"Andron00e","type":"user"},{"_id":"62aaaaf55a99fb2669bcd0e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1655352046059-noauth.jpeg","isPro":false,"fullname":"GaggiX","user":"GaggiX","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"64f3d15b77b0eb97ea1ec8b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f3d15b77b0eb97ea1ec8b2/y_3DjdOr5reXzTvHwn-xT.jpeg","isPro":false,"fullname":"Christopher Snyder","user":"csnyder","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">
The Flexible Vision Transformer adapts to varied image resolutions and aspect ratios through dynamic tokenization and extrapolation techniques, outperforming traditional methods.
AI-generated summary
Nature is infinitely resolution-free. In the context of this reality,
existing diffusion models, such as Diffusion Transformers, often face
challenges when processing image resolutions outside of their trained domain.
To overcome this limitation, we present the Flexible Vision Transformer (FiT),
a transformer architecture specifically designed for generating images with
unrestricted resolutions and aspect ratios. Unlike traditional methods that
perceive images as static-resolution grids, FiT conceptualizes images as
sequences of dynamically-sized tokens. This perspective enables a flexible
training strategy that effortlessly adapts to diverse aspect ratios during both
training and inference phases, thus promoting resolution generalization and
eliminating biases induced by image cropping. Enhanced by a meticulously
adjusted network structure and the integration of training-free extrapolation
techniques, FiT exhibits remarkable flexibility in resolution extrapolation
generation. Comprehensive experiments demonstrate the exceptional performance
of FiT across a broad range of resolutions, showcasing its effectiveness both
within and beyond its training resolution distribution. Repository available at
https://github.com/whlzy/FiT.