https://huggingface.co/spaces/YupengZhou/StoryDiffusion\n","updatedAt":"2024-05-03T07:38:51.966Z","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1272}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.41626155376434326},"editors":["AdinaY"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg"],"reactions":[{"reaction":"๐ฅ","users":["victor","basedanarki"],"count":2},{"reaction":"๐","users":["basedanarki"],"count":1},{"reaction":"๐","users":["basedanarki"],"count":1}],"isReport":false}},{"id":"6634ad831b8d86661c1b21cc","author":{"_id":"63b4147f7af2e415f2599659","avatarUrl":"/avatars/7d8989ddefab16d31b377870e56e0550.svg","fullname":"hakkyu kim","name":"HAKKYU","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2},"createdAt":"2024-05-03T09:25:23.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"TYPO in Algorithm 1. \n reshape_featrue -> reshape_feature","html":"
TYPO in Algorithm 1. reshape_featrue -> reshape_feature
\n","updatedAt":"2024-05-03T09:25:23.308Z","author":{"_id":"63b4147f7af2e415f2599659","avatarUrl":"/avatars/7d8989ddefab16d31b377870e56e0550.svg","fullname":"hakkyu kim","name":"HAKKYU","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9402732253074646},"editors":["HAKKYU"],"editorAvatarUrls":["/avatars/7d8989ddefab16d31b377870e56e0550.svg"],"reactions":[{"reaction":"๐","users":["g0ster"],"count":1}],"isReport":false}},{"id":"66358e225824a07240282c84","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2024-05-04T01:23:46.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation](https://huggingface.co/papers/2403.02827) (2024)\n* [From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation](https://huggingface.co/papers/2404.15267) (2024)\n* [OneActor: Consistent Character Generation via Cluster-Conditioned Guidance](https://huggingface.co/papers/2404.10267) (2024)\n* [Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition](https://huggingface.co/papers/2403.14148) (2024)\n* [PoseAnimate: Zero-shot high fidelity pose controllable character animation](https://huggingface.co/papers/2404.13680) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-05-04T01:23:46.893Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7235580086708069},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2405.01434","authors":[{"_id":"663443af2adb91d8d9fa2b9a","user":{"_id":"64b54d5799ba6b130245da20","avatarUrl":"/avatars/c5281b17a4d4a565d6d189a007ac934e.svg","isPro":false,"fullname":"zhou","user":"YupengZhou","type":"user"},"name":"Yupeng Zhou","status":"admin_assigned","statusLastChangedAt":"2024-05-03T07:39:20.759Z","hidden":false},{"_id":"663443af2adb91d8d9fa2b9b","user":{"_id":"63fe0b160c1bbe8e29d2dd32","avatarUrl":"/avatars/bc574036287170a77057893efaa48e2d.svg","isPro":false,"fullname":"Zhou","user":"DaQuan21","type":"user"},"name":"Daquan Zhou","status":"admin_assigned","statusLastChangedAt":"2024-05-03T07:39:31.405Z","hidden":false},{"_id":"663443af2adb91d8d9fa2b9c","user":{"_id":"64e496ae0195913c7fa91c66","avatarUrl":"/avatars/23f274f0a3b4ef6ded35205df9bfb564.svg","isPro":false,"fullname":"chengmingming","user":"mingming8688","type":"user"},"name":"Ming-Ming Cheng","status":"admin_assigned","statusLastChangedAt":"2024-05-03T07:39:51.152Z","hidden":false},{"_id":"663443af2adb91d8d9fa2b9d","name":"Jiashi Feng","hidden":false},{"_id":"663443af2adb91d8d9fa2b9e","name":"Qibin Hou","hidden":false}],"publishedAt":"2024-05-02T16:25:16.000Z","submittedOnDailyAt":"2024-05-03T00:23:55.234Z","title":"StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video\n Generation","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"For recent diffusion-based generative models, maintaining consistent content\nacross a series of generated images, especially those containing subjects and\ncomplex details, presents a significant challenge. In this paper, we propose a\nnew way of self-attention calculation, termed Consistent Self-Attention, that\nsignificantly boosts the consistency between the generated images and augments\nprevalent pretrained diffusion-based text-to-image models in a zero-shot\nmanner. To extend our method to long-range video generation, we further\nintroduce a novel semantic space temporal motion prediction module, named\nSemantic Motion Predictor. It is trained to estimate the motion conditions\nbetween two provided images in the semantic spaces. This module converts the\ngenerated sequence of images into videos with smooth transitions and consistent\nsubjects that are significantly more stable than the modules based on latent\nspaces only, especially in the context of long video generation. By merging\nthese two novel components, our framework, referred to as StoryDiffusion, can\ndescribe a text-based story with consistent images or videos encompassing a\nrich variety of contents. The proposed StoryDiffusion encompasses pioneering\nexplorations in visual story generation with the presentation of images and\nvideos, which we hope could inspire more research from the aspect of\narchitectural modifications. Our code is made publicly available at\nhttps://github.com/HVision-NKU/StoryDiffusion.","upvotes":56,"discussionId":"663443b32adb91d8d9fa2c32","ai_summary":"StoryDiffusion, combining consistent self-attention and semantic motion prediction, enables generation of coherent and stable images and videos from textual descriptions.","ai_keywords":["diffusion-based generative models","self-attention","Consistent Self-Attention","latent spaces","Semantic Motion Predictor","visual story generation"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"655ac762cb17ec19ef82719b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ac762cb17ec19ef82719b/1kDncYrGLYS_2SR8cNdAL.png","isPro":false,"fullname":"Welcome to matlok","user":"matlok","type":"user"},{"_id":"63fe0b160c1bbe8e29d2dd32","avatarUrl":"/avatars/bc574036287170a77057893efaa48e2d.svg","isPro":false,"fullname":"Zhou","user":"DaQuan21","type":"user"},{"_id":"6385d86181fe8c678a345ff0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6385d86181fe8c678a345ff0/ETvVOtBAQ-RpwfT4TCWsv.jpeg","isPro":false,"fullname":"Yilin Zhao","user":"ermu2001","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64403d8d7663594a1263fdd4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64403d8d7663594a1263fdd4/9faL_ocHf6W2Jm6vR1zWl.png","isPro":false,"fullname":"Ahmed Khalil","user":"antiquesordo","type":"user"},{"_id":"63c5d43ae2804cb2407e4d43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1673909278097-noauth.png","isPro":false,"fullname":"xziayro","user":"xziayro","type":"user"},{"_id":"65d9903fdceb54d42011a98d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d9903fdceb54d42011a98d/5jnLeCY9sDtS98JyO9qzX.jpeg","isPro":false,"fullname":"meng shao","user":"meng-shao","type":"user"},{"_id":"64ba49cb6f69abce6e00b824","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/DEqT2xTv6Qz7EISAi-yoB.png","isPro":false,"fullname":"Joshua McVay","user":"magejosh11","type":"user"},{"_id":"634b059629bfb40824530d7a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634b059629bfb40824530d7a/K6mSstDv9g-zFPlvmRiCC.jpeg","isPro":false,"fullname":"Cesar Romero","user":"Zaesar","type":"user"},{"_id":"63ddc7b80f6d2d6c3efe3600","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ddc7b80f6d2d6c3efe3600/RX5q9T80Jl3tn6z03ls0l.jpeg","isPro":false,"fullname":"J","user":"dashfunnydashdash","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","isPro":true,"fullname":"Adina Yakefu","user":"AdinaY","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
StoryDiffusion, combining consistent self-attention and semantic motion prediction, enables generation of coherent and stable images and videos from textual descriptions.
AI-generated summary
For recent diffusion-based generative models, maintaining consistent content
across a series of generated images, especially those containing subjects and
complex details, presents a significant challenge. In this paper, we propose a
new way of self-attention calculation, termed Consistent Self-Attention, that
significantly boosts the consistency between the generated images and augments
prevalent pretrained diffusion-based text-to-image models in a zero-shot
manner. To extend our method to long-range video generation, we further
introduce a novel semantic space temporal motion prediction module, named
Semantic Motion Predictor. It is trained to estimate the motion conditions
between two provided images in the semantic spaces. This module converts the
generated sequence of images into videos with smooth transitions and consistent
subjects that are significantly more stable than the modules based on latent
spaces only, especially in the context of long video generation. By merging
these two novel components, our framework, referred to as StoryDiffusion, can
describe a text-based story with consistent images or videos encompassing a
rich variety of contents. The proposed StoryDiffusion encompasses pioneering
explorations in visual story generation with the presentation of images and
videos, which we hope could inspire more research from the aspect of
architectural modifications. Our code is made publicly available at
https://github.com/HVision-NKU/StoryDiffusion.