\n","updatedAt":"2024-07-22T10:37:59.013Z","author":{"_id":"64df3ad6a9bcacc18bc0606a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/s3kpJyOf7NwO-tHEpRcok.png","fullname":"Carlos","name":"Carlosvirella100","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5}},"numEdits":0,"identifiedLanguage":{"language":"hu","probability":0.19420602917671204},"editors":["Carlosvirella100"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/s3kpJyOf7NwO-tHEpRcok.png"],"reactions":[],"isReport":false}},{"id":"669f080fcca85726b8e16571","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2024-07-23T01:31:59.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [ARTIST: Improving the Generation of Text-rich Images by Disentanglement](https://huggingface.co/papers/2406.12044) (2024)\n* [SceneTextGen: Layout-Agnostic Scene Text Image Synthesis with Diffusion Models](https://huggingface.co/papers/2406.01062) (2024)\n* [Improving Text Generation on Images with Synthetic Captions](https://huggingface.co/papers/2406.00505) (2024)\n* [ODGEN: Domain-specific Object Detection Data Generation with Diffusion Models](https://huggingface.co/papers/2405.15199) (2024)\n* [Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation](https://huggingface.co/papers/2406.09305) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-07-23T01:31:59.714Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6879870891571045},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2407.14138","authors":[{"_id":"669dc4af811696bba807562c","user":{"_id":"627d2723401f42c57b6b7c0c","avatarUrl":"/avatars/6ff754e56aaee63d8572881a6a966171.svg","isPro":false,"fullname":"Yuanzhi Zhu","user":"Yuanzhi","type":"user"},"name":"Yuanzhi Zhu","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:06:34.349Z","hidden":true},{"_id":"669dc4af811696bba807562d","user":{"_id":"64230048a73327caad9d8241","avatarUrl":"/avatars/ac115f1a21c743db7c925f0e18451145.svg","isPro":false,"fullname":"Jiawei Liu","user":"JiaweiLIU","type":"user"},"name":"Jiawei Liu","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:06:52.161Z","hidden":false},{"_id":"669dc4af811696bba807562e","name":"Feiyu Gao","hidden":false},{"_id":"669dc4af811696bba807562f","name":"Wenyu Liu","hidden":false},{"_id":"669dc4af811696bba8075630","user":{"_id":"62600de6d47e3dbae32ce1ce","avatarUrl":"/avatars/a536417cfec6e10ac415091bd1829426.svg","isPro":false,"fullname":"Xinggang Wang","user":"xinggangw","type":"user"},"name":"Xinggang Wang","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:10:14.794Z","hidden":false},{"_id":"669dc4af811696bba8075631","name":"Peng Wang","hidden":false},{"_id":"669dc4af811696bba8075632","user":{"_id":"635b8b6a37c6a2c12e2cce00","avatarUrl":"/avatars/229fb72180529141515d1df797b33709.svg","isPro":false,"fullname":"Fei Huang","user":"hzhwcmhf","type":"user"},"name":"Fei Huang","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:11:58.528Z","hidden":false},{"_id":"669dc4af811696bba8075633","user":{"_id":"6638f898c168eec02354abeb","avatarUrl":"/avatars/0a0b7a1b056463fb4ccf6bfe6456a7d1.svg","isPro":false,"fullname":"Cong Yao","user":"bridgetop3young","type":"user"},"name":"Cong Yao","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:12:05.099Z","hidden":false},{"_id":"669dc4af811696bba8075634","user":{"_id":"66051de91b8220d71becb448","avatarUrl":"/avatars/f2b507f5821f8baa90b60f62ae4de389.svg","isPro":false,"fullname":"杨智博","user":"zhiboyang","type":"user"},"name":"Zhibo Yang","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:12:13.078Z","hidden":false}],"publishedAt":"2024-07-19T09:08:20.000Z","submittedOnDailyAt":"2024-07-22T01:02:29.461Z","title":"Visual Text Generation in the Wild","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Recently, with the rapid advancements of generative models, the field of\nvisual text generation has witnessed significant progress. However, it is still\nchallenging to render high-quality text images in real-world scenarios, as\nthree critical criteria should be satisfied: (1) Fidelity: the generated text\nimages should be photo-realistic and the contents are expected to be the same\nas specified in the given conditions; (2) Reasonability: the regions and\ncontents of the generated text should cohere with the scene; (3) Utility: the\ngenerated text images can facilitate related tasks (e.g., text detection and\nrecognition). Upon investigation, we find that existing methods, either\nrendering-based or diffusion-based, can hardly meet all these aspects\nsimultaneously, limiting their application range. Therefore, we propose in this\npaper a visual text generator (termed SceneVTG), which can produce high-quality\ntext images in the wild. Following a two-stage paradigm, SceneVTG leverages a\nMultimodal Large Language Model to recommend reasonable text regions and\ncontents across multiple scales and levels, which are used by a conditional\ndiffusion model as conditions to generate text images. Extensive experiments\ndemonstrate that the proposed SceneVTG significantly outperforms traditional\nrendering-based methods and recent diffusion-based methods in terms of fidelity\nand reasonability. Besides, the generated images provide superior utility for\ntasks involving text detection and text recognition. Code and datasets are\navailable at AdvancedLiterateMachinery.","upvotes":9,"discussionId":"669dc4b1811696bba80756b0","ai_summary":"SceneVTG combines a Multimodal Large Language Model and a conditional diffusion model to generate high-quality, photo-realistic, and contextually coherent text images, surpassing existing methods in fidelity, reasonability, and utility for text detection and recognition.","ai_keywords":["Multimodal Large Language Model","conditional diffusion model"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63470b9f3ea42ee2cb4f3279","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Xv8-IxM4GYM91IUOkRnCG.png","isPro":false,"fullname":"NG","user":"SirRa1zel","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"65ba471ad88a65abb9328ee2","avatarUrl":"/avatars/956238ce5034091e64d026b0272c4400.svg","isPro":false,"fullname":"Dazhi Jiang","user":"thuzhizhi","type":"user"},{"_id":"6514107324b8829bb6e2ab35","avatarUrl":"/avatars/5dcdb916d5d6eee79a4c55eec001b1df.svg","isPro":false,"fullname":"shuo zhang","user":"shuozhang2","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"654b547a21672d7c20105f7e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/654b547a21672d7c20105f7e/Zhax-YCu-UieQIkt2xSou.png","isPro":false,"fullname":"gasanov marat","user":"ggassannovv","type":"user"},{"_id":"633e570be7d5ce7bfe037a53","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633e570be7d5ce7bfe037a53/zV8ULv4Mu7YIGZ8D3JtmK.jpeg","isPro":false,"fullname":"Zhaocheng Liu","user":"zhaocheng","type":"user"},{"_id":"60c8d264224e250fb0178f77","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60c8d264224e250fb0178f77/i8fbkBVcoFeJRmkQ9kYAE.png","isPro":true,"fullname":"Adam Lee","user":"Abecid","type":"user"},{"_id":"663ccbff3a74a20189d4aa2e","avatarUrl":"/avatars/83a54455e0157480f65c498cd9057cf2.svg","isPro":false,"fullname":"Nguyen Van Thanh","user":"NguyenVanThanhHust","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
SceneVTG combines a Multimodal Large Language Model and a conditional diffusion model to generate high-quality, photo-realistic, and contextually coherent text images, surpassing existing methods in fidelity, reasonability, and utility for text detection and recognition.
AI-generated summary
Recently, with the rapid advancements of generative models, the field of
visual text generation has witnessed significant progress. However, it is still
challenging to render high-quality text images in real-world scenarios, as
three critical criteria should be satisfied: (1) Fidelity: the generated text
images should be photo-realistic and the contents are expected to be the same
as specified in the given conditions; (2) Reasonability: the regions and
contents of the generated text should cohere with the scene; (3) Utility: the
generated text images can facilitate related tasks (e.g., text detection and
recognition). Upon investigation, we find that existing methods, either
rendering-based or diffusion-based, can hardly meet all these aspects
simultaneously, limiting their application range. Therefore, we propose in this
paper a visual text generator (termed SceneVTG), which can produce high-quality
text images in the wild. Following a two-stage paradigm, SceneVTG leverages a
Multimodal Large Language Model to recommend reasonable text regions and
contents across multiple scales and levels, which are used by a conditional
diffusion model as conditions to generate text images. Extensive experiments
demonstrate that the proposed SceneVTG significantly outperforms traditional
rendering-based methods and recent diffusion-based methods in terms of fidelity
and reasonability. Besides, the generated images provide superior utility for
tasks involving text detection and text recognition. Code and datasets are
available at AdvancedLiterateMachinery.