lynx   »   [go: up one dir, main page]

https://polyai-ldn.github.io/pheme/

\n","updatedAt":"2024-01-08T22:25:36.528Z","author":{"_id":"61fc3f7a87117e8015dd1166","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61fc3f7a87117e8015dd1166/v8D6S9bh0BS88BHajQPb6.png","fullname":"Marian Basti","name":"marianbasti","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":29}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.27011188864707947},"editors":["marianbasti"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/61fc3f7a87117e8015dd1166/v8D6S9bh0BS88BHajQPb6.png"],"reactions":[],"isReport":false}},{"id":"659d17ae200e8fe4c2842eda","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2024-01-09T09:53:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Audiobox: Unified Audio Generation with Natural Language Prompts](https://huggingface.co/papers/2312.15821) (2023)\n* [Boosting Large Language Model for Speech Synthesis: An Empirical Study](https://huggingface.co/papers/2401.00246) (2023)\n* [Efficient Parallel Audio Generation using Group Masked Language Modeling](https://huggingface.co/papers/2401.01099) (2024)\n* [ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations](https://huggingface.co/papers/2303.01261) (2023)\n* [ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations](https://huggingface.co/papers/2312.14398) (2023)\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n","updatedAt":"2024-01-09T09:53:50.816Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7354611158370972},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2401.02839","authors":[{"_id":"659b6086eff07dcf1fede121","user":{"_id":"6273c47ef6d63a28483eaba2","avatarUrl":"/avatars/f50e8c37f070846814985091609aa23f.svg","isPro":false,"fullname":"Paweł Budzianowski","user":"pfb30","type":"user"},"name":"Paweł Budzianowski","status":"admin_assigned","statusLastChangedAt":"2024-01-08T11:43:53.695Z","hidden":false},{"_id":"659b6086eff07dcf1fede122","user":{"_id":"636a9312d5ee5ba5bbb32a1e","avatarUrl":"/avatars/58ce173bcac7432cfc9fbac9712ff844.svg","isPro":false,"fullname":"Taras Sereda","user":"taras-sereda","type":"user"},"name":"Taras Sereda","status":"admin_assigned","statusLastChangedAt":"2024-01-08T11:44:16.608Z","hidden":false},{"_id":"659b6086eff07dcf1fede123","user":{"_id":"652eb092338c761caf2764ba","avatarUrl":"/avatars/7ab894033e2f3bffc8f793aee32eb510.svg","isPro":false,"fullname":"Tomasz Cichy","user":"cichyt","type":"user"},"name":"Tomasz Cichy","status":"admin_assigned","statusLastChangedAt":"2024-01-08T11:44:35.504Z","hidden":false},{"_id":"659b6086eff07dcf1fede124","user":{"_id":"6273e70dc8d55dd434bd8e52","avatarUrl":"/avatars/3483eeda218e95b1eb00c3dc63c7d000.svg","isPro":false,"fullname":"Ivan Vulić","user":"ivulic","type":"user"},"name":"Ivan Vulić","status":"admin_assigned","statusLastChangedAt":"2024-01-08T11:44:42.167Z","hidden":false}],"publishedAt":"2024-01-05T14:47:20.000Z","submittedOnDailyAt":"2024-01-08T00:10:07.179Z","title":"Pheme: Efficient and Conversational Speech Generation","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"In recent years, speech generation has seen remarkable progress, now\nachieving one-shot generation capability that is often virtually\nindistinguishable from real human voice. Integrating such advancements in\nspeech generation with large language models might revolutionize a wide range\nof applications. However, certain applications, such as assistive\nconversational systems, require natural and conversational speech generation\ntools that also operate efficiently in real time. Current state-of-the-art\nmodels like VALL-E and SoundStorm, powered by hierarchical neural audio codecs,\nrequire large neural components and extensive training data to work well. In\ncontrast, MQTTS aims to build more compact conversational TTS models while\ncapitalizing on smaller-scale real-life conversational speech data. However,\nits autoregressive nature yields high inference latency and thus limits its\nreal-time usage. In order to mitigate the current limitations of the\nstate-of-the-art TTS models while capitalizing on their strengths, in this work\nwe introduce the Pheme model series that 1) offers compact yet high-performing\nmodels, 2) allows for parallel speech generation of 3) natural conversational\nspeech, and 4) it can be trained efficiently on smaller-scale conversational\ndata, cutting data demands by more than 10x but still matching the quality of\nthe autoregressive TTS models. We also show that through simple teacher-student\ndistillation we can meet significant improvements in voice quality for\nsingle-speaker setups on top of pretrained Pheme checkpoints, relying solely on\nsynthetic speech generated by much larger teacher models. Audio samples and\npretrained models are available online.","upvotes":18,"discussionId":"659b6087eff07dcf1fede131","ai_summary":"The Pheme model series achieves compact and high-quality voice generation with parallel processing, efficient training on smaller datasets, and improved voice quality through distillation.","ai_keywords":["speech generation","hierarchical neural audio codecs","TTS models","autoregressive nature","real-time usage","compact models","parallel speech generation","natural conversational speech","teacher-student distillation","pretrained Pheme checkpoints"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63fcc5e05d2bea4588be3800","avatarUrl":"/avatars/a49bb0bf205990fab7335c29226d9028.svg","isPro":false,"fullname":"Yan","user":"PantherBule","type":"user"},{"_id":"637b53a7a2460cde612b127b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637b53a7a2460cde612b127b/urzriWZ00OvHgESYnBlKX.jpeg","isPro":false,"fullname":"Krinal Joshi","user":"krinal","type":"user"},{"_id":"6401a0c14a5c92eccfe333c2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6401a0c14a5c92eccfe333c2/-fv3jekUxMNrE058phGpA.png","isPro":false,"fullname":"Ha-Yeong Choi","user":"Ha0","type":"user"},{"_id":"5e1f1a7d5c2e2c73f4512186","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e1f1a7d5c2e2c73f4512186/yYe60T8GOjGjKSJRpY8vb.jpeg","isPro":false,"fullname":"Zhanliang Liu","user":"zliu","type":"user"},{"_id":"644e1b1d9b4e87c31bab0a14","avatarUrl":"/avatars/88bb4c4a67dc8958069e9014f5e73a0b.svg","isPro":false,"fullname":"Michael Barry","user":"MichaelBarryUK","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"642bce83fc41757877f60d9f","avatarUrl":"/avatars/a7999a8d11142fdcc300207bec29d57a.svg","isPro":false,"fullname":"Aleksi Höylä","user":"7Kala7","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"60c8d264224e250fb0178f77","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60c8d264224e250fb0178f77/i8fbkBVcoFeJRmkQ9kYAE.png","isPro":true,"fullname":"Adam Lee","user":"Abecid","type":"user"},{"_id":"6032802e1f993496bc14d9e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6032802e1f993496bc14d9e3/w6hr-DEQot4VVkoyRIBiy.png","isPro":false,"fullname":"Omar Sanseviero","user":"osanseviero","type":"user"},{"_id":"5f353bb37e58354338621655","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1639773384591-5f353bb37e58354338621655.jpeg","isPro":false,"fullname":"Nicholas Broad","user":"nbroad","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2401.02839

Pheme: Efficient and Conversational Speech Generation

Published on Jan 5, 2024
· Submitted by AK on Jan 8, 2024

Abstract

The Pheme model series achieves compact and high-quality voice generation with parallel processing, efficient training on smaller datasets, and improved voice quality through distillation.

AI-generated summary

In recent years, speech generation has seen remarkable progress, now achieving one-shot generation capability that is often virtually indistinguishable from real human voice. Integrating such advancements in speech generation with large language models might revolutionize a wide range of applications. However, certain applications, such as assistive conversational systems, require natural and conversational speech generation tools that also operate efficiently in real time. Current state-of-the-art models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs, require large neural components and extensive training data to work well. In contrast, MQTTS aims to build more compact conversational TTS models while capitalizing on smaller-scale real-life conversational speech data. However, its autoregressive nature yields high inference latency and thus limits its real-time usage. In order to mitigate the current limitations of the state-of-the-art TTS models while capitalizing on their strengths, in this work we introduce the Pheme model series that 1) offers compact yet high-performing models, 2) allows for parallel speech generation of 3) natural conversational speech, and 4) it can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models. We also show that through simple teacher-student distillation we can meet significant improvements in voice quality for single-speaker setups on top of pretrained Pheme checkpoints, relying solely on synthetic speech generated by much larger teacher models. Audio samples and pretrained models are available online.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2401.02839 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 3

Collections including this paper 2

Лучший частный хостинг