Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-06-21T01:34:32.148Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6849551200866699},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2506.09827","authors":[{"_id":"685519bb4f1add9d4c5c5cbd","user":{"_id":"61a24fc72101184cfb29c965","avatarUrl":"/avatars/e32aa61016caef50de28c16b30196799.svg","isPro":false,"fullname":"Christoph Schuhmann","user":"ChristophSchuhmann","type":"user"},"name":"Christoph Schuhmann","status":"extracted_confirmed","statusLastChangedAt":"2025-06-20T10:22:26.676Z","hidden":false},{"_id":"685519bb4f1add9d4c5c5cbe","name":"Robert Kaczmarczyk","hidden":false},{"_id":"685519bb4f1add9d4c5c5cbf","user":{"_id":"64ac21f11cacea8d4b8f2b3f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ac21f11cacea8d4b8f2b3f/asQOf8wFZ4vmqIeyxfvUR.jpeg","isPro":false,"fullname":"Gollam Rabby","user":"tourist800","type":"user"},"name":"Gollam Rabby","status":"admin_assigned","statusLastChangedAt":"2025-06-20T12:29:47.779Z","hidden":false},{"_id":"685519bb4f1add9d4c5c5cc0","user":{"_id":"62e7dd4036a8e8a82700041c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e7dd4036a8e8a82700041c/Dgk9mXYLVd4LpiNLWjn-q.jpeg","isPro":false,"fullname":"Felix Friedrich","user":"felfri","type":"user"},"name":"Felix Friedrich","status":"claimed_verified","statusLastChangedAt":"2025-06-20T08:57:36.090Z","hidden":false},{"_id":"685519bb4f1add9d4c5c5cc1","name":"Maurice Kraus","hidden":false},{"_id":"685519bb4f1add9d4c5c5cc2","name":"Kourosh Nadi","hidden":false},{"_id":"685519bb4f1add9d4c5c5cc3","user":{"_id":"5fc6879e1c5ee87b1164876d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5fc6879e1c5ee87b1164876d/Tjnm_lv0Bq0gPbFOTDH6E.jpeg","isPro":false,"fullname":"Huu Nguyen","user":"huu-ontocord","type":"user"},"name":"Huu Nguyen","status":"claimed_verified","statusLastChangedAt":"2025-06-23T08:15:57.375Z","hidden":false},{"_id":"685519bb4f1add9d4c5c5cc4","name":"Kristian Kersting","hidden":false},{"_id":"685519bb4f1add9d4c5c5cc5","user":{"_id":"62cd8f74342b1d5dab8da3a6","avatarUrl":"/avatars/51c237653aadc98c73df207d9d054597.svg","isPro":false,"fullname":"Sören Auer","user":"soeren1611","type":"user"},"name":"Sören Auer","status":"admin_assigned","statusLastChangedAt":"2025-06-20T12:30:06.624Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/62e7dd4036a8e8a82700041c/tvkMYrIKiGhbIVAuaxvlt.png"],"publishedAt":"2025-06-11T15:06:59.000Z","submittedOnDailyAt":"2025-06-20T06:53:47.262Z","title":"EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech\n Emotion Detection","submittedOnDailyBy":{"_id":"62e7dd4036a8e8a82700041c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e7dd4036a8e8a82700041c/Dgk9mXYLVd4LpiNLWjn-q.jpeg","isPro":false,"fullname":"Felix Friedrich","user":"felfri","type":"user"},"summary":"The advancement of text-to-speech and audio generation models necessitates\nrobust benchmarks for evaluating the emotional understanding capabilities of AI\nsystems. Current speech emotion recognition (SER) datasets often exhibit\nlimitations in emotional granularity, privacy concerns, or reliance on acted\nportrayals. This paper introduces EmoNet-Voice, a new resource for speech\nemotion detection, which includes EmoNet-Voice Big, a large-scale pre-training\ndataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions,\nand 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human\nexpert annotations. EmoNet-Voice is designed to evaluate SER models on a\nfine-grained spectrum of 40 emotion categories with different levels of\nintensities. Leveraging state-of-the-art voice generation, we curated synthetic\naudio snippets simulating actors portraying scenes designed to evoke specific\nemotions. Crucially, we conducted rigorous validation by psychology experts who\nassigned perceived intensity labels. This synthetic, privacy-preserving\napproach allows for the inclusion of sensitive emotional states often absent in\nexisting datasets. Lastly, we introduce Empathic Insight Voice models that set\na new standard in speech emotion recognition with high agreement with human\nexperts. Our evaluations across the current model landscape exhibit valuable\nfindings, such as high-arousal emotions like anger being much easier to detect\nthan low-arousal states like concentration.","upvotes":18,"discussionId":"685519bb4f1add9d4c5c5cc6","ai_summary":"EmoNet-Voice, a new resource with large pre-training and benchmark datasets, advances speech emotion recognition by offering fine-grained emotion evaluation with synthetic, privacy-preserving audio.","ai_keywords":["speech emotion recognition","SER","EmoNet-Voice","EmoNet-Voice Big","EmoNet-Voice Bench","human expert annotations","synthetic audio snippets","psychology experts","high-arousal emotions","low-arousal states","Empathic Insight Voice models"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62e7dd4036a8e8a82700041c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e7dd4036a8e8a82700041c/Dgk9mXYLVd4LpiNLWjn-q.jpeg","isPro":false,"fullname":"Felix Friedrich","user":"felfri","type":"user"},{"_id":"635ba0c637c6a2c12e2daef9","avatarUrl":"/avatars/9fc2932d9ace2715f540f896754ec7d2.svg","isPro":false,"fullname":"Ollie McCarthy","user":"ollieollie","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"64273c8c5bca6f17a3187617","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/YbUO2zchO3ZTL2hViKzlQ.jpeg","isPro":false,"fullname":"Luan Dopke","user":"luandopke","type":"user"},{"_id":"61fc3f7a87117e8015dd1166","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61fc3f7a87117e8015dd1166/v8D6S9bh0BS88BHajQPb6.png","isPro":false,"fullname":"Marian Basti","user":"marianbasti","type":"user"},{"_id":"665b133508d536a8ac804f7d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Uwi0OnANdTbRbHHQvGqvR.png","isPro":false,"fullname":"Paulson","user":"Pnaomi","type":"user"},{"_id":"62e54f0eae9d3f10acb95cb9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e54f0eae9d3f10acb95cb9/VAyk05hqB3OZWXEZW-B0q.png","isPro":true,"fullname":"mrfakename","user":"mrfakename","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63e749fbe02ee67e8e4e9101","avatarUrl":"/avatars/8b4406f5831df17eccb9bf18a97afb90.svg","isPro":false,"fullname":"Jeff Gao","user":"jeff-gao","type":"user"},{"_id":"67a3446691b7a8e1b11c1cb7","avatarUrl":"/avatars/46e7d3629d81d1b231f1df7bf6eef273.svg","isPro":false,"fullname":"Umair Abbas","user":"mianumairsiddiquie","type":"user"},{"_id":"66f85f4ab720048bbc5487e1","avatarUrl":"/avatars/384196d5676e9c8d8970e5b84914e328.svg","isPro":false,"fullname":"Ali Ihsan Nergiz","user":"nergizai","type":"user"},{"_id":"5fc6879e1c5ee87b1164876d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5fc6879e1c5ee87b1164876d/Tjnm_lv0Bq0gPbFOTDH6E.jpeg","isPro":false,"fullname":"Huu Nguyen","user":"huu-ontocord","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3}">
EmoNet-Voice, a new resource with large pre-training and benchmark datasets, advances speech emotion recognition by offering fine-grained emotion evaluation with synthetic, privacy-preserving audio.
AI-generated summary
The advancement of text-to-speech and audio generation models necessitates
robust benchmarks for evaluating the emotional understanding capabilities of AI
systems. Current speech emotion recognition (SER) datasets often exhibit
limitations in emotional granularity, privacy concerns, or reliance on acted
portrayals. This paper introduces EmoNet-Voice, a new resource for speech
emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training
dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions,
and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human
expert annotations. EmoNet-Voice is designed to evaluate SER models on a
fine-grained spectrum of 40 emotion categories with different levels of
intensities. Leveraging state-of-the-art voice generation, we curated synthetic
audio snippets simulating actors portraying scenes designed to evoke specific
emotions. Crucially, we conducted rigorous validation by psychology experts who
assigned perceived intensity labels. This synthetic, privacy-preserving
approach allows for the inclusion of sensitive emotional states often absent in
existing datasets. Lastly, we introduce Empathic Insight Voice models that set
a new standard in speech emotion recognition with high agreement with human
experts. Our evaluations across the current model landscape exhibit valuable
findings, such as high-arousal emotions like anger being much easier to detect
than low-arousal states like concentration.
An exciting frontier in technology today is the quest for artificial intelligence that truly understands and interacts with humans on a deeper level. While AI has made remarkable progress in language processing and complex problem-solving, one critical dimension has yet to be fully realized: true emotional intelligence.
Can our AI systems perceive the subtle joy in a crinkled eye, the faint tremor of anxiety in a voice, or the complex blend of emotions that color our everyday interactions? We believe this is not just a fascinating academic pursuit but a fundamental necessity for the future of human-AI collaboration.
Today, we're proud to release EmoNet – a suite of new, open and freely available models and tools designed to support global research and innovation in the emerging field of emotionally intelligent AI. Our contributions are multi-faceted, addressing critical gaps in current research and providing powerful new tools for the global AI community.