https://talkpl-ai.github.io/talkplay-demo/#
dataset: https://huggingface.co/datasets/talkpl-ai/talkplay-db-v1

\n","updatedAt":"2025-02-28T08:17:47.908Z","author":{"_id":"637c3504c292c0fd3f37361f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637c3504c292c0fd3f37361f/wyTkbYKi8HufRT65LGN0P.jpeg","fullname":"seungheon.doh","name":"seungheondoh","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":40}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.36964961886405945},"editors":["seungheondoh"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/637c3504c292c0fd3f37361f/wyTkbYKi8HufRT65LGN0P.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.13713","authors":[{"_id":"67bb52fa9a6372cdca7ee13f","user":{"_id":"637c3504c292c0fd3f37361f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637c3504c292c0fd3f37361f/wyTkbYKi8HufRT65LGN0P.jpeg","isPro":true,"fullname":"seungheon.doh","user":"seungheondoh","type":"user"},"name":"Seungheon Doh","status":"claimed_verified","statusLastChangedAt":"2025-02-28T12:16:04.632Z","hidden":false},{"_id":"67bb52fa9a6372cdca7ee140","name":"Keunwoo Choi","hidden":false},{"_id":"67bb52fa9a6372cdca7ee141","name":"Juhan Nam","hidden":false}],"publishedAt":"2025-02-19T13:28:20.000Z","title":"TALKPLAY: Multimodal Music Recommendation with Large Language Models","summary":"We present TalkPlay, a multimodal music recommendation system that\nreformulates the recommendation task as large language model token generation.\nTalkPlay represents music through an expanded token vocabulary that encodes\nmultiple modalities - audio, lyrics, metadata, semantic tags, and playlist\nco-occurrence. Using these rich representations, the model learns to generate\nrecommendations through next-token prediction on music recommendation\nconversations, that requires learning the associations natural language query\nand response, as well as music items. In other words, the formulation\ntransforms music recommendation into a natural language understanding task,\nwhere the model's ability to predict conversation tokens directly optimizes\nquery-item relevance. Our approach eliminates traditional\nrecommendation-dialogue pipeline complexity, enabling end-to-end learning of\nquery-aware music recommendations. In the experiment, TalkPlay is successfully\ntrained and outperforms baseline methods in various aspects, demonstrating\nstrong context understanding as a conversational music recommender.","upvotes":3,"discussionId":"67bb52fb9a6372cdca7ee1a6","ai_summary":"TalkPlay reformulates music recommendations as a large language model token generation task, achieving high performance by leveraging multimodal music representations and end-to-end learning.","ai_keywords":["large language model","token generation","music recommendation","multimodal","audio","lyrics","metadata","semantic tags","playlist co-occurrence","next-token prediction","natural language understanding","query-item relevance","recommendation-dialogue pipeline","context understanding","conversational music recommender"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"637c3504c292c0fd3f37361f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637c3504c292c0fd3f37361f/wyTkbYKi8HufRT65LGN0P.jpeg","isPro":true,"fullname":"seungheon.doh","user":"seungheondoh","type":"user"},{"_id":"646ac6473eb2bab0419d2803","avatarUrl":"/avatars/6c56e80bd37eec5847f0893c28923dbb.svg","isPro":false,"fullname":"Peng ","user":"pennlio","type":"user"},{"_id":"68109e7dde3a9b852f754d37","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/tv6YGDYbhWPnuEuBa0HJ2.png","isPro":false,"fullname":"zishuang wang","user":"fffuk","type":"user"}],"acceptLanguages":["*"]}">

arxiv:2502.13713

TALKPLAY: Multimodal Music Recommendation with Large Language Models

Published on Feb 19

Upvote

Authors:

Seungheon Doh ,

Abstract

TalkPlay reformulates music recommendations as a large language model token generation task, achieving high performance by leveraging multimodal music representations and end-to-end learning.

AI-generated summary

We present TalkPlay, a multimodal music recommendation system that reformulates the recommendation task as large language model token generation. TalkPlay represents music through an expanded token vocabulary that encodes multiple modalities - audio, lyrics, metadata, semantic tags, and playlist co-occurrence. Using these rich representations, the model learns to generate recommendations through next-token prediction on music recommendation conversations, that requires learning the associations natural language query and response, as well as music items. In other words, the formulation transforms music recommendation into a natural language understanding task, where the model's ability to predict conversation tokens directly optimizes query-item relevance. Our approach eliminates traditional recommendation-dialogue pipeline complexity, enabling end-to-end learning of query-aware music recommendations. In the experiment, TalkPlay is successfully trained and outperforms baseline methods in various aspects, demonstrating strong context understanding as a conversational music recommender.