lynx   »   [go: up one dir, main page]

https://sais-fuxi.github.io/projects/cockatiel/
📖Paper: https://arxiv.org/abs/2503.09279
💥Code: https://github.com/Fr0zenCrane/Cockatiel
🤗Captioner Model: https://huggingface.co/Fr0zencr4nE/Cockatiel-13B
🤗Scorer Model: https://huggingface.co/Fr0zencr4nE/Cockatiel-Scorer
🤗Dataset: https://huggingface.co/datasets/Fr0zencr4nE/Cockatiel-4K

\n","updatedAt":"2025-03-17T02:16:46.428Z","author":{"_id":"66a9b3533d417b0baa9220a6","avatarUrl":"/avatars/adc372bd24df1d3bf43258833411e8af.svg","fullname":"Luozheng Qin","name":"Fr0zencr4nE","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.35154101252555847},"editors":["Fr0zencr4nE"],"editorAvatarUrls":["/avatars/adc372bd24df1d3bf43258833411e8af.svg"],"reactions":[],"isReport":false}},{"id":"67d8cdd047c260e46b48d0c0","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-03-18T01:35:12.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models](https://huggingface.co/papers/2502.15393) (2025)\n* [Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning](https://huggingface.co/papers/2503.07906) (2025)\n* [Fine-Grained Video Captioning through Scene Graph Consolidation](https://huggingface.co/papers/2502.16427) (2025)\n* [Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos](https://huggingface.co/papers/2502.21314) (2025)\n* [Pretrained Image-Text Models are Secretly Video Captioners](https://huggingface.co/papers/2502.13363) (2025)\n* [VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation](https://huggingface.co/papers/2502.12782) (2025)\n* [MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation](https://huggingface.co/papers/2502.01719) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-03-18T01:35:12.622Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.708280622959137},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2503.09279","authors":[{"_id":"67d2bd340860f2d7ff10e3dc","user":{"_id":"66a9b3533d417b0baa9220a6","avatarUrl":"/avatars/adc372bd24df1d3bf43258833411e8af.svg","isPro":false,"fullname":"Luozheng Qin","user":"Fr0zencr4nE","type":"user"},"name":"Luozheng Qin","status":"admin_assigned","statusLastChangedAt":"2025-03-17T08:56:11.189Z","hidden":false},{"_id":"67d2bd340860f2d7ff10e3dd","name":"Zhiyu Tan","hidden":false},{"_id":"67d2bd340860f2d7ff10e3de","user":{"_id":"6304d630dae2eb7d084148c7","avatarUrl":"/avatars/7d7a6ca99334bdae3ed1752ff40a8d94.svg","isPro":false,"fullname":"mengping yang","user":"Kobeshegu","type":"user"},"name":"Mengping Yang","status":"admin_assigned","statusLastChangedAt":"2025-03-17T08:56:25.200Z","hidden":false},{"_id":"67d2bd340860f2d7ff10e3df","user":{"_id":"658ea92268d0b7633176b4ed","avatarUrl":"/avatars/40173c9126dccfe78bc46b12c6ced8c8.svg","isPro":false,"fullname":"xiaomeng yang","user":"xiaomengyang","type":"user"},"name":"Xiaomeng Yang","status":"admin_assigned","statusLastChangedAt":"2025-03-17T08:56:33.077Z","hidden":false},{"_id":"67d2bd340860f2d7ff10e3e0","name":"Hao Li","hidden":false}],"publishedAt":"2025-03-12T11:25:04.000Z","submittedOnDailyAt":"2025-03-17T00:46:46.368Z","title":"Cockatiel: Ensembling Synthetic and Human Preferenced Training for\n Detailed Video Caption","submittedOnDailyBy":{"_id":"66a9b3533d417b0baa9220a6","avatarUrl":"/avatars/adc372bd24df1d3bf43258833411e8af.svg","isPro":false,"fullname":"Luozheng Qin","user":"Fr0zencr4nE","type":"user"},"summary":"Video Detailed Captioning (VDC) is a crucial task for vision-language\nbridging, enabling fine-grained descriptions of complex video content. In this\npaper, we first comprehensively benchmark current state-of-the-art approaches\nand systematically identified two critical limitations: biased capability\ntowards specific captioning aspect and misalignment with human preferences. To\naddress these deficiencies, we propose Cockatiel, a novel three-stage training\npipeline that ensembles synthetic and human-aligned training for improving VDC\nperformance. In the first stage, we derive a scorer from a meticulously\nannotated dataset to select synthetic captions high-performing on certain\nfine-grained video-caption alignment and human-preferred while disregarding\nothers. Then, we train Cockatiel-13B, using this curated dataset to infuse it\nwith assembled model strengths and human preferences. Finally, we further\ndistill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive\nquantitative and qualitative experiments reflect the effectiveness of our\nmethod, as we not only set new state-of-the-art performance on VDCSCORE in a\ndimension-balanced way but also surpass leading alternatives on human\npreference by a large margin as depicted by the human evaluation results.","upvotes":5,"discussionId":"67d2bd370860f2d7ff10e4da","githubRepo":"https://github.com/Fr0zenCrane/Cockatiel","ai_summary":"A three-stage training pipeline combining synthetic and human-aligned data improves Video Detailed Captioning performance, outperforming existing methods in both quantitative metrics and human preference.","ai_keywords":["video detailed captioning","vision-language bridging","captioning aspect","human preferences","three-stage training pipeline","synthetic captions","human-aligned training","VDCSCORE","model strengths","human evaluation"],"githubStars":37},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66a9b3533d417b0baa9220a6","avatarUrl":"/avatars/adc372bd24df1d3bf43258833411e8af.svg","isPro":false,"fullname":"Luozheng Qin","user":"Fr0zencr4nE","type":"user"},{"_id":"66715d32ce46539c1a82e989","avatarUrl":"/avatars/3be3f5a8bd31588e8811610fe63d1d0e.svg","isPro":false,"fullname":"Fudan-FUXI","user":"Fudan-FUXI","type":"user"},{"_id":"637c7503fe115289cfecbe6b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676361945047-637c7503fe115289cfecbe6b.jpeg","isPro":false,"fullname":"Wenhao Chai","user":"wchai","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2503.09279

Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

Published on Mar 12
· Submitted by Luozheng Qin on Mar 17
Authors:
,

Abstract

A three-stage training pipeline combining synthetic and human-aligned data improves Video Detailed Captioning performance, outperforming existing methods in both quantitative metrics and human preference.

AI-generated summary

Video Detailed Captioning (VDC) is a crucial task for vision-language bridging, enabling fine-grained descriptions of complex video content. In this paper, we first comprehensively benchmark current state-of-the-art approaches and systematically identified two critical limitations: biased capability towards specific captioning aspect and misalignment with human preferences. To address these deficiencies, we propose Cockatiel, a novel three-stage training pipeline that ensembles synthetic and human-aligned training for improving VDC performance. In the first stage, we derive a scorer from a meticulously annotated dataset to select synthetic captions high-performing on certain fine-grained video-caption alignment and human-preferred while disregarding others. Then, we train Cockatiel-13B, using this curated dataset to infuse it with assembled model strengths and human preferences. Finally, we further distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive quantitative and qualitative experiments reflect the effectiveness of our method, as we not only set new state-of-the-art performance on VDCSCORE in a dimension-balanced way but also surpass leading alternatives on human preference by a large margin as depicted by the human evaluation results.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.09279 in a Space README.md to link it from this page.

Collections including this paper 1

Лучший частный хостинг