Datasets: https://ai.meta.com/datasets/plm-data/
Blog: https://ai.meta.com/blog/meta-fair-updates-perception-localization-reasoning/\n","updatedAt":"2025-04-18T21:59:45.875Z","author":{"_id":"647a4ae6c6077fb66fbd0036","avatarUrl":"/avatars/ee08791366311da83bd5395012f3b1c2.svg","fullname":"Jang Hyun (Vincent) Cho","name":"janghyuncho7","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.6244345903396606},"editors":["janghyuncho7"],"editorAvatarUrls":["/avatars/ee08791366311da83bd5395012f3b1c2.svg"],"reactions":[],"isReport":false}},{"id":"6802fd80c7a744b751213fa7","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2025-04-19T01:33:52.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [TULIP: Towards Unified Language-Image Pretraining](https://huggingface.co/papers/2503.15485) (2025)\n* [Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding](https://huggingface.co/papers/2504.10465) (2025)\n* [How Can Objects Help Video-Language Understanding?](https://huggingface.co/papers/2504.07454) (2025)\n* [On the Limitations of Vision-Language Models in Understanding Image Transforms](https://huggingface.co/papers/2503.09837) (2025)\n* [SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features](https://huggingface.co/papers/2502.14786) (2025)\n* [A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models](https://huggingface.co/papers/2502.13942) (2025)\n* [FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding](https://huggingface.co/papers/2503.14935) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\nThe following papers were recommended by the Semantic Scholar API
\n- \n
- TULIP: Towards Unified Language-Image Pretraining (2025) \n
- Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding (2025) \n
- How Can Objects Help Video-Language Understanding? (2025) \n
- On the Limitations of Vision-Language Models in Understanding Image Transforms (2025) \n
- SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features (2025) \n
- A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models (2025) \n
- FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding (2025) \n
Please give a thumbs up to this comment if you found it helpful!
\nIf you want recommendations for any Paper on Hugging Face checkout this Space
\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
A video & written summary - https://aipapersacademy.com/perception-language-models/
\n","updatedAt":"2025-05-03T14:08:26.286Z","author":{"_id":"665edfcf2b842ec980842bd4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665edfcf2b842ec980842bd4/GJHNPJ3ULIMEMq6VGxZaI.png","fullname":"AI Papers Academy","name":"aipapersacademy","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7616311311721802},"editors":["aipapersacademy"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/665edfcf2b842ec980842bd4/GJHNPJ3ULIMEMq6VGxZaI.png"],"reactions":[],"isReport":false}},{"id":"68c7d77b80a9182ce65f6ec0","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1},"createdAt":"2025-09-15T09:08:11.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/perceptionlm-open-access-data-and-models-for-detailed-visual-understanding","html":"arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/perceptionlm-open-access-data-and-models-for-detailed-visual-understanding
\n","updatedAt":"2025-09-15T09:08:11.825Z","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6010730266571045},"editors":["grantsing"],"editorAvatarUrls":["/avatars/3aedb9522cc3cd08349d654f523fd792.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.13180","authors":[{"_id":"6802c455f2384edf1c8ee068","user":{"_id":"647a4ae6c6077fb66fbd0036","avatarUrl":"/avatars/ee08791366311da83bd5395012f3b1c2.svg","isPro":false,"fullname":"Jang Hyun (Vincent) Cho","user":"janghyuncho7","type":"user"},"name":"Jang Hyun Cho","status":"claimed_verified","statusLastChangedAt":"2025-04-20T15:02:20.575Z","hidden":false},{"_id":"6802c455f2384edf1c8ee069","name":"Andrea Madotto","hidden":false},{"_id":"6802c455f2384edf1c8ee06a","name":"Effrosyni Mavroudi","hidden":false},{"_id":"6802c455f2384edf1c8ee06b","name":"Triantafyllos Afouras","hidden":false},{"_id":"6802c455f2384edf1c8ee06c","name":"Tushar Nagarajan","hidden":false},{"_id":"6802c455f2384edf1c8ee06d","user":{"_id":"64807585856901b0edb8d68b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/kmCo6brXhQ7SyxVx3lpsx.jpeg","isPro":false,"fullname":"Muhammad Maaz","user":"mmaaz60","type":"user"},"name":"Muhammad Maaz","status":"claimed_verified","statusLastChangedAt":"2025-04-20T15:02:25.631Z","hidden":false},{"_id":"6802c455f2384edf1c8ee06e","name":"Yale Song","hidden":false},{"_id":"6802c455f2384edf1c8ee06f","name":"Tengyu Ma","hidden":false},{"_id":"6802c455f2384edf1c8ee070","user":{"_id":"6563c6846ef2a1d0f989f11b","avatarUrl":"/avatars/b03f6680ac3e718a1c38fec59a211b62.svg","isPro":false,"fullname":"Shuming Hu","user":"shumingh","type":"user"},"name":"Shuming Hu","status":"claimed_verified","statusLastChangedAt":"2025-04-20T15:02:23.100Z","hidden":false},{"_id":"6802c455f2384edf1c8ee071","name":"Suyog Jain","hidden":false},{"_id":"6802c455f2384edf1c8ee072","name":"Miguel Martin","hidden":false},{"_id":"6802c455f2384edf1c8ee073","name":"Huiyu Wang","hidden":false},{"_id":"6802c455f2384edf1c8ee074","name":"Hanoona Rasheed","hidden":false},{"_id":"6802c455f2384edf1c8ee075","user":{"_id":"640dc9bf8512ec51d7f0ac1a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/640dc9bf8512ec51d7f0ac1a/sT4rdEoQbzfW6D3xDVdqt.jpeg","isPro":false,"fullname":"peizesun","user":"peizesun","type":"user"},"name":"Peize Sun","status":"claimed_verified","statusLastChangedAt":"2025-04-20T15:02:17.940Z","hidden":false},{"_id":"6802c455f2384edf1c8ee076","name":"Po-Yao Huang","hidden":false},{"_id":"6802c455f2384edf1c8ee077","name":"Daniel Bolya","hidden":false},{"_id":"6802c455f2384edf1c8ee078","name":"Nikhila Ravi","hidden":false},{"_id":"6802c455f2384edf1c8ee079","name":"Shashank Jain","hidden":false},{"_id":"6802c455f2384edf1c8ee07a","name":"Tammy Stark","hidden":false},{"_id":"6802c455f2384edf1c8ee07b","name":"Shane Moon","hidden":false},{"_id":"6802c455f2384edf1c8ee07c","name":"Babak Damavandi","hidden":false},{"_id":"6802c455f2384edf1c8ee07d","name":"Vivian Lee","hidden":false},{"_id":"6802c455f2384edf1c8ee07e","name":"Andrew Westbury","hidden":false},{"_id":"6802c455f2384edf1c8ee07f","name":"Salman Khan","hidden":false},{"_id":"6802c455f2384edf1c8ee080","name":"Philipp Krähenbühl","hidden":false},{"_id":"6802c455f2384edf1c8ee081","name":"Piotr Dollár","hidden":false},{"_id":"6802c455f2384edf1c8ee082","name":"Lorenzo Torresani","hidden":false},{"_id":"6802c455f2384edf1c8ee083","name":"Kristen Grauman","hidden":false},{"_id":"6802c455f2384edf1c8ee084","name":"Christoph Feichtenhofer","hidden":false}],"publishedAt":"2025-04-17T17:59:56.000Z","submittedOnDailyAt":"2025-04-18T20:07:57.228Z","title":"PerceptionLM: Open-Access Data and Models for Detailed Visual\n Understanding","submittedOnDailyBy":{"_id":"647a4ae6c6077fb66fbd0036","avatarUrl":"/avatars/ee08791366311da83bd5395012f3b1c2.svg","isPro":false,"fullname":"Jang Hyun (Vincent) Cho","user":"janghyuncho7","type":"user"},"summary":"Vision-language models are integral to computer vision research, yet many\nhigh-performing models remain closed-source, obscuring their data, design and\ntraining recipe. The research community has responded by using distillation\nfrom black-box models to label training data, achieving strong benchmark\nresults, at the cost of measurable scientific progress. However, without\nknowing the details of the teacher model and its data sources, scientific\nprogress remains difficult to measure. In this paper, we study building a\nPerception Language Model (PLM) in a fully open and reproducible framework for\ntransparent research in image and video understanding. We analyze standard\ntraining pipelines without distillation from proprietary models and explore\nlarge-scale synthetic data to identify critical data gaps, particularly in\ndetailed video understanding. To bridge these gaps, we release 2.8M\nhuman-labeled instances of fine-grained video question-answer pairs and\nspatio-temporally grounded video captions. Additionally, we introduce\nPLM-VideoBench, a suite for evaluating challenging video understanding tasks\nfocusing on the ability to reason about \"what\", \"where\", \"when\", and \"how\" of a\nvideo. We make our work fully reproducible by providing data, training recipes,\ncode & models.","upvotes":18,"discussionId":"6802c45bf2384edf1c8ee1d0","projectPage":"https://ai.meta.com/research/publications/perceptionlm-open-access-data-and-models-for-detailed-visual-understanding/","githubRepo":"https://github.com/facebookresearch/perception_models/","ai_summary":"A fully transparent Perception Language Model (PLM) for image and video understanding is developed without relying on proprietary models, using large-scale synthetic and human-labeled data and introducing PLM-VideoBench for evaluation.","ai_keywords":["Perception Language Model","PLM","video question-answer pairs","spatio-temporally grounded video captions","PLM-VideoBench","video understanding"],"githubStars":1639},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"647a4ae6c6077fb66fbd0036","avatarUrl":"/avatars/ee08791366311da83bd5395012f3b1c2.svg","isPro":false,"fullname":"Jang Hyun (Vincent) Cho","user":"janghyuncho7","type":"user"},{"_id":"638fe91639f7e2a7f9d2a8c6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638fe91639f7e2a7f9d2a8c6/hB7DMVODcdAEUdQnXxWA8.jpeg","isPro":false,"fullname":"Yue Zhao","user":"zhaoyue-zephyrus","type":"user"},{"_id":"62ff05ab9bfda579504d81b2","avatarUrl":"/avatars/94590ed960b6df3fc1ababa31cd13f17.svg","isPro":false,"fullname":"Daniel Bolya","user":"dbolya","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"64807585856901b0edb8d68b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/kmCo6brXhQ7SyxVx3lpsx.jpeg","isPro":false,"fullname":"Muhammad Maaz","user":"mmaaz60","type":"user"},{"_id":"6563c6846ef2a1d0f989f11b","avatarUrl":"/avatars/b03f6680ac3e718a1c38fec59a211b62.svg","isPro":false,"fullname":"Shuming Hu","user":"shumingh","type":"user"},{"_id":"64b64debeb9a833e08d079fd","avatarUrl":"/avatars/62ad6f5a8c1b69252e855ef26cc4e7c2.svg","isPro":false,"fullname":"Shuhan Tan","user":"tanshh97","type":"user"},{"_id":"609bc5719da7ce3ca8930b8e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1620821302901-noauth.jpeg","isPro":false,"fullname":"Andrea Madotto","user":"andreamad8","type":"user"},{"_id":"64c9e07ef7f4ccb5ea6a5f8d","avatarUrl":"/avatars/57e990d4dddd749c6cb50c3f0150a332.svg","isPro":false,"fullname":"Jiale Zhi","user":"jz2023","type":"user"},{"_id":"640dc9bf8512ec51d7f0ac1a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/640dc9bf8512ec51d7f0ac1a/sT4rdEoQbzfW6D3xDVdqt.jpeg","isPro":false,"fullname":"peizesun","user":"peizesun","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Abstract
A fully transparent Perception Language Model (PLM) for image and video understanding is developed without relying on proprietary models, using large-scale synthetic and human-labeled data and introducing PLM-VideoBench for evaluation.
Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TULIP: Towards Unified Language-Image Pretraining (2025)
- Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding (2025)
- How Can Objects Help Video-Language Understanding? (2025)
- On the Limitations of Vision-Language Models in Understanding Image Transforms (2025)
- SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features (2025)
- A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models (2025)
- FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/perceptionlm-open-access-data-and-models-for-detailed-visual-understanding
Models citing this paper 7
Browse 7 models citing this paperDatasets citing this paper 4
Spaces citing this paper 0
No Space linking this paper