lynx   »   [go: up one dir, main page]

https://osatlas.github.io/
Github: https://github.com/OS-Copilot/OS-Atlas

\n","updatedAt":"2024-11-04T06:34:10.219Z","author":{"_id":"6280e830e99dccaac4bbfde5","avatarUrl":"/avatars/9242b8d2826ce2f79af9bb794bba2b61.svg","fullname":"Zhiyong Wu","name":"zy001","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4875609278678894},"editors":["zy001"],"editorAvatarUrls":["/avatars/9242b8d2826ce2f79af9bb794bba2b61.svg"],"reactions":[{"reaction":"👍","users":["marinaretik"],"count":1}],"isReport":false}},{"id":"6728f676d56ab6f42fa930b4","author":{"_id":"6486638da4cf2081f20c40ec","avatarUrl":"/avatars/0bc16a7447cd71ac18828a678313bd83.svg","fullname":"Mike Young","name":"mikelabs","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12},"createdAt":"2024-11-04T16:29:42.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Paper summary: https://www.aimodels.fyi/papers/arxiv/os-atlas-foundation-action-model-generalist-gui","html":"

Paper summary: https://www.aimodels.fyi/papers/arxiv/os-atlas-foundation-action-model-generalist-gui

\n","updatedAt":"2024-11-04T16:29:42.143Z","author":{"_id":"6486638da4cf2081f20c40ec","avatarUrl":"/avatars/0bc16a7447cd71ac18828a678313bd83.svg","fullname":"Mike Young","name":"mikelabs","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4896003305912018},"editors":["mikelabs"],"editorAvatarUrls":["/avatars/0bc16a7447cd71ac18828a678313bd83.svg"],"reactions":[],"isReport":false}},{"id":"6729763c54125050a62ec851","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2024-11-05T01:34:52.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents](https://huggingface.co/papers/2410.05243) (2024)\n* [Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms](https://huggingface.co/papers/2410.18967) (2024)\n* [EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data](https://huggingface.co/papers/2410.19461) (2024)\n* [OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning](https://huggingface.co/papers/2410.18963) (2024)\n* [Harnessing Webpage UIs for Text-Rich Visual Understanding](https://huggingface.co/papers/2410.13824) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-11-05T01:34:52.885Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7309239506721497},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2410.23218","authors":[{"_id":"67286aaf6c2c5bca2617d7b4","user":{"_id":"6280e830e99dccaac4bbfde5","avatarUrl":"/avatars/9242b8d2826ce2f79af9bb794bba2b61.svg","isPro":false,"fullname":"Zhiyong Wu","user":"zy001","type":"user"},"name":"Zhiyong Wu","status":"admin_assigned","statusLastChangedAt":"2024-11-04T14:18:19.133Z","hidden":false},{"_id":"67286aaf6c2c5bca2617d7b5","name":"Zhenyu Wu","hidden":false},{"_id":"67286aaf6c2c5bca2617d7b6","user":{"_id":"64e6cf78ecce34cb442dc889","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64e6cf78ecce34cb442dc889/qVZFiUEpBpSkmH8SQeinm.jpeg","isPro":false,"fullname":"Fangzhi Xu","user":"xufangzhi","type":"user"},"name":"Fangzhi Xu","status":"admin_assigned","statusLastChangedAt":"2024-11-04T14:18:52.631Z","hidden":false},{"_id":"67286aaf6c2c5bca2617d7b7","name":"Yian Wang","hidden":true},{"_id":"67286aaf6c2c5bca2617d7b8","user":{"_id":"6064a0eeb1703ddba0d458b9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1617207525789-noauth.png","isPro":false,"fullname":"Qiushi","user":"QiushiSun","type":"user"},"name":"Qiushi Sun","status":"claimed_verified","statusLastChangedAt":"2024-11-04T11:37:21.047Z","hidden":false},{"_id":"67286aaf6c2c5bca2617d7b9","user":{"_id":"6602548a68d519ed324b47c5","avatarUrl":"/avatars/5ab411f87440cc2a98c7a1c6a3ed5548.svg","isPro":false,"fullname":"ChengyouJia","user":"ChengyouJia","type":"user"},"name":"Chengyou Jia","status":"admin_assigned","statusLastChangedAt":"2024-11-04T14:19:11.993Z","hidden":false},{"_id":"67286aaf6c2c5bca2617d7ba","user":{"_id":"63340dbbd92c5842ae71d1e9","avatarUrl":"/avatars/3a3182996bd41b526dcbfa8687d91963.svg","isPro":false,"fullname":"Kanzhi Cheng","user":"cckevinn","type":"user"},"name":"Kanzhi Cheng","status":"admin_assigned","statusLastChangedAt":"2024-11-04T14:19:17.294Z","hidden":false},{"_id":"67286aaf6c2c5bca2617d7bb","user":{"_id":"642b9861bb77f8456634b048","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642b9861bb77f8456634b048/VrNmmcdgX7FufQmdP5YaG.jpeg","isPro":false,"fullname":"Zichen Ding","user":"heroding77","type":"user"},"name":"Zichen Ding","status":"claimed_verified","statusLastChangedAt":"2024-11-04T09:35:44.330Z","hidden":false},{"_id":"67286aaf6c2c5bca2617d7bc","user":{"_id":"6561824484a9fbe322b9abc3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6561824484a9fbe322b9abc3/omS3D6PzGBD7kc4S7bOIO.png","isPro":false,"fullname":"LIHENG CHEN","user":"Lemaqwq","type":"user"},"name":"Liheng Chen","status":"admin_assigned","statusLastChangedAt":"2024-11-04T14:19:29.363Z","hidden":false},{"_id":"67286aaf6c2c5bca2617d7bd","name":"Paul Pu Liang","hidden":false},{"_id":"67286aaf6c2c5bca2617d7be","name":"Yu Qiao","hidden":false}],"publishedAt":"2024-10-30T17:10:19.000Z","submittedOnDailyAt":"2024-11-04T04:04:10.191Z","title":"OS-ATLAS: A Foundation Action Model for Generalist GUI Agents","submittedOnDailyBy":{"_id":"6280e830e99dccaac4bbfde5","avatarUrl":"/avatars/9242b8d2826ce2f79af9bb794bba2b61.svg","isPro":false,"fullname":"Zhiyong Wu","user":"zy001","type":"user"},"summary":"Existing efforts in building GUI agents heavily rely on the availability of\nrobust commercial Vision-Language Models (VLMs) such as GPT-4o and\nGeminiProVision. Practitioners are often reluctant to use open-source VLMs due\nto their significant performance lag compared to their closed-source\ncounterparts, particularly in GUI grounding and Out-Of-Distribution (OOD)\nscenarios. To facilitate future research in this area, we developed OS-Atlas -\na foundational GUI action model that excels at GUI grounding and OOD agentic\ntasks through innovations in both data and modeling. We have invested\nsignificant engineering effort in developing an open-source toolkit for\nsynthesizing GUI grounding data across multiple platforms, including Windows,\nLinux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing\nthe largest open-source cross-platform GUI grounding corpus to date, which\ncontains over 13 million GUI elements. This dataset, combined with innovations\nin model training, provides a solid foundation for OS-Atlas to understand GUI\nscreenshots and generalize to unseen interfaces. Through extensive evaluation\nacross six benchmarks spanning three different platforms (mobile, desktop, and\nweb), OS-Atlas demonstrates significant performance improvements over previous\nstate-of-the-art models. Our evaluation also uncovers valuable insights into\ncontinuously improving and scaling the agentic capabilities of open-source\nVLMs.","upvotes":49,"discussionId":"67286ab16c2c5bca2617d918","ai_summary":"OS-Atlas improves GUI agent performance through a large open-source GUI grounding dataset and model innovations.","ai_keywords":["GUI agents","Vision-Language Models (VLMs)","GUI grounding","Out-Of-Distribution (OOD)","open-source toolkit","GUI screenshots","benchmarks"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"656d73ed0bbc114fe6449704","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656d73ed0bbc114fe6449704/gpteBU9GmKSHRVkRBUHld.png","isPro":false,"fullname":"Symbol-LLM","user":"Symbol-LLM","type":"user"},{"_id":"6602548a68d519ed324b47c5","avatarUrl":"/avatars/5ab411f87440cc2a98c7a1c6a3ed5548.svg","isPro":false,"fullname":"ChengyouJia","user":"ChengyouJia","type":"user"},{"_id":"646085482815a0704748a8f7","avatarUrl":"/avatars/78145e2a5db4a12b984ef12b10ee73d3.svg","isPro":false,"fullname":"LinusWangg","user":"LinusWangg","type":"user"},{"_id":"62330137f772775e1f257010","avatarUrl":"/avatars/4d265032fe7d3b16d2985dd25ba31d32.svg","isPro":false,"fullname":"River Gao","user":"RiverGao","type":"user"},{"_id":"656d9eb2b40203890228a4f8","avatarUrl":"/avatars/5ba0f9950292091faa2102b0975d3af8.svg","isPro":false,"fullname":"Zixian Huang","user":"njuhzx","type":"user"},{"_id":"653a24e9313cf747714278a0","avatarUrl":"/avatars/158a8bd1ce4ba140125b89088a0ce9dd.svg","isPro":false,"fullname":"Edson","user":"OscarDo93589","type":"user"},{"_id":"6064a0eeb1703ddba0d458b9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1617207525789-noauth.png","isPro":false,"fullname":"Qiushi","user":"QiushiSun","type":"user"},{"_id":"658be7fe135580745c510323","avatarUrl":"/avatars/830e5cec4565efdc23226a86a0fcef0e.svg","isPro":false,"fullname":"Jian Zhang","user":"VentureZJ","type":"user"},{"_id":"649d7d8968586ca9bf7f5fe6","avatarUrl":"/avatars/b444240770d4025dea41871cf38126dc.svg","isPro":false,"fullname":"Wenhao Zhu","user":"Wenhao97","type":"user"},{"_id":"649d1d4c379eada9a580cf59","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649d1d4c379eada9a580cf59/ucXv7KoJDEB3Phgn-Dn5E.png","isPro":false,"fullname":"xuhuang","user":"xuhuang87","type":"user"},{"_id":"65fed45b08d35929362dd651","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65fed45b08d35929362dd651/KLMxsyRN6_HhCZP1iDw6K.png","isPro":false,"fullname":"FeiYuan","user":"FeYuan","type":"user"},{"_id":"66ac77011cfb12c087605acb","avatarUrl":"/avatars/54c06bd1c4c9d491470ed4162c2301ae.svg","isPro":false,"fullname":"Lin","user":"Qika","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
Papers
arxiv:2410.23218

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Published on Oct 30, 2024
· Submitted by Zhiyong Wu on Nov 4, 2024
#1 Paper of the day
Authors:
,
,
,

Abstract

OS-Atlas improves GUI agent performance through a large open-source GUI grounding dataset and model innovations.

AI-generated summary

Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.

Community

Paper author Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 8

Browse 8 models citing this paper

Datasets citing this paper 3

Spaces citing this paper 8

Collections including this paper 20

Лучший частный хостинг