lynx   »   [go: up one dir, main page]

https://scicode-bench.github.io

\n","updatedAt":"2024-07-22T07:59:02.430Z","author":{"_id":"6610c90c6c59fff43622eef7","avatarUrl":"/avatars/f832a3918101aa5f0834f1456b154c37.svg","fullname":"Yanxin Lu","name":"amber1120","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3557124435901642},"editors":["amber1120"],"editorAvatarUrls":["/avatars/f832a3918101aa5f0834f1456b154c37.svg"],"reactions":[],"isReport":false},"replies":[{"id":"66a24e137fe09e2a65d89b38","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":975},"createdAt":"2024-07-25T13:07:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi @amber1120 congrats on this work!\n\nAre you planning to share the dataset on the hub? Here's a guide: https://huggingface.co/docs/datasets/loading.\n\nThe dataset could then be loaded in 2 lines of code, like so:\n\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"your-hf-organization/scicode\")\n```\nIt could then also be linked to this paper, as explained here: https://huggingface.co/docs/hub/en/datasets-cards#linking-a-paper.\n\nWe could also set up a Space using Gradio for the leaderboard.\n\nLet me know if you need any help!\n\nCheers,\nNiels\nOpen-source @ HF","html":"

Hi \n\n@amber1120\n\t congrats on this work!

\n

Are you planning to share the dataset on the hub? Here's a guide: https://huggingface.co/docs/datasets/loading.

\n

The dataset could then be loaded in 2 lines of code, like so:

\n
from datasets import load_dataset\n\ndataset  = load_dataset(\"your-hf-organization/scicode\")\n
\n

It could then also be linked to this paper, as explained here: https://huggingface.co/docs/hub/en/datasets-cards#linking-a-paper.

\n

We could also set up a Space using Gradio for the leaderboard.

\n

Let me know if you need any help!

\n

Cheers,
Niels
Open-source @ HF

\n","updatedAt":"2024-07-25T13:07:31.796Z","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":975}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8660281896591187},"editors":["nielsr"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg"],"reactions":[{"reaction":"👍","users":["clefourrier"],"count":1}],"isReport":false,"parentCommentId":"669e11468580d17cb6f5bb2e"}}]},{"id":"669f08163de1a03833c69ba2","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264},"createdAt":"2024-07-23T01:32:06.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions](https://huggingface.co/papers/2406.15877) (2024)\n* [PLUM: Preference Learning Plus Test Cases Yields Better Code Language Models](https://huggingface.co/papers/2406.06887) (2024)\n* [CodeRAG-Bench: Can Retrieval Augment Code Generation?](https://huggingface.co/papers/2406.14497) (2024)\n* [AICoderEval: Improving AI Domain Code Generation of Large Language Models](https://huggingface.co/papers/2406.04712) (2024)\n* [A Survey on Large Language Models for Code Generation](https://huggingface.co/papers/2406.00515) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-07-23T01:32:06.002Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.704921305179596},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2407.13168","authors":[{"_id":"669dcf3b7f28c43e09d5497d","user":{"_id":"64ead5f1213a0415bd22d0e4","avatarUrl":"/avatars/2797e96d93b999a3e5f816935eb43673.svg","isPro":false,"fullname":"Minyang Tian","user":"mtian8","type":"user"},"name":"Minyang Tian","status":"extracted_pending","statusLastChangedAt":"2024-07-22T03:17:16.720Z","hidden":false},{"_id":"669dcf3b7f28c43e09d5497e","user":{"_id":"6009aa9126e23ab5edb47afd","avatarUrl":"/avatars/f8afa9cf648227ad5817323451ede378.svg","isPro":false,"fullname":"Luyu Gao","user":"Luyu","type":"user"},"name":"Luyu Gao","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:12:44.059Z","hidden":false},{"_id":"669dcf3b7f28c43e09d5497f","user":{"_id":"660ec5a2509153ca49775a7c","avatarUrl":"/avatars/97570fc245cc8ec7628da9c13bd35b71.svg","isPro":false,"fullname":"Hao Peng","user":"haopeng01","type":"user"},"name":"Shizhuo Dylan Zhang","status":"extracted_pending","statusLastChangedAt":"2024-07-22T03:17:16.720Z","hidden":false},{"_id":"669dcf3b7f28c43e09d54980","name":"Xinan Chen","hidden":false},{"_id":"669dcf3b7f28c43e09d54981","name":"Cunwei Fan","hidden":false},{"_id":"669dcf3b7f28c43e09d54982","name":"Xuefei Guo","hidden":false},{"_id":"669dcf3b7f28c43e09d54983","name":"Roland Haas","hidden":false},{"_id":"669dcf3b7f28c43e09d54984","name":"Pan Ji","hidden":false},{"_id":"669dcf3b7f28c43e09d54985","name":"Kittithat Krongchon","hidden":false},{"_id":"669dcf3b7f28c43e09d54986","name":"Yao Li","hidden":false},{"_id":"669dcf3b7f28c43e09d54987","user":{"_id":"661551375e8f1eaaba4c53b3","avatarUrl":"/avatars/1d41150d06ce4e4726269b9c48cbbc10.svg","isPro":false,"fullname":"LiuShengyang","user":"LSY3579","type":"user"},"name":"Shengyan Liu","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:31:26.403Z","hidden":false},{"_id":"669dcf3b7f28c43e09d54988","name":"Di Luo","hidden":false},{"_id":"669dcf3b7f28c43e09d54989","user":{"_id":"630394e6eedc089484c367e5","avatarUrl":"/avatars/26560adc1466cad5b5646d1a2aea3c76.svg","isPro":false,"fullname":"MA YUTAO","user":"mytkkb","type":"user"},"name":"Yutao Ma","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:30:46.216Z","hidden":false},{"_id":"669dcf3b7f28c43e09d5498a","name":"Hao Tong","hidden":false},{"_id":"669dcf3b7f28c43e09d5498b","name":"Kha Trinh","hidden":false},{"_id":"669dcf3b7f28c43e09d5498c","user":{"_id":"624fa0a20f724e866aa833c6","avatarUrl":"/avatars/be97e25a08b9a69200d8c7cc9a756410.svg","isPro":false,"fullname":"Chenyu Tian","user":"CYBruce","type":"user"},"name":"Chenyu Tian","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:13:53.923Z","hidden":false},{"_id":"669dcf3b7f28c43e09d5498d","name":"Zihan Wang","hidden":false},{"_id":"669dcf3b7f28c43e09d5498e","user":{"_id":"6717b1a510c78be6e4335dd4","avatarUrl":"/avatars/368e4af45151d2756726f45ec9cb0b64.svg","isPro":false,"fullname":"Bohao Wu","user":"nlogn-27","type":"user"},"name":"Bohao Wu","status":"claimed_verified","statusLastChangedAt":"2024-11-22T16:30:46.015Z","hidden":false},{"_id":"669dcf3b7f28c43e09d5498f","user":{"_id":"645862428eef89b5c3c25ba1","avatarUrl":"/avatars/b46887a16d17c6e552d26a5a140e81e0.svg","isPro":false,"fullname":"Yanyu Xiong","user":"yxiong5","type":"user"},"name":"Yanyu Xiong","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:28:56.221Z","hidden":false},{"_id":"669dcf3b7f28c43e09d54990","user":{"_id":"65306f8b2168c2bddd37930d","avatarUrl":"/avatars/3c476272977cc7d6241dc1d6d63bd377.svg","isPro":false,"fullname":"Shengzhu Yin","user":"shengzhu","type":"user"},"name":"Shengzhu Yin","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:28:47.804Z","hidden":false},{"_id":"669dcf3b7f28c43e09d54991","name":"Minhui Zhu","hidden":false},{"_id":"669dcf3b7f28c43e09d54992","name":"Kilian Lieret","hidden":false},{"_id":"669dcf3b7f28c43e09d54993","user":{"_id":"6610c90c6c59fff43622eef7","avatarUrl":"/avatars/f832a3918101aa5f0834f1456b154c37.svg","isPro":false,"fullname":"Yanxin Lu","user":"amber1120","type":"user"},"name":"Yanxin Lu","status":"claimed_verified","statusLastChangedAt":"2024-07-22T07:08:39.863Z","hidden":false},{"_id":"669dcf3b7f28c43e09d54994","user":{"_id":"64881deb8e004bb92b0f4845","avatarUrl":"/avatars/30a1e016d469bf7eb42c713351a9f65c.svg","isPro":false,"fullname":"Genglin Liu","user":"genglinliu","type":"user"},"name":"Genglin Liu","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:16:25.059Z","hidden":false},{"_id":"669dcf3b7f28c43e09d54995","user":{"_id":"647e94f7770c299e56fc996f","avatarUrl":"/avatars/6b53e9ece40e1ebd4a5297b47ebd8b91.svg","isPro":false,"fullname":"Yufeng Du","user":"yufeng16","type":"user"},"name":"Yufeng Du","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:16:33.054Z","hidden":false},{"_id":"669dcf3b7f28c43e09d54996","user":{"_id":"61ecb698382b3b92fea7305f","avatarUrl":"/avatars/b1bcf0d9ade12172810cb65dc9c4e4d8.svg","isPro":false,"fullname":"Tianhua Tao","user":"Tianhua","type":"user"},"name":"Tianhua Tao","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:16:40.367Z","hidden":false},{"_id":"669dcf3b7f28c43e09d54997","user":{"_id":"65b7a5657817e067a5ad45d8","avatarUrl":"/avatars/f8792352f16be80bd138ffc911138be1.svg","isPro":false,"fullname":"Ofir Press","user":"ofirpress","type":"user"},"name":"Ofir Press","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:16:47.364Z","hidden":false},{"_id":"669dcf3b7f28c43e09d54998","name":"Jamie Callan","hidden":false},{"_id":"669dcf3b7f28c43e09d54999","user":{"_id":"677eef9d26d2ab07cfbcf6c6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/drlFgINAvi8M44TSME5-F.png","isPro":false,"fullname":"Eliu Huerta","user":"elihu13","type":"user"},"name":"Eliu Huerta","status":"extracted_pending","statusLastChangedAt":"2025-01-08T21:36:43.745Z","hidden":false},{"_id":"669dcf3b7f28c43e09d5499a","user":{"_id":"660ec5a2509153ca49775a7c","avatarUrl":"/avatars/97570fc245cc8ec7628da9c13bd35b71.svg","isPro":false,"fullname":"Hao Peng","user":"haopeng01","type":"user"},"name":"Hao Peng","status":"admin_assigned","statusLastChangedAt":"2024-07-22T10:18:14.706Z","hidden":false}],"publishedAt":"2024-07-18T05:15:24.000Z","submittedOnDailyAt":"2024-07-22T06:29:02.424Z","title":"SciCode: A Research Coding Benchmark Curated by Scientists","submittedOnDailyBy":{"_id":"6610c90c6c59fff43622eef7","avatarUrl":"/avatars/f832a3918101aa5f0834f1456b154c37.svg","isPro":false,"fullname":"Yanxin Lu","user":"amber1120","type":"user"},"summary":"Since language models (LMs) now outperform average humans on many challenging\ntasks, it has become increasingly difficult to develop challenging,\nhigh-quality, and realistic evaluations. We address this issue by examining\nLMs' capabilities to generate code for solving real scientific research\nproblems. Incorporating input from scientists and AI researchers in 16 diverse\nnatural science sub-fields, including mathematics, physics, chemistry, biology,\nand materials science, we created a scientist-curated coding benchmark,\nSciCode. The problems in SciCode naturally factorize into multiple subproblems,\neach involving knowledge recall, reasoning, and code synthesis. In total,\nSciCode contains 338 subproblems decomposed from 80 challenging main problems.\nIt offers optional descriptions specifying useful scientific background\ninformation and scientist-annotated gold-standard solutions and test cases for\nevaluation. Claude3.5-Sonnet, the best-performing model among those tested, can\nsolve only 4.6% of the problems in the most realistic setting. We believe that\nSciCode demonstrates both contemporary LMs' progress towards becoming helpful\nscientific assistants and sheds light on the development and evaluation of\nscientific AI in the future.","upvotes":14,"discussionId":"669dcf3c7f28c43e09d549ca","ai_summary":"SciCode, a scientist-curated coding benchmark, evaluates language models' ability to solve scientific research problems across various fields, highlighting current capabilities and future challenges in AI-assisted science.","ai_keywords":["language models","coding benchmark","knowledge recall","reasoning","code synthesis","scientific AI"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6448e1fbe988635a3d6aa97d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/eG4R9-3hgrimttP7ep3dN.jpeg","isPro":false,"fullname":"Shawn/Yuxuan Tong","user":"tongyx361","type":"user"},{"_id":"655ac762cb17ec19ef82719b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ac762cb17ec19ef82719b/1kDncYrGLYS_2SR8cNdAL.png","isPro":false,"fullname":"Welcome to matlok","user":"matlok","type":"user"},{"_id":"64881deb8e004bb92b0f4845","avatarUrl":"/avatars/30a1e016d469bf7eb42c713351a9f65c.svg","isPro":false,"fullname":"Genglin Liu","user":"genglinliu","type":"user"},{"_id":"6610c90c6c59fff43622eef7","avatarUrl":"/avatars/f832a3918101aa5f0834f1456b154c37.svg","isPro":false,"fullname":"Yanxin Lu","user":"amber1120","type":"user"},{"_id":"669e16ac6582d2ef7072060c","avatarUrl":"/avatars/b9acbd53a75e67b2d97da2f0c94b112e.svg","isPro":false,"fullname":"Xiaotong Cui","user":"SuperDoglikecoding","type":"user"},{"_id":"644e1b1d9b4e87c31bab0a14","avatarUrl":"/avatars/88bb4c4a67dc8958069e9014f5e73a0b.svg","isPro":false,"fullname":"Michael Barry","user":"MichaelBarryUK","type":"user"},{"_id":"662dd19f9e6d371ab71b91ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/662dd19f9e6d371ab71b91ce/mZBPw_Zs8ZlEFGlbekAoH.jpeg","isPro":false,"fullname":"Yezhaohui Wang","user":"HaruTeru","type":"user"},{"_id":"66897660e1bb46446f255db2","avatarUrl":"/avatars/51c517843f29bfd50e99e21547511a07.svg","isPro":false,"fullname":"Charlie Waters","user":"charliewaters","type":"user"},{"_id":"6689857212de1f2acc920945","avatarUrl":"/avatars/d451966e6ad81a0cf9b838ae3d3aef33.svg","isPro":false,"fullname":"Chet Down","user":"ChetDown","type":"user"},{"_id":"66898509377ee64785add814","avatarUrl":"/avatars/b7d9563c0dada2ffb3d1cc300753e004.svg","isPro":false,"fullname":"John Leon","user":"1john","type":"user"},{"_id":"64ead5f1213a0415bd22d0e4","avatarUrl":"/avatars/2797e96d93b999a3e5f816935eb43673.svg","isPro":false,"fullname":"Minyang Tian","user":"mtian8","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2407.13168

SciCode: A Research Coding Benchmark Curated by Scientists

Published on Jul 18, 2024
· Submitted by Yanxin Lu on Jul 22, 2024
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

SciCode, a scientist-curated coding benchmark, evaluates language models' ability to solve scientific research problems across various fields, highlighting current capabilities and future challenges in AI-assisted science.

AI-generated summary

Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.

Community

Paper author Paper submitter
·

Hi @amber1120 congrats on this work!

Are you planning to share the dataset on the hub? Here's a guide: https://huggingface.co/docs/datasets/loading.

The dataset could then be loaded in 2 lines of code, like so:

from datasets import load_dataset

dataset  = load_dataset("your-hf-organization/scicode")

It could then also be linked to this paper, as explained here: https://huggingface.co/docs/hub/en/datasets-cards#linking-a-paper.

We could also set up a Space using Gradio for the leaderboard.

Let me know if you need any help!

Cheers,
Niels
Open-source @ HF

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.13168 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.13168 in a Space README.md to link it from this page.

Collections including this paper 9

Лучший частный хостинг