lynx   »   [go: up one dir, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-04-22T01:35:45.281Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7187069058418274},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.13677","authors":[{"_id":"6806a3464020c8351ddd6b90","user":{"_id":"5e8ef1f14957053f606489e6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1635502086699-5e8ef1f14957053f606489e6.jpeg","isPro":false,"fullname":"Andrea Santilli","user":"teelinsan","type":"user"},"name":"Andrea Santilli","status":"claimed_verified","statusLastChangedAt":"2025-04-22T09:51:54.306Z","hidden":false},{"_id":"6806a3464020c8351ddd6b91","name":"Adam Golinski","hidden":false},{"_id":"6806a3464020c8351ddd6b92","name":"Michael Kirchhof","hidden":false},{"_id":"6806a3464020c8351ddd6b93","name":"Federico Danieli","hidden":false},{"_id":"6806a3464020c8351ddd6b94","name":"Arno Blaas","hidden":false},{"_id":"6806a3464020c8351ddd6b95","user":{"_id":"6530cf34e7535baecd9620a7","avatarUrl":"/avatars/e6058a932d88e42b4957734f653cbcfd.svg","isPro":false,"fullname":"Miao Xiong","user":"happymio","type":"user"},"name":"Miao Xiong","status":"claimed_verified","statusLastChangedAt":"2025-05-27T10:03:45.808Z","hidden":false},{"_id":"6806a3464020c8351ddd6b96","name":"Luca Zappella","hidden":false},{"_id":"6806a3464020c8351ddd6b97","name":"Sinead Williamson","hidden":false}],"publishedAt":"2025-04-18T13:13:42.000Z","submittedOnDailyAt":"2025-04-21T18:29:41.361Z","title":"Revisiting Uncertainty Quantification Evaluation in Language Models:\n Spurious Interactions with Response Length Bias Results","submittedOnDailyBy":{"_id":"5e8ef1f14957053f606489e6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1635502086699-5e8ef1f14957053f606489e6.jpeg","isPro":false,"fullname":"Andrea Santilli","user":"teelinsan","type":"user"},"summary":"Uncertainty Quantification (UQ) in Language Models (LMs) is crucial for\nimproving their safety and reliability. Evaluations often use performance\nmetrics like AUROC to assess how well UQ methods (e.g., negative sequence\nprobabilities) correlate with task correctness functions (e.g., ROUGE-L). In\nthis paper, we show that commonly used correctness functions bias UQ\nevaluations by inflating the performance of certain UQ methods. We evaluate 7\ncorrectness functions -- from lexical-based and embedding-based metrics to\nLLM-as-a-judge approaches -- across 4 datasets x 4 models x 6 UQ methods. Our\nanalysis reveals that length biases in the errors of these correctness\nfunctions distort UQ assessments by interacting with length biases in UQ\nmethods. We identify LLM-as-a-judge approaches as among the least length-biased\nchoices and hence a potential solution to mitigate these biases.","upvotes":1,"discussionId":"6806a3474020c8351ddd6bc7","ai_summary":"Evaluations of Uncertainty Quantification (UQ) in language models are biased by correctness functions, affecting UQ methods' performance, with LLM-as-a-judge approaches identified as less biased.","ai_keywords":["Uncertainty Quantification (UQ)","Language Models (LMs)","AUROC","negative sequence probabilities","ROUGE-L","lexical-based metrics","embedding-based metrics","length biases","LLM-as-a-judge approaches"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6478cef3bb9a5693c48941da","avatarUrl":"/avatars/8f47dbaed3f305c5e0ee147966a45505.svg","isPro":false,"fullname":"Bajra","user":"Mandur","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2504.13677

Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results

Published on Apr 18
· Submitted by Andrea Santilli on Apr 21
Authors:
,
,
,
,
,

Abstract

Evaluations of Uncertainty Quantification (UQ) in language models are biased by correctness functions, affecting UQ methods' performance, with LLM-as-a-judge approaches identified as less biased.

AI-generated summary

Uncertainty Quantification (UQ) in Language Models (LMs) is crucial for improving their safety and reliability. Evaluations often use performance metrics like AUROC to assess how well UQ methods (e.g., negative sequence probabilities) correlate with task correctness functions (e.g., ROUGE-L). In this paper, we show that commonly used correctness functions bias UQ evaluations by inflating the performance of certain UQ methods. We evaluate 7 correctness functions -- from lexical-based and embedding-based metrics to LLM-as-a-judge approaches -- across 4 datasets x 4 models x 6 UQ methods. Our analysis reveals that length biases in the errors of these correctness functions distort UQ assessments by interacting with length biases in UQ methods. We identify LLM-as-a-judge approaches as among the least length-biased choices and hence a potential solution to mitigate these biases.

Community

Paper author Paper submitter

Uncertainty Quantification (UQ) in Language Models (LMs) is crucial for improving their safety and reliability. Evaluations often use performance metrics like AUROC to assess how well UQ methods (e.g., negative sequence probabilities) correlate with task correctness functions (e.g., ROUGE-L). In this paper, we show that commonly used correctness functions bias UQ evaluations by inflating the performance of certain UQ methods. We evaluate 7 correctness functions -- from lexical-based and embedding-based metrics to LLM-as-a-judge approaches -- across 4 datasets x 4 models x 6 UQ methods. Our analysis reveals that length biases in the errors of these correctness functions distort UQ assessments by interacting with length biases in UQ methods. We identify LLM-as-a-judge approaches as among the least length-biased choices and hence a potential solution to mitigate these biases.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.13677 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.13677 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.13677 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.
Лучший частный хостинг