Announcing LMEval: An Open Source Framework for Cross-Model Evaluation

Wednesday, May 14, 2025

Announcing LMEval: An Open Source Framework for Cross-Model Evaluation

Authors: Elie Bursztein - Distinguished Research Scientist & David Tao - Software Engineer, Applied Security and Safety Research

Simplifying Cross-Provider Model Benchmarking

At InCyber Forum Europe in April, we open sourced LMEval, a large model evaluation framework, to help others accurately and efficiently compare how models from various providers perform across benchmark datasets. This announcement coincided with a joint talk with Giskard about our collaboration to increase trust in model safety and security. Giskard uses LMeval to run the Phare benchmark that independently evaluates popular models' security and safety.

Results from the Phare benchmark that leverages LMEval for evaluation

Example of LMEval running on a multimodal benchmark across two models.

Rapid Changes in the Landscape of Large Models

New Large Language Models (LLMs) are released constantly, often promising improvements and new features. To keep up with this fast-paced lifecycle, developers, researchers, and organizations must quickly and reliably evaluate if those newer models are better suited for their specific applications. So far, rapid model evaluation has proven difficult, as it requires tools that allow scalable, accurate, easy-to-use, cross-provider benchmarking.

Introducing LMEval: Simplifying Cross-Provider Model Benchmarking

To address this challenge, we are excited to introduce LMEval (Large Model Evaluator), an open source framework that Google developed to streamline the evaluation of LLMs across diverse benchmark datasets and model providers. LMEval is designed from the ground up to be accurate, multimodal, and easy-to-use. Its key features include:

Multi-Provider Compatibility: Evaluating models shouldn't require wrestling with different APIs for each provider. LMEval leverages the LiteLLM framework to offer out-of-the-box compatibility with major model providers including Google, OpenAI, Anthropic, Ollama, and Hugging Face. You can define your benchmark once and run it consistently across various models with minimal code changes.

Incremental & Efficient Evaluation: Re-running an entire benchmark suite every time a new model or version is released is slow, inefficient and costly. LMEval's intelligent evaluation engine plans and executes evaluations incrementally. It runs only the necessary evaluations for new models, prompts, or questions, saving significant time and compute resources. Its multi-threaded engine further accelerates this process.

Multimodal & Multi-Metric Support: Modern foundation models go beyond text. LMEval is designed for multimodal evaluation, supporting benchmarks that include text, images and code. Adding new modalities is straightforward. Furthermore, it supports various scoring metrics to support a wide range of benchmark formats from boolean questions, to multi-choices, to free form generation. Additionally, LMEval provides support for safety/punting detection.

Scalable & Secure Storage: To store benchmark results in a secure and efficient manner, LMEval utilizes a self-encrypting SQLite database. This approach protects benchmark data and results from inadvertent crawling/indexing while they stay easily accessible through LMEval.

Getting Started with LMEval

Creating and running evaluations with LMEval is designed to be intuitive. Here's a simplified example demonstrating how to evaluate two Gemini model versions on a benchmark:

Example of LMEval running on a multimodal benchmark across two models.

Results from the Phare benchmark that leverages LMEval for evaluation

The LMEval GitHub repository includes example notebooks to help you get started.

Visualizing Results with LMEvalboard

Understanding benchmark results requires more than just summary statistics. To help with this, LMEval includes LMEvalboard, a companion dashboard tool that offers an interactive visualization of how models stack up against each other. LMEvalboard provides valuable insights into model strengths and weaknesses, complementing traditional raw evaluation data.

LMEvalboard UI allows to quickly analyze how models compares on a given benchmark

LMEvalboard allows you to:

View Overall Performance: Quickly compare all models' accuracy across the entire benchmark.
Analyze a Single Model: Dive deep into a specific model's performance characteristics across different categories using radar charts and drill down on specific examples of failures
Perform Head-to-Head Comparisons: Directly compare two models, visualizing their performance differences across categories and examine specific questions where they disagree.

Try LMEval Today!

We invite you to explore LMEval, use it for your own evaluations, and contribute to its development by heading to the LMEval GitHub repository: https://github.com/google/lmeval

Acknowledgements

LMEval would not have been possible without the help of many people, including: Luca Invernizzi, Lenin Simicich, Marianna Tishchenko, Amanda Walker, and many other Googlers.

Kubernetes 1.33 is available on GKE!

Friday, May 9, 2025

Kubernetes 1.33 is now available in the Google Kubernetes Engine (GKE) Rapid Channel! For more information about the content of Kubernetes 1.33, read the official Kubernetes 1.33 Release Notes and the specific GKE 1.33 Release Notes.

Enhancements in 1.33:

In-place Pod Resizing

Workloads can be scaled horizontally by updating the Pod replica count, or vertically by updating the resources required in the Pods container(s). Before this enhancement, container resources defined in a Pod's spec were immutable, and updating any of these details within a Pod template would trigger Pod replacement impacting service's reliability.

In-place Pod Resizing (IPPR, Public Preview) allows you to change the CPU and memory requests and limits assigned to containers within a running Pod through the new /resize pod subresource, often without requiring a container restart decreasing service's disruptions.

This opens up various possibilities for vertical scale-up of stateful processes without any downtime, seamless scale-down when the traffic is low, and even allocating larger resources during startup, which can then be reduced once the initial setup is complete.

Review Resize CPU and Memory Resources assigned to Containers for detailed guidance on using the new API.

DRA

Kubernetes Dynamic Resource Allocation (DRA), currently in beta as of v1.33, offers a more flexible API for requesting devices than Device Plugin. (Instructions for opt-in beta features in GKE)

Recent updates include the promotion of driver-owned resource claim status to beta. New alpha features introduced are partitionable devices, device taints and tolerations for managing device availability, prioritized device lists for versatile workload allocation, and enhanced admin access controls. Preparations for general availability include a new v1beta2 API to improve user experience and simplify future feature integration, alongside improved RBAC rules and support for seamless driver upgrades. DRA is anticipated to reach general availability in Kubernetes v1.34.

containerd 2.0

With GKE 1.33, we are excited to introduce support for containerd 2.0. This marks the first major version update for the underlying container runtime used by GKE. Adopting this version ensures that GKE continues to leverage the latest advancements and security enhancements from the upstream containerd community.

It's important to note that as a major version update, containerd 2.0 introduces many new features and enhancements while also deprecating others. To ensure a smooth transition and maintain compatibility for your workloads, we strongly encourage you to review your Cloud Recommendations. These recommendations will help identify any workloads that may be affected by these changes. Please see "Migrate nodes to containerd 2" for detailed guidance on making your workloads forward-compatible.

Multiple Service CIDRs

This enhancement introduced a new implementation of allocation logic for Service IPs. The updated IP address allocator logic uses two newly stable API objects: ServiceCIDR and IPAddress. Now generally available, these APIs allow cluster administrators to dynamically increase the number of IP addresses available for Services by creating new ServiceCIDR objects.

Highlight of Googlers' contributions in 1.33 cycle:

Coordinated Leader Election

The Coordinated Leader Election feature progressed to beta, introducing significant enhancements in how a lease-candidate's availability is determined for an election. Specifically, the ping-acknowledgement checking process has been optimized to be fully concurrent instead of the previous sequential approach ensuring faster and more efficient detection of unresponsive candidates, which is essential for promptly identifying truly available lease candidates and maintaining the reliability of the leader election process.

Compatibility Versions

New CLI flags were added to apiserver as options for adjusting API enablement wrt an apiserver's emulated version. --emulation-forward-compatible is an option to implicitly enable all APIs which are introduced after the emulation version and have higher priority than APIs of the same group resource enabled at the emulation version.
--runtime-config-emulation-forward-compatible is an option to explicit enable specific APIs introduced after the emulation version through the runtime-config

zPages

ComponentStatusz and ComponentFlagz alpha features are now available to be turned on for all control plane components.
Components now expose two new HTTP endpoints, /statusz and /flagz, providing enhanced visibility into their internal state. /statusz details the component's uptime, golang, binary and emulation versions info, while /flagz reveals the command-line arguments used at startup.

Streaming List Responses

To improve cluster stability when handling large datasets, streaming encoding for List responses was introduced as a new Beta feature. Previously, serializing entire List responses into a single memory block could strain kube-apiserver memory. The new streaming encoder processes and transmits each item in a list individually, preventing large memory allocations. This significantly reduces memory spikes, improves API server reliability, and enhances overall cluster performance, especially for clusters with large resources, all while maintaining backward compatibility and requiring no client-side changes.

Snapshottable API server cache

Further enhancing API server performance and stability, a new Alpha feature introduces snapshotting to the watchcache. This allows serving LIST requests for historical or paginated data directly from its in-memory cache. Previously, these types of requests would query etcd directly, requiring to pipe the data through multiple encoding, decoding, and validation stages. This process often led to increased memory pressure, unpredictable performance, and potential stability issues, especially with large resources. By leveraging efficient B-tree based snapshotting within the watchcache, this enhancement significantly reduces direct etcd load and minimizes memory allocations on the API server. This results in more predictable performance, increased API server reliability, and better overall resource utilization, while incorporating mechanisms to ensure data consistency between the cache and etcd.

Declarative Validation

Kubernetes thrives on its large, vibrant community of contributors. We're constantly looking for ways to help make it easier to maintain and contribute to this project. For years, one area that posed challenges was how the Kubernetes API itself was validated: using hand-written Go code. This traditional method has proven to be difficult to authors, challenging to review and cumbersome to document, impacting overall maintainability and the contributor experience. To address these pain points, the declarative validation project was initiated.
In 1.33, the foundational infrastructure was established to transition Kubernetes API validation from handwritten Go code to a declarative model using IDL tags. This release introduced the validation-gen code generator, designed to parse these IDL tags and produce Go validation functions.

Ordered Namespace Deletion

The current namespace deletion process is semi-random, which may lead to security gaps or unintended behavior, such as Pods persisting after the deletion of their associated NetworkPolicies. By implementing an opinionated deletion mechanism, the Pods will be deleted before other resources with respect to logical and security dependencies. This design enhances the security and reliability of Kubernetes by mitigating risks arising from the non-deterministic deletion order.

Acknowledgements

As always, we want to thank all the Googlers that provide their time, passion, talent and leadership to keep making Kubernetes the best container orchestration platform. We would like to mention especially Googlers who helped drive the contributions mentioned in this blog: Tim Allclair, Natasha Sarkar, Vivek Bansal, Anish Shah, Dawn Chen, Tim Hockin, John Belamaric, Morten Torkildsen, Yu Liao,Cici Huang, Samuel Karp, Chris Henzie, Luiz Oliveira, Piotr Betkier, Alex Curtis, Jonah Peretz, Brad Hoekstra, Yuhan Yao, Ray Wainman, Richa Banker, Marek Siarkowicz, Siyuan Zhang, Jeffrey Ying, Henry Wu, Yuchen Zhou, Jordan Liggitt, Benjamin Elder, Antonio Ojea, Yongrui Lin, Joe Betz, Aaron Prindle and the Googlers who helped bring 1.33 to GKE!

- Benjamin Elder & Sen Lu, Google Kubernetes Engine

GSoC 2025: We have our Contributors!

Thursday, May 8, 2025

Congratulations to the 1272 Contributors from 68 countries accepted for GSoC 2025! Our 185 Mentoring Orgs have been very busy this past month - reviewing 23,559 proposals, having countless discussions with applicants, and finally, completing the rigorous selection process to find the right Contributors for their community.

Here are some highlights of the 2025 GSoC applicants:

15,240 applicants from 130 countries submitting 23,559 proposals
Over 2,350 mentors and organization administrators
66.3% of applicants have no prior open source experience

Now that the 2025 GSoC Contributors have been announced, the Organizations and Contributors will be spending 3 weeks together in the Community Bonding period. This time is a very important part of the GSoC program. Designed to get new contributors quickly up to speed, Mentors will use the next three weeks to introduce GSoC Contributors to their community, helping them understand the codebase and norms of their project, adjusting deliverables for the project and understanding the impact and reach of their summer project.

Contributors will begin writing code for Organizations on June 2nd - the official beginning of a totally new adventure! We're absolutely delighted to kick off another year alongside our amazing community.

A huge thanks to all the enthusiastic applicants who participated and, of course, to our phenomenal volunteer Mentors and Organization Administrators. Your weeks of thoughtful proposal reviews and proactive engagement with participants have been invaluable in introducing them to the world of open source.

And congratulations once again to our 2025 GSoC Contributors! Our goal is that GSoC serves as the catalyst for Contributors to become long term participants (and maybe even maintainers!) of open source communities of every shape and size. Now is their chance to dive in and learn more about open source and connect with these amazing communities.

opensource.google.com

Google Open Source Blog