Statistics
See recent articles
Showing new listings for Tuesday, 29 July 2025
- [1] arXiv:2507.19540 [pdf, html, other]
-
Title: Bayesian symbolic regression: Automated equation discovery from a physicists' perspectiveSubjects: Machine Learning (stat.ML); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
Symbolic regression automates the process of learning closed-form mathematical models from data. Standard approaches to symbolic regression, as well as newer deep learning approaches, rely on heuristic model selection criteria, heuristic regularization, and heuristic exploration of model space. Here, we discuss the probabilistic approach to symbolic regression, an alternative to such heuristic approaches with direct connections to information theory and statistical physics. We show how the probabilistic approach establishes model plausibility from basic considerations and explicit approximations, and how it provides guarantees of performance that heuristic approaches lack. We also discuss how the probabilistic approach compels us to consider model ensembles, as opposed to single models.
- [2] arXiv:2507.19564 [pdf, html, other]
-
Title: Consistency and Central Limit Results for the Maximum Likelihood Estimator in the Admixture ModelSubjects: Applications (stat.AP)
In the Admixture Model, the probability of an individual having a certain number of alleles at a specific marker depends on the allele frequencies in $K$ ancestral populations and the fraction of the individual's genome originating from these ancestral populations.
This study investigates consistency and central limit results of maximum likelihood estimators (MLEs) for the ancestry and the allele frequencies in the Admixture Model, complimenting previous work by \cite{pfaff2004information, pfaffelhuber2022central}. Specifically, we prove consistency of the MLE, if we estimate the allele frequencies and the ancestries. Furthermore, we prove central limit theorems, if we estimate the ancestry of a finite number of individuals and the allele frequencies of finitely many markers, also addressing the case where the true ancestry lies on the boundary of the parameter space.
Finally, we use the new theory to quantify the uncertainty of the MLEs for the data of \citet{10002015global}. - [3] arXiv:2507.19607 [pdf, html, other]
-
Title: Inference with weights: Residualization produces short, valid intervals for varying estimands and varying resampling processesSubjects: Methodology (stat.ME)
Weighting procedures are used in observational causal inference to adjust for covariate imbalance within the sample. Common practice for inference is to estimate robust standard errors from a weighted regression of outcome on treatment. However, it is well known that weighting can inflate variance estimates, sometimes significantly, leading to standard errors and confidence intervals that are overly conservative. We instead examine and recommend the use of robust standard errors from a weighted regression that additionally includes the balancing covariates and their interactions with treatment. We show that these standard errors are more precise and asymptotically correct for weights that achieve exact balance under multiple common resampling frameworks, including design-based and model-based inference, as well as superpopulation sampling with a finite sample correction. Gains to precision can be quite significant when the balancing weights adjust for prognostic covariates. For procedures that balance only approximately or in expectation, such as inverse propensity weighting or approximate balancing weights, our proposed method improves precision by reducing residuals through augmentation with the parametric model. We demonstrate our approach through simulation and re-analysis of multiple empirical studies.
- [4] arXiv:2507.19611 [pdf, other]
-
Title: State evolution beyond first-order methods I: Rigorous predictions and finite-sample guaranteesSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We develop a toolbox for exact analysis of iterative algorithms on a class of high-dimensional nonconvex optimization problems with random data. While prior work has shown that low-dimensional statistics of (generalized) first-order methods can be predicted by a deterministic recursion known as state evolution, our focus is on developing such a prediction for a more general class of algorithms. We provide a state evolution for any method whose iterations are given by (possibly interleaved) first-order and saddle point updates, showing two main results. First, we establish a rigorous state evolution prediction that holds even when the updates are not coordinate-wise separable. Second, we establish finite-sample guarantees bounding the deviation of the empirical updates from the established state evolution. In the process, we develop a technical toolkit that may prove useful in related problems. One component of this toolkit is a general Hilbert space lifting technique to prove existence and uniqueness of a convenient parameterization of the state evolution. Another component of the toolkit combines a generic application of Bolthausen's conditioning method with a sequential variant of Gordon's Gaussian comparison inequality, and provides additional ingredients that enable a general finite-sample analysis.
- [5] arXiv:2507.19623 [pdf, html, other]
-
Title: Adaptive Proximal Causal Inference with Some Invalid ProxiesSubjects: Methodology (stat.ME)
Proximal causal inference (PCI) is a recently proposed framework to identify and estimate the causal effect of an exposure on an outcome in the presence of hidden confounders, using observed proxies. Specifically, PCI relies on two types of proxies: a treatment-inducing confounding proxy, related to the outcome only through its association with unmeasured confounders (given treatment and covariates), and an outcome-inducing confounding proxy, related to the treatment only through such association (given covariates). These proxies must satisfy stringent exclusion restrictions - namely, the treatment proxy must not affect the outcome, and the outcome proxy must not be affected by the treatment. To improve identification and potentially efficiency, multiple proxies are often used, raising concerns about bias from exclusion violations. To address this, we introduce necessary and sufficient conditions for identifying causal effects in the presence of many proxies, some potentially invalid. Under a canonical proximal linear structural equations model, we propose a LASSO-based median estimator that jointly selects valid proxies and estimates the causal effect, with theoretical guarantees. Recognizing LASSO's limitations in consistently selecting valid treatment proxies, we develop an adaptive LASSO-based estimator with differential penalization. We show that it is root-n consistent and yields valid confidence intervals when a valid outcome proxy is available. We also extend the approach to settings with many potentially invalid outcome proxies. Theoretical results are supported by simulations and an application assessing the effect of right heart catheterization on 30-day survival in ICU patient.
- [6] arXiv:2507.19633 [pdf, html, other]
-
Title: Uniform inference in linear mixed modelsSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
We provide finite-sample distribution approximations, that are uniform in the parameter, for inference in linear mixed models. Focus is on variances and covariances of random effects in cases where existing theory fails because their covariance matrix is nearly or exactly singular, and hence near or at the boundary of the parameter set. Quantitative bounds on the differences between the standard normal density and those of linear combinations of the score function enable, for example, the assessment of sufficient sample size. The bounds also lead to useful asymptotic theory in settings where both the number of parameters and the number of random effects grow with the sample size. We consider models with independent clusters and ones with a possibly diverging number of crossed random effects, which are notoriously complicated. Simulations indicate the theory leads to practically relevant methods. In particular, the studied confidence regions, which are straightforward to implement, have near-nominal coverage in finite samples even when some random effects have variances near or equal to zero, or correlations near or equal to $\pm 1$.
- [7] arXiv:2507.19650 [pdf, html, other]
-
Title: A direct approach to tree-guided feature aggregation for high-dimensional regressionSubjects: Methodology (stat.ME)
In high-dimensional linear models, sparsity is often exploited to reduce variability and achieve parsimony. Equi-sparsity, where one assumes that predictors can be aggregated into groups sharing the same effects, is an alternative parsimonious structure that can be more suitable in certain applications. Previous work has clearly demonstrated the benefits of exploiting equi-sparsity in the presence of ``rare features'' (Yan and Bien 2021). In this work, we propose a new tree-guided regularization scheme for simultaneous estimation and feature aggregation. Unlike existing methods, our estimator avoids synthetic overparameterization and its detrimental effects. Even though our penalty is applied to hierarchically overlapped groups, we show that its proximal operator can be solved with a one-pass, non-iterative algorithm. Novel techniques are developed to study the finite-sample error bound of this seminorm-induced regularizer under least squares and binomial deviance losses. Theoretically, compared to existing methods, the proposed method offers a faster or equivalent rate depending on the true equi-sparisty structure. Extensive simulation studies verify these findings. Finally, we illustrate the usefulness of the proposed method with an application to a microbiome dataset, where we conduct post-selection inference on the aggregated features' effects.
- [8] arXiv:2507.19663 [pdf, html, other]
-
Title: Adaptive Bayesian Data-Driven Design of Reliable Solder Joints for Micro-electronic DevicesComments: data-driven design, adaptive hyperparameters, Bayesian optimization, solder joint reliability, micro-electronicsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
Solder joint reliability related to failures due to thermomechanical loading is a critically important yet physically complex engineering problem. As a result, simulated behavior is oftentimes computationally expensive. In an increasingly data-driven world, the usage of efficient data-driven design schemes is a popular choice. Among them, Bayesian optimization (BO) with Gaussian process regression is one of the most important representatives. The authors argue that computational savings can be obtained from exploiting thorough surrogate modeling and selecting a design candidate based on multiple acquisition functions. This is feasible due to the relatively low computational cost, compared to the expensive simulation objective. This paper addresses the shortcomings in the adjacent literature by providing and implementing a novel heuristic framework to perform BO with adaptive hyperparameters across the various optimization iterations. Adaptive BO is subsequently compared to regular BO when faced with synthetic objective minimization problems. The results show the efficiency of adaptive BO when compared any worst-performing regular Bayesian schemes. As an engineering use case, the solder joint reliability problem is tackled by minimizing the accumulated non-linear creep strain under a cyclic thermal load. Results show that adaptive BO outperforms regular BO by 3% on average at any given computational budget threshold, critically saving half of the computational expense budget. This practical result underlines the methodological potential of the adaptive Bayesian data-driven methodology to achieve better results and cut optimization-related expenses. Lastly, in order to promote the reproducibility of the results, the data-driven implementations are made available on an open-source basis.
- [9] arXiv:2507.19685 [pdf, html, other]
-
Title: A Comparison of the Bayesian Posterior Probability and the Frequentist $p$-Value in Testing Equivalence HypothesesSubjects: Methodology (stat.ME)
Equivalence tests, otherwise known as parity or similarity tests, are frequently used in ``bioequivalence studies" to establish practical equivalence rather than the usual statistical significant difference. In this article, we propose an equivalence test using both the $p$-value and a Bayesian procedure by computing the posterior probability that the null hypothesis is true. Since these posterior probabilities follow the uniform $[0,1]$ distribution under the null hypothesis, we use them in a Two One-Sided Test (TOST) procedure to perform equivalence tests. For certain specifications of the prior parameters, test based on these posterior probabilities are more powerful and less conservative than those based on the $p$-value. We compare the parameter values that maximize the power functions of tests based on these two measures of evidence when using different equivalence margins. We also derive the correlation coefficient between these two measures of evidence. Furthermore, we also consider the effect of the prior variance on the conservativity and power function of the test based on the posterior probabilities. Finally, we provide examples and a small-scale simulation study to compare their performance in terms of type I error rate control and power in a single test, as well as in multiple testing, considering the power of the false discovery rate procedure.
- [10] arXiv:2507.19696 [pdf, html, other]
-
Title: Location Tests with Noisy Proxies for Latent VariablesSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
We investigate inference in a latent binary variable model where a noisy proxy of the latent variable is available, motivated by the variable perturbation effectiveness problem in single-cell CRISPR screens. The baseline approach is to ignore the perturbation effectiveness problem, while a recent proposal employs a weighted average based on the proxies. Our main goals are to determine how accurate the proxies must be in order for a weighted test to gain power over the unweighted baseline, and to develop tests that are powerful regardless of the accuracy of the proxies. To address the first goal, we compute the Pitman relative efficiency of the weighted test relative to the unweighted test, yielding an interpretable quantification of proxy quality that drives the power of the weighted test. To address the second goal, we propose two strategies. First, we propose a maximum-likelihood based approach that adapts the proxies to the data. Second, we propose an estimator of the Pitman efficiency if a "positive control outcome variable" is available (as is often the case in single-cell CRISPR screens), which facilitates an adaptive choice of whether to use the proxies at all. Our numerical simulations support the Pitman efficiency as the key quantity for determining whether the weighted test gains power over the baseline, and demonstrate that the two proposed adaptive tests can improve on both existing approaches across a range of proxy qualities.
- [11] arXiv:2507.19774 [pdf, html, other]
-
Title: Bag of Coins: A Statistical Probe into Neural Confidence StructuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Modern neural networks, despite their high accuracy, often produce poorly calibrated confidence scores, limiting their reliability in high-stakes applications. Existing calibration methods typically post-process model outputs without interrogating the internal consistency of the predictions themselves. In this work, we introduce a novel, non-parametric statistical probe, the Bag-of-Coins (BoC) test, that examines the internal consistency of a classifier's logits. The BoC test reframes confidence estimation as a frequentist hypothesis test: does the model's top-ranked class win 1-v-1 contests against random competitors at a rate consistent with its own stated softmax probability? When applied to modern deep learning architectures, this simple probe reveals a fundamental dichotomy. On Vision Transformers (ViTs), the BoC output serves as a state-of-the-art confidence score, achieving near-perfect calibration with an ECE of 0.0212, an 88% improvement over a temperature-scaled baseline. Conversely, on Convolutional Neural Networks (CNNs) like ResNet, the probe reveals a deep inconsistency between the model's predictions and its internal logit structure, a property missed by traditional metrics. We posit that BoC is not merely a calibration method, but a new diagnostic tool for understanding and exposing the differing ways that popular architectures represent uncertainty.
- [12] arXiv:2507.19787 [pdf, html, other]
-
Title: Sparse-mode Dynamic Mode Decomposition for Disambiguating Local and Global StructuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The dynamic mode decomposition (DMD) is a data-driven approach that extracts the dominant features from spatiotemporal data. In this work, we introduce sparse-mode DMD, a new variant of the optimized DMD framework that specifically leverages sparsity-promoting regularization in order to approximate DMD modes which have localized spatial structure. The algorithm maintains the noise-robust properties of optimized DMD while disambiguating between modes which are spatially local versus global in nature. In many applications, such modes are associated with discrete and continuous spectra respectively, thus allowing the algorithm to explicitly construct, in an unsupervised manner, the distinct portions of the spectrum. We demonstrate this by analyzing synthetic and real-world systems, including examples from optical waveguides, quantum mechanics, and sea surface temperature data.
- [13] arXiv:2507.19848 [pdf, html, other]
-
Title: A Bayesian Additive Regression Trees Model for zero and one inflated data for Predicting Individual Treatment Effects in Alcohol Use Disorder TrialsPamela Solano, M Lee Van Horn, Kyle Walters, Philipp Besendorfer, Alena Kuhlemeier, Manel Martínez-Ramón, Thomas JakiSubjects: Applications (stat.AP)
Alcohol Use Disorder (AUD) treatment presents high individual-level heterogeneity, with outcomes ranging from complete abstinence to persistent heavy drinking. This variability-driven by complex behavioral, social, and environmental factors-poses major challenges for treatment evaluation and individualized decision-making. In particular, accurately modeling bounded semicontinuous outcomes and estimating predictive individual treatment effects (PITEs) remains methodologically demanding.
For the pre-registered PITE analysis of Project MATCH, we developed HOBZ-BART, a novel Bayesian nonparametric model tailored for semicontinuous outcomes concentrated at clinically meaningful boundary values (0 and 1). The model decomposes the outcome into three components-abstinence, partial drinking, and persistent use-via a sequential hurdle structure, offering interpretability aligned with clinical reasoning. A shared Bayesian Additive Regression Tree (BART) ensemble captures nonlinear effects and covariate interactions across components, while a scalable Beta-likelihood approximation enables efficient, conjugate-friendly posterior computation.
Through extensive simulations we demonstrate that HOBZ-BART outperforms traditional zero-one inflated Beta (ZOIB) model in predictive accuracy, computational efficiency, and PITE estimation. We then present the primary PITE analysis of the MATCH trial using HOBZ-BART which enables clinically meaningful comparisons of Cognitive Behavioral Therapy (CBT), Motivational Enhancement Therapy (MET), and Twelve Step Facilitation (TSF), offering personalized treatment insights.
HOBZ-BART combines statistical rigor with clinical interpretability, addressing a critical need in addiction research for models that support individualized, data-driven care. - [14] arXiv:2507.19868 [pdf, html, other]
-
Title: Temporal network analysis via a degree-corrected Cox modelComments: This paper supersedes arxiv article arXiv:2301.04296v1 titled "A degree-corrected Cox model for dynamic networks" by Yuguo Chen, Lianqiang Qu, Jinfeng Xu, Ting Yan, Yunpeng ZhouSubjects: Methodology (stat.ME)
Temporal dynamics, characterised by time-varying degree heterogeneity and homophily effects, are often exhibited in many real-world networks. As observed in an MIT Social Evolution study, the in-degree and out-degree of the nodes show considerable heterogeneity that varies with time. Concurrently, homophily effects, which explain why nodes with similar characteristics are more likely to connect with each other, are also time-dependent. To facilitate the exploration and understanding of these dynamics, we propose a novel degree-corrected Cox model for directed networks, where the way for degree-heterogeneity or homophily effects to change with time is left completely unspecified. Because each node has individual-specific in- and out-degree parameters that vary over time, the number of unknown parameters grows with the number of nodes, leading to a high-dimensional estimation problem. Therefore, it is highly nontrivial to make inference. We develop a local estimating equations approach to estimate the unknown parameters and establish the consistency and asymptotic normality of the proposed estimators in the high-dimensional regime. We further propose test statistics to check whether temporal variation or degree heterogeneity is present in the network and develop a graphically diagnostic method to evaluate goodness-of-fit for dynamic network models. Simulation studies and two real data analyses are provided to assess the finite sample performance of the proposed method and illustrate its practical utility.
- [15] arXiv:2507.19889 [pdf, html, other]
-
Title: Causal Inference for Circular DataSubjects: Methodology (stat.ME); Applications (stat.AP)
In causal inference, a fundamental task is to estimate the effect resulting from a specific treatment, which is often handled with inverse probability weighting. Despite an abundance of attention to the advancement of this task, most articles have focused on linear data rather than circular data, which are measured in angles. In this article, we extend the causal inference framework to accommodate circular data. Specifically, two new treatment effects, average direction treatment effect (ADTE) and average length treatment effect (ALTE), are introduced to offer a proper causal explanation for these data. As the average direction and average length describe the location and concentration of a random sample of circular data, the ADTE and ALTE measure the change in direction and length between two counterfactual outcomes. With inverse probability weighting, we propose estimators that exhibit ideal theoretical properties, which are validated by a simulation study. To illustrate the practical utility of our estimator, we analyze the effect of different job types on dispatchers' sleep patterns using data from Federal Railroad Administration.
- [16] arXiv:2507.19893 [pdf, html, other]
-
Title: Retrospective score tests versus prospective score tests for genetic association with case-control dataJournal-ref: Liu Y., Li P., Song L., Yu K., Qin J. (2021) Retrospective score tests versus prospective score tests for genetic association with case-control data. Biometrics, 77, 102-112Subjects: Methodology (stat.ME)
Since the seminal work by Prentice and Pyke (1979), the prospective logistic likelihood has become the standard method of analysis for retrospectively collected case-control data, in particular for testing the association between a single genetic marker and a disease outcome in genetic case-control studies. When studying multiple genetic markers with relatively small effects, especially those with rare variants, various aggregated approaches based on the same prospective likelihood have been developed to integrate subtle association evidence among all considered markers. In this paper we show that using the score statistic derived from a prospective likelihood is not optimal in the analysis of retrospectively sampled genetic data. We develop the locally most powerful genetic aggregation test derived through the retrospective likelihood under a random effect model assumption. In contrast to the fact that the disease prevalence information cannot be used to improve the efficiency for the estimation of odds ratio parameters in logistic regression models, we show that it can be utilized to enhance the testing power in genetic association studies. Extensive simulations demonstrate the advantages of the proposed method over the existing ones. One real genome-wide association study is analyzed for illustration.
- [17] arXiv:2507.19915 [pdf, html, other]
-
Title: Effective Bayesian Modeling of Large Spatiotemporal Count Data Using Autoregressive Gamma ProcessesSubjects: Methodology (stat.ME); Applications (stat.AP); Computation (stat.CO)
We put forward a new Bayesian modeling strategy for spatiotemporal count data that enables efficient posterior sampling. Most previous models for such data decompose logarithms of the response Poisson rates into fixed effects and spatial random effects, where the latter is typically assumed to follow a latent Gaussian process, the conditional autoregressive model, or the intrinsic conditional autoregressive model. Since log-Gaussian is not conjugate to Poisson, such implementations must resort to either approximation methods like INLA or Metropolis moves on latent states in MCMC algorithms for model fitting and exhibit several approximation and posterior sampling challenges. Instead of modeling logarithms of spatiotemporal frailties jointly as a Gaussian process, we construct a spatiotemporal autoregressive gamma process guaranteed stationary across the time dimension. We decompose latent Poisson variables to permit fully conjugate Gibbs sampling of spatiotemporal frailties and design a sparse spatial dependence structure to get a linear computational complexity that facilitates efficient posterior computation. Our model permits convenient Bayesian predictive machinery based on posterior samples that delivers satisfactory performance in predicting at new spatial locations and time intervals. We have performed extensive simulation experiments and real data analyses, which corroborated our model's accurate parameter estimation, model fitting, and out-of-sample prediction capabilities.
- [18] arXiv:2507.19978 [pdf, html, other]
-
Title: Extreme value theory for singular subspace estimation in the matrix denoising modelComments: 64 pages, 8 figuresSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
This paper studies fine-grained singular subspace estimation in the matrix denoising model where a deterministic low-rank signal matrix is additively perturbed by a stochastic matrix of Gaussian noise. We establish that the maximum Euclidean row norm (i.e., the two-to-infinity norm) of the aligned difference between the leading sample and population singular vectors approaches the Gumbel distribution in the large-matrix limit, under suitable signal-to-noise conditions and after appropriate centering and scaling. We apply our novel asymptotic distributional theory to test hypotheses of low-rank signal structure encoded in the leading singular vectors and their corresponding principal subspace. We provide de-biased estimators for the corresponding nuisance signal singular values and show that our proposed plug-in test statistic has desirable properties. Notably, compared to using the Frobenius norm subspace distance, our test statistic based on the two-to-infinity norm has higher power to detect structured alternatives that differ from the null in only a few matrix entries or rows. Our main results are obtained by a novel synthesis of and technical analysis involving entrywise matrix perturbation analysis, extreme value theory, saddle point approximation methods, and random matrix theory. Our contributions complement the existing literature for matrix denoising focused on minimaxity, mean squared error analysis, unitarily invariant distances between subspaces, component-wise asymptotic distributional theory, and row-wise uniform error bounds. Numerical simulations illustrate our main results and demonstrate the robustness properties of our testing procedure to non-Gaussian noise distributions.
- [19] arXiv:2507.20001 [pdf, html, other]
-
Title: Computation of Optimal Type-II Progressing Censoring Scheme Using Genetic Algorithm ApproachSubjects: Applications (stat.AP); Statistics Theory (math.ST); Computation (stat.CO); Methodology (stat.ME)
The experimenter must perform a legitimate search in the entire set of feasible censoring schemes to identify the optimal type II progressive censoring scheme, when applied to a life-testing experiment. Current recommendations are limited to small sample sizes. Exhaustive search strategies are not practically feasible for large sample sizes. This paper proposes a meta-heuristic algorithm based on the genetic algorithm for large sample sizes. The algorithm is found to provide optimal or near-optimal solutions for small sample sizes and large sample sizes. Our suggested optimal criterion is based on the cost function and is scale-invariant for both location-scale and log-location-scale distribution families. To investigate how inaccurate parameter values or cost coefficients may affect the optimal solution, a sensitivity analysis is also taken into account.
- [20] arXiv:2507.20024 [pdf, html, other]
-
Title: Discrete Gaussian Vector Fields On MeshesSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
Though the underlying fields associated with vector-valued environmental data are continuous, observations themselves are discrete. For example, climate models typically output grid-based representations of wind fields or ocean currents, and these are often downscaled to a discrete set of points. By treating the area of interest as a two-dimensional manifold that can be represented as a triangular mesh and embedded in Euclidean space, this work shows that discrete intrinsic Gaussian processes for vector-valued data can be developed from discrete differential operators defined with respect to a mesh. These Gaussian processes account for the geometry and curvature of the manifold whilst also providing a flexible and practical formulation that can be readily applied to any two-dimensional mesh. We show that these models can capture harmonic flows, incorporate boundary conditions, and model non-stationary data. Finally, we apply these models to downscaling stationary and non-stationary gridded wind data on the globe, and to inference of ocean currents from sparse observations in bounded domains.
- [21] arXiv:2507.20058 [pdf, html, other]
-
Title: Predicting Parkinson's Disease Progression Using Statistical and Neural Mixed Effects Models: A Comparative Study on Longitudinal BiomarkersComments: 20pages,3 figures,currently under reviewSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
Predicting Parkinson's Disease (PD) progression is crucial, and voice biomarkers offer a non-invasive method for tracking symptom severity (UPDRS scores) through telemonitoring. Analyzing this longitudinal data is challenging due to within-subject correlations and complex, nonlinear patient-specific progression patterns. This study benchmarks LMMs against two advanced hybrid approaches: the Generalized Neural Network Mixed Model (GNMM) (Mandel 2021), which embeds a neural network within a GLMM structure, and the Neural Mixed Effects (NME) model (Wortwein 2023), allowing nonlinear subject-specific parameters throughout the network. Using the Oxford Parkinson's telemonitoring voice dataset, we evaluate these models' performance in predicting Total UPDRS to offer practical guidance for PD research and clinical applications.
- [22] arXiv:2507.20079 [pdf, html, other]
-
Title: Lasso Penalization for High-Dimensional Beta Regression Models: Computation, Analysis, and InferenceSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
Beta regression is commonly employed when the outcome variable is a proportion. Since its conception, the approach has been widely used in applications spanning various scientific fields. A series of extensions have been proposed over time, several of which address variable selection and penalized estimation, e.g., with an $\ell_1$-penalty (LASSO). However, a theoretical analysis of this popular approach in the context of Beta regression with high-dimensional predictors is lacking. In this paper, we aim to close this gap. A particular challenge arises from the non-convexity of the associated negative log-likelihood, which we address by resorting to a framework for analyzing stationary points in a neighborhood of the target parameter. Leveraging this framework, we derive a non-asymptotic bound on the $\ell_1$-error of such stationary points. In addition, we propose a debiasing approach to construct confidence intervals for the regression parameters. A proximal gradient algorithm is devised for optimizing the resulting penalized negative log-likelihood function. Our theoretical analysis is corroborated via simulation studies, and a real data example concerning the prediction of county-level proportions of incarceration is presented to showcase the practical utility of our methodology.
- [23] arXiv:2507.20092 [pdf, html, other]
-
Title: Bayesian Mixed-Effects Models for Multilevel Two-way Functional Data: Applications to EEG ExperimentsSubjects: Methodology (stat.ME)
In multi-condition EEG experiments, brain activity is recorded as subjects perform various tasks or are exposed to different stimuli. The recorded signals are commonly transformed into time-frequency representations, which often display smooth variations across time and frequency dimensions. These representations are naturally structured as two-way functional data, with experimental conditions nested within subjects. Existing analytical methods fail to jointly account for the data's multilevel structure, functional nature, and dependence on subject-level covariates. To address these limitations, we propose a Bayesian mixed-effects model for two-way functional data that incorporates covariate-dependent fixed effects at the condition level and multilevel random effects. For enhanced model interpretability and parsimony, we introduce a novel covariate-dependent CANDECOMP/PARAFAC (CP) decomposition for the fixed effects, with marginally interpretable time and frequency patterns. We further propose a sparsity-inducing prior for CP rank selection and an efficient algorithm for posterior sampling. The proposed method is evaluated through extensive simulations and applied to EEG data collected to investigate the effects of alcoholism on cognitive processing in response to visual stimuli. Our analysis reveals distinct patterns of time-frequency activity associated with alcoholism, offering new insights into the neural processing differences between subject groups and experimental conditions.
- [24] arXiv:2507.20153 [pdf, html, other]
-
Title: A Markov switching discrete-time Hawkes process: application to the monitoring of bats behaviorSubjects: Methodology (stat.ME); Applications (stat.AP)
Over the past few decades, the Hawkes process has become a popular framework for modeling temporal events thanks to its flexibility to capture different dependency structures. The objective of this work is to model call sequences emitted by bats for echolocation, whose patterns are known to change depending on the animal's activity. The novelty of the model lies in the combination of a Hawkes-type dependency from past events, as well as a latent variable that encodes changes in bat behavior. More precisely, we consider a discrete-time version of the Hawkes process, with an exponential kernel, where the immigration term varies according to a latent Markov chain. We prove that this model is identifiable and can be reformulated in terms of a Hidden Markov Model, with Poisson emissions. Based on these properties, we show that maximum likelihood inference of the model parameters can be performed using an EM algorithm, which involves a recursive M-step. A simulation study demonstrates the performance of our approach method for estimating the parameters, recovering the number of hidden states and classifying each bin of the trajectory. Finally, we illustrate the use of the proposed modeling to distinguish different behaviors of bats, based on the recording of their cries.
- [25] arXiv:2507.20231 [pdf, html, other]
-
Title: Causal Inference when Intervention Units and Outcome Units DifferSubjects: Methodology (stat.ME)
We study causal inference in settings characterized by interference with a bipartite structure. There are two distinct sets of units: intervention units to which an intervention can be applied and outcome units on which the outcome of interest can be measured. Outcome units may be affected by interventions on some, but not all, intervention units, as captured by a bipartite graph. Examples of this setting can be found in analyses of the impact of pollution abatement in plants on health outcomes for individuals, or the effect of transportation network expansions on regional economic activity. We introduce and discuss a variety of old and new causal estimands for these bipartite settings. We do not impose restrictions on the functional form of the exposure mapping and the potential outcomes, thus allowing for heterogeneity, non-linearity, non-additivity, and potential interactions in treatment effects. We propose unbiased weighting estimators for these estimands from a design-based perspective, based on the knowledge of the bipartite network under general experimental designs. We derive their variance and prove consistency for increasing number of outcome units. Using the Chinese high-speed rail construction study, analyzed in Borusyak and Hull [2023], we discuss non-trivial positivity violations that depend on the estimands, the adopted experimental design, and the structure of the bipartite graph.
- [26] arXiv:2507.20288 [pdf, html, other]
-
Title: A nonparametric approach to practical identifiability of nonlinear mixed effects modelsTyler Cassidy, Stuart T. Johnston, Michael Plank, Imke Botha, Jennifer A. Flegg, Ryan J. Murphy, Sara HamisSubjects: Methodology (stat.ME); Quantitative Methods (q-bio.QM)
Mathematical modelling is a widely used approach to understand and interpret clinical trial data. This modelling typically involves fitting mechanistic mathematical models to data from individual trial participants. Despite the widespread adoption of this individual-based fitting, it is becoming increasingly common to take a hierarchical approach to parameter estimation, where modellers characterize the population parameter distributions, rather than considering each individual independently. This hierarchical parameter estimation is standard in pharmacometric modelling. However, many of the existing techniques for parameter identifiability do not immediately translate from the individual-based fitting to the hierarchical setting. Here, we propose a nonparametric approach to study practical identifiability within a hierarchical parameter estimation framework. We focus on the commonly used nonlinear mixed effects framework and investigate two well-studied examples from the pharmacometrics and viral dynamics literature to illustrate the potential utility of our approach.
- [27] arXiv:2507.20329 [pdf, html, other]
-
Title: Clustering data with values missing at random using scale mixtures of multivariate skew-normal distributionsComments: Keywords: Mixture Models, skew-normal distribution, missing values at random. 32 pages, 14 figuresSubjects: Methodology (stat.ME); Computation (stat.CO)
Handling missing data is a major challenge in model-based clustering, especially when the data exhibit skewness and heavy tails. We address this by extending the finite mixture of scale mixtures of multivariate skew-normal (FMSMSN) family to accommodate incomplete data under a missing at random (MAR) mechanism. Unlike previous work that is limited to one of the special cases of the FMSMSN family, our method offers a cluster analysis methodology for the entire family that accounts for skewness and excess kurtosis amidst data with missing values. The multivariate skew-normal distribution, as parameterised by \cite{azzalini1996} and \cite{arnoldbeaver} includes the normal distribution as a special case, which ensures that our method is flexible toward existing symmetric model-based clustering techniques under a normality assumption. We derive the distributional properties of the missing components of the data and propose an augmented EM-type algorithm tailored for incomplete observations. The modified E-step yields closed-form expressions for the conditional expectations of the missing values. The simulation experiments showcase the flexibility of the FMSMSN family in both clustering performance and parameter recovery for varying percentages of missing values, while incorporating the effects of sample size and cluster proximity. Finally, we illustrate the practical utility of the proposed method by applying special cases of the FMSMSN family to global CO2 emissions data.
- [28] arXiv:2507.20379 [pdf, html, other]
-
Title: A global Lipschitz stability perspective for understanding approximate approaches in Bayesian sequential learningSubjects: Statistics Theory (math.ST); Numerical Analysis (math.NA)
We establish a general, non-asymptotic error analysis framework for understanding the effects of incremental approximations made by practical approaches for Bayesian sequential learning (BSL) on their long-term inference performance. Our setting covers inverse problems, state estimation, and parameter-state estimation. In these settings, we bound the difference-termed the learning error-between the unknown true posterior and the approximate posterior computed by these approaches, using three widely used distribution metrics: total variation, Hellinger, and Wasserstein distances. This framework builds on our establishment of the global Lipschitz stability of the posterior with respect to the prior across these settings. To the best of our knowledge, this is the first work to establish such global Lipschitz stability under the Hellinger and Wasserstein distances and the first general error analysis framework for approximate BSL methods.
Our framework offers two sets of upper bounds on the learning error. The first set demonstrates the stability of general approximate BSL methods with respect to the incremental approximation process, while the second set is estimable in many practical scenarios.
Furthermore, as an initial step toward understanding the phenomenon of learning error decay, which is sometimes observed, we identify sufficient conditions under which data assimilation leads to learning error reduction. - [29] arXiv:2507.20396 [pdf, html, other]
-
Title: Recurrent Event Analysis with Ordinary Differential EquationsSubjects: Methodology (stat.ME)
This paper introduces a general framework for analyzing recurrent event data by modeling the conditional mean function of the recurrent event process as the solution to an Ordinary Differential Equation (ODE). This approach not only accommodates a wide range of semi-parametric recurrent event models, including both non-homogeneous Poisson processes (NHPPs) and non-Poisson processes, but also is scalable and easy-to-implement. Based on this framework, we propose a Sieve Maximum Pseudo-Likelihood Estimation (SMPLE) method, employing the NHPP as a working model. We establish the consistency and asymptotic normality of the proposed estimator, demonstrating that it achieves semi-parametric efficiency when the NHPP working model is valid. Furthermore, we develop an efficient resampling procedure to estimate the asymptotic covariance matrix. To assess the statistical efficiency and computational scalability of the proposed method, we conduct extensive numerical studies, including simulations under various settings and an application to a real-world dataset analyzing risk factors associated with Intensive Care Unit (ICU) readmission frequency.
- [30] arXiv:2507.20547 [pdf, other]
-
Title: Exploring Causal Mediation Analysis in Bacterial Vaginosis ChallengesComments: 35 pages, 8 figuresSubjects: Applications (stat.AP)
Bacterial Vaginosis (BV) affects nearly 23-29% of women worldwide and increases risk of miscarriage, preterm birth, and sexually transmitted infections. It involves a shift in the vaginal microbiome from Lactobacillus dominance to a diverse bacterial composition. Understanding causal pathways linking behavioral factors to BV risk is essential for effective intervention. Observational studies have identified pathogenic bacteria associated with BV, and causal mediation analysis can clarify how behaviors like sexual activity influence the microbiome. Analyzing microbiome data is complex due to its high-dimensional and compositional nature, often challenging traditional statistical methods, especially with small samples. This article presents various approaches to measure causal mediation effects, emphasizing the benefits of an empirical distribution method for small samples, and outlines models for mediators, exposure, and outcomes, aiming to identify taxa that mediate the exposure-outcome relationship in BV, concluding with a revisit of the motivational example and model identification.
- [31] arXiv:2507.20558 [pdf, other]
-
Title: Time-to-Event Modeling with Pseudo-Observations in Federated SettingsComments: 30 pages, 5 figuresSubjects: Applications (stat.AP)
In multi-center clinical studies, concerns about patient privacy often prohibit pool- ing individual-level time-to-event data. We propose a non-iterative, one-shot federated framework using distributed pseudo-observations, derived from a sequentially updated Kaplan-Meier estimator and fitted with renewable generalized linear models. This framework enables the estimation of survival probabilities at specified landmark times and accommodates both time-invariant and time-varying covariate effects. To cap- ture site-level heterogeneity, we introduce a soft-thresholding debiasing procedure that adaptively shrinks local estimates toward the global fit. Through extensive simula- tions across varying event rates and site-size distributions, our method demonstrates performance comparable to pooled Cox and the one-shot Optimal Distributed Aggre- gation (ODAC) models, with added flexibility to capture non-proportional hazards. Applied to pediatric obesity data from the Chicago Area Patient-Centered Outcomes Research Network (CAPriCORN), which comprises four different sites and includes a total of 45,865 patients. The federated pseudo value regression model produced esti- mates of both time-constant and time-varying hazard ratios that closely aligned with those obtained from the pooled analysis, demonstrating its utility as a robust and privacy-preserving alternative for collaborative survival research. To further address potential heterogeneity across sites, we applied a covariate-wise debiasing algorithm, enabling site-level adjustments while preserving consistency with the global model.
- [32] arXiv:2507.20560 [pdf, other]
-
Title: Statistical Inference for Differentially Private Stochastic Gradient DescentSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
Privacy preservation in machine learning, particularly through Differentially Private Stochastic Gradient Descent (DP-SGD), is critical for sensitive data analysis. However, existing statistical inference methods for SGD predominantly focus on cyclic subsampling, while DP-SGD requires randomized subsampling. This paper first bridges this gap by establishing the asymptotic properties of SGD under the randomized rule and extending these results to DP-SGD. For the output of DP-SGD, we show that the asymptotic variance decomposes into statistical, sampling, and privacy-induced components. Two methods are proposed for constructing valid confidence intervals: the plug-in method and the random scaling method. We also perform extensive numerical analysis, which shows that the proposed confidence intervals achieve nominal coverage rates while maintaining privacy.
- [33] arXiv:2507.20598 [pdf, other]
-
Title: Nullstrap-DE: A General Framework for Calibrating FDR and Preserving Power in DE Methods, with Applications to DESeq2 and edgeRSubjects: Methodology (stat.ME); Genomics (q-bio.GN); Applications (stat.AP)
Differential expression (DE) analysis is a key task in RNA-seq studies, aiming to identify genes with expression differences across conditions. A central challenge is balancing false discovery rate (FDR) control with statistical power. Parametric methods such as DESeq2 and edgeR achieve high power by modeling gene-level counts using negative binomial distributions and applying empirical Bayes shrinkage. However, these methods may suffer from FDR inflation when model assumptions are mildly violated, especially in large-sample settings. In contrast, non-parametric tests like Wilcoxon offer more robust FDR control but often lack power and do not support covariate adjustment. We propose Nullstrap-DE, a general add-on framework that combines the strengths of both approaches. Designed to augment tools like DESeq2 and edgeR, Nullstrap-DE calibrates FDR while preserving power, without modifying the original method's implementation. It generates synthetic null data from a model fitted under the gene-specific null (no DE), applies the same test statistic to both observed and synthetic data, and derives a threshold that satisfies the target FDR level. We show theoretically that Nullstrap-DE asymptotically controls FDR while maintaining power consistency. Simulations confirm that it achieves reliable FDR control and high power across diverse settings, where DESeq2, edgeR, or Wilcoxon often show inflated FDR or low power. Applications to real datasets show that Nullstrap-DE enhances statistical rigor and identifies biologically meaningful genes.
- [34] arXiv:2507.20609 [pdf, other]
-
Title: Independence Testing for Mixed DataSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Computation (stat.CO)
We consider the problem of testing independence in mixed-type data that combine count variables with positive, absolutely continuous variables. We first introduce two distinct classes of test statistics in the bivariate setting, designed to test independence between the components of a bivariate mixed-type vector. These statistics are then extended to the multivariate context to accommodate: (i) testing independence between vectors of different types and possibly different dimensions, and (ii) testing total independence among all components of vectors with different types. The construction is based on the recently introduced Baringhaus-Gaigall transformation, which characterizes the joint distribution of such data. We establish the asymptotic properties of the resulting tests and, through an extensive power study, demonstrate that the proposed approach is both competitive and flexible.
- [35] arXiv:2507.20799 [pdf, other]
-
Title: Permutation Tests Based on the Copula-Graphic Estimator and Their Use for Survival Tree ConstructionSubjects: Methodology (stat.ME)
Survival trees are popular alternatives to Cox or Aalen regression models that offer both modelling flexibility and graphical interpretability. This paper introduces a new algorithm for survival trees that relaxes the assumption of independent censoring. To this end, we use the copula-graphic estimator to estimate survival functions. This allows us to flexibly specify shape and strength of the dependence of survival and censoring times within survival trees. For splitting, we present a permutation test for the null hypothesis of equal survival. Our test statistic consists of the integrated absolute distance of the group's copula-graphic estimators. A first simulation study shows a good type I error and power behavior of the new test. We thereby asses simulation settings of various group sizes, censoring percentages and grades of dependence generated by Clayton and Frank copulas. Using this test as splitting criterion, a second simulation study studies the performance of the resulting trees and compares it with that of the usual logrank-based tree. Lastly, the tree algorithm is applied to real-world clinical trial data.
- [36] arXiv:2507.20941 [pdf, html, other]
-
Title: Multivariate Conformal Prediction via Conformalized Gaussian ScoringSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME); Other Statistics (stat.OT)
While achieving exact conditional coverage in conformal prediction is unattainable without making strong, untestable regularity assumptions, the promise of conformal prediction hinges on finding approximations to conditional guarantees that are realizable in practice. A promising direction for obtaining conditional dependence for conformal sets--in particular capturing heteroskedasticity--is through estimating the conditional density $\mathbb{P}_{Y|X}$ and conformalizing its level sets. Previous work in this vein has focused on nonconformity scores based on the empirical cumulative distribution function (CDF). Such scores are, however, computationally costly, typically requiring expensive sampling methods. To avoid the need for sampling, we observe that the CDF-based score reduces to a Mahalanobis distance in the case of Gaussian scores, yielding a closed-form expression that can be directly conformalized. Moreover, the use of a Gaussian-based score opens the door to a number of extensions of the basic conformal method; in particular, we show how to construct conformal sets with missing output values, refine conformal sets as partial information about $Y$ becomes available, and construct conformal sets on transformations of the output space. Finally, empirical results indicate that our approach produces conformal sets that more closely approximate conditional coverage in multivariate settings compared to alternative methods.
- [37] arXiv:2507.20944 [pdf, html, other]
-
Title: A multivariate spatial model for ordinal survey-based dataSubjects: Methodology (stat.ME)
Health surveys provide valuable information for monitoring population health, identifying risk factors and informing public health policies. Most of the questions included are coded as ordinal variables and organized into thematic blocks. Accordingly, multivariate modeling provides a natural framework for considering these variables as true groups, thereby accounting for potential dependencies among the responses within each block. In this paper, we propose a multivariate spatial analysis of ordinal survey-based data. This multivariate approach enables the joint analysis of sets of ordinal responses that are likely to be correlated, accounting for individual-level effects, while simultaneously improving the estimation of the geographical patterns for each variable and capturing their interdependencies. We apply this methodology to describe the spatial distribution of several mental health indicators from the Health Survey of the Region of Valencia (Spain) for the year 2022. Specifically, we analyze the block of questions from the 12-item General Health Questionnaire included in the survey.
- [38] arXiv:2507.20975 [pdf, html, other]
-
Title: Locally Adaptive Conformal Inference for Operator ModelsComments: 9 pages, 2 figures, 2 tablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Operator models are regression algorithms for functional data and have become a key tool for emulating large-scale dynamical systems. Recent advances in deep neural operators have dramatically improved the accuracy and scalability of operator modeling, but lack an inherent notion of predictive uncertainty. We introduce Local Spectral Conformal Inference (LSCI), a new framework for locally adaptive, distribution-free uncertainty quantification for neural operator models. LSCI uses projection-based depth scoring and localized conformal inference to generate function-valued prediction sets with statistical guarantees. We prove approximate finite-sample marginal coverage under local exchangeability, and demonstrate significant gains in adaptivity and coverage across synthetic and real-world operator learning tasks.
- [39] arXiv:2507.21022 [pdf, html, other]
-
Title: A Generalized Cramér-Rao Bound Using Information GeometryComments: Presented at the IEEE International Symposium on Information Theory (ISIT 2025)Subjects: Statistics Theory (math.ST); Information Theory (cs.IT); Other Statistics (stat.OT)
In information geometry, statistical models are considered as differentiable manifolds, where each probability distribution represents a unique point on the manifold. A Riemannian metric can be systematically obtained from a divergence function using Eguchi's theory (1992); the well-known Fisher-Rao metric is obtained from the Kullback-Leibler (KL) divergence. The geometric derivation of the classical Cramér-Rao Lower Bound (CRLB) by Amari and Nagaoka (2000) is based on this metric. In this paper, we study a Riemannian metric obtained by applying Eguchi's theory to the Basu-Harris-Hjort-Jones (BHHJ) divergence (1998) and derive a generalized Cramér-Rao bound using Amari-Nagaoka's approach. There are potential applications for this bound in robust estimation.
New submissions (showing 39 of 39 entries)
- [40] arXiv:2507.19531 (cross-list from eess.SY) [pdf, html, other]
-
Title: A safety governor for learning explicit MPC controllers from dataSubjects: Systems and Control (eess.SY); Methodology (stat.ME)
We tackle neural networks (NNs) to approximate model predictive control (MPC) laws. We propose a novel learning-based explicit MPC structure, which is reformulated into a dual-mode scheme over maximal constrained feasible set. The scheme ensuring the learning-based explicit MPC reduces to linear feedback control while entering the neighborhood of origin. We construct a safety governor to ensure that learning-based explicit MPC satisfies all the state and input constraints. Compare to the existing approach, our approach is computationally easier to implement even in high-dimensional system. The proof of recursive feasibility for the safety governor is given. Our approach is demonstrated on numerical examples.
- [41] arXiv:2507.19539 (cross-list from cs.LG) [pdf, html, other]
-
Title: Swift-Sarsa: Fast and Robust Linear ControlComments: Presented at RLDM 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Javed, Sharifnassab, and Sutton (2024) introduced a new algorithm for TD learning -- SwiftTD -- that augments True Online TD($\lambda$) with step-size optimization, a bound on the effective learning rate, and step-size decay. In their experiments SwiftTD outperformed True Online TD($\lambda$) and TD($\lambda$) on a variety of prediction tasks derived from Atari games, and its performance was robust to the choice of hyper-parameters. In this extended abstract we extend SwiftTD to work for control problems. We combine the key ideas behind SwiftTD with True Online Sarsa($\lambda$) to develop an on-policy reinforcement learning algorithm called $\textit{Swift-Sarsa}$.
We propose a simple benchmark for linear on-policy control called the $\textit{operant conditioning benchmark}$. The key challenge in the operant conditioning benchmark is that a very small subset of input signals are relevant for decision making. The majority of the signals are noise sampled from a non-stationary distribution. To learn effectively, the agent must learn to differentiate between the relevant signals and the noisy signals, and minimize prediction errors by assigning credit to the weight parameters associated with the relevant signals.
Swift-Sarsa, when applied to the operant conditioning benchmark, learned to assign credit to the relevant signals without any prior knowledge of the structure of the problem. It opens the door for solution methods that learn representations by searching over hundreds of millions of features in parallel without performance degradation due to noisy or bad features. - [42] arXiv:2507.19603 (cross-list from econ.EM) [pdf, html, other]
-
Title: Uniform Critical Values for Likelihood Ratio Tests in Boundary ProblemsSubjects: Econometrics (econ.EM); Methodology (stat.ME)
Limit distributions of likelihood ratio statistics are well-known to be discontinuous in the presence of nuisance parameters at the boundary of the parameter space, which lead to size distortions when standard critical values are used for testing. In this paper, we propose a new and simple way of constructing critical values that yields uniformly correct asymptotic size, regardless of whether nuisance parameters are at, near or far from the boundary of the parameter space. Importantly, the proposed critical values are trivial to compute and at the same time provide powerful tests in most settings. In comparison to existing size-correction methods, the new approach exploits the monotonicity of the two components of the limiting distribution of the likelihood ratio statistic, in conjunction with rectangular confidence sets for the nuisance parameters, to gain computational tractability. Uniform validity is established for likelihood ratio tests based on the new critical values, and we provide illustrations of their construction in two key examples: (i) testing a coefficient of interest in the classical linear regression model with non-negativity constraints on control coefficients, and, (ii) testing for the presence of exogenous variables in autoregressive conditional heteroskedastic models (ARCH) with exogenous regressors. Simulations confirm that the tests have desirable size and power properties. A brief empirical illustration demonstrates the usefulness of our proposed test in relation to testing for spill-overs and ARCH effects.
- [43] arXiv:2507.19627 (cross-list from cs.LG) [pdf, html, other]
-
Title: Federated Calculation of the Free-Support Transportation Barycenter by Single-Loop Dual DecompositionSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We propose an efficient federated dual decomposition algorithm for calculating the Wasserstein barycenter of several distributions, including choosing the support of the solution. The algorithm does not access local data and uses only highly aggregated information. It also does not require repeated solutions to mass transportation problems. Because of the absence of any matrix-vector operations, the algorithm exhibits a very low complexity of each iteration and significant scalability. We illustrate its virtues and compare it to the state-of-the-art methods on several examples of mixture models.
- [44] arXiv:2507.19654 (cross-list from econ.EM) [pdf, html, other]
-
Title: Binary Classification with the Maximum Score Model and Linear ProgrammingSubjects: Econometrics (econ.EM); Methodology (stat.ME)
This paper presents a computationally efficient method for binary classification using Manski's (1975,1985) maximum score model when covariates are discretely distributed and parameters are partially but not point identified. We establish conditions under which it is minimax optimal to allow for either non-classification or random classification and derive finite-sample and asymptotic lower bounds on the probability of correct classification. We also describe an extension of our method to continuous covariates. Our approach avoids the computational difficulty of maximum score estimation by reformulating the problem as two linear programs. Compared to parametric and nonparametric methods, our method balances extrapolation ability with minimal distributional assumptions. Monte Carlo simulations and empirical applications demonstrate its effectiveness and practical relevance.
- [45] arXiv:2507.19672 (cross-list from cs.AI) [pdf, html, other]
-
Title: Alignment and Safety in Large Language Models: Safety Mechanisms, Training Paradigms, and Emerging ChallengesHaoran Lu, Luyang Fang, Ruidong Zhang, Xinliang Li, Jiazhang Cai, Huimin Cheng, Lin Tang, Ziyu Liu, Zeliang Sun, Tao Wang, Yingchuan Zhang, Arif Hassan Zidan, Jinwen Xu, Jincheng Yu, Meizhi Yu, Hanqi Jiang, Xilin Gong, Weidi Luo, Bolun Sun, Yongkai Chen, Terry Ma, Shushan Wu, Yifan Zhou, Junhao Chen, Haotian Xiang, Jing Zhang, Afrar Jahin, Wei Ruan, Ke Deng, Yi Pan, Peilong Wang, Jiahui Li, Zhengliang Liu, Lu Zhang, Lin Zhao, Wei Liu, Dajiang Zhu, Xin Xing, Fei Dou, Wei Zhang, Chao Huang, Rongjie Liu, Mengrui Zhang, Yiwen Liu, Xiaoxiao Sun, Qin Lu, Zhen Xiang, Wenxuan Zhong, Tianming Liu, Ping MaComments: 119 pages, 10 figures, 7 tablesSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Due to the remarkable capabilities and growing impact of large language models (LLMs), they have been deeply integrated into many aspects of society. Thus, ensuring their alignment with human values and intentions has emerged as a critical challenge. This survey provides a comprehensive overview of practical alignment techniques, training protocols, and empirical findings in LLM alignment. We analyze the development of alignment methods across diverse paradigms, characterizing the fundamental trade-offs between core alignment objectives. Our analysis shows that while supervised fine-tuning enables basic instruction-following, preference-based methods offer more flexibility for aligning with nuanced human intent. We discuss state-of-the-art techniques, including Direct Preference Optimization (DPO), Constitutional AI, brain-inspired methods, and alignment uncertainty quantification (AUQ), highlighting their approaches to balancing quality and efficiency. We review existing evaluation frameworks and benchmarking datasets, emphasizing limitations such as reward misspecification, distributional robustness, and scalable oversight. We summarize strategies adopted by leading AI labs to illustrate the current state of practice. We conclude by outlining open problems in oversight, value pluralism, robustness, and continuous alignment. This survey aims to inform both researchers and practitioners navigating the evolving landscape of LLM alignment.
- [46] arXiv:2507.19680 (cross-list from cs.LG) [pdf, html, other]
-
Title: Feature learning is decoupled from generalization in high capacity neural networksNiclas Alexander Göring, Charles London, Abdurrahman Hadi Erturk, Chris Mingard, Yoonsoo Nam, Ard A. LouisSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Neural networks outperform kernel methods, sometimes by orders of magnitude, e.g. on staircase functions. This advantage stems from the ability of neural networks to learn features, adapting their hidden representations to better capture the data. We introduce a concept we call feature quality to measure this performance improvement. We examine existing theories of feature learning and demonstrate empirically that they primarily assess the strength of feature learning, rather than the quality of the learned features themselves. Consequently, current theories of feature learning do not provide a sufficient foundation for developing theories of neural network generalization.
- [47] arXiv:2507.19701 (cross-list from cs.RO) [pdf, html, other]
-
Title: PhysVarMix: Physics-Informed Variational Mixture Model for Multi-Modal Trajectory PredictionSubjects: Robotics (cs.RO); Machine Learning (stat.ML)
Accurate prediction of future agent trajectories is a critical challenge for ensuring safe and efficient autonomous navigation, particularly in complex urban environments characterized by multiple plausible future scenarios. In this paper, we present a novel hybrid approach that integrates learning-based with physics-based constraints to address the multi-modality inherent in trajectory prediction. Our method employs a variational Bayesian mixture model to effectively capture the diverse range of potential future behaviors, moving beyond traditional unimodal assumptions. Unlike prior approaches that predominantly treat trajectory prediction as a data-driven regression task, our framework incorporates physical realism through sector-specific boundary conditions and Model Predictive Control (MPC)-based smoothing. These constraints ensure that predicted trajectories are not only data-consistent but also physically plausible, adhering to kinematic and dynamic principles. Furthermore, our method produces interpretable and diverse trajectory predictions, enabling enhanced downstream decision-making and planning in autonomous driving systems. We evaluate our approach on two benchmark datasets, demonstrating superior performance compared to existing methods. Comprehensive ablation studies validate the contributions of each component and highlight their synergistic impact on prediction accuracy and reliability. By balancing data-driven insights with physics-informed constraints, our approach offers a robust and scalable solution for navigating the uncertainties of real-world urban environments.
- [48] arXiv:2507.19873 (cross-list from cs.LG) [pdf, html, other]
-
Title: RestoreAI - Pattern-based Risk Estimation Of Remaining ExplosivesSubjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Applications (stat.AP); Machine Learning (stat.ML)
Landmine removal is a slow, resource-intensive process affecting over 60 countries. While AI has been proposed to enhance explosive ordnance (EO) detection, existing methods primarily focus on object recognition, with limited attention to prediction of landmine risk based on spatial pattern information. This work aims to answer the following research question: How can AI be used to predict landmine risk from landmine patterns to improve clearance time efficiency? To that effect, we introduce RestoreAI, an AI system for pattern-based risk estimation of remaining explosives. RestoreAI is the first AI system that leverages landmine patterns for risk prediction, improving the accuracy of estimating the residual risk of missing EO prior to land release. We particularly focus on the implementation of three instances of RestoreAI, respectively, linear, curved and Bayesian pattern deminers. First, the linear pattern deminer uses linear landmine patterns from a principal component analysis (PCA) for the landmine risk prediction. Second, the curved pattern deminer uses curved landmine patterns from principal curves. Finally, the Bayesian pattern deminer incorporates prior expert knowledge by using a Bayesian pattern risk prediction. Evaluated on real-world landmine data, RestoreAI significantly boosts clearance efficiency. The top-performing pattern-based deminers achieved a 14.37 percentage point increase in the average share of cleared landmines per timestep and required 24.45% less time than the best baseline deminer to locate all landmines. Interestingly, linear and curved pattern deminers showed no significant performance difference, suggesting that more efficient linear patterns are a viable option for risk prediction.
- [49] arXiv:2507.19898 (cross-list from cs.HC) [pdf, html, other]
-
Title: TS-Insight: Visualizing Thompson Sampling for Verification and XAIComments: Accepted as a poster at IEEE VIS 2025 ("TS-Insight: Visual Fingerprinting of Multi-Armed Bandits"). Open-source tool available at this https URLSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Thompson Sampling (TS) and its variants are powerful Multi-Armed Bandit algorithms used to balance exploration and exploitation strategies in active learning. Yet, their probabilistic nature often turns them into a ``black box'', hindering debugging and trust. We introduce TS-Insight, a visual analytics tool explicitly designed to shed light on the internal decision mechanisms of Thompson Sampling-based algorithms, for model developers. It comprises multiple plots, tracing for each arm the evolving posteriors, evidence counts, and sampling outcomes, enabling the verification, diagnosis, and explainability of exploration/exploitation dynamics. This tool aims at fostering trust and facilitating effective debugging and deployment in complex binary decision-making scenarios especially in sensitive domains requiring interpretable decision-making.
- [50] arXiv:2507.19968 (cross-list from cs.LG) [pdf, html, other]
-
Title: Dimer-Enhanced Optimization: A First-Order Approach to Escaping Saddle Points in Neural Network TrainingComments: 8 pages, 2 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
First-order optimization methods, such as SGD and Adam, are widely used for training large-scale deep neural networks due to their computational efficiency and robust performance. However, relying solely on gradient information, these methods often struggle to navigate complex loss landscapes with flat regions, plateaus, and saddle points. Second-order methods, which use curvature information from the Hessian matrix, can address these challenges but are computationally infeasible for large models. The Dimer method, a first-order technique that constructs two closely spaced points to probe the local geometry of a potential energy surface, efficiently estimates curvature using only gradient information. Inspired by its use in molecular dynamics simulations for locating saddle points, we propose Dimer-Enhanced Optimization (DEO), a novel framework to escape saddle points in neural network training. DEO adapts the Dimer method to explore a broader region of the loss landscape, approximating the Hessian's smallest eigenvector without computing the full matrix. By periodically projecting the gradient onto the subspace orthogonal to the minimum curvature direction, DEO guides the optimizer away from saddle points and flat regions, enhancing training efficiency with non-stepwise updates. Preliminary experiments on a Transformer toy model show DEO achieves competitive performance compared to standard first-order methods, improving navigation of complex loss landscapes. Our work repurposes physics-inspired, first-order curvature estimation to enhance neural network training in high-dimensional spaces.
- [51] arXiv:2507.20039 (cross-list from q-fin.PM) [pdf, html, other]
-
Title: Dependency Network-Based Portfolio Design with Forecasting and VaR ConstraintsSubjects: Portfolio Management (q-fin.PM); Econometrics (econ.EM); Statistical Finance (q-fin.ST); Machine Learning (stat.ML)
This study proposes a novel portfolio optimization framework that integrates statistical social network analysis with time series forecasting and risk management. Using daily stock data from the S&P 500 (2020-2024), we construct dependency networks via Vector Autoregression (VAR) and Forecast Error Variance Decomposition (FEVD), transforming influence relationships into a cost-based network. Specifically, FEVD breaks down the VAR's forecast error variance to quantify how much each stock's shocks contribute to another's uncertainty information we invert to form influence-based edge weights in our network. By applying the Minimum Spanning Tree (MST) algorithm, we extract the core inter-stock structure and identify central stocks through degree centrality. A dynamic portfolio is constructed using the top-ranked stocks, with capital allocated based on Value at Risk (VaR). To refine stock selection, we incorporate forecasts from ARIMA and Neural Network Autoregressive (NNAR) models. Trading simulations over a one-year period demonstrate that the MST-based strategies outperform a buy-and-hold benchmark, with the tuned NNAR-enhanced strategy achieving a 63.74% return versus 18.00% for the benchmark. Our results highlight the potential of combining network structures, predictive modeling, and risk metrics to improve adaptive financial decision-making.
- [52] arXiv:2507.20048 (cross-list from cs.LG) [pdf, html, other]
-
Title: Irredundant k-Fold Cross-ValidationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
In traditional k-fold cross-validation, each instance is used ($k\!-\!1$) times for training and once for testing, leading to redundancy that lets many instances disproportionately influence the learning phase. We introduce Irredundant $k$--fold cross-validation, a novel method that guarantees each instance is used exactly once for training and once for testing across the entire validation procedure. This approach ensures a more balanced utilization of the dataset, mitigates overfitting due to instance repetition, and enables sharper distinctions in comparative model analysis. The method preserves stratification and remains model-agnostic, i.e., compatible with any classifier. Experimental results demonstrate that it delivers consistent performance estimates across diverse datasets --comparable to $k$--fold cross-validation-- while providing less optimistic variance estimates because training partitions are non-overlapping, and significantly reducing the overall computational cost.
- [53] arXiv:2507.20068 (cross-list from cs.LG) [pdf, html, other]
-
Title: PERRY: Policy Evaluation with Confidence Intervals using Auxiliary DataAishwarya Mandyam, Jason Meng, Ge Gao, Jiankai Sun, Mac Schwager, Barbara E. Engelhardt, Emma BrunskillSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Off-policy evaluation (OPE) methods aim to estimate the value of a new reinforcement learning (RL) policy prior to deployment. Recent advances have shown that leveraging auxiliary datasets, such as those synthesized by generative models, can improve the accuracy of these value estimates. Unfortunately, such auxiliary datasets may also be biased, and existing methods for using data augmentation for OPE in RL lack principled uncertainty quantification. In high stakes settings like healthcare, reliable uncertainty estimates are important for comparing policy value estimates. In this work, we propose two approaches to construct valid confidence intervals for OPE when using data augmentation. The first provides a confidence interval over the policy performance conditioned on a particular initial state $V^{\pi}(s_0)$-- such intervals are particularly important for human-centered applications. To do so we introduce a new conformal prediction method for high dimensional state MDPs. Second, we consider the more common task of estimating the average policy performance over many initial states; to do so we draw on ideas from doubly robust estimation and prediction powered inference. Across simulators spanning robotics, healthcare and inventory management, and a real healthcare dataset from MIMIC-IV, we find that our methods can use augmented data and still consistently produce intervals that cover the ground truth values, unlike previously proposed methods.
- [54] arXiv:2507.20072 (cross-list from cs.LG) [pdf, html, other]
-
Title: Sparse Equation Matching: A Derivative-Free Learning for General-Order Dynamical SystemsSubjects: Machine Learning (cs.LG); Methodology (stat.ME)
Equation discovery is a fundamental learning task for uncovering the underlying dynamics of complex systems, with wide-ranging applications in areas such as brain connectivity analysis, climate modeling, gene regulation, and physical system simulation. However, many existing approaches rely on accurate derivative estimation and are limited to first-order dynamical systems, restricting their applicability to real-world scenarios. In this work, we propose sparse equation matching (SEM), a unified framework that encompasses several existing equation discovery methods under a common formulation. SEM introduces an integral-based sparse regression method using Green's functions, enabling derivative-free estimation of differential operators and their associated driving functions in general-order dynamical systems. The effectiveness of SEM is demonstrated through extensive simulations, benchmarking its performance against derivative-based approaches. We then apply SEM to electroencephalographic (EEG) data recorded during multiple oculomotor tasks, collected from 52 participants in a brain-computer interface experiment. Our method identifies active brain regions across participants and reveals task-specific connectivity patterns. These findings offer valuable insights into brain connectivity and the underlying neural mechanisms.
- [55] arXiv:2507.20088 (cross-list from cs.LG) [pdf, html, other]
-
Title: Feed-anywhere ANN (I) Steady Discrete $\to$ Diffusing on Graph Hidden StatesComments: 11 pages, 1 algorithmSubjects: Machine Learning (cs.LG); Mathematical Physics (math-ph); Optimization and Control (math.OC); Machine Learning (stat.ML)
We propose a novel framework for learning hidden graph structures from data using geometric analysis and nonlinear dynamics. Our approach: (1) Defines discrete Sobolev spaces on graphs for scalar/vector fields, establishing key functional properties; (2) Introduces gauge-equivalent nonlinear Schrödinger and Landau--Lifshitz dynamics with provable stable stationary solutions smoothly dependent on input data and graph weights; (3) Develops a stochastic gradient algorithm over graph moduli spaces with sparsity regularization. Theoretically, we guarantee: topological correctness (homology recovery), metric convergence (Gromov--Hausdorff), and efficient search space utilization. Our dynamics-based model achieves stronger generalization bounds than standard neural networks, with complexity dependent on the data manifold's topology.
- [56] arXiv:2507.20089 (cross-list from cs.LG) [pdf, html, other]
-
Title: Meta Fusion: A Unified Framework For Multimodality Fusion with Mutual LearningSubjects: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
Developing effective multimodal data fusion strategies has become increasingly essential for improving the predictive power of statistical machine learning methods across a wide range of applications, from autonomous driving to medical diagnosis. Traditional fusion methods, including early, intermediate, and late fusion, integrate data at different stages, each offering distinct advantages and limitations. In this paper, we introduce Meta Fusion, a flexible and principled framework that unifies these existing strategies as special cases. Motivated by deep mutual learning and ensemble learning, Meta Fusion constructs a cohort of models based on various combinations of latent representations across modalities, and further boosts predictive performance through soft information sharing within the cohort. Our approach is model-agnostic in learning the latent representations, allowing it to flexibly adapt to the unique characteristics of each modality. Theoretically, our soft information sharing mechanism reduces the generalization error. Empirically, Meta Fusion consistently outperforms conventional fusion strategies in extensive simulation studies. We further validate our approach on real-world applications, including Alzheimer's disease detection and neural decoding.
- [57] arXiv:2507.20108 (cross-list from cs.LG) [pdf, other]
-
Title: Graded Transformers: A Symbolic-Geometric Approach to Structured LearningSubjects: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
We introduce the Graded Transformer framework, a novel class of sequence models that embeds algebraic inductive biases through grading transformations on vector spaces. Extending the theory of Graded Neural Networks (GNNs), we propose two architectures: the Linearly Graded Transformer (LGT) and the Exponentially Graded Transformer (EGT). These models apply parameterized scaling operators-governed by fixed or learnable grading tuples and, for EGT, exponential factors to infuse hierarchical structure into attention and representation layers, enhancing efficiency for structured data.
We derive rigorous theoretical guarantees, including universal approximation theorems for continuous and Sobolev functions, reduced sample complexity via effective VC dimension bounds, Lipschitz continuity of graded operations, and robustness to adversarial perturbations. A graded loss function ensures gradient stability and alignment with domain priors during optimization. By treating grades as differentiable parameters, the framework enables adaptive feature prioritization, overcoming limitations of fixed grades in prior work.
The Graded Transformer holds transformative potential for hierarchical learning and neurosymbolic reasoning, with applications spanning algebraic geometry (e.g., moduli spaces and zeta functions), physics (e.g., multiscale simulations), natural language processing (e.g., syntactic parsing), biological sequence analysis (e.g., variant prediction), and emerging areas like graph neural networks and financial modeling. This work advances structured deep learning by fusing geometric and algebraic principles with attention mechanisms, offering a mathematically grounded alternative to data-driven models and paving the way for interpretable, efficient systems in complex domains. - [58] arXiv:2507.20112 (cross-list from cs.LG) [pdf, html, other]
-
Title: Online Learning with Probing for Sequential User-Centric SelectionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
We formalize sequential decision-making with information acquisition as the probing-augmented user-centric selection (PUCS) framework, where a learner first probes a subset of arms to obtain side information on resources and rewards, and then assigns $K$ plays to $M$ arms. PUCS covers applications such as ridesharing, wireless scheduling, and content recommendation, in which both resources and payoffs are initially unknown and probing is costly. For the offline setting with known distributions, we present a greedy probing algorithm with a constant-factor approximation guarantee $\zeta = (e-1)/(2e-1)$. For the online setting with unknown distributions, we introduce OLPA, a stochastic combinatorial bandit algorithm that achieves a regret bound $\mathcal{O}(\sqrt{T} + \ln^{2} T)$. We also prove a lower bound $\Omega(\sqrt{T})$, showing that the upper bound is tight up to logarithmic factors. Experiments on real-world data demonstrate the effectiveness of our solutions.
- [59] arXiv:2507.20126 (cross-list from cs.CV) [pdf, html, other]
-
Title: An Automated Deep Segmentation and Spatial-Statistics Approach for Post-Blast Rock Fragmentation AssessmentSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
We introduce an end-to-end pipeline that leverages a fine-tuned YOLO12l-seg model -- trained on over 500 annotated post-blast images -- to deliver real-time instance segmentation (Box mAP@0.5 ~ 0.769, Mask mAP@0.5 ~ 0.800 at ~ 15 FPS). High-fidelity masks are converted into normalized 3D coordinates, from which we extract multi-metric spatial descriptors: principal component directions, kernel density hotspots, size-depth regression, and Delaunay edge statistics. We present four representative examples to illustrate key fragmentation patterns. Experimental results confirm the framework's accuracy, robustness to small-object crowding, and feasibility for rapid, automated blast-effect assessment in field conditions.
- [60] arXiv:2507.20157 (cross-list from cs.IT) [pdf, html, other]
-
Title: Sparse Regression Codes for Secret Key Agreement: Achieving Strong Secrecy and Near-Optimal Rates for Gaussian SourcesComments: 15 pages, 5 figuresSubjects: Information Theory (cs.IT); Probability (math.PR); Applications (stat.AP)
Secret key agreement from correlated physical layer observations is a cornerstone of information-theoretic security. This paper proposes and rigorously analyzes a complete, constructive protocol for secret key agreement from Gaussian sources using Sparse Regression Codes (SPARCs). Our protocol systematically leverages the known optimality of SPARCs for both rate-distortion and Wyner-Ziv (WZ) coding, facilitated by their inherent nested structure. The primary contribution of this work is a comprehensive end-to-end analysis demonstrating that the proposed scheme achieves near-optimal secret key rates with strong secrecy guarantees, as quantified by a vanishing variational distance. We explicitly characterize the gap to the optimal rate, revealing a fundamental trade-off between the key rate and the required public communication overhead, which is governed by a tunable quantization parameter. Furthermore, we uncover a non-trivial constrained optimization for this parameter, showing that practical constraints on the SPARC code parameters induce a peak in the achievable secret key rate. This work establishes SPARCs as a viable and theoretically sound framework for secure key generation, providing a compelling low-complexity alternative to existing schemes and offering new insights into the practical design of such protocols.
- [61] arXiv:2507.20268 (cross-list from cs.LG) [pdf, html, other]
-
Title: Data-Efficient Prediction-Powered Calibration via Cross-ValidationSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
Calibration data are necessary to formally quantify the uncertainty of the decisions produced by an existing artificial intelligence (AI) model. To overcome the common issue of scarce calibration data, a promising approach is to employ synthetic labels produced by a (generally different) predictive model. However, fine-tuning the label-generating predictor on the inference task of interest, as well as estimating the residual bias of the synthetic labels, demand additional data, potentially exacerbating the calibration data scarcity problem. This paper introduces a novel approach that efficiently utilizes limited calibration data to simultaneously fine-tune a predictor and estimate the bias of the synthetic labels. The proposed method yields prediction sets with rigorous coverage guarantees for AI-generated decisions. Experimental results on an indoor localization problem validate the effectiveness and performance gains of our solution.
- [62] arXiv:2507.20272 (cross-list from cs.LG) [pdf, other]
-
Title: Approximating Full Conformal Prediction for Neural Network Regression with Gauss-Newton InfluenceComments: Accepted at the 13th International Conference on Learning Representations (ICLR 2025)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Uncertainty quantification is an important prerequisite for the deployment of deep learning models in safety-critical areas. Yet, this hinges on the uncertainty estimates being useful to the extent the prediction intervals are well-calibrated and sharp. In the absence of inherent uncertainty estimates (e.g. pretrained models predicting only point estimates), popular approaches that operate post-hoc include Laplace's method and split conformal prediction (split-CP). However, Laplace's method can be miscalibrated when the model is misspecified and split-CP requires sample splitting, and thus comes at the expense of statistical efficiency. In this work, we construct prediction intervals for neural network regressors post-hoc without held-out data. This is achieved by approximating the full conformal prediction method (full-CP). Whilst full-CP nominally requires retraining the model for every test point and candidate label, we propose to train just once and locally perturb model parameters using Gauss-Newton influence to approximate the effect of retraining. Coupled with linearization of the network, we express the absolute residual nonconformity score as a piecewise linear function of the candidate label allowing for an efficient procedure that avoids the exhaustive search over the output space. On standard regression benchmarks and bounding box localization, we show the resulting prediction intervals are locally-adaptive and often tighter than those of split-CP.
- [63] arXiv:2507.20333 (cross-list from cs.AI) [pdf, html, other]
-
Title: The Blessing and Curse of Dimensionality in Safety AlignmentComments: Published as a conference paper at COLM 2025Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
The focus on safety alignment in large language models (LLMs) has increased significantly due to their widespread adoption across different domains. The scale of LLMs play a contributing role in their success, and the growth in parameter count follows larger hidden dimensions. In this paper, we hypothesize that while the increase in dimensions has been a key advantage, it may lead to emergent problems as well. These problems emerge as the linear structures in the activation space can be exploited, in the form of activation engineering, to circumvent its safety alignment. Through detailed visualizations of linear subspaces associated with different concepts, such as safety, across various model scales, we show that the curse of high-dimensional representations uniquely impacts LLMs. Further substantiating our claim, we demonstrate that projecting the representations of the model onto a lower dimensional subspace can preserve sufficient information for alignment while avoiding those linear structures. Empirical results confirm that such dimensional reduction significantly reduces susceptibility to jailbreaking through representation engineering. Building on our empirical validations, we provide theoretical insights into these linear jailbreaking methods relative to a model's hidden dimensions. Broadly speaking, our work posits that the high dimensions of a model's internal representations can be both a blessing and a curse in safety alignment.
- [64] arXiv:2507.20349 (cross-list from cs.LG) [pdf, html, other]
-
Title: From Observations to Causations: A GNN-based Probabilistic Prediction Framework for Causal DiscoverySubjects: Machine Learning (cs.LG); Methodology (stat.ME)
Causal discovery from observational data is challenging, especially with large datasets and complex relationships. Traditional methods often struggle with scalability and capturing global structural information. To overcome these limitations, we introduce a novel graph neural network (GNN)-based probabilistic framework that learns a probability distribution over the entire space of causal graphs, unlike methods that output a single deterministic graph. Our framework leverages a GNN that encodes both node and edge attributes into a unified graph representation, enabling the model to learn complex causal structures directly from data. The GNN model is trained on a diverse set of synthetic datasets augmented with statistical and information-theoretic measures, such as mutual information and conditional entropy, capturing both local and global data properties. We frame causal discovery as a supervised learning problem, directly predicting the entire graph structure. Our approach demonstrates superior performance, outperforming both traditional and recent non-GNN-based methods, as well as a GNN-based approach, in terms of accuracy and scalability on synthetic and real-world datasets without further training. This probabilistic framework significantly improves causal structure learning, with broad implications for decision-making and scientific discovery across various fields.
- [65] arXiv:2507.20353 (cross-list from math.PR) [pdf, html, other]
-
Title: A Theory of $θ$-ExpectationsSubjects: Probability (math.PR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
The canonical theory of stochastic calculus under ambiguity, founded on sub-additivity, is insensitive to non-convex uncertainty structures, leading to an identifiability impasse. This paper develops a mathematical framework for an identifiable calculus sensitive to non-convex geometry. We introduce the $\theta$-BSDE, a class of backward stochastic differential equations where the driver is determined by a pointwise maximization over a primitive, possibly non-convex, uncertainty set. The system's tractability is predicated not on convexity, but on a global analytic hypothesis: the existence of a unique and globally Lipschitz maximizer map for the driver function. Under this hypothesis, which carves out a tractable class of models, we establish well-posedness via a fixed-point argument. For a distinct, geometrically regular class of models, we prove a result of independent interest: under non-degeneracy conditions from Malliavin calculus, the maximizer is unique along any solution path, ensuring the model's internal consistency. We clarify the fundamental logical gap between this pathwise property and the global regularity required by our existence proof. The resulting valuation operator defines a dynamically consistent expectation, and we establish its connection to fully nonlinear PDEs via a Feynman-Kac formula.
- [66] arXiv:2507.20459 (cross-list from cs.LG) [pdf, html, other]
-
Title: Diagonally-Weighted Generalized Method of Moments Estimation for Gaussian Mixture ModelingSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
Since Pearson [Philosophical Transactions of the Royal Society of London. A, 185 (1894), pp. 71-110] first applied the method of moments (MM) for modeling data as a mixture of one-dimensional Gaussians, moment-based estimation methods have proliferated. Among these methods, the generalized method of moments (GMM) improves the statistical efficiency of MM by weighting the moments appropriately. However, the computational complexity and storage complexity of MM and GMM grow exponentially with the dimension, making these methods impractical for high-dimensional data or when higher-order moments are required. Such computational bottlenecks are more severe in GMM since it additionally requires estimating a large weighting matrix. To overcome these bottlenecks, we propose the diagonally-weighted GMM (DGMM), which achieves a balance among statistical efficiency, computational complexity, and numerical stability. We apply DGMM to study the parameter estimation problem for weakly separated heteroscedastic low-rank Gaussian mixtures and design a computationally efficient and numerically stable algorithm that obtains the DGMM estimator without explicitly computing or storing the moment tensors. We implement the proposed algorithm and empirically validate the advantages of DGMM: in numerical studies, DGMM attains smaller estimation errors while requiring substantially shorter runtime than MM and GMM. The code and data will be available upon publication at this https URL.
- [67] arXiv:2507.20501 (cross-list from math.OC) [pdf, other]
-
Title: Post-estimation Adjustments in Data-driven Decision-making with Applications in PricingSubjects: Optimization and Control (math.OC); Applications (stat.AP)
The predict-then-optimize (PTO) framework is a standard approach in data-driven decision-making, where a decision-maker first estimates an unknown parameter from historical data and then uses this estimate to solve an optimization problem. While widely used for its simplicity and modularity, PTO can lead to suboptimal decisions because the estimation step does not account for the structure of the downstream optimization problem. We study a class of problems where the objective function, evaluated at the PTO decision, is asymmetric with respect to estimation errors. This asymmetry causes the expected outcome to be systematically degraded by noise in the parameter estimate, as the penalty for underestimation differs from that of overestimation. To address this, we develop a data-driven post-estimation adjustment that improves decision quality while preserving the practicality and modularity of PTO. We show that when the objective function satisfies a particular curvature condition, based on the ratio of its third and second derivatives, the adjustment simplifies to a closed-form expression. This condition holds for a broad range of pricing problems, including those with linear, log-linear, and power-law demand models. Under this condition, we establish theoretical guarantees that our adjustment uniformly and asymptotically outperforms standard PTO, and we precisely characterize the resulting improvement. Additionally, we extend our framework to multi-parameter optimization and settings with biased estimators. Numerical experiments demonstrate that our method consistently improves revenue, particularly in small-sample regimes where estimation uncertainty is most pronounced. This makes our approach especially well-suited for pricing new products or in settings with limited historical price variation.
- [68] arXiv:2507.20542 (cross-list from cs.LG) [pdf, html, other]
-
Title: Improving Group Fairness in Tensor Completion via Imbalance Mitigating Entity AugmentationJournal-ref: 29th PAKDD, 2025, 29--41Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Group fairness is important to consider in tensor decomposition to prevent discrimination based on social grounds such as gender or age. Although few works have studied group fairness in tensor decomposition, they suffer from performance degradation. To address this, we propose STAFF(Sparse Tensor Augmentation For Fairness) to improve group fairness by minimizing the gap in completion errors of different groups while reducing the overall tensor completion error. Our main idea is to augment a tensor with augmented entities including sufficient observed entries to mitigate imbalance and group bias in the sparse tensor. We evaluate \method on tensor completion with various datasets under conventional and deep learning-based tensor models. STAFF consistently shows the best trade-off between completion error and group fairness; at most, it yields 36% lower MSE and 59% lower MADE than the second-best baseline.
- [69] arXiv:2507.20708 (cross-list from cs.LG) [pdf, other]
-
Title: Exposing the Illusion of Fairness: Auditing Vulnerabilities to Distributional Manipulation AttacksSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Applications (stat.AP)
Proving the compliance of AI algorithms has become an important challenge with the growing deployment of such algorithms for real-life applications. Inspecting possible biased behaviors is mandatory to satisfy the constraints of the regulations of the EU Artificial Intelligence's Act. Regulation-driven audits increasingly rely on global fairness metrics, with Disparate Impact being the most widely used. Yet such global measures depend highly on the distribution of the sample on which the measures are computed. We investigate first how to manipulate data samples to artificially satisfy fairness criteria, creating minimally perturbed datasets that remain statistically indistinguishable from the original distribution while satisfying prescribed fairness constraints. Then we study how to detect such manipulation. Our analysis (i) introduces mathematically sound methods for modifying empirical distributions under fairness constraints using entropic or optimal transport projections, (ii) examines how an auditee could potentially circumvent fairness inspections, and (iii) offers recommendations to help auditors detect such data manipulations. These results are validated through experiments on classical tabular datasets in bias detection.
- [70] arXiv:2507.20714 (cross-list from cs.LG) [pdf, other]
-
Title: Prostate Cancer Classification Using Multimodal Feature Fusion and Explainable AIAsma Sadia Khan, Fariba Tasnia Khan, Tanjim Mahmud, Salman Karim Khan, Rishita Chakma, Nahed Sharmen, Mohammad Shahadat Hossain, Karl AnderssonSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Applications (stat.AP)
Prostate cancer, the second most prevalent male malignancy, requires advanced diagnostic tools. We propose an explainable AI system combining BERT (for textual clinical notes) and Random Forest (for numerical lab data) through a novel multimodal fusion strategy, achieving superior classification performance on PLCO-NIH dataset (98% accuracy, 99% AUC). While multimodal fusion is established, our work demonstrates that a simple yet interpretable BERT+RF pipeline delivers clinically significant improvements - particularly for intermediate cancer stages (Class 2/3 recall: 0.900 combined vs 0.824 numerical/0.725 textual). SHAP analysis provides transparent feature importance rankings, while ablation studies prove textual features' complementary value. This accessible approach offers hospitals a balance of high performance (F1=89%), computational efficiency, and clinical interpretability - addressing critical needs in prostate cancer diagnostics.
- [71] arXiv:2507.20838 (cross-list from cs.LG) [pdf, other]
-
Title: BuildSTG: A Multi-building Energy Load Forecasting Method using Spatio-Temporal Graph Neural NetworkSubjects: Machine Learning (cs.LG); Applications (stat.AP)
Due to the extensive availability of operation data, data-driven methods show strong capabilities in predicting building energy loads. Buildings with similar features often share energy patterns, reflected by spatial dependencies in their operational data, which conventional prediction methods struggle to capture. To overcome this, we propose a multi-building prediction approach using spatio-temporal graph neural networks, comprising graph representation, graph learning, and interpretation. First, a graph is built based on building characteristics and environmental factors. Next, a multi-level graph convolutional architecture with attention is developed for energy prediction. Lastly, a method interpreting the optimized graph structure is introduced. Experiments on the Building Data Genome Project 2 dataset confirm superior performance over baselines such as XGBoost, SVR, FCNN, GRU, and Naive, highlighting the method's robustness, generalization, and interpretability in capturing meaningful building similarities and spatial relationships.
- [72] arXiv:2507.20846 (cross-list from astro-ph.IM) [pdf, other]
-
Title: Precision spectral estimation at sub-Hz frequencies: closed-form posteriors and Bayesian noise projectionComments: This work has been submitted to the IEEE for possible publicationSubjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Signal Processing (eess.SP); Applications (stat.AP)
We present a Bayesian method for estimating spectral quantities in multivariate Gaussian time series. The approach, based on periodograms and Wishart statistics, yields closed-form expressions at any given frequency for the marginal posterior distributions of the individual power spectral densities, the pairwise coherence, and the multiple coherence, as well as for the joint posterior distribution of the full cross-spectral density matrix. In the context of noise projection - where one series is modeled as a linear combination of filtered versions of the others, plus a background component - the method also provides closed-form posteriors for both the susceptibilities, i.e., the filter transfer functions, and the power spectral density of the background. Originally developed for the analysis of the data from the European Space Agency's LISA Pathfinder mission, the method is particularly well-suited to very-low-frequency data, where long observation times preclude averaging over large sets of periodograms, which would otherwise allow these to be treated as approximately normally distributed.
- [73] arXiv:2507.20980 (cross-list from cs.CV) [pdf, html, other]
-
Title: LargeMvC-Net: Anchor-based Deep Unfolding Network for Large-scale Multi-view ClusteringComments: 10 pages, 7 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO); Machine Learning (stat.ML)
Deep anchor-based multi-view clustering methods enhance the scalability of neural networks by utilizing representative anchors to reduce the computational complexity of large-scale clustering. Despite their scalability advantages, existing approaches often incorporate anchor structures in a heuristic or task-agnostic manner, either through post-hoc graph construction or as auxiliary components for message passing. Such designs overlook the core structural demands of anchor-based clustering, neglecting key optimization principles. To bridge this gap, we revisit the underlying optimization problem of large-scale anchor-based multi-view clustering and unfold its iterative solution into a novel deep network architecture, termed LargeMvC-Net. The proposed model decomposes the anchor-based clustering process into three modules: RepresentModule, NoiseModule, and AnchorModule, corresponding to representation learning, noise suppression, and anchor indicator estimation. Each module is derived by unfolding a step of the original optimization procedure into a dedicated network component, providing structural clarity and optimization traceability. In addition, an unsupervised reconstruction loss aligns each view with the anchor-induced latent space, encouraging consistent clustering structures across views. Extensive experiments on several large-scale multi-view benchmarks show that LargeMvC-Net consistently outperforms state-of-the-art methods in terms of both effectiveness and scalability.
- [74] arXiv:2507.20982 (cross-list from math.PR) [pdf, html, other]
-
Title: Bernstein-type dimension-free concentration for self-normalised martingalesSubjects: Probability (math.PR); Statistics Theory (math.ST)
We introduce a dimension-free Bernstein-type tail inequality for self-normalised martingales normalised by their predictable quadratic variation. As applications of our result, we propose solutions to the recent open problems posed by Mussi et al. (2024), providing computationally efficient confidence sequences for logistic regression with adaptively chosen RKHS-valued covariates, and establishing instance-adaptive regret bounds in the corresponding kernelised bandit setting.
- [75] arXiv:2507.20993 (cross-list from cs.LG) [pdf, html, other]
-
Title: Personalized Treatment Effect Estimation from Unstructured DataSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Existing methods for estimating personalized treatment effects typically rely on structured covariates, limiting their applicability to unstructured data. Yet, leveraging unstructured data for causal inference has considerable application potential, for instance in healthcare, where clinical notes or medical images are abundant. To this end, we first introduce an approximate 'plug-in' method trained directly on the neural representations of unstructured data. However, when these fail to capture all confounding information, the method may be subject to confounding bias. We therefore introduce two theoretically grounded estimators that leverage structured measurements of the confounders during training, but allow estimating personalized treatment effects purely from unstructured inputs, while avoiding confounding bias. When these structured measurements are only available for a non-representative subset of the data, these estimators may suffer from sampling bias. To address this, we further introduce a regression-based correction that accounts for the non-uniform sampling, assuming the sampling mechanism is known or can be well-estimated. Our experiments on two benchmark datasets show that the plug-in method, directly trainable on large unstructured datasets, achieves strong empirical performance across all settings, despite its simplicity.
- [76] arXiv:2507.21040 (cross-list from cs.LG) [pdf, html, other]
-
Title: Transformers as Unrolled Inference in Probabilistic Laplacian Eigenmaps: An Interpretation and Potential ImprovementsComments: Initial versionSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We propose a probabilistic interpretation of transformers as unrolled inference steps assuming a probabilistic Laplacian Eigenmaps model from the ProbDR framework. Our derivation shows that at initialisation, transformers perform "linear" dimensionality reduction. We also show that within the transformer block, a graph Laplacian term arises from our arguments, rather than an attention matrix (which we interpret as an adjacency matrix). We demonstrate that simply subtracting the identity from the attention matrix (and thereby taking a graph diffusion step) improves validation performance on a language model and a simple vision transformer.
Cross submissions (showing 37 of 37 entries)
- [77] arXiv:2007.00736 (replaced) [pdf, other]
-
Title: Tensor Completion with Nearly Linear Samples Given Weak Side InformationJournal-ref: Proceedings of the ACM on Measurement and Analysis of Computing Systems, Volume 6, Issue 2, Article 39 (June 2022), 35 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
Tensor completion exhibits an interesting computational-statistical gap in terms of the number of samples needed to perform tensor estimation. While there are only $\Theta(tn)$ degrees of freedom in a $t$-order tensor with $n^t$ entries, the best known polynomial time algorithm requires $O(n^{t/2})$ samples in order to guarantee consistent estimation. In this paper, we show that weak side information is sufficient to reduce the sample complexity to $O(n)$. The side information consists of a weight vector for each of the modes which is not orthogonal to any of the latent factors along that mode; this is significantly weaker than assuming noisy knowledge of the subspaces. We provide an algorithm that utilizes this side information to produce a consistent estimator with $O(n^{1+\kappa})$ samples for any small constant $\kappa > 0$. We also provide experiments on both synthetic and real-world datasets that validate our theoretical insights.
- [78] arXiv:2110.01950 (replaced) [pdf, html, other]
-
Title: Classification of high-dimensional data with spiked covariance matrix structureComments: 40 pages, 2 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We study the classification problem for high-dimensional data with $n$ observations on $p$ features where the $p \times p$ covariance matrix $\Sigma$ exhibits a spiked eigenvalues structure and the vector $\zeta$, given by the difference between the whitened mean vectors, is sparse with sparsity at most $s$. We propose an adaptive classifier (adaptive with respect to the sparsity $s$) that first performs dimension reduction on the feature vectors prior to classification in the dimensionally reduced space, i.e., the classifier whitened the data, then screen the features by keeping only those corresponding to the $s$ largest coordinates of $\zeta$ and finally apply Fisher linear discriminant on the selected features. Leveraging recent results on entrywise matrix perturbation bounds for covariance matrices, we show that the resulting classifier is Bayes optimal whenever $n \rightarrow \infty$ and $s \sqrt{n^{-1} \ln p} \rightarrow 0$. Experimental results on real and synthetic data sets indicate that the proposed classifier is competitive with existing state-of-the-art methods while also selecting a smaller number of features.
- [79] arXiv:2112.07755 (replaced) [pdf, other]
-
Title: Separate Exchangeability as Modeling Principle in Bayesian NonparametricsSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
We argue for the use of separate exchangeability as a modeling principle in Bayesian nonparametric (BNP) inference. Separate exchangeability is de facto widely applied in the Bayesian parametric case, e.g., it naturally arises in simple mixed models. However, while in some areas, such as random graphs, separate and (closely related) joint exchangeable models are widely used, they are curiously underused for several other applications in BNP. We briefly review the definition of separate exchangeability, focusing on the implications of such a definition in Bayesian modeling. We then discuss two tractable classes of models that implement separate exchangeability, which are the natural counterparts of familiar partially exchangeable BNP models.
The first is nested random partitions for a data matrix, defining a partition of columns and nested partitions of rows, nested within column clusters. Many recent models for nested partitions implement partially exchangeable models related to variations of the well-known nested Dirichlet process. We argue that inference under such models in some cases ignores important features of the experimental setup. We obtain the separately exchangeable counterpart of such partially exchangeable partition structures.
The second class is about setting up separately exchangeable priors for a nonparametric regression model when multiple sets of experimental units are involved. We highlight how a Dirichlet process mixture of linear models, known as ANOVA DDP, can naturally implement separate exchangeability in such regression problems. Finally, we illustrate how to perform inference under such models in two real data examples. - [80] arXiv:2209.00102 (replaced) [pdf, other]
-
Title: Bayesian Mixed Multidimensional Scaling for Auditory ProcessingSubjects: Methodology (stat.ME); Applications (stat.AP)
The human brain distinguishes speech sounds by mapping acoustic signals into a latent perceptual space. This space can be estimated via multidimensional scaling (MDS), preserving the similarity structure in lower dimensions. However, individual and group-level heterogeneity, especially between native and non-native listeners, remains poorly understood. Prior approaches often ignore such variability or cannot capture shared structure, limiting principled comparison. Moreover, the literature typically focuses on latent distances rather than the underlying features themselves. To address these issues, we develop a Bayesian mixed MDS method that accounts for both subject- and group-level heterogeneity, enabling recovery of biologically interpretable latent features. Simulations and an auditory neuroscience application demonstrate how these features reconstruct observed distances and vary with individual and language background, revealing novel insights.
- [81] arXiv:2303.03520 (replaced) [pdf, html, other]
-
Title: The Effect of Alcohol intake on Brain White Matter Microstructural Integrity: A New Causal Inference Framework for Incomplete Phenomic DataSubjects: Methodology (stat.ME)
Although substance use, such as alcohol intake, is known to be associated with cognitive decline during aging, its direct influence on the central nervous system remains incompletely understood. In this study, we investigate the influence of alcohol intake frequency on reduction of brain white matter microstructural integrity in the fornix, a brain region considered a promising marker of age-related microstructural degeneration, using a large UK Biobank (UKB) cohort with extensive phenomic data reflecting a comprehensive lifestyle profile. Two major challenges arise: 1) potentially nonlinear confounding effects from phenomic variables and 2) a limited proportion of participants with complete phenomic data. To address these challenges, we develop a novel ensemble learning framework tailored for robust causal inference and introduce a data integration step to incorporate information from UKB participants with incomplete phenomic data, improving estimation efficiency. Our analysis reveals that daily alcohol intake may significantly reduce fractional anisotropy, a neuroimaging-derived measure of white matter structural integrity, in the fornix and increase systolic and diastolic blood pressure levels. Moreover, extensive numerical studies demonstrate the superiority of our method over competing approaches in terms of estimation bias, while outcome regression-based estimators may be preferred when minimizing mean squared error is prioritized.
- [82] arXiv:2306.10189 (replaced) [pdf, html, other]
-
Title: MOCK: an Algorithm for Learning Nonparametric Differential Equations via Multivariate Occupation Kernel FunctionsComments: 29 pages, 6 figures Accepted at Transactions in Machine Learning Research (TMLR)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Learning a nonparametric system of ordinary differential equations from trajectories in a $d$-dimensional state space requires learning $d$ functions of $d$ variables. Explicit formulations often scale quadratically in $d$ unless additional knowledge about system properties, such as sparsity and symmetries, is available. In this work, we propose a linear approach, the multivariate occupation kernel method (MOCK), using the implicit formulation provided by vector-valued reproducing kernel Hilbert spaces. The solution for the vector field relies on multivariate occupation kernel functions associated with the trajectories and scales linearly with the dimension of the state space. We validate through experiments on a variety of simulated and real datasets ranging from 2 to 1024 dimensions. MOCK outperforms all other comparators on 3 of the 9 datasets on full trajectory prediction and 4 out of the 9 datasets on next-point prediction.
- [83] arXiv:2309.03782 (replaced) [pdf, other]
-
Title: A semi-parametric model for assessing the effect of temperature on ice accumulation rate from Antarctic ice core dataSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Applications (stat.AP)
In this paper, we present a semiparametric model for describing the effect of temperature on Antarctic ice accumulation on a paleoclimatic time scale. The model is motivated by sharp ups and downs in the rate of ice accumulation apparent from ice core data records, which are synchronous with movements of temperature. We prove strong consistency of the estimators under reasonable conditions. We conduct extensive simulations to assess the performance of the estimators and bootstrap based standard errors and confidence limits for the requisite range of sample sizes. Analysis of ice core data from two Antarctic locations over several hundred thousand years shows a reasonable fit. The apparent accumulation rate exhibits a thinning pattern that should facilitate the understanding of ice condensation, transformation and flow over the ages. There is a very strong linear relationship between temperature and the apparent accumulation rate adjusted for thinning.
- [84] arXiv:2312.11283 (replaced) [pdf, other]
-
Title: A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. CensusJohn M. Abowd, Tamara Adams, Robert Ashmead, David Darais, Sourya Dey, Simson L. Garfinkel, Nathan Goldschlag, Michael B. Hawes, Daniel Kifer, Philip Leclerc, Ethan Lew, Scott Moore, Rolando A. Rodríguez, Ramy N. Tadros, Lars VilhuberComments: This is the accepted Harvard Data Science Review paper. The accepted supplemental text is here: https://arxiv.org/abs/2312.11283v2Journal-ref: Abowd, J. M., et al. (2025). A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. Census. Harvard Data Science ReviewSubjects: Applications (stat.AP); Cryptography and Security (cs.CR); Econometrics (econ.EM)
We show that individual, confidential microdata records from the 2010 U.S. Census of Population and Housing can be accurately reconstructed from the published tabular summaries. Ninety-seven million person records (every resident in 70% of all census blocks) are exactly reconstructed with provable certainty using only public information. We further show that a hypothetical attacker using our methods can reidentify with 95% accuracy population unique individuals who are perfectly reconstructed and not in the modal race and ethnicity category in their census block (3.4 million persons)--a result that is only possible because their confidential records were used in the published tabulations. Finally, we show that the methods used for the 2020 Census, based on a differential privacy framework, provide better protection against this type of attack, with better published data accuracy, than feasible alternatives.
- [85] arXiv:2404.03867 (replaced) [pdf, html, other]
-
Title: Dimension-free Relaxation Times of Informed MCMC Samplers on Discrete SpacesComments: Accepted by BernoulliSubjects: Computation (stat.CO); Probability (math.PR); Machine Learning (stat.ML)
Convergence analysis of Markov chain Monte Carlo methods in high-dimensional statistical applications is increasingly recognized. In this paper, we develop general mixing time bounds for Metropolis-Hastings algorithms on discrete spaces by building upon and refining some recent theoretical advancements in Bayesian model selection problems. We establish sufficient conditions for a class of informed Metropolis-Hastings algorithms to attain relaxation times that are independent of the problem dimension. These conditions are grounded in the high-dimensional statistical theory and allow for possibly multimodal posterior distributions. We obtain our results through two independent techniques: the multicommodity flow method and single-element drift condition analysis; we find that the latter yields a slightly tighter mixing time bound. Our results are readily applicable to a broad spectrum of statistical problems with discrete parameter spaces, as we demonstrate using both theoretical and numerical examples.
- [86] arXiv:2404.06995 (replaced) [pdf, html, other]
-
Title: Model-free Change-point Detection using AUC of a ClassifierSubjects: Methodology (stat.ME)
In contemporary data analysis, it is increasingly common to work with non-stationary complex data sets. These data sets typically extend beyond the classical low-dimensional Euclidean space, making it challenging to detect shifts in their distribution without relying on strong structural assumptions. This paper proposes a novel offline change-point detection method that leverages classifiers developed in the statistics and machine learning community. With suitable data splitting, the test statistic is constructed through sequential computation of the Area Under the Curve (AUC) of a classifier, which is trained on data segments on both ends of the sequence. It is shown that the resulting AUC process attains its maxima at the true change-point location, which facilitates the change-point estimation. The proposed method is characterized by its complete nonparametric nature, high versatility, considerable flexibility, and absence of stringent assumptions on the underlying data or any distributional shifts. Theoretically, we derive the limiting pivotal distribution of the proposed test statistic under null, as well as the asymptotic behaviors under both local and fixed alternatives. The localization rate of the change-point estimator is also provided. Extensive simulation studies and the analysis of two real-world data sets illustrate the superior performance of our approach compared to existing model-free change-point detection methods.
- [87] arXiv:2408.06642 (replaced) [pdf, html, other]
-
Title: Quantifying uncertainty in climate projections with conformal ensemblesComments: 25 pages, 8 figures, 2 tablesSubjects: Applications (stat.AP); Machine Learning (stat.ML)
Ensembles of General Circulation Models (GCMs) are the primary tools for investigating climate sensitivity, projecting future climate states, and quantifying uncertainty. GCM ensembles are subject to substantial uncertainty due to model inadequacies, resolution limits, internal variability, and inter-model variability, meaning rigorous climate risk assessments and informed decision-making require reliable and accurate uncertainty quantification (UQ). We introduce conformal ensembles (CE), a new approach to climate UQ that quantifies and constrains projection uncertainty with conformal prediction sets and observational data. CE seamlessly integrates climate model ensembles and observational data across a range of scales to generate statistically rigorous, easy-to-interpret uncertainty estimates. CE can be applied to any climatic variable using any ensemble analysis method and outperforms existing inter-model variability methods in uncertainty quantification across all time horizons and most spatial locations under SSP2-4.5. CE is also computationally efficient, requires minimal assumptions, and is highly robust to the conformity measure. Experiments show that it is effective when conditioning future projections on historical reanalysis data compared with standard ensemble averaging approaches, yielding more physically consistent projections.
- [88] arXiv:2408.13276 (replaced) [pdf, other]
-
Title: Non-convex matrix sensing: Breaking the quadratic rank barrier in the sample complexityComments: 64 pagesSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)
For the problem of reconstructing a low-rank matrix from a few linear measurements, two classes of algorithms have been widely studied in the literature: convex approaches based on nuclear norm minimization, and non-convex approaches that use factorized gradient descent. Under certain statistical model assumptions, it is known that nuclear norm minimization recovers the ground truth as soon as the number of samples scales linearly with the number of degrees of freedom of the ground-truth. In contrast, while non-convex approaches are computationally less expensive, existing recovery guarantees assume that the number of samples scales at least quadratically with the rank $r$ of the ground-truth matrix. In this paper, we close this gap by showing that the non-convex approaches can be as efficient as nuclear norm minimization in terms of sample complexity. Namely, we consider the problem of reconstructing a positive semidefinite matrix from a few Gaussian measurements. We show that factorized gradient descent with spectral initialization converges to the ground truth at a linear rate as soon as the number of samples scales with $ \Omega (rd\kappa^2)$, where $d$ is the dimension, and $\kappa$ is the condition number of the ground truth matrix. This improves the previous rank-dependence in the sample complexity of non-convex matrix factorization from quadratic to linear. Furthermore, we extend our theory to the noisy setting, where we show that with noisy measurements, factorized gradient descent with spectral initialization converges to the minimax optimal error up to a factor linear in $\kappa$. Our proof relies on a probabilistic decoupling argument, where we show that the gradient descent iterates are only weakly dependent on the individual entries of the measurement matrices. We expect that our proof technique is of independent interest for other non-convex problems.
- [89] arXiv:2410.03229 (replaced) [pdf, html, other]
-
Title: Elucidating the Design Choice of Probability Paths in Flow Matching for ForecastingSoon Hoe Lim, Yijin Wang, Annan Yu, Emma Hart, Michael W. Mahoney, Xiaoye S. Li, N. Benjamin ErichsonComments: 35 pagesJournal-ref: Transactions on Machine Learning Research (TMLR), 2025Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Flow matching has recently emerged as a powerful paradigm for generative modeling and has been extended to probabilistic time series forecasting in latent spaces. However, the impact of the specific choice of probability path model on forecasting performance remains under-explored. In this work, we demonstrate that forecasting spatio-temporal data with flow matching is highly sensitive to the selection of the probability path model. Motivated by this insight, we propose a novel probability path model designed to improve forecasting performance. Our empirical results across various dynamical system benchmarks show that our model achieves faster convergence during training and improved predictive performance compared to existing probability path models. Importantly, our approach is efficient during inference, requiring only a few sampling steps. This makes our proposed model practical for real-world applications and opens new avenues for probabilistic forecasting.
- [90] arXiv:2410.05634 (replaced) [pdf, html, other]
-
Title: Identification and estimation for matrix time series CP-factor modelsSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM)
We propose a new method for identifying and estimating the CP-factor models for matrix time series. Unlike the generalized eigenanalysis-based method of Chang et al. (2023) for which the convergence rates of the associated estimators may suffer from small eigengaps as the asymptotic theory is based on some matrix perturbation analysis, the proposed new method enjoys faster convergence rates which are free from any eigengaps. It achieves this by turning the problem into a joint diagonalization of several matrices whose elements are determined by a basis of a linear system, and by choosing the basis carefully to avoid near co-linearity (see Proposition 5 and Section 4.3). Furthermore, unlike Chang et al. (2023) which requires the two factor loading matrices to be full-ranked, the proposed new method can handle rank-deficient factor loading matrices. Illustration with both simulated and real matrix time series data shows the advantages of the proposed new method.
- [91] arXiv:2410.19568 (replaced) [pdf, other]
-
Title: Prediction of microstructural representativity from a single imageJournal-ref: Advanced Science 2025Subjects: Computation (stat.CO); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
In this study, we present a method for predicting the representativity of the phase fraction observed in a single image (2D or 3D) of a material. Traditional approaches often require large datasets and extensive statistical analysis to estimate the Integral Range, a key factor in determining the variance of microstructural properties. Our method leverages the Two-Point Correlation function to directly estimate the variance from a single image, thereby enabling phase fraction prediction with associated confidence levels. We validate our approach using open-source datasets, demonstrating its efficacy across diverse microstructures. This technique significantly reduces the data requirements for representativity analysis, providing a practical tool for material scientists and engineers working with limited microstructural data. To make the method easily accessible, we have created a web-application, this http URL, for quick, simple and informative use of the method.
- [92] arXiv:2501.16517 (replaced) [pdf, html, other]
-
Title: Symmetric Perceptrons, Number Partitioning and LatticesSubjects: Statistics Theory (math.ST); Computational Complexity (cs.CC); Mathematical Physics (math-ph); Probability (math.PR)
The symmetric binary perceptron ($\mathrm{SBP}_{\kappa}$) problem with parameter $\kappa : \mathbb{R}_{\geq1} \to [0,1]$ is an average-case search problem defined as follows: given a random Gaussian matrix $\mathbf{A} \sim \mathcal{N}(0,1)^{n \times m}$ as input where $m \geq n$, output a vector $\mathbf{x} \in \{-1,1\}^m$ such that $$|| \mathbf{A} \mathbf{x} ||_{\infty} \leq \kappa(m/n) \cdot \sqrt{m}~.$$ The number partitioning problem ($\mathrm{NPP}_{\kappa}$) corresponds to the special case of setting $n=1$. There is considerable evidence that both problems exhibit large computational-statistical gaps.
In this work, we show (nearly) tight average-case hardness for these problems, assuming the worst-case hardness of standard approximate shortest vector problems on lattices.
For $\mathrm{SBP}$, for large $n$, the best that efficient algorithms have been able to achieve is $\kappa(x) = \Theta(1/\sqrt{x})$ (Bansal and Spencer, Random Structures and Algorithms 2020), which is a far cry from the statistical bound. The problem has been extensively studied in the TCS and statistics communities, and Gamarnik, Kizildag, Perkins and Xu (FOCS 2022) conjecture that Bansal-Spencer is tight: namely, $\kappa(x) = \widetilde{\Theta}(1/\sqrt{x})$ is the optimal value achieved by computationally efficient algorithms. We prove their conjecture assuming the worst-case hardness of approximating the shortest vector problem on lattices.
For $\mathrm{NPP}$, Karmarkar and Karp's classical differencing algorithm achieves $\kappa(m) = 2^{-O(\log^2 m)}~.$ We prove that Karmarkar-Karp is nearly tight: namely, no polynomial-time algorithm can achieve $\kappa(m) = 2^{-\Omega(\log^3 m)}$, once again assuming the worst-case subexponential hardness of approximating the shortest vector problem on lattices to within a subexponential factor. - [93] arXiv:2502.04709 (replaced) [pdf, html, other]
-
Title: Early Stopping for Regression TreesSubjects: Statistics Theory (math.ST)
We develop early stopping rules for growing regression tree estimators. The fully data-driven stopping rule is based on monitoring the global residual norm. The best-first search and the breadth-first search algorithms together with linear interpolation give rise to generalized projection or regularization flows. A general theory of early stopping is established. Oracle inequalities for the early-stopped regression tree are derived without any smoothness assumption on the regression function, assuming the original CART splitting rule, yet with a much broader scope. The remainder terms are of smaller order than the best achievable rates for Lipschitz functions in dimension $d\ge 2$. In real and synthetic data the early stopping regression tree estimators attain the statistical performance of cost-complexity pruning while significantly reducing computational costs.
- [94] arXiv:2502.07285 (replaced) [pdf, html, other]
-
Title: Negative Dependence as a toolbox for machine learning : review and new developmentsComments: Dedicated to the memory of Prof K.R. Parthasarathy: visionary, guru, and scientist par excellenceSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
Negative dependence is becoming a key driver in advancing learning capabilities beyond the limits of traditional independence. Recent developments have evidenced support towards negatively dependent systems as a learning paradigm in a broad range of fundamental machine learning challenges including optimization, sampling, dimensionality reduction and sparse signal recovery, often surpassing the performance of current methods based on statistical independence. The most popular negatively dependent model has been that of determinantal point processes (DPPs), which have their origins in quantum theory. However, other models, such as perturbed lattice models, strongly Rayleigh measures, zeros of random functions have gained salience in various learning applications. In this article, we review this burgeoning field of research, as it has developed over the past two decades or so. We also present new results on applications of DPPs to the parsimonious representation of neural networks. In the limited scope of the article, we mostly focus on aspects of this area to which the authors contributed over the recent years, including applications to Monte Carlo methods, coresets and stochastic gradient descent, stochastic networks, signal processing and connections to quantum computation. However, starting from basics of negative dependence for the uninitiated reader, extensive references are provided to a broad swath of related developments which could not be covered within our limited scope. While existing works and reviews generally focus on specific negatively dependent models (e.g. DPPs), a notable feature of this article is that it addresses negative dependence as a machine learning methodology as a whole. In this vein, it covers within its span an array of negatively dependent models and their applications well beyond DPPs, thereby putting forward a very general and rather unique perspective.
- [95] arXiv:2502.20608 (replaced) [pdf, html, other]
-
Title: Analysis of multivariate event times under informative censoring using vine copulaSubjects: Methodology (stat.ME); Applications (stat.AP); Computation (stat.CO)
The study of times to nonterminal events of different types and their interrelation is a compelling area of interest. The primary challenge in analyzing such multivariate event times is the presence of informative censoring by the terminal event. While numerous statistical methods have been proposed for a single nonterminal event, i.e., semi-competing risks data, there remains a dearth of tools for analyzing times to multiple nonterminal events. This article introduces a novel analysis framework that leverages the vine copula to directly estimate the joint density of multivariate times to nonterminal and terminal events. Unlike the few existing methods based on multivariate or nested copulas, the developed approach excels in capturing the heterogeneous dependence between each pair of event times (nonterminal-terminal and between-nonterminal) in terms of strength and structure. We propose a likelihood-based estimation and inference procedure, which can be implemented efficiently in sequential stages. Through extensive simulation studies, we demonstrate the satisfactory finite-sample performance of our proposed stage-wise estimators and analytical variance estimators, as well as their advantages over existing methods. We apply the developed approach to data from a crowdfunding platform to investigate the relationship between various types of creator-backer interactions and a creator's lifetime on the platform.
- [96] arXiv:2503.11599 (replaced) [pdf, html, other]
-
Title: Quantifying sleep apnea heterogeneity using hierarchical Bayesian modelingSubjects: Applications (stat.AP)
Obstructive Sleep Apnea (OSA) is a breathing disorder during sleep that affects millions of people worldwide. The diagnosis of OSA often occurs through an overnight polysomnogram (PSG) sleep study that generates a massive amount of physiological data. However, despite the evidence of substantial heterogeneity in the expression and symptoms of OSA, diagnosis and scientific analysis of severity typically focus on a single summary statistic, the Apnea-Hypopnea Index (AHI). We address the limitations of this approach through hierarchical Bayesian modeling of PSG data. Our approach produces interpretable random effects for each patient, which govern sleep-stage dynamics, rates of OSA events, and impacts of OSA events on subsequent sleep-stage dynamics. We propose a novel approach for using these random effects to produce a Bayes optimal clustering of patients. We use the proposed approach to analyze data from the APPLES study. Our analysis produces clinically interesting groups of patients with sleep apnea and a novel finding of an association between OSA expression and cognitive performance that is missed by an AHI-based analysis.
- [97] arXiv:2503.15382 (replaced) [pdf, other]
-
Title: The information mismatch, and how to fix itSubjects: Other Statistics (stat.OT)
We live in unprecedented times in terms of our ability to use evidence to inform medical care. For example, we can perform data-driven post-test probability calculations. However, there is work to do. As has been previously noted, sensitivity and specificity, which play a key role in post-test probability calculations, are defined as unadjusted for patient covariates. In light of this, there have been multiple recommendations that sensitivity and specificity be adjusted for covariates. However, there is less work on the downstream clinical impact of unadjusted sensitivity and specificity. We discuss this here. We argue that unadjusted sensitivity and specificity, when mixed with covariate-dependent pre-test probability scores (which are more easily available nowadays given the multitude of online calculators), can lead to a post-test probability that contains an ``information mismatch.'' We write the equations behind such an information mismatch and discuss the steps that can be taken to fix it.
- [98] arXiv:2504.19450 (replaced) [pdf, html, other]
-
Title: Signal detection from spiked noise via asymmetrizationComments: We further included the heavy-tailed case and some eigenvector result. As a byproduct, we also proved the main result in arXiv:1012.4818 under the minimal second moment conditionSubjects: Statistics Theory (math.ST)
The signal plus noise model $H=S+Y$ is a fundamental model in signal detection when a low rank signal $S$ is polluted by noise $Y$. In the high-dimensional setting, one often uses the leading singular values and corresponding singular vectors of $H$ to conduct the statistical inference of the signal $S$. Especially, when $Y$ consists of iid random entries, the singular values of $S$ can be estimated from those of $H$ as long as the signal $S$ is strong enough. However, when the $Y$ entries are heteroscedastic or heavy-tailed, this standard approach may fail. Especially in this work, we consider a situation that can easily arise with heteroscedastic or heavy-tailed noise but is particularly difficult to address using the singular value approach, namely, when the noise $Y$ itself may create spiked singular values. It has been a recurring question how to distinguish the signal $S$ from the spikes in $Y$, as this seems impossible by examining the leading singular values of $H$. Inspired by the work \cite{CCF21}, we turn to study the eigenvalues of an asymmetrized model when two samples $H_1=S+Y_1$ and $H_2=S+Y_2$ are available. We show that by looking into the leading eigenvalues (in magnitude) of the asymmetrized model $H_1H_2^*$, one can easily detect $S$. We will primarily discuss the heteroscedastic case and then discuss the extension to the heavy-tailed case. As a byproduct, we also derive the fundamental result regarding the outlier of non-Hermitian random matrix in \cite{Tao} under the minimal 2nd moment condition.
- [99] arXiv:2505.09619 (replaced) [pdf, other]
-
Title: Machine Learning Solutions Integrated in an IoT Healthcare Platform for Heart Failure Risk StratificationAiman Faiz, Claudio Pascarelli, Gianvito Mitrano, Gianluca Fimiani, Marina Garofano, Mariangela Lazoi, Claudio Passino, Alessia BramantiSubjects: Other Statistics (stat.OT); Artificial Intelligence (cs.AI)
The management of chronic Heart Failure (HF) presents significant challenges in modern healthcare, requiring continuous monitoring, early detection of exacerbations, and personalized treatment strategies. In this paper, we present a predictive model founded on Machine Learning (ML) techniques to identify patients at HF risk. This model is an ensemble learning approach, a modified stacking technique, that uses two specialized models leveraging clinical and echocardiographic features and then a meta-model to combine the predictions of these two models. We initially assess the model on a real dataset and the obtained results suggest that it performs well in the stratification of patients at HR risk. Specifically, we obtained high sensitivity (95\%), ensuring that nearly all high-risk patients are identified. As for accuracy, we obtained 84\%, which can be considered moderate in some ML contexts. However, it is acceptable given our priority of identifying patients at risk of HF because they will be asked to participate in the telemonitoring program of the PrediHealth research project on which some of the authors of this paper are working. The initial findings also suggest that ML-based risk stratification models can serve as valuable decision-support tools not only in the PrediHealth project but also for healthcare professionals, aiding in early intervention and personalized patient management. To have a better understanding of the value and of potentiality of our predictive model, we also contrasted its results with those obtained by using three baseline models. The preliminary results indicate that our predictive model outperforms these baselines that flatly consider features, \ie not grouping them in clinical and echocardiographic features.
- [100] arXiv:2505.15328 (replaced) [pdf, html, other]
-
Title: A covariate-adaptive test for replicability across multiple studies with false discovery rate controlSubjects: Methodology (stat.ME)
Replicability is a lynchpin for credible discoveries. The partial conjunction (PC) p-value, which combines individual base p-values from multiple similar studies, can gauge whether a feature of interest exhibits replicated signals across studies. However, when a large set of features are examined as in high-throughput experiments, testing for their replicated signals simultaneously can pose a very underpowered problem, due to both the multiplicity burden and inherent limitations of PC $p$-values. This power deficiency is markedly severe when replication is demanded for all studies under consideration, which is nonetheless the most natural and appealing benchmark for scientific generalizability a practitioner may request.
We propose ParFilter, a general framework that marries the ideas of filtering and covariate-adaptiveness to power up large-scale testing for replicated signals as described above. It reduces the multiplicity burden by partitioning studies into smaller groups and borrowing the cross-group information to filter out unpromising features. Moreover, harnessing side information offered by auxiliary covariates whenever they are available, it can train informative hypothesis weights to encourage rejections of features more likely to exhibit replicated signals. We prove its finite-sample control on the false discovery rate, under both independence and arbitrary dependence among the base $p$-values across features. In simulations as well as a real case study on autoimmunity based on RNA-Seq data obtained from thymic cells, the ParFilter has demonstrated competitive performance against other existing methods for such replicability analyses. - [101] arXiv:2505.22518 (replaced) [pdf, other]
-
Title: IGNIS: A Robust Neural Network Framework for Constrained Parameter Estimation in Archimedean CopulasComments: Under reviewSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Classical estimators, the cornerstones of statistical inference, face insurmountable challenges when applied to important emerging classes of Archimedean copulas. These models exhibit pathological properties, including numerically unstable densities, non-monotonic parameter-to-dependence mappings, and vanishingly small likelihood gradients, rendering methods like Maximum Likelihood (MLE) and Method of Moments (MoM) inconsistent or computationally infeasible. We introduce IGNIS, a unified neural estimation framework that sidesteps these barriers by learning a direct, robust mapping from data-driven dependency measures to the underlying copula parameter theta. IGNIS utilizes a multi-input architecture and a theory-guided output layer (softplus(z) + 1) to automatically enforce the domain constraint theta_hat >= 1. Trained and validated on four families (Gumbel, Joe, and the numerically challenging A1/A2), IGNIS delivers accurate and stable estimates for real-world financial and health datasets, demonstrating its necessity for reliable inference in modern, complex dependence models where traditional methods fail.
- [102] arXiv:2506.02394 (replaced) [pdf, html, other]
-
Title: Joint modeling for learning decision-making dynamics in behavioral experimentsSubjects: Methodology (stat.ME); Machine Learning (cs.LG)
Major depressive disorder (MDD), a leading cause of disability and mortality, is associated with reward-processing abnormalities and concentration issues. Motivated by the probabilistic reward task from the Establishing Moderators and Biosignatures of Antidepressant Response in Clinical Care (EMBARC) study, we propose a novel framework that integrates the reinforcement learning (RL) model and drift-diffusion model (DDM) to jointly analyze reward-based decision-making with response times. To account for emerging evidence suggesting that decision-making may alternate between multiple interleaved strategies, we model latent state switching using a hidden Markov model (HMM). In the ''engaged'' state, decisions follow an RL-DDM, simultaneously capturing reward processing, decision dynamics, and temporal structure. In contrast, in the ''lapsed'' state, decision-making is modeled using a simplified DDM, where specific parameters are fixed to approximate random guessing with equal probability. The proposed method is implemented using a computationally efficient generalized expectation-maximization (EM) algorithm with forward-backward procedures. Through extensive numerical studies, we demonstrate that our proposed method outperforms competing approaches across various reward-generating distributions, under both strategy-switching and non-switching scenarios, as well as in the presence of input perturbations. When applied to the EMBARC study, our framework reveals that MDD patients exhibit lower overall engagement than healthy controls and experience longer decision times when they do engage. Additionally, we show that neuroimaging measures of brain activities are associated with decision-making characteristics in the ''engaged'' state but not in the ''lapsed'' state, providing evidence of brain-behavior association specific to the ''engaged'' state.
- [103] arXiv:2506.11424 (replaced) [pdf, html, other]
-
Title: Local empirical Bayes correction for Bayesian modelingJournal-ref: Osaka Keidai Ronshu, vol.68, no.4, pp.161-172, 2017Subjects: Methodology (stat.ME); Statistics Theory (math.ST)
The James-Stein estimator has attracted much interest as a shrinkage estimator that yields better estimates than the maximum likelihood estimator. The James-Stein estimator is also very useful as an argument in favor of empirical Bayesian methods. However, for problems involving large-scale data, such as differential gene expression data, the distribution is considered a mixture distribution with different means that cannot be considered sufficiently close. Therefore, it is not appropriate to apply the James-Stein estimator. Efron (2011) proposed a local empirical Bayes correction that attempted to correct a selection bias for large-scale data.
- [104] arXiv:2506.14822 (replaced) [pdf, other]
-
Title: Analysis and conditional optimization of projection estimates for distribution of random variable using Legendre polynomialsJournal-ref: Algorithms 2025, 18(8), 466Subjects: Computation (stat.CO); Numerical Analysis (math.NA); Probability (math.PR)
Algorithms for jointly obtaining projection estimates of the density and distribution function of a random variable using Legendre polynomials are proposed. For these algorithms, a problem of the conditional optimization is solved. Such optimization allows one to increase the approximation accuracy with minimum computational costs. The proposed algorithms are tested on examples with different degrees of smoothness of the density. A projection estimate of the density is compared to a histogram that is often used in applications to estimate distributions.
- [105] arXiv:2506.16872 (replaced) [pdf, html, other]
-
Title: Unveiling Complex Territorial Socio-Economic Dynamics: A Statistical Mechanics ApproachComments: This version includes minor corrections of typographical errors, improved clarity and accessibility of the text for readers in the fields of geographical systems and social indicators, and updated references to recent literatureSubjects: Methodology (stat.ME)
This study proposes a novel approach based on the Ising model for analyzing the observed territorial configuration of a network of municipalities classified as being central hubs or peripheral areas. This is interpreted as being a reference of a system of interacting territorial binary units. The socio-economic structure of the municipalities is synthesized into interpretable composite indices, which are further aggregated by means of Principal Components Analysis in order to reduce dimensionality and construct a univariate external field compatible with the Ising framework. Monte Carlo simulations via parallel computing are conducted adopting a Simulated Annealing variant of the classic Metropolis-Hastings algorithm. This ensures an efficient local exploration of the configuration space in the neighbourhood of to the reference of the system. Model consistency is assessed both in terms of energy stability and the likelihood of these configurations. The comparison between observed configuration and simulated ones is crucial in the analysis of multivariate phenomena, concomitantly accounting for territorial interactions. Model uncertainty in estimating the probability of each municipality being a central hub or peripheral area is quantified by adopting the model-agnostic Conformal Prediction framework which yields adaptive intervals with guaranteed coverage. The innovative use of geographical maps of the prediction intervals renders this approach an effective tool. It combines statistical mechanics, multivariate analysis and uncertainty quantification, providing a robust and interpretable framework for modeling socio-economic territorial dynamics, with potential applications in Official Statistics.
- [106] arXiv:2507.14666 (replaced) [pdf, html, other]
-
Title: What Quality Engineers Need to Know about Degradation ModelsJared M. Clark, Jie Min, Mingyang Li, Richard L. Warr, Stephanie P. DeHart, Caleb B. King, Lu Lu, Yili HongComments: 37 pages, 16 figuresSubjects: Applications (stat.AP)
Degradation models play a critical role in quality engineering by enabling the assessment and prediction of system reliability based on data. The objective of this paper is to provide an accessible introduction to degradation models. We explore commonly used degradation data types, including repeated measures degradation data and accelerated destructive degradation test data, and review modeling approaches such as general path models and stochastic process models. Key inference problems, including reliability estimation and prediction, are addressed. Applications across diverse fields, including material science, renewable energy, civil engineering, aerospace, and pharmaceuticals, illustrate the broad impact of degradation models in industry. We also discuss best practices for quality engineers, software implementations, and challenges in applying these models. This paper aims to provide quality engineers with a foundational understanding of degradation models, equipping them with the knowledge necessary to apply these techniques effectively in real-world scenarios.
- [107] arXiv:2507.17867 (replaced) [pdf, html, other]
-
Title: Spatialize v1.0: A Python/C++ Library for Ensemble Spatial InterpolationFelipe Navarro, Alvaro F. Egaña, Alejandro Ehrenfeld, Felipe Garrido, María Jesús Valenzuela, Juan F. Sánchez-PérezSubjects: Methodology (stat.ME)
In this paper, we present Spatialize, an open-source library that implements ensemble spatial interpolation, a novel method that combines the simplicity of basic interpolation methods with the power of classical geostatistical tools, like Kriging. It leverages the richness of stochastic modelling and ensemble learning, making it robust, scalable and suitable for large datasets. In addition, Spatialize provides a powerful framework for uncertainty quantification, offering both point estimates and empirical posterior distributions. It is implemented in Python 3.x, with a C++ core for improved performance, and is designed to be easy to use, requiring minimal user intervention. This library aims to bridge the gap between expert and non-expert users of geostatistics by providing automated tools that rival traditional geostatistical methods. Here, we present a detailed description of Spatialize along with a wealth of examples of its use.
- [108] arXiv:2507.18737 (replaced) [pdf, other]
-
Title: Robust Tail Index Estimation under Random Censoring via Minimum Density Power DivergenceSubjects: Statistics Theory (math.ST)
We introduce a robust estimator for the tail index of a Pareto-type distribution under random right censoring, developed within the framework of the minimum density power divergence. To the best of our knowledge, this is the first approach to integrate density power divergence into the context of randomly censored extreme value models, thus opening a new path for robust inference in this setting. Under general regularity conditions, the proposed estimator is shown to be consistent and asymptotically normal. Its finite-sample behavior is thoroughly assessed through an extensive simulation study, which highlights its improved robustness and efficiency compared to existing methods. Finally, the practical relevance of the method is illustrated through an application to a real AIDS survival dataset.
- [109] arXiv:2507.18749 (replaced) [pdf, html, other]
-
Title: Tree-structured Ising models under mean parameterizationSubjects: Statistics Theory (math.ST); Probability (math.PR)
We assess advantages of expressing tree-structured Ising models via their mean parameterization rather than their commonly chosen canonical parameterization. This includes fixedness of marginal distributions, often convenient for dependence modeling, and the dispelling of the intractable normalizing constant otherwise hindering Ising models. We derive an analytic expression for the joint probability generating function of mean-parameterized tree-structured Ising models, conferring efficient computation methods for the distribution of the sum of its constituent random variables. The mean parameterization also allows for a stochastic representation of Ising models, providing straightforward sampling methods. We furthermore show that Markov random fields with fixed Poisson marginal distributions may act as an efficient and accurate approximation for tree-structured Ising models, in the spirit of Poisson approximation.
- [110] arXiv:2507.19028 (replaced) [pdf, html, other]
-
Title: Nonparametric Linear Discriminant Analysis for High Dimensional Matrix-Valued DataComments: 23 pages, 12 figures, 3 tablesSubjects: Methodology (stat.ME); Applications (stat.AP); Machine Learning (stat.ML)
This paper addresses classification problems with matrix-valued data, which commonly arises in applications such as neuroimaging and signal processing. Building on the assumption that the data from each class follows a matrix normal distribution, we propose a novel extension of Fisher's Linear Discriminant Analysis (LDA) tailored for matrix-valued observations. To effectively capture structural information while maintaining estimation flexibility, we adopt a nonparametric empirical Bayes framework based on Nonparametric Maximum Likelihood Estimation (NPMLE), applied to vectorized and scaled matrices. The NPMLE method has been shown to provide robust, flexible, and accurate estimates for vector-valued data with various structures in the mean vector or covariance matrix. By leveraging its strengths, our method is effectively generalized to the matrix setting, thereby improving classification performance. Through extensive simulation studies and real data applications, including electroencephalography (EEG) and magnetic resonance imaging (MRI) analysis, we demonstrate that the proposed method consistently outperforms existing approaches across a variety of data structures.
- [111] arXiv:2209.10675 (replaced) [pdf, html, other]
-
Title: A Validation Approach to Over-parameterized Matrix and Image RecoveryComments: 32 pages and 10 figuresSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
This paper studies the problem of recovering a low-rank matrix from several noisy random linear measurements. We consider the setting where the rank of the ground-truth matrix is unknown a priori and use an objective function built from a rank-overspecified factored representation of the matrix variable, where the global optimal solutions overfit and do not correspond to the underlying ground truth. We then solve the associated nonconvex problem using gradient descent with small random initialization. We show that as long as the measurement operators satisfy the restricted isometry property (RIP) with its rank parameter scaling with the rank of the ground-truth matrix rather than scaling with the overspecified matrix rank, gradient descent iterations are on a particular trajectory towards the ground-truth matrix and achieve nearly information-theoretically optimal recovery when it is stopped appropriately. We then propose an efficient stopping strategy based on the common hold-out method and show that it detects a nearly optimal estimator provably. Moreover, experiments show that the proposed validation approach can also be efficiently used for image restoration with deep image prior, which over-parameterizes an image with a deep network.
- [112] arXiv:2302.00982 (replaced) [pdf, html, other]
-
Title: Stochastic optimal transport in Banach Spaces for regularized estimation of multivariate quantilesComments: 32 pages, 6 figuresSubjects: Probability (math.PR); Machine Learning (stat.ML)
We introduce a new stochastic algorithm for solving entropic optimal transport (EOT) between two absolutely continuous probability measures $\mu$ and $\nu$. Our work is motivated by the specific setting of Monge-Kantorovich quantiles where the source measure $\mu$ is either the uniform distribution on the unit hypercube or the spherical uniform distribution. Using the knowledge of the source measure, we propose to parametrize a Kantorovich dual potential by its Fourier coefficients. In this way, each iteration of our stochastic algorithm reduces to two Fourier transforms that enables us to make use of the Fast Fourier Transform (FFT) in order to implement a fast numerical method to solve EOT. We study the almost sure convergence of our stochastic algorithm that takes its values in an infinite-dimensional Banach space. Then, using numerical experiments, we illustrate the performances of our approach on the computation of regularized Monge-Kantorovich quantiles. In particular, we investigate the potential benefits of entropic regularization for the smooth estimation of multivariate quantiles using data sampled from the target measure $\nu$.
- [113] arXiv:2305.15612 (replaced) [pdf, html, other]
-
Title: Density Ratio Estimation-based Bayesian Optimization with Semi-Supervised LearningComments: Accepted at the 42nd International Conference on Machine Learning (ICML 2025)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Bayesian optimization has attracted huge attention from diverse research areas in science and engineering, since it is capable of efficiently finding a global optimum of an expensive-to-evaluate black-box function. In general, a probabilistic regression model is widely used as a surrogate function to model an explicit distribution over function evaluations given an input to estimate and a training dataset. Beyond the probabilistic regression-based methods, density ratio estimation-based Bayesian optimization has been suggested in order to estimate a density ratio of the groups relatively close and relatively far to a global optimum. Developing this line of research further, supervised classifiers are employed to estimate a class probability for the two groups instead of a density ratio. However, the supervised classifiers used in this strategy are prone to be overconfident for known knowledge on global solution candidates. Supposing that we have access to unlabeled points, e.g., predefined fixed-size pools, we propose density ratio estimation-based Bayesian optimization with semi-supervised learning to solve this challenge. Finally, we show the empirical results of our methods and several baseline methods in two distinct scenarios with unlabeled point sampling and a fixed-size pool, and analyze the validity of our methods in diverse experiments.
- [114] arXiv:2306.15000 (replaced) [pdf, html, other]
-
Title: Identifying Socially Disruptive PoliciesComments: An R package for implementation can be found at this https URLSubjects: Econometrics (econ.EM); Methodology (stat.ME)
Social disruption occurs when a policy creates or destroys many network connections between agents. It is a costly side effect of many interventions and so a growing empirical literature recommends measuring and accounting for social disruption when evaluating the welfare impact of a policy. However, there is currently little work characterizing what can actually be learned about social disruption from data in practice. In this paper, we consider the problem of identifying social disruption in an experimental setting. We show that social disruption is not generally point identified, but informative bounds can be constructed by rearranging the eigenvalues of the marginal distribution of network connections between pairs of agents identified from the experiment. We apply our bounds to the setting of Banerjee et al. (2021) and find large disruptive effects that the authors miss by only considering regression estimates.
- [115] arXiv:2309.10370 (replaced) [pdf, html, other]
-
Title: Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimizationComments: AMS Latex, 25 pages. Exposition has been streamlinedSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Physics (math-ph); Optimization and Control (math.OC); Machine Learning (stat.ML)
In this paper, we approach the problem of cost (loss) minimization in underparametrized shallow ReLU networks through the explicit construction of upper bounds which appeal to the structure of classification data, without use of gradient descent. A key focus is on elucidating the geometric structure of approximate and precise minimizers. We consider an $\mathcal{L}^2$ cost function, input space $\mathbb{R}^M$, output space ${\mathbb R}^Q$ with $Q\leq M$, and training input sample size that can be arbitrarily large. We prove an upper bound on the minimum of the cost function of order $O(\delta_P)$ where $\delta_P$ measures the signal-to-noise ratio of training data. In the special case $M=Q$, we explicitly determine an exact degenerate local minimum of the cost function, and show that the sharp value differs from the upper bound obtained for $Q\leq M$ by a relative error $O(\delta_P^2)$. The proof of the upper bound yields a constructively trained network; we show that it metrizes a particular $Q$-dimensional subspace in the input space ${\mathbb R}^M$. We comment on the characterization of the global minimum of the cost function in the given context.
- [116] arXiv:2310.04115 (replaced) [pdf, html, other]
-
Title: Markov chain entropy games and the geometry of their Nash equilibriaComments: 29 pages, 2 figuresSubjects: Probability (math.PR); Information Theory (cs.IT); Optimization and Control (math.OC); Computation (stat.CO)
We introduce and study a two-player zero-sum game between a probabilist and Nature defined by a convex function $f$, a finite collection $\mathcal{B}$ of Markov generators (or its convex hull), and a target distribution $\pi$. The probabilist selects a mixed strategy $\mu \in \mathcal{P}(\mathcal{B})$, the set of probability measures on $\mathcal{B}$, while Nature adopts a pure strategy and selects a $\pi$-reversible Markov generator $M$. The probabilist receives a payoff equal to the $f$-divergence $D_f(M \| L)$, where $L$ is drawn according to $\mu$. We prove that this game always admits a mixed strategy Nash equilibrium and satisfies a minimax identity. In contrast, a pure strategy equilibrium may fail to exist. We develop a projected subgradient method to compute approximate mixed strategy equilibria with provable convergence guarantees. Connections to information centroids, Chebyshev centers, and Bayes risk are discussed. This paper extends earlier minimax results on $f$-divergences to the context of Markov generators.
- [117] arXiv:2312.09857 (replaced) [pdf, html, other]
-
Title: Deep Unsupervised Domain Adaptation for Time Series Classification: a BenchmarkComments: Published in Data Mining and Knowledge DiscoveryJournal-ref: Ismail Fawaz, H., Del Grosso, G., Kerdoncuff, T., Boisbunon, A., & Saffar, I. (2025). Deep unsupervised domain adaptation for time series classification: a benchmark: HI Fawaz et al. Data Mining and Knowledge Discovery, 39(4), 39Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Unsupervised Domain Adaptation (UDA) aims to harness labeled source data to train models for unlabeled target data. Despite extensive research in domains like computer vision and natural language processing, UDA remains underexplored for time series data, which has widespread real-world applications ranging from medicine and manufacturing to earth observation and human activity recognition. Our paper addresses this gap by introducing a comprehensive benchmark for evaluating UDA techniques for time series classification, with a focus on deep learning methods. We provide seven new benchmark datasets covering various domain shifts and temporal dynamics, facilitating fair and standardized UDA method assessments with state of the art neural network backbones (e.g. Inception) for time series data. This benchmark offers insights into the strengths and limitations of the evaluated approaches while preserving the unsupervised nature of domain adaptation, making it directly applicable to practical problems. Our paper serves as a vital resource for researchers and practitioners, advancing domain adaptation solutions for time series data and fostering innovation in this critical field. The implementation code of this benchmark is available at this https URL.
- [118] arXiv:2403.16459 (replaced) [pdf, html, other]
-
Title: On the rates of convergence for learning with convolutional neural networksSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
We study approximation and learning capacities of convolutional neural networks (CNNs) with one-side zero-padding and multiple channels. Our first result proves a new approximation bound for CNNs with certain constraint on the weights. Our second result gives new analysis on the covering number of feed-forward neural networks with CNNs as special cases. The analysis carefully takes into account the size of the weights and hence gives better bounds than the existing literature in some situations. Using these two results, we are able to derive rates of convergence for estimators based on CNNs in many learning problems. In particular, we establish minimax optimal convergence rates of the least squares based on CNNs for learning smooth functions in the nonparametric regression setting. For binary classification, we derive convergence rates for CNN classifiers with hinge loss and logistic loss. It is also shown that the obtained rates for classification are minimax optimal in some common settings.
- [119] arXiv:2406.09069 (replaced) [pdf, html, other]
-
Title: On the Robustness of Global Feature Effect ExplanationsComments: Accepted at ECML PKDD 2024Journal-ref: Machine Learning and Knowledge Discovery in Databases, vol. 2, pp. 125-142, 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We study the robustness of global post-hoc explanations for predictive models trained on tabular data. Effects of predictor features in black-box supervised learning are an essential diagnostic tool for model debugging and scientific discovery in applied sciences. However, how vulnerable they are to data and model perturbations remains an open research question. We introduce several theoretical bounds for evaluating the robustness of partial dependence plots and accumulated local effects. Our experimental results with synthetic and real-world datasets quantify the gap between the best and worst-case scenarios of (mis)interpreting machine learning predictions globally.
- [120] arXiv:2407.21420 (replaced) [pdf, other]
-
Title: Whitney extensions on symmetric spacesComments: 28 pagesSubjects: Representation Theory (math.RT); Machine Learning (stat.ML)
In 1934, H. Whitney introduced the problem of extending a function on a set of points in $\mathbb{R}^n$ to an analytic function on the ambient space. In this article we prove Whitney type extension theorems for data on some homogeneous spaces. We use harmonic analysis on the homogeneous spaces and representation theory of compact as well as noncompact reductive groups.
- [121] arXiv:2411.16715 (replaced) [pdf, html, other]
-
Title: PaRCE: Probabilistic and Reconstruction-based Competency Estimation for CNN-based Image ClassificationComments: arXiv admin note: text overlap with arXiv:2409.06111Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Convolutional neural networks (CNNs) are extremely popular and effective for image classification tasks but tend to be overly confident in their predictions. Various works have sought to quantify uncertainty associated with these models, detect out-of-distribution (OOD) inputs, or identify anomalous regions in an image, but limited work has sought to develop a holistic approach that can accurately estimate perception model confidence across various sources of uncertainty. We develop a probabilistic and reconstruction-based competency estimation (PaRCE) method and compare it to existing approaches for uncertainty quantification and OOD detection. We find that our method can best distinguish between correctly classified, misclassified, and OOD samples with anomalous regions, as well as between samples with visual image modifications resulting in high, medium, and low prediction accuracy. We describe how to extend our approach for anomaly localization tasks and demonstrate the ability of our approach to distinguish between regions in an image that are familiar to the perception model from those that are unfamiliar. We find that our method generates interpretable scores that most reliably capture a holistic notion of perception model confidence.
- [122] arXiv:2412.11174 (replaced) [pdf, html, other]
-
Title: Semi-Supervised Risk Control via Prediction-Powered InferenceSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The risk-controlling prediction sets (RCPS) framework is a general tool for transforming the output of any machine learning model to design a predictive rule with rigorous error rate control. The key idea behind this framework is to use labeled hold-out calibration data to tune a hyper-parameter that affects the error rate of the resulting prediction rule. However, the limitation of such a calibration scheme is that with limited hold-out data, the tuned hyper-parameter becomes noisy and leads to a prediction rule with an error rate that is often unnecessarily conservative. To overcome this sample-size barrier, we introduce a semi-supervised calibration procedure that leverages unlabeled data to rigorously tune the hyper-parameter without compromising statistical validity. Our procedure builds upon the prediction-powered inference framework, carefully tailoring it to risk-controlling tasks. We demonstrate the benefits and validity of our proposal through two real-data experiments: few-shot image classification and early time series classification.
- [123] arXiv:2502.05719 (replaced) [pdf, other]
-
Title: Extended Histogram-based Outlier Score (EHBOS)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Histogram-Based Outlier Score (HBOS) is a widely used outlier or anomaly detection method known for its computational efficiency and simplicity. However, its assumption of feature independence limits its ability to detect anomalies in datasets where interactions between features are critical. In this paper, we propose the Extended Histogram-Based Outlier Score (EHBOS), which enhances HBOS by incorporating two-dimensional histograms to capture dependencies between feature pairs. This extension allows EHBOS to identify contextual and dependency-driven anomalies that HBOS fails to detect. We evaluate EHBOS on 17 benchmark datasets, demonstrating its effectiveness and robustness across diverse anomaly detection scenarios. EHBOS outperforms HBOS on several datasets, particularly those where feature interactions are critical in defining the anomaly structure, achieving notable improvements in ROC AUC. These results highlight that EHBOS can be a valuable extension to HBOS, with the ability to model complex feature dependencies. EHBOS offers a powerful new tool for anomaly detection, particularly in datasets where contextual or relational anomalies play a significant role.
- [124] arXiv:2502.10505 (replaced) [pdf, html, other]
-
Title: Preference learning made easy: Everything should be understood through win rateComments: ICML 2025Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Preference learning, or the task of aligning generative models to preference comparison data, has yet to reach the conceptual maturity of classification, density estimation, etc. To close this gap, this work presents a framework to understand preference learning starting from the sampling distribution of pairwise preference data. First, we prove that the only evaluation of a generative model that respects both preferences and prevalences in the data distribution is a form of win rate, justifying win rate as the focal point to understand preference learning. We then analyze preference learning methods as win rate optimization (WRO) or non-WRO. We present novel instances of WRO beyond existing examples (RLHF, NLHF) and identify two key theoretical benefits of all such methods. We prove that common non-WRO methods like DPO and SFT on preferred samples lack these properties and suggest ways to mitigate such theoretical limitations. We also show that WRO underperforms in practice due optimization difficulties and that optimization success predicts performance better than choices which affect the objective's solution. Our analysis highlights best practices for existing methods and provides recommendations for future research, guided by the principle that one should either align non-WRO methods more closely with WRO or improve the optimization of WRO objectives.
- [125] arXiv:2502.18826 (replaced) [pdf, html, other]
-
Title: Adversarial Combinatorial Semi-bandits with Graph FeedbackComments: To appear in ICML 2025Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
In combinatorial semi-bandits, a learner repeatedly selects from a combinatorial decision set of arms, receives the realized sum of rewards, and observes the rewards of the individual selected arms as feedback. In this paper, we extend this framework to include \emph{graph feedback}, where the learner observes the rewards of all neighboring arms of the selected arms in a feedback graph $G$. We establish that the optimal regret over a time horizon $T$ scales as $\widetilde{\Theta}(S\sqrt{T}+\sqrt{\alpha ST})$, where $S$ is the size of the combinatorial decisions and $\alpha$ is the independence number of $G$. This result interpolates between the known regrets $\widetilde\Theta(S\sqrt{T})$ under full information (i.e., $G$ is complete) and $\widetilde\Theta(\sqrt{KST})$ under the semi-bandit feedback (i.e., $G$ has only self-loops), where $K$ is the total number of arms. A key technical ingredient is to realize a convexified action using a random decision vector with negative correlations. We also show that online stochastic mirror descent (OSMD) that only realizes convexified actions in expectation is suboptimal. In addition, we describe the problem of \emph{combinatorial semi-bandits with general capacity} and apply our results to derive an improved regret upper bound, which may be of independent interest.
- [126] arXiv:2503.00810 (replaced) [pdf, html, other]
-
Title: Minimax Optimal Reinforcement Learning with Quasi-OptimismComments: Minor corrections to constant factorsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In our quest for a reinforcement learning (RL) algorithm that is both practical and provably optimal, we introduce EQO (Exploration via Quasi-Optimism). Unlike existing minimax optimal approaches, EQO avoids reliance on empirical variances and employs a simple bonus term proportional to the inverse of the state-action visit count. Central to EQO is the concept of quasi-optimism, where estimated values need not be fully optimistic, allowing for a simpler yet effective exploration strategy. The algorithm achieves the sharpest known regret bound for tabular RL under the mildest assumptions, proving that fast convergence can be attained with a practical and computationally efficient approach. Empirical evaluations demonstrate that EQO consistently outperforms existing algorithms in both regret performance and computational efficiency, providing the best of both theoretical soundness and practical effectiveness.
- [127] arXiv:2503.09722 (replaced) [pdf, html, other]
-
Title: The Pitfalls of Imitation Learning when Actions are ContinuousComments: 98 pages, 2 figures, updated proof sketchSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
We study the problem of imitating an expert demonstrator in a discrete-time, continuous state-and-action control system. We show that, even if the dynamics satisfy a control-theoretic property called exponential stability (i.e. the effects of perturbations decay exponentially quickly), and the expert is smooth and deterministic, any smooth, deterministic imitator policy necessarily suffers error on execution that is exponentially larger, as a function of problem horizon, than the error under the distribution of expert training data. Our negative result applies to any algorithm which learns solely from expert data, including both behavior cloning and offline-RL algorithms, unless the algorithm produces highly "improper" imitator policies--those which are non-smooth, non-Markovian, or which exhibit highly state-dependent stochasticity--or unless the expert trajectory distribution is sufficiently "spread." We provide experimental evidence of the benefits of these more complex policy parameterizations, explicating the benefits of today's popular policy parameterizations in robot learning (e.g. action-chunking and diffusion policies). We also establish a host of complementary negative and positive results for imitation in control systems.
- [128] arXiv:2505.18300 (replaced) [pdf, html, other]
-
Title: Beyond Self-Repellent Kernels: History-Driven Target Towards Efficient Nonlinear MCMC on General GraphsComments: Accepted at ICML 2025 (Oral)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We propose a history-driven target (HDT) framework in Markov Chain Monte Carlo (MCMC) to improve any random walk algorithm on discrete state spaces, such as general undirected graphs, for efficient sampling from target distribution $\boldsymbol{\mu}$. With broad applications in network science and distributed optimization, recent innovations like the self-repellent random walk (SRRW) achieve near-zero variance by prioritizing under-sampled states through transition kernel modifications based on past visit frequencies. However, SRRW's reliance on explicit computation of transition probabilities for all neighbors at each step introduces substantial computational overhead, while its strict dependence on time-reversible Markov chains excludes advanced non-reversible MCMC methods. To overcome these limitations, instead of direct modification of transition kernel, HDT introduces a history-dependent target distribution $\boldsymbol{\pi}[\mathbf{x}]$ to replace the original target $\boldsymbol{\mu}$ in any graph sampler, where $\mathbf{x}$ represents the empirical measure of past visits. This design preserves lightweight implementation by requiring only local information between the current and proposed states and achieves compatibility with both reversible and non-reversible MCMC samplers, while retaining unbiased samples with target distribution $\boldsymbol{\mu}$ and near-zero variance performance. Extensive experiments in graph sampling demonstrate consistent performance gains, and a memory-efficient Least Recently Used (LRU) cache ensures scalability to large general graphs.
- [129] arXiv:2506.09853 (replaced) [pdf, other]
-
Title: Causal Sufficiency and Necessity Improves Chain-of-Thought ReasoningSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Methodology (stat.ME)
Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.
- [130] arXiv:2506.16550 (replaced) [pdf, html, other]
-
Title: A Free Probabilistic Framework for Analyzing the Transformer-based Language ModelsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We present a formal operator-theoretic framework for analyzing Transformer-based language models using free probability theory. By modeling token embeddings and attention mechanisms as self-adjoint operators in a tracial \( W^* \)-probability space, we reinterpret attention as non-commutative convolution and describe representation propagation via free additive convolution. This leads to a spectral dynamic system interpretation of deep Transformers. We derive entropy-based generalization bounds under freeness assumptions and provide insight into positional encoding, spectral evolution, and representational complexity. This work offers a principled, though theoretical, perspective on structural dynamics in large language models.
- [131] arXiv:2507.03511 (replaced) [pdf, html, other]
-
Title: Nonparametric regression for cost-effectiveness analyses with observational data - a tutorialSubjects: Econometrics (econ.EM); Applications (stat.AP)
Healthcare decision-making often requires selecting among treatment options under budget constraints, particularly when one option is more effective but also more costly. Cost-effectiveness analysis (CEA) provides a framework for evaluating whether the health benefits of a treatment justify its additional costs. A key component of CEA is the estimation of treatment effects on both health outcomes and costs, which becomes challenging when using observational data, due to potential confounding. While advanced causal inference methods exist for use in such circumstances, their adoption in CEAs remains limited, with many studies relying on overly simplistic methods such as linear regression or propensity score matching. We believe that this is mainly due to health economists being generally unfamiliar with superior methodology. In this paper, we address this gap by introducing cost-effectiveness researchers to modern nonparametric regression models, with a particular focus on Bayesian Additive Regression Trees (BART). We provide practical guidance on how to implement BART in CEAs, including code examples, and discuss its advantages in producing more robust and credible estimates from observational data.
- [132] arXiv:2507.09061 (replaced) [pdf, other]
-
Title: Imitation Learning in Continuous Action Spaces: Mitigating Compounding Error without InteractionComments: Exposition and experiments have been deemed insufficient. Major long-term revisions are desired.Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
We study the problem of imitating an expert demonstrator in a continuous state-and-action dynamical system. While imitation learning in discrete settings such as autoregressive language modeling has seen immense success and popularity in recent years, imitation in physical settings such as autonomous driving and robot learning has proven comparably more complex due to the compounding errors problem, often requiring elaborate set-ups to perform stably. Recent work has demonstrated that even in benign settings, exponential compounding errors are unavoidable when learning solely from expert-controlled trajectories, suggesting the need for more advanced policy parameterizations or data augmentation. To this end, we present minimal interventions that provably mitigate compounding errors in continuous state-and-action imitation learning. When the system is open-loop stable, we prescribe "action chunking," i.e., predicting and playing sequences of actions in open-loop; when the system is possibly unstable, we prescribe "noise injection," i.e., adding noise during expert demonstrations. These interventions align with popular choices in modern robot learning, though the benefits we derive are distinct from the effects they were designed to target. Our results draw insights and tools from both control theory and reinforcement learning; however, our analysis reveals novel considerations that do not naturally arise when either literature is considered in isolation.
- [133] arXiv:2507.11274 (replaced) [pdf, other]
-
Title: Fast Last-Iterate Convergence of SGD in the Smooth Interpolation RegimeComments: 30 pagesSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We study population convergence guarantees of stochastic gradient descent (SGD) for smooth convex objectives in the interpolation regime, where the noise at optimum is zero or near zero. The behavior of the last iterate of SGD in this setting -- particularly with large (constant) stepsizes -- has received growing attention in recent years due to implications for the training of over-parameterized models, as well as to analyzing forgetting in continual learning and to understanding the convergence of the randomized Kaczmarz method for solving linear systems. We establish that after $T$ steps of SGD on $\beta$-smooth convex loss functions with stepsize $0 < \eta < 2/\beta$, the last iterate exhibits expected excess risk $\widetilde{O}(\frac{1}{\eta (2-\beta \eta) T^{1-\beta\eta/2}} + \frac{\eta}{(2-\beta\eta)^2} T^{\beta\eta/2} \sigma_\star^2)$, where $\sigma_\star^2$ denotes the variance of the stochastic gradients at the optimum. In particular, for a well-tuned stepsize we obtain a near optimal $\widetilde{O}(1/T + \sigma_\star/\sqrt{T})$ rate for the last iterate, extending the results of Varre et al. (2021) beyond least squares regression; and when $\sigma_\star=0$ we obtain a rate of $\smash{O(1/\sqrt T)}$ with $\eta=1/\beta$, improving upon the best-known $\smash{O(T^{-1/4})}$ rate recently established by Evron et al. (2025) in the special case of realizable linear regression.