Quantitative Biology
See recent articles
Showing new listings for Tuesday, 29 July 2025
- [1] arXiv:2507.19553 [pdf, html, other]
-
Title: Theoretical modeling and quantitative research on aquatic ecosystems driven by multiple factorsComments: 9 pages, 7 figuresSubjects: Quantitative Methods (q-bio.QM)
Understanding the complex interactions between water temperature, nutrient levels, and chlorophyll-a dynamics is essential for addressing eutrophication and the proliferation of harmful algal blooms in freshwater ecosystems algal. However, many existing studies tend to oversimplify thse relationships often neglecting the non-linear effects and long-term temporal variations that influence chlorophyll-a growth. Here, we conducted multi-year field monitoring (2020-2024) of the key environmental factors, including total nitrogen (TN), total phosphorus (TP), water temperature, and chlorophyll-a, across three water bodies in Guangdong Province, China: Tiantangshan Reservoir(S1), Baisha River Reservoir(S2) and Meizhou Reservoir(S3). Based on the collected data, we developed a multi-factor interaction model to quantitatively assess the spatiotemporal dynamics of chlorophyll-a and its environmental drivers. Our research reveal significant temporal and spatial variability in chlorophyll-a concentrations, with strong positive correlations to TN, TP, and water temperature. Long-term data from S1 and S2 demonstrate a clear trend of increasing eutrophication, with TN emerging as a more influential factor than TP in chlorophyll-a proliferation. The developed model accurately reproduces observed patterns, offering a robust theoretical basis for future predictive and management-oriented studies of aquatic ecosystem health.
- [2] arXiv:2507.19565 [pdf, other]
-
Title: Review of Deep Learning Applications to Structural Proteomics Enabled by Cryogenic Electron Microscopy and TomographyComments: 16 pagesSubjects: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
The past decade's "cryoEM revolution" has produced exponential growth in high-resolution structural data through advances in cryogenic electron microscopy (cryoEM) and tomography (cryoET). Deep learning integration into structural proteomics workflows addresses longstanding challenges including low signal-to-noise ratios, preferred orientation artifacts, and missing-wedge problems that historically limited efficiency and scalability. This review examines AI applications across the entire cryoEM pipeline, from automated particle picking using convolutional neural networks (Topaz, crYOLO, CryoSegNet) to computational solutions for preferred orientation bias (spIsoNet, cryoPROS) and advanced denoising algorithms (Topaz-Denoise). In cryoET, tools like IsoNet employ U-Net architectures for simultaneous missing-wedge correction and noise reduction, while TomoNet streamlines subtomogram averaging through AI-driven particle detection. The workflow culminates with automated atomic model building using sophisticated tools like ModelAngelo, DeepTracer, and CryoREAD that translate density maps into interpretable biological structures. These AI-enhanced approaches have achieved near-atomic resolution reconstructions with minimal manual intervention, resolved previously intractable datasets suffering from severe orientation bias, and enabled successful application to diverse biological systems from HIV virus-like particles to in situ ribosomal complexes. As deep learning evolves, particularly with large language models and vision transformers, the future promises sophisticated automation and accessibility in structural biology, potentially revolutionizing our understanding of macromolecular architecture and function.
- [3] arXiv:2507.19659 [pdf, other]
-
Title: Posterior bounds on divergence time of two sequences under dependent-site evolutionary modelsSubjects: Populations and Evolution (q-bio.PE); Probability (math.PR)
Let x and y be two length n DNA sequences, and suppose we would like to estimate the divergence time T. A well known simple but crude estimate of T is p := d(x,y)/n, the fraction of mutated sites (the p-distance). We establish a posterior concentration bound on T, showing that the posterior distribution of T concentrates within a logarithmic factor of p when d(x,y)log(n)/n = o(1). Our bounds hold under a large class of evolutionary models, including many standard models that incorporate site dependence. As a special case, we show that T exceeds p with vanishingly small posterior probability as n increases under models with constant mutation rates, complementing the result of Mihaescu and Steel (Appl Math Lett 23(9):975--979, 2010). Our approach is based on bounding sequence transition probabilities in various convergence regimes of the underlying evolutionary process. Our result may be useful for improving the efficiency of iterative optimization and sampling schemes for estimating divergence times in phylogenetic inference.
- [4] arXiv:2507.19711 [pdf, html, other]
-
Title: Pre-exposure prophylaxis and syphilis in men who have sex with men: a network analysisSubjects: Populations and Evolution (q-bio.PE)
Pre-exposure prophylaxis (PrEP) has been established as an effective tool for preventing HIV infection among men who have sex with men (MSM). However, there is the possibility of PrEP usage leading to increased sexual partners and increased transmission of non-HIV sexually transmitted infections such as syphilis. We take here a network perspective to examine this possibility using data on sexual partnerships, demographic data, PrEP usage, and syphilis among MSM in Columbus, Ohio. We use a recently developed community detection algorithm, an adaptation of the community detection algorithm InfoMap to absorbing random walks, to identify clusters of people (`communities') that may drive syphilis transmission. Our community detection approach takes into account both sexual partnerships as well as syphilis treatment rates when detecting communities. We apply this algorithm to sexual networks fitted to empirical data from the Network Epidemiology of Syphilis Transmission (NEST) study in Columbus, Ohio. We assume that PrEP usage is associated with regular visits to a sexual health provider, and thus is correlated with syphilis detection and treatment rates. We examine how PrEP usage can affect community structure in the sexual networks fitted to the NEST data. We identify two types of PrEP users, those belonging to a large, highly connected community and tending to have a large number of sexual partners, versus those with a small number of sexual partners and belonging to smaller communities. A stochastic syphilis model indicates that PrEP users in the large community may play an important role in sustaining syphilis transmission.
- [5] arXiv:2507.19772 [pdf, html, other]
-
Title: External light schedules can induce nighttime sleep disruptions in a Homeostat-Circadian-Light Model for sleep in young childrenComments: 19 pages, 8 figuresSubjects: Neurons and Cognition (q-bio.NC)
Sleep disturbances, particularly nighttime waking, are highly prevalent in young children and can significantly disrupt not only the child's well-being but also family functioning. Behavioral and environmental strategies, including the regulation of light exposure, are typically recommended treatments for nighttime waking. Using the Homeostatic-Circadian-Light (HCL) mathematical model for sleep timing based on the interaction of the circadian rhythm, the homeostatic sleep drive and external light, we analyze how external light schedules can influence the occurrence of nighttime waking in young children. We fit the model to data for sleep homeostasis and sleep behavior in 2 - 3.5 year olds and identified subsets of parameter ranges that fit the data but indicated a susceptibility to nighttime waking. This suggests that as children develop they may exhibit more or less propensity to awaken during the night. Notably, parameter sets exhibiting earlier sleep timing were more susceptible to nighttime waking. For a model parameter set susceptible to, but not exhibiting, nighttime waking, we analyze how external light schedules affect sleep patterns. We find that low daytime light levels can induce nighttime sleep disruptions and extended bright-light exposure also promotes nighttime waking. Further results suggest that consistent daily routines are essential; irregular schedules, particularly during weekends, markedly worsen the consolidation of nighttime sleep. Specifically, weekend delays in morning lights-on and evening lights-off times result in nighttime sleep disruptions and can influence sleep timing during the week. These results highlight how external light, daily rhythms, and parenting routines interact to shape childrens' sleep health, providing a useful framework for improving sleep management practices.
- [6] arXiv:2507.19805 [pdf, other]
-
Title: Sequence-based protein-protein interaction prediction and its applications in drug discoveryComments: 32 pages, 6 figures, 3 tablesSubjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
Aberrant protein-protein interactions (PPIs) underpin a plethora of human diseases, and disruption of these harmful interactions constitute a compelling treatment avenue. Advances in computational approaches to PPI prediction have closely followed progress in deep learning and natural language processing. In this review, we outline the state-of the-art for sequence-based PPI prediction methods and explore their impact on target identification and drug discovery. We begin with an overview of commonly used training data sources and techniques used to curate these data to enhance the quality of the training set. Subsequently, we survey various PPI predictor types, including traditional similarity-based approaches, and deep learning-based approaches with a particular emphasis on the transformer architecture. Finally, we provide examples of PPI prediction in systems-level proteomics analyses, target identification, and design of therapeutic peptides and antibodies. We also take the opportunity to showcase the potential of PPI-aware drug discovery models in accelerating therapeutic development.
- [7] arXiv:2507.19944 [pdf, html, other]
-
Title: Attractive and Repulsive Perceptual Biases Naturally Emerge in Generative Adversarial InferenceSubjects: Neurons and Cognition (q-bio.NC)
Human perceptual estimates exhibit a striking reversal in bias depending on uncertainty: they shift toward prior expectations under high sensory uncertainty, but away from them when internal noise is dominant. While Bayesian inference combined with efficient coding can explain this dual bias, existing models rely on handcrafted priors or fixed encoders, offering no account of how such representations and inferences could emerge through learning. We introduce a Generative Adversarial Inference (GAI) network that simultaneously learns sensory representations and inference strategies directly from data, without assuming explicit likelihoods or priors. Through joint reconstruction and adversarial training, the model learns a representation that approximates an efficient code consistent with information-theoretic predictions. Trained on Gabor stimuli with varying signal-to-noise ratios, GAI spontaneously reproduces the full transition from prior attraction to repulsion, and recovers the Fisher information profile predicted by efficient coding theory. It also captures the characteristic bias reversal observed in human perception more robustly than supervised or variational alternatives. These results show that a single adversarially trained network can jointly acquire an efficient sensory code and support Bayesian-consistent behavior, providing a neurally plausible, end-to-end account of perceptual bias that unifies normative theory and deep learning.
- [8] arXiv:2507.19979 [pdf, html, other]
-
Title: Inference for stochastic reaction networks via logistic regressionComments: 44 pages, 14 figuresSubjects: Quantitative Methods (q-bio.QM)
Identifying network structure and inferring parameters are central challenges in modeling chemical reaction networks. In this study, we propose likelihood-based methods grounded in multinomial logistic regression to infer both stoichiometries and network connectivity structure from full time-series trajectories of stochastic chemical reaction networks. When complete molecular count trajectories are observed for all species, stoichiometric coefficients are identifiable, provided each reaction occurs at least once during the observation window. However, identifying catalytic species remains difficult, as their molecular counts remain unchanged before and after each reaction event. Through three illustrative stochastic models involving catalytic interactions in open networks, we demonstrate that the logistic regression framework, when applied properly, can recover the full network structure, including stoichiometric relationships. We further apply Bayesian logistic regression to estimate model parameters in real-world epidemic settings, using the COVID-19 outbreak in the Greater Seoul area of South Korea as a case study. Our analysis focuses on a Susceptible--Infected--Recovered (SIR) network model that incorporates demographic effects. To address the challenge of partial observability, particularly the availability of data only for the infectious subset of the population, we develop a method that integrates Bayesian logistic regression with differential equation models. This approach enables robust inference of key SIR parameters from observed COVID-19 case trajectories. Overall, our findings demonstrate that simple, likelihood-based techniques such as logistic regression can recover meaningful mechanistic insights from both synthetic and empirical time-series data.
- [9] arXiv:2507.19992 [pdf, other]
-
Title: NIRS: An Ontology for Non-Invasive Respiratory Support in Acute CareComments: Submitted to the Journal of the American Medical Informatics Association (JAMIA)Subjects: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI)
Objective: Develop a Non Invasive Respiratory Support (NIRS) ontology to support knowledge representation in acute care settings.
Materials and Methods: We developed the NIRS ontology using Web Ontology Language (OWL) semantics and Protege to organize clinical concepts and relationships. To enable rule-based clinical reasoning beyond hierarchical structures, we added Semantic Web Rule Language (SWRL) rules. We evaluated logical reasoning by adding 17 hypothetical patient clinical scenarios. We used SPARQL queries and data from the Electronic Intensive Care Unit (eICU) Collaborative Research Database to retrieve and test targeted inferences.
Results: The ontology has 132 classes, 12 object properties, and 17 data properties across 882 axioms that establish concept relationships. To standardize clinical concepts, we added 350 annotations, including descriptive definitions based on controlled vocabularies. SPARQL queries successfully validated all test cases (rules) by retrieving appropriate patient outcomes, for instance, a patient treated with HFNC (high-flow nasal cannula) for 2 hours due to acute respiratory failure may avoid endotracheal intubation.
Discussion: The NIRS ontology formally represents domain-specific concepts, including ventilation modalities, patient characteristics, therapy parameters, and outcomes. SPARQL query evaluations on clinical scenarios confirmed the ability of the ontology to support rule based reasoning and therapy recommendations, providing a foundation for consistent documentation practices, integration into clinical data models, and advanced analysis of NIRS outcomes.
Conclusion: We unified NIRS concepts into an ontological framework and demonstrated its applicability through the evaluation of hypothetical patient scenarios and alignment with standardized vocabularies. - [10] arXiv:2507.20205 [pdf, html, other]
-
Title: Signed Higher-Order Interactions for Brain Disorder Diagnosis via Multi-Channel TransformersSubjects: Neurons and Cognition (q-bio.NC); Graphics (cs.GR)
Accurately characterizing higher-order interactions of brain regions and extracting interpretable organizational patterns from Functional Magnetic Resonance Imaging data is crucial for brain disease diagnosis. Current graph-based deep learning models primarily focus on pairwise or triadic patterns while neglecting signed higher-order interactions, limiting comprehensive understanding of brain-wide communication. We propose HOI-Brain, a novel computational framework leveraging signed higher-order interactions and organizational patterns in fMRI data for brain disease diagnosis. First, we introduce a co-fluctuation measure based on Multiplication of Temporal Derivatives to detect higher-order interactions with temporal resolution. We then distinguish positive and negative synergistic interactions, encoding them in signed weighted simplicial complexes to reveal brain communication insights. Using Persistent Homology theory, we apply two filtration processes to these complexes to extract signed higher-dimensional neural organizations spatiotemporally. Finally, we propose a multi-channel brain Transformer to integrate heterogeneous topological features. Experiments on Alzheimer' s disease, Parkinson' s syndrome, and autism spectrum disorder datasets demonstrate our framework' s superiority, effectiveness, and interpretability. The identified key brain regions and higher-order patterns align with neuroscience literature, providing meaningful biological insights.
- [11] arXiv:2507.20304 [pdf, other]
-
Title: Ligand Pose Generation via QUBO-Based Hotspot Sampling and Geometric Triplet MatchingSubjects: Biomolecules (q-bio.BM)
We propose a framework based on Quadratic Unconstrained Binary Optimization (QUBO) for generating plausible ligand binding poses within protein pockets, enabling efficient structure-based virtual screening. The method discretizes the binding site into a grid and solves a QUBO problem to select spatially distributed, energetically favorable grid points. Each ligand is represented by a three-atom geometric contour, which is aligned to the selected grid points through rigid-body transformation, producing from hundreds to hundreds of thousands of candidate poses. Using a benchmark of 169 protein-ligand complexes, we generated an average of 110 to 600000 poses per ligand, depending on QUBO parameters and matching thresholds. Evaluation against crystallographic structures revealed that a larger number of candidates increases the likelihood of recovering near-native poses, with recovery rates reaching 100 percent for root mean square deviation (RMSD) values below 1.0 angstrom and 95.9 percent for RMSD values below 0.6 angstrom. Since the correct binding pose is not known in advance, we apply AutoDock-based scoring to select the most plausible candidates from the generated pool, achieving recovery rates of up to 82.8 percent for RMSD < 2.0 angstrom, 81.7 percent for RMSD < 1.5 angstrom, and 75.2 percent for RMSD < 1.0 angstrom. When poses with misleading scores are excluded, performance improves further, with recovery rates reaching up to 97.8 percent for RMSD < 2.0 angstrom and 1.5 angstrom, and 95.4 percent for RMSD < 1.0 angstrom. This modular and hardware-flexible framework offers a scalable solution for pre-filtering ligands and generating high-quality binding poses before affinity prediction, making it well-suited for large-scale virtual screening pipelines.
- [12] arXiv:2507.20401 [pdf, html, other]
-
Title: Mathematical model of blood coagulation during endovenous laser therapyAnna A. Andreeva (1), Konstantin A. Klochkov (1), Alexey I. Lobanov (1) ((1) Moscow Institute of Physics and Technology)Subjects: Quantitative Methods (q-bio.QM)
Endovenous laser therapy (ELT) as a minimally invasive procedure for ablation of large superficial veins, nevertheless, can cause complications of thrombotic nature. In this regard, the study of the main patterns of thrombus formation during ELT and modelling of endovenous heat-induced thrombosis (EHIT) is relevant. Based on the assumption of diffusion limiting of biochemical processes occurring during the coagulation of blood, by recalculating the reaction rates according to the Stokes-Einstein equation, a simple point model of blood coagulation during ELT was built in this paper. As a result of the use of this model, it was demonstrated that blood heating entails an increase in the rate of thrombin production, a decrease in the time for achieving the peak of its concentration by 5-6 times with its almost constant amplitude. Heating leads to the rapid formation of fibrin clusters and the appearance of a fibrin-polymer network with a smaller cell size. The quantitative dependence on the selected rheological model was also shown. All the data necessary for using the model are given in this article for reproducibility.
- [13] arXiv:2507.20406 [pdf, html, other]
-
Title: A Topology-Based Machine Learning Model Decisively Outperforms Flux Balance Analysis in Predicting Metabolic Gene EssentialitySubjects: Molecular Networks (q-bio.MN)
Background: The rational identification of essential genes is a cornerstone of drug discov- ery, yet standard computational methods like Flux Balance Analysis (FBA) often struggle to produce accurate predictions in complex, redundant metabolic networks. Hypothesis: We hypothesized that the topological structure of a metabolic network con- tains a more robust predictive signal for essentiality than functional simulations alone. Methodology: To test this hypothesis, we developed a machine learning pipeline by first constructing a reaction-reaction graph from the e_coli_core metabolic model. Graph- theoretic features, including betweenness centrality and PageRank, were engineered to de- scribe the topological role of each gene. A RandomForestClassifier was trained on these features, and its performance was rigorously benchmarked against a standard FBA single- gene deletion analysis using a curated ground-truth dataset. Results: Our machine learning model achieved a solid predictive performance with an F1-Score of 0.400 (Precision: 0.412, Recall: 0.389). In profound contrast, the standard FBA baseline method failed to correctly identify any of the known essential genes, resulting in an F1-Score of 0.000. Conclusion: This work demonstrates that a "structure-first" machine learning approach is a significantly superior strategy for predicting gene essentiality compared to traditional FBA on the E. coli core network. By learning the topological signatures of critical network roles, our model successfully overcomes the known limitations of simulation-based methods in handling biological redundancy. While the performance of topology-only models is ex- pected to face challenges on more complex genome-scale networks, this validated framework represents a significant step forward and highlights the primacy of network architecture in determining biological function.
- [14] arXiv:2507.20601 [pdf, other]
-
Title: Comparing and Scaling fMRI Features for Brain-Behavior PredictionSubjects: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
Predicting behavioral variables from neuroimaging modalities such as magnetic resonance imaging (MRI) has the potential to allow the development of neuroimaging biomarkers of mental and neurological disorders. A crucial processing step to this aim is the extraction of suitable features. These can differ in how well they predict the target of interest, and how this prediction scales with sample size and scan time. Here, we compare nine feature subtypes extracted from resting-state functional MRI recordings for behavior prediction, ranging from regional measures of functional activity to functional connectivity (FC) and metrics derived with graph signal processing (GSP), a principled approach for the extraction of structure-informed functional features. We study 979 subjects from the Human Connectome Project Young Adult dataset, predicting summary scores for mental health, cognition, processing speed, and substance use, as well as age and sex. The scaling properties of the features are investigated for different combinations of sample size and scan time. FC comes out as the best feature for predicting cognition, age, and sex. Graph power spectral density is the second best for predicting cognition and age, while for sex, variability-based features show potential as well. When predicting sex, the low-pass graph filtered coupled FC slightly outperforms the simple FC variant. None of the other targets were predicted significantly. The scaling results point to higher performance reserves for the better-performing features. They also indicate that it is important to balance sample size and scan time when acquiring data for prediction studies. The results confirm FC as a robust feature for behavior prediction, but also show the potential of GSP and variability-based measures. We discuss the implications for future prediction studies in terms of strategies for acquisition and sample composition.
New submissions (showing 14 of 14 entries)
- [15] arXiv:2507.19615 (cross-list from math.PR) [pdf, html, other]
-
Title: Population dynamics under random switchingComments: 72 pages, 3 figuresSubjects: Probability (math.PR); Populations and Evolution (q-bio.PE)
Populations interact non-linearly and are influenced by environmental fluctuations. In order to have realistic mathematical models, one needs to take into account that the environmental fluctuations are inherently stochastic. Often, environmental stochasticity is modeled by systems of stochastic differential equations. However, this type of stochasticity is not always the best suited for ecological modeling. Instead, biological systems can be modeled using piecewise deterministic Markov processes (PDMP). For a PDMP the process follows the flow of a system of ordinary differential equations for a random time, after which the environment switches to a different state, where the dynamics is given by a different system of differential equations. Then this is repeated. The current paper is devoted to the study of the dynamics of $n$ populations described by $n$-dimensional Kolmogorov PDMP. We provide sharp conditions for persistence and extinction, based on the invasion rates (Lyapunov exponents) of the ergodic probability measures supported on the boundary of the positive orthant. In order to showcase the applicability of our results, we apply the theory in some interesting ecological examples.
- [16] arXiv:2507.19637 (cross-list from physics.med-ph) [pdf, other]
-
Title: Quantifying lower-limb muscle coordination during cycling using electromyography-informed muscle synergiesReza Ahmadi (1), Shahram Rasoulian (2), Hamidreza Heidary (1), Saied Jalal Aboodarda (2), Thomas K. Uchida (3), Walter Herzog (1 and 2), Amin Komeili (1 and 2) ((1) Department of Mechanical and Manufacturing Engineering, University of Calgary, Calgary, Canada (2) Human Performance Laboratory, Faculty of Kinesiology, University of Calgary, Calgary, Canada (3) Department of Mechanical Engineering, University of Ottawa, Ottawa, Canada)Comments: 21 pages, 6 figuresSubjects: Medical Physics (physics.med-ph); Quantitative Methods (q-bio.QM)
Assessment of muscle coordination during cycling may provide insight into motor control strategies and movement efficiency. This study evaluated muscle synergies and coactivation patterns as indicators of neuromuscular coordination in lower-limb across three power levels of cycling. Twenty recreational cyclists performed a graded cycling test on a stationary bicycle ergometer. Electromyography was recorded bilaterally from seven lower-limb muscles and muscle synergies were extracted using non-negative matrix factorization. The Coactivation Index (CI), Synergy Index (SI), and Synergy Coordination Index (SCI) were calculated to assess muscle coordination patterns. Four muscle synergies were identified consistently across power levels, with changes in synergy composition and activation timing correlated with increased muscular demands. As power level increased, the CI showed reduced muscle coactivation at the knee and greater muscle coactivation at the ankle. The SI revealed a greater contribution of the synergy weights of the extensor muscles than those of the flexor muscles at the knee. In contrast, the relative EMG contribution of hip extensor and flexor muscles remained consistent with increasing power levels. The SCI increased significantly with increasing power level, suggesting a reduction in the size of the synergy space and improved neuromuscular coordination. These findings provide insight into how the central nervous system modulates its response to increasing mechanical demands. Combining synergy and coactivation indices offers a promising approach to assess motor control, inform rehabilitation, and optimize performance in cycling tasks.
- [17] arXiv:2507.19734 (cross-list from eess.IV) [pdf, html, other]
-
Title: A Metabolic-Imaging Integrated Model for Prognostic Prediction in Colorectal Liver MetastasesComments: 8 pages,4 figuesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Prognostic evaluation in patients with colorectal liver metastases (CRLM) remains challenging due to suboptimal accuracy of conventional clinical models. This study developed and validated a robust machine learning model for predicting postoperative recurrence risk. Preliminary ensemble models achieved exceptionally high performance (AUC $>$ 0.98) but incorporated postoperative features, introducing data leakage risks. To enhance clinical applicability, we restricted input variables to preoperative baseline clinical parameters and radiomic features from contrast-enhanced CT imaging, specifically targeting recurrence prediction at 3, 6, and 12 months postoperatively. The 3-month recurrence prediction model demonstrated optimal performance with an AUC of 0.723 in cross-validation. Decision curve analysis revealed that across threshold probabilities of 0.55-0.95, the model consistently provided greater net benefit than "treat-all" or "treat-none" strategies, supporting its utility in postoperative surveillance and therapeutic decision-making. This study successfully developed a robust predictive model for early CRLM recurrence with confirmed clinical utility. Importantly, it highlights the critical risk of data leakage in clinical prognostic modeling and proposes a rigorous framework to mitigate this issue, enhancing model reliability and translational value in real-world settings.
- [18] arXiv:2507.19755 (cross-list from cs.LG) [pdf, html, other]
-
Title: Modeling enzyme temperature stability from sequence segment perspectiveZiqi Zhang, Shiheng Chen, Runze Yang, Zhisheng Wei, Wei Zhang, Lei Wang, Zhanzhi Liu, Fengshan Zhang, Jing Wu, Xiaoyong Pan, Hongbin Shen, Longbing Cao, Zhaohong DengSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
Developing enzymes with desired thermal properties is crucial for a wide range of industrial and research applications, and determining temperature stability is an essential step in this process. Experimental determination of thermal parameters is labor-intensive, time-consuming, and costly. Moreover, existing computational approaches are often hindered by limited data availability and imbalanced distributions. To address these challenges, we introduce a curated temperature stability dataset designed for model development and benchmarking in enzyme thermal modeling. Leveraging this dataset, we present the \textit{Segment Transformer}, a novel deep learning framework that enables efficient and accurate prediction of enzyme temperature stability. The model achieves state-of-the-art performance with an RMSE of 24.03, MAE of 18.09, and Pearson and Spearman correlations of 0.33, respectively. These results highlight the effectiveness of incorporating segment-level representations, grounded in the biological observation that different regions of a protein sequence contribute unequally to thermal behavior. As a proof of concept, we applied the Segment Transformer to guide the engineering of a cutinase enzyme. Experimental validation demonstrated a 1.64-fold improvement in relative activity following heat treatment, achieved through only 17 mutations and without compromising catalytic function.
- [19] arXiv:2507.19956 (cross-list from cs.CV) [pdf, html, other]
-
Title: Predicting Brain Responses To Natural Movies With Multimodal LLMsCesar Kadir Torrico Villanueva, Jiaxin Cindy Tu, Mihir Tripathy, Connor Lane, Rishab Iyer, Paul S. ScottiComments: Code available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
We present MedARC's team solution to the Algonauts 2025 challenge. Our pipeline leveraged rich multimodal representations from various state-of-the-art pretrained models across video (V-JEPA2), speech (Whisper), text (Llama 3.2), vision-text (InternVL3), and vision-text-audio (Qwen2.5-Omni). These features extracted from the models were linearly projected to a latent space, temporally aligned to the fMRI time series, and finally mapped to cortical parcels through a lightweight encoder comprising a shared group head plus subject-specific residual heads. We trained hundreds of model variants across hyperparameter settings, validated them on held-out movies and assembled ensembles targeted to each parcel in each subject. Our final submission achieved a mean Pearson's correlation of 0.2085 on the test split of withheld out-of-distribution movies, placing our team in fourth place for the competition. We further discuss a last-minute optimization that would have raised us to second place. Our results highlight how combining features from models trained in different modalities, using a simple architecture consisting of shared-subject and single-subject components, and conducting comprehensive model selection and ensembling improves generalization of encoding models to novel movie stimuli. All code is available on GitHub.
- [20] arXiv:2507.20130 (cross-list from cs.LG) [pdf, html, other]
-
Title: Generative molecule evolution using 3D pharmacophore for efficient Structure-Based Drug DesignSubjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
Recent advances in generative models, particularly diffusion and auto-regressive models, have revolutionized fields like computer vision and natural language processing. However, their application to structure-based drug design (SBDD) remains limited due to critical data constraints. To address the limitation of training data for models targeting SBDD tasks, we propose an evolutionary framework named MEVO, which bridges the gap between billion-scale small molecule dataset and the scarce protein-ligand complex dataset, and effectively increase the abundance of training data for generative SBDD models. MEVO is composed of three key components: a high-fidelity VQ-VAE for molecule representation in latent space, a diffusion model for pharmacophore-guided molecule generation, and a pocket-aware evolutionary strategy for molecule optimization with physics-based scoring function. This framework efficiently generate high-affinity binders for various protein targets, validated with predicted binding affinities using free energy perturbation (FEP) methods. In addition, we showcase the capability of MEVO in designing potent inhibitors to KRAS$^{\textrm{G12D}}$, a challenging target in cancer therapeutics, with similar affinity to the known highly active inhibitor evaluated by FEP calculations. With high versatility and generalizability, MEVO offers an effective and data-efficient model for various tasks in structure-based ligand design.
- [21] arXiv:2507.20189 (cross-list from eess.SP) [pdf, html, other]
-
Title: NeuroCLIP: A Multimodal Contrastive Learning Method for rTMS-treated Methamphetamine Addiction AnalysisChengkai Wang, Di Wu, Yunsheng Liao, Wenyao Zheng, Ziyi Zeng, Xurong Gao, Hemmings Wu, Zhoule Zhu, Jie Yang, Lihua Zhong, Weiwei Cheng, Yun-Hsuan Chen, Mohamad SawanSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
Methamphetamine dependence poses a significant global health challenge, yet its assessment and the evaluation of treatments like repetitive transcranial magnetic stimulation (rTMS) frequently depend on subjective self-reports, which may introduce uncertainties. While objective neuroimaging modalities such as electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS) offer alternatives, their individual limitations and the reliance on conventional, often hand-crafted, feature extraction can compromise the reliability of derived biomarkers. To overcome these limitations, we propose NeuroCLIP, a novel deep learning framework integrating simultaneously recorded EEG and fNIRS data through a progressive learning strategy. This approach offers a robust and trustworthy biomarker for methamphetamine addiction. Validation experiments show that NeuroCLIP significantly improves discriminative capabilities among the methamphetamine-dependent individuals and healthy controls compared to models using either EEG or only fNIRS alone. Furthermore, the proposed framework facilitates objective, brain-based evaluation of rTMS treatment efficacy, demonstrating measurable shifts in neural patterns towards healthy control profiles after treatment. Critically, we establish the trustworthiness of the multimodal data-driven biomarker by showing its strong correlation with psychometrically validated craving scores. These findings suggest that biomarker derived from EEG-fNIRS data via NeuroCLIP offers enhanced robustness and reliability over single-modality approaches, providing a valuable tool for addiction neuroscience research and potentially improving clinical assessments.
- [22] arXiv:2507.20288 (cross-list from stat.ME) [pdf, html, other]
-
Title: A nonparametric approach to practical identifiability of nonlinear mixed effects modelsTyler Cassidy, Stuart T. Johnston, Michael Plank, Imke Botha, Jennifer A. Flegg, Ryan J. Murphy, Sara HamisSubjects: Methodology (stat.ME); Quantitative Methods (q-bio.QM)
Mathematical modelling is a widely used approach to understand and interpret clinical trial data. This modelling typically involves fitting mechanistic mathematical models to data from individual trial participants. Despite the widespread adoption of this individual-based fitting, it is becoming increasingly common to take a hierarchical approach to parameter estimation, where modellers characterize the population parameter distributions, rather than considering each individual independently. This hierarchical parameter estimation is standard in pharmacometric modelling. However, many of the existing techniques for parameter identifiability do not immediately translate from the individual-based fitting to the hierarchical setting. Here, we propose a nonparametric approach to study practical identifiability within a hierarchical parameter estimation framework. We focus on the commonly used nonlinear mixed effects framework and investigate two well-studied examples from the pharmacometrics and viral dynamics literature to illustrate the potential utility of our approach.
- [23] arXiv:2507.20426 (cross-list from cs.LG) [pdf, html, other]
-
Title: ResCap-DBP: A Lightweight Residual-Capsule Network for Accurate DNA-Binding Protein Prediction Using Global ProteinBERT EmbeddingsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Biomolecules (q-bio.BM)
DNA-binding proteins (DBPs) are integral to gene regulation and cellular processes, making their accurate identification essential for understanding biological functions and disease mechanisms. Experimental methods for DBP identification are time-consuming and costly, driving the need for efficient computational prediction techniques. In this study, we propose a novel deep learning framework, ResCap-DBP, that combines a residual learning-based encoder with a one-dimensional Capsule Network (1D-CapsNet) to predict DBPs directly from raw protein sequences. Our architecture incorporates dilated convolutions within residual blocks to mitigate vanishing gradient issues and extract rich sequence features, while capsule layers with dynamic routing capture hierarchical and spatial relationships within the learned feature space. We conducted comprehensive ablation studies comparing global and local embeddings from ProteinBERT and conventional one-hot encoding. Results show that ProteinBERT embeddings substantially outperform other representations on large datasets. Although one-hot encoding showed marginal advantages on smaller datasets, such as PDB186, it struggled to scale effectively. Extensive evaluations on four pairs of publicly available benchmark datasets demonstrate that our model consistently outperforms current state-of-the-art methods. It achieved AUC scores of 98.0% and 89.5% on PDB14189andPDB1075, respectively. On independent test sets PDB2272 and PDB186, the model attained top AUCs of 83.2% and 83.3%, while maintaining competitive performance on larger datasets such as PDB20000. Notably, the model maintains a well balanced sensitivity and specificity across datasets. These results demonstrate the efficacy and generalizability of integrating global protein representations with advanced deep learning architectures for reliable and scalable DBP prediction in diverse genomic contexts.
- [24] arXiv:2507.20440 (cross-list from cs.LG) [pdf, html, other]
-
Title: BioNeuralNet: A Graph Neural Network based Multi-Omics Network Data Analysis ToolVicente Ramos (1), Sundous Hussein (1), Mohamed Abdel-Hafiz (1), Arunangshu Sarkar (2), Weixuan Liu (2), Katerina J. Kechris (2), Russell P. Bowler (3), Leslie Lange (4), Farnoush Banaei-Kashani (1) ((1) Department of Computer Science and Engineering, University of Colorado Denver, Denver, USA, (2) Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, USA, (3) Genomic Medicine Institute, Cleveland Clinic, Cleveland, USA, (4) Division of Biomedical Informatics and Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, USA)Comments: 6 pages, 1 figure, 2 tables; Software available on PyPI as BioNeuralNet. For documentation, tutorials, and workflows see this https URLSubjects: Machine Learning (cs.LG); Genomics (q-bio.GN)
Multi-omics data offer unprecedented insights into complex biological systems, yet their high dimensionality, sparsity, and intricate interactions pose significant analytical challenges. Network-based approaches have advanced multi-omics research by effectively capturing biologically relevant relationships among molecular entities. While these methods are powerful for representing molecular interactions, there remains a need for tools specifically designed to effectively utilize these network representations across diverse downstream analyses. To fulfill this need, we introduce BioNeuralNet, a flexible and modular Python framework tailored for end-to-end network-based multi-omics data analysis. BioNeuralNet leverages Graph Neural Networks (GNNs) to learn biologically meaningful low-dimensional representations from multi-omics networks, converting these complex molecular networks into versatile embeddings. BioNeuralNet supports all major stages of multi-omics network analysis, including several network construction techniques, generation of low-dimensional representations, and a broad range of downstream analytical tasks. Its extensive utilities, including diverse GNN architectures, and compatibility with established Python packages (e.g., scikit-learn, PyTorch, NetworkX), enhance usability and facilitate quick adoption. BioNeuralNet is an open-source, user-friendly, and extensively documented framework designed to support flexible and reproducible multi-omics network analysis in precision medicine.
- [25] arXiv:2507.20598 (cross-list from stat.ME) [pdf, other]
-
Title: Nullstrap-DE: A General Framework for Calibrating FDR and Preserving Power in DE Methods, with Applications to DESeq2 and edgeRSubjects: Methodology (stat.ME); Genomics (q-bio.GN); Applications (stat.AP)
Differential expression (DE) analysis is a key task in RNA-seq studies, aiming to identify genes with expression differences across conditions. A central challenge is balancing false discovery rate (FDR) control with statistical power. Parametric methods such as DESeq2 and edgeR achieve high power by modeling gene-level counts using negative binomial distributions and applying empirical Bayes shrinkage. However, these methods may suffer from FDR inflation when model assumptions are mildly violated, especially in large-sample settings. In contrast, non-parametric tests like Wilcoxon offer more robust FDR control but often lack power and do not support covariate adjustment. We propose Nullstrap-DE, a general add-on framework that combines the strengths of both approaches. Designed to augment tools like DESeq2 and edgeR, Nullstrap-DE calibrates FDR while preserving power, without modifying the original method's implementation. It generates synthetic null data from a model fitted under the gene-specific null (no DE), applies the same test statistic to both observed and synthetic data, and derives a threshold that satisfies the target FDR level. We show theoretically that Nullstrap-DE asymptotically controls FDR while maintaining power consistency. Simulations confirm that it achieves reliable FDR control and high power across diverse settings, where DESeq2, edgeR, or Wilcoxon often show inflated FDR or low power. Applications to real datasets show that Nullstrap-DE enhances statistical rigor and identifies biologically meaningful genes.
- [26] arXiv:2507.20644 (cross-list from cs.LG) [pdf, other]
-
Title: Deep Generative Models of Evolution: SNP-level Population Adaptation by Genomic Linkage IncorporationComments: 10 pages, 5 figuresSubjects: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
The investigation of allele frequency trajectories in populations evolving under controlled environmental pressures has become a popular approach to study evolutionary processes on the molecular level. Statistical models based on well-defined evolutionary concepts can be used to validate different hypotheses about empirical observations. Despite their popularity, classic statistical models like the Wright-Fisher model suffer from simplified assumptions such as the independence of selected loci along a chromosome and uncertainty about the parameters. Deep generative neural networks offer a powerful alternative known for the integration of multivariate dependencies and noise reduction. Due to their high data demands and challenging interpretability they have, so far, not been widely considered in the area of population genomics. To address the challenges in the area of Evolve and Resequencing experiments (E&R) based on pooled sequencing (Pool-Seq) data, we introduce a deep generative neural network that aims to model a concept of evolution based on empirical observations over time. The proposed model estimates the distribution of allele frequency trajectories by embedding the observations from single nucleotide polymorphisms (SNPs) with information from neighboring loci. Evaluation on simulated E&R experiments demonstrates the model's ability to capture the distribution of allele frequency trajectories and illustrates the representational power of deep generative models on the example of linkage disequilibrium (LD) estimation. Inspecting the internally learned representations enables estimating pairwise LD, which is typically inaccessible in Pool-Seq data. Our model provides competitive LD estimation in Pool-Seq data high degree of LD when compared to existing methods.
- [27] arXiv:2507.20714 (cross-list from cs.LG) [pdf, other]
-
Title: Prostate Cancer Classification Using Multimodal Feature Fusion and Explainable AIAsma Sadia Khan, Fariba Tasnia Khan, Tanjim Mahmud, Salman Karim Khan, Rishita Chakma, Nahed Sharmen, Mohammad Shahadat Hossain, Karl AnderssonSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Applications (stat.AP)
Prostate cancer, the second most prevalent male malignancy, requires advanced diagnostic tools. We propose an explainable AI system combining BERT (for textual clinical notes) and Random Forest (for numerical lab data) through a novel multimodal fusion strategy, achieving superior classification performance on PLCO-NIH dataset (98% accuracy, 99% AUC). While multimodal fusion is established, our work demonstrates that a simple yet interpretable BERT+RF pipeline delivers clinically significant improvements - particularly for intermediate cancer stages (Class 2/3 recall: 0.900 combined vs 0.824 numerical/0.725 textual). SHAP analysis provides transparent feature importance rankings, while ablation studies prove textual features' complementary value. This accessible approach offers hospitals a balance of high performance (F1=89%), computational efficiency, and clinical interpretability - addressing critical needs in prostate cancer diagnostics.
- [28] arXiv:2507.20925 (cross-list from cs.LG) [pdf, html, other]
-
Title: Zero-Shot Learning with Subsequence Reordering Pretraining for Compound-Protein InteractionSubjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Given the vastness of chemical space and the ongoing emergence of previously uncharacterized proteins, zero-shot compound-protein interaction (CPI) prediction better reflects the practical challenges and requirements of real-world drug development. Although existing methods perform adequately during certain CPI tasks, they still face the following challenges: (1) Representation learning from local or complete protein sequences often overlooks the complex interdependencies between subsequences, which are essential for predicting spatial structures and binding properties. (2) Dependence on large-scale or scarce multimodal protein datasets demands significant training data and computational resources, limiting scalability and efficiency. To address these challenges, we propose a novel approach that pretrains protein representations for CPI prediction tasks using subsequence reordering, explicitly capturing the dependencies between protein subsequences. Furthermore, we apply length-variable protein augmentation to ensure excellent pretraining performance on small training datasets. To evaluate the model's effectiveness and zero-shot learning ability, we combine it with various baseline methods. The results demonstrate that our approach can improve the baseline model's performance on the CPI task, especially in the challenging zero-shot scenario. Compared to existing pre-training models, our model demonstrates superior performance, particularly in data-scarce scenarios where training samples are limited. Our implementation is available at this https URL.
- [29] arXiv:2507.21016 (cross-list from cs.LG) [pdf, html, other]
-
Title: Predicting Cognition from fMRI:A Comparative Study of Graph, Transformer, and Kernel Models Across Task and Rest ConditionsJagruti Patel (1), Mikkel Schöttner (1), Thomas A. W. Bolton (1), Patric Hagmann (1) ((1) Department of Radiology, Lausanne University Hospital and University of Lausanne (CHUV-UNIL), Lausanne, Switzerland)Comments: Preliminary version; a revised version will be uploaded laterSubjects: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
Predicting cognition from neuroimaging data in healthy individuals offers insights into the neural mechanisms underlying cognitive abilities, with potential applications in precision medicine and early detection of neurological and psychiatric conditions. This study systematically benchmarked classical machine learning (Kernel Ridge Regression (KRR)) and advanced deep learning (DL) models (Graph Neural Networks (GNN) and Transformer-GNN (TGNN)) for cognitive prediction using Resting-state (RS), Working Memory, and Language task fMRI data from the Human Connectome Project Young Adult dataset.
Our results, based on R2 scores, Pearson correlation coefficient, and mean absolute error, revealed that task-based fMRI, eliciting neural responses directly tied to cognition, outperformed RS fMRI in predicting cognitive behavior. Among the methods compared, a GNN combining structural connectivity (SC) and functional connectivity (FC) consistently achieved the highest performance across all fMRI modalities; however, its advantage over KRR using FC alone was not statistically significant. The TGNN, designed to model temporal dynamics with SC as a prior, performed competitively with FC-based approaches for task-fMRI but struggled with RS data, where its performance aligned with the lower-performing GNN that directly used fMRI time-series data as node features. These findings emphasize the importance of selecting appropriate model architectures and feature representations to fully leverage the spatial and temporal richness of neuroimaging data.
This study highlights the potential of multimodal graph-aware DL models to combine SC and FC for cognitive prediction, as well as the promise of Transformer-based approaches for capturing temporal dynamics. By providing a comprehensive comparison of models, this work serves as a guide for advancing brain-behavior modeling using fMRI, SC and DL. - [30] arXiv:2507.21035 (cross-list from cs.AI) [pdf, html, other]
-
Title: GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression AnalysisComments: 51 pages, 5 figuresSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Genomics (q-bio.GN)
Gene expression analysis holds the key to many biomedical discoveries, yet extracting insights from raw transcriptomic data remains formidable due to the complexity of multiple large, semi-structured files and the need for extensive domain expertise. Current automation approaches are often limited by either inflexible workflows that break down in edge cases or by fully autonomous agents that lack the necessary precision for rigorous scientific inquiry. GenoMAS charts a different course by presenting a team of LLM-based scientists that integrates the reliability of structured workflows with the adaptability of autonomous agents. GenoMAS orchestrates six specialized LLM agents through typed message-passing protocols, each contributing complementary strengths to a shared analytic canvas. At the heart of GenoMAS lies a guided-planning framework: programming agents unfold high-level task guidelines into Action Units and, at each juncture, elect to advance, revise, bypass, or backtrack, thereby maintaining logical coherence while bending gracefully to the idiosyncrasies of genomic data.
On the GenoTEX benchmark, GenoMAS reaches a Composite Similarity Correlation of 89.13% for data preprocessing and an F$_1$ of 60.48% for gene identification, surpassing the best prior art by 10.61% and 16.85% respectively. Beyond metrics, GenoMAS surfaces biologically plausible gene-phenotype associations corroborated by the literature, all while adjusting for latent confounders. Code is available at this https URL.
Cross submissions (showing 16 of 16 entries)
- [31] arXiv:2405.06500 (replaced) [pdf, other]
-
Title: Advantageous and disadvantageous inequality aversion can be taught through vicarious learning of others' preferencesComments: 40 pages for main text 15 pages for Supplemental 5 figuresSubjects: Neurons and Cognition (q-bio.NC)
While enforcing egalitarian social norms is critical for human society, punishing social norm violators often incurs a cost to the self. This cost looms even larger when one can benefit from an unequal distribution of resources (i.e. advantageous inequity), as in receiving a higher salary than a colleague with the identical role. In the Ultimatum Game, a classic test bed for fairness norm enforcement, individuals rarely reject (punish) such unequal proposed divisions of resources because doing so entails a sacrifice of one's own benefit. Recent work has demonstrated that observing another's punitive responses to unfairness can efficiently alter the punitive preferences of an observer. It remains an open question, however, whether such contagion is powerful enough to impart advantageous inequity aversion to individuals. Using a variant of the Ultimatum Game in which participants are tasked with responding to fairness violations on behalf of another 'Teacher' - whose aversion to advantageous (versus disadvantageous) inequity was systematically manipulated-we probe whether individuals subsequently increase their punishment unfair after experiencing fairness violations on their own behalf. In two experiments, we found individuals can acquire aversion to advantageous inequity 'vicariously' through observing (and implementing) the Teacher's preferences. Computationally, these learning effects were best characterized by a model which learns the latent structure of the Teacher's preferences, rather than a simple Reinforcement Learning account. In summary, our study is the first to demonstrate that people can swiftly and readily acquire another's preferences for advantageous inequity, suggesting in turn that behavioral contagion may be one promising mechanism through which social norm enforcement - which people rarely implement in the case of advantageous inequality - can be enhanced.
- [32] arXiv:2411.07986 (replaced) [pdf, other]
-
Title: Maximum likelihood estimation of log-affine models using detailed-balanced reaction networksComments: 25 pages, 5 figures. Improved exposition. To appear in J. Math. BiolSubjects: Molecular Networks (q-bio.MN); Algebraic Geometry (math.AG); Dynamical Systems (math.DS)
A fundamental question in the field of molecular computation is what computational tasks a biochemical system can carry out. In this work, we focus on the problem of finding the maximum likelihood estimate (MLE) for log-affine models. We revisit a construction due to Gopalkrishnan of a mass-action system with the MLE as its unique positive steady state, which is based on choosing a basis for the kernel of the design matrix of the model. We extend this construction to allow for any finite spanning set of the kernel, and explore how the choice of spanning set influences the dynamics of the resulting network, including the existence of boundary steady states, the deficiency of the network, and the rate of convergence. In particular, we prove that using a Markov basis as the spanning set guarantees global stability of the MLE steady state.
- [33] arXiv:2411.11513 (replaced) [pdf, html, other]
-
Title: A Modular Open Source Framework for Genomic Variant CallingSubjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
Variant calling is a fundamental task in genomic research, essential for detecting genetic variations such as single nucleotide polymorphisms (SNPs) and insertions or deletions (indels). This paper presents an enhancement to DeepChem, a widely used open-source drug discovery framework, through the integration of DeepVariant. In particular, we introduce a variant calling pipeline that leverages DeepVariant's convolutional neural network (CNN) architecture to improve the accuracy and reliability of variant detection. The implemented pipeline includes stages for realignment of sequencing reads, candidate variant detection, and pileup image generation, followed by variant classification using a modified Inception v3 model. Our work adds a modular and extensible variant calling framework to the DeepChem framework and enables future work integrating DeepChem's drug discovery infrastructure more tightly with bioinformatics pipelines.
- [34] arXiv:2412.07815 (replaced) [pdf, html, other]
-
Title: Mask prior-guided denoising diffusion improves inverse protein foldingPeizhen Bai, Filip Miljković, Xianyuan Liu, Leonardo De Maria, Rebecca Croasdale-Wood, Owen Rackham, Haiping LuSubjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
Inverse protein folding generates valid amino acid sequences that can fold into a desired protein structure, with recent deep-learning advances showing strong potential and competitive performance. However, challenges remain, such as predicting elements with high structural uncertainty, including disordered regions. To tackle such low-confidence residue prediction, we propose a Mask-prior-guided denoising Diffusion (MapDiff) framework that accurately captures both structural information and residue interactions for inverse protein folding. MapDiff is a discrete diffusion probabilistic model that iteratively generates amino acid sequences with reduced noise, conditioned on a given protein backbone. To incorporate structural information and residue interactions, we develop a graph-based denoising network with a mask-prior pre-training strategy. Moreover, in the generative process, we combine the denoising diffusion implicit model with Monte-Carlo dropout to reduce uncertainty. Evaluation on four challenging sequence design benchmarks shows that MapDiff substantially outperforms state-of-the-art methods. Furthermore, the in silico sequences generated by MapDiff closely resemble the physico-chemical and structural characteristics of native proteins across different protein families and architectures.
- [35] arXiv:2501.02634 (replaced) [pdf, html, other]
-
Title: Optimal Inference of Asynchronous Boolean Network ModelsSubjects: Molecular Networks (q-bio.MN)
The network inference problem arises in biological research when one needs to quantitatively choose the best protein-interaction model for explaining a phenotype. The diverse nature of the data and nonlinear dynamics pose significant challenges in the search for the best methodology. In addition to balancing fit and model complexity, computational efficiency must be considered. In this paper, we present a novel approach that finds a solution that fits the observed dataset and otherwise a minimal number of unobserved datasets. We present algorithms for computing Boolean networks that optimally satisfy this criterion, and allow for asynchronicity network dynamics. Furthermore, we show that using our methodology a solution to the pseudo-time inference problem, which is pertinent to the analysis of single-cell data, can be intertwined with network inference. Results are described for real and simulated datasets.
- [36] arXiv:2505.14730 (replaced) [pdf, other]
-
Title: Predicting Neoadjuvant Chemotherapy Response in Triple-Negative Breast Cancer Using Pre-Treatment Histopathologic ImagesHikmat Khan, Ziyu Su, Huina Zhang, Yihong Wang, Bohan Ning, Shi Wei, Hua Guo, Zaibo Li, Muhammad Khalid Khan NiaziSubjects: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Triple-negative breast cancer (TNBC) remains a major clinical challenge due to its aggressive behavior and lack of targeted therapies. Accurate early prediction of response to neoadjuvant chemotherapy (NACT) is essential for guiding personalized treatment strategies and improving patient outcomes. In this study, we present an attention-based multiple instance learning (MIL) framework designed to predict pathologic complete response (pCR) directly from pre-treatment hematoxylin and eosin (H&E)-stained biopsy slides. The model was trained on a retrospective in-house cohort of 174 TNBC patients and externally validated on an independent cohort (n = 30). It achieved a mean area under the curve (AUC) of 0.85 during five-fold cross-validation and 0.78 on external testing, demonstrating robust predictive performance and generalizability. To enhance model interpretability, attention maps were spatially co-registered with multiplex immuno-histochemistry (mIHC) data stained for PD-L1, CD8+ T cells, and CD163+ macrophages. The attention regions exhibited moderate spatial overlap with immune-enriched areas, with mean Intersection over Union (IoU) scores of 0.47 for PD-L1, 0.45 for CD8+ T cells, and 0.46 for CD163+ macrophages. The presence of these biomarkers in high-attention regions supports their biological relevance to NACT response in TNBC. This not only improves model interpretability but may also inform future efforts to identify clinically actionable histological biomarkers directly from H&E-stained biopsy slides, further supporting the utility of this approach for accurate NACT response prediction and advancing precision oncology in TNBC.
- [37] arXiv:2506.06191 (replaced) [pdf, other]
-
Title: Functional Architecture of the Human Hypothalamus: Cortical Coupling and Subregional Organization Using 7-Tesla fMRIKent M. Lee, Joshua Rodriguez, Ludger Hartley, Philip A. Kragel, Lorena Chanes, Tor D. Wager, Karen S. Quigley, Lawrence L. Wald, Marta Bianciardi, Lisa Feldman Barrett, Jordan E. Theriault, Ajay B. SatputeComments: 36 pages, 1 table, 4 figures, 1 supplementary figureSubjects: Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM)
The hypothalamus plays an important role in the regulation of the bodys metabolic state and behaviors related to survival. Despite its importance however, many questions exist regarding the intrinsic and extrinsic connections of the hypothalamus in humans, especially its relationship with the cortex. As a heterogeneous structure, it is possible that the hypothalamus is composed of different subregions, which have their own distinct relationships with the cortex. Previous work on functional connectivity in the human hypothalamus have either treated it as a unitary structure or relied on methodological approaches that are limited in modeling its intrinsic functional architecture. Here, we used resting state data from ultrahigh field 7 Tesla fMRI and a data driven analytical approach to identify functional subregions of the human hypothalamus. Our approach identified four functional hypothalamic subregions based on intrinsic functional connectivity, which in turn showed distinct patterns of functional connectivity with cortex. Overall, all hypothalamic subregions showed stronger connectivity with a cortical network, Cortical Network 1 composed primarily of frontal, midline, and limbic cortical areas and weaker connectivity with a second cortical network composed largely of posterior sensorimotor regions, Cortical Network 2. Of the hypothalamic subregions, the anterior hypothalamus showed the strongest connection to Cortical Network 1, while a more ventral subregion containing the anterior hypothalamus extending to the tuberal region showed the weakest connectivity. The findings support the use of ultrahigh field, high resolution imaging in providing a more incisive investigation of the human hypothalamus that respects its complex internal structure and extrinsic functional architecture.
- [38] arXiv:2507.17274 (replaced) [pdf, other]
-
Title: Practitioner forecasts of technological progress in biostasisAndrew T. McKenzie, Michael Cerullo, Navid Farahani, Jordan S. Sparks, Taurus Londoño, Aschwin de Wolf, Suzan Dziennis, Borys Wróbel, Alexander German, Emil F. Kendziorra, João Pedro de Magalhães, Wonjin Cho, R. Michael Perry, Max MoreSubjects: Neurons and Cognition (q-bio.NC)
Biostasis has the potential to extend human lives by offering a bridge to powerful life extension technologies that may be developed in the future. However, key questions in the field remain unresolved, including which biomarkers reliably indicate successful preservation, what technical obstacles pose the greatest barriers, and whether different proposed revival methods are theoretically feasible. To address these gaps, we conducted a collaborative forecasting exercise with 22 practitioners in biostasis, including individuals with expertise in neuroscience, cryobiology, and clinical care. Our results reveal substantial consensus in some areas, for example that synaptic connectivity can serve as a reliable surrogate biomarker for information preservation quality. Practitioners identified three most likely failure modes in contemporary biostasis: inadequate preservation quality even under ideal conditions, geographic barriers preventing timely preservation, and poor procedural execution. Regarding revival strategies, most respondents believe that provably reversible cryopreservation of whole mammalian organisms is most likely decades away, with provably reversible human cryopreservation expected even later, if it is ever achieved. Among revival strategies from contemporary preservation methods, whole brain emulation was considered the most likely to be developed first, though respondents were divided on the metaphysical question of whether it could constitute genuine revival. Molecular nanotechnology was viewed as nearly as likely to be technically feasible, and compatible with both pure cryopreservation and aldehyde-based methods. Taken together, these findings delineate current barriers to high-quality preservation, identify future research priorities, and provide baseline estimates for key areas of uncertainty.
- [39] arXiv:2507.18557 (replaced) [pdf, other]
-
Title: Deep Learning for Blood-Brain Barrier Permeability PredictionComments: Author list updated to reflect current contributionSubjects: Quantitative Methods (q-bio.QM)
Predicting whether a molecule can cross the blood-brain barrier (BBB) is a key step in early-stage neuropharmaceutical development, directly influencing both research efficiency and success rates in drug discovery. Traditional empirical methods based on physicochemical properties are prone to systematic misjudgements due to their reliance on static rules. Early machine learning models, although data-driven, often suffer from limited capacity, poor generalization, and insufficient interpretability. In recent years, artificial intelligence (AI) methods have become essential tools for predicting BBB permeability and guiding related drug design, owing to their ability to model molecular structures and capture complex biological mechanisms. This article systematically reviews the evolution of this field-from deep neural networks to graph-based structural modeling-highlighting the advantages of multi-task and multimodal learning strategies in identifying mechanism-relevant variables. We further explore the emerging potential of generative models and causal inference methods for integrating permeability prediction with mechanism-aware drug design. BBB modeling is in the transition from static classification toward mechanistic perception and structure-function modeling. This paradigm shift provides a methodological foundation and future roadmap for the integration of AI into neuropharmacological development.
- [40] arXiv:2507.18595 (replaced) [pdf, html, other]
-
Title: Investigating Mobility in Spatial Biodiversity Models through Recurrence Quantification AnalysisSubjects: Populations and Evolution (q-bio.PE); Applied Physics (physics.app-ph)
Recurrence plots and their associated quantifiers provide a robust framework for detecting and characterising complex patterns in non-linear time-series. In this paper, we employ recurrence quantification analysis to investigate the dynamics of the cyclic, non-hierarchical May-Leonard model, also referred to as rock--paper--scissors systems, that describes competitive interactions among three species. A crucial control parameter in these systems is the species' mobility $m$, which governs the spatial displacement of individuals and profoundly influences the resulting dynamics. By systematically varying $m$ and constructing suitable recurrence plots from numerical simulations, we explore how recurrence quantifiers reflect distinct dynamical features associated with different ecological states. We then introduce an ensemble-based approach that leverages statistical distributions of recurrence quantifiers, computed from numerous independent realisations, allowing us to identify dynamical outliers as significant deviations from typical system behaviour. Through detailed numerical analyses, we demonstrate that these outliers correspond to divergent ecological regimes associated with specific mobility values, providing also a robust manner to infer the mobility parameter from observed numerical data. Our results highlight the potential of recurrence-based methods as diagnostic tools for analysing spatial ecological systems and extracting ecologically relevant information from their non-linear dynamical patterns.
- [41] arXiv:2503.22939 (replaced) [pdf, other]
-
Title: Interpretable Graph Kolmogorov-Arnold Networks for Multi-Cancer Classification and Biomarker Identification using Multi-Omics DataFadi Alharbi, Nishant Budhiraja, Aleksandar Vakanski, Boyu Zhang, Murtada K. Elbashir, Harshith Guduru, Mohanad MohammedSubjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
The integration of heterogeneous multi-omics datasets at a systems level remains a central challenge for developing analytical and computational models in precision cancer diagnostics. This paper introduces Multi-Omics Graph Kolmogorov-Arnold Network (MOGKAN), a deep learning framework that utilizes messenger-RNA, micro-RNA sequences, and DNA methylation samples together with Protein-Protein Interaction (PPI) networks for cancer classification across 31 different cancer types. The proposed approach combines differential gene expression with DESeq2, Linear Models for Microarray (LIMMA), and Least Absolute Shrinkage and Selection Operator (LASSO) regression to reduce multi-omics data dimensionality while preserving relevant biological features. The model architecture is based on the Kolmogorov-Arnold theorem principle and uses trainable univariate functions to enhance interpretability and feature analysis. MOGKAN achieves classification accuracy of 96.28 percent and exhibits low experimental variability in comparison to related deep learning-based models. The biomarkers identified by MOGKAN were validated as cancer-related markers through Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. By integrating multi-omics data with graph-based deep learning, our proposed approach demonstrates robust predictive performance and interpretability with potential to enhance the translation of complex multi-omics data into clinically actionable cancer diagnostics.
- [42] arXiv:2505.00316 (replaced) [pdf, html, other]
-
Title: Surrogate modeling of Cellular-Potts Agent-Based Models as a segmentation task using the U-Net neural network architectureTien Comlekoglu, J. Quetzalcóatl Toledo-Marín, Tina Comlekoglu, Douglas W. DeSimone, Shayn M. Peirce, Geoffrey Fox, James A. GlazierSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
The Cellular-Potts model is a powerful and ubiquitous framework for developing computational models for simulating complex multicellular biological systems. Cellular-Potts models (CPMs) are often computationally expensive due to the explicit modeling of interactions among large numbers of individual model agents and diffusive fields described by partial differential equations (PDEs). In this work, we develop a convolutional neural network (CNN) surrogate model using a U-Net architecture that accounts for periodic boundary conditions. We use this model to accelerate the evaluation of a mechanistic CPM previously used to investigate in vitro vasculogenesis. The surrogate model was trained to predict 100 computational steps ahead (Monte-Carlo steps, MCS), accelerating simulation evaluations by a factor of 590 times compared to CPM code execution. Over multiple recursive evaluations, our model effectively captures the emergent behaviors demonstrated by the original Cellular-Potts model of such as vessel sprouting, extension and anastomosis, and contraction of vascular lacunae. This approach demonstrates the potential for deep learning to serve as efficient surrogate models for CPM simulations, enabling faster evaluation of computationally expensive CPM of biological processes at greater spatial and temporal scales.
- [43] arXiv:2505.02022 (replaced) [pdf, other]
-
Title: NbBench: Benchmarking Language Models for Comprehensive Nanobody TasksSubjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
Nanobodies -- single-domain antibody fragments derived from camelid heavy-chain-only antibodies -- exhibit unique advantages such as compact size, high stability, and strong binding affinity, making them valuable tools in therapeutics and diagnostics. While recent advances in pretrained protein and antibody language models (PPLMs and PALMs) have greatly enhanced biomolecular understanding, nanobody-specific modeling remains underexplored and lacks a unified benchmark. To address this gap, we introduce NbBench, the first comprehensive benchmark suite for nanobody representation learning. Spanning eight biologically meaningful tasks across nine curated datasets, NbBench encompasses structure annotation, binding prediction, and developability assessment. We systematically evaluate eleven representative models -- including general-purpose protein LMs, antibody-specific LMs, and nanobody-specific LMs -- in a frozen setting. Our analysis reveals that antibody language models excel in antigen-related tasks, while performance on regression tasks such as thermostability and affinity remains challenging across all models. Notably, no single model consistently outperforms others across all tasks. By standardizing datasets, task definitions, and evaluation protocols, NbBench offers a reproducible foundation for assessing and advancing nanobody modeling.
- [44] arXiv:2505.04823 (replaced) [pdf, other]
-
Title: Guide your favorite protein sequence generative modelJunhao Xiong, Hunter Nisonoff, Maria Lukarska, Ishan Gaur, Luke M. Oltrogge, David F. Savage, Jennifer ListgartenSubjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
Generative machine learning models on sequences are transforming protein engineering. However, no principled framework exists for conditioning these models on auxiliary information, such as experimental data, in a plug-and-play manner. Herein, we present ProteinGuide -- a principled and general method for conditioning -- by unifying a broad class of protein generative models under a single framework. We demonstrate the applicability of ProteinGuide by guiding two protein generative models, ProteinMPNN and ESM3, to generate amino acid and structure token sequences, conditioned on several user-specified properties such as enhanced stability, enzyme classes, and CATH-labeled folds. We also used ProteinGuide with inverse folding models and our own experimental assay to design adenine base editor sequences for high activity.
- [45] arXiv:2505.06834 (replaced) [pdf, html, other]
-
Title: Local stabilizability implies global controllability in catalytic reaction systemsSubjects: Biological Physics (physics.bio-ph); Subcellular Processes (q-bio.SC)
Controlling complex reaction networks is a fundamental challenge in the fields of physics, biology, and systems engineering. Here, we prove a general principle for catalytic reaction systems with kinetics where the reaction order and the stoichiometric coefficient match: the local stabilizability of a given state implies global controllability within its stoichiometric compatibility class. In other words, if a target state can be maintained against small perturbations, the system can be controlled from any initial condition to that state. This result highlights a tight link between the local and global dynamics of nonlinear chemical reaction systems, providing a mathematical criterion for global reachability that is often elusive in high-dimensional systems. The finding illuminate the robustness of biochemical systems and offers a way to control catalytic reaction systems in a generic framework.
- [46] arXiv:2505.13940 (replaced) [pdf, other]
-
Title: DrugPilot: LLM-based Parameterized Reasoning Agent for Drug DiscoveryComments: 29 pages, 8 figures, 2 tablesSubjects: Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
Large language models (LLMs) integrated with autonomous agents hold significant potential for advancing scientific discovery through automated reasoning and task execution. However, applying LLM agents to drug discovery is still constrained by challenges such as large-scale multimodal data processing, limited task automation, and poor support for domain-specific tools. To overcome these limitations, we introduce DrugPilot, a LLM-based agent system with a parameterized reasoning architecture designed for end-to-end scientific workflows in drug discovery. DrugPilot enables multi-stage research processes by integrating structured tool use with a novel parameterized memory pool. The memory pool converts heterogeneous data from both public sources and user-defined inputs into standardized representations. This design supports efficient multi-turn dialogue, reduces information loss during data exchange, and enhances complex scientific decision-making. To support training and benchmarking, we construct a drug instruction dataset covering eight core drug discovery tasks. Under the Berkeley function-calling benchmark, DrugPilot significantly outperforms state-of-the-art agents such as ReAct and LoT, achieving task completion rates of 98.0%, 93.5%, and 64.0% for simple, multi-tool, and multi-turn scenarios, respectively. These results highlight DrugPilot's potential as a versatile agent framework for computational science domains requiring automated, interactive, and data-integrated reasoning.
- [47] arXiv:2507.16674 (replaced) [pdf, other]
-
Title: GASPnet: Global Agreement to Synchronize PhasesSubjects: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
In recent years, Transformer architectures have revolutionized most fields of artificial intelligence, relying on an attentional mechanism based on the agreement between keys and queries to select and route information in the network. In previous work, we introduced a novel, brain-inspired architecture that leverages a similar implementation to achieve a global 'routing by agreement' mechanism. Such a system modulates the network's activity by matching each neuron's key with a single global query, pooled across the entire network. Acting as a global attentional system, this mechanism improves noise robustness over baseline levels but is insufficient for multi-classification tasks. Here, we improve on this work by proposing a novel mechanism that combines aspects of the Transformer attentional operations with a compelling neuroscience theory, namely, binding by synchrony. This theory proposes that the brain binds together features by synchronizing the temporal activity of neurons encoding those features. This allows the binding of features from the same object while efficiently disentangling those from distinct objects. We drew inspiration from this theory and incorporated angular phases into all layers of a convolutional network. After achieving phase alignment via Kuramoto dynamics, we use this approach to enhance operations between neurons with similar phases and suppresses those with opposite phases. We test the benefits of this mechanism on two datasets: one composed of pairs of digits and one composed of a combination of an MNIST item superimposed on a CIFAR-10 image. Our results reveal better accuracy than CNN networks, proving more robust to noise and with better generalization abilities. Overall, we propose a novel mechanism that addresses the visual binding problem in neural networks by leveraging the synergy between neuroscience and machine learning.
- [48] arXiv:2507.19218 (replaced) [pdf, other]
-
Title: Technological folie à deux: Feedback Loops Between AI Chatbots and Mental IllnessSebastian Dohnány, Zeb Kurth-Nelson, Eleanor Spens, Lennart Luettgau, Alastair Reid, Iason Gabriel, Christopher Summerfield, Murray Shanahan, Matthew M NourSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
Artificial intelligence chatbots have achieved unprecedented adoption, with millions now using these systems for emotional support and companionship in contexts of widespread social isolation and capacity-constrained mental health services. While some users report psychological benefits, concerning edge cases are emerging, including reports of suicide, violence, and delusional thinking linked to perceived emotional relationships with chatbots. To understand this new risk profile we need to consider the interaction between human cognitive and emotional biases, and chatbot behavioural tendencies such as agreeableness (sycophancy) and adaptability (in-context learning). We argue that individuals with mental health conditions face increased risks of chatbot-induced belief destabilization and dependence, owing to altered belief-updating, impaired reality-testing, and social isolation. Current AI safety measures are inadequate to address these interaction-based risks. To address this emerging public health concern, we need coordinated action across clinical practice, AI development, and regulatory frameworks.