Publications | Matt Raymond

2025

ICML FM4LS

Joint Diffusion Sampling via Positive-Unlabeled Guidance for Multi-Modal Data

Matt Raymond, Yilun Zhu, Jianxin Zhang, Angela Violi, and Clayton Scott

In Proceedings of the ICML 2025 Workshop on Foundation Models for Life Sciences (FM4LS), Jul 2025

Abs OpenReview

Multi-modal generative models typically require abundant training data from multi-modal joint distributions, which is often unavailable in the life sciences. We propose to treat each modality as a marginal distribution and correct their independent diffusion processes to sample from their joint distribution. Specifically, we introduce "joint diffusion sampling," a method that generates a sample from joint distributions using pre-trained models for individual (uni-modal) marginal distributions and minimal data from the (multi-modal) joint distribution. We demonstrate preliminary uni- and multi-modal results for images, molecules, and Boolean values, and discuss multi-modal applications of our approach.
PSST

Machine learning models for Si nanoparticle growth in nonthermal plasma

Matt Raymond, Paolo Elvati, Jacob C. Saldinger, Jonathan Lin, Xuetao Shi, and Angela Violi

Plasma Sources Science and Technology, Mar 2025

Abs arXiv DOI Talk

Nanoparticles formed in nonthermal plasmas (NTPs) can have unique properties and applications. However, modeling their growth in these environments presents significant challenges due to the non-equilibrium nature of NTPs, making them computationally expensive to describe. In this work, we address the challenges associated with accelerating the estimation of parameters needed for these models. Specifically, we explore how different machine learning models can be tailored to improve prediction outcomes. We apply these methods to reactive classical molecular dynamics data, which capture the processes associated with colliding silane fragments in NTPs. These reactions exemplify processes where qualitative trends are clear, but their quantification is challenging, hard to generalize, and requires time-consuming simulations. Our results demonstrate that good prediction performance can be achieved when appropriate loss functions are implemented and correct invariances are imposed. While the diversity of molecules used in the training set is critical for accurate prediction, our findings indicate that only a fraction (15%–25%) of the energy and temperature sampling is required to achieve high levels of accuracy. This suggests a substantial reduction in computational effort is possible for similar systems.

2024

IEEE MLSP

Joint Optimization of Piecewise Linear Ensembles

Matt Raymond, Angela Violi, and Clayton Scott

In IEEE 34th International Workshop on Machine Learning for Signal Processing, Sep 2024

Abs DOI HTML press

Tree ensembles achieve state-of-the-art performance despite being greedily optimized. Global refinement (GR) reduces greediness by jointly and globally optimizing all constant leaves. We propose Joint Optimization of Piecewise Linear ENsembles (JOPLEN), a piecewise-linear extension of GR. Compared to GR, JOPLEN improves model flexibility and can apply common penalties, including sparsity-promoting matrix norms and subspace-norms, to nonlinear prediction. We evaluate the Frobenius norm, ℓ2,1 norm, and Laplacian regularization for 146 regression and classification datasets; JOPLEN, combined with GB trees and RF, achieves superior performance in both settings. Additionally, JOPLEN with a nuclear norm penalty empirically learns smooth and subspace-aligned functions. Finally, we perform multitask feature selection by extending the Dirty LASSO. JOPLEN Dirty LASSO achieves a superior feature sparsity/performance tradeoff to linear and gradient boosted approaches. We anticipate that JOPLEN will improve regression, classification, and feature selection across many fields.
arXiv

Universal Feature Selection for Simultaneous Interpretability of Multitask Datasets

Matt Raymond, Jacob Charles Saldinger, Paolo Elvati, Clayton Scott, and Angela Violi

Mar 2024

Abs arXiv DOI

Extracting meaningful features from complex, high-dimensional datasets across scientific domains remains challenging. Current methods often struggle with scalability, limiting their applicability to large datasets, or make restrictive assumptions about feature-property relationships, hindering their ability to capture complex interactions. BoUTS’s general and scalable feature selection algorithm surpasses these limitations to identify both universal features relevant to all datasets and task-specific features predictive for specific subsets. Evaluated on seven diverse chemical regression datasets, BoUTS achieves state-of-the-art feature sparsity while maintaining prediction accuracy comparable to specialized methods. Notably, BoUTS’s universal features enable domain-specific knowledge transfer between datasets, and suggest deep connections in seemingly-disparate chemical datasets. We expect these results to have important repercussions in manually-guided inverse problems. Beyond its current application, BoUTS holds immense potential for elucidating data-poor systems by leveraging information from similar data-rich systems. BoUTS represents a significant leap in cross-domain feature selection, potentially leading to advancements in various scientific fields.

2023

Nature CS Cover Article
Domain-agnostic predictions of nanoscale interactions in proteins and nanoparticles

Jacob Saldinger, Matt Raymond, Paolo Elvati, and Angela Violi

Nature Computational Science, May 2023

Abs DOI Bib HTML press

Although challenging, the accurate and rapid prediction of nanoscale interactions has broad applications for numerous biological processes and material properties. While several models have been developed to predict the interaction of specific biological components, they use system-specific information that hinders their application to more general materials. Here we present NeCLAS, a general and efficient machine learning pipeline that predicts the location of nanoscale interactions, providing human-intelligible predictions. NeCLAS outperforms current nanoscale prediction models for generic nanoparticles up to 10–20\thinspacenm, reproducing interactions for biological and non-biological systems. Two aspects contribute to these results: a low-dimensional representation of nanoparticles and molecules (to reduce the effect of data uncertainty), and environmental features (to encode the physicochemical neighborhood at multiple scales). This framework has several applications, from basic research to rapid prototyping and design in nanobiotechnology.
@article{saldinger2023, author = {Saldinger, Jacob and Raymond, Matt and Elvati, Paolo and Violi, Angela}, title = {Domain-agnostic predictions of nanoscale interactions in proteins and nanoparticles}, journal = {Nature Computational Science}, year = {2023}, month = may, day = {01}, issn = {2662-8457}, doi = {10.1038/s43588-023-00438-x}, cover = {https://nature.com/natcomputsci/volumes/3/issues/5}, press = {https://news.umich.edu/nanobiotics-ai-for-discovering-where-and-how-nanoparticles-bind-with-proteins/} }