Library

A collection of papers (and code) that were at one time or another deemed interesting enough to hang on to…

Deep Learning

LLMs and Agents

  • Collective Intelligence of Specialized Language Models Guides Realization of de novo Chemical Synthesis + GitHub Repo
    Li, Haote; Sarkar, Sumon; Lu, Wenxin; Loftus, Patrick; Qiu, Tianyin; Shee, Yu; Cuomo, Abbigayle; Webster, John-Paul; Kelly, H. Ray; Manee, Vidhyadhar; Sreekumar, Sanil; Buono, Frederic; Crabtree, Robert; Newhouse, Timothy; Batista, Victor
    Organic Chemistry on ChemRxiv 2025
     The paper introduces MOSAIC, a framework using Llama3.1-8B-instruct architecture and 2,489 specialized models to analyze chemical reactions effectively. It predicts novel transformations, successfully synthesizing over 35 new compounds from diverse categories. This approach enhances the utilization of existing chemical knowledge, fostering advancements in computational and experimental chemistry.

  • The Hitchhiker’s Guide to Socratic Methods in Prompting Large Language Models for Chemistry Applications
    Hassan Harb, Yunkai Sun, Rajeev Surendran Assary
    Theoretical and Computational Chemistry on ChemRxiv 2025
     The paper discusses the application of the Socratic method in prompting large language models (LLMs) for chemistry, focusing on iterative questioning to improve hypothesis refinement and problem-solving. It illustrates how integrating Socratic principles enhances LLM performance, adaptability, and model interpretability in scientific reasoning through examples from chemistry and materials research.

  • Instruction-Following Pruning for Large Language Models
    Bairu Hou, Qibin Chen, Jianyu Wang, Guoli Yin, Chong Wang, Nan Du, Ruoming Pang, Shiyu Chang, Tao Lei
    cs.CL on arXiv 2025
     The paper proposes “instruction-following pruning,” a dynamic structured pruning method for large language models (LLMs). It utilizes a sparse mask predictor that adapts based on user instructions, optimizing both the predictor and the LLM using instruction-following data. Results show that a 3B activated model outperforms a 3B dense model by 5-8 points in specific domains, matching a 9B model’s performance.

  • Inconsistency of LLMs in Molecular Representations
    Bing Yan, Angelica Chen, Kyunghyun Cho
    Theoretical and Computational Chemistry on ChemRxiv 2024
     The paper investigates the consistency of large language models (LLMs) in molecular representations like SMILES and IUPAC names. Despite finetuning with a dual representation dataset and applying a Kullback-Leibler divergence loss for training, the models exhibited less than 1% consistency and no improvement in accuracy. Findings highlight the limitations of LLMs in understanding chemistry.

  • MemGPT: Towards LLMs as Operating Systems + GitHub Repo
    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez
    arXiv 2024
      Infinite context for lanuage models. Now pacakged as part of Letta.

  • AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning + GitHub Repo
    Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, Kaidi Cao, Vassilis N. Ioannidis, Karthik Subbian, Jure Leskovec, and James Zou
    NeurIPS 2024

  • ReAct: Synergizing Reasoning and Acting in Language Models + Project Site
    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao
    ICLR 2023
      ReAct integrates reasoning and acting in LLMs by interleaving reasoning traces with task-specific actions. It reduces hallucinations in QA (HotpotQA) and fact verification (Fever) via Wikipedia API interactions and outperforms imitation and RL methods in ALFWorld (+34%) and WebShop (+10%). ReAct enhances interpretability and decision-making with minimal in-context examples.

  • LoRA: Low-Rank Adaptation of Large Language Models + GitHub Repo
    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
    ICLR 2022
      LoRA is now wrapped into the 🤗 PEFT library

  • Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela
    NeurIPS 2020

Neural Reasoning & Decision Making

Contrastive Learning

Recommender Systems

Molecular Generation

Generative models for molecules. Most typically text-based inputs (SMILES/SELFIES) or graph reps (parallel models on atom and bond matrices). Usually have some property optimization ability (latent space search/interpolation, reinformcement learning, guided genetic exploration). Most commonly these methods are autoregressive, but more recently non-autoregressive molecular generation methods have started to arise.

Reviews

Diffusion / Flow Matching Models

Normalizing Flows

  • MolGrow: A Graph Normalizing Flow for Hierarchical Molecular Generation (No implementation available)
    Maksim Kuznetsov and Daniil Polykovskiy
    Proceedings of the AAAI Conference on Artificial Intelligence 2021, 35 (9), 8226-8234
      Heirarchical normalizing flow for molecular graphs, autoregressive. Builds either BFS or fragment based (better). Model is composed of “plug-and-play” modules. Trained on MOSES, QM9, Zinc250k. Property-constrained optimization is based on genetic algorithm.

  • FastFlows: Flow-Based Models for Molecular Graph Generation
    Nathan C. Frey, Vijay Gadepally, and Bharath Ramsundar
    ELLIS Machine Learning for Molecule Discovery Workshop 2021
      Framework for normalizing flows from SELFIES. Uses substructure filtering to speed up training and work from small training sets. Built in MPO functionality.
    TDS article

  • MoFlow: An Invertible Flow Model for Generating Molecular Graphs + GitHub Repo
    Chengxi Zang and Fei Wang
    in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 2020
      Non-autoregressive normalizing Flow for molecular graphs; two-stage flow (bonds (based on GLOW network from Nvidia) > bond-conditioned flow for atoms). Similar to GraphNVP. Trained (NLL) on QM9 and ZINC250k. Developed new architecture. Excellent results.

  • GraphAF: a Flow-based Autoregressive Model for Molecular Graph Generation + GitHub repo
    Chence Shi, Minkai Xu, Zhaocheng Zhu, Weinan Zhang, Ming Zhang, and Jian Tang
    ICLR 2020
      How to explain this better than reviewer #1…

"This paper proposes a generative model architecture for molecular graph generation based on autoregressive flows. The main contribution of this paper is to combine existing techniques (auto-regressive BFS-ordered generation of graphs, normalizing flows, dequantization by Gaussian noise, fine-tuning based on reinforcement learning for molecular property optimization, and validity constrained sampling). Most of these techniques are well-established either for data generation with normalizing flows or for molecular graph generation and the novelty lies in the combination of these building blocks into a framework."

GANs

Other

  • Growing and Linking Optimizers: Synthesis-driven Molecule Design
    Clarisse Descamps, Vincent Bouttier, Juan Sanz García, Quentin Perron, Hamza Tajmouati
    Biological and Medicinal Chemistry on ChemRxiv
    2025-02-28
     The paper introduces two reaction-based generative models, Growing Optimizer and Linking Optimizer, for molecular design, focusing on drug discovery. These models sequentially select building blocks and simulate reactions, offering advantages in restricting chemistry to specific pathways. Compared to REINVENT 4, they generate more synthetically accessible molecules with desired properties.

  • Scaffold Hopping with Generative Reinforcement Learning
    Luke, Rossen; Francesca, Grisoni; Finton, Sirockin; Nadine, Schneider
    Biological and Medicinal Chemistry on ChemRxiv
    2025-02-21
     The paper explores scaffold hopping using generative reinforcement learning to design novel scaffolds for lead candidates. Presents improvements on REINVENT and LinkINVENT methods using RL for Unconstrained Scaffold Hopping. Essentially uses a ROCS-based reward to steer scaffold generation towards those with similar 3D and pharmacoaphore properties.

  • Targeted Molecular Generation With Latent Reinforcement Learning
    Ragy Haddad, Eleni Litsa, Zhen Liu, Xin Yu, Daniel Burkhardt, Govinda Bhisetti
    ChemRxiv
    2025-01-03
     The paper presents a novel approach for targeted molecular generation using Reinforcement Learning with proximal policy optimization (PPO) in the latent space of pre-trained deep learning generative models. The method shows superior performance on benchmark datasets and can generate molecules with specific substructures while optimizing for desired properties, aiding drug discovery.

  • TamGen: drug design with target-aware molecule generation through a chemical language model + GitHub Repo
    Kehan Wu, Yingce Xia, Pan Deng, Renhe Liu, Yuan Zhang, Han Guo, Yumeng Cui, Qizhi Pei, Lijun Wu, Shufang Xie, Si Chen, Xi Lu, Song Hu, Jinzhi Wu, Chi-Kin Chan, Shawn Chen, Liangliang Zhou, Nenghai Yu, Enhong Chen, Haiguang Liu, Jinjiang Guo, Tao Qin & Tie-Yan Liu
    Nat. Commun. 2024, 15, 9360

  • TurboHopp: Accelerated Molecule Scaffold Hopping with Consistency Models
    Kiwoong Yoo, Owen Oertell, Junhyun Lee, Sanghoon Lee & Jaewoo Kang
    NeurIPS 2024

  • DrugSynthMC: An Atom-Based Generation of Drug-like Molecules with Monte Carlo Search
    Milo Roucairol, Alexios Georgiou, Tristan Cazenave, Filippo Prischi & Olivier E. Pardo
    J. Chem. Inf. Model. 2024, 64, 18, 7097

  • Enabling target-aware molecule generation to follow multi objectives with Pareto MCTS + GitHub Repo
    Yaodong Yang, Guangyong Chen, Jinpeng Li, Junyou Li, Odin Zhang, Xujun Zhang, Lanqing Li, Jianye Hao, Ercheng Wang & Pheng-Ann Heng
    Commun. Biol. 2024, 7, 1074

  • Llamol: a dynamic multi-conditional generative transformer for de novo molecular design
    Niklas Dobberstein, Astrid Maass & Jan Hamaekers
    J. of Cheminf., 2024, 16, 73
      Transformer based on Llama2, tweaked for molgen. Not the most impressive paper, but some interesting tidbits scatted throughout (e.g., SCL, etc…)

  • REINVENT4: Modern AI–driven generative molecule design + GitHub Repo
    Hannes H. Loeffler, Jiazhen He, Alessandro Tibo, Jon Paul Janet, Alexey Voronov, Lewis H. Mervin & Ola Engkvist
    J. of Cheminf., 2024, 16, 20
      AstraZeneca’s molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design and molecule optimization.

  • Masked graph modeling for molecule generation + GitHub Repo
    Omar Mahmood, Elman Mansimov, Richard Bonneau, and Kyunghyun Cho
    Nat. Commun. 2021, 12, 3156
      MPNN for moleular graphs. Generation by iterative sampling of subsets of graphs components, furuter generation steps are conditionalized on the rest of the graph. Trained on QM9 and ChEMBL. Paper provides analysis of GuacaMol benchmark metrics particularly their independence. Conclusions:

    1. Validity, KL-divergence and Fréchet Distance scores correlate highly with each other
    2. These three metrics correlate negatively with the novelty score
    3. Uniqueness does not correlate strongly with any other metric

HTE

  • Expediting hit-to-lead progression in drug discovery through reaction prediction and multi-objective molecular optimization
    Kenneth, Atz; David F., Nippa;… Gisbert, Schneider
    Organic Chemistry on ChemRxiv 2025
     The paper presents an integrated medicinal chemistry workflow that accelerates hit-to-lead optimization in drug discovery. Using high-throughput experimentation, a dataset of 13,490 reaction outcomes was generated, training deep graph neural networks. A virtual library of 26,375 molecules led to 212 candidate MAGL inhibitors, with 14 achieving subnanomolar activity, improving potency up to 4500 times over original compounds.

Reaction Product Prediction

These models predict mechanisms for chemical reactions, ideally similar to how we teach 2nd years to push arrows. There are reltatively few of expamples of this task but they fall into 3 major categories electron flows, graph edits, reaction netowrks. At inference these models are used for forward synthesis prediction, potntially for prediction of chemo/regio-selectivity. Largely trained on pattern recognition from atom-mapped inputs (USPTO) though there are exceptions (e.g., Baldi papers below).

Electron Flow Prediction

Sources and Sinks

The Baldi papers map e- sources and sinks, combinatorially generates probability distribution of electron flows. Described classifiers are used to filter source-sink pairs before eval. Trained on in-house (unavailable) data. Papers don’t have available source code but ready-to-use programs are available on ChemDB.

Reaction Network Graphs

Other

Atom Mapping

Computer-Aided Retrosynthesis Planning

Publication Parsing

ML Driven Drug Design

Property/Activity Prediction

Active Learning Methods

Synthetic Accessibility

Molecular Optimization

  • A Zero-Shot Single-point Molecule Optimization Model: Mimicking Medicinal Chemists’ Expertise
    Peng Gao, Jie Zhang, Zhilian Dai, Yangyang Deng, Dan Zhang, Jiawei Fu, Songyou Zhong, Yichao Liu
    Theoretical and Computational Chemistry on ChemRxiv 2024
     The paper presents the Single-point Chemical Language Model (SpCLM), a framework for molecular design that mimics medicinal chemists’ expertise. Using a few hundred generated compounds, SpCLM predicts 60%-80% of active compounds in tests, correlating well with experimental data. This method reduces the need for extensive screening, offering a data-driven approach to optimize drug activity and selectivity.

  • Projecting Molecules into Synthesizable Chemical Spaces
    Shitong Luo, Wenhao Gao, Zuofan Wu, Jian Peng, Connor W. Coley, and Jianzhu Ma
    ArXiv Preprint, 2024
      Interesting new approach to making molecules more synthesizable from genenerated virtual hits. Cleaning the chaff energy. Describes a new postfix notation (A B +) for synthetic transformations. Transformer-based model that translates graphs to postfix notation. Model capable of synthesis planning, generating similar and more synthesizable analogues, exploring chemical space in the syntesizablilty dimension.

  • Evolutionary Multiobjective Molecule Optimization in an Implicit Chemical Space + GitHub Repo
    Xin Xia, Yiping Liu, Chunhou Zheng, Xingyi Zhang, Qingwen Wu, Xin Gao, Xiangxiang Zeng, and Yansen Su
    J. Chem. Inf. Model. 2024, 64, (13), 5161
      Multiobjective molecule optimization framework (MOMO) is a pareto-based MPO tool that evolves moelcules into better molecules. Genetic/ecolutionary algorithm in the latent (implicit) space ended by a VAE.

Large-scale Virtual Screening

Cheminformatics

Reviews

General

Δ-machine learning

Protein Structure Prediction

Chemistry

Med Chem

My papers