Paper Library

Molecular Generation

Generative models for molecules. Most typically text-based inputs (SMILES/SELFIES) or graph reps (parallel models on atom and bond matrices). Usually have some property optimization ability (latent space search/interpolation, reinformcement learning, guided genetic exploration). Most commonly these methods are autoregressive, but more recently non-autoregressive molecular generation methods have started to arise.

Reviews

Diffusion Models

  • Mixed Continuous and Categorical Flow Matching for 3D De Novo Molecule Generation + GitHub Repo
    Ian Dunn and David Ryan Koes
    ArXiv 2024
      Extends the flow matching framework to categorical data by constructing flows that are constrained to exist on a continuous representation of categorical data known as the probability simplex. Finds that, in practice, a simpler approach that makes no accommodations for the categorical nature of the data yields equivalent or superior performance. Presents FlowMol, a flow matching model for 3D de novo generative model that achieves improved performance over prior flow matching methods.

  • GeoLDM: Geometric Latent Diffusion Models for 3D Molecule Generation + GitHub Repo
    Minkai Xu, Alexander Powers, Ron Dror, Stefano Ermon, and Jure Leskovec
    ICML 2023
      Stable (latent) diffusion model for 3D point clouds and 2D graphs. Capable of free and property conditioned generation (split-train-condition).

  • Equivariant Diffusion for Molecule Generation in 3D + GitHub Repo
    Emiel Hoogeboom, Vı́ctor Garcia Satorras, Clément Vignac, and Max Welling
    in Proceedings of the 39th International Conference on Machine Learning, PMLR 162:8867-8887, 2022
      Non-autoregressive diffusion model (rotation invariant). Reps: \(x = (x_1 ... x_M) \in \mathbb{R}^{M \times 3}\) (atom position matrix) with corresponding feature vectors \(h = (h_1 ... h_M) \in \mathbb{R}^{M \times num feat}\).

Normalizing Flows

  • MolGrow: A Graph Normalizing Flow for Hierarchical Molecular Generation (No implementation available)
    Maksim Kuznetsov and Daniil Polykovskiy
    Proceedings of the AAAI Conference on Artificial Intelligence 2021, 35 (9), 8226-8234
      Heirarchical normalizing flow for molecular graphs, autoregressive. Builds either BFS or fragment based (better). Model is composed of “plug-and-play” modules. Trained on MOSES, QM9, Zinc250k. Property-constrained optimization is based on genetic algorithm.

  • FastFlows: Flow-Based Models for Molecular Graph Generation
    Nathan C. Frey, Vijay Gadepally, and Bharath Ramsundar
    ELLIS Machine Learning for Molecule Discovery Workshop 2021
      Framework for normalizing flows from SELFIES. Uses substructure filtering to speed up training and work from small training sets. Built in MPO functionality.
    TDS article

  • MoFlow: An Invertible Flow Model for Generating Molecular Graphs + GitHub Repo
    Chengxi Zang and Fei Wang
    in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 2020
      Non-autoregressive normalizing Flow for molecular graphs; two-stage flow (bonds (based on GLOW network from Nvidia) > bond-conditioned flow for atoms). Similar to GraphNVP. Trained (NLL) on QM9 and ZINC250k. Developed new architecture. Excellent results.

  • GraphAF: a Flow-based Autoregressive Model for Molecular Graph Generation + GitHub repo
    Chence Shi, Minkai Xu, Zhaocheng Zhu, Weinan Zhang, Ming Zhang, and Jian Tang
    ICLR 2020
      How to explain this better than reviewer #1…

"This paper proposes a generative model architecture for molecular graph generation based on autoregressive flows. The main contribution of this paper is to combine existing techniques (auto-regressive BFS-ordered generation of graphs, normalizing flows, dequantization by Gaussian noise, fine-tuning based on reinforcement learning for molecular property optimization, and validity constrained sampling). Most of these techniques are well-established either for data generation with normalizing flows or for molecular graph generation and the novelty lies in the combination of these building blocks into a framework."

GANs

Other

  • Llamol: a dynamic multi-conditional generative transformer for de novo molecular design
    Niklas Dobberstein, Astrid Maass & Jan Hamaekers
    J. of Cheminf., 2024, 16, 73
      Transformer based on Llama2, tweaked for molgen. Not the most impressive paper, but some interesting tidbits scatted throughout (e.g., SCL, etc…)

  • REINVENT4: Modern AI–driven generative molecule design + GitHub Repo
    Hannes H. Loeffler, Jiazhen He, Alessandro Tibo, Jon Paul Janet, Alexey Voronov, Lewis H. Mervin & Ola Engkvist
    J. of Cheminf., 2024, 16, 20
      AstraZeneca’s molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design and molecule optimization.

  • Masked graph modeling for molecule generation + GitHub Repo
    Omar Mahmood, Elman Mansimov, Richard Bonneau, and Kyunghyun Cho
    Nat. Commun. 2021, 12, 3156
      MPNN for moleular graphs. Generation by iterative sampling of subsets of graphs components, furuter generation steps are conditionalized on the rest of the graph. Trained on QM9 and ChEMBL. Paper provides analysis of GuacaMol benchmark metrics particularly their independence. Conclusions:

    1. Validity, KL-divergence and Fréchet Distance scores correlate highly with each other
    2. These three metrics correlate negatively with the novelty score
    3. Uniqueness does not correlate strongly with any other metric

Reaction Informatics

These models predict mechanisms for chemical reactions, ideally similar to how we teach 2nd years to push arrows. There are reltatively few of expamples of this task but they fall into 3 major categories electron flows, graph edits, reaction netowrks. At inference these models are used for forward synthesis prediction, potntially for prediction of chemo/regio-selectivity. Largely trained on pattern recognition from atom-mapped inputs (USPTO) though there are exceptions (e.g., Baldi papers below).

Electron Flow Prediction

Sources and Sinks

The Baldi papers map e- sources and sinks, combinatorially generates probability distribution of electron flows. Described classifiers are used to filter source-sink pairs before eval. Trained on in-house (unavailable) data. Papers don’t have available source code but ready-to-use programs are available on ChemDB.

Reaction Network Graphs

Other

Atom Mapping

Computer-Aided Retrosynthesis Planning

Publication Parsing

ML Driven Drug Design

General

Property/Activity Prediction

Active Learning Methods

Synthetic Accessibility

Molecular Optimization

  • Projecting Molecules into Synthesizable Chemical Spaces
    Shitong Luo, Wenhao Gao, Zuofan Wu, Jian Peng, Connor W. Coley, and Jianzhu Ma
    ArXiv Preprint, 2024
      Interesting new approach to making molecules more synthesizable from genenerated virtual hits. Cleaning the chaff energy. Describes a new postfix notation (A B +) for synthetic transformations. Transformer-based model that translates graphs to postfix notation. Model capable of synthesis planning, generating similar and more synthesizable analogues, exploring chemical space in the syntesizablilty dimension.

  • Evolutionary Multiobjective Molecule Optimization in an Implicit Chemical Space + GitHub Repo
    Xin Xia, Yiping Liu, Chunhou Zheng, Xingyi Zhang, Qingwen Wu, Xin Gao, Xiangxiang Zeng, and Yansen Su
    J. Chem. Inf. Model. 2024, ASAP
      Multiobjective molecule optimization framework (MOMO) is a pareto-based MPO tool that evolves moelcules into better molecules. Genetic/ecolutionary algorithm in the latent (implicit) space ended by a VAE.

Virtual Screening

Cheminformatics

Protein Structure Prediction

  • Accurate structure prediction of biomolecular interactions with AlphaFold 3 - No code released
    Josh Abramson, Jonas Adler, Jack Dunger, … & John M. Jumper
    Nat. 2024 630, 493–500
      AlphaFold 3, a diffusion-based architecture that is capable of predicting the joint structure of complexes including proteins, nucleic acids, small molecules, ions and modified residues. Blog at Isomorphic

  • State-specific protein–ligand complex structure prediction with a multiscale deep generative model + GitHub Repo
    Zhuoran Qiao, Weili Nie, Arash Vahdat, Thomas F. Miller III & Animashree Anandkumar
    Nat. Mach. Intell. 2024 6, 195–208
      NeuralPLexer, a computational approach that can directly predict protein–ligand complex structures solely using protein sequence and ligand molecular graph inputs. Owing to its specificity in sampling both ligand-free-state and ligand-bound-state ensembles, NeuralPLexer consistently outperforms AlphaFold2 in terms of global protein structure accuracy on both representative structure pairs with large conformational changes and recently determined ligand-binding proteins.

  • DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking + GitHub Repo
    Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, and Tommi Jaakkola
    ICLR 2023
      Cool paper that treats docking as a generative task instead of a search/regression task. DiffDock is a diffusion model over the non-Euclidean manifold of ligand poses. Really interesting way of thinking of things.

  • Structure-based Drug Design with Equivariant Diffusion Models (DiffSBDD) + GitHub Repo
    Arne Schneuing, Yuanqi Du, Charles Harris, Arian Jamasb, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Lió, Carla Gomes, Max Welling, Michael Bronstein, and Bruno Correia
    ArXiv Preprint 2022
      Diffusion model for SBDD, serious issues with results in this paper see OpenReview

  • Accelerating Inference in Molecular Diffusion Models with Latent Representations of Protein Structure + GitHub Repo
    Ian Dunn and David Ryan Koes
    NeurIPS 2023
      GNN-based architecture for learning latent representations of molecular structure. Encodes protein represntation into reduced set of key points. When trained end-to-end with a diffusion model (DiffSBDD) for de novo ligand design, achieves comparable performance to one with an all-atom protein representation while exhibiting a 3-fold reduction in inference time. Unclear whether or not the original issues with DiffSBDD were address in this implementation…

Deep Learning

Chemistry

Med Chem

My papers