Paper Library

Molecular Generation

Generative models for molecules. Most typically text-based inputs (SMILES/SELFIES) or graph reps (parallel models on atom and bond matrices). Usually have some property optimization ability (latent space search/interpolation, reinformcement learning, guided genetic exploration). Most commonly these methods are autoregressive, but more recently non-autoregressive molecular generation methods have started to arise.

Reviews

Diffusion Models

Normalizing Flows

  • MolGrow: A Graph Normalizing Flow for Hierarchical Molecular Generation (No implementation available)
    Maksim Kuznetsov and Daniil Polykovskiy
    Proceedings of the AAAI Conference on Artificial Intelligence 2021, 35 (9), 8226-8234
      Heirarchical normalizing flow for molecular graphs, autoregressive. Builds either BFS or fragment based (better). Model is composed of “plug-and-play” modules. Trained on MOSES, QM9, Zinc250k. Property-constrained optimization is based on genetic algorithm.

  • FastFlows: Flow-Based Models for Molecular Graph Generation
    Nathan C. Frey, Vijay Gadepally, and Bharath Ramsundar
    ELLIS Machine Learning for Molecule Discovery Workshop 2021
      Framework for normalizing flows from SELFIES. Uses substructure filtering to speed up training and work from small training sets. Built in MPO functionality.
    TDS article

  • MoFlow: An Invertible Flow Model for Generating Molecular Graphs + GitHub Repo
    Chengxi Zang and Fei Wang
    in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 2020
      Non-autoregressive normalizing Flow for molecular graphs; two-stage flow (bonds (based on GLOW network from Nvidia) > bond-conditioned flow for atoms). Similar to GraphNVP. Trained (NLL) on QM9 and ZINC250k. Developed new architecture. Excellent results.

  • GraphAF: a Flow-based Autoregressive Model for Molecular Graph Generation + GitHub repo
    Chence Shi, Minkai Xu, Zhaocheng Zhu, Weinan Zhang, Ming Zhang, and Jian Tang
    ICLR 2020
      How to explain this better than reviewer #1…

"This paper proposes a generative model architecture for molecular graph generation based on autoregressive flows. The main contribution of this paper is to combine existing techniques (auto-regressive BFS-ordered generation of graphs, normalizing flows, dequantization by Gaussian noise, fine-tuning based on reinforcement learning for molecular property optimization, and validity constrained sampling). Most of these techniques are well-established either for data generation with normalizing flows or for molecular graph generation and the novelty lies in the combination of these building blocks into a framework."

GANs

Other

Reaction Informatics

These models predict mechanisms for chemical reactions, ideally similar to how we teach 2nd years to push arrows. There are reltatively few of expamples of this task but they fall into 3 major categories electron flows, graph edits, reaction netowrks. At inference these models are used for forward synthesis prediction, potntially for prediction of chemo/regio-selectivity. Largely trained on pattern recognition from atom-mapped inputs (USPTO) though there are exceptions (e.g., Baldi papers below).

Electron Flow Prediction

Sources and Sinks

The Baldi papers map e- sources and sinks, combinatorially generates probability distribution of electron flows. Described classifiers are used to filter source-sink pairs before eval. Trained on in-house (unavailable) data. Papers don’t have available source code but ready-to-use programs are available on ChemDB.

Reaction Network Graphs

Other

Atom Mapping

Computer-Aided Retrosynthesis Planning

Publication Parsing

ML Driven Drug Design

Property/Activity Prediction

Active Learning Methods

Synthetic Accessibility

Molecular Optimization

  • Projecting Molecules into Synthesizable Chemical Spaces
    Shitong Luo, Wenhao Gao, Zuofan Wu, Jian Peng, Connor W. Coley, and Jianzhu Ma
    ArXiv Preprint, 2024
      Interesting new approach to making molecules more synthesizable from genenerated virtual hits. Cleaning the chaff energy. Describes a new postfix notation (A B +) for synthetic transformations. Transformer-based model that translates graphs to postfix notation. Model capable of synthesis planning, generating similar and more synthesizable analogues, exploring chemical space in the syntesizablilty dimension.

  • Evolutionary Multiobjective Molecule Optimization in an Implicit Chemical Space + GitHub Repo
    Xin Xia, Yiping Liu, Chunhou Zheng, Xingyi Zhang, Qingwen Wu, Xin Gao, Xiangxiang Zeng, and Yansen Su
    J. Chem. Inf. Model. 2024, 64, (13), 5161
      Multiobjective molecule optimization framework (MOMO) is a pareto-based MPO tool that evolves moelcules into better molecules. Genetic/ecolutionary algorithm in the latent (implicit) space ended by a VAE.

Virtual Screening

Cheminformatics

Reviews

General

Δ-machine learning

Protein Structure Prediction

  • Accurate structure prediction of biomolecular interactions with AlphaFold 3 - No code released
    Josh Abramson, Jonas Adler, Jack Dunger, … & John M. Jumper
    Nat. 2024 630, 493
      AlphaFold 3, a diffusion-based architecture that is capable of predicting the joint structure of complexes including proteins, nucleic acids, small molecules, ions and modified residues. Blog at Isomorphic

  • State-specific protein-ligand complex structure prediction with a multiscale deep generative model + GitHub Repo
    Zhuoran Qiao, Weili Nie, Arash Vahdat, Thomas F. Miller III & Animashree Anandkumar
    Nat. Mach. Intell. 2024, 6, 195
      NeuralPLexer, a computational approach that can directly predict protein-ligand complex structures solely using protein sequence and ligand molecular graph inputs. Owing to its specificity in sampling both ligand-free-state and ligand-bound-state ensembles, NeuralPLexer consistently outperforms AlphaFold2 in terms of global protein structure accuracy on both representative structure pairs with large conformational changes and recently determined ligand-binding proteins.

  • DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking + GitHub Repo
    Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, and Tommi Jaakkola
    ICLR 2023
      Cool paper that treats docking as a generative task instead of a search/regression task. DiffDock is a diffusion model over the non-Euclidean manifold of ligand poses. Really interesting way of thinking of things.

  • Structure-based Drug Design with Equivariant Diffusion Models (DiffSBDD) + GitHub Repo
    Arne Schneuing, Yuanqi Du, Charles Harris, Arian Jamasb, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Lió, Carla Gomes, Max Welling, Michael Bronstein, and Bruno Correia
    ArXiv Preprint 2022
      Diffusion model for SBDD, serious issues with results in this paper see OpenReview

  • Accelerating Inference in Molecular Diffusion Models with Latent Representations of Protein Structure + GitHub Repo
    Ian Dunn and David Ryan Koes
    NeurIPS 2023
      GNN-based architecture for learning latent representations of molecular structure. Encodes protein represntation into reduced set of key points. When trained end-to-end with a diffusion model (DiffSBDD) for de novo ligand design, achieves comparable performance to one with an all-atom protein representation while exhibiting a 3-fold reduction in inference time. Unclear whether or not the original issues with DiffSBDD were address in this implementation…

Deep Learning

Contrastive Learning

Chemistry

Med Chem

My papers