🥇Samsung AI Challenge 2021 1등 솔루션입니다🥇

Overview

MoT - Molecular Transformer

Large-scale Pretraining for Molecular Property Prediction

Samsung AI Challenge for Scientific Discovery

This repository is an official implementation of a model which won first place in the Samsung AI Challenge for Scientific Discovery competition and was introduced at SAIF 2021. The result of the challenge was announced at this video.

Introduction

MoT is a transformer-based model for predicting molecular properties from its 3D molecular structure. It was first introduced to calculate the excitation energy gap between S1 and T1 states by the molecular structure.

Requirements

Before running this project, you need to install the below libraries:

  • numpy
  • pandas
  • torch==1.9.0+cu111
  • tqdm
  • wandb
  • dataclasses
  • requests
  • omegaconf
  • pytorch_lightning==1.4.8
  • rdkit-pypi
  • scikit_learn

This project supports NVIDIA Apex. It will be automatically detected and used to accelerate training when installed. apex reduces the training time up to 50%.

setup.sh helps installing necessary libraries, including apex. It installs the requirements and apex at once. You can simply run the script as follows:

$ bash setup.sh

About Molecular Transformer

There are many apporaches to predict the molecular properties. However, for the case of calculating excitation energy gaps (e.g. between S1 to T1 states), it is necessary to consider the entire 3D structure and the charge of atoms in the compound. But many transformer-based molecular models use SMILES (or InChI) format. We also tried text-based methods in the competition, but the graph-based models showed better performance.

The important thing is to consider all connections between the atoms in the compound. However, the atoms are placed in 3D coordinate system, and it is almost impossible to feed 3D positional informations to the model (and adding 3d positional embeddings was worse than the baseline). So we designed new attention method, inspired by disentangled attention in DeBERTa.

First of all, the type of atoms and their charges will be embedded to the vectors and summed. Note that the positional embeddings will not be used to the input because attention layers will calculate the attention scores relatively. And thanks to the absence of the positional embeddings, there is no limit to the number of atoms.

The hidden representations will be attended by the attention layers. Similar to the disentangled attention introduced in DeBERTa, our relative attention is performed not only for contents, but also between relative informations and the contents. The relative informations include relative distances and the type of bonds between the atoms.

The relative information R is calculated as above. The euclidean distances are encoded through sinusoidal encoding, with modified period (from 10000 to 100). The bond type embeddings can be described as below:

The important thing is disconnections (i.e. there is no bond between two certain atoms) should be embedded as index 0, rather than excluded from attention. Also [CLS] tokens are separated from other normal bond-type embeddings on relative attention.

According to the above architecture, the model successfully focuses on the relations of the atoms. And similar to the other transformer-based models, it also shows that pretraining from large-scale dataset achieves better performance, even with few finetuning samples. We pretrained our model with PubChem3D (50M) and PubChemQC (3M). For PubChem3D, the model was trained to predict conformer-RMSD, MMFF94 energy, shape self-overlap, and feature self-overlap. For PubChemQC, the model was trained to predict the singlet excitation energies from S1 to S10 states.

Reproduction

To reproduce our results on the competition or pretrain a new model, you should follow the below steps. A large disk and high-performance GPUs (e.g. A100s) will be required.

Download PubChem3D and PubChemQC

First of all, let's download PubChem3D and PubChemQC datasets. The following commands will download the datasets and format to the specific dataset structure.

$ python utilities/download_pubchem.py
$ python utilities/download_pubchemqc.py

Although we used 50M PubChem3D compounds, you can use full 100M samples if your network status and the client are available while downloading.

After downloading all datasets, we have to create index files which indicate the seeking position of each sample. Because the dataset size is really large, it is impossible to load the entire data to the memory. So our dataset will access the data randomly using this index files.

$ python utilities/create_dataset_index.py pubchem-compound-50m.csv
$ python utilities/create_dataset_index.py pubchemqc-excitations-3m.csv

Check if pubchem-compound-50m.index and pubchemqc-excitations-3m.index are created.

Training and Finetuning

Now we are ready to train MoT. Using the datasets, we are going to pretrain new model. Move the datasets to pretrain directory and also change the working directory to pretrain. And type the below commands to pretrain for PubChem3D and PubChemQC datasets respectively. Note that PubChemQC-pretraining will use PubChem3D-pretrained model weights.

$ python src/train.py config/mot-base-pubchem.yaml
$ python src/train.py config/mot-base-pubchemqc.yaml

Check if mot-base-pubchem.pth and mot-base-pubchemqc.pth are created. Next, move the final output weights file (mot-base-pubchemqc.pth) to finetune directory. Prepare the competition dataset samsung-ai-challenge-for-scientific-discovery to the same directory and start finetuning by using below command:

$ python src/train.py config/train/mot-base-pubchemqc.yaml  \
        data.fold_index=[fold index]                        \
        model.random_seed=[random seed]

We recommend to train the model for 5 folds with various random seeds. It is well known that the random seed is critial to transformer finetuning. You can tune the random seed to achieve better results.

After finetuning the models, use following codes to predict the energy gaps through test dataset.

$ python src/predict.py config/predict/mot-base-pubchemqc.yaml \
        model.pretrained_model_path=[finetuned model path]

And you can see the prediction file of which name is same as the model name. You can submit the single predictions or average them to get ensembled result.

$ python utilities/simple_ensemble.py finetune/*.csv [output file name]

Finetune with custom dataset

If you want to finetune with custom dataset, all you need to do is to rewrite the configuration file. Note that finetune directory is considered only for the competition dataset. So the entire training codes are focused on the competition data structure. Instead, you can finetune the model with your custom dataset on pretrain directory. Let's check the configuration file for PubChemQC dataset which is placed at pretrain/config/mot-base-pubchemqc.yaml.

data:
  dataset_file:
    label: pubchemqc-excitations-3m.csv
    index: pubchemqc-excitations-3m.index
  input_column: structure
  label_columns: [s1_energy, s2_energy, s3_energy, s4_energy, s5_energy, s6_energy, s7_energy, s8_energy, s9_energy, s10_energy]
  labels_mean_std:
    s1_energy: [4.56093558, 0.8947327]
    s2_energy: [4.94014921, 0.8289951]
    s3_energy: [5.19785427, 0.78805644]
    s4_energy: [5.39875606, 0.75659831]
    s5_energy: [5.5709758, 0.73529373]
    s6_energy: [5.71340364, 0.71889017]
    s7_energy: [5.83764871, 0.70644563]
    s8_energy: [5.94665475, 0.6976438]
    s9_energy: [6.04571037, 0.69118142]
    s10_energy: [6.13691953, 0.68664366]
  max_length: 128
  bond_drop_prob: 0.1
  validation_ratio: 0.05
  dataloader_workers: -1

model:
  pretrained_model_path: mot-base-pubchem.pth
  config: ...

In the configuration file, you can see data.dataset_file field. It can be changed to the desired finetuning dataset with its index file. Do not forget to create the index file by utilities/create_dataset_index.py. And you can specify the column name which contains the encoded 3D structures. data.label_columns indicates which columns will be used to predict. The values will be normalized by data.labels_mean_std. Simply copy this file and rename to your own dataset. Change the name and statistics of each label. Here is an example for predicting toxicity values:

data:
  dataset_file:
    label: toxicity.csv
    index: toxicity.index
  input_column: structure
  label_columns: [toxicity]
  labels_mean_std:
    toxicity: [0.92, 1.85]
  max_length: 128
  bond_drop_prob: 0.0
  validation_ratio: 0.1
  dataloader_workers: -1

model:
  pretrained_model_path: mot-base-pubchemqc.pth
  config:
    num_layers: 12
    hidden_dim: 768
    intermediate_dim: 3072
    num_attention_heads: 12
    hidden_dropout_prob: 0.1
    attention_dropout_prob: 0.1
    position_scale: 100.0
    initialize_range: 0.02

train:
  name: mot-base-toxicity
  optimizer:
    lr: 1e-4
    betas: [0.9, 0.999]
    eps: 1e-6
    weight_decay: 0.01
  training_steps: 100000
  warmup_steps: 10000
  batch_size: 256
  accumulate_grads: 1
  max_grad_norm: 1.0
  validation_interval: 1.0
  precision: 16
  gpus: 1

Results on Competition Dataset

Model PubChem PubChemQC Competition LB (Public/Private)
ELECTRA 0.0493 − 0.1508/−
BERT Regression 0.0074 0.0497 0.1227/−
MoT-Base (w/o PubChem) − 0.0188 0.0877/−
MoT-Base (PubChemQC 150k) 0.0086 0.0151 0.0666/−
    + PubChemQC 300k " 0.0917 0.0526/−
    + 5Fold CV " " 0.0507/−
    + Ensemble " " 0.0503/−
    + Increase Maximum Atoms " " 0.0497/0.04931

Description: Comparison results of various models. ELECTRA and BERT Regression are SMILES-based models which are trained with PubChem-100M (and PubChemQC-3M for BERT Regression only). ELECTRA is trained to distinguish fake SMILES tokens (i.e., ELECTRA approach) and BERT Regression is trained to predict the labels, without unsupervised learning. PubChemQC 150k and 300k denote that the model is trained for 150k and 300k steps in PubChemQC stage.

Utilities

This repository provides some useful utility scripts.

  • create_dataset_index.py: As mentioned above, it creates seeking positions of samples in the dataset for random accessing.
  • download_pubchem.py and download_pubchemqc.py: Download PubChem3D and PubChemQC datasets.
  • find_test_compound_cids.py: Find CIDs of the compounds in test dataset to prevent from training the compounds. It may occur data-leakage.
  • simple_ensemble.py: It performs simple ensemble by averaging all predictions from various models.

License

This repository is released under the Apache License 2.0. License can be found in LICENSE file.

A2LP for short, ECCV2020 spotlight, Investigating SSL principles for UDA problems

Label-Propagation-with-Augmented-Anchors (A2LP) Official codes of the ECCV2020 spotlight (label propagation with augmented anchors: a simple semi-supe

20 Oct 27, 2022
Code for "Unsupervised Layered Image Decomposition into Object Prototypes" paper

DTI-Sprites Pytorch implementation of "Unsupervised Layered Image Decomposition into Object Prototypes" paper Check out our paper and webpage for deta

40 Dec 22, 2022
Company clustering with K-means/GMM and visualization with PCA, t-SNE, using SSAN relation extraction

RE results graph visualization and company clustering Installation pip install -r requirements.txt python -m nltk.downloader stopwords python3.7 main.

Jieun Han 1 Oct 06, 2022
The implemetation of Dynamic Nerual Garments proposed in Siggraph Asia 2021

DynamicNeuralGarments Introduction This repository contains the implemetation of Dynamic Nerual Garments proposed in Siggraph Asia 2021. ./GarmentMoti

42 Dec 27, 2022
DecoupledNet is semantic segmentation system which using heterogeneous annotations

DecoupledNet: Decoupled Deep Neural Network for Semi-supervised Semantic Segmentation Created by Seunghoon Hong, Hyeonwoo Noh and Bohyung Han at POSTE

Hyeonwoo Noh 74 Sep 22, 2021
Code for the paper "Improved Techniques for Training GANs"

Status: Archive (code is provided as-is, no updates expected) improved-gan code for the paper "Improved Techniques for Training GANs" MNIST, SVHN, CIF

OpenAI 2.2k Jan 01, 2023
An 16kHz implementation of HiFi-GAN for soft-vc.

HiFi-GAN An 16kHz implementation of HiFi-GAN for soft-vc. Relevant links: Official HiFi-GAN repo HiFi-GAN paper Soft-VC repo Soft-VC paper Example Usa

Benjamin van Niekerk 42 Dec 27, 2022
A general-purpose, flexible, and easy-to-use simulator alongside an OpenAI Gym trading environment for MetaTrader 5 trading platform (Approved by OpenAI Gym)

gym-mtsim: OpenAI Gym - MetaTrader 5 Simulator MtSim is a simulator for the MetaTrader 5 trading platform alongside an OpenAI Gym environment for rein

Mohammad Amin Haghpanah 184 Dec 31, 2022
Suite of 500 procedurally-generated NLP tasks to study language model adaptability

TaskBench500 The TaskBench500 dataset and code for generating tasks. Data The TaskBench dataset is available under wget http://web.mit.edu/bzl/www/Tas

Belinda Li 20 May 17, 2022
UltraGCN: An Ultra Simplification of Graph Convolutional Networks for Recommendation

UltraGCN This is our Pytorch implementation for our CIKM 2021 paper: Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, Xiuqiang He. UltraGCN: A

XUEPAI 93 Jan 03, 2023
This repository includes different versions of the prescribed-time controller as Simulink blocks and MATLAB script codes for engineering applications.

Prescribed-time Control Prescribed-time control (PTC) blocks in Simulink environment, MATLAB R2020b. For more theoretical details, refer to the papers

Amir Shakouri 1 Mar 11, 2022
Implementation of character based convolutional neural network

Character Based CNN This repo contains a PyTorch implementation of a character-level convolutional neural network for text classification. The model a

Ahmed BESBES 248 Nov 21, 2022
PAthological QUpath Obsession - QuPath and Python conversations

PAQUO: PAthological QUpath Obsession Welcome to paquo 👋 , a library for interacting with QuPath from Python. paquo's goal is to provide a pythonic in

Bayer AG 60 Dec 31, 2022
Object Depth via Motion and Detection Dataset

ODMD Dataset ODMD is the first dataset for learning Object Depth via Motion and Detection. ODMD training data are configurable and extensible, with ea

Brent Griffin 172 Dec 21, 2022
A new GCN model for Point Cloud Analyse

Pytorch Implementation of PointNet and PointNet++ This repo is implementation for VA-GCN in pytorch. Classification (ModelNet10/40) Data Preparation D

12 Feb 02, 2022
(ICONIP 2020) MobileHand: Real-time 3D Hand Shape and Pose Estimation from Color Image

MobileHand: Real-time 3D Hand Shape and Pose Estimation from Color Image This repo contains the source code for MobileHand, real-time estimation of 3D

90 Dec 12, 2022
the official code for ICRA 2021 Paper: "Multimodal Scale Consistency and Awareness for Monocular Self-Supervised Depth Estimation"

G2S This is the official code for ICRA 2021 Paper: Multimodal Scale Consistency and Awareness for Monocular Self-Supervised Depth Estimation by Hemang

NeurAI 4 Jul 27, 2022
tmm_fast is a lightweight package to speed up optical planar multilayer thin-film device computation.

tmm_fast tmm_fast or transfer-matrix-method_fast is a lightweight package to speed up optical planar multilayer thin-film device computation. It is es

26 Dec 11, 2022
Official Code for AdvRush: Searching for Adversarially Robust Neural Architectures (ICCV '21)

AdvRush Official Code for AdvRush: Searching for Adversarially Robust Neural Architectures (ICCV '21) Environmental Set-up Python == 3.6.12, PyTorch =

11 Dec 10, 2022
An implementation of shampoo

shampoo.pytorch An implementation of shampoo, proposed in Shampoo : Preconditioned Stochastic Tensor Optimization by Vineet Gupta, Tomer Koren and Yor

Ryuichiro Hataya 69 Sep 10, 2022