Reaction SMILES-AA mapping via language modelling

Last update: Dec 13, 2022

Related tags

Overview

rxn-aa-mapper

Reactions SMILES-AA sequence mapping

setup

conda env create -f conda.yml
conda activate rxn_aa_mapper

In the following we consider on examples provided to show how to use RXNAAMapper.

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`

Create a vocabulary compatible with the enzymatic reaction tokenizer:

create-enzymatic-reaction-vocabulary ./examples/data-samples/biochemical ./examples/token_75K_min_600_max_750_500K.json /tmp/vocabulary.txt "*.csv"

use the tokenizer

Using the examples vocabulary and AA tokenizer provided, we can observe the enzymatic reaction tokenizer in action:

from rxn_aa_mapper.tokenization import EnzymaticReactionBertTokenizer

tokenizer = EnzymaticReactionBertTokenizer(
    vocabulary_file="./examples/vocabulary_token_75K_min_600_max_750_500K.txt",
    aa_sequence_tokenizer_filepath="./examples/token_75K_min_600_max_750_500K.json"
)
tokenizer.tokenize("NC(=O)c1ccc[n+]([C@@H]2O[[email protected]](COP(=O)(O)OP(=O)(O)OC[[email protected]]3O[C@@H](n4cnc5c(N)ncnc54)[[email protected]](O)[C@@H]3O)[C@@H](O)[[email protected]]2O)c1.O=C([O-])CC(C(=O)[O-])C(O)C(=O)[O-]|AGGVKTVTLIPGDGIGPEISAAVMKIFDAAKAPIQANVRPCVSIEGYKFNEMYLDTVCLNIETACFATIKCSDFTEEICREVAENCKDIK>>O=C([O-])CCC(=O)C(=O)[O-]")

train the model

The mlm-trainer script can be used to train a model via MTL:

mlm-trainer \
    ./examples/data-samples/biochemical ./examples/data-samples/biochemical \  # just a sample, simply split data in a train and a validation folder
    ./examples/vocabulary_token_75K_min_600_max_750_500K.txt /tmp/mlm-trainer-log \
    ./examples/sample-config.json "*.csv" 1 \  # for a more realistic config see ./examples/config.json
    ./examples/data-samples/organic ./examples/data-samples/organic \  # just a sample, simply split data in a train and a validation folder
    ./examples/token_75K_min_600_max_750_500K.json

Checkpoints will be stored in the /tmp/mlm-trainer-log for later usage in identification of active sites.

Those can be turned into an HuggingFace model by simply running:

checkpoint-to-hf-model /path/to/model.ckpt /tmp/rxnaamapper-pretrained-model ./examples/vocabulary_token_75K_min_600_max_750_500K.txt ./examples/sample-config.json ./examples/token_75K_min_600_max_750_500K.json

predict active site

The trained model can used to map reactant atoms to AA sequence locations that potentially represent the active site.

from rxn_aa_mapper.aa_mapper import RXNAAMapper

config_mapper = {
    "vocabulary_file": "./examples/vocabulary_token_75K_min_600_max_750_500K.txt",
    "aa_sequence_tokenizer_filepath": "./examples/token_75K_min_600_max_750_500K.json",
    "model_path": "/tmp/rxnaamapper-pretrained-model",
    "head": 3,
    "layers": [11],
    "top_k": 1,
}
mapper = RXNAAMapper(config=config_mapper)
mapper.get_reactant_aa_sequence_attention_guided_maps(["NC(=O)c1ccc[n+]([C@@H]2O[[email protected]](COP(=O)(O)OP(=O)(O)OC[[email protected]]3O[C@@H](n4cnc5c(N)ncnc54)[[email protected]](O)[C@@H]3O)[C@@H](O)[[email protected]]2O)c1.O=C([O-])CC(C(=O)[O-])C(O)C(=O)[O-]|AGGVKTVTLIPGDGIGPEISAAVMKIFDAAKAPIQANVRPCVSIEGYKFNEMYLDTVCLNIETACFATIKCSDFTEEICREVAENCKDIK>>O=C([O-])CCC(=O)C(=O)[O-]"])

citation

@article{dassi2021identification,
  title={Identification of Enzymatic Active Sites with Unsupervised Language Modeling},
  author={Dassi, Lo{\"\i}c Kwate and Manica, Matteo and Probst, Daniel and Schwaller, Philippe and Teukam, Yves Gaetan Nana and Laino, Teodoro},
  year={2021}
  conference={AI for Science: Mind the Gaps at NeurIPS 2021, ELLIS Machine Learning for Molecule Discovery Workshop 2021}
}

Reaction SMILES-AA mapping via language modelling

Related tags

Overview

rxn-aa-mapper

setup

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`

use the tokenizer

train the model

predict active site

citation

Owner

Angora is a mutation-based fuzzer. The main goal of Angora is to increase branch coverage by solving path constraints without symbolic execution.

A small demonstration of using WebDataset with ImageNet and PyTorch Lightning

This project aims at building a real-time wide band channel sounder using USRPs

PyTorch code for our ECCV 2018 paper "Image Super-Resolution Using Very Deep Residual Channel Attention Networks"

Self-Learning - Books Papers, Courses & more I have to learn soon

Rotation-Only Bundle Adjustment

The audio-video synchronization of MKV Container Format is exploited to achieve data hiding

tensorrt int8 量化yolov5 4.0 onnx模型

Implementation of Squeezenet in pytorch, pretrained models on Cifar 10 data to come

An algorithm that handles large-scale aerial photo co-registration, based on SURF, RANSAC and PyTorch autograd.

Editing a classifier by rewriting its prediction rules

Caffe models in TensorFlow

Diffusion Probabilistic Models for 3D Point Cloud Generation (CVPR 2021)

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

1st ranked 'driver careless behavior detection' for AI Online Competition 2021, hosted by MSIT Korea.

Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).

Accelerated SMPL operation, commonly used in generate 3D human mesh, STAR included.

Implementation of Graph Convolutional Networks in TensorFlow

Official Pytorch Implementation of Adversarial Instance Augmentation for Building Change Detection in Remote Sensing Images.

competitions-v2

Reaction SMILES-AA mapping via language modelling

Related tags

Overview

rxn-aa-mapper

setup

generate a vocabulary to be used with the EnzymaticReactionBertTokenizer

use the tokenizer

train the model

predict active site

citation

Owner

Angora is a mutation-based fuzzer. The main goal of Angora is to increase branch coverage by solving path constraints without symbolic execution.

A small demonstration of using WebDataset with ImageNet and PyTorch Lightning

This project aims at building a real-time wide band channel sounder using USRPs

PyTorch code for our ECCV 2018 paper "Image Super-Resolution Using Very Deep Residual Channel Attention Networks"

Self-Learning - Books Papers, Courses & more I have to learn soon

Rotation-Only Bundle Adjustment

The audio-video synchronization of MKV Container Format is exploited to achieve data hiding

tensorrt int8 量化yolov5 4.0 onnx模型

Implementation of Squeezenet in pytorch, pretrained models on Cifar 10 data to come

An algorithm that handles large-scale aerial photo co-registration, based on SURF, RANSAC and PyTorch autograd.

Editing a classifier by rewriting its prediction rules

Caffe models in TensorFlow

Diffusion Probabilistic Models for 3D Point Cloud Generation (CVPR 2021)

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

1st ranked 'driver careless behavior detection' for AI Online Competition 2021, hosted by MSIT Korea.

Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).

Accelerated SMPL operation, commonly used in generate 3D human mesh, STAR included.

Implementation of Graph Convolutional Networks in TensorFlow

Official Pytorch Implementation of Adversarial Instance Augmentation for Building Change Detection in Remote Sensing Images.

competitions-v2

generate a vocabulary to be used with the `EnzymaticReactionBertTokenizer`