BMS-Molecular-Translation

Introduction

This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got bronze medals in this competition. Significant part of code was originated from Y.Nakama's notebook

This competition was about image-to-text translation of images with molecular skeletal strucutures to InChI chemical formula identifiers.

InChI=1S/C16H13Cl2NO3/c1-10-2-4-11(5-3-10)16(21)22-9-15(20)19-14-8-12(17)6-7-13(14)18/h2-8H,9H2,1H3,(H,19,20)

Solution

General Encoder-Decoder concept

Most participants used CNN encoder to acquire features with decoder (LSTM/GRU/Transformer) to get text sequences. That's a casual approach to image captioning problem.

Pseudo-labelling with InChI validation using RDKit

RDKit is an open source toolkit for cheminformatics and it was quite useful while solving the problem. When we trained our first model, it scored around 7-8 on public leaderboard and we decided to make pseudo-labelling on test data. However, in common scenario you get a significant amount of wrong predictions in your extended training set from pseudo-labelling. With RDKit we validated all of our predicted formulas and select around 800k correct samples. Lack of wrong labels in pseudo labels improved the score.

Predictions normalization

This notebook tells about InChI normalization

Blending

Finally, we blended ~20 predictions from 2 models (mostly from different epochs) using RDKit validation to choose only formulas which have possible InChI structure.

Pipeline for chemical image-to-text competition

Related tags

Overview

BMS-Molecular-Translation

Introduction

Solution

General Encoder-Decoder concept

Pseudo-labelling with InChI validation using RDKit

Predictions normalization

Blending

Final private LB score 1.79

Owner

Maksim Zhdanov

Lingtrain Aligner — ML powered library for the accurate texts alignment.

MicBot - MicBot uses Google Translate to speak everyone's chat messages

Basic Utilities for PyTorch Natural Language Processing (NLP)

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

The training code for the 4th place model at MDX 2021 leaderboard A.

Japanese synonym library

2021搜狐校园文本匹配算法大赛baseline

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

JaQuAD: Japanese Question Answering Dataset

Search-Engine - 📖 AI based search engine

Fake Shakespearean Text Generator

Basic yet complete Machine Learning pipeline for NLP tasks

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Neural network sequence labeling model

2021海华AI挑战赛·中文阅读理解·技术组·第三名

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions