To be a next-generation DL-based phenotype prediction from genome mutations.

Overview
Sequence -----------+--> 3D_structure --> 3D_module --+                                      +--> ?
|                   |                                 |                                      +--> ?
|                   |                                 +--> Joint_module --> Hierarchical_CLF +--> ?
|                   |                                 |                                      +--> ?
+-> NLP_embeddings -+-------> Embedding_module -------+                                      +--> ?

ClynMut: Predicting the Clynical Relevance of Genome Mutations (wip)

To be a next-generation DL-based phenotype prediction from genome mutations. Will use sota NLP and structural techniques.

Planned modules will likely be:

  • 3D learning module
  • NLP embeddings
  • Joint module + Hierarchical classification

The main idea is for the model to learn the prediction in an end-to-end fashion.

Install

$ pip install clynmut

Example Usage:

import torch
from clynmut import *

hier_graph = {"class": "all", 
              "children": [
                {"class": "effect_1", "children": [
                  {"class": "effect_12", "children": []},
                  {"class": "effect_13", "children": []}
                ]},
                {"class": "effect_2", "children": []},
                {"class": "effect_3", "children": []},
              ]}

model = MutPredict(
    seq_embedd_dim = 512,
    struct_embedd_dim = 256, 
    seq_reason_dim = 512, 
    struct_reason_dim = 256,
    hier_graph = hier_graph,
    dropout = 0.0,
    use_msa = False,
    device = None)

seqs = ["AFTQRWHDLKEIMNIDALTWER",
        "GHITSMNWILWVYGFLE"]

pred_dicts = model(seqs, pred_format="dict")

Important topics:

3D structure learning

There are a couple architectures that can be used here. I've been working on 2 of them, which are likely to be used here:

Hierarchical classification

  • A simple custom helper class has been developed for it.

Testing

$ python setup.py test

Datasets:

This package will use the awesome work by Jonathan King at this repository.

To install

$ pip install git+https://github.com/jonathanking/sidechainnet.git

Or

$ git clone https://github.com/jonathanking/sidechainnet.git
$ cd sidechainnet && pip install -e .

Citations:

@article{pejaver_urresti_lugo-martinez_pagel_lin_nam_mort_cooper_sebat_iakoucheva et al._2020,
    title={Inferring the molecular and phenotypic impact of amino acid variants with MutPred2},
    volume={11},
    DOI={10.1038/s41467-020-19669-x},
    number={1},
    journal={Nature Communications},
    author={Pejaver, Vikas and Urresti, Jorge and Lugo-Martinez, Jose and Pagel, Kymberleigh A. and Lin, Guan Ning and Nam, Hyun-Jun and Mort, Matthew and Cooper, David N. and Sebat, Jonathan and Iakoucheva, Lilia M. et al.},
    year={2020}
@article{rehmat_farooq_kumar_ul hussain_naveed_2020, 
    title={Predicting the pathogenicity of protein coding mutations using Natural Language Processing},
    DOI={10.1109/embc44109.2020.9175781},
    journal={2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)},
    author={Rehmat, Naeem and Farooq, Hammad and Kumar, Sanjay and ul Hussain, Sibt and Naveed, Hammad},
    year={2020}
@article{pagel_antaki_lian_mort_cooper_sebat_iakoucheva_mooney_radivojac_2019,
    title={Pathogenicity and functional impact of non-frameshifting insertion/deletion variation in the human genome},
    volume={15},
    DOI={10.1371/journal.pcbi.1007112},
    number={6},
    journal={PLOS Computational Biology},
    author={Pagel, Kymberleigh A. and Antaki, Danny and Lian, AoJie and Mort, Matthew and Cooper, David N. and Sebat, Jonathan and Iakoucheva, Lilia M. and Mooney, Sean D. and Radivojac, Predrag},
    year={2019},
    pages={e1007112}
You might also like...
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow.  This is part of the CASL project: http://casl-project.ai/
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Python generation script for BitBirds

BitBirds generation script Intro This is published under MIT license, which means you can do whatever you want with it - entirely at your own risk. Pl

TTS is a library for advanced Text-to-Speech generation.
TTS is a library for advanced Text-to-Speech generation.

TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects.

Comments
  • TO DO LIST

    TO DO LIST

    • [x] Add embeddings functionality
    • [ ] Add 3d structure module (likely-to-be GVP/... based)
    • [x] Add classifier
    • [x] Hierarchical classification helper based on differentiability
    • [x] End-to-end code
    • [ ] data collection
    • [ ] data formatting
    • [ ] Run featurization for all data points (esm1b + af2 structs)
    • [ ] Perform a sample training
    • [ ] Perform sample evaluation
    • [ ] Iterate - improve
    • [ ] ...
    • [ ] idk, will see as we go
    opened by hypnopump 0
Releases(0.0.2)
Owner
Eric Alcaide
For he today that sheds his blood with me; Shall be my brother
Eric Alcaide
A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode

Bloxflip Smart Bet A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode. https://bloxflip.com/crash. THIS

43 Jan 05, 2023
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Amazon Web Services - Labs 83 Jan 09, 2023
Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

VUMBLEB 69 Nov 04, 2022
⚖️ A Statutory Article Retrieval Dataset in French.

A Statutory Article Retrieval Dataset in French This repository contains the Belgian Statutory Article Retrieval Dataset (BSARD), as well as the code

Maastricht Law & Tech Lab 19 Nov 17, 2022
This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini!

About CappuccinoJs This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini! Este conversor criar

Arthur Ottoni Ribeiro 48 Nov 15, 2022
Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

ICTNLP 29 Oct 16, 2022
Simple NLP based project without any use of AI

Simple NLP based project without any use of AI

Shripad Rao 1 Apr 26, 2022
Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

Speech_38_ru_commands Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR Программа умеет распознавать 38 ключевы

Andrey 9 May 05, 2022
C.J. Hutto 3.8k Dec 30, 2022
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 3.6k Jan 02, 2023
PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

Tencent 633 Dec 28, 2022
NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Artefact 114 Dec 15, 2022
A Python script that compares files in directories

compare-files A Python script that compares files in different directories, this is similar to the command filecmp.cmp(f1, f2). I made this script in

Colvin 1 Oct 15, 2021
txtai: Build AI-powered semantic search applications in Go

txtai: Build AI-powered semantic search applications in Go txtai executes machine-learning workflows to transform data and build AI-powered semantic s

NeuML 49 Dec 06, 2022
This repo contains simple to use, pretrained/training-less models for speaker diarization.

PyDiar This repo contains simple to use, pretrained/training-less models for speaker diarization. Supported Models Binary Key Speaker Modeling Based o

12 Jan 20, 2022
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

fastNLP 342 Jan 05, 2023
Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

Auto-Research A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting arti

Sidharth Pal 20 Dec 14, 2022
Nested Named Entity Recognition

Nested Named Entity Recognition Training Dataset: CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark url: https://tianchi.aliyun.

8 Dec 25, 2022
This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

This Project is based on NLTK(Natural Language Toolkit) It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its

SaiVenkatDhulipudi 2 Nov 17, 2021
OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

How To Use killtheZoom-2.0 Windows 0. https://joyhong.tistory.com/79 이 글을 보면서 tesseract를 C:\Program Files\Tesseract-OCR 경로로 설치해주세요(한국어 언어 추가 필요) 상단의 초

김정인 9 Sep 13, 2021