Structural basis for solubility in protein expression systems

Overview

Structural basis for solubility in protein expression systems

Twitter Follow GitHub repo size

Large-scale protein production for biotechnology and biopharmaceutical applications rely on high protein solubility in expression systems. Solubility has been measured for a significant fraction of E. coli and S. cerevisiae proteomes and these datasets are routinely used to train predictors of protein solubility in different organisms. Thanks to continued advances in experimental structure-determination and modelling, many of these solubility measurements can now be paired with accurate structural models.

The challenge is mentored by Christopher Ing and Mark Fingerhuth.

Aim of the challenge

It is the objective of this project to use our provided dataset of protein structure and solubility value pairs in order to produce a solubility predictor with comparable accuracy to sequence-based predictors reported in the literature. The provided dataset to be used in this project is created by following the dataset curation procedure described in the SOLart paper, and this hackathon project has a similar aim to this manuscript.

The dataset

The process of generating the dataset is described in the SOLArt manuscript. At a high level, all experimentally tested E. coli and S. cerevisiae proteins were matched through Uniprot IDs to known crystallographic structures or high sequence similarity homology models. After balancing the fold types using CATH, a dataset containing a balanced spread of solubility values was produced. The resulting proteins for the training and testing of these models were prepared and disclosed in the supplemental material of this paper as a list of (Uniprot,PDB,Chain,Solubility) pairs. The PDB files were not included in this work so we had to re-extract them from SWISS-MODEL. Whenever a crystallographic structure was present, it was used, assuming high coverage over the Uniprot sequence. In some cases, the original PDB templates used within the original SOLArt paper had been superceded by improved templates, and we opted to take the highest resolution, highest sequence identity, models in our updated dataset. We stripped away all irrelevant chains and heteroatoms.

If issues are identified with individual structures, please refer to the Uniprot ID and manually investigate the best template. In some cases, we needed to improve structure correctness by modelling missing atoms/residues inside the Chemical Computing Group software MOE on a case-by-case basis.

The dataset can be found in the data/ subdirectory - it is already divided into training/ and test/ data. The training/ data comes with solubility_values.csv and solublity_values.yaml (same content just different format) which both contain the solubility target values for all the PDB files provided in that directory. Note that each PDB file is named after the Uniprot identifier of the respective protein and the protein column in the solubility_values.csv also contains the Uniprot identifiers.

The test/ dataset consists of three different subdirectories (protein structures derived from different organisms and with different approaches) and you should NOT use them for any training. Only the yeast_crystal_structs/ directory contains solubility_values.csv and solublity_values.yaml (same content just different format) files which you can use for some local testing & validation. In order to find out your performance on the entire test dataset you need to use the automated benchmarking system (see below).

Example output

Your code should output a file called predictions.csv in the following format:

protein,solubility
P69829,83
P31133,62

whereby the protein column contains the Uniprot ID (corresponds to the filename of the PDB files) and the solubility column contains the predicted solubility value (can be int or float).

Note, that there are three (!) test subsets but you are expected to submit all the predictions in one file (not three) for the benchmarking system to work.

Automated benchmarking system

The continuous integration script in .github/workflows/ci.yml will automatically build the Dockerfile on every commit to the main branch. This docker image will be published as your hackathon submission to https://biolib.com//. For this to work, make sure you set the BIOLIB_TOKEN and BIOLIB_PROJECT_URI accordingly as repository secrets.

To read more about the benchmarking system click here.

Say thanks

Give this repo a star: GitHub Repo stars

Star the ProteinQure org on Github: GitHub Org's stars

Owner
ProteinQure
ProteinQure
一个IDA脚本,可以检测出哈希算法(无论是否魔改常数)并生成frida hook 代码。

findhash 在哈希算法上,比Findcrypt更好的检测工具,同时生成Frida hook代码。 使用方法 把findhash.xml和findhash.py扔到ida plugins目录下 ida -edit-plugin-findhash 试图解决的问题 哈希函数的初始化魔数被修改 想快速

266 Dec 29, 2022
This repo holds custom callback plugin, so your Ansible could write everything in the PostgreSQL database.

English What is it? This is callback plugin that dumps most of the Ansible internal state to the external PostgreSQL database. What is this for? If yo

Sergey Pechenko 19 Oct 21, 2022
Jogo em redes similar ao clássico pedra papel e tesoura

Batalha Tática Tecnologias de Redes de Computadores-A-N-JOGOS DIGITAIS Professor Fabio Henrique Cabrini Alunos: Eric Henrique de Oliveira Silva - RA 1

Eric Henrique de Oliveira Silva 1 Dec 01, 2021
This app is to use algorithms to find the root of the equation

In this repository, I made an amazing app with tkinter python language and other libraries the idea of this app is to use algorithms to find the root of the equation I used three methods from numeric

Mohammad Al Jadallah 3 Sep 16, 2022
This is where I learn machine learning

This is where I learn machine learning🤷‍ This means that this repo covers no specific topic of machine learning or a project - I work in here when I want to learn/try something

Wilhelm Berghammer 47 Nov 16, 2022
Adjust the white point, gamma or make your XDR display darker without losing HDR peak luminance or the ability to adjust display brightness

XDR Tuner Adjust the white point, gamma or make your XDR display darker without losing HDR peak luminance or the ability to adjust display brightness

François Simond 16 Dec 28, 2022
FileTransfer - to exchange files from phone to laptop

A small website I locally host on my network to exchange files from my phone and other devices to my laptop.

Ronak Badhe 4 Feb 15, 2022
Transform Python source code into it's most compact representation

Python Minifier Transforms Python source code into it's most compact representation. Try it out! python-minifier currently supports Python 2.7 and Pyt

Daniel Flook 403 Jan 02, 2023
Python library for creating PEG parsers

PyParsing -- A Python Parsing Module Introduction The pyparsing module is an alternative approach to creating and executing simple grammars, vs. the t

Pyparsing 1.7k Jan 03, 2023
GUI tool to manage the contents of chests in Botw

Botw chest manager is a small gui tool allowing to easily manage chests. Sometimes Ice Spear can be very time consuming when adding a simple chest. The purpose of this light tool is to add a new ches

3 Aug 25, 2022
Stori QA Automation Challenge

Stori-QA-Automation-Challenge This is the repository is created for the Stori QA Intern Automation Engineer Challenge! In this you can find the Requir

Daniel Castañeda 0 Feb 20, 2022
Create an application to visualize single/multiple Xandar Kardian people counting sensors detection result for a indoor area.

Program Design Purpose: We want to create an application to visualize single/multiple Xandar Kardian people counting sensors detection result for a indoor area.

2 Dec 28, 2022
Possible solutions to Wordscapes, a mobile game for the android operating system, downloadable from the play store

Possible solutions to Wordscapes, a mobile game for the android operating system, downloadable from the play store

Clifford Onyonka 2 Feb 23, 2022
Customizable-menu-python - User customizable menu in Python

Menu personalizável pelo usuário em Python A minha ideia com esse projeto pessoa

Renan Barbosa 4 Oct 28, 2022
My collection of mini-projects in various languages

Mini-Projects My collection of mini-projects in various languages About: This repository consists of a number of small projects. Most of these "mini-p

Siddhant Attavar 1 Jul 11, 2022
Number calculator application.

Number calculator application.

Michael J Bailey 3 Oct 08, 2021
A Python package that provides physical constants.

PhysConsts A Python package that provides physical constants. The code is being developed by Marc van der Sluys of the department of Astrophysics at t

Marc van der Sluys 1 Jan 05, 2022
RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.

RDFLib RDFLib is a pure Python package for working with RDF. RDFLib contains most things you need to work with RDF, including: parsers and serializers

RDFLib 1.8k Jan 02, 2023
This is a a CSMA/CA simulator written in Python based on simulator of the same type

This is a a CSMA/CA simulator written in Python based on simulator of the same type found the link https://github.com/StevenSLXie/CSMA-Simulator with

M. Ismail 4 Nov 22, 2022
This is a Python script to detect rapid upwards price changes (pumps) in a cryptocurrency pairing

A python script to detect a rapid upwards price brekout (pump) in a cryptocurrency pairing, through pandas and Binance API.

3 May 25, 2022