A pipeline that creates consensus sequences from a Nanopore reads. I

Last update: May 15, 2022

Overview

Authors:
Ada Madejska, MCDB, UCSB (contact: [email protected])
Nick Noll, UCSB

This pipeline takes error-prone Nanopore reads and tries to increase the percentage identity
of the results of identifying species with BLAST. The reads in fastq format are put through the pipeline
which includes the following steps.
1. Quality control
- very short and very long reads (reads that highly deviate from the usual length of the 16S sequence)
are dropped.
2. Kmer frequency matrix
- make a kmer frequency matrix based on the reads from the quality control step. The value of k
can be changed (k=5 or 6 is recommended)
3. UMAP projection and HDBSCAN clustering
- the kmer frequency matrix is used to create a UMAP projection. The default parameters for UMAP
and HDBSCAN functions have been chosen based on mock dataset but can be changed.
4. Refinement
- based on our tests on mock datasets, sometimes reads from different species can cluster together.
To prevent that, we include a refinement step based on MSA of Clustal Omega on each cluster.
The alignment outputs a guide tree which is used for dividing the cluster into smaller subclusters.
The distance threshold can be changed to suit each dataset.
5. Consensus making
- lastly, based on the defined clusters, the last step creates a consensus sequence based on
majority calling. The direction of the reads is fixed using minimap2, the alignment is performed
by MAFFT, and the consensus is created using em_cons. The reads are run through BLASTN to check
for identity of each cluster.

Software Dependencies:

To successfully run the pipeline, certain software need to be installed.
1. Minimap2 - for the consensus making step (https://github.com/lh3/minimap2)
2. MAFFT - for alignment in the consensus making step (https://mafft.cbrc.jp/alignment/software/)
3. EM_CONS - for creating the consensus (http://emboss.sourceforge.net/apps/cvs/emboss/apps/cons.html)
4. NCBIN - for identification of the consensus sequences in the database
(https://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/) (a 16S database is also required)
5. CLUSTALO - for the refinement step (http://www.clustal.org/omega/)

Specifications:

This pipeline runs in python3.8.10 and julia v"1.4.1".

The following Python libraries are also required:
BioPython
hdbscan
matplotlib
pandas
sklearn
umap

Following Julia packages are required:
Pkg
DataFrames
CSV

A pipeline that creates consensus sequences from a Nanopore reads. I

Related tags

Overview

Owner

Ada Madejska

A columnar data container that can be compressed.

peptides.py is a pure-Python package to compute common descriptors for protein sequences

A collection of learning outcomes data analysis using Python and SQL, from DQLab.

PyPSA: Python for Power System Analysis

Tools for the analysis, simulation, and presentation of Lorentz TEM data.

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

High Dimensional Portfolio Selection with Cardinality Constraints

A script to "SHUA" H1-2 map of Mercenaries mode of Hearthstone

An experimental project I'm undertaking for the sole purpose of increasing my Python knowledge

Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

Business Intelligence (BI) in Python, OLAP

BinTuner is a cost-efficient auto-tuning framework, which can deliver a near-optimal binary code that reveals much more differences than -Ox settings.

Maximum Covariance Analysis in Python

Includes all files needed to satisfy hw02 requirements

Analyzing Covid-19 Outbreaks in Ontario

A Python adaption of Augur to prioritize cell types in perturbation analysis.

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

Useful tool for inserting DataFrames into the Excel sheet.

ETL pipeline on movie data using Python and postgreSQL

Common bioinformatics database construction