HG002 QC Snakemake

To Run

Resources and data specified within snakefile (hg002QC.smk) for simplicity. Tested with snakemake v6.15.3.

Warning: Several steps of this workflow require minimum coverage. It's recommended that this workflow not be run when yield in base pairs is insufficient to produceat least 15X coverage (i.e. yield/3099922541 >= 15x).

# clone repo
git clone --recursive https://github.com/PacificBiosciences/pb-human-wgs-workflow-snakemake.git workflow

# make necessary directories
mkdir cluster_logs

# create conda environment
conda env create --file workflow/environment.yaml

# activate conda environment
conda activate pb-human-wgs-workflow

# submit job
sbatch workflow/run_hg002QC.sh

Plots

A list of important stats from target files that would be good for plotting.

targets = [f"conditions/{condition}/{filename}"
                    for condition in ubam_dict.keys()
                    for filename in ["smrtcell_stats/all_movies.read_length_and_quality.tsv",
                                    "hifiasm/asm.p_ctg.fasta.stats.txt",
                                    "hifiasm/asm.a_ctg.fasta.stats.txt",
                                    "hifiasm/asm.p_ctg.qv.txt",
                                    "hifiasm/asm.a_ctg.qv.txt",
                                    "truvari/summary.txt",
                                    "pbsv/all_chroms.pbsv.vcf.gz",
                                    "deepvariant/deepvariant.vcf.stats.txt",
                                    "whatshap/deepvariant.phased.tsv",
                                    "happy/all.summary.csv",
                                    "happy/all.extended.csv",
                                    "happy/cmrg.summary.csv",
                                    "happy/cmrg.extended.csv",
                                    "mosdepth/coverage.mosdepth.summary.txt",
                                    "mosdepth/mosdepth.M2_ratio.txt",
                                    "mosdepth/gc_coverage.summary.txt",
                                    "mosdepth/coverage.thresholds.summary.txt"]]

smrtcell_stats/all_movies.read_length_and_quality.tsv
- outputs 3 columns (read name, read length, read quality)
- boxplots of read length and quality
hifiasm/asm.p_ctg.fasta.stats.txt (primary) + hifiasm/asm.a_ctg.fasta.stats.txt (alternate)
- all stats below should be collected for both primary (p_ctg) and alternate (p_atg) assemblies
- assembly size awk '$1=="SZ" {print $2}' <filename>
- auN (area under the curve) awk '$1=="AU" {print $2}' <filename>
- NGx - line plot of NG10 through NG90 awk '$1=="NL" {print $2 $3}' <filename> ($2 is x-axis, $3 y-axis) like this: example plot
hifiasm/asm.p_ctg.qv.txt + hifiasm/asm.a_ctg.qv.txt
- adjusted assembly quality awk '$1=="QV" {print $3}' <filename> for primary and alternate assemblies
truvari/truvari.summary.txt
- structural variant recall jq .recall <filename>
- structural variant precision jq .precision <filename>
- structural variant f1 jq .f1 <filename>
- number of calls jq '."call cnt"' <filename>
- FP jq .FP <filename>
- TP-call jq .TP-call <filename>
- FN jq .FN <filename>
- TP-base jq .TP-base <filename>
pbsv/all_chroms.pbsv.vcf.gz
- counts of each type of variant bcftools query -i 'FILTER=="PASS"' -f '%INFO/SVTYPE\n' <filename> | awk '{A[$1]++}END{for(i in A)print i,A[i]}'
- can also do size distributions of indels bcftools query -i 'FILTER=="PASS" && (INFO/SVTYPE=="INS" | INFO/SVTYPE=="DEL")' -f '%INFO/SVTYPE\t%INFO/SVLEN\n' <filename>
deepvariant/deepvariant.vcf.stats.txt
- several values in lines starting with 'SN' awk '$1=="SN"' <filename>
  - number of SNPS
  - number INDELs
  - number of multi-allelic sites
  - number of multi-allelic SNP sites
- ratio of transitions to transversions awk '$1=="TSTV" {print$5}' <filename>
- can monitor substitution types awk '$1=="ST"' <filename>
- SNP heterozygous : non-ref homozygous ratio awk '$1=="PSC" {print $6/$5}' <filename>
- SNP transitions : transversions awk '$1=="PSC" {print $7/$8}' <filename>
- Number of heterozygous insertions : number of homozgyous alt insertions awk '$1=="PSI" {print $8/$10}' <filename>
- Number of heterozygous deletions : number of homozgyous alt deletions awk '$1=="PSI" {print $9/$11}' <filename>
- Total INDEL heterozygous:homozygous ratio awk '$1=="PSI" {print ($8+$9)/($10+$11)}' <filename>8+9:10+11 indel het:hom)
whatshap/deepvariant.phased.tsv
- phase block N50 awk '$2=="ALL" {print $22}' <filename>
- bp_per_block_sum (total number of phased bases) awk '$2=="ALL" {print $18}' <filename>
whatshap/deepvariant.phased.blocklist
- calculate phase block size (to - from) and reverse order them (awk 'NR>1 {print $5-$4}' <filename> |sort -nr), then plot as cumulative line graph like for assembly, N_0 to N90 example plot
happy/all.summary.csv + happy/cmrg.summary.csv
- stats should be collected for all variants and cmrg challenging medically relevant genes
  - SNP recall awk -F, '$1=="SNP" && $2=="PASS" {print $10}' <filename>
  - SNP precision awk -F, '$1=="SNP" && $2=="PASS" {print $11}' <filename>
  - SNP F1 awk -F, '$1=="SNP" && $2=="PASS" {print $13}' <filename>
  - INDEL recall awk -F, '$1=="INDEL" && $2=="PASS" {print $10}' <filename>
  - INDEL precision awk -F, '$1=="INDEL" && $2=="PASS" {print $11}' <filename>
  - INDEL F1 awk -F, '$1=="INDEL" && $2=="PASS" {print $13}' <filename>
happy/all.extended.csv + happy/cmrg.extended.csv
- there are many stratifications that can be examined, and Aaron Wenger might have opinionso n which are most important. The below commands are just for one stratification "GRCh38_lowmappabilityall.bed.gz".
- SNP GRCh38_lowmappabilityall recall awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $8}' <filename>
- SNP GRCh38_lowmappabilityall precision awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $9}' <filename>
- SNP GRCh38_lowmappabilityall F1 awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $11}' <filename>
- INDEL GRCh38_lowmappabilityall recall awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $8}' <filename>
- INDEL GRCh38_lowmappabilityall precision awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $9}' <filename>
- INDEL GRCh38_lowmappabilityall F1 awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $11}' <filename>
mosdepth/coverage.mosdepth.summary.txt
- mean aligned coverage in "coverage.mosdepth.summary.txt" - 4th column of final row, can grep 'total_region'
mosdepth/mosdepth.M2_ratio.txt
- outputs single value: ratio of chr2 coverage to chrM coverage
- bar chart of m2 ratio
mosdepth/gc_coverage.summary.txt
- outputs 5 columns: gc percentage bin, q1 , median , q3 , count
- q1, median, q3 columns are statistics for coverage at different gc percentages (e.g. median cover at 30% GC)
- "count" refers to # of 500 bp windows that fall in that bin
- can pick a couple of key GC coverage bins and make box plots out of them
mosdepth/coverage.thresholds.summary.txt
- outputs 10 columns corresponding to % of genome sequenced to minimum coverage depths (1X - 10X)
- maybe a line chart comparing the different coverage thresholds among conditions

Hg002-qc-snakemake - HG002 QC Snakemake

Related tags

Overview

HG002 QC Snakemake

To Run

Plots

Owner

Juniper A. Lake

Fast, flexible and easy to use probabilistic modelling in Python.

CPSPEC is an astrophysical data reduction software for timing

Processo de ETL (extração, transformação, carregamento) realizado pela equipe no projeto final do curso da Soul Code Academy.

This python script allows you to manipulate the audience data from Sl.ido surveys

vartests is a Python library to perform some statistic tests to evaluate Value at Risk (VaR) Models

PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

Python Practicum - prepare for your Data Science interview or get a refresher.

An Aspiring Drop-In Replacement for NumPy at Scale

Senator Trades Monitor

PyNHD is a part of HyRiver software stack that is designed to aid in watershed analysis through web services.

A fast, flexible, and performant feature selection package for python.

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

Collections of pydantic models

Elasticsearch tool for easily collecting and batch inserting Python data and pandas DataFrames

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

ped-crash-techvol: Texas Ped Crash Tech Volume Pack

Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Analyzing Earth Observation (EO) data is complex and solutions often require custom tailored algorithms.