Pipeline to convert a haploid assembly into diploid

Related tags

Data Analysishapdup
Overview

HapDup

HapDup (haplotype duplicator) is a pipeline to convert a haploid long read assembly into a dual diploid assembly. The reconstructed haplotypes preserve heterozygous structural variants (in addition to small variants) and are locally phased.

Version 0.4

Input requirements

HapDup takes as input a haploid long-read assembly, such as produced with Flye or Shasta. Currenty, ONT reads (Guppy 5+ recommended) and PacBio HiFi reads are supported.

HapDup is currently designed for low-heterozygosity genomes (such as human). The expectation is that the assembly has most of the diploid genome collapsed into a single haplotype. For assemblies with partially resolved haplotypes, alternative alleles could be removed prior to running the pipeline using purge_dups. We expect to add a better support of highly heterozygous genomes in the future.

The first stage is to realign the original long reads on the assembly using minimap2. We recommend to use the latest minimap2 release.

minimap2 -ax map-ont -t 30 assembly.fasta reads.fastq | samtools sort -@ 4 -m 4G > lr_mapping.bam
samtools index -@ 4 assembly_lr_mapping.bam

Quick start using Docker

HapDup is available on the Docker Hub.

If Docker is not installed in your system, you need to set it up first following this guide.

Next steps assume that your assembly.fasta and lr_mapping.bam are in the same directory, which will also be used for HapDup output. If it is not the case, you might need to bind additional directories using the Docker's -v / --volume argument. The number of threads (-t argument) should be adjusted according to the available resources. For PacBio HiFi input, use --rtype hifi instead of --rtype ont.

cd directory_with_assembly_and_alignment
HD_DIR=`pwd`
docker run -v $HD_DIR:$HD_DIR -u `id -u`:`id -g` mkolmogo/hapdup:0.4 \
  hapdup --assembly $HD_DIR/assembly.fasta --bam $HD_DIR/lr_mapping.bam --out-dir $HD_DIR/hapdup -t 64 --rtype ont

Quick start using Singularity

Alternatively, you can use Singularity. First, you will need install the client as descibed in the manual. One way to do it is through conda:

conda install singularity

Next steps assume that your assembly.fasta and lr_mapping.bam are in the same directory, which will also be used for HapDup output. If it is not the case, you might need to bind additional directories using the --bind argument. The number of threads (-t argument) should be adjusted according to the available resources. For PacBio HiFi input, use --rtype hifi instead of --rtype ont.

singularity pull docker://mkolmogo/hapdup:0.4
HD_DIR=`pwd`
singularity exec --bind $HD_DIR hapdup_0.4.sif \
  hapdup --assembly $HD_DIR/assembly.fasta --bam $HD_DIR/lr_mapping.bam --out-dir $HD_DIR/hapdup -t 64 --rtype ont

Output files

The output directory will contain:

  • haplotype_{1,2}.fasta - final assembled haplotypes
  • phased_blocks_hp{1,2}.bed - phased blocks coordinates

Haplotypes generated by the pipeline contain homozogous and heterozygous varinats (small and structural). Becuase the pipeline is only using long-read (ONT) data, it does not achieve chromosome-level phasing. Fully-phased blocks are given in the the phased_blocks* files.

Pipeline overview

  1. HapDup starts with filtering alignments that are likely originating from the unassembled parts of the genome. Such alignments may later create false haplotypes if not removed (e.g. if reads from a segmental duplication with two copies can create four haplotypes).

  2. Afterwards, PEPPER is used to call SNPs from the filtered alignment file

  3. Then we use Margin to phase SNPs and haplotype reads

  4. We then use Flye to polish the initiall assembly with the reads from each of the two haplotypes independently

  5. Finally, we find (heterozygous) breakpoints in long-read alignments and apply the corresponding structural changes to the corresponding polished haplotypes. Currently, it allows to recover large heterozygous inversions.

Benchmarks

We evaluated HapDup haplotypes in terms of reconstructed structural variants signatures (heterozygous & homozygous) using the HG002 for which the curated set of SVs is available. We used the recent ONT data basecalled with Guppy 5.

Given HapDup haplotypes, we called SV using dipdiff. We also compare SV set against hifiasm assemblies, even though they were produced from HiFi, rather than ONT reads. Evaluated using truvari with -r 2000 option. GT refers to genotype-considered benchmarks.

Method Precision Recall F1-score GT Precision GT Recall GT F1-score
Shasta+HapDup 0.9500 0.9551 0.9525 0.934 0.9543 0.9405
Sniffles 0.9294 0.9143 0.9219 0.8284 0.9051 0.8605
CuteSV 0.9324 0.9428 0.9376 0.9119 0.9416 0.9265
hifiasm 0.9512 0.9734 0.9622 0.9129 0.9723 0.9417

Yak k-mer based evaluations:

Hap QV Switch err Hamming err
1 35 0.0389 0.1862
2 35 0.0385 0.1845

Given a minimap2 alignment, HapDup runs in ~400 CPUh and uses ~80 Gb of RAM.

Source installation

If you prefer, you can install from source as follows:

#create a new conda environemnt and activate it
conda create -n hapdup python=3.8
conda activate hapdup

#get HapDup source
git clone https://github.com/fenderglass/hapdup
cd hapdup
git submodule update --init --recursive

#build and install Flye
pushd submodules/Flye/ && python setup.py install && popd

#build and install Margin
pushd submodules/margin/ && mkdir build && cd build && cmake .. && make && cp ./margin $CONDA_PREFIX/bin/ && popd

#build and install PEPPER and its dependencies
pushd submodules/pepper/ && python -m pip install . && popd

To run, ensure that the conda environemnt is activated and then execute:

conda activate hapdup
./hapdup.py --assembly assembly.fasta --bam lr_mapping.bam --out-dir hapdup -t 64 --rtype ont

Acknowledgements

The major parts of the HapDup pipeline are:

Authors

The pipeline was developed at UC Santa Cruz genomics institute, Benedict Paten's lab.

Pipeline code contributors:

  • Mikhail Kolmogorov

PEPPER/Margin/Shasta support:

  • Kishwar Shafin
  • Trevor Pesout
  • Paolo Carnevali

Citation

If you use HapDup in your research, the most relevant papers to cite are:

Kishwar Shafin, Trevor Pesout, Pi-Chuan Chang, Maria Nattestad, Alexey Kolesnikov, Sidharth Goel, Gunjan Baid et al. "Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks." bioRxiv (2021). doi:10.1101/2021.03.04.433952

Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin and Pavel Pevzner, "Assembly of Long Error-Prone Reads Using Repeat Graphs", Nature Biotechnology, 2019 doi:10.1038/s41587-019-0072-8

License

HapDup is distributed under a BSD license. See the LICENSE file for details. Other software included in this discrubution is released under either MIT or BSD licenses.

How to get help

A preferred way report any problems or ask questions is the issue tracker.

In case you prefer personal communication, please contact Mikhail at [email protected].

Comments
  • pthread_setaffinity_np failed Error while running pepper

    pthread_setaffinity_np failed Error while running pepper

    Hi,

    I'm trying to run HapDup on my assembly from Flye. However, an error has occurred while running Pepper: RuntimeError: /onnxruntime_src/onnxruntime/core/platform/posix/env.cc:173 onnxruntime::{anonymous}::PosixThread::PosixThread(const char*, int, unsigned int (*)(int, Eigen::ThreadPoolInterface*), Eigen::ThreadPoolInterface*, const onnxruntime::ThreadOptions&) pthread_setaffinity_np failed, error code: 0 error msg: there's also a warning before this runtime error: /usr/local/lib/python3.8/dist-packages/torch/onnx/symbolic_opset9.py:2095: UserWarning: Exporting a model to ONNX with a batch_size other than 1, with a variable length with LSTM can cause an error when running the ONNX model with a different batch size. Make sure to save the model with a batch size of 1, or define the initial states (h0/c0) as inputs of the model. warnings.warn("Exporting a model to ONNX with a batch_size other than 1, " + Do you have any idea why this happens?

    The commands that I use are like:

    reads=NA24385_ONT_Promethion.fastq
    outdir=`pwd`
    assembly=${outdir}/assembly.fasta
    hapdup_sif=../HapDup/hapdup_0.4.sif
    
    time minimap2 -ax map-ont -t 30 ${assembly} ${reads} | samtools sort -@ 4 -m 4G > assembly_lr_mapping.bam
    samtools index -@ 4 assembly_lr_mapping.bam
    
    time singularity exec --bind ${outdir} ${hapdup_sif}\
    	hapdup --assembly ${assembly} --bam ${outdir}/assembly_lr_mapping.bam --out-dir ${outdir}/hapdup -t 64 --rtype ont
    

    Thank you

    bug 
    opened by LYC-vio 11
  • Docker fails to mount HD_DIR

    Docker fails to mount HD_DIR

    Hi! The following error occurs when I try to run:

    sudo docker run -v $HD_DIR:$HD_DIR -u `id -u`:`id -g` mkolmogo/hapdup:0.2   hapdup --assembly $HD_DIR/barcode11CBS1878_v2.fasta --bam $HD_DIR/lr_mapping.bam --out-dir $HD_DIR/hapdup
    
    docker: Error response from daemon: error while creating mount source path '/home/user/data/volume_2/hapdup_results': mkdir /home/user/data: file exists.
    ERRO[0000] error waiting for container: context canceled``
    
    opened by alkminion1 8
  • Incorrect genotype for large deletion

    Incorrect genotype for large deletion

    Hi,

    I have used Hapdup to make a haplotype-resolved assembly from Illumina-corrected ONT reads (haploid assembly made with Flye 2.9) and I am particularly interested in a large 32kb deletion. Here is a screenshot of IGV (from top to bottom: HAP1, HAP2 and haploid assembly):

    hapdup

    I believe the position and size of the deletion are near correct. However, the deletion is homozygous while it should be heterozygous. I have assembled with Hifiasm this proband and its parents using 30x PacBio HiFi: the 3 assemblies support an heterozygous call in the proband. I can also see from the corrected ONT that there is support for a heterozygous call. Finally, we can see this additional contig in the haploid assembly which I guess also support a heterozygous call.

    Hence, my question is: even if MARGIN manages to correctly separate reads with the deletion from reads without the deletion, can the polishing of Flye actually "fix" such a large event in one of the haplotype assembly?

    Thanks, Guillaume

    opened by GuillaumeHolley 7
  • invalid contig

    invalid contig

    Hi, I got an error when I ran the third step, and here is the error reporting information. Skipped filtering phase Skipped pepper phase Skipped margin phase Skipped Flye phase Finding breakpoints Parsed 304552 reads 14590 split reads Running: flye-minimap2 -ax asm5 -t 64 -K 5G /usr_storage/zyl/SY_haplotype/ZSP192L/ZSP192L.fasta /usr_storage/zyl/SY_haplotype/ZSP192L/hapdup/flye_hap_1/polished_1.fasta 2>/dev/null | flye-samtools sort -m 4G [email protected] > /usr_storage/zyl/SY_haplotype/ZSP192L/hapdup/structural/liftover_hp1.bam [bam_sort_core] merging from 0 files and 4 in-memory blocks... Traceback (most recent call last): File "/usr/local/bin/hapdup", line 8, in sys.exit(main()) File "/usr/local/lib/python3.8/dist-packages/hapdup/main.py", line 173, in main bed_liftover(inversions_bed, minimap_out, open(inversions_hp, "w")) File "/usr/local/lib/python3.8/dist-packages/hapdup/bed_liftover.py", line 76, in bed_liftover proj_start_chr, proj_start_pos, proj_start_sign = project(bam_file, chr_id, chr_start) File "/usr/local/lib/python3.8/dist-packages/hapdup/bed_liftover.py", line 9, in project name, pos, sign = project_flank(bam_path, ref_seq, ref_pos, 1) File "/usr/local/lib/python3.8/dist-packages/hapdup/bed_liftover.py", line 23, in project_flank for pileup_col in samfile.pileup(ref_seq, max(0, ref_pos - flank), ref_pos + flank, truncate=True, File "pysam/libcalignmentfile.pyx", line 1335, in pysam.libcalignmentfile.AlignmentFile.pileup File "pysam/libchtslib.pyx", line 685, in pysam.libchtslib.HTSFile.parse_region ValueError: invalid contig Contig125

    bug 
    opened by tongyin121 7
  • hapdup fails with multiple primary alignments

    hapdup fails with multiple primary alignments

    I've got one sample in a series which is failing in hapdup with the following error.

    Any suggestions?

    Thanks

    `

    Starting merge Expected three tokens in header line, got 2 This usually means you have multiple primary alignments with the same read ID. You can identify whether this is the case with this command:

        samtools view -F 0x904 YOUR.bam | cut -f 1 | sort | uniq -c | awk '$1 > 1'
    

    Expected three tokens in header line, got 2 This usually means you have multiple primary alignments with the same read ID. You can identify whether this is the case with this command:

        samtools view -F 0x904 YOUR.bam | cut -f 1 | sort | uniq -c | awk '$1 > 1'
    

    [2022-12-13 17:15:57] ERROR: Missing output: hapdup/margin/MARGIN_PHASED.haplotagged.bam Traceback (most recent call last): File "/data/test_data/GIT/hapdup/hapdup.py", line 24, in sys.exit(main()) File "/data/test_data/GIT/hapdup/hapdup/main.py", line 206, in main file_check(haplotagged_bam) File "/data/test_data/GIT/hapdup/hapdup/main.py", line 114, in file_check raise Exception("Missing output") Exception: Missing output `

    opened by mattloose 4
  • ZeroDivisionError

    ZeroDivisionError

    Hi,

    I'm running hapdup version 0.8 on a number of human genomes.

    It appears to fail fairly regulary with an error in flye:

    [2022-10-20 15:33:48] INFO: Running Flye polisher [2022-10-20 15:33:48] INFO: Polishing genome (1/1) [2022-10-20 15:33:48] INFO: Polishing with provided bam [2022-10-20 15:33:48] INFO: Separating alignment into bubbles [2022-10-20 15:37:12] ERROR: Thread exception [2022-10-20 15:37:12] ERROR: Traceback (most recent call last): File "/home/plzmwl/anaconda3/envs/hapdup/lib/python3.8/site-packages/flye/polishing/bubbles.py", line 79, in _thread_worker indels_profile = _get_indel_clusters(ctg_aln, profile, ctg_region.start) File "/home/plzmwl/anaconda3/envs/hapdup/lib/python3.8/site-packages/flye/polishing/bubbles.py", line 419, in _get_indel_clusters get_clusters(deletions, add_last=True) File "/home/plzmwl/anaconda3/envs/hapdup/lib/python3.8/site-packages/flye/polishing/bubbles.py", line 410, in get_clusters support = len(reads) / region_coverage ZeroDivisionError: division by zero

    The flye version is 2.9-b1778

    Has anyone else seen this and have any suggestions on how to fix?

    Thanks.

    bug 
    opened by mattloose 4
  • Super cool tool! Maybe in the README mention that FASTQ files are required if using the container

    Super cool tool! Maybe in the README mention that FASTQ files are required if using the container

    As far as know, inputting FASTA reads aligned to the reference as the bam file will result in Pepper failing to find any variants using the default settings of the Docker/Singularity container as the config for Pepper requires a minimum base quality setting.

    opened by jelber2 3
  • Option to set minimap2 -I flag

    Option to set minimap2 -I flag

    Hi,

    Ran into this error while trying to run hapdup:

    [2022-03-08 10:23:36] INFO: Running: flye-minimap2 -ax asm5 -t 10 -K 5G <PATH>/assembly.fasta <PATH>/hapdup/flye_hap_1/polished_1.fasta 2>/dev/null | flye-samtools sort -m 4G [email protected] > <PATH>/hapdup/structural/liftover_hp1.bam
    [E::sam_parse1] missing SAM header
    [W::sam_read1] Parse error at line 2
    samtools sort: truncated file. Aborting
    Traceback (most recent call last):
      File "/usr/local/bin/hapdup", line 8, in <module>
        sys.exit(main())
      File "/usr/local/lib/python3.8/dist-packages/hapdup/main.py", line 245, in main
        subprocess.check_call(" ".join(minimap_cmd), shell=True)
      File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command 'flye-minimap2 -ax asm5 -t 10 -K 5G <PATH>/assembly.fasta <PATH>/hapdup/flye_hap_1/polished_1.fasta 2>/dev/null | flye-samtools sort -m 4G -@4 > <PATH>/hapdup/structural/liftover_hp1.bam' returned non-zero exit status 1.
    

    I suspect it could be because of the default minimap2 -I flag being too small (4G)? If this is the case, maybe an option to specify this could be added, or adjust it automatically depending on genome size?

    Thanks!

    bug 
    opened by fellen31 3
  • Using Pepper-MARGIN r0.7?

    Using Pepper-MARGIN r0.7?

    Hi,

    Would it be possible to include the latest version of Pepper-MARGIN (r0.7) in hapdup? I haven't been able to run hapdup on my data so far because of some issues in Pepper-MARGIN r0.6 (now solved in r0.7).

    Thank you!

    Guillaume

    enhancement 
    opened by GuillaumeHolley 3
  • Please add CLI option to specify location of BAM index

    Please add CLI option to specify location of BAM index

    Could you add an additional command line parameter, allowing the specification of the BAM index location? Eg,

    hapdup --bam /some/where/abam.bam --bam-index /other/location/a_bam_index.bam.csi

    And then pass that optional value to pysam.AlignmentFile() argument filepath_index?

    Motivation: I'm wrapping hapdup and some other steps in a WDL script, and need to pass each file separately (ie, they are localized as individual files, and there is no guarantee they end up in the same directory when hapdup is invoked). The current hapdup assumes the index and bam are in the same directory, and fails.

    Thanks!

    CC: @0seastar0

    enhancement 
    opened by bkmartinjr 3
  • Singularity

    Singularity

    Hi,

    Thank you for this tool, I am really excited to try it! Would it be possible to have hapdup available as a Singularity image or to have the Docker image available online in a container repo (such that it can be converted to a Singularity image with a singularity pull)?

    Thanks, Guillaume

    opened by GuillaumeHolley 3
  • --overwrite fails

    --overwrite fails

    [2022-11-03 20:16:22] INFO: Filtering alignments Traceback (most recent call last): File "/data/test_data/GIT/hapdup/hapdup.py", line 24, in sys.exit(main()) File "/data/test_data/GIT/hapdup/hapdup/main.py", line 153, in main filter_alignments_parallel(args.bam, filtered_bam, min(args.threads, 30), File "/data/test_data/GIT/hapdup/hapdup/filter_misplaced_alignments.py", line 188, in filter_alignments_parallel pysam.merge("-@", str(num_threads), bam_out, *bams_to_merge) File "/home/plzmwl/anaconda3/envs/hapdup/lib/python3.8/site-packages/pysam/utils.py", line 69, in call raise SamtoolsError( pysam.utils.SamtoolsError: "samtools returned with error 1: stdout=, stderr=[bam_merge] File 'hapdup/filtered.bam' exists. Please apply '-f' to overwrite. Abort.\n"

    Looks as though when you run with --overwrite the command is not being correctly passed through to sub processes.

    bug 
    opened by mattloose 2
  • phase block number compared to Whatshap

    phase block number compared to Whatshap

    Hello, thank you for the great tool!

    I was just testing HapDup v0.7 on our fish genome. Comparing the output with phasing done with WhatsHap (WH), I wondered why there is such a big difference in phased block size and block number between HapDup and the WH pipeline?

    For the fish chromosomes, WH was generating 679 blocks using 2'689'114 phased SNPs. Margin (HapDup pipeline) was generating 5352 blocks using 3'862'108 phased SNPs.

    The main difference seems to be the prior read filtering and usage of MarginPhase for the phasing in HapDup, but does this explain such a big difference?

    I was wondering if phase blocks of HapDup could be concatenated using whatshap SNP and block information to increase continuity? I imagine it would be a straightforward approach overlapping SNP positions between Margin and WH with phase block ids and lift-over phase ids from WH. I will do some visual inspections and scripting to test if there is overlap of called SNPs and agreement on block boarders.

    Cheers, Michel

    opened by MichelMoser 3
Releases(0.10)
  • 0.10(Oct 29, 2022)

  • 0.6(Mar 4, 2022)

    • Fixed issue with PEPPER model quantization causing the pipeline to hang on some systems
    • Speed-up of the last breakpoint analysis part of the pipeline, causing bottlenecks on some datasets
    Source code(tar.gz)
    Source code(zip)
  • 0.5(Feb 6, 2022)

    • Update to PEPPER 0.7
    • Added new option --pepper-model for custom pepper models
    • Added new option --bam-index to provide a non-standard path to alignment index file
    Source code(tar.gz)
    Source code(zip)
  • 0.4(Nov 19, 2021)

Owner
Mikhail Kolmogorov
Postdoc @ UCSC CGL, Paten lab. I work on building algorithms for computational genomics.
Mikhail Kolmogorov
Tools for analyzing data collected with a custom unity-based VR for insects.

unityvr Tools for analyzing data collected with a custom unity-based VR for insects. Organization: The unityvr package contains the following submodul

Hannah Haberkern 1 Dec 14, 2022
Shot notebooks resuming the main functions of GeoPandas

Shot notebooks resuming the main functions of GeoPandas, 2 notebooks written as Exercises to apply these functions.

1 Jan 12, 2022
a tool that compiles a csv of all h1 program stats

h1stats - h1 Program Stats Scraper This python3 script will call out to HackerOne's graphql API and scrape all currently active programs for informati

Evan 40 Oct 27, 2022
Implementation in Python of the reliability measures such as Omega.

reliabiliPy Summary Simple implementation in Python of the [reliability](https://en.wikipedia.org/wiki/Reliability_(statistics) measures for surveys:

Rafael Valero Fernández 2 Apr 27, 2022
Hidden Markov Models in Python, with scikit-learn like API

hmmlearn hmmlearn is a set of algorithms for unsupervised learning and inference of Hidden Markov Models. For supervised learning learning of HMMs and

2.7k Jan 03, 2023
GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors. GWpy provides a user-f

GWpy 342 Jan 07, 2023
Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the

SALib 663 Jan 05, 2023
Data science/Analysis Health Care Portfolio

Health-Care-DS-Projects Data Science/Analysis Health Care Portfolio Consists Of 3 Projects: Mexico Covid-19 project, analyze the patient medical histo

Mohamed Abd El-Mohsen 1 Feb 13, 2022
A set of tools to analyse the output from TraDIS analyses

QuaTradis (Quadram TraDis) A set of tools to analyse the output from TraDIS analyses Contents Introduction Installation Required dependencies Bioconda

Quadram Institute Bioscience 2 Feb 16, 2022
Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

topas-create-graphs A script to automatically plot the results of a topas simulation Works for percentage depth dose (pdd) and dose profiles (dp). Dep

Sebastian Schäfer 10 Dec 08, 2022
2019 Data Science Bowl

Kaggle-2019-Data-Science-Bowl-Solution - Here i present my solution to kaggle 2019 data science bowl and how i improved it to win a silver medal in that competition.

Deepak Nandwani 1 Jan 01, 2022
ASOUL直播间弹幕抓取&&数据分析

ASOUL直播间弹幕抓取&&数据分析(更新中) 这些文件用于爬取ASOUL直播间的弹幕(其他直播间也可以)和其他信息,以及简单的数据分析生成。

159 Dec 10, 2022
Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

1 Feb 11, 2022
Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

Pypeln Pypeln (pronounced as "pypeline") is a simple yet powerful Python library for creating concurrent data pipelines. Main Features Simple: Pypeln

Cristian Garcia 1.4k Dec 31, 2022
CS50 pset9: Using flask API to create a web application to exchange stocks' shares.

C$50 Finance In this guide we want to implement a website via which users can “register”, “login” “buy” and “sell” stocks, like below: Background If y

1 Jan 24, 2022
This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

This is a project for analysis and estimation of House Prices in King County USA The .csv file contains the data of the house and the .ipynb file con

Amit Prakash 1 Jan 21, 2022
Finding project directories in Python (data science) projects, just like there R rprojroot and here packages

Find relative paths from a project root directory Finding project directories in Python (data science) projects, just like there R here and rprojroot

Daniel Chen 102 Nov 16, 2022
simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

aliaksandr-master 0 Jan 26, 2022
An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

1 Feb 11, 2022
Get mutations in cluster by querying from LAPIS API

Cluster Mutation Script Get mutations appearing within user-defined clusters. Usage Clusters are defined in the clusters dict in main.py: clusters = {

neherlab 1 Oct 22, 2021