Python script for transferring data between three drives in two separate stages

Overview

Waterlock

Waterlock is a Python script meant for incrementally transferring data between three folder locations in two separate stages. It performs hash verification and persistently tracks data transfer progress using SQLite.

I am not responsible for any lost data. This was an evening coding project. Use at your own discretion.

Use Case & Features

The use-case Waterlock was designed for is moving files from one computer (i.e. your home server) to a intermediary drive (i.e. a portable hard drive), and then from the hard drive to another computer (i.e. an offsite backup server).

  • It will fill the intermediary drive with as many files as it can, aside from a user-configurable amount of reserve-space.
  • It performs blake2 checksums with every file copy, comparing it to the initial hash value stored in the SQLite database to ensure that data is not corrupted.
  • It uses a SQLite database to track what data has been moved. As a result, you can incrementally move data from one location to another with minimal user input.
  • Every time Waterlock is run on the source location, it will check for any files that have been recently modified (based on timestamp, not hash). Any modified files will have their hash & modification timestamps updated in the database, in addition to being marked as unmoved such that they are transferred again and updated. Note that Waterlock does not version files. Nevertheless, silently corrupted files should theoretically not be transferred over unless their modification timestamp has been adjusted.
  • Every time Waterlock is run on the source location, it will check for any files that were previously moved to the intermediary drive but did not reach the destination. If these files are no longer on the intermediary drive due to accidental deletion for instance, Waterlock will move those files to the intermediary drive again.

Example Use Case: I use Waterlock to transfer large files that are too large to transfer over the network to an offsite backup location at a relatives house. Each time I visit I run the script on my home server to load the external drive, then run it again on the offsite-backup server.

Usage

Change the settings at the top of the script, using absolute file paths. While relative paths may work, they are more error prone due to string formatting issues. Store the script on the intermediary drive itself and run it from there. It will automatically create waterlock.db and a cargo folder where the data will be stored. Note that after the final transfer to the destination, Waterlock will not delete data on the intermediary drive.

python waterlock.py

If you are familiar with Python, you can also fully verify all the files on the middle or destination drives to ensure that the hashes match what is stored in the database. This is done using two additional class functions called verify_middle() and verify_destination(). The code to verify files on the destination would be as follows:

if __name__ == "__main__":
    wl = Waterlock( source_directory=source_directory, 
                    end_directory=end_direcotry, 
                    reserved_space=reserved_space
                    )
    wl.start()
    wl.verify_destination()

Why 'Waterlock'?

It is named Waterlock after marine locks used to move ships through waterways of different water levels in multiple stages.

You might also like...
Catalogue data - A Python Scripts to prepare catalogue data

catalogue_data Scripts to prepare catalogue data. Setup Clone this repo. Install

This is a python script to navigate and extract the FSD50K dataset

FSD50K navigator This is a script I use to navigate the sound dataset from FSK50K.

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

topas-create-graphs A script to automatically plot the results of a topas simulation Works for percentage depth dose (pdd) and dose profiles (dp). Dep

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.
fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

A data parser for the internal syncing data format used by Fog of World.
A data parser for the internal syncing data format used by Fog of World.

A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data structure.

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter. Fancy data functions that will make your life as a data scientist easier.
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Releases(latest)
Owner
David Swanlund
PhD student at SFU studying spatial privacy and spatial data anonymization
David Swanlund
Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

Pablo 1 Nov 05, 2021
Conduits - A Declarative Pipelining Tool For Pandas

Conduits - A Declarative Pipelining Tool For Pandas Traditional tools for declaring pipelines in Python suck. They are mostly imperative, and can some

Kale Miller 7 Nov 21, 2021
t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology.

tree-SNE t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology. Building on recent advances in s

Isaac Robinson 61 Nov 21, 2022
Retail-Sim is python package to easily create synthetic dataset of retaile store.

Retailer's Sale Data Simulation Retail-Sim is python package to easily create synthetic dataset of retaile store. Simulation Model Simulator consists

Corca AI 7 Sep 30, 2022
Tools for the analysis, simulation, and presentation of Lorentz TEM data.

ltempy ltempy is a set of tools for Lorentz TEM data analysis, simulation, and presentation. Features Single Image Transport of Intensity Equation (SI

McMorran Lab 1 Dec 26, 2022
Important dataframe statistics with a single command

quick_eda Receiving dataframe statistics with one command Project description A python package for Data Scientists, Students, ML Engineers and anyone

Sven Eschlbeck 2 Dec 19, 2021
Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

pgmpy pgmpy is a python library for working with Probabilistic Graphical Models. Documentation and list of algorithms supported is at our official sit

pgmpy 2.2k Dec 25, 2022
Python package for analyzing behavioral data for Brain Observatory: Visual Behavior

Allen Institute Visual Behavior Analysis package This repository contains code for analyzing behavioral data from the Allen Brain Observatory: Visual

Allen Institute 16 Nov 04, 2022
A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset

xwrf A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset. The primary objective of

National Center for Atmospheric Research 43 Nov 29, 2022
Maximum Covariance Analysis in Python

xMCA | Maximum Covariance Analysis in Python The aim of this package is to provide a flexible tool for the climate science community to perform Maximu

Niclas Rieger 39 Jan 03, 2023
Airflow ETL With EKS EFS Sagemaker

Airflow ETL With EKS EFS & Sagemaker (en desarrollo) Diagrama de la solución Imp

1 Feb 14, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 01, 2022
NFCDS Workshop Beginners Guide Bioinformatics Data Analysis

Genomics Workshop FIXME: overview of workshop Code of Conduct All participants s

Elizabeth Brooks 2 Jun 13, 2022
scikit-survival is a Python module for survival analysis built on top of scikit-learn.

scikit-survival scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizi

Sebastian Pölsterl 876 Jan 04, 2023
Implementation in Python of the reliability measures such as Omega.

OmegaPy Summary Simple implementation in Python of the reliability measures: Omega Total, Omega Hierarchical and Omega Hierarchical Total. Name Link O

Rafael Valero Fernández 2 Apr 27, 2022
A tool to compare differences between dataframes and create a differences report in Excel

similarpanda A module to check for differences between pandas Dataframes, and generate a report in Excel format. This is helpful in a workplace settin

Andre Pretorius 9 Sep 15, 2022
💬 Python scripts to parse Messenger, Hangouts, WhatsApp and Telegram chat logs into DataFrames.

Chatistics Python 3 scripts to convert chat logs from various messaging platforms into Pandas DataFrames. Can also generate histograms and word clouds

Florian 893 Jan 02, 2023
Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git re

Kjell Wooding 18 Dec 23, 2022
Falcon: Interactive Visual Analysis for Big Data

Falcon: Interactive Visual Analysis for Big Data Crossfilter millions of records without latencies. This project is work in progress and not documente

Vega 803 Dec 27, 2022
Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

WeRateDogs Twitter Data from 2015 to 2017 Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data Table of Contents Introduction Proj

Keenan Cooper 1 Jan 12, 2022