Python script for transferring data between three drives in two separate stages

Last update: Nov 10, 2021

Related tags

Overview

Waterlock

Waterlock is a Python script meant for incrementally transferring data between three folder locations in two separate stages. It performs hash verification and persistently tracks data transfer progress using SQLite.

I am not responsible for any lost data. This was an evening coding project. Use at your own discretion.

Use Case & Features

The use-case Waterlock was designed for is moving files from one computer (i.e. your home server) to a intermediary drive (i.e. a portable hard drive), and then from the hard drive to another computer (i.e. an offsite backup server).

It will fill the intermediary drive with as many files as it can, aside from a user-configurable amount of reserve-space.
It performs blake2 checksums with every file copy, comparing it to the initial hash value stored in the SQLite database to ensure that data is not corrupted.
It uses a SQLite database to track what data has been moved. As a result, you can incrementally move data from one location to another with minimal user input.
Every time Waterlock is run on the source location, it will check for any files that have been recently modified (based on timestamp, not hash). Any modified files will have their hash & modification timestamps updated in the database, in addition to being marked as unmoved such that they are transferred again and updated. Note that Waterlock does not version files. Nevertheless, silently corrupted files should theoretically not be transferred over unless their modification timestamp has been adjusted.
Every time Waterlock is run on the source location, it will check for any files that were previously moved to the intermediary drive but did not reach the destination. If these files are no longer on the intermediary drive due to accidental deletion for instance, Waterlock will move those files to the intermediary drive again.

Example Use Case: I use Waterlock to transfer large files that are too large to transfer over the network to an offsite backup location at a relatives house. Each time I visit I run the script on my home server to load the external drive, then run it again on the offsite-backup server.

Usage

Change the settings at the top of the script, using absolute file paths. While relative paths may work, they are more error prone due to string formatting issues. Store the script on the intermediary drive itself and run it from there. It will automatically create waterlock.db and a cargo folder where the data will be stored. Note that after the final transfer to the destination, Waterlock will not delete data on the intermediary drive.

python waterlock.py

If you are familiar with Python, you can also fully verify all the files on the middle or destination drives to ensure that the hashes match what is stored in the database. This is done using two additional class functions called verify_middle() and verify_destination(). The code to verify files on the destination would be as follows:

if __name__ == "__main__":
    wl = Waterlock( source_directory=source_directory, 
                    end_directory=end_direcotry, 
                    reserved_space=reserved_space
                    )
    wl.start()
    wl.verify_destination()

Why 'Waterlock'?

It is named Waterlock after marine locks used to move ships through waterways of different water levels in multiple stages.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

Python script for transferring data between three drives in two separate stages

Related tags

Overview

Waterlock

Use Case & Features

Usage

Why 'Waterlock'?

You might also like...

Catalogue data - A Python Scripts to prepare catalogue data

This is a python script to navigate and extract the FSD50K dataset

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

A data parser for the internal syncing data format used by Fog of World.

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Fancy data functions that will make your life as a data scientist easier.

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Releases(latest)

Owner

David Swanlund

NumPy aware dynamic Python compiler using LLVM

Anomaly Detection with R

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

PyEmits, a python package for easy manipulation in time-series data.

Maximum Covariance Analysis in Python

Statsmodels: statistical modeling and econometrics in Python

Hydrogen (or other pure gas phase species) depressurization calculations

A Python and R autograding solution

Business Intelligence (BI) in Python, OLAP

Statistical package in Python based on Pandas

WaveFake: A Data Set to Facilitate Audio DeepFake Detection

A program that uses an API and a AI model to get info of sotcks

PyChemia, Python Framework for Materials Discovery and Design

NFCDS Workshop Beginners Guide Bioinformatics Data Analysis

A collection of robust and fast processing tools for parsing and analyzing web archive data.

Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

Generate lookml for views from dbt models

Detecting Underwater Objects (DUO)

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.