Research using python - Guide for development of research code (using Anaconda Python)

Overview

Guide for development of research code
(using Anaconda Python)

TL;DR:

One time setup

  1. Install git and go through its one time setup, bare minimum:
    git config --global user.name “First Last”
    git config --global user.email “first[email protected]”
    git config --global core.editor editor_of_choice
    
    Editor option for the few folks on windows (haven't tried it myself):
    git config --global core.editor "'input/path/to/notepad++.exe' -multiInst -notabbar -nosession -noPlugin"
    
  2. Install git-lfs and run git lfs install.
  3. Install miniconda.
  4. Sign up for a GitHub account.
  5. Generate an SSH key and add it to your GitHub account.

Once per repository setup

  1. Create empty repository on GitHub, lets call it my_project.
  2. Initial commit into local repository and push to remote: 0. Create local repository (also creates new directory) git init my_project
    1. Create a markdown file, README.md describing the project.
    2. Create an environment_dev.yml file based on this example. Change the environment name to an appropriate one and add relevant packages.
    3. Copy this pre-commit configuration file.
    4. Copy this .gitignore file and add file types you want git to ignore.
    5. Add file types to be tracked by git-lfs based on file extension, creates the .gitattributes file (e.g. git lfs track "*.pth")
    6. Copy this .flake8 file to customize the tool settings.
git add README.md environment_dev.yml .pre-commit-config.yaml .gitattributes .gitignore .flake8
git commit
git branch -M main
git remote add origin [email protected]:user_name/my_project.git
git push -u origin main
  1. Create virtual environment activate it and set up pre-commit:
    conda env create -f environment_dev.yml
    conda activate my_project_dev
    pre-commit install
    

Start working

  1. Activate virtual environment conda activate my_project_dev
  2. Create new branch off of main:
git checkout main
git checkout -b my_new_branch
  1. Work.
  2. Commit locally and push to remote (origin can be either a fork, if using a triangular workflow, or the original repository if using a centralized workflow):
git add file1 file2 file3
git commit
git push origin my_new_branch
  1. Create a pull request on GitHub and after tests pass merge into main branch.

If code is not in the remote repository, consider it lost.

Long version

Why should you care?

Most scientists need to write code as part of their research. This is a "physical" embodiment of the underlying algorithmic and mathematical theory. Traditionally the software engineering standards applied to code written as part of research have been rather low (rampant code duplication...). In the past decade we have seen this change. Primarily because it is now much more common for researchers to share their code (often due to the "encouragement" of funding agencies) in all its glory.

When sharing code, we expect it to comply with some minimal software engineering standards including design, readability, and testing.

I strive to follow the guidance below, but don't always. Still, it's important to have a goal to strive towards. To quote Lewis Carol (If you don't know where you're going, any road will take you there). From Alice's Adventures in Wonderland:

“Would you tell me, please, which way I ought to go from here?” “That depends a good deal on where you want to get to,” said the Cat. “I don’t much care where-” said Alice. “Then it doesn’t matter which way you go,” said the Cat. "-so long as I get somewhere,” Alice added as an explanation.“Oh, you’re sure to do that,” said the Cat, “if you only walk long enough.”

Personal pet peeves, in no particular order:

  • A single commit of all the code in the GitHub repository. Yes, you're sharing code but it did not magically materialize in its final form, be transparent so that we can trust the code and see how it developed over time. We can learn from paths that did not pan out almost as much as from the path that did. By providing all of the history we can see which algorithmic paths were attempted and did not work out. Help others avoid going down dead-end paths.
  • Repository contains .DS_Store files. Yes, we know you are proud of your Mac. I like OSX too, but seriously, you should have added this file type to the .gitignore file when setting up the repository.
  • Deep learning code sans-data, sans-weight files. This is completely useless in terms of reproducibility. Don't "share" like this.
  • Code duplication with minor, hard to detect, differences between copies.

Version control

  1. Use a version control system, currently Git is the VCS of the day. Learn how to use it (introduction to git slide deck).
  2. Use a remote repository, your cloud backup. Keep it private during development and then make it public upon publication acceptance. Free services GitHub, BitBucket.
  3. Do not commit binary or large files into the repository. Use git-lfs. Beware the Jupyter notebook. Do not commit notebooks with output as this will cause the repository size to blow up, particularly if output includes images. Clear the output before committing.
  4. Use the pre-commit framework to improve (1) compliance to code style (2) avoid commits of large/binary files, AWS credentials and private keys. We all need a little help complying with our self imposed constraints (example configuration file). Note that git pre-commit hooks do not preclude non-compliant commits, as a determined user can go around the hooks, git commit --no-verify.

Writing code (Python as a use case)

Many languages have style, testing and documentation tools and conventions. Here we focus on Python, but the concepts are similar for all languages.

  1. Style - Use consistent style and enforce it. Other human beings need to read the code and readily understand it. Write code that is compliant with PEP8 (the Python style guide):
    • Use flake8 to enforce PEP8.
    • Use the Black code formatter, works for scripts and Jupyter notebooks (for Jupyter notebook support pip install black[jupyter] instead of the regular pip install black). It does not completely agree with flake8, so use both?
    • Some folks don't like the Black formatting, it isn't all roses. An alternative is autopep8.
  2. Testing - Write nominal regression tests at the same time you implement the functionality. Non-rigorous regression testing is acceptable in a research setting as we explore various solutions. The more rigorous the testing the easier it will be for a development team to get code into production. Use pytest for this task.
  3. Documentation - Write the documentation while you are implementing. Start by adding a README file to your repository (use markdown or restructured text). It should include a general description of the repository contents, how to run the programs and possibly instructions on how to build them from source code. Generally, when we postpone writing documentation we will likely never do it. That's fine too, as long as you are willing to admit to yourself that you are consciously choosing to not document your code. In Python, use a consistent Docstring format. Two popular ones are Google style and NumPy style.
  4. Reproducible environment - include instructions or configuration files to reproduce the environment in which the code is expected to work. In Python you provide files listing all package dependencies enabling the creation of the appropriate virtual environment in which to run the program. A requirements.txt for plain Python, or an environment.yml for the anaconda Python distribution. For development we often rely on additional packages not required for usage (e.g. pytest). Consequentially we include a requirements_dev.txt (environment_dev.yml) in addition to the requirements.txt (environment.yml) files. Sample requirements.txt, requirements_dev.txt and environment.yml, environment_dev.yml files.
  5. Your code is a mathematical multi-parametric function that depends on many parameters beyond the input. These parameters are either:
  • Hard coded - best avoided if they need to be changed for different inputs.
  • Given as arguments on the command-line, appropriate when you have a few, less than five. Several popular Python modules/packages that support parsing command-line arguments: argparse, click and docopt. Personally I use argparse (example usage available here).
  • Specified in a configuration file. These usually use XML or JSON formats. I use JSON (example configuration file and short script that reads it). The parameters file is given on the command-line so we also get to use argparse.

Continuous integration

Automate testing and possibly delivery using continuous integration. There are many CI services that readily integrate with remote hosted git services. In the past I've used TravisCI and CircleCI. Currently using GitHub Actions. All of these rely on a yaml based configuration files to define workflows.

An example GitHub actions workflow which runs the same tests as the pre-commit defined above is available here.

Owner
Ziv Yaniv
Ziv Yaniv
Calculator in command line using python programming language

Calculator in command line using python programming language University of the People Python fundamental Chapter 5 Conditionals and recursion The main

mark sikaundi 3 Dec 09, 2021
A(Sync) Interface for internal Audible API written in pure Python.

Audible Audible is a Python low-level interface to communicate with the non-publicly Audible API. It enables Python developers to create there own Aud

mkb79 192 Jan 03, 2023
Open slidebook .sldy files in Python

Work in progress slidebook-python Open slidebook .sldy files in Python To install slidebook-python requires Python = 3.9 pip install slidebook-python

The Institute of Cancer Research 2 May 04, 2022
An app about keyboards, originating from the design of u/Sonnenschirm

keebapp-backend An app about keyboards, originating from the design of u/Sonnenschirm Setup Firstly, ensure that the environment for python is install

8 Sep 04, 2022
This is the repo for Uncertainty Quantification 360 Toolkit.

UQ360 The Uncertainty Quantification 360 (UQ360) toolkit is an open-source Python package that provides a diverse set of algorithms to quantify uncert

International Business Machines 207 Dec 30, 2022
An implementation to rank your favourite songs from World of Walker

World-Of-Walker-Elo An implementation to rank your favourite songs from Alan Walker's 2021 album World of Walker. Uses the Elo rating system, which is

1 Nov 26, 2021
An unofficial opensource Pokemon cursor theme for Windows and Linux.

pokemon-cursor An unofficial opensource Pokemon cursor theme for Windows and Linux. Cursor Sizes 22 24 28 32 40 48 56 64 72 80 88 96 Colors Quick inst

Kaiz Khatri 72 Dec 26, 2022
Taking the fight to the establishment.

Throwdown Taking the fight to the establishment. Wat? I wanted a simple markdown interpreter in python and/or javascript to output html for my website

Trevor van Hoof 1 Feb 01, 2022
Learn to code in any language. If

Learn to Code It is an intiiative undertaken by Student Ambassadors Club, Jamshoro for students who are absolute begineers in programming and want to

Student Ambassadors' Club at Mehran UET 15 Oct 19, 2022
Cute study buddy that helps you study with the Pomodoro technique!

study-buddy Cute study buddy that helps you study with the Pomodoro (or Animedoro) technique! Kirby The Kirby folder has a Kirby, pink-themed Pomodoro

Ethan Emmanuel 1 Jan 19, 2022
A compiler for ARM, X86, MSP430, xtensa and more implemented in pure Python

A compiler for ARM, X86, MSP430, xtensa and more implemented in pure Python

Windel Bouwman 277 Dec 26, 2022
Covid-19-Trends - A project that me and my friends created as the CSC110 Final Project at UofT

Covid-19-Trends Introduction The COVID-19 pandemic has caused severe financial s

1 Jan 07, 2022
A male and female dog names python package

A male and female dog names python package

Fayas Noushad 3 Dec 12, 2021
Repositório contendo atividades no curso de desenvolvimento de sistemas no SENAI

SENAI-DES Este é um repositório contendo as atividades relacionadas ao curso de desenvolvimento de sistemas no SENAI. Se é a primeira vez em contato c

Abe Hidek 4 Dec 06, 2022
Bitflip Fault Simulation Platform by Daniele Rizzieri (2021)

BFSP [v1.05] Bitflip Fault Simulation Platform by Daniele Rizzieri (2021) The platform injects a random bitflip in each of N copies of a binary file.

Daniele Rizzieri 2 Nov 05, 2022
Chemical equation balancer

Chemical equation balancer Balance your chemical equations with ease! Installation $ git clone

Marijan Smetko 4 Nov 26, 2022
Bitflip Fault Simulation Platform by Daniele Rizzieri (2021)

SEE Injection Framework 2021 This repository contains two Single Event Effect (SEE) injection platforms. The first one is called BFSP - "Bitflip Fault

Daniele Rizzieri 2 Nov 05, 2022
program to store and update pokemons using SQL and Flask

Pokemon SQL and Flask Pokemons api in python. Technologies flask pymysql Description PokeCorp is a company that tracks pokemon and their trainers arou

Sara Hindy Salfer 1 Oct 20, 2021
Collie is for uncovering RDMA NIC performance anomalies

Collie is for uncovering RDMA NIC performance anomalies. Overview Prerequ

Bytedance Inc. 34 Dec 11, 2022
Simple but maybe too simple config management through python data classes. We use it for machine learning.

👩‍✈️ Coqpit Simple, light-weight and no dependency config handling through python data classes with to/from JSON serialization/deserialization. Curre

coqui 67 Nov 29, 2022