A Lightweight Experiment & Resource Monitoring Tool 📺

Last update: Dec 28, 2022

Related tags

Overview

Lightweight Experiment & Resource Monitoring 📺

"Did I already run this experiment before? How many resources are currently available on my cluster?" If these are common questions you encounter during your daily life as a researcher, then mle-monitor is made for you. It provides a lightweight API for tracking your experiments using a pickle protocol database (e.g. for hyperparameter searches and/or multi-configuration/multi-seed runs). Furthermore, it comes with built-in resource monitoring on Slurm/Grid Engine clusters and local machines/servers.

mle-monitor provides three core functionalities:

MLEProtocol: A composable protocol database API for ML experiments.
MLEResource: A tool for obtaining server/cluster usage statistics.
MLEDashboard: A dashboard visualizing resource usage & experiment protocol.

To get started I recommend checking out the colab notebook and an example workflow.

`MLEProtocol`: Keeping Track of Your Experiments 📝

from mle_monitor import MLEProtocol

# Load protocol database or create new one -> print summary
protocol_db = MLEProtocol("mle_protocol.db", verbose=False)
protocol_db.summary(tail=10, verbose=True)

# Draft data to store in protocol & add it to the protocol
meta_data = {
    "purpose": "Grid search",  # Purpose of experiment
    "project_name": "MNIST",  # Project name of experiment
    "experiment_type": "hyperparameter-search",  # Type of experiment
    "experiment_dir": "experiments/logs",  # Experiment directory
    "num_total_jobs": 10,  # Number of total jobs to run
    ...
}
new_experiment_id = protocol_db.add(meta_data)

# ... train your 10 (pseudo) networks/complete respective jobs
for i in range(10):
    protocol_db.update_progress_bar(new_experiment_id)

# Wrap up an experiment (store completion time, etc.)
protocol_db.complete(new_experiment_id)

The meta data can contain the following keys:

Search Type	Description	Default
`purpose`	Purpose of experiment	`'None provided'`
`project_name`	Project name of experiment	`'default'`
`exec_resource`	Resource jobs are run on	`'local'`
`experiment_dir`	Experiment log storage directory	`'experiments'`
`experiment_type`	Type of experiment to run	`'single'`
`base_fname`	Main code script to execute	`'main.py'`
`config_fname`	Config file path of experiment	`'base_config.yaml'`
`num_seeds`	Number of evaluations seeds	1
`num_total_jobs`	Number of total jobs to run	1
`num_job_batches`	Number of jobs in single batch	1
`num_jobs_per_batch`	Number of sequential job batches	1
`time_per_job`	Expected duration: days-hours-minutes	`'00:01:00'`
`num_cpus`	Number of CPUs used in job	1
`num_gpus`	Number of GPUs used in job	0

Additionally you can synchronize the protocol with a Google Cloud Storage (GCS) bucket by providing cloud_settings. In this case also the results stored in experiment_dir will be uploaded to the GCS bucket, when you call protocol.complete().

# Define GCS settings - requires 'GOOGLE_APPLICATION_CREDENTIALS' env var.
cloud_settings = {
    "project_name": "mle-toolbox",  # GCP project name
    "bucket_name": "mle-protocol",  # GCS bucket name
    "use_protocol_sync": True,  # Whether to sync the protocol to GCS
    "use_results_storage": True,  # Whether to sync experiment_dir to GCS
}
protocol_db = MLEProtocol("mle_protocol.db", cloud_settings, verbose=True)

The `MLEResource`: Keeping Track of Your Resources 📉

On Your Local Machine

from mle_monitor import MLEResource

# Instantiate local resource and get usage data
resource = MLEResource(resource_name="local")
resource_data = resource.monitor()

On a Slurm Cluster

resource = MLEResource(
    resource_name="slurm-cluster",
    monitor_config={"partitions": ["<partition-1>", "<partition-2>"]},
)

On a Grid Engine Cluster

resource = MLEResource(
    resource_name="sge-cluster",
    monitor_config={"queues": ["<queue-1>", "<queue-2>"]}
)

The `MLEDashboard`: Dashboard Visualization 🎞️

from mle_monitor import MLEDashboard

# Instantiate dashboard with protocol and resource
dashboard = MLEDashboard(protocol, resource)

# Get a static snapshot of the protocol & resource utilisation printed in console
dashboard.snapshot()

# Run monitoring in while loop - dashboard
dashboard.live()

Installation ⏳

A PyPI installation is available via:

pip install mle-monitor

Alternatively, you can clone this repository and afterwards 'manually' install it:

git clone https://github.com/mle-infrastructure/mle-monitor.git
cd mle-monitor
pip install -e .

Development & Milestones for Next Release

You can run the test suite via python -m pytest -vv tests/. If you find a bug or are missing your favourite feature, feel free to contact me @RobertTLange or create an issue 🤗 .

Complete system for facial identity system. Include one-shot model, database operation, features visualization, monitoring

2 Dec 28, 2021

Comments

Is the dashboard pooling squeue?

Hey, Thanks for publishing the library, the dashboard looks great!

However, I was a bit concerned to see you are using squeue since the official documentation says

"Executing squeue sends a remote procedure call to slurmctld. If enough calls from squeue or other Slurm client commands that send remote procedure calls to the slurmctld daemon come in at once, it can result in a degradation of performance of the slurmctld daemon, possibly resulting in a denial of service.

Do not run squeue or other Slurm client commands that send remote procedure calls to slurmctld from loops in shell scripts or other programs. Ensure that programs limit calls to squeue to the minimum necessary for the information you are trying to gather."

Do you poll squeue or is there some other, smarter management of it that I missed?

Thanks, Eliahu

opened by eliahuhorwitz 0

Releases(v0.0.1)

v0.0.1(Dec 9, 2021)

Basic API for MLEProtocol, MLEResource & MLEDashboard:

from mle_monitor import MLEProtocol

# Load protocol database or create new one -> print summary
protocol_db = MLEProtocol("mle_protocol.db", verbose=False)
protocol_db.summary(tail=10, verbose=True)

# Draft data to store in protocol & add it to the protocol
meta_data = {
    "purpose": "Grid search",  # Purpose of experiment
    "project_name": "MNIST",  # Project name of experiment
    "experiment_type": "hyperparameter-search",  # Type of experiment
    "experiment_dir": "experiments/logs",  # Experiment directory
    "num_total_jobs": 10,  # Number of total jobs to run
    ...
}
new_experiment_id = protocol_db.add(meta_data)

# ... train your 10 (pseudo) networks/complete respective jobs
for i in range(10):
    protocol_db.update_progress_bar(new_experiment_id)

# Wrap up an experiment (store completion time, etc.)
protocol_db.complete(new_experiment_id)

Source code(tar.gz)
Source code(zip)

A Lightweight Experiment & Resource Monitoring Tool 📺

Related tags

Overview

Lightweight Experiment & Resource Monitoring 📺

MLEProtocol: Keeping Track of Your Experiments 📝

The MLEResource: Keeping Track of Your Resources 📉

On Your Local Machine

On a Slurm Cluster

On a Grid Engine Cluster

The MLEDashboard: Dashboard Visualization 🎞️

Installation ⏳

Development & Milestones for Next Release

You might also like...

Meta Representation Transformation for Low-resource Cross-lingual Learning

OpenDILab RL Kubernetes Custom Resource and Operator Lib

Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

Real-Time Social Distance Monitoring tool using Computer Vision

An air quality monitoring service with a Raspberry Pi and a SDS011 sensor.

Attendance Monitoring with Face Recognition using Python

Complete system for facial identity system. Include one-shot model, database operation, features visualization, monitoring

Comments

Is the dashboard pooling squeue?

Releases(v0.0.1)

v0.0.1(Dec 9, 2021)

Owner

DCGAN-tensorflow - A tensorflow implementation of Deep Convolutional Generative Adversarial Networks

Estimating and Exploiting the Aleatoric Uncertainty in Surface Normal Estimation

A generalist algorithm for cell and nucleus segmentation.

Neural style transfer in PyTorch.

particle tracking model, works with the ROMS output file(qck.nc, his.nc)

PyTorch Code for "Generalization in Dexterous Manipulation via Geometry-Aware Multi-Task Learning"

ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers

TRIQ implementation

Simple helper library to convert a collection of numpy data to tfrecord, and build a tensorflow dataset from the tfrecord.

交互式标注软件，暂定名 iann

Hashformers is a framework for hashtag segmentation with transformers.

Off-policy continuous control in PyTorch, with RDPG, RTD3 & RSAC

EMNLP 2021 Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections

MADT: Offline Pre-trained Multi-Agent Decision Transformer

CPU inference engine that delivers unprecedented performance for sparse models

[NeurIPS 2020] Semi-Supervision (Unlabeled Data) & Self-Supervision Improve Class-Imbalanced / Long-Tailed Learning

Overview of architecture and implementation of TEDS-Net, as described in MICCAI 2021: "TEDS-Net: Enforcing Diffeomorphisms in Spatial Transformers to Guarantee TopologyPreservation in Segmentations"

An offline deep reinforcement learning library

Experiments for Fake News explainability project

Contains supplementary materials for reproduce results in HMC divergence time estimation manuscript

`MLEProtocol`: Keeping Track of Your Experiments 📝

The `MLEResource`: Keeping Track of Your Resources 📉

The `MLEDashboard`: Dashboard Visualization 🎞️