Catalogue data - A Python Scripts to prepare catalogue data

Last update: Mar 03, 2022

Related tags

Data Analysis catalogue_data

Overview

catalogue_data

Scripts to prepare catalogue data.

Setup

Clone this repo.

Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation

sudo apt-get install git-lfs
git lfs install

Install dependencies:

sudo apt-add-repository non-free
sudo apt-get update
sudo apt-get install unrar

Create virtual environment, activate it and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token and set environment variables in the .env file at the root directory:

HF_USERNAME=
   
    
HF_USER_ACCESS_TOKEN=
    
     
GIT_USER=
     
      
GIT_EMAIL=

Create metadata

To create dataset metadata (in file dataset_infos.json) run:

python create_metadata.py --repo <repo_id>

where you should replace , e.g. bigscience-catalogue-lm-data/lm_ca_viquiquad

Aggregate datasets

To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:

python aggregate_datasets.py --dataset_ratios_path <path_to_file_with_dataset_ratios> --save_path <dir_path_to_save_aggregated_dataset>

where you should replace:

path_to_file_with_dataset_ratios: path to JSON file containing a dict with dataset names (keys) and their ratio (values) between 0 and 1.
: directory path to save the aggregated dataset

Catalogue data - A Python Scripts to prepare catalogue data

Related tags

Overview

catalogue_data

Setup

Create metadata

Aggregate datasets

Owner

BigScience Workshop

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

A tax calculator for stocks and dividends activities.

A set of tools to analyse the output from TraDIS analyses

A Numba-based two-point correlation function calculator using a grid decomposition

pyETT: Python library for Eleven VR Table Tennis data

Data Science Environment Setup in single line

A columnar data container that can be compressed.

Senator Trades Monitor

A library to create multi-page Streamlit applications with ease.

MS in Data Science capstone project. Studying attacks on autonomous vehicles.

PyChemia, Python Framework for Materials Discovery and Design

Extract data from a wide range of Internet sources into a pandas DataFrame.

Pipeline and Dataset helpers for complex algorithm evaluation.

Automated Exploration Data Analysis on a financial dataset

Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World"

Bamboolib - a GUI for pandas DataFrames

Exploring the Top ML and DL GitHub Repositories

Processo de ETL (extração, transformação, carregamento) realizado pela equipe no projeto final do curso da Soul Code Academy.

Show you how to integrate Zeppelin with Airflow

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python