OpenABC-D: A Large-Scale Dataset For Machine Learning Guided Integrated Circuit Synthesis

Related tags

Deep LearningOpenABC
Overview

OpenABC-D: A Large-Scale Dataset For Machine Learning Guided Integrated Circuit Synthesis

Version License

Overview

OpenABC-D is a large-scale labeled dataset generated by synthesizing open source hardware IPs using state-of-art logic synthesis tool yosys-abc. We consider 29 open-source hardware IP designs collected from various sources (MIT-CEP, IWLS, OpenROAD, OpenPiton etc) and synthesized them with 1500 random synthesis flows (we call them synthesis recipes).

Each synthesis flow has a predefined length L (L=20, in our case). We preserved all AIGs: starting, intermediate and final AIGs with labels like number of nodes, longest path, sequence of atomic synthesis transofrmations (rewrite, refactor, balance etc.) along with graph statistics, area and delay of final AIG.

We converted the AIGs in pytorch data format that can be directly used by a machine learning engineer lessening the effort of costly labeled data generation and pre-processing. OpenABC-D can be used for a variety of learning tasks on logic synthesis such as

  1. Predicting quality of result (QoR) performance of a synthesis recipe on a hardware IP.
  2. Area and delay prediction post techonolgy mapping.
  3. Learn functional and structural features of AIG using self-supervised labels (useful for tasks like RL-based logic synthesis)

Our dataset can easily be used for graph-based machine learning framework like Pytorch-Geometric. The data generation pipeline of OpenABC-D is shown as follows:

Installing dependencies

We recommend using venv or Anaconda environment to install pre-requisites packages for running our framework and models. We list down the packages which we used on our side for experimentations. We recommend installing the packages using requirements.txt file provided in our repository.

  • cudatoolkit = 10.1
  • numpy >= 1.20.1
  • pandas >= 1.2.2
  • pickleshare >= 0.7.5
  • python >=3.9
  • pytorch = 1.8.1
  • scikit-learn = 0.24.1
  • torch-geometric=1.7.0
  • tqdm >= 4.56
  • seaborn >= 0.11.1
  • networkx >= 2.5
  • joblib >= 1.1.0

Here are few resources to install the packages (if not using requirements.txt)

Make sure that that the cudatoolkit version in the gpu matches with the pytorch-geometric's (and dependencies) CUDA version.

Organisation

Dataset directory structure

├── OPENABC_DATASET
│   ├── bench			# Original and synthesized bench files. Log reports post technology mapping
│   ├── graphml                # Graphml files
│   ├── lib			# Nangate 15nm technology library
│   ├── ptdata			# pytorch-geometric compatible data
│   ├── statistics		# Area, delay, number of nodes, depth of final AIGs for all designs
│   └── synScripts		# 1500 synthesis scripts customized for each design
  1. In bench directory, each design has a subfolder containing original bench file: design_orig.bench, a log folder containing log runs of 1500 synthesis recipes, and syn.zip file containing all bench files synthesized with synthesis recipe N.

  2. In graphml directory, each design has subfolder containing zipped graphml files corresponding to the bench files created for each synthesis runs.

  3. In lib directory, Nangate15nm.lib file is present. This is used for technology mapping post logic minimization.

  4. In ptdata directory, we have subfolders for each design having zipped pytorch file of the format designIP_synthesisID_stepID.pt. Also, we kept train-test split csv files for each learning tasks in subdirectories with naming convention lr_ID.

  5. In statistics diretcory, we have two subfolders: adp and finalAig. In adp, we have csv files for all designs with information about area and delay of final AIG post tech-mapping. In finalAig, csv files have information about graph characteristics of final AIGs obtained post optimization. Also, there is another file named synthesisstastistics.pickle which have all the above information in dictionary format. This file is used for labelling purpose in ML pipeline for various tasks.

  6. In synScripts directory, we have subfolders of each design having 1500 synthesis scripts.

Data generation

├── datagen
│   ├── automation 			      # Scripts for automation (Bulk/parallel runs for synthesis, AIG2Graph conversions etc.)
│   │   ├── automate_bulkSynthesis.py         # Shell script for each design to perform 1500 synthesis runs
│   │   ├── automate_finalDataCollection.py   # Script file to collect graph statistics, area and delay of final AIG
│   │   ├── automate_synbench2Graphml.py      # Shell script file generation to involking andAIG2Graphml.py
│   │   └── automate_synthesisScriptGen.py    # Script to generate 1500 synthesis script customized for each design
│   └── utilities
│       ├── andAIG2Graphml.py		      # Python utility to convert AIG BENCH file to graphml format
│       ├── collectAreaAndDelay.py            # Python utility to parse log and collect area and delay numbers
│       ├── collectGraphStatistics.py         # Python utility to for computing final AIG statistics
│       ├── pickleStatsForML.py               # Pickled file containing labels of all designs (to be used to assign labels in ML pipeline)
│       ├── PyGDataAIG.py		      # Python utility to convert synthesized graphml files to pytorch data format
│       └── synthID2SeqMapping.py	      # Python utility to annotate synthesis recipe using numerical encoding and dump in pickle form
  1. automation directory contains python scripts for automating bulk data generation (e.g. synthesis runs, graphml conversion, pytorch data generation etc.). utilities folder have utility scripts performing various tasks and called from automation scripts.

  2. Step 1: Run automate_synthesisScriptGen.py to generate customized synthesis script for 1500 synthesis recipes. One can see the template of a synthesis recipe in referenceDir.

  3. Step 2: Run automate_bultkSynthesis.py to generate a shell script for a design. Run the shell script to perform the synthesis runs. Make sure yosys-abc is available in PATH.

  4. Step 3: Run automate_synbench2Graphml.py to generate a shell script for generating graphml files. The shell script invokes andAIG2Graphml.py using 21 parallel threads processing data of each synthesis runs in sequence.

  5. Step 4: Run PyGDataAIG.py to generate pytorch data for each graphml file of the format designIP_synthesisID_stepID.pt.

  6. Step 5: Run collectAreaAndDelay.py and collectGraphStatistics.py to collect information about final AIG's statistics. Post that, run pickleStatsForML.py which will output synthesisStatistics.pickle file.

  7. Step 6: Run synthID2SeqMapping.py utility to generate synthID2Vec.pickle file containing numerically encoded data of synthesis recipes.

Benchmarking models: Training and evaluation

├── models
│   ├── classification
│   │   └── ClassNetV1
│   │       ├── model.py			# Graph convolution network based architecture model
│   │       ├── netlistDataset.py		# Dataset loader
│   │       ├── train.py			# Train and evaluation utility
│   │       └── utils.py			# Utitility functions
│   └── qor
│       ├── NetV1
│       │   ├── evaluate.py
│       │   ├── model.py
│       │   ├── netlistDataset.py
│       │   ├── train.py
│       │   └── utils.py
│       ├── NetV2
│       │   ├── evaluate.py
│       │   ├── model.py
│       │   ├── netlistDataset.py
│       │   ├── train.py
│       │   └── utils.py
│       └── NetV3
│           ├── evaluate.py
│           ├── model.py
│           ├── netlistDataset.py
│           ├── train.py
│           └── utils.py

models directory contains the benchmarked model described in details in our paper. The names of the python utilities are self explainatory.

Case 1: Prediction QoR of a synthesis recipe

We recommend creating a following folder hierarchy before training/evaluating a model using our dataset and model codes:

├── OPENABC-D
│   ├── lp1
│   │   ├── test_data_set1.csv
│   │   ├── test_data_set2.csv
│   │   ├── test_data_set3.csv
│   │   ├── train_data_set1.csv
│   │   ├── train_data_set2.csv
│   │   └── train_data_set3.csv
│   ├── lp2
│   │   ├── test_data_set1.csv
│   │   └── train_data_set1.csv
│   ├── processed
│   ├── synthesisStatistics.pickle
│   └── synthID2Vec.pickle

OPENABC-D is the top level directory containing the dataset, train-test split files, and labeled data available. Transfer all the relevant zipped pytorch data in the subdirectory processed.

The user can now go the models directory and run codes for training and evaluation. An example run for dataset split strategy 1 (Train on first 1000 synthesis recipe, predict QoR of next 500 recipe)

python train.py --datadir $HOME/OPENABC-D --rundir $HOMEDIR/NETV1_set1 --dataset set1 --lp 1 --lr 0.001 --epochs 60 --batch-size 32

Setting lp=1 and dataset=set1 will pick appropriate train-test split strategy dataset for QoR regression problem. The model will run for 60 epochs and report the training, validation and test performance on the dataset outputing appropriate plots.

Similarly for split-strategy 2 and 3, one can set the dataset as set2 and set3 respectively.

For evaluating performance of specific model on a custom curated dataset, a user can create appropriate csv file with dataset instances and add it to dictionary entry in train.py. For evaluating existing dataset split, one can run the following code.

python evaluate.py --datadir $HOME/OPENABC-D --rundir $HOMEDIR/NETV1_set1 --dataset set1 --lp 1 --model "gcn-epoch20-loss-0.813.pt" --batch-size 32

The test-MSE performance we obtained on our side are as follows:

Net Type Case-I Case-II Case-III
NetV1 0.648+-0.05 10.59+-2.78 0.588+-0.04
NetV2 0.815+-0.02 1.236+-0.15 0.538+-0.01
NetV3 0.579+-0.02 1.470+-0.14 0.536+-0.03
You might also like...
Large scale embeddings on a single machine.

Marius Marius is a system under active development for training embeddings for large-scale graphs on a single machine. Training on large scale graphs

A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.
A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

About This repository provides data and code for the paper: Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development (subm

A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.
A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

Larger Google Sat2Map dataset This dataset extends the aerial ⟷ Maps dataset used in pix2pix (Isola et al., CVPR17). The provide script download_sat2m

Exemplo de implementação do padrão circuit breaker em python
Exemplo de implementação do padrão circuit breaker em python

fast-circuit-breaker Circuit breakers existem para permitir que uma parte do seu sistema falhe sem destruir todo seu ecossistema de serviços. Michael

Multi-Scale Geometric Consistency Guided Multi-View Stereo

ACMM [News] The code for ACMH is released!!! [News] The code for ACMP is released!!! About ACMM is a multi-scale geometric consistency guided multi-vi

City-Scale Multi-Camera Vehicle Tracking Guided by Crossroad Zones Code

City-Scale Multi-Camera Vehicle Tracking Guided by Crossroad Zones Requirements Python 3.8 or later with all requirements.txt dependencies installed,

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

The SLIDE package contains the source code for reproducing the main experiments in this paper. Dataset The Datasets can be downloaded in Amazon-

DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)
DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)

DeepLM DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021) Run Please install th

Comments
  • Dataset for QoR prediction

    Dataset for QoR prediction

    Hi,

    We want to use your dataset to do QoR predictions. We followed your README.md file and downloaded the 19GB dataset for ML. We extracted the .pt files and read them, but are struggling to make sense of the numbers. The excel image shows the data we extracted from synthesisStatistics.pickle. Can you tell us what the numbers for each design indicate?

    image

    The snippet below shows the data we got from the 'ac97_ctrl_syn706_step0.pt'. This data is below: {'and_nodes': tensor(11464), 'desName': ['ac97_ctrl'], 'edge_index': tensor([[ 2339, 2340, 2340, ..., 15938, 15938, 15939], [ 2338, 2, 3, ..., 2336, 15932, 15938]]), 'edge_type': tensor([1, 0, 1, ..., 1, 0, 0]), 'longest_path': tensor(11), 'node_id': ['ys__n0', 'ys__n1', 'ys__n3', 'ys__n4', 'ys__n7', ...], 'node_type': tensor([0, 0, 0, ..., 1, 2, 1]), 'not_edges': tensor(14326), 'num_inverted_predecessors': tensor([0, 0, 0, ..., 1, 1, 0]), 'pi': tensor(2339), 'po': tensor(2137), 'stepID': [0], 'synID': [706], 'synVec': tensor([1, 4, 5, 6, 0, 3, 0, 1, 6, 1, 1, 4, 6, 1, 1, 3, 0, 1, 0, 5])} Can you explain how do we get the timing numbers from the above?

    We were wondering if we need to download the 1.4TB zipped files to generate the ML training and testing datasets, or is there a better way to go about this? Can you guide us on how to use the useful datapoints from the 1.4TB dataset without downloading it all? We are students of UC San Diego and trying to maximize our resource utilization.

    Thank you!

    opened by ankursharma129 4
  • Zip file of initial AIG

    Zip file of initial AIG

    The origin full dataset is too huge, we have trouble to download it. Since, we only want the the initial AIG file. We want to know, is there any zip file/link that only have the initial AIG files? Thanks.

    opened by lirui-shanghaitech 2
Releases(v1.0)
Owner
NYU Machine-Learning guided Design Automation (MLDA)
Machine-learning aided Chip design
NYU Machine-Learning guided Design Automation (MLDA)
Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains

Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains This is an accompanying repository to the ICAIL 2021 pap

4 Dec 16, 2021
This is a model to classify Vietnamese sign language using Motion history image (MHI) algorithm and CNN.

Vietnamese sign lagnuage recognition using MHI and CNN This is a model to classify Vietnamese sign language using Motion history image (MHI) algorithm

Phat Pham 3 Feb 24, 2022
PyTorch implementation of the paper Dynamic Data Augmentation with Gating Networks

Dynamic Data Augmentation with Gating Networks This is an official PyTorch implementation of the paper Dynamic Data Augmentation with Gating Networks

九州大学 ヒューマンインタフェース研究室 3 Oct 26, 2022
This is the research repository for Vid2Doppler: Synthesizing Doppler Radar Data from Videos for Training Privacy-Preserving Activity Recognition.

Vid2Doppler: Synthesizing Doppler Radar Data from Videos for Training Privacy-Preserving Activity Recognition This is the research repository for Vid2

Future Interfaces Group (CMU) 26 Dec 24, 2022
Making a music video with Wav2CLIP and VQGAN-CLIP

music2video Overview A repo for making a music video with Wav2CLIP and VQGAN-CLIP. The base code was derived from VQGAN-CLIP The CLIP embedding for au

Joel Jang | 장요엘 163 Dec 26, 2022
PyTorch implementation of paper: AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer, ICCV 2021.

AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer [Paper] [PyTorch Implementation] [Paddle Implementation] Overview This reposit

148 Dec 30, 2022
HMLLDB is a collection of LLDB commands to assist in the debugging of iOS apps.

HMLLDB is a collection of LLDB commands to assist in the debugging of iOS apps. 中文介绍 Features Non-intrusive. Your iOS project does not need to be modi

mao2020 47 Oct 22, 2022
Efficient Two-Step Networks for Temporal Action Segmentation (Neurocomputing 2021)

Efficient Two-Step Networks for Temporal Action Segmentation This repository provides a PyTorch implementation of the paper Efficient Two-Step Network

8 Apr 16, 2022
This repository contains the code for the CVPR 2020 paper "Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision"

Differentiable Volumetric Rendering Paper | Supplementary | Spotlight Video | Blog Entry | Presentation | Interactive Slides | Project Page This repos

697 Jan 06, 2023
An addon uses SMPL's poses and global translation to drive cartoon character in Blender.

Blender addon for driving character The addon drives the cartoon character by passing SMPL's poses and global translation into model's armature in Ble

犹在镜中 153 Dec 14, 2022
a dnn ai project to classify which food people are eating on audio recordings

Deep Learning - EAT Challenge About This project is part of an AI challenge of the DeepLearning course 2021 at the University of Augsburg. The objecti

Marco Tröster 1 Oct 24, 2021
DLL: Direct Lidar Localization

DLL: Direct Lidar Localization Summary This package presents DLL, a direct map-based localization technique using 3D LIDAR for its application to aeri

Service Robotics Lab 127 Dec 16, 2022
The official implementation of "Rethink Dilated Convolution for Real-time Semantic Segmentation"

RegSeg The official implementation of "Rethink Dilated Convolution for Real-time Semantic Segmentation" Paper: arxiv D block Decoder Setup Install the

Roland 61 Dec 27, 2022
Accelerated deep learning R&D

Accelerated deep learning R&D PyTorch framework for Deep Learning research and development. It focuses on reproducibility, rapid experimentation, and

Catalyst-Team 3.1k Jan 06, 2023
Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis (CVPR2022)

Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis Multi-View Consistent Generative Adversarial Networks for 3D-aware

Xuanmeng Zhang 78 Dec 10, 2022
A visualization tool to show a TensorFlow's graph like TensorBoard

tfgraphviz tfgraphviz is a module to visualize a TensorFlow's data flow graph like TensorBoard using Graphviz. tfgraphviz enables to provide a visuali

44 Nov 09, 2022
Using Clinical Drug Representations for Improving Mortality and Length of Stay Predictions

Using Clinical Drug Representations for Improving Mortality and Length of Stay Predictions Usage Clone the code to local. https://github.com/tanlab/MI

Computational Biology and Machine Learning lab @ TOBB ETU 3 Oct 18, 2022
In this repo we reproduce and extend results of Learning in High Dimension Always Amounts to Extrapolation by Balestriero et al. 2021

In this repo we reproduce and extend results of Learning in High Dimension Always Amounts to Extrapolation by Balestriero et al. 2021. Balestriero et

Sean M. Hendryx 1 Jan 27, 2022
Tensorflow Implementation of SMU: SMOOTH ACTIVATION FUNCTION FOR DEEP NETWORKS USING SMOOTHING MAXIMUM TECHNIQUE

SMU A Tensorflow Implementation of SMU: SMOOTH ACTIVATION FUNCTION FOR DEEP NETWORKS USING SMOOTHING MAXIMUM TECHNIQUE arXiv https://arxiv.org/abs/211

Fuhang 5 Jan 18, 2022
Search and filter videos based on objects that appear in them using convolutional neural networks

Thingscoop: Utility for searching and filtering videos based on their content Description Thingscoop is a command-line utility for analyzing videos se

Anastasis Germanidis 354 Dec 04, 2022