Several simple examples for popular neural network toolkits calling custom CUDA operators.

Last update: Jan 01, 2023

Overview

Neural Network CUDA Example

Several simple examples for neural network toolkits (PyTorch, TensorFlow, etc.) calling custom CUDA operators.

We provide several ways to compile the CUDA kernels and their cpp wrappers, including jit, setuptools and cmake.

We also provide several python codes to call the CUDA kernels, including kernel time statistics and model training.

For more accurate time statistics, you'd best use nvprof or nsys to run the code.

Environments

NVIDIA Driver: 418.116.00
CUDA: 11.0
Python: 3.7.3
PyTorch: 1.7.0+cu110
TensorFlow: 2.4.1
CMake: 3.16.3
Ninja: 1.10.0
GCC: 8.3.0

Cannot ensure successful running in other environments.

Code structure

├── include
│   └── add2.h # header file of add2 cuda kernel
├── kernel
│   └── add2_kernel.cu # add2 cuda kernel
├── pytorch
│   ├── add2_ops.cpp # torch wrapper of add2 cuda kernel
│   ├── time.py # time comparison of cuda kernel and torch
│   ├── train.py # training using custom cuda kernel
│   ├── setup.py
│   └── CMakeLists.txt
├── tensorflow
│   ├── add2_ops.cpp # tensorflow wrapper of add2 cuda kernel
│   ├── time.py # time comparison of cuda kernel and tensorflow
│   ├── train.py # training using custom cuda kernel
│   └── CMakeLists.txt
├── LICENSE
└── README.md

PyTorch

Compile cpp and cuda

JIT
Directly run the python code.

Setuptools

python3 pytorch/setup.py install

CMake

mkdir build
cd build
cmake ../pytorch
make

Run python

Compare kernel running time

python3 pytorch/time.py --compiler jit
python3 pytorch/time.py --compiler setup
python3 pytorch/time.py --compiler cmake

Train model

python3 pytorch/train.py --compiler jit
python3 pytorch/train.py --compiler setup
python3 pytorch/train.py --compiler cmake

TensorFlow

Compile cpp and cuda

CMake

mkdir build
cd build
cmake ../tensorflow
make

Run python

Compare kernel running time

python3 tensorflow/time.py --compiler cmake

Train model

python3 tensorflow/train.py --compiler cmake

Implementation details (in Chinese)

PyTorch自定义CUDA算子教程与运行时间分析
 详解PyTorch编译并调用自定义CUDA算子的三种方式
 三分钟教你如何PyTorch自定义反向传播

F.A.Q

Q. ImportError: libc10.so: cannot open shared object file: No such file or directory

A. You must do import torch before import add2.

Several simple examples for popular neural network toolkits calling custom CUDA operators.

Related tags

Overview

Neural Network CUDA Example

Environments

Code structure

PyTorch

Compile cpp and cuda

Run python

TensorFlow

Compile cpp and cuda

Run python

Implementation details (in Chinese)

F.A.Q

Owner

WeiYang

GndNet: Fast ground plane estimation and point cloud segmentation for autonomous vehicles using deep neural networks.

Scripts for training an AI to play the endless runner Subway Surfers using a supervised machine learning approach by imitation and a convolutional neural network (CNN) for image classification

Watch faces morph into each other with StyleGAN 2, StyleGAN, and DCGAN!

Semi-supervised Representation Learning for Remote Sensing Image Classification Based on Generative Adversarial Networks

An API-first distributed deployment system of deep learning models using timeseries data to analyze and predict systems behaviour

MonoRCNN is a monocular 3D object detection method for automonous driving

Official code for "InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization" (ICLR 2020, spotlight)

A symbolic-model-guided fuzzer for TLS

Patch SVDD for Image anomaly detection

CausalNLP is a practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.

Code for ICML 2021 paper: How could Neural Networks understand Programs?

PyTorch implementation of Munchausen Reinforcement Learning based on DQN and SAC. Handles discrete and continuous action spaces

Piotr - IoT firmware emulation instrumentation for training and research

Official Implementation of Swapping Autoencoder for Deep Image Manipulation (NeurIPS 2020)

Curating a dataset for bioimage transfer learning

A Comprehensive Study on Learning-Based PE Malware Family Classification Methods

SLAMP: Stochastic Latent Appearance and Motion Prediction

Rethinking of Pedestrian Attribute Recognition: A Reliable Evaluation under Zero-Shot Pedestrian Identity Setting

Implements Gradient Centralization and allows it to use as a Python package in TensorFlow

Distributed Evolutionary Algorithms in Python