AI and Machine Learning workflows on Anthos Bare Metal.

Overview

Hybrid and Sovereign AI on Anthos Bare Metal

Table of Contents

Overview

AI and Machine Learning workflows using TensorFlow on Anthos Bare Metal. TensorFlow is one of the most popular ML frameworks (10M+ downloads per month) in use today, but at the same time presents a lot of challenges when it comes to setup (GPUs, CUDA Drivers, TF Serving etc), performance tuning, cluster provisioning, maintenance, and model serving. This work will showcase the easy to use guides for ML model serving, training, infrastructure, ML Notebooks, and more on Anthos Bare Metal.

Terraform as IaC Substrate

Terraform is an open-source infrastructure as code software tool, and one of the ways in which Enterprise IT teams create, manage, and update infrastructure resources such as physical machines, VMs, switches, containers, and more. Provisioning the hardware or resources is always the first step in the process and these guides will be using Terraform as a common substrate to create the infrastructure for AI/ML apps. Checkout our upstream contribution to the Google Terraform Provider for GPU support in the instance_template module.

Serving TensorFlow ResNet Model on ABM

In this installation you'll see how to create an end-to-end TensorFlow ML serving ResNet installation on ABM using Google Compute Engine. Once the setup is completed, you'll be able to send image classification requests using GRPC client to ABM ML Serving cluster.

Requirements

  • Google Cloud Platform access and install gcloud SDK
  • Service Account JSON
  • Terraform, Git, Container Image

ResNet SavedModel Image on GCR

Let's create a local directory and download the Deep residual network (ResNet) model.

rm -rf /tmp/resnet
mkdir /tmp/resnet
curl -s http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC_jpg.tar.gz | tar --strip-components=2 -C /tmp/resnet -xvz

Verify the SavedModel

ls /tmp/resnet/*
saved_model.pb variables

Now we will commit the ResNet serving docker image:

docker run -d --name serving_base tensorflow/serving
docker cp /tmp/resnet serving_base:/models/resnet
docker commit --change "ResNet model" serving_base $USER/resnet_serving
docker kill serving_base
docker rm serving_base

Copy the local docker image to gcr.io

export GCR_IMAGE_PATH="gcr.io/$GCP_PROJECT/abm_serving/resnet"
docker tar $USER/resnet_serving $GCR_IMAGE_PATH
docker push $GCR_IMAGEPATH

ABM GCE Cluster using Terraform

Create GCE demo host and perform few steps to setup the host:

export SERVICE_ACCOUNT_FILE=<FILE_LOCATION>

export DEMO_HOST="abm-demo-host-live"
gcloud compute instances create $DEMO_HOST --zone=us-central1-a
gcloud compute scp $SERVICE_ACCOUNT_FILE $USER@$DEMO_HOST:

Perform ssh login into the demo machine and follow steps below:

gcloud compute ssh $DEMO_HOST --zone=us-central1-a

# Activate Service Account
gcloud auth activate-service-account --key-file=$SERVICE_ACCOUNT_FILE

# Install Git
sudo apt-get install git

# Install Terraform
# v0.14.10
export TERRAFORM_VERSION="0.14.10"

List current Anthos/GKE clusters using hub membership. You can list existing clusters and compare it with newly created clusters.

# List Anthos BM clusters
gcloud container hub memberships list

Install Terraform, and make few minor changes to configuration files:

# Remove any previous versions. You can skip if this is a new instance
sudo apt remove terraform

sudo apt-get install software-properties-common

curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
sudo apt-get update && sudo apt-get install terraform=$TERRAFORM_VERSION

terraform -version

Let's setup some ABM infrastructure on GCE using Terraform

# Git clone ABM Terraform setup
git clone https://github.com/GoogleCloudPlatform/anthos-samples.git
cd anthos-samples
git checkout abm-gcp-tf-demo
cd anthos-bm-gcp-terraform

# Make changes to cluster names and few edits
cp terraform.tfvars.sample terraform.tfvars

Make edits to the variables.tf and terraform.tfvars and also make sure the abm_cluster_id is modified to a unique name

# Change abm_cluster_id and service account name in variables.tf
export CLUSTER_ID=`echo "abm-tensorflow-"$(date +"%m%d%H%M")`
echo $CLUSTER_ID

Create GCE resources using Terraform and verify

# Terraform init and apply
terraform init && terraform plan
terraform apply

# Verify resources using gcloud
gcloud compute instancs list

# Let's create cluster using bmctl and perform pre-flight checks and verify
export KUBECONFIG=$HOME/bmctl-workspace/$CLUSTER_ID/$CLUSTER_ID-kubeconfig

# List ABM clusters
gcloud container hub memberships list

# Listing the details of live-cluster
gcloud container hub memberships describe $LIVE_CLUSTER_NAME

Verify k8s cluster details and check few outputs

kubectl get nodes
kubectl get deployments
kubectl get pods

TensorFlow ResNet model service on ABM Cluster

git clone https://github.com/GoogleCloudPlatform/anthos-ai
cd anthos-ai

kubectl create -f serving/resnet_k8s.yaml

# Let's view deployments and pods
kubectl get deployments
kubectl get pods

kubectl get services
kubectl describe service resnet-abm-service

# Let's send prediction request to ResNet service on ABM
git clone https://github.com/puneith/serving.git
sudo tools/run_in_docker.sh python tensorflow_serving/example/resnet_client_grpc.py $IMAGE_URL --server=10.200.0.51:8500

Return to the demo host and then destroy the demo host

# Destroy resources and demo host
terraform destroy

gcloud compute instances delete $DEMO_HOST
Owner
Google Cloud Platform
Google Cloud Platform
Spooky Skelly For Python

_____ _ _____ _ _ _ | __| ___ ___ ___ | |_ _ _ | __|| |_ ___ | || | _ _ |__ || . || . || . || '

Kur0R1uka 1 Dec 23, 2021
Behavioral Testing of Clinical NLP Models

Behavioral Testing of Clinical NLP Models This repository contains code for testing the behavior of clinical prediction models based on patient letter

Betty van Aken 2 Sep 20, 2022
Mlcode - Continuous ML API Integrations

mlcode Basic APIs for ML applications. Django REST Application Contains REST API

Sujith S 1 Jan 01, 2022
Refactored version of FastSpeech2

Refactored version of FastSpeech2. An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

ILJI CHOI 10 May 26, 2022
SentAugment is a data augmentation technique for semi-supervised learning in NLP.

SentAugment SentAugment is a data augmentation technique for semi-supervised learning in NLP. It uses state-of-the-art sentence embeddings to structur

Meta Research 363 Dec 30, 2022
Milaan Parmar / Милан пармар / _米兰 帕尔马 170 Dec 13, 2022
Contract Understanding Atticus Dataset

Contract Understanding Atticus Dataset This repository contains code for the Contract Understanding Atticus Dataset (CUAD), a dataset for legal contra

The Atticus Project 273 Dec 17, 2022
Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

Graph-Bert Source code of "Graph-Bert: Only Attention is Needed for Learning Graph Representations". Please check the script.py as the entry point. We

14 Mar 25, 2022
Unlimited Call - Text Bombing Tool

FastBomber Unlimited Call - Text Bombing Tool Installation On Termux

Aryan 6 Nov 10, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

GAN stability This repository contains the experiments in the supplementary material for the paper Which Training Methods for GANs do actually Converg

Lars Mescheder 884 Nov 11, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 03, 2023
Implementation of some unbalanced loss like focal_loss, dice_loss, DSC Loss, GHM Loss et.al

Implementation of some unbalanced loss for NLP task like focal_loss, dice_loss, DSC Loss, GHM Loss et.al Summary Here is a loss implementation reposit

121 Jan 01, 2023
simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Quickly train T5 models in just 3 lines of code + ONNX support simpleT5 is built on top of PyTorch-lightning ⚡️ and Transformers 🤗 that lets you quic

Shivanand Roy 220 Dec 30, 2022
NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

NL-Augmenter 🦎 → 🐍 The NL-Augmenter is a collaborative effort intended to add transformations of datasets dealing with natural language. Transformat

684 Jan 09, 2023
ZUNIT - Toward Zero-Shot Unsupervised Image-to-Image Translation

ZUNIT Dependencies you can install all the dependencies by pip install -r requirements.txt Datasets Download CUB dataset. Unzip the birds.zip at ./da

Chen Yuanqi 9 Jun 24, 2022
Application for shadowing Chinese.

chinese-shadowing Simple APP for shadowing chinese. With this application, it is very easy to record yourself, play the sound recorded and listen to s

Thomas Hirtz 5 Sep 06, 2022
Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation This repository contains the original implementation of the unsupervised PBSMT and NMT models presented in Phrase-Bas

Facebook Research 1.5k Dec 28, 2022
Ecommerce product title recognition package

revizor This package solves task of splitting product title string into components, like type, brand, model and article (or SKU or product code or you

Bureaucratic Labs 16 Mar 03, 2022
Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

UNITER: UNiversal Image-TExt Representation Learning This is the official repository of UNITER (ECCV 2020). This repository currently supports finetun

Yen-Chun Chen 680 Dec 24, 2022