OpenDILab RL Kubernetes Custom Resource and Operator Lib

Overview

DI Orchestrator

DI Orchestrator is designed to manage DI (Decision Intelligence) jobs using Kubernetes Custom Resource and Operator.

Prerequisites

  • A well-prepared kubernetes cluster. Follow the instructions to create a kubernetes cluster, or create a local kubernetes node referring to kind or minikube
  • Cert-manager. Installation on kubernetes please refer to cert-manager docs. Or you can install it by the following command.
kubectl create -f ./config/certmanager/cert-manager.yaml

Install DI Orchestrator

DI Orchestrator consists of two components: di-operator and di-server. Install di-operator and di-server with the following command.

kubectl create -f ./config/di-manager.yaml

di-operator and di-server will be installed in di-system namespace.

$ kubectl get pod -n di-system
NAME                               READY   STATUS    RESTARTS   AGE
di-operator-57cc65d5c9-5vnvn   1/1     Running   0          59s
di-server-7b86ff8df4-jfgmp     1/1     Running   0          59s

Install global components of DIJob defined in AggregatorConfig:

kubectl create -f config/samples/agconfig.yaml -n di-system

Submit DIJob

# submit DIJob
$ kubectl create -f config/samples/dijob-cartpole.yaml

# get pod and you will see coordinator is created by di-operator
# a few seconds later, you will see collectors and learners created by di-server
$ kubectl get pod

# get logs of coordinator
$ kubectl logs cartpole-dqn-coordinator

User Guide

Refers to user-guide. For Chinese version, please refer to 中文手册

Contributing

Refers to developer-guide.

Contact us throw [email protected]

Comments
  • 在 Pod 内增加集群信息

    在 Pod 内增加集群信息

    希望以 dijob replica 方式提交时,每个 pod 都能见到整个 replica 的 host 信息和自己的启动顺序,增加以下几个环境变量:

    1. replica 中所有 pod 的 FQDN,依据启动顺序排序
    2. 当前 pod 的 FQDN
    3. 当前 pod 的顺序编号

    DI-engine 中会根据这些变量实现对应的网络连接,attach-to 的生成逻辑可以从 di-orchestrator 中移除

    enhancement 
    opened by sailxjx 3
  • add tasks to dijob spec

    add tasks to dijob spec

    1. goal

    There is only one pod template defined in a dijob, which results in that we can not define different commands or resources for different componets of di-engine such as collector, learner and evaluator. So we are supposed to find a more general way to define a custom resource of dijob.

    2. design *

    Inspired by VolcanoJob, we define the spec.tasks to describe different componets of di-engine. spec.tasks is a list, which allows us to define multiple tasks. We can specify different task.type to label the task as one of collector, learner, evaluator and none. none means the task is a general task, which is the default value.

    After change, the dijob can be defined as follow:

    apiVersion: diengine.opendilab.org/v2alpha1
    kind: DIJob
    metadata:
      name: job-with-tasks
    spec:
      priority: "normal"  # job priority, which is a reserved field for allocator
      backoffLimit: 0  # restart count
      cleanPodPolicy: "Running"  # the policy to clean pods after job completion
      preemptible: false  # job is preemtible or not
      minReplicas: 2  
      maxReplicas: 5
      tasks:
      - replicas: 1
        name: "learner"
        type: learner
        template:
          metadata:
            name: di
          spec:
            containers:
            - image: registry.sensetime.com/xlab/ding:nightly
              imagePullPolicy: IfNotPresent
              name: pydi
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              command: ["/bin/bash", "-c",]
              args: 
              - |
                ditask --label learner xxx
              resources:
                requests:
                  cpu: "1"
                  nvidia.com/gpu: 1
            restartPolicy: Never
      - replicas: 1
        name: "evaluator"
        type: evaluator
        template:
          metadata:
            name: di
          spec:
            containers:
            - image: registry.sensetime.com/xlab/ding:nightly
              imagePullPolicy: IfNotPresent
              name: pydi
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              command: ["/bin/bash", "-c",]
              args: 
              - |
                ditask --label evaluator xxx
            restartPolicy: Never
      - replicas: 2
        name: "collector"
        type: collector
        template:
          metadata:
            name: di
          spec:
            containers:
            - image: registry.sensetime.com/xlab/ding:nightly
              imagePullPolicy: IfNotPresent
              name: pydi
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              command: ["/bin/bash", "-c",]
              args: 
              - |
                ditask --label collector xxx
            restartPolicy: Never
    status:
      conditions:
      - lastTransitionTime: "2022-05-26T07:25:11Z"
        lastUpdateTime: "2022-05-26T07:25:11Z"
        message: job created.
        reason: JobPending
        status: "False"
        type: Pending
      - lastTransitionTime: "2022-05-26T07:25:11Z"
        lastUpdateTime: "2022-05-26T07:25:11Z"
        message: job is starting since all pods are created.
        reason: JobStarting
        status: "False"
        type: Starting
      phase: Starting
      profilings: {}
      readyReplicas: 0
      replicas: 4
      taskStatus:
        learner:
          Pending: 1
        evaluator:
          Pending: 1
        collector:
          Pending: 2
      reschedules: 0
      restarts: 0
    

    task definition:

    type Task struct {
    	Name string `json:"name,omitempty"`
    
    	Type TaskType `json:"type,omitempty"`
    
    	Replicas int32 `json:"replicas,omitempty"`
    
    	Template corev1.PodTemplateSpec `json:"template,omitempty"`
    }
    
    type TaskType string
    
    const (
    	TaskTypeLearner TaskType = "learner"
    
    	TaskTypeCollector TaskType = "collector"
    
    	TaskTypeEvaluator TaskType = "evaluator"
    
    	TaskTypeNone TaskType = "none"
    )
    
    

    status.taskStatus definition:

    type DIJobStatus struct {
      // Phase defines the observed phase of the job
      // +kubebuilder:default=Pending
      Phase Phase `json:"phase,omitempty"`
    
      // ...
      
      // map for different task statuses. key: task.name, value: TaskStatus
      TaskStatus map[string]TaskStatus
    
      // ...
    }
    
    // count of different pod phases
    type TaskStatus map[corev1.PodPhase]int32
    
    enhancement 
    opened by konnase 1
  • new version for di-engine new architecture

    new version for di-engine new architecture

    release notes

    features

    • v1.0.0 for DI-engine new architecture
    • remove webhook
    • manage commands with cobra
    • refactor orchestrator architecture inspired from adaptdl
    • use gin to rewrite di-server
    • update di-server http interface
    enhancement 
    opened by konnase 1
  • v0.2.0

    v0.2.0

    • [x] split webhook and operator
    • [x] add dockerfile.dev
    • [x] update CleanPolicyALL to CleanPolicyAll
    • [x] remove k8s service related operations from server, and operator is responsible for managing services
    • [x] add e2e test
    enhancement 
    opened by konnase 1
  • refactor job spec

    refactor job spec

    • refactor job spec definition and add spec.tasks to support multi tasks #20
    • add DI_RANK to pod env and remove engineFields in job.spec #16
    • add e2e test
    • add validator to validate the correctness of dijob spec
    • change job.phase to Pending when job replicas scaled to 0
    • implement a processor to process di-server requests
    • refactor project structure
    enhancement 
    opened by konnase 0
  • Release/v1.0

    Release/v1.0

    release notes

    features

    • v1.0.0 for DI-engine new architecture
    • remove webhook
    • manage commands with cobra
    • refactor orchestrator architecture inspired from adaptdl
    • use gin to rewrite di-server
    • update di-server http interface
    enhancement 
    opened by konnase 0
  • fix: job failed submit when collector/learner missed

    fix: job failed submit when collector/learner missed

    job failed submit when collector/learner missed because webhook create an empty dijob, and golang builder add some default value to some feilds of collector/learner, which result in invalid type error. solved by make coordinator/collector/learner as pointers.

    bug 
    opened by konnase 0
  • Feat/job create event

    Feat/job create event

    • add event handler for dijob, and mark job as Created when job submitted
    • mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)
    • mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted
    • version -> v0.2.1
    enhancement 
    opened by konnase 0
  • allocate的一些问题

    allocate的一些问题

    1.目前的allocator的逻辑,对于不可被抢占的job的初始分配,仅利用minreplicas修改replicas属性,那job的pods部署到哪个节点是完全由K8S决定吗?而且Release1.13代码的allocator.go中对不可被抢占job的初始分配部分貌似还没有写。 2.job是否可以被抢占的含义具体是什么?和是否能被调度是不是等价的? 3.调度策略的FitPolicy的Allocate和Optimize方法也没有进行实现,这部分内容什么时候可以补充? 4.文档中存在许多与最新代码不符合的地方,比如DIJob.Spec.Group属性在代码中已经被移除,文档中提到的job.spec.minreplicas属性代码中也没有,而是在JobInfo中。可以更新一下文档吗? 感谢!

    opened by RZ-Q 3
Releases(v1.1.3)
  • v1.1.3(Aug 22, 2022)

  • v1.1.2(Jul 21, 2022)

    bugs fix

    • global cmd flag error(https://github.com/opendilab/DI-orchestrator/pull/23)
    • wrong pod subdomain(https://github.com/opendilab/DI-orchestrator/pull/24)
    • incorrect to get global rank(https://github.com/opendilab/DI-orchestrator/pull/25)
    Source code(tar.gz)
    Source code(zip)
    di-manager.yaml(445.36 KB)
  • v1.1.1(Jul 4, 2022)

  • v1.1.0(Jun 30, 2022)

    • refactor job spec definition and add spec.tasks to support multi tasks #20
    • add DI_RANK to pod env and remove engineFields in job.spec #16
    • add e2e test
    • add validator to validate the correctness of dijob spec
    • change job.phase to Pending when job replicas scaled to 0
    • implement a processor to process di-server requests
    • refactor project structure

    see details in https://github.com/opendilab/DI-orchestrator/pull/21

    Source code(tar.gz)
    Source code(zip)
    di-manager.yaml(374.01 KB)
  • v1.0.0(Mar 23, 2022)

  • v0.2.2(Dec 15, 2021)

  • v0.2.1(Oct 12, 2021)

    feature

    • add event handler for dijob, and mark job as Created when job submitted(https://github.com/opendilab/DI-orchestrator/pull/13)
    • mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)
    • mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted
    Source code(tar.gz)
    Source code(zip)
    di-manager.yaml(1.38 MB)
  • v0.2.0(Sep 28, 2021)

  • v0.2.0-rc.0(Sep 6, 2021)

    • split webhook and operator
    • add dockerfile.dev
    • update CleanPolicyALL to CleanPolicyAll
    • remove k8s service related operations from server, and operator is responsible for managing services
    • add e2e test
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Jul 8, 2021)

    Features

    • Define DIJob CRD to support DI jobs' submission
    • Define AggregatorConfig CRD to support aggregator definition
    • Add webhook to validate DIJob submission
    • Provide http service for DI jobs to request for DI modules
    • Docs to introduce DI-orchestrator architecture
    Source code(tar.gz)
    Source code(zip)
Owner
OpenDILab
Open sourced Decision Intelligence (DI)
OpenDILab
The official implementation of the IEEE S&P`22 paper "SoK: How Robust is Deep Neural Network Image Classification Watermarking".

Watermark-Robustness-Toolbox - Official PyTorch Implementation This repository contains the official PyTorch implementation of the following paper to

49 Dec 19, 2022
A pytorch implementation of Detectron. Both training from scratch and inferring directly from pretrained Detectron weights are available.

Use this instead: https://github.com/facebookresearch/maskrcnn-benchmark A Pytorch Implementation of Detectron Example output of e2e_mask_rcnn-R-101-F

Roy 2.8k Dec 29, 2022
[ICLR 2021] "Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective" by Wuyang Chen, Xinyu Gong, Zhangyang Wang

Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective [PDF] Wuyang Chen, Xinyu Gong, Zhangyang Wang In ICLR 2

VITA 156 Nov 28, 2022
Implémentation en pyhton de l'article Depixelizing pixel art de Johannes Kopf et Dani Lischinski

Implémentation en pyhton de l'article Depixelizing pixel art de Johannes Kopf et Dani Lischinski

TableauBits 3 May 29, 2022
Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Portrait Photo Retouching with PPR10K Paper | Supplementary Material PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask an

184 Dec 11, 2022
基于PaddleOCR搭建的OCR server... 离线部署用

开头说明 DangoOCR 是基于大家的 CPU处理器 来运行的,CPU处理器 的好坏会直接影响其速度, 但不会影响识别的精度 ,目前此版本识别速度可能在 0.5-3秒之间,具体取决于大家机器的配置,可以的话尽量不要在运行时开其他太多东西。需要配合团子翻译器 Ver3.6 及其以上的版本才可以使用!

胖次团子 131 Dec 25, 2022
Aerial Imagery dataset for fire detection: classification and segmentation (Unmanned Aerial Vehicle (UAV))

Aerial Imagery dataset for fire detection: classification and segmentation using Unmanned Aerial Vehicle (UAV) Title FLAME (Fire Luminosity Airborne-b

79 Jan 06, 2023
Code for Private Recommender Systems: How Can Users Build Their Own Fair Recommender Systems without Log Data? (SDM 2022)

Private Recommender Systems: How Can Users Build Their Own Fair Recommender Systems without Log Data? (SDM 2022) We consider how a user of a web servi

joisino 20 Aug 21, 2022
TagLab: an image segmentation tool oriented to marine data analysis

TagLab: an image segmentation tool oriented to marine data analysis TagLab was created to support the activity of annotation and extraction of statist

Visual Computing Lab - ISTI - CNR 49 Dec 29, 2022
🧮 Matrix Factorization for Collaborative Filtering is just Solving an Adjoint Latent Dirichlet Allocation Model after All

Accompanying source code to the paper "Matrix Factorization for Collaborative Filtering is just Solving an Adjoint Latent Dirichlet Allocation Model A

Florian Wilhelm 39 Dec 03, 2022
验证码识别 深度学习 tensorflow 神经网络

captcha_tf2 验证码识别 深度学习 tensorflow 神经网络 使用卷积神经网络,对字符,数字类型验证码进行识别,tensorflow使用2.0以上 目前项目还在更新中,诸多bug,欢迎提出issue和PR, 希望和你一起共同完善项目。 实例demo 训练过程 优化器选择: Adam

5 Apr 28, 2022
Imaging, analysis, and simulation software for radio interferometry

ehtim (eht-imaging) Python modules for simulating and manipulating VLBI data and producing images with regularized maximum likelihood methods. This ve

Andrew Chael 5.2k Dec 28, 2022
Code for paper "Context-self contrastive pretraining for crop type semantic segmentation"

Code for paper "Context-self contrastive pretraining for crop type semantic segmentation" Setting up a python environment Follow the instruction in ht

Michael Tarasiou 11 Oct 09, 2022
deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

63 Oct 17, 2022
Ensembling Off-the-shelf Models for GAN Training

Data-Efficient GANs with DiffAugment project | paper | datasets | video | slides Generated using only 100 images of Obama, grumpy cats, pandas, the Br

MIT HAN Lab 1.2k Dec 26, 2022
HW3 ― GAN, ACGAN and UDA

HW3 ― GAN, ACGAN and UDA In this assignment, you are given datasets of human face and digit images. You will need to implement the models of both GAN

grassking100 1 Dec 13, 2021
Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers [CVPR 2021]

Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers [BCNet, CVPR 2021] This is the official pytorch implementation of BCNet built on

Lei Ke 434 Dec 01, 2022
LERP : Label-dependent and event-guided interpretable disease risk prediction using EHRs

LERP : Label-dependent and event-guided interpretable disease risk prediction using EHRs This is the code for the LERP. Dataset The dataset used is MI

5 Jun 18, 2022
An efficient PyTorch implementation of the evaluation metrics in recommender systems.

recsys_metrics An efficient PyTorch implementation of the evaluation metrics in recommender systems. Overview • Installation • How to use • Benchmark

Xingdong Zuo 12 Dec 02, 2022
Based on Stockfish neural network(similar to LcZero)

MarcoEngine Marco Engine - interesnaya neyronnaya shakhmatnaya set', kotoraya ispol'zuyet metod samoobucheniya(dostizheniye khoroshoy igy putem proboy

Marcus Kemaul 4 Mar 12, 2022