PyTorch Operator

Kubeflow PyTorch-Job Training Operator

PyTorch is a Python package that provides two high-level features:

  • Tensor computation (like NumPy) with strong GPU acceleration

  • Deep neural networks built on a tape-based autograd system

You can reuse your favorite Python packages such as NumPy, SciPy, and Cython to extend PyTorch when needed. More information at https://github.com/kubeflow/pytorch-operator **or the PyTorch site https://pytorch.org/

Quick Start

As usual, let's deploy PyTorch with one single line command

k3ai apply pytorch-op

Test You PyTorch-Job installation

We will use the MNISE example from the Kubeflow PyTorch-Job repo at https://github.com/kubeflow/pytorch-operator/tree/master/examples/mnist****

As usual, we want to avoid complexity so we re-worked a bit the sample and make it way much more easier.

Step 1

You'll see tha in the example a container need to be created before running the sample, we merged the container commands directly in the YAML file so now it's one-click job.

For CPU only

kubectl apply -f - << EOF
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-gloo"
  namespace: kubeflow
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
              command: ['sh','-c','pip install tensorboardX==1.6.0 && mkdir -p /opt/mnist/src && cd /opt/mnist/src && curl -O https://raw.githubusercontent.com/kubeflow/pytorch-operator/master/examples/mnist/mnist.py && chgrp -R 0 /opt/mnist && chmod -R g+rwX /opt/mnist && python /opt/mnist/src/mnist.py']
              args: ["--backend", "gloo"]

    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
              command: ['sh','-c','pip install tensorboardX==1.6.0 && mkdir -p /opt/mnist/src && cd /opt/mnist/src && curl -O https://raw.githubusercontent.com/kubeflow/pytorch-operator/master/examples/mnist/mnist.py && chgrp -R 0 /opt/mnist && chmod -R g+rwX /opt/mnist && python /opt/mnist/src/mnist.py']
              args: ["--backend", "gloo"]
EOF

If you have GPU enabled you may run it this way

kubectl apply -f - << EOF
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-gloo"
  namespace: kubeflow
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
              command: ['sh','-c','pip install tensorboardX==1.6.0 && mkdir -p /opt/mnist/src && cd /opt/mnist/src && curl -O https://raw.githubusercontent.com/kubeflow/pytorch-operator/master/examples/mnist/mnist.py && chgrp -R 0 /opt/mnist && chmod -R g+rwX /opt/mnist && python /opt/mnist/src/mnist.py']
              args: ["--backend", "gloo"]
              # Change the value of nvidia.com/gpu based on your configuration
              resources:
                limits:
                  nvidia.com/gpu: 1 
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
              command: ['sh','-c','pip install tensorboardX==1.6.0 && mkdir -p /opt/mnist/src && cd /opt/mnist/src && curl -O https://raw.githubusercontent.com/kubeflow/pytorch-operator/master/examples/mnist/mnist.py && chgrp -R 0 /opt/mnist && chmod -R g+rwX /opt/mnist && python /opt/mnist/src/mnist.py']
              args: ["--backend", "gloo"]
              # Change the value of nvidia.com/gpu based on your configuration
              resources:
                limits:
                  nvidia.com/gpu: 1 
EOF

Step 2

Check if pod are deployed correctly with

kubectl get pod -l pytorch-job-name=pytorch-dist-mnist-gloo -n kubeflow

It should ouput something like this

NAME                               READY   STATUS    RESTARTS   AGE
pytorch-dist-mnist-gloo-master-0   1/1     Running   0          2m26s
pytorch-dist-mnist-gloo-worker-0   1/1     Running   0          2m26s

Step 3

Check logs result of your training job

 kubectl logs -l pytorch-job-name=pytorch-dist-mnist-gloo -n kubeflow

You should observe an output similar to this (since we are using 1 Master and 1 worker in this case)

Train Epoch: 1 [55680/60000 (93%)]      loss=0.0341
Train Epoch: 1 [56320/60000 (94%)]      loss=0.0357
Train Epoch: 1 [56960/60000 (95%)]      loss=0.0774
Train Epoch: 1 [57600/60000 (96%)]      loss=0.1186
Train Epoch: 1 [58240/60000 (97%)]      loss=0.1927
Train Epoch: 1 [58880/60000 (98%)]      loss=0.2050
Train Epoch: 1 [59520/60000 (99%)]      loss=0.0642

accuracy=0.9660

Train Epoch: 1 [55680/60000 (93%)]      loss=0.0341
Train Epoch: 1 [56320/60000 (94%)]      loss=0.0357
Train Epoch: 1 [56960/60000 (95%)]      loss=0.0774
Train Epoch: 1 [57600/60000 (96%)]      loss=0.1186
Train Epoch: 1 [58240/60000 (97%)]      loss=0.1927
Train Epoch: 1 [58880/60000 (98%)]      loss=0.2050
Train Epoch: 1 [59520/60000 (99%)]      loss=0.0642

accuracy=0.9660

Last updated