LogoLogo
Slack ChannelGitHub Repo
  • K3ai (keɪ3ai)
  • What we are trying to solve (a.k.a Our Goals)
  • Quick Start
  • Command Reference
  • Config File Reference
  • Remove k3ai (Work In Progress)
  • Contributing
  • EXAMPLEs
    • Hello-Start
    • Hello-Home
    • Hello-Earth
    • Hello-Universe
    • Hello-All
  • Plugins
    • How to Build Your First Plugin
    • NVIDIA gpu
    • Kubeflow Pipelines
    • Tensorflow Operator
    • PyTorch Operator
    • MPI operator (WIP)
    • Argo Workflows
    • Jupyter Notebook
    • H2O
  • other guides
    • Civo Cloud
  • community
    • Inclusivity
    • Code of Conduct
    • Roadmap
    • Contributors
    • Seen in the wild
Powered by GitBook
On this page
  • Quick Start
  • Test your installation
  • Step 1
  • Step 2

Was this helpful?

Export as PDF
  1. Plugins

Tensorflow Operator

Kubeflow Tensorflow-Job Training Operator

PreviousKubeflow PipelinesNextPyTorch Operator

Last updated 4 years ago

Was this helpful?

TFJob provides a Kubernetes custom resource that makes it easy to run distributed or non-distributed TensorFlow jobs on Kubernetes.

More on the Tensorflow Operator at ****

Quick Start

All you have to run is:

k3ai apply tensorflow-op

Test your installation

We present here a sample from Tensorflow Operator on ****

Step 1

We first need to add a persistent volume and claim, to do so let's add the two YAML file we need, copy and paste each command in order.

kubectl apply -f - << EOF
apiVersion: v1
kind: PersistentVolume
metadata:
  name: tfevent-volume
  labels:
    type: local
    app: tfjob
spec:
  capacity:
    storage: 10Gi
  storageClassName: local-path
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /tmp/data
EOF

now we add the PVC.

kubectl apply -f - << EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: tfevent-volume
  namespace: kubeflow 
  labels:
    type: local
    app: tfjob
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
EOF

Step 2

Now we deploy the example

kubectl apply -f https://raw.githubusercontent.com/kubeflow/tf-operator/master/examples/v1/mnist_with_summaries/tf_job_mnist.yaml

You can observe the result of the example with

kubectl logs -l tf-job-name=mnist -n kubeflow --tail=-1

It should output something similar to this (we show just partially the output here)

...
Adding run metadata for 799
Accuracy at step 800: 0.957
Accuracy at step 810: 0.9698
Accuracy at step 820: 0.9676
Accuracy at step 830: 0.9676
Accuracy at step 840: 0.9677
Accuracy at step 850: 0.9673
Accuracy at step 860: 0.9676
Accuracy at step 870: 0.9654
Accuracy at step 880: 0.9694
Accuracy at step 890: 0.9708
Adding run metadata for 899
Accuracy at step 900: 0.9737
Accuracy at step 910: 0.9708
Accuracy at step 920: 0.9721
Accuracy at step 930: 0.972
Accuracy at step 940: 0.9639
Accuracy at step 950: 0.966
Accuracy at step 960: 0.9654
Accuracy at step 970: 0.9683
Accuracy at step 980: 0.9685
Accuracy at step 990: 0.9666
Adding run metadata for 999

Note: Because we are using local-path as storage volume and we are on a single node cluster we can't use ReadWriteMany as per Rancher local-path provisioner issue __

https://github.com/kubeflow/tf-operator
https://github.com/kubeflow/tf-operator
https://github.com/rancher/local-path-provisioner/issues/70#issuecomment-574390050