使用 kubeflow 的你也遇到這個問題了嗎? Notebook Server: 1 pod has unbound immediate PersistentVolumeClaims?!

又是一個讓我懷疑人生的問題。經過在 local 建置 kubeflow 的摧殘後(詳情請看:Mac 上安裝 kubeflow? 其實不太簡單),以為一切就要風平浪靜,但就在建立第一個 Jupyter Server (它的名字叫做 playground-0)時,好奇怪?怎麼 Server 一直無法成功建立,在 Status 上的狀態顯示playground-0 default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims。為了解決這個問題,第一時間想到的就是查看 k8s 的狀況(也沒有其他招了…):

查看 pod 狀態

# 如果是用 microk8s,後續的指令 'kubectl' 皆改成 'microk8s kubectl' 即可。
$ kubectl get pods -n admin
NAME           READY   STATUS    RESTARTS   AGE
playground-0   1/1     Running   0          6h11m

等等,看起來似乎沒有什麼異常。只好再看細一點。describe 這 pod:

$ kubectl describe pod playground-0 -n admin
Name:         playground-0
Namespace:    admin
Priority:     0
Node:         microk8s-vm/192.168.64.2
Start Time:   Sat, 23 Oct 2021 17:25:05 +0800
Labels:       app=playground
...
workspace-playground:
Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName:  workspace-playground
ReadOnly:   false
...
Events:
Type     Reason            Age   From               Message
---------------------------------------------------------------------
Warning  FailedScheduling  25m   default-scheduler  0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
Normal   Scheduled         24m   default-scheduler  Successfully assigned admin/playground-0 to microk8s-vm
Normal   Pulled            24m   kubelet            Container image "gcr.io/kubeflow-images-public/tensorflow-1.15.2-notebook-cpu:1.0.0" already present on machine
Normal   Created           24m   kubelet            Created container playground
Normal   Started           24m   kubelet            Started container playground

此時,在 Events 的地方找到對應的訊息了。看起來,是在一開始(Age: 25m:0/1 nodes are available: 1 pod has unbound immediate),PersistentVolumeClaims(PVC)並沒有成功進行 bound 。但隨後(Age: 24m:Successfully assigned admin/playground-0 to microk8s-vm)就重新 bound 成功。

重啟 Pod

感覺上,似乎是這筆訊息,讓 Dashboard 顯示異常。這時候,我的解決方式很日常。透過重啟,洗掉 Event 中的錯誤:

$ kubectl delete pod playground-0 -n admin
pod "playground-0" deleted

這邊重啟的方式,並不是透過在 Dashboard 點選“垃圾桶”進行刪除。透過垃圾桶,會把整個 Pod 連根拔起,目前沒有透過這樣的方式,Pass 過這個問題。

等待 pod 重啟後,再次進行 describe:

ubuntu@microk8s-vm:~$ kubectl describe pod playground-0 -n admin
Name:           playground-0
Namespace:      admin
Priority:       0
Node:           microk8s-vm/192.168.64.2
Start Time:     Sat, 23 Oct 2021 17:51:21 +0800
Labels:         app=playground
...
workspace-playground:
Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName:  workspace-playground
ReadOnly:   false
...
Events:
Type    Reason     Age   From               Message
---------------------------------------------------------------------
Normal  Scheduled  3s    default-scheduler  Successfully assigned admin/playground-0 to microk8s-vm
Normal  Pulled     2s    kubelet            Container image "gcr.io/kubeflow-images-public/tensorflow-1.15.2-notebook-cpu:1.0.0" already present on machine
Normal  Created    1s    kubelet            Created container playground
Normal  Started    1s    kubelet            Started container playground

現在這個 Pod 的 Event 中,已經沒有剛剛出現的錯誤訊息。此時,Dashboard 上的 Status 已經變成打勾,也可點擊 Connect 連進 Server 囉!

總結

Pod 中的 Events 有可能會影響到 kubeflow Dashboard 上的狀態,如果在 Dashboard 上發現一些異常的訊息,但實際查看相關 Pod 並沒有異常,有可能是因為 Events 紀錄到某些的錯誤而導致。