Kubernetes - Health Check

2月 2 2020 Ops 28 分钟讀完 (大概 4206 字)

概述

本篇會示範如何設定 liveness probe(存活探針), readiness probe(就緒探針), 以及 startup probe(啟動探針)。 kubelet 使用 liveness probe 來決定何時重啟容器, 舉例來說, 當你的應用處於運行中, 但是卻無法處理 request。

kubelet 使用 readiness probe 來決定這個容器是否已經準備就緒, 可以接收流量了。當一個 Pod 內的所有容器都就緒了, 該 Pod 才算是就緒。如果一個 Pod 尚未就緒, 它將會從 Service 負載平衡的後端清單中被移除。
kubelet 使用 startup probe 來決定一個容器應用是否已經成功啟動了。當 startup probe 判定為 success 之前, liveness probe 跟 readiness probe 是沒有作用的。這可以用在某些會花比較久時間來啟動的應用, 這樣就不會在還沒啟動完成前就因為 liveness 判定失敗而給殺掉了。

本篇主要是個人學習 Kubernetes health check 的一篇筆記, 內容你會看到很多很多的 Q&A, 因為 Ray 個人習慣將長篇的理論知識切割成零碎 Q&A, 適合我個人學習吸收, 考試都考 100 分呢！

定義存活探針指令

許多應用長時間運行後, 就陷入一個壞掉的狀態, 這時候唯有重啟才可恢復。 Kubernetes 提供了 liveness probe 來偵測並且排除這樣的狀況
讓我們啟動一個 Pod, 這個 Pod 會運行 k8s.gcr.io/busybox 鏡像, 以下是該 Pod 的設定檔:

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-exec
spec:
  containers:
  - name: liveness
    image: k8s.gcr.io/busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5

這邊針對上面的 yaml file 做解說：

periodSeconds: liveness probe 多久檢查一次
initialDelaySeconds: 首次啟動後, 要延遲多久在執行
cat /tmp/healthy: livenessProbe 會執行這段執行, 如果成功則回傳 0, 其他都是失敗。
當容器啟動時, 會執行 /bin/sh -c "touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600", 會先建立 /tmp/healthy, 然後 sleep 30 秒, 所以在這 30 秒間, /tmp/healthy 是存在的。 30 秒後會把 /tmp/healthy 砍了, 然後 sleep 600 秒, 在這 600 秒間, 指令會回傳錯誤

接著來實際操作一回:

建立 Pod

kubectl apply -f https://k8s.io/examples/pods/probe/exec-liveness.yaml

頭 30 秒, 執行以下指令檢視 Pod 事件：
kubectl describe pod liveness-exec

輸出未顯示 liveness probe 失敗

FirstSeen    LastSeen    Count   From            SubobjectPath           Type        Reason      Message
--------- --------    -----   ----            -------------           --------    ------      -------
24s       24s     1   {default-scheduler }                    Normal      Scheduled   Successfully assigned liveness-exec to worker0
23s       23s     1   {kubelet worker0}   spec.containers{liveness}   Normal      Pulling     pulling image "k8s.gcr.io/busybox"
23s       23s     1   {kubelet worker0}   spec.containers{liveness}   Normal      Pulled      Successfully pulled image "k8s.gcr.io/busybox"
23s       23s     1   {kubelet worker0}   spec.containers{liveness}   Normal      Created     Created container with docker id 86849c15382e; Security:[seccomp=unconfined]
23s       23s     1   {kubelet worker0}   spec.containers{liveness}   Normal      Started     Started container with docker id 86849c15382e

35 秒後, 在檢視一次
kubectl describe pod liveness-exec

輸出可以看到, liveness probe 失敗了

FirstSeen LastSeen    Count   From            SubobjectPath           Type        Reason      Message
--------- --------    -----   ----            -------------           --------    ------      -------
37s       37s     1   {default-scheduler }                    Normal      Scheduled   Successfully assigned liveness-exec to worker0
36s       36s     1   {kubelet worker0}   spec.containers{liveness}   Normal      Pulling     pulling image "k8s.gcr.io/busybox"
36s       36s     1   {kubelet worker0}   spec.containers{liveness}   Normal      Pulled      Successfully pulled image "k8s.gcr.io/busybox"
36s       36s     1   {kubelet worker0}   spec.containers{liveness}   Normal      Created     Created container with docker id 86849c15382e; Security:[seccomp=unconfined]
36s       36s     1   {kubelet worker0}   spec.containers{liveness}   Normal      Started     Started container with docker id 86849c15382e
2s        2s      1   {kubelet worker0}   spec.containers{liveness}   Warning     Unhealthy   Liveness probe failed: cat: can't open '/tmp/healthy': No such file or directory

在等 30 秒, 看容器有沒有被重啟
kubectl get pod liveness-exec

看來是重啟了, 並且重啟次數加一

NAME            READY     STATUS    RESTARTS   AGE
liveness-exec   1/1       Running   1          1m

定義 liveness HTTP request

liveness probe 另外一種方式為 HTTP request, 以下設定檔為運行鏡像 k8s.gcr.io/liveness 的 Pod

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-http
spec:
  containers:
  - name: liveness
    image: k8s.gcr.io/liveness
    args:
    - /server
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
        httpHeaders:
        - name: Custom-Header
          value: Awesome
      initialDelaySeconds: 3
      periodSeconds: 3

針對以上的設定檔做解釋:

periodSeconds: 每 3 秒檢查一次
initialDelaySeconds: 容器啟動後, 要延遲 3 秒再開始探針
這次探針實施的方式為傳送 HTTP GET request 到 server 的 port 8080
如果 /haalthz 回傳 200~399, 則表示成功, 其餘都失敗

如果有興趣可以去看此鏡像的原始碼, 如以下程式碼, 10 秒後開始回傳 500, 10 秒內回傳 200

http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
    duration := time.Now().Sub(started)
    if duration.Seconds() > 10 {
        w.WriteHeader(500)
        w.Write([]byte(fmt.Sprintf("error: %v", duration.Seconds())))
    } else {
        w.WriteHeader(200)
        w.Write([]byte("ok"))
    }
})

kubelet 會在容器啟動後 3 秒開始探測, 所以一開始的健康健康會是 success 的, 但是 10 秒後就會開始 fail, 然後 kubelet 會把容器殺掉並且重啟

以下實際試試 HTTP liveness check

建立 Pod

kubectl apply -f https://k8s.io/examples/pods/probe/http-liveness.yaml

10 秒後, 檢視 liveness probe 事件, 可以發現已經失敗並且又重啟了
kubectl describe pod liveness-http

在版本 v1.13 之前 (包含 v1.13), 如果在 node 有設定 http_proxy (或 HTTP_PROXY) 環境變數, 則 HTTP liveness 會使用這個 proxy, v1.13 之後的版本不會影響 HTTP liveness probe

定義 TCP liveness probe

第三種 liveness probe 使用 TCP Socket, 如下設定檔, kubelet 會嘗試在你容器上的一個特定的 port 開啟一個 socket, 如果可以建立連線, 則成功, 反之則視為失敗

apiVersion: v1
kind: Pod
metadata:
  name: goproxy
  labels:
    app: goproxy
spec:
  containers:
  - name: goproxy
    image: k8s.gcr.io/goproxy:0.1
    ports:
    - containerPort: 8080
    readinessProbe:
      tcpSocket:
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 10
    livenessProbe:
      tcpSocket:
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 20

以下依序來解釋上面設定檔的步驟：

TCP 檢查跟 HTTP 檢查很類似
這個範例同時使用了 readiness probe 以及 liveness probe
容器啟動五秒後, kubelet 會發送第一個 readiness 探測
readiness 探測會嘗試連結 goproxy 容器的 port 8080, 如果成功, 表示該容器 ready
kubelet 會持續的每十秒做一次檢查
kubelet 會在容器啟動十五秒後, 發送第一個 liveness 探測
liveness 探測會嘗試連結 goproxy 容器的 port 8080, 如果失敗, 容器會被重新啟動
liveness probe 每二十秒探測一次

以下可以實際操作看看：

建立 Pod

kubectl apply -f https://k8s.io/examples/pods/probe/tcp-liveness-readiness.yaml

十五秒後, 檢視 Pod 事件來看看 liveness 的狀態:
kubectl describe pod goproxy

使用命名 port

在 HTTP 或 TCP liveness 檢查中, 你可以使用一個命名過的 port, 如下

ports:
- name: liveness-port
  containerPort: 8080
  hostPort: 8080

livenessProbe:
  httpGet:
    path: /healthz
    port: liveness-port

使用 startup probes 來保護啟動較慢的容器

某些時候, 你可能會碰到會花較久時間在容器首次啟動的應用, 這個時候你不需要提高 liveness probe 的失敗門檻, 你可以使用 startup probe, 重點就是, 設定 failureThreshold * periodSeconds 的時間, 這個時間要長於正常應用啟動可能耗費的最大時間, 如下範例：

ports:
- name: liveness-port
  containerPort: 8080
  hostPort: 8080

livenessProbe:
  httpGet:
    path: /healthz
    port: liveness-port
  failureThreshold: 1
  periodSeconds: 10

startupProbe:
  httpGet:
    path: /healthz
    port: liveness-port
  failureThreshold: 30
  periodSeconds: 10

以下針對上面的配置檔做解說:

應用會有最長五分鐘的啟動時間 (30 * 10 = 300s)
一旦 startup probe 被判定為成功, liveness probe 會接手, 從原本擁有高容許失敗門檻啟動偵測變成一次失敗就重啟的卡死偵測
如果 startup probe 一直失敗, 那超過 300s 後, 容器會被殺掉, 並視乎 restartPolicy 看是否重啟

定義 readiness probes

某些時候, 應用會暫時的無法處理流量, 舉例來說, 當在啟動時載入大量資料或是設定檔時, 或是在啟動後依賴外部服務時。在這些情況下, 你不需要殺掉容器, 但你需要暫時的不將請求送往這些容器。 Kubernetes 提供了 readiness probe 來解決這個問題, 如果一個 Pod 被 readiness probe 回報 not ready 的話, 那該 Pod 將不會收到來自於 Kubernetes Services 的流量
Readiness probe 的運行是不間斷的
Readiness probe 的設定基本上跟 liveness probe 差不多, 只差在 readinessProbe 換成 livenessProbe, 如下：

readinessProbe:
  exec:
    command:
    - cat
    - /tmp/healthy
  initialDelaySeconds: 5
  periodSeconds: 5

Readiness probe 跟 liveness probe 可以一起使用, 避免一些狀況, 例如說該容器暫時還無法處理流量, 但還收到 Service 的請求, 這時候 liveness probe 探測失敗, 就直接將容器殺掉了。

設置 Probes

以下介紹幾個 Probes 的 field, 更精確的控制 livenss 跟 readiness 的檢查行為

intialDelaySeconds: 在容器首次啟動時, 控制 liveness probe 或 readiness probe 開始探測的延遲時間, 預設為 0 秒, 最小為 0
periodSeconds: 多久探測一次, 預設 10 秒, 最短 1 秒
timeoutSeconds: 幾秒沒收到回應時判斷為失敗, 預設 1 秒, 最短 1 秒
successThreshold: 在檢查被判定失敗後, 接下來要多少次連續判定成功才算成功, 預設 1 次, liveness 必須設為 1 次, 最小可設為 1 次
failureThreshold: 當 Pod 被偵測到失敗時, Kubernetes 會持續嘗試直到滿足此欄位設定的次數, 這樣才算失敗, 套用在 liveness probe 上就是重啟容器, 而套用在 readiness probe 上就是標示該容器為 unready, 預設為 3 次, 最小可設為 1 次

以下介紹幾個 HTTP probes, 可設在 httpGet 的欄位：

host: 要連結的 host name, 預設為 pod IP, 相比在 httpHeaders 中設定, 你可能會想要使用這一個欄位來設定
scheme: 連接到 host 的 scheme (HTTP 或 HTTPS), 預設為 HTTP
path: 存取 HTTP server 的路徑
httpHeaders: 客製化設定請求的 header, HTTP 允許重複的 header
port: 可以使用 port number 或已命名的 port, 如果是 port number, 範圍須介於 1 ~ 65535

針對上面的 HTTP probes 行為做補充解釋:

如果使用 HTTP probe, kubelet 會發送 HTTP 請求到指定的 path 以及 port 來做檢查
kubelet 預設使用 pod 的 IP 位址, 除非在 host 欄位有特別指定
如果 scheme 欄位設為 HTTPS, 則 kubelet 會發送 HTTPS 請求, 並且略過憑證檢查
大部分的情況, 你不會去設定 host 欄位, 除非以下特殊情況
- 假如你的容器監聽 127.0.0.1, 然後 pod 的 hostNetwork 欄位為 true, 那 httpGet 下的 host 欄位需設定為 127.0.0.1
- 如果你的 pod 取決於 virtual hosts, 你不可使用 host 欄位, 而是要設定 httpHeaders 欄位中的 Host header
TCP probe 中, kubelet 是在 node 中建立連線, 不是在 pod 中, 所以 host 欄位中不可使用 service name, 否則 kubelet 將無法解析它

參考資料

Q&A

Kubernetes 中, 當我使用 TCP probe 時我無法在 httpGet 下的 host 使用 service name, 為什麼？
因為 kubelet 是在 node 建立 connection, 並非在 pod

Kubernetes 中, 如果我將 httpGet 下的 scheme 欄位設置為 HTTPS, 那會怎麼樣？
kubelet 會發送 HTTPS 請求且會略過憑證驗證

Kubernetes 中, 預設 httpGet 下的 host 是什麼？
pod 的 ip

kubernetes 中, failureThreshold 預設為幾次?
3 次

kubernetes 中, successthreshold 在 liveness probe 必須設為多少？
1

Kubernetes 中, 健康檢查可以設定最短多久檢查一次？
1 秒

Kubernetes 中, 健康檢查預設幾秒檢查一次？
10 秒

Kubernetes 中, startup probe 解決了什麼問題？
對於啟動時間較長的應用可以有高容許失敗門檻, 一旦啟動完成後, 切換成低失敗容許門檻的 liveness probe

Kubernetes 中, liveness probe 可以解決什麼問題？
一個 container 有在運行, 但是卻無法正常運作

Kubernetes 中, 如果一個 pod 裡頭有一個 container 沒通過 readiness probe, 那這個 pod 算 ready 嗎？
不算

Kubernetes 中, readiness probe 的作用是？
決定容器是否已準備開始接受流量

Kubernetes 中, 一個尚未通過 readiness probe 的 container, 會被列在 service load balancers 的後端之一嗎？
不會

Kubernetes 中, startup probe 的作用是？
判斷一個容器何時完成啟動

Kubernetes 中, 如果 startup probe 尚未通過, readiness probe 或 liveness probe 會開始運作嗎？
不會哦

請試著解釋以下 kubernetes yaml file 中的每一條 directive ？

yaml file:

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-exec
spec:
  containers:
  - name: liveness
    image: k8s.gcr.io/busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5

Answer:

# API 版本
apiVersion: v1
# 種類為 Pod
kind: Pod
# 該 Pod 的 metadata
metadata:
  # 該 Pod 的 labels, 可被 selector 選擇, 為 key/value pair
  labels:
    test: liveness
  # 該 pod 的 name
  name: liveness-exec
# 該 pod 運行的規格
spec:
  # 定義容器
  containers:
    # 容器名稱
  - name: liveness
    # 鏡像名稱
    image: k8s.gcr.io/busybox
    # 容器啟動後運行的指令
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
    # 存活探針
    livenessProbe:
      # 探測方式為 exec, 執行指定的 command, 若回傳值非為 0, 則視為失敗
      exec:
        # 定義 command
        command:
        - cat
        - /tmp/healthy
      # 首次啟動容器時, 探針延遲 5 秒, 以等待所有資源準備就緒
      initialDelaySeconds: 5
      # 每五秒探測一次
      periodSeconds: 5

Kubernetes 中, health check 有哪三種探針？
- startup probe
- liveness probe
- readiness probe

試著解釋以下 kubernetes yaml file 中的每一條 directive

yaml file:

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-http
spec:
  containers:
  - name: liveness
    image: k8s.gcr.io/liveness
    args:
    - /server
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
        httpHeaders:
        - name: Custom-Header
          value: Awesome
      initialDelaySeconds: 3
      periodSeconds: 3

Answer:

# API 版本
apiVersion: v1
# 種類為 pod
kind: Pod
# 該 pod 的 metadata
metadata:
  # 該 pod 的 label, 可被選擇器選擇, 為自定義 key/value pair
  labels:
    test: liveness
  # 該 pod 的 name
  name: liveness-http
# 該 pod 運行規格
spec:
  # 定義容器
  containers:
    # 容器名稱
  - name: liveness
    # 鏡像名稱
    image: k8s.gcr.io/liveness
    # 容器啟動後執行的指令
    args:
    - /server
    # 存活探針
    livenessProbe:
      # 探測類型為 httpGet, 就像是透過呼叫一支 API 並取得回應來判斷是否運作正常
      httpGet:
        # API 位址
        path: /healthz
        # API port 號
        port: 8080
        # 定義 header
        httpHeaders:
          # header name
        - name: Custom-Header
          # header value
          value: Awesome
      # 首次啟動時, delay 3 秒以待資源就緒
      initialDelaySeconds: 3
      # 每 3 秒探測一次
      periodSeconds: 3

Kubernetes liveness probe 中, 若使用 httpGet, 怎樣的回應算是成功？
status code 200 >= 成功 < 400

試著解釋以下的 Kubernetes yaml file 中的每一條 directive

yaml file:

apiVersion: v1
kind: Pod
metadata:
  name: goproxy
  labels:
    app: goproxy
spec:
  containers:
  - name: goproxy
    image: k8s.gcr.io/goproxy:0.1
    ports:
    - containerPort: 8080
    readinessProbe:
      tcpSocket:
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 10
    livenessProbe:
      tcpSocket:
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 20

Answer:

# API 版本
apiVersion: v1
# 種類為 Pod
kind: Pod
# 該 Pod 的 metadata
metadata:
  # 該 pod 的 name
  name: goproxy
  # 該 pod 的 label, 可被 selector 選擇, 為自定義的 key/value pair
  labels:
    app: goproxy
# 該 pod 的運行規格
spec:
  # 定義容器
  containers:
    # 容器名稱
  - name: goproxy
    # 鏡像名稱
    image: k8s.gcr.io/goproxy:0.1
    # 定義 port
    ports:
      # 容器 port 為 8080
    - containerPort: 8080
    # 定義 readiness 探針
    readinessProbe:
      # 探測類型為 tcpSocket
      tcpSocket:
        # 探測的 port
        port: 8080
      # 首次啟動容器時, 延遲 5 秒
      initialDelaySeconds: 5
      # 每 10 秒探測一次
      periodSeconds: 10
    # 定義存活探針
    livenessProbe:
      # 探測方式為 tcpSocket
      tcpSocket:
        # 探測的 port 為 8080
        port: 8080
      # 首次啟動容器時, 延遲 15 秒
      initialDelaySeconds: 15
      # 每 20 秒探測一次
      periodSeconds: 20

請解釋以下的 Kubernetes yaml file 中的 directives

yaml file:

ports:
- name: liveness-port
  containerPort: 8080
  hostPort: 8080

livenessProbe:
  httpGet:
    path: /healthz
    port: liveness-port

Answer:

# 定義 port
ports:
  # port name
- name: liveness-port
  # port 號
  containerPort: 8080
  # 宿主機 port 號
  hostPort: 8080

# 定義存活探針
livenessProbe:
  # 定義類型為 httpGet
  httpGet:
    # API 位址
    path: /healthz
    # 使用定義好的 port
    port: liveness-port

Kubernetes 中, 如果我有一個應用可能啟動需要長一點啟動時間, 舉例來說, 可能容器啟動之後, 應用還需要一段時間才會正常運作, 那我可以使用什麼 probe 來讓 liveness probe 可以在應用啟動完成後馬上開始探測？
使用 startup probe

以下的 Kubernetes yaml file 中, 如果超過了 300 秒都沒有成功, Kubernetes 會如何處置這個 pod?

yaml file:

ports:
- name: liveness-port
  containerPort: 8080
  hostPort: 8080

livenessProbe:
  httpGet:
    path: /healthz
    port: liveness-port
  failureThreshold: 1
  periodSeconds: 10

startupProbe:
  httpGet:
    path: /healthz
    port: liveness-port
  failureThreshold: 30
  periodSeconds: 10

Answer:
該容器會被砍掉

以下的 Kubernetes yaml file 中, 如果 startupProbe 成功了, Kubernetes 會如何處置這個 pod?

yaml file:

ports:
- name: liveness-port
  containerPort: 8080
  hostPort: 8080

livenessProbe:
  httpGet:
    path: /healthz
    port: liveness-port
  failureThreshold: 1
  periodSeconds: 10

startupProbe:
  httpGet:
    path: /healthz
    port: liveness-port
  failureThreshold: 30
  periodSeconds: 10

Answer:
livenessProbe 會開始運作, 接手探測工作

請解釋以下 Kubernetes yaml file 中的每一條 directive

yaml file:

ports:
- name: liveness-port
  containerPort: 8080
  hostPort: 8080

livenessProbe:
  httpGet:
    path: /healthz
    port: liveness-port
  failureThreshold: 1
  periodSeconds: 10

startupProbe:
  httpGet:
    path: /healthz
    port: liveness-port
  failureThreshold: 30
  periodSeconds: 10

Answer:

# 定義 port
ports:
  # port name
- name: liveness-port
  # 容器 port
  containerPort: 8080
  # 宿主機 port
  hostPort: 8080

# 定義存活探針
livenessProbe:
  # 定義 httpGet 方式
  httpGet:
    # 呼叫的 path
    path: /healthz
    # 呼叫的 port, 使用上面定義的 port
    port: liveness-port
  # 失敗一次就算失敗
  failureThreshold: 1
  # 每 10 秒探測一次
  periodSeconds: 10

# 定義 startup 探針
startupProbe:
  # 定義 httpGet
  httpGet:
    # 探測 path
    path: /healthz
    # 探測 port
    port: liveness-port
  # 失敗 30 次才算失敗
  failureThreshold: 30
  # 每 10 秒探測一次
  periodSeconds: 10

Kubernetes 中, 當 readiness probe 失敗, 會怎麼樣？
將不會從 Kubernetes Service 收到 traffic
Kubernetes 中, 當我使用 httpGet 探測方式時, 該探測行為在 pod 還是在 node?
pod

Kubernetes 中, 當我使用 tcpSocket 探測方式時, 該探測行為在 pod 還是在 node?
node

Kubernetes 中, 當 startup probe 失敗, 會怎麼樣？
該 pod 會被 killed

#Kubernetes #Kubernetes health check

Kubernetes - Health Check

概述

定義存活探針指令

定義 liveness HTTP request

定義 TCP liveness probe

使用命名 port

使用 startup probes 來保護啟動較慢的容器

定義 readiness probes

設置 Probes

參考資料

Q&A

留言

Your browser is out-of-date!