调研开源系统的健康检查功能

一、 For What

简化排查。类似360的“体检报告”，打开页面能快速查看是否正常、异常的话原因是啥
多年前在工作中自己搞过个
监控报警，即时发现问题
自动化运维处理
比如部署系统基于健康检查来自动控制引流，停止继续分批部署等；
比如注册中心/lb基于健康检查来切流、剔除节点。
比如资源调度系统基于健康检查来自动重启容器等

二、分类

https://mozillazg.com/2019/01/notes-about-design-health-checking.html
健康检查分为很多层，技术层、业务层都有自己的健康检查语义
因此框架应该允许业务层自己实现健康检查插件

三、基于健康检查的监控系统/资源调度系统 vs 基于日志的监控系统

看起来基于健康检查的监控系统和基于日志的监控系统实现的功能一样，有啥区别？
可以这样想，健康检查与基础设施对接的接口有很多种：
a. 暴露API
b. 调用infra的API进行上报
c. 打日志
d. exec

比如发布时，部署系统要基于健康检查来自动控制引流，停止继续分批部署，这时候就不适合日志，要用API；
运行时的话这几种都行，如果已有基于日志的监控系统，为了复用基础设施可以考虑用日志形式做健康检查。

四、产品设计调研

4.1. spring boot actuator

videos:
https://www.bilibili.com/video/BV1Nf4y117W6?p=4

blogs:
https://bigjar.github.io/2018/08/19/Spring-Boot-Actuator-健康检查、审计、统计和监控/
https://www.liaoxuefeng.com/wiki/1252599548343744/1282386381766689
https://www.baeldung.com/spring-boot-actuators

开了很多类型的http接口供查询运行时信息。接口列表见：
https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-features.html#production-ready-endpoints

OOD：
endpoint---(group)---indicator

数据模型:
/actuator/health
https://spring.io/blog/2020/03/25/liveness-and-readiness-probes-with-spring-boot

// http://localhost:8080/actuator/health
// HTTP/1.1 200 OK

{
  "status": "UP",
  "components": {
    "diskSpace": {
      "status": "UP",
      "details": { //...
      }
    },
    "livenessProbe": {
      "status": "UP"
    },
    "ping": {
      "status": "UP"
    },
    "readinessProbe": {
      "status": "UP"
    }
  },
  "groups": [
    "liveness",
    "readiness"
  ]
}

/actuator/health/liveness
https://segmentfault.com/a/1190000022515968

{
  "status": "UP",
  "components": {
    "livenessProbe": {
      "status": "UP"
    }
  }
}

/actuator/info
1.x版本的数据结构如下：
https://www.baeldung.com/spring-boot-actuators#4-info-endpoint

{
    "app" : {
        "version" : "1.0.0",
        "description" : "This is my first spring boot application",
        "name" : "Spring Sample Application"
    }
}

UI:

spring-boot-admin 图形化监控
社区写的一个dashboard项目
https://segmentfault.com/a/1190000017816452
Spring Boot 系列教程_哔哩哔哩 (゜-゜)つロ干杯~-bilibili
Spring Boot Actuator可以将数据接入Prometheus+Grafana
就是开个http接口给Prometheus调用
https://bigjar.github.io/2018/08/19/Spring-Boot-Metrics%E7%9B%91%E6%8E%A7%E4%B9%8BPrometheus-Grafana/#%E5%A2%9E%E5%8A%A0Micrometer-Prometheus-Registry%E5%88%B0%E4%BD%A0%E7%9A%84Spring-Boot%E5%BA%94%E7%94%A8

4.2. k8s

https://open.163.com/newview/movie/free?pid=NFVMPP3I7&mid=DFVMPV196
Kubernetes中的健康检查使用存活性探针（liveness probes）和就绪性探针（readiness probes）来实现。
目前支持的探测方式包括：

HTTP

Kubernetes去访问一个路径，如果它得到的是200或300范围内的HTTP响应，它会将应用程序标记为健康。否则它被标记为不健康。

TCP
Exec命令

另外有graceful shutdown的能力
https://aijishu.com/a/1060000000024274

Q: liveness vs readiness ?感觉有重复？
https://segmentfault.com/a/1190000022053869
https://blog.colinbreck.com/kubernetes-liveness-and-readiness-probes-how-to-avoid-shooting-yourself-in-the-foot/#fnref2
https://blog.colinbreck.com/kubernetes-liveness-and-readiness-probes-revisited-how-to-avoid-shooting-yourself-in-the-other-foot/
两个语义，一个管重启，一个管切流

readiness.gif

liveness.gif

Q：怎么避免集群自动化重启风暴？
A: 不在liveness里检查依赖项
https://blog.colinbreck.com/kubernetes-liveness-and-readiness-probes-how-to-avoid-shooting-yourself-in-the-foot/#fnref2
https://blog.colinbreck.com/kubernetes-liveness-and-readiness-probes-revisited-how-to-avoid-shooting-yourself-in-the-other-foot/

Q: 开放liveness接口 vs 直接让进程异常退出
A：两者都能实现被k8s自动重启的效果，但个人认为后者更好。因为前者的设计是强依赖k8s probe逻辑，后者的设计并没有这种依赖

Q: liveness检查，启动中是返回啥?
A: 失败，不用担心启动中被kill，initialDelaySeconds参数告诉kubelet在第一次执行probe之前要等待多久，过了这个时间再检测、如果还是启动中也算启动超时、可以kill掉换机器部署了
https://jimmysong.io/kubernetes-handbook/guide/configure-liveness-readiness-probes.html
https://blog.csdn.net/cainiaofly/article/details/84324321

4.3. spring on k8s

https://segmentfault.com/a/1190000022515968
可以通过"/actuator/health/liveness" 和 "/actuator/health/readiness"访问获得。
接到Graceful shutdown的通知后，内嵌的web服务器会拒绝接受新请求:

所有四个嵌入式Web服务器（Jetty，Reactor Netty，Tomcat和Undertow）以及基于响应的和基于Servlet的Web应用程序都支持正常关闭。启用后，应用程序关闭将包括可配置持续时间的宽限期。宽限期内，现有请求将被允许完成，但新请求将被禁止。不允许新请求的确切方式因所使用的Web服务器而异，Jetty，Reactor Netty和Tomcat将停止接受请求Undertow将接受请求，但会立即以服务不可用（503）响应进行响应。

4.4. service mesh中的sidecar

envoy

健康检查别人

sidecar作为lb，检查转发到的机器
https://www.qikqiak.com/envoy-book/detect-service-health-with-healthchecks/
https://www.servicemesher.com/envoy/intro/arch_overview/health_checking.html
- 主动
有快速失败机制
https://www.servicemesher.com/envoy/intro/arch_overview/health_checking.html
有过滤器机制
- 被动
sidecar之间?
量太大，是否保护本地服务
https://www.servicemesher.com/envoy/intro/arch_overview/health_checking.html

供infra检查

sidecar内部
Envoy 的健康检查接口 localhost:15020/healthz/ready
会在 xDS 配置初始化完成后才返回 200，否则将返回 503
https://zhaohuabing.com/post/2020-09-05-istio-sidecar-dependency/
sidecar服务的app
看着就是infra(k8s)的检查透传给本地服务，可以配置过滤器
https://www.servicemesher.com/envoy/intro/arch_overview/health_checking.html#arch-overview-health-checking-filter
https://zhuanlan.zhihu.com/p/335008284

Dapr

自己不去健康检查别人，只开接口供infra检查。

sidecar内部状态
就是开个http接口，比如curl http://localhost:3500/v1.0/healthz 返回个http状态码
能自动改k8s配置，与k8s probe集成
https://docs.dapr.io/developing-applications/building-blocks/observability/sidecar-health/
https://docs.dapr.io/zh-hans/developing-applications/building-blocks/observability/sidecar-health/

API:
https://docs.dapr.io/zh-hans/reference/api/health_api/

sidecar服务的app
有dapr调app检查actor状态的接口
https://docs.dapr.io/reference/api/actors_api/#health-check
另外有Querying actor state externally

4.5. 有哪些基于健康检查来做事的infra?

资源调度系统
k8s
监控系统
比如Spring Boot Actuator可以将数据接入Prometheus
https://bigjar.github.io/2018/08/19/Spring-Boot-Metrics%E7%9B%91%E6%8E%A7%E4%B9%8BPrometheus-Grafana/#%E5%A2%9E%E5%8A%A0Micrometer-Prometheus-Registry%E5%88%B0%E4%BD%A0%E7%9A%84Spring-Boot%E5%BA%94%E7%94%A8
load balancer/注册中心
可能是load balancer自己调接口做健康检查，也可能是自动化运维系统做健康检查、发现异常后调load balancer切流

4.5. 业界有哪些数据规范

监控指标采集、上报的规范

Micrometer

jvm内部的API规范。Think SLF4J, but for application metrics

MicroMeter是一款针对JVM应用的Metrics指标监测库，应用程序通过调用其提供的通用API来收集性能指标，并对接多种当前流行的监控系统，如Prometheus、Datadog。因其内部实现了对不同监控系统的适配工作，使得切换监控系统变得很容易。其设计宗旨即在提高可移植性的同时，几乎不增加指标收集活动的开销，号称监控界的SLF4J，对于SLA指标的测量非常方便。
https://www.freesion.com/article/55201066755/

https://www.cnblogs.com/cjsblog/p/11556029.html
https://www.tony-bro.com/posts/1386774700/index.html

Prometheus等都是有自己的数据采集规范

https://segmentfault.com/a/1190000023491231

健康检查的规范

k8s是自己的规范：看http状态
spring-boot也是自己的

服务发现的规范（和健康检查无关）

Xds
xds api 在envoy中被称为 Data plane API
https://skyao.io/learning-envoy/xds/
https://www.servicemesher.com/blog/the-universal-data-plane-api/
https://cloudnative.to/envoy/intro/arch_overview/operations/dynamic_configuration.html
HDS

// HDS is Health Discovery Service. It compliments Envoy’s health checking
// service by designating this Envoy to be a healthchecker for a subset of hosts
// in the cluster. The status of these health checks will be reported to the
// management server, where it can be aggregated etc and redistributed back to
// Envoy through EDS.
https://github.com/envoyproxy/data-plane-api/blob/main/envoy/service/discovery/v2/hds.proto

envoy要负责health check别的节点，hds看着是控制这功能的策略