今天登上了以前的集群,发现kubect get nodes报错,systemctl restart kubelet后再status,发现没有无法正常启动kubelet,于是使用journactl -u kubelet发现了一些端倪
Nov 25 11:21:21 master01 kubelet[4699]: F1125 11:21:21.705999 4699 server.go:273] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory
报错的核心原因是:kubelet 的证书过期了,并且它试图自动轮换证书(Bootstrap)时,找不到引导所需的配置文件
我一开始使用kubeadm init phase kubeconfig kubelet想更换kubelet证书,完成后systemctl status kubelet显示绿色running状态,但一直有如下错误
Nov 25 11:31:45 master01 kubelet[8807]: E1125 11:31:45.773667 8807 kubelet.go:2263] node "master01" not found
Nov 25 11:31:45 master01 kubelet[8807]: E1125 11:31:45.874075 8807 kubelet.go:2263] node "master01" not found
Nov 25 11:31:45 master01 kubelet[8807]: E1125 11:31:45.974426 8807 kubelet.go:2263] node "master01" not found
Nov 25 11:31:46 master01 kubelet[8807]: E1125 11:31:46.074729 8807 kubelet.go:2263] node "master01" not found
Nov 25 11:31:46 master01 kubelet[8807]: E1125 11:31:46.175350 8807 kubelet.go:2263] node "master01" not found
这是因为我的节点从 8 月份(证书过期时)到现在掉线太久,Kubernetes 的 Controller Manager 判定该节点已死亡,并将其从集群中删除了
运行
kubectl get nodes --kubeconfig /etc/kubernetes/admin.conf
报错
Unable to connect to the server: EOF
意味着 API Server 进程正在崩溃或者拒绝连接。 结合刚才 Kubelet 证书过期的情况,现在几乎可以 100% 断定:我的整个 Kubernetes 控制平面组件(API Server, Controller Manager, Scheduler)的证书也都全过期了
运行命令检查证书是否完全过期
kubeadm alpha certs check-expiration
结果印证了想法
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[check-expiration] Error reading configuration from the Cluster. Falling back to default configuration
W1125 11:37:40.755202 15501 validation.go:28] Cannot validate kube-proxy config - no validator is available
W1125 11:37:40.755392 15501 validation.go:28] Cannot validate kubelet config - no validator is available
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED
admin.conf Aug 07, 2025 06:49 UTC <invalid>
apiserver Aug 07, 2025 06:49 UTC <invalid> ca
apiserver-etcd-client Aug 07, 2025 06:49 UTC <invalid> etcd-ca
apiserver-kubelet-client Aug 07, 2025 06:49 UTC <invalid> ca
controller-manager.conf Aug 07, 2025 06:49 UTC <invalid>
etcd-healthcheck-client Aug 07, 2025 06:49 UTC <invalid> etcd-ca
etcd-peer Aug 07, 2025 06:49 UTC <invalid> etcd-ca
etcd-server Aug 07, 2025 06:49 UTC <invalid> etcd-ca
front-proxy-client Aug 07, 2025 06:49 UTC <invalid> front-proxy-ca
scheduler.conf Aug 07, 2025 06:49 UTC <invalid>
CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED
ca Aug 05, 2034 06:49 UTC 8y no
etcd-ca Aug 05, 2034 06:49 UTC 8y no
front-proxy-ca Aug 05, 2034 06:49 UTC 8y no
以下完整解决步骤,我的是三主结构,需要在所有master上执行
# 续签
kubeadm alpha certs renew all
# 重新生成 controller-manager 配置文件
rm -f /etc/kubernetes/controller-manager.conf
kubeadm init phase kubeconfig controller-manager
# 重新生成 scheduler 配置文件
rm -f /etc/kubernetes/scheduler.conf
kubeadm init phase kubeconfig scheduler
# 重新生成 admin 配置文件 (给自己用的)
rm -f /etc/kubernetes/admin.conf
kubeadm init phase kubeconfig admin
#让 API Server 加载新证书 (重启),静态 Pod(API Server 等)不会自动热加载证书,必须重启。最稳妥的方法是临时移走 manifest 文件。
# 移走配置文件 (这会停止控制平面组件)
mkdir -p /tmp/k8s-manifests-backup
mv /etc/kubernetes/manifests/yaml /tmp/k8s-manifests-backup/
# 等待 20秒 (确保容器停止)
# 确认容器已消失 (如果还有输出,再多等一会儿)
docker ps | grep kube-apiserver
# 移回配置文件 (触发组件重启)
mv /tmp/k8s-manifests-backup/yaml /etc/kubernetes/manifests/
# 等待 60秒 (等待组件启动)
#更新 kubectl 权限
cp /etc/kubernetes/admin.conf ~/.kube/config
chown $(id -u):$(id -g) ~/.kube/config
问题来了,执行上述命令后运行kubectl get nodes报错
The connection to the server 192.168.239.100:6443 was refused - did you specify the right host or port?
意味着 API Server 进程没有运行起来
找到对应容器id
docker logs xxxx --tail 20
发现如下信息
panic: context deadline exceeded,且报错位置在 etcd.go
判断API Server 启动失败是因为它无法连接到 Etcd 数据库,接下来把注意力转移到etcd,找到etcd的容器id然后docker logs
2025-11-25 03:57:22.461205 I | embed: rejected connection from "192.168.239.102:52020" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2025-11-25 03:57:22.489233 I | embed: rejected connection from "192.168.239.101:43752" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
2025-11-25 03:57:22.489416 I | embed: rejected connection from "192.168.239.102:52034" (error "tls: failed to verify client's certificate: x509: certificate has expired or is not yet valid", ServerName "")
这里已经非常明显了,<strong>因为我没有在其他master更新证书</strong>,其他master证书也过期了,所以 Etcd 失去法定人数(Quorum)而瘫痪
可在我完成更新后,执行kubectl get nodes依旧报错
Unable to connect to the server: x509: certificate is valid for 10.96.0.1, 192.168.239.103, 192.168.239.103, not 192.168.239.100
<strong>证书里没有包含正在访问的 IP 192.168.239.100。因此,kubectl 认为这是个冒牌服务器,拒绝连接,所以需要重新生成证书然后执行上述操作</strong>
#删除旧的 API Server 证书
rm /etc/kubernetes/pki/apiserver.crt
rm /etc/kubernetes/pki/apiserver.key
#重新生成包含所有 IP 的证书
kubeadm init phase certs apiserver \
--apiserver-cert-extra-sans=192.168.239.100,192.168.239.101,192.168.239.102,192.168.239.103,master01,master02,master03,127.0.0.1
#重启 API Server
docker rm -f $(docker ps -a -q --filter "label=component=kube-apiserver")
#更新 admin.conf 并验证
# 1. 重新生成 admin.conf
rm /etc/kubernetes/admin.conf
kubeadm init phase kubeconfig admin
# 2. 覆盖 kubectl 默认配置
cp /etc/kubernetes/admin.conf ~/.kube/config
chown $(id -u):$(id -g) ~/.kube/config
# 3. 验证
kubectl get nodes
小总结(关于Unable to connect to the server: x509: certificate is valid for 10.96.0.1, 192.168.239.103, 192.168.239.103, not 192.168.239.100报错)
“鸡生蛋,蛋生鸡”的死锁
- 配置确实存在: 当初安装集群时,
kubeadm确实把指定的参数(VIP、Master IP)乖乖地写进了kubeadm-configConfigMap 里。所以现在集群恢复后,去查 ConfigMap,发现它们都在。 - 但是,集群“瞎”了: 当今天发现证书过期时,API Server 已经挂了(上面写有原因)。
- 关键时刻掉链子: 当你运行
kubeadm alpha certs renew时,kubeadm尝试去连接 API Server 读取那个 ConfigMap,想知道“我该把哪些 IP 加到证书里?”- 结果: 连接失败(因为证书过期了,连不上)。
- 反应:
kubeadm报错说“读不到集群配置”(Error reading configuration...)。
- 降级策略 (Falling back): 读取不到 ConfigMap 怎么办?
kubeadm为了让你能把证书签出来,选择了一个**“降级”**方案:- 它不再参考 ConfigMap 里的配置(因为拿不到)。
- 它改为探测本机网络:它只看了看本机的主机名 (
master01) 和网卡 IP (192.168.239.100)。 - 它根据这些有限的信息,签发了一张“最基础”的证书。
- 后果: 这张基础证书里自然就丢失了 VIP (
103) 和其他节点的 IP,因为这些信息只存在于那个读不到的 ConfigMap 里,而不是本机网卡上。 - 所以其实在我把其他master节点也更新好证书后,etcd会恢复,apiserver也会恢复,这时候我使用renew更新证书就会成功了,而不必说必须使用init
结束!