记录一次高可用集群使用kubeadm v1.9.6升级v1.13 的问题

项目中使用kubeadm 将k8s版本从v1.9.6升级到1.13.4,由于无法跨版本升级，所以大致流程是

v1.9.6 -> v1.10.8
v1.10.8 -> v1.11.5
v1.11.5 -> v1.12.4
v1.12.4 -> v1.13.4

其中操作集群A从v1.9.6到v1.10.8升级过程中，执行upgrade的时候出现了如下问题：

[root@cloud-cn-master-1 ~]# kubeadm upgrade apply v1.10.8 -y
...
[upgrade/prepull] Successfully prepulled the images for all the control plane components
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.8"...
Static pod: kube-apiserver-rz-dev-master01 hash: a82830fd687fdabd030b65ee6c4b4fd4
Static pod: kube-controller-manager-rz-dev-master01 hash: 1a23c184fadb64c889a41831476c56e8
Static pod: kube-scheduler-rz-dev-master01 hash: a0adc2bf23e7d5336ecd4677ce95938c
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests500670946"
[controlplane] Adding extra host path mount "k8s" to "kube-controller-manager"
[upgrade/staticpods] current and new manifests of kube-apiserver are equal, skipping upgrade
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-controller-manager.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2019-04-22-13-04-58/kube-controller-manager.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-controller-manager-rz-dev-master01 hash: 1a23c184fadb64c889a41831476c56e8
Static pod: kube-controller-manager-rz-dev-master01 hash: 1a23c184fadb64c889a41831476c56e8
Static pod: kube-controller-manager-rz-dev-master01 hash: 1a23c184fadb64c889a41831476c56e8
Static pod: kube-controller-manager-rz-dev-master01 hash: 1a23c184fadb64c889a41831476c56e8
Static pod: kube-controller-manager-rz-dev-master01 hash: 1a23c184fadb64c889a41831476c56e8
...

升级过程中一直卡在校验hash这步，由于之前测试没有出现类似情况，而此次出问题和测试环境唯一的区别就是该环境升级过证书，默认证书有效期是1年，此环境更新成30年了，所以尝试先把证书还原后再次执行upgrade，升级成功，具体原因未排查。

在另一个环境从v1.12升级到v1.13.4版本时候，出现了如下情况：

[root@cloud-cn-master-1 ~]# kubeadm upgrade apply v1.13.4 -y
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
FATAL: failed to get node registration: node doesn't have kubeadm.alpha.kubernetes.io/cri-socket annotation

报错提示node缺少annotation，而查看操作kubeadm升级的master node，是存在cri-socket的annotation的，但是其他两台master 不存在，于是尝试手动添加annotation，然后再次跑upgrade

[root@cloud-cn-master-1 ~]# kubectl annotate node <nodename> kubeadm.alpha.kubernetes.io/cri-socket=/var/run/dockershim.sock
[root@cloud-cn-master-1 ~]# kubeadm upgrade apply v1.13.4 -y
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/config] FATAL: failed to getAPIEndpoint: failed to get APIEndpoint information for this node

又报错找不到APIEndpoint，尝试执行看下kubeadm-config的数据

[root@cloud-cn-master-1 ~]# kubectl -n kube-system get cm kubeadm-config -oyaml|grep -A5 cloud-cn-master-1
  ClusterStatus: |
    apiEndpoints:
      cloud-cn-master-1:
        advertiseAddress: 10.0.128.251
        bindPort: 6443
    apiVersion: kubeadm.k8s.io/v1beta1
    kind: ClusterStatus

发现该apiEndpoints是存在的，但是只存在这一个节点，于是尝试edit cm把另外两个节点的apiEndpoint也配置上，再次upgrade，发现又出现了校验hash不通过的错误，于是尝试去看下kubeadm的源码，理清楚这个config阶段的思路，源码可以参考此处,

// if this isn't a new controlplane instance (e.g. in case of kubeadm upgrades)
// get nodes specific information as well
if !newControlPlane {
    // gets the nodeRegistration for the current from the node object
    if err := getNodeRegistration(kubeconfigDir, client, &initcfg.NodeRegistration); err != nil {
        return nil, errors.Wrap(err, "failed to get node registration")
    }
    // gets the APIEndpoint for the current node from then ClusterStatus in the kubeadm-config ConfigMap
    if err := getAPIEndpoint(configMap.Data, initcfg.NodeRegistration.Name, &initcfg.LocalAPIEndpoint); err != nil {
        return nil, errors.Wrap(err, "failed to getAPIEndpoint")
    }
}

该部分即对应执行upgrade的逻辑，先去getNodeRegistration然后去getAPIEndpoint，看下getAPIEndpoint的逻辑

// getAPIEndpoint returns the APIEndpoint for the current node
func getAPIEndpoint(data map[string]string, nodeName string, apiEndpoint *kubeadmapi.APIEndpoint) error {
    // gets the ClusterStatus from kubeadm-config
    clusterStatus, err := UnmarshalClusterStatus(data)
    if err != nil {
        return err
    }

    // gets the APIEndpoint for the current machine from the ClusterStatus
    e, ok := clusterStatus.APIEndpoints[nodeName]
    if !ok {
        return errors.New("failed to get APIEndpoint information for this node")
    }

    apiEndpoint.AdvertiseAddress = e.AdvertiseAddress
    apiEndpoint.BindPort = e.BindPort
    return nil
}

会根据nodeName和kubeadm-config这个configmap的数据去拿APIEndpoint的AdvertiseAddress和BindPort信息，但是手动确认过确实是存在 APIEndpoint的配置的，所以再次查看传过来的nodeName是否正确，由于nodeName是从NodeRegistration中获取的先看下NodeRegistration的获取逻辑:

// getNodeRegistration returns the nodeRegistration for the current node
func getNodeRegistration(kubeconfigDir string, client clientset.Interface, nodeRegistration *kubeadmapi.NodeRegistrationOptions) error {
    // gets the name of the current node
    nodeName, err := getNodeNameFromKubeletConfig(kubeconfigDir)
    if err != nil {
        return errors.Wrap(err, "failed to get node name from kubelet config")
    }

    // gets the corresponding node and retrieves attributes stored there.
    node, err := client.CoreV1().Nodes().Get(nodeName, metav1.GetOptions{})
    if err != nil {
        return errors.Wrap(err, "failed to get corresponding node")
    }

    criSocket, ok := node.ObjectMeta.Annotations[constants.AnnotationKubeadmCRISocket]
    if !ok {
        return errors.Errorf("node %s doesn't have %s annotation", nodeName, constants.AnnotationKubeadmCRISocket)
    }

    // returns the nodeRegistration attributes
    nodeRegistration.Name = nodeName
    nodeRegistration.CRISocket = criSocket
    nodeRegistration.Taints = node.Spec.Taints
    // NB. currently nodeRegistration.KubeletExtraArgs isn't stored at node level but only in the kubeadm-flags.env
    //     that isn't modified during upgrades
    //     in future we might reconsider this thus enabling changes to the kubeadm-flags.env during upgrades as well
    return nil
}

发现nodeName是通过getNodeNameFromKubeletConfig获取的，也就是说读取的是kubelet.conf配置，看下getNodeNameFromKubeletConfig逻辑：

// getNodeNameFromConfig gets the node name from a kubelet config file
// TODO: in future we want to switch to a more canonical way for doing this e.g. by having this
//       information in the local kubelet config.yaml
func getNodeNameFromKubeletConfig(kubeconfigDir string) (string, error) {
    // loads the kubelet.conf file
    fileName := filepath.Join(kubeconfigDir, constants.KubeletKubeConfigFileName)
    config, err := clientcmd.LoadFromFile(fileName)
    if err != nil {
        return "", err
    }

    // gets the info about the current user
    authInfo := config.AuthInfos[config.Contexts[config.CurrentContext].AuthInfo]

    // gets the X509 certificate with current user credentials
    var certs []*x509.Certificate
    if len(authInfo.ClientCertificateData) > 0 {
        // if the config file uses an embedded x509 certificate (e.g. kubelet.conf created by kubeadm), parse it
        if certs, err = certutil.ParseCertsPEM(authInfo.ClientCertificateData); err != nil {
            return "", err
        }
    } else if len(authInfo.ClientCertificate) > 0 {
        // if the config file links an external x509 certificate (e.g. kubelet.conf created by TLS bootstrap), load it
        if certs, err = certutil.CertsFromFile(authInfo.ClientCertificate); err != nil {
            return "", err
        }
    } else {
        return "", errors.New("invalid kubelet.conf. X509 certificate expected")
    }

    // We are only putting one certificate in the certificate pem file, so it's safe to just pick the first one
    // TODO: Support multiple certs here in order to be able to rotate certs
    cert := certs[0]

    // gets the node name from the certificate common name
    return strings.TrimPrefix(cert.Subject.CommonName, constants.NodesUserPrefix), nil
}

发现kubeadm先加载本机的kubelet.conf文件，然后尝试去找当前context中配置的用户的client-certificate-data数据，然后解析cert 的信息，找到subject的CommanName,来当作NodeName然后去kubeadm-config中找对应的apiEndpoint,所以尝试解析下当前的证书数据，看下CN的值：

#先获取client-certificate-data并做base64解密得到cert信息,$client-certificate-data为kubelet.conf中client-certificate-data对应的内容
[root@cloud-cn-master-1 ~]# echo $client-certificate-data |base64 -d > kubelet.crt
[root@cloud-cn-master-1 ~]# openssl x509 -in kubelet.crt -text |grep -i Subject
        Subject: O=system:nodes, CN=system:node:cloud-cn-master-2
        Subject Public Key Info:

👆，好吧原因定位到了，是因为获取到的CommanName与本机不匹配，于是在cloud-cn-master-1上获取到的是cloud-cn-master-2的nodeName，然后去获取annotation自然也就拿不到了，然后即使手动annotate了，也是临时过了一步，到configmap中取apiEndpoint自然也获取不到，哪怕再手动维护apiEndpoint，那么后续到校验hash也是无法通过，所以根本原因在于升级证书的时候没有kubelet证书被覆盖了，导致一系列的问题