name: gke-expert description: 为Google Kubernetes Engine(GKE)操作提供专家指导,包括集群管理、工作负载部署、扩缩容、监控、故障排除和优化。适用于处理GKE集群、GCP上的Kubernetes部署、容器编排,或当用户需要有关kubectl命令、GKE网络、自动扩缩容、工作负载身份或GKE特有功能(如Autopilot、Binary Authorization或Config Sync)的帮助时。
GKE专家
初始评估 当用户请求GKE帮助时,请确定:
集群类型:Autopilot还是Standard? 任务:创建、部署、扩缩容、故障排除还是优化? 环境:开发、预发布还是生产?
快速入门工作流
创建集群
Autopilot(推荐给大多数用户):
bashgcloud container clusters create-auto CLUSTER_NAME
–region=REGION
–release-channel=regular
Standard(适用于特定的节点需求):
bashgcloud container clusters create CLUSTER_NAME
–zone=ZONE
–num-nodes=3
–enable-autoscaling
–min-nodes=2
–max-nodes=10
创建后始终进行身份验证:
bashgcloud container clusters get-credentials CLUSTER_NAME --region=REGION
部署应用程序
创建部署清单:
yamlapiVersion: apps/v1 kind: Deployment metadata: name: APP_NAME spec: replicas: 3 selector: matchLabels: app: APP_NAME template: metadata: labels: app: APP_NAME spec: containers: - name: APP_NAME image: gcr.io/PROJECT_ID/IMAGE:TAG ports: - containerPort: 8080 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi
应用并暴露服务:
bashkubectl apply -f deployment.yaml
kubectl expose deployment APP_NAME --type=LoadBalancer --port=80 --target-port=8080
设置自动扩缩容
Pod的HPA:
bashkubectl autoscale deployment APP_NAME --cpu-percent=70 --min=2 --max=100
集群自动扩缩容(仅限Standard):
bashgcloud container clusters update CLUSTER_NAME
–enable-autoscaling --min-nodes=2 --max-nodes=10 --zone=ZONE
配置工作负载身份
在集群上启用:
bashgcloud container clusters update CLUSTER_NAME
–workload-pool=PROJECT_ID.svc.id.goog
关联服务账户:
bash# 创建GCP服务账户 gcloud iam service-accounts create GSA_NAME
创建K8s服务账户
kubectl create serviceaccount KSA_NAME
绑定它们
gcloud iam service-accounts add-iam-policy-binding
GSA_NAME@PROJECT_ID.iam.gserviceaccount.com
–role roles/iam.workloadIdentityUser
–member “serviceAccount:PROJECT_ID.svc.id.goog[default/KSA_NAME]”
注解K8s服务账户
kubectl annotate serviceaccount KSA_NAME
iam.gke.io/gcp-service-account=GSA_NAME@PROJECT_ID.iam.gserviceaccount.com
故障排除指南
Pod问题
bash# Pod未启动 - 检查事件
kubectl describe pod POD_NAME
kubectl get events --field-selector involvedObject.name=POD_NAME
常见修复方法:
ImagePullBackOff:检查镜像是否存在以及拉取密钥
CrashLoopBackOff:kubectl logs POD_NAME --previous
Pending:kubectl describe nodes(检查资源)
OOMKilled:增加内存限制
服务问题 bash# 没有端点 kubectl get endpoints SERVICE_NAME kubectl get pods -l app=APP_NAME # 检查Pod是否匹配选择器
测试连通性
kubectl run test --image=busybox -it --rm – wget -O- SERVICE_NAME 性能问题 bash# 检查资源使用情况 kubectl top nodes kubectl top pods --all-namespaces
查找瓶颈
kubectl describe resourcequotas kubectl describe limitranges 生产环境模式 带HTTPS的Ingress yamlapiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: APP_NAME-ingress annotations: networking.gke.io/managed-certificates: “CERT_NAME” spec: rules:
- host: example.com
http:
paths:
- path: / pathType: Prefix backend: service: name: APP_NAME port: number: 80 Pod中断预算 yamlapiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: APP_NAME-pdb spec: minAvailable: 1 selector: matchLabels: app: APP_NAME 安全上下文 yamlspec: securityContext: runAsNonRoot: true runAsUser: 1000 containers:
- name: app securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: [“ALL”] 成本优化
使用Autopilot进行自动资源调整 启用具有适当限制的集群自动扩缩容 对非关键工作负载使用Spot VM:
bashgcloud container node-pools create spot-pool
–cluster=CLUSTER_NAME
–spot
–num-nodes=2
适当设置资源请求/限制 使用VPA获取建议:kubectl describe vpa APP_NAME-vpa
基本命令 bash# 集群管理 gcloud container clusters list kubectl config get-contexts kubectl cluster-info
部署
kubectl rollout status deployment/APP_NAME kubectl rollout undo deployment/APP_NAME kubectl scale deployment APP_NAME --replicas=5
调试
kubectl logs -f POD_NAME --tail=50 kubectl exec -it POD_NAME – /bin/bash kubectl port-forward pod/POD_NAME 8080:80
监控
kubectl top nodes kubectl top pods kubectl get events --sort-by=‘.lastTimestamp’
外部文档
有关此技能之外的详细文档:
- 官方GKE文档:https://cloud.google.com/kubernetes-engine/docs
- kubectl参考:https://kubernetes.io/docs/reference/kubectl/
- GKE最佳实践:https://cloud.google.com/kubernetes-engine/docs/best-practices
- 工作负载身份:https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity
- GKE定价计算器:https://cloud.google.com/products/calculator
清理
kubectl delete all -l app=APP_NAME kubectl drain NODE_NAME --ignore-daemonsets 高级主题参考
对于复杂场景,请参考:
有状态工作负载:使用带有持久卷的StatefulSets 批处理作业:使用具有适当退避策略的Jobs/CronJobs 多区域:使用多集群Ingress或Traffic Director 服务网格:安装Anthos Service Mesh以实现高级网络功能 GitOps:实施Config Sync或Flux进行声明式管理 监控:与Cloud Monitoring集成或安装Prometheus