name: karpenter description: Kubernetes节点自动扩缩和成本优化与Karpenter。使用于实现节点供应、spot实例管理、集群大小调整、节点合并或降低计算成本。涵盖NodePool配置、EC2NodeClass设置、中断预算、spot/on-demand混合策略、多架构支持、和容量类型选择。 triggers:
- karpenter
- 节点自动扩缩
- nodepool
- ec2nodeclass
- provisioner
- spot实例
- on-demand实例
- 节点合并
- 节点终止
- 集群自动扩缩
- 大小调整
- 容量类型
- 节点中断
- 计算成本
- 实例选择
- graviton
- arm64 allowed-tools: Read, Grep, Glob, Edit, Write, Bash
Karpenter
概述
Karpenter是一个Kubernetes节点自动扩缩器,根据变化的应用程序负载提供适当大小的计算资源。与Cluster Autoscaler不同,后者扩缩预定义的节点组,Karpenter基于聚合的pod资源需求来供应节点,实现更好的装箱和成本优化。
与Cluster Autoscaler的关键区别
- 直接供应:直接与云提供商API对话(不需要节点组)
- 快速扩缩:在几秒内供应节点 vs 几分钟
- 灵活的实例选择:自动从所有可用实例类型中选择
- 合并:主动用更便宜的替代品替换节点
- Spot实例优化:一流支持,带有自动回退
何时使用Karpenter
- 运行具有多样资源需求的工作负载
- 需要快速扩缩(亚分钟响应)
- 使用spot实例和Graviton(ARM64)进行成本优化
- 合并以减少集群浪费和过度供应
- 具有不可预测或突发工作负载的集群
- 根据实际使用模式调整基础设施大小
- 自动管理混合容量类型(spot/on-demand)
指令
1. 安装和设置
- 在集群中安装Karpenter控制器
- 配置云提供商凭据(IAM角色)
- 设置实例配置文件和安群组
- 为不同类型的工作负载创建NodePools
- 定义EC2NodeClass(AWS)或等效于您的提供商
2. 设计NodePool策略
- 为不同工作负载类分离NodePools
- 定义实例类型家族和大小
- 配置spot/on-demand混合
- 设置每个NodePool的资源限制
- 计划多AZ分布
3. 配置中断管理
- 设置中断预算以控制变动
- 配置合并策略
- 定义节点生命周期的过期窗口
- 处理工作负载特定的中断约束
- 测试中断场景
4. 优化成本和性能
- 启用合并以节省成本
- 使用带有回退策略的spot实例
- 在pod上设置适当的资源请求(Karpenter依赖于准确的请求)
- 监控节点利用率和浪费
- 根据使用情况调整实例类型限制
- 利用Graviton(ARM64)实例以降低20%成本
- 配置容量类型权重以优先选择spot而非on-demand
5. 成本优化策略
- Spot实例:为容错工作负载配置70-90% spot混合
- Graviton(ARM64):使用c7g, m7g, r7g家族以降低成本
- 合并:启用WhenUnderutilized策略以替换昂贵的节点
- 实例多样性:广泛的实例家族选择提高spot可用性
- 大小调整:让Karpenter高效装箱而不是过度供应
6. Spot实例管理
- 使用广泛的实例类型选择(10+家族)以获得更好的spot可用性
- 配置当spot不可用时自动回退到on-demand
- 实现Pod中断预算以控制影响范围
- 在应用程序中设置优雅终止处理程序(preStop钩子)
- 监控spot中断率并调整实例选择
- 使用不同的可用区以减少相关故障
7. 节点合并
- WhenUnderutilized:主动用更便宜/更小的替代品替换节点
- WhenEmpty:仅合并完全空的节点(保守)
- 配置consolidateAfter延迟以防止变动(典型30s-600s)
- 使用中断预算来限制合并率(每个窗口5-20%)
- 在合并过程中尊重Pod中断预算
- 设置过期窗口以强制定期节点刷新
最佳实践
- 从保守开始:以限制性实例类型开始,根据观察扩展
- 使用中断预算:防止太多节点同时被中断
- 设置Pod资源请求:Karpenter依赖准确的请求进行调度
- 启用合并:让Karpenter自动优化节点利用率
- 分离工作负载类:为不同需求使用多个NodePools
- 监控供应:跟踪供应延迟和失败
- 测试Spot中断:确保优雅处理spot实例终止
- 使用拓扑扩散:结合pod拓扑约束以提高可用性
示例
示例1:具有多实例类型的基本NodePool
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
nodeClassRef:
name: default
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.k8s.aws/instance-family
operator: In
values:
["c6a", "c6i", "c7i", "m6a", "m6i", "m7i", "r6a", "r6i", "r7i"]
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["large", "xlarge", "2xlarge", "4xlarge"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: topology.kubernetes.io/zone
operator: In
values: ["us-west-2a", "us-west-2b", "us-west-2c"]
kubelet:
maxPods: 110
systemReserved:
cpu: 100m
memory: 100Mi
ephemeral-storage: 1Gi
evictionHard:
memory.available: 5%
nodefs.available: 10%
imageGCHighThresholdPercent: 85
imageGCLowThresholdPercent: 80
taints:
- key: workload-type
value: general
effect: NoSchedule
metadata:
labels:
workload-type: general
managed-by: karpenter
limits:
cpu: 1000
memory: 1000Gi
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s
budgets:
- nodes: 10%
duration: 5m
weight: 10
示例2:EC2NodeClass用于AWS特定配置
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
role: KarpenterNodeRole-my-cluster
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
kubernetes.io/role/internal-elb: "1"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
- name: my-cluster-node-security-group
userData: |
#!/bin/bash
echo "Custom node initialization"
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
iops: 3000
throughput: 125
encrypted: true
deleteOnTermination: true
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 2
httpTokens: required
detailedMonitoring: true
tags:
Name: karpenter-node
Environment: production
ManagedBy: karpenter
ClusterName: my-cluster
示例3:针对不同工作负载的专门NodePools
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: gpu-workloads
spec:
template:
spec:
nodeClassRef:
name: gpu-nodes
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["g5", "g6", "p4", "p5"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: karpenter.k8s.aws/instance-gpu-count
operator: Gt
values: ["0"]
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
metadata:
labels:
workload-type: gpu
nvidia.com/gpu: "true"
limits:
cpu: 500
memory: 2000Gi
nvidia.com/gpu: 16
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 300s
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: batch-workloads
spec:
template:
spec:
nodeClassRef:
name: default
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["c6a", "c6i", "c7i", "m6a", "m6i"]
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["2xlarge", "4xlarge", "8xlarge"]
taints:
- key: workload-type
value: batch
effect: NoSchedule
metadata:
labels:
workload-type: batch
spot-interruption-handler: enabled
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 60s
budgets:
- nodes: 20%
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: stateful-workloads
spec:
template:
spec:
nodeClassRef:
name: stateful-nodes
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["r6i", "r7i"]
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["xlarge", "2xlarge", "4xlarge"]
- key: topology.kubernetes.io/zone
operator: In
values: ["us-west-2a", "us-west-2b"]
kubelet:
maxPods: 50
taints:
- key: workload-type
value: stateful
effect: NoSchedule
metadata:
labels:
workload-type: stateful
storage-optimized: "true"
limits:
cpu: 200
memory: 800Gi
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 600s
budgets:
- nodes: 1
duration: 30m
示例4:中断预算和合并策略
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: production-apps
spec:
template:
spec:
nodeClassRef:
name: default
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["c6i", "m6i", "r6i"]
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s
expireAfter: 720h
budgets:
- nodes: 5%
duration: 8h
schedule: "0 8 * * MON-FRI"
- nodes: 20%
duration: 16h
schedule: "0 18 * * MON-FRI"
- nodes: 30%
duration: 48h
schedule: "0 0 * * SAT"
- nodes: 10%
示例5:Pod调度与Karpenter
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-application
spec:
replicas: 5
selector:
matchLabels:
app: my-application
template:
metadata:
labels:
app: my-application
spec:
tolerations:
- key: workload-type
operator: Equal
value: general
effect: NoSchedule
nodeSelector:
workload-type: general
karpenter.sh/capacity-type: spot
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: my-application
topologyKey: topology.kubernetes.io/zone
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: kubernetes.io/arch
operator: In
values: ["arm64"]
- weight: 30
preference:
matchExpressions:
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["2xlarge", "4xlarge"]
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: my-application
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: my-application
containers:
- name: app
image: my-app:latest
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- sleep 15
terminationGracePeriodSeconds: 30
示例6:Spot实例处理和回退
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: spot-with-fallback
spec:
template:
spec:
nodeClassRef:
name: default
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: karpenter.k8s.aws/instance-family
operator: In
values:
- "c5a"
- "c6a"
- "c6i"
- "c7i"
- "m5a"
- "m6a"
- "m6i"
- "m7i"
- "r5a"
- "r6a"
- "r6i"
- "r7i"
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["large", "xlarge", "2xlarge", "4xlarge"]
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
metadata:
labels:
spot-enabled: "true"
annotations:
karpenter.sh/spot-to-spot-consolidation: "true"
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s
budgets:
- nodes: 25%
weight: 5
示例7:Karpenter与Pod中断预算
apiVersion: apps/v1
kind: Deployment
metadata:
name: critical-service
spec:
replicas: 6
selector:
matchLabels:
app: critical-service
template:
metadata:
labels:
app: critical-service
spec:
tolerations:
- key: workload-type
operator: Equal
value: general
effect: NoSchedule
containers:
- name: app
image: critical-service:latest
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: critical-service-pdb
spec:
minAvailable: 4
selector:
matchLabels:
app: critical-service
示例8:多架构NodePool
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: multi-arch
spec:
template:
spec:
nodeClassRef:
name: default
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.k8s.aws/instance-family
operator: In
values:
- "c6g"
- "m6g"
- "r6g"
- "c7g"
- "m7g"
- "r7g"
- "c6i"
- "m6i"
- "r6i"
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
metadata:
labels:
multi-arch: "true"
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 60s
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
role: KarpenterNodeRole-my-cluster
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
监控和故障排除
关键指标监控
karpenter_nodes_created_total
karpenter_nodes_terminated_total
karpenter_provisioner_scheduling_duration_seconds
karpenter_disruption_replacement_node_initialized_seconds
karpenter_disruption_consolidation_actions_performed_total
karpenter_disruption_budgets_allowed_disruptions
karpenter_provisioner_instance_type_price_estimate
karpenter_cloudprovider_instance_type_offering_price_estimate
karpenter_pods_state
常见问题和解决方案
问题:Pod卡在Pending状态
- 检查NodePool需求是否匹配pod节点选择器/容忍
- 验证云提供商限制是否未超出
- 检查所选区域中实例类型的可用性
- 确保子网容量可用
问题:过度节点变动
- 调整合并延迟(consolidateAfter)
- 审查中断预算
- 检查pod资源请求是否准确
- 考虑使用WhenEmpty而非WhenUnderutilized
问题:尽管使用Karpenter成本仍高
- 如果未激活,则启用合并
- 验证是否正在使用spot实例
- 检查pod是否具有不必要的大资源请求
- 审查实例类型选择(允许更多种类)
问题:Spot中断导致服务中断
- 实现Pod中断预算
- 使用多样实例类型以提高spot可用性
- 配置适当的副本计数
- 在应用程序中实现优雅关机
与Terraform集成
resource "helm_release" "karpenter" {
namespace = "karpenter"
create_namespace = true
name = "karpenter"
repository = "oci://public.ecr.aws/karpenter"
chart = "karpenter"
version = "v0.33.0"
values = [
<<-EOT
settings:
clusterName: ${var.cluster_name}
clusterEndpoint: ${var.cluster_endpoint}
interruptionQueue: ${var.interruption_queue_name}
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: ${var.karpenter_irsa_arn}
controller:
resources:
requests:
cpu: 1
memory: 1Gi
limits:
cpu: 2
memory: 2Gi
EOT
]
depends_on = [
aws_iam_role_policy_attachment.karpenter_controller
]
}
resource "kubectl_manifest" "karpenter_nodepool_default" {
yaml_body = <<-YAML
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
nodeClassRef:
name: default
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["c6i", "m6i", "r6i"]
limits:
cpu: 1000
memory: 1000Gi
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s
YAML
depends_on = [helm_release.karpenter]
}
从Cluster Autoscaler迁移
-
计划迁移
- 识别当前节点组及其特性
- 映射工作负载到新NodePool配置
- 计划共存期
-
与Cluster Autoscaler一起部署Karpenter
- 在集群中安装Karpenter
- 使用不同标签创建NodePools
- 首先测试非关键工作负载
-
增量迁移工作负载
- 使用Karpenter容忍/节点选择器更新pod规范
- 监控供应和合并行为
- 验证成本和性能指标
-
移除Cluster Autoscaler
- 所有工作负载迁移后,缩减CA节点组
- 移除Cluster Autoscaler部署
- 清理CA特定资源