diff --git a/blog/2023-08-16-release/index.md b/blog/2023-08-16-release/index.md new file mode 100644 index 000000000..1139fd22f --- /dev/null +++ b/blog/2023-08-16-release/index.md @@ -0,0 +1,184 @@ +--- +slug: release-v1.3.0 +title: "Koordinator v1.3: 增强资源预留,支持 NRI,提供节点画像的 Mid 资源超卖" +authors: [saintube] +tags: [release] +--- + +## 背景 + +Koordinator 是一个开源项目,旨在基于阿里巴巴在容器调度领域的多年经验,提供一个完整的混部解决方案,包含混部工作负载编排、资源调度、资源隔离及性能调优等多方面能力,来帮助用户优化容器性能,充分发掘空闲物理资源,提升资源效率,增强延迟敏感型工作负载和批处理作业的运行效率和可靠性。 + +在此,我们很高兴地向各位宣布 Koordinator v1.3.0 版本的发布。自 2022 年 4 月发布 v0.1.0 版本以来,Koordinator 迄今迭代发布了共 11 个版本,吸引了了包括阿里巴巴、Intel、小米、小红书、爱奇艺、360、有赞等企业在内的大量优秀工程师参与贡献。在 v1.3.0 版本中,Koordinator 带来了 NRI (Node Resource Interface) 支持、Mid 资源超卖等新特性,并在资源预留、负载感知调度、DeviceShare 调度、负载感知重调度、调度器框架、单机指标采集和资源超卖框架等特性上进行了稳定性修复、性能优化与功能增强。 + +在 v1.3.0 版本中,共有 12 位新加入的开发者参与到了 Koordinator 社区的建设,他们是 @bowen-intel,@BUPT-wxq,@Gala-R,@haoyann,@kangclzjc,@Solomonwisdom,@stulzq,@TheBeatles1994,@Tiana2018,@VinceCui,@wenchezhao,@zhouzijiang,感谢期间各位社区同学的积极参与和贡献,也感谢所有同学在社区的持续投入。 + +## 版本功能特性解读 + +### 资源预留增强 + +资源预留(Reservation)能力自 v0.5.0 版本提出后,经历了一年的打磨和迭代,在 v1.3.0 版本中针对抢占、设备预留、Coscheduling 等场景增强了预留机制,新增 allocatePolicy 字段用于定义不同的预留资源分配策略。最新的资源预留 API 如下: + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: Reservation +metadata: + name: reservation-demo +spec: + # template字段填写reservation对象的资源需求和affinity信息,就像调度pod一样. + template: + namespace: default + spec: + containers: + - args: + - '-c' + - '1' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + requests: + cpu: 500m + memory: 1Gi + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: topology.kubernetes.io/zone + operator: In + values: + - cn-hangzhou-i + schedulerName: koord-scheduler # 指定koord-scheduler来负责reservation对象的调度. + # 指定可分配预留资源的owners. + owners: + - labelSelector: + matchLabels: + app: app-demo + ttl: 1h + # 指定预留资源是否仅支持一次性的分配. + allocateOnce: true + # 指定预留资源的分配策略,当前支持以下策略: + # - Default: 缺省配置,不限制对预留资源的分配,pod优先分配自节点上的预留资源;如果预留资源不足,则继续分配节点空闲资源。 + # - Aligned: pod优先分配自节点上的预留资源;如果预留资源不足,则继续分配节点空闲资源,但要求这部分资源满足Pod需求。该策略可用于规避pod同时分配多个reservation的资源。 + # - Restricted: 对于预留资源包含的各个资源维度,pod必须分配自预留资源;其余资源维度可以分配节点空闲资源。包含了Aligned策略的语义。 + # 同一节点尚不支持Default策略和Aligned策略或Restricted策略共存。 + allocatePolicy: "Aligned" + # 控制预留资源是否可以使用 + unschedulable: false +``` + +此外,资源预留在 v1.3.0 中还包含了如下兼容性和性能优化: + +1. 增强 Reservation 的抢占,允许 Reservation 内的 Pod 间抢占,拒绝 Reservation 外的 Pod 抢占 Reservation 内的 Pod。 +2. 增强设备预留场景,如果节点上设备资源被部分预留并被 pod 使用,支持剩余资源的分配。 +3. 支持 Reservation 使用 Coscheduling。 +4. 新增 Reservation Affinity协议,允许用户一定从Reservation内分配资源。 +5. 优化 Reservation 兼容性,修复因 Reservation 导致原生打分插件失效的问题。 +6. 优化因引入 Reservation 导致的调度性能回归问题。 +7. 修复 Reservation 预留端口误删除的问题。 + +关于资源预留的设计,详见[Designs - Resource Reservation](/docs/designs/resource-reservation)。 + +### 其他调度增强 + +在 v1.3.0 中,koordinator 在调度和重调度方面还包含如下增强: + +1. DeviceShare 调度 + + - 更改 GPU 资源使用方式,使用 GPU Share API 时,必须声明`koordinator.sh/gpu-memory`或`koordinator.sh/gpu-memory-ratio`,允许不声明`koordinator.sh/gpu-core`。 + - 支持打分,可用于实现 GPU Share 场景和整卡分配场景的 bin-packing 或 spread,并支持卡粒度 binpacking 或 spread。 + - 修复用户误删除 Device CRD 导致调度器内部状态异常重复分配设备的问题。 + +2. 负载感知调度:修复对仅填写 Request 的 Pod 的调度逻辑。 + +3. 调度器框架:优化 PreBind 阶段的 Patch 操作,将多个插件的 Patch 操作合并为一次提交,提升操作效率,降低 APIServer 压力。 + +4. 重调度 + + - LowNodeLoad 支持按节点池设置不同的负载水位和参数等。自动兼容原有配置。 + - 跳过 schedulerName 不是 koord-scheduler 的Pod,支持配置不同的 schedulerName。 + +### NRI 资源管理模式 + +Koordinator 的 runtime hooks 支持两种模式,standalone 和 CRI proxy,然而这两种模式各自有着一些限制。当前,尽管在 standalone 模式做了很多优化,但当想获得更加及时的 Pod 或容器的事件或者环境变量的注入时还是需要依赖 proxy 模式。然而, proxy 模式要求单独部署 koord-runtime-proxy 组件来代理 CRI (Container Runtime Interface) 请求, 同时需要更改 Kubelet 的启动参数并重启 Kubelet。 + +NRI(Node Resource Interface),即节点资源接口,是 CRI 兼容的容器运行时插件扩展的通用框架,独立于具体的容器运行时(e.g. containerd, cri-o), 提供不同生命周期事件的接口,允许用户在不修改容器运行时源代码的情况下添加自定义逻辑。特别的是,2.0 版本 NRI 只需要运行一个插件实例用于处理所有 NRI 事件和请求,容器运行时通过 Unix-Domain Socket 与插件通信,使用基于 Protobuf 的协议数据,和 1.0 版本 NRI 相比拥有更高的性能,能够实现有状态的 NRI 插件。 + +通过 NRI 的引入,既能及时的订阅 Pod 或者容器的生命周期事件,又避免了对 Kubelet 的侵入修改。在 Koordinator v1.3.0 中,我们引入 NRI 这种社区推荐的方式来管理 runtime hooks 来解决之前版本遇到的问题,大大提升了 Koordinator 部署的灵活性和处理的时效性,提供了一种优雅的云原生系统的资源管理标准化模式。 + +![nri](/img/nri-proposal.png) + +> 注:NRI 模式不支持 docker 的容器运行时,使用 docker 的用户请继续使用 standalone 模式或 proxy 模式。 + +关于 Koordinator 启用 NRI 的部署方式,请见[Installation - Enable NRI Mode Resource Management](/docs/installation#enable-nri-mode-resource-management)。 + +### 节点画像和 Mid 资源超卖 + +Koordinator 中将节点资源分为4种资源优先级模型 Prod、Mid、Batch 和 Free,低优先级资源可以复用高优先级已分配但未使用的物理资源,以提升物理资源利用率;同时,资源优先级越高,提供的资源也越稳定,例如 Batch 资源采用高优先级资源短期(short-term)已分配但未使用的超卖资源,而 Mid 资源采用高优先级资源长周期(long-term)已分配但未使用的超卖资源。不同资源优先级模型如下图所示: + +![resource-priority-model](/img/resource-model.png) + +Koordinator v1.3.0 新增了节点画像能力,基于 Prod 的历史资源用量进行峰值预测,以支持 Mid-tier 的资源超卖调度。Mid 资源的超卖计算公式如下: + +``` +MidAllocatable := min(ProdReclaimable, NodeAllocatable * thresholdRatio) +ProdReclaimable := max(0, ProdAllocated - ProdPeak * (1 + safeMargin)) +``` + +- `ProdPeak`:通过节点画像,预估的节点上已调度 Prod Pod 在中长周期内(e.g. 12h)的用量峰值。 +- `ProdReclaimable`:基于节点画像结果,预估在中长周期内可复用的 Prod 资源。 +- `MidAllocatable`:节点上可分配的 Mid 资源。 + +此外,Mid 资源的单机隔离保障将在下个版本得到完善,相关动态敬请关注[Issue #1442](https://github.com/koordinator-sh/koordinator/issues/1442)。 +在 v1.3.0 版本中,用户可以查看和提交 Mid-tier 的超卖资源,也可以通过以下 Prometheus metrics 来观测节点画像的趋势变化。 + +```bash +# 查看节点的CPU资源画像,reclaimable指标表示预测的可回收资源量,predictor对应不同的预测模型 +koordlet_node_predicted_resource_reclaimable{node="test-node", predictor="minPredictor", resource="cpu", unit="core"} +# 查看节点的内存资源画像,reclaimable指标表示预测的可回收资源量,predictor对应不同的预测模型 +koordlet_node_predicted_resource_reclaimable{node="test-node", predictor="minPredictor", resource="memory", unit="byte"} +``` + +```bash +$ kubectl get node test-node -o yaml +apiVersion: v1 +kind: Node +metadata: + name: test-node +status: + # ... + allocatable: + cpu: '32' + memory: 129636240Ki + pods: '110' + kubernetes.io/mid-cpu: '16000' # allocatable cpu milli-cores for Mid-tier pods + kubernetes.io/mid-memory: 64818120Ki # allocatable memory bytes for Mid-tier pods + capacity: + cpu: '32' + memory: 129636240Ki + pods: '110' + kubernetes.io/mid-cpu: '16000' + kubernetes.io/mid-memory: 64818120Ki +``` + +关于 Koordinator 节点画像的设计,详见[Design - Node Prediction](/docs/designs/node-prediction)。 + +### 其他功能 + +通过 [v1.3.0 Release](https://github.com/koordinator-sh/koordinator/releases/tag/v1.3.0) 页面,可以看到更多包含在 v1.3.0 版本的新增功能。 + +## 未来计划 + +在接下来的版本中,Koordinator 目前规划了以下功能: + +- 硬件拓扑感知调度,综合考虑节点 CPU、内存、GPU 等多个资源维度的拓扑关系,在集群范围内进行调度优化。 +- 提供节点可分配资源的放大机制。 +- NRI 资源管理模式的完善和增强。 + +更多信息,敬请关注 [Milestone v1.4.0](https://github.com/koordinator-sh/koordinator/milestone/12)。 + +## 结语 + +最后,Koordinator 是一个开放的社区,欢迎广大云原生爱好者们随时通过各种方式参与共建,无论您在云原生领域是初学乍到还是驾轻就熟,我们都非常期待听到您的声音! diff --git a/blog/authors.yml b/blog/authors.yml index 20a0a1841..8adb5c242 100644 --- a/blog/authors.yml +++ b/blog/authors.yml @@ -33,3 +33,9 @@ zwzhang0107: title: Koordinator maintainer url: https://github.com/zwzhang0107 image_url: https://github.com/zwzhang0107.png + +saintube: + name: Rougang Han + title: Koordinator member + url: https://github.com/saintube + image_url: https://github.com/saintube.png diff --git a/docs/designs/node-prediction.md b/docs/designs/node-prediction.md new file mode 100644 index 000000000..9bda2cc8a --- /dev/null +++ b/docs/designs/node-prediction.md @@ -0,0 +1,278 @@ +# Node Prediction + +## Summary + +The *node prediction* is proposed to both improve the node utilization and avoid overloading. By profiling the +tendency of the node metrics, we can estimate the peak usage and implement more efficient over-commitment policy. + +## Motivation + +Scheduling pods with setting appropriate resource requirements is truly hard to follow. Underestimating requests can +bring performance issues. However, overvaluing requests is likely to cause resource waste and low efficiency. One +common approach is using Vertical Pod Autoscaler (VPA) to autopilot the resource requirements for the pods of the same +workload. The VPA optimizes the resource requirements of the pod according to the pod metrics of the same workload. It +estimates the pod usage and specifies proper resource requirements. It works well when we want to optimize the resource +requirements of workloads. However, most VPA approaches try to abandon the time series attribute from the pod metrics +and generate a relatively static requests/limits that should guarantee to make no bad ignoring the timing. It leaves +the usage-to-limit gap, i.e. the gap between the recommended pod request with the real-time pod usage, and the +well-known pooling effect, i.e. the gap between the sum of the pod usages with the node usage. Inspired by +[Google's work](#references) in the EuroSys'21, we propose the node prediction in Koordinator to conquer these two +gaps. + +### Goals + +- Define the node prediction API. +- Propose an online history-based-optimized (HBO) prediction model. +- Clarify how the Mid-tier resources are calculated with the prediction. + +### Non-Goals/Future Work + +- Propose a time-series-forecasting-based or offline prediction model. + +## User Stories + +### Story 1 + +As a cluster administrator, there are many web service pods allocating almost node resources. Whereas, the node +utilization is low since most allocated resources are not actually used. To improve node utilization, I want to reclaim +the unused resources to submit some low-priority online-service pods and Flink jobs. However, I am concerned with the +risks of over-utilization bringing machine overload which may cause the performance degradation and hurt the pod QoS. + +### Story 2 + +As a Kubernetes developer, I want to support the long-term load balancing in the scheduler. Thus, I need the information +that which nodes should be idle for a long time. + +## Design + +### Design Principles + +- The node prediction is low-cost and can be implemented in the Koordlet. +- The node prediction is pluggable. Users can replace the default model to customize the prediction. + +### Architecture + +The node prediction is implemented mainly in the Koordlet and Koord-Manager. The architecture is as below: + +![image](/img/node-prediction.svg) + +- Koordlet: The agent runs on the node. It implements the metrics collection, metrics storage, and predict server. + - Metrics Advisor: It collects the cpu/memory usage of the node and running pods. It stores the collected metrics in the Metric Cache. + - Metric Cache: It stores the node and pod metrics in a TSDB, which allows other modules to query the metrics later. + - Predict Server: With the node and pod metrics retrieved from the Metric Cache, it calculates and checkpoints the predicted result based on the prediction model. + - States Informer: It maintains the metadata of the node and the pods. It also reports the latest prediction periodically to the kube-apiserver. +- Koord-Manager: The controller runs on a master node. + - Configuration delivery: It maintains the prediction and colocation strategies and distributes the node strategy onto the NodeMetric. + - Resource Calculator: It fetches the node prediction result, and calculates the resource allocatable of the reclaimed resources (i.e. Mid-tier resource). +- Koord-Scheduler: It schedules the pod with different priority bands (e.g. Prod, Mid, Batch). It can enable load-aware scheduling to balance the over-committed nodes' utilization. + +#### Workflow + +In the koordlet, stages to update the node prediction are as follows: + +1. Histogram initialization: The predict server initializes a set of histograms for CPU and memory. For implementing `N-Sigma_v1`, it initializes decayed histograms only for the node and priority classes. While implementing `N-Sigma_v2`, it initializes histograms both for the node and every running pod. +2. Metrics collection: The metrics advisor collects the usage statistics of node and pods and stores them as metric points into the metric cache every CollectInterval (e.g. 1s). +3. Histogram updating: The predict server fetches the node metrics and pod metrics of latest HistogramUpdateInterval (e.g. 30s). Then it uses the aggregated result to update the decayed histograms. +4. Periodical reporting: The states informer fetches node metrics and the last histograms for the node and priority classes every ReportingInterval (e.g. 60s). Then it reports the complete NodeMetric status with last node prediction info to the kube-apiserver. +5. Fast reporting: The states informer fetches the last histograms every CheckPredictionInterval (e.g. 20s). It checks if the predicted result is too small or too larger than the last updated prediction exceeding the ResourceDiffThreshold (e.g. 5%), or the updated duration is longer than ForceUpdateInterval (e.g. 600s). If the check result is true, It updates the latest node prediction to the kube-apiserver. + +In the koord-manager, stages to update the Mid-tier resources allocatable are as follows: + +1. NodeMetric lifecycle management: The koord-manager list-watches the Node and the ConfigMap slo-controller-config, and maintains the lifecycle of the NodeMetric CR. Once the colocation strategy in the slo-controller-config updated, the koord-manager parses the config data and updates the node prediction policy and mid colocation policy into the NodeMetric.Spec. +2. Mid resource updating: The koord-manager list-watches the NodeMetric. Once the NodeMetric status is updated, the koord-manager gets the latest node metrics and node prediction, and calculates the Mid allocatable resources based on the Mid over-commitment formula. Finally, it updates the Mid allocatable resources into the Node status as the extended resources (`kubernetes.io/mid-cpu`, `kubernetes.io/mid-memory`). + +#### Scheduling Optimization + +The results of the node prediction on the NodeMetric, the Mid extended resources on the Node and the scheduling Pod +in the scheduler are updated in different time. It is inevitable to find that the scheduler schedules a pod with an +older version of the node prediction, which may cause the schedule result "lagged". + +To relief the lagged prediction, the koordlet and koord-manager try both updating earlier when the +prediction/NodeMetric differs from the previous result than a threshold and set a resource buffer which should +tolerant most of the result changes between synchronizations. + +For the worst case in which the prediction could be lagged too much (e.g. 1 hour), we can maintain a lower bound of +the real Mid allocatable resources inside the scheduler. This part is not planned in the first version of the Mid-tier +over-commitment. + +### API + +#### Node Prediction + +##### Predict Policy + +```go +// ColocationStrategy defines the colocation strategy in slo-controller-config ConfigMap. +type ColocationStrategy struct { + // ... + NodePredictPolicy *slov1alpha1.PredictPolicy `json:"nodePredictPolicy,omitempty"` +} + +type NodeMetricSpec struct { + // ... + PredictPolicy *PredictPolicy `json:"predictPolicy,omitempty"` +} + +// PredictPolicy defines the policy for the node prediction. +type PredictPolicy struct { + ResourceDiffThresholdPercent *int64 `json:"resourceDiffThresholdPercent,omitempty"` + ColdStartPeriodSeconds *int64 `json:"coldStartPeriodSeconds,omitempty"` +} +``` + +##### Predicted Result + +```go +type NodeMetricStatus struct { + // ... + // ProdReclaimableMetric is the estimated reclaimable resources for the Prod-type pods. + ProdReclaimableMetric *ReclaimableMetric `json:"prodReclaimableMetric,omitempty"` +} + +type ReclaimableMetric struct { + // Resource is the resource usage of the prediction. + Resource ResourceMap `json:"resource,omitempty"` +} +``` + +#### Mid Overcommitment + +##### Colocation Strategy + +```go +type ColocationStrategy struct { + // ... + // MidCPUThresholdPercent defines the maximum percentage of the Mid-tier cpu resource dividing the node allocatable. + // MidCPUAllocatable <= NodeCPUAllocatable * MidCPUThresholdPercent / 100. + MidCPUThresholdPercent *int64 `json:"midCPUThresholdPercent,omitempty" validate:"omitempty,min=0,max=100"` + // MidMemoryThresholdPercent defines the maximum percentage of the Mid-tier memory resource dividing the node allocatable. + // MidMemoryAllocatable <= NodeMemoryAllocatable * MidMemoryThresholdPercent / 100. + MidMemoryThresholdPercent *int64 `json:"midMemoryThresholdPercent,omitempty" validate:"omitempty,min=0,max=100"` +} +``` + +##### Extended Resources + +```yaml +apiVersion: v1 +kind: Node +metadata: + name: test-node +status: + allocatable: + cpu: '32' + memory: 129636240Ki + pods: '213' + kubernetes.io/mid-cpu: '16000' # allocatable cpu milli-cores for Mid-tier pods + kubernetes.io/mid-memory: 64818120Ki # allocatable memory bytes for Mid-tier pods + capacity: + cpu: '32' + memory: 129636240Ki + pods: '213' + kubernetes.io/mid-cpu: '16000' + kubernetes.io/mid-memory: 64818120Ki +``` + +### Theoretical Model + +#### Node Peak Prediction + +Before elaborating the peak prediction algorithm, let's formalize the node peak prediction problem. + +Let's denote the usage of a Pod `p` at the time `t` is `U(p, t)`. + +Then the usage of a Node `M` which schedules a set of Pods is `MU(Pods, t) = sum[p in Pods](U(p, t))`. + +> Note that the non-Pod usage of the node can be regarded as the usage of a special pod `S`. + +When we want to predict the node peak at the time `T`, we are calculating +`Peak(Pods, T) = max[t >= T](sum[p in Pods](U(p, t)))`. + +The predicted peak `Peak(Pods, T)` is our node prediction result at `T`. + +#### N-sigma Prediction + +There are several [statistical peak prediction models](#alternatives) which are practical to implement in the online +scheduler. [*N-sigma*](#references) is the picked peak prediction model in the current implementation. It assumes the +timing node metrics follow the Gaussian distribution, which allows us to estimate the node peak with the mean and +standard deviation (stdev): + +`Peak_N-Sigma_v1(Pods, T) = mean[T0 <= t <= T](MU(Pods, t)) + N * stdev[T0 <= t <= T](MU(Pods, t))` + +The `Peak_N-Sigma_v1` is the predicted node peak. It is implemented as the first version of node prediction, which is +calculated based on node-level metrics. + +Moreover, we can calculate with the pods' metrics: + +`Peak_Pods-N-Sigma'(Pods, T) = sum[p in Pods](mean[T0 <= t <= T](U(p, t)) + N * stdev[T0 <= t <= T](U(p, t)))` + +A more conservative is derived from their maximal. The `Peak_N-sigma_v2` is the second version of node prediction, +which also considers the pod-level metrics. + +`Peak_N-Sigma_v2(Pods, T) = max(Peak_N-Sigma_v1(Pods, T), Peak_Pods-N-Sigma(Pods, T))`. + +#### Mid-tier Overcommitment + +In the first version, the Mid-tier resource contains the reclaimable resources which are probably unused in the +long-term by the high-priority (i.e. Prod) pods. +The resource calculation for the Mid-tier resources can be described as follows: + +``` +Allocatable[Mid] := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio) +``` + +- `Reclaimable[Mid] := max(0, reclaimRatio * Allocated[Prod] - Peak[Prod])`. The peak prediction model is used for estimating the future usage of the running Prod pods. The Mid pods can allocate a proportion of reclaimed resources from running Prod pods. +- `NodeAllocatable * thresholdRatio` is the maximal co-located Mid-tier resource setting from a ratio of the node allocatable. + +In next versions, the Mid-tier resource is planned to mix with the default node allocatable (i.e. the Prod allocatable), +which means a Mid pod can allocate the unallocated node allocatable resource, and an idle node is able to schedule Mid +pods. The Prod pods can preempt the Mid pods when the mixed allocatable is exhausted by the Mid pods, so that the +Prod-tier resource is still more stable and guaranteed than the Mid-tier. +Then the resource calculation for the mixed Mid-tier resources can be described as follows: + +``` +Allocatable[Mid]' := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio) + Unallocated[Mid] +Unallocated[Mid] = max(NodeAllocatable - Allocated[Prod], 0) +``` + +## Alternatives + +### Peak Prediction Models + +There are several different peak prediction and time series forecasting models which can estimate the future peak +based on the historical node metrics, including statistical methods and machine learning methods. In this proposal, +statistical peak prediction models are preferred since they are practical to implement in the online scheduling system, +have less overhead of metrics collection than the ML approaches, and more simple to analyze and debug. + +Here are some common statistical peak prediction models: + +1. [Borg-default](#references) + +Borg-default simply over-commits the machine resources in a fixed rate `a`, which means the peak usage is regarded as +the result of the requests dividing `a`. + +Let's denote the resource request of the Pod `p` at the time `t` is `R(p, t)`, where `R(p, t) = 0` when `p` is not +running. Then we have, + +`Peak_Borg-default(Pods, T) = 1/a * sum[p in Pods](R(p, T))`, `a = 1.1` by default. + +2. [Resource Central](#references) + +Resource Central considers the peak of the machine as the sum of the peak of individual pods (or VMs). And a simple +peak prediction of a pod is the percentile of the historical usages, e.g. `percentile[t in [T-C, T]](U(p, t))`. + +`Peak_ResourceCentral(Pods, T) = sum[p in Pods](percentile[t in [T-C, T]](U(p, t)))` + +3. [Max](#references) + +The Max prediction model does not use the historical metrics directly, but takes the maximal of any known peak results. +It gets the more conservative result than the input models. For example, we have a `Max_Borg-default_ResourceCentral` +model calculated from the Borg-default and Resource Central models: + +`Peak_Max_Borg-default_ResourceCentral(Pods, T) = max(Peak_Borg-default(Pods, T), Peak_ResourceCentral(Pods, T))` + +## References + +1. Vertical Pod Autoscaler: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler +2. Bashir, Noman, et al. "Take it to the limit: peak prediction-driven resource overcommitment in datacenters." Proceedings of the Sixteenth European Conference on Computer Systems. 2021. +3. Cortez, Eli, et al. "Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms." Proceedings of the 26th Symposium on Operating Systems Principles. 2017. diff --git a/docs/designs/nri-mode-resource-management.md b/docs/designs/nri-mode-resource-management.md index 1b9b4b533..507c7e30c 100644 --- a/docs/designs/nri-mode-resource-management.md +++ b/docs/designs/nri-mode-resource-management.md @@ -146,7 +146,7 @@ There are a little difference in execution timing between `NRI` and `proxy` mode ## Upgrade Strategy -- Need to upgrade containerd to 1.7.0+ or CRIO to 1.25.0+ +- Need to upgrade containerd to 1.7.0+ or CRIO to 1.26.0+ - Need to enable NRI diff --git a/docs/installation.md b/docs/installation.md index efd8048f8..91c0795a6 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -20,7 +20,7 @@ $ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/ $ helm repo update # Install the latest version. -$ helm install koordinator koordinator-sh/koordinator --version 1.2.0 +$ helm install koordinator koordinator-sh/koordinator --version 1.3.0 ``` ## Upgrade with helm @@ -33,7 +33,7 @@ $ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/ $ helm repo update # Upgrade the latest version. -$ helm upgrade koordinator koordinator-sh/koordinator --version 1.2.0 [--force] +$ helm upgrade koordinator koordinator-sh/koordinator --version 1.3.0 [--force] ``` Note that: @@ -73,7 +73,7 @@ For pods that do not want hook servers processing (such as addon pods), you can Download from github releases: ```bash $ # select the version -$ wget https://github.com/koordinator-sh/koordinator/releases/download/v1.2.0/koord-runtime-proxy_1.2.0_linux_x86_64 -O koord-runtime-proxy +$ wget https://github.com/koordinator-sh/koordinator/releases/download/v1.3.0/koord-runtime-proxy_1.3.0_linux_x86_64 -O koord-runtime-proxy $ chmod +x koord-runtime-proxy ``` @@ -148,7 +148,7 @@ The following table lists the configurable parameters of the chart and their def | `manager.log.level` | Log level that koord-manager printed | `4` | | `manager.replicas` | Replicas of koord-manager deployment | `2` | | `manager.image.repository` | Repository for koord-manager image | `koordinatorsh/koord-manager` | -| `manager.image.tag` | Tag for koord-manager image | `v1.2.0` | +| `manager.image.tag` | Tag for koord-manager image | `v1.3.0` | | `manager.resources.limits.cpu` | CPU resource limit of koord-manager container | `1000m` | | `manager.resources.limits.memory` | Memory resource limit of koord-manager container | `1Gi` | | `manager.resources.requests.cpu` | CPU resource request of koord-manager container | `500m` | @@ -163,7 +163,7 @@ The following table lists the configurable parameters of the chart and their def | `scheduler.log.level` | Log level that koord-scheduler printed | `4` | | `scheduler.replicas` | Replicas of koord-scheduler deployment | `2` | | `scheduler.image.repository` | Repository for koord-scheduler image | `koordinatorsh/koord-scheduler` | -| `scheduler.image.tag` | Tag for koord-scheduler image | `v1.2.0` | +| `scheduler.image.tag` | Tag for koord-scheduler image | `v1.3.0` | | `scheduler.resources.limits.cpu` | CPU resource limit of koord-scheduler container | `1000m` | | `scheduler.resources.limits.memory` | Memory resource limit of koord-scheduler container | `1Gi` | | `scheduler.resources.requests.cpu` | CPU resource request of koord-scheduler container | `500m` | @@ -175,7 +175,7 @@ The following table lists the configurable parameters of the chart and their def | `scheduler.hostNetwork` | Whether koord-scheduler pod should run with hostnetwork | `false` | | `koordlet.log.level` | Log level that koordlet printed | `4` | | `koordlet.image.repository` | Repository for koordlet image | `koordinatorsh/koordlet` | -| `koordlet.image.tag` | Tag for koordlet image | `v1.2.0` | +| `koordlet.image.tag` | Tag for koordlet image | `v1.3.0` | | `koordlet.resources.limits.cpu` | CPU resource limit of koordlet container | `500m` | | `koordlet.resources.limits.memory` | Memory resource limit of koordlet container | `256Mi` | | `koordlet.resources.requests.cpu` | CPU resource request of koordlet container | `0` | diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/current/best-practices/fine-grained-cpu-orchestration.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/best-practices/fine-grained-cpu-orchestration.md new file mode 100644 index 000000000..92851eb8c --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/best-practices/fine-grained-cpu-orchestration.md @@ -0,0 +1,259 @@ +# Coordinated sharing of CPU resources in Colocation Scenarios - Fine-grained CPU Orchestration + +## Introduction + +In a cloud-native environment, users often deploy different types of workloads in the same cluster, leveraging different peak effects of different services to achieve time-sharing multiplexing of resources and avoid resource waste. However, colocation of different types of workloads often leads to resource competition and mutual interference. The most typical scenario is the colocation of online and offline workloads. When more computing resources are occupied by offline workloads, the response time of online loads will be affected; when more computing resources are occupied by online workloads for a long time, the task completion time of offline workloads cannot be guaranteed. This phenomenon belongs to the Noisy Neighbor problem. + +Depending on the degree of colocation and resource types, there are many different ways to solve this problem. Quota management can limit the resource usage of loads from the entire cluster dimension, and Koordinator provides multi-level elastic quota management functions in this regard. From the single-node level, CPU, memory, disk IO, and network resources may be shared by different loads. Koordinator has provided some resource isolation and guarantee capabilities on CPU and memory, and related capabilities on disk IO and network resources are under construction. + +This article mainly introduces how Koordinator helps loads (online and online, online and offline) share CPU resources collaboratively when different types of workloads are colocated on the same node. + +## Problem Description + +The essence of CPU resource Noisy Neighbor is that different workloads share CPU resources without coordination. +1. The default resource model of Kubernetes uses cgroup (cfs quota) to limit the access of different loads to CPU resources in terms of CPU time usage. In this case, some workloads may be switched to CPU cores by the operating system scheduler. Since different CPU cores have different memory access time to different physical locations, switching cpu cores will result in longer memory access time, thus affecting load performance, thereby affecting load performance. +2. In NUMA architecture, SMT threads (logical cores) share execution units and L2 cache of physical cores. +When there are multiple workloads on the same physical core, resource contention will happen between different workloads, resulting in load performance degradation. + +Kubernetes provides topology manager and CPU manager on node level to solve the above problems. However, this feature will only attempt to take effect after the Pod has been scheduled on the machine. This may lead to the situation where Pods are scheduled to nodes with sufficient CPU resources but topology requirements are not met. + +## Solutions + +### Application-Oriented CPU Orchestration QoS Semantics + +In response to the above problems and deficiencies, Koordinator designed an application-oriented QoS semantics and CPU orchestration protocol, as shown in the figure below. + +![img](/img/qos-cpu-orchestration.png) + +LS (Latency Sensitive) is applied to typical microservice loads, and Koordinator isolates it from other latency-sensitive loads to ensure its performance. LSR (Latency Sensitive Reserved) is similar to Kubernetes' Guaranteed. On the basis of LS, it adds the semantics that applications require reserved binding cores. LSE (Latency Sensitive Exclusive) is common in applications that are particularly sensitive to CPU, such as middleware. In addition to satisfying its semantics similar to LSR's requirement to bind cores, Koordinator also ensures that the allocated CPU is not shared with any other load. + +Also, to improve resource utilization, BE workloads can share CPU with LSR and LS. To ensure that latency-sensitive applications shared with BE are not disturbed by it, Koordinator provides strategies such as interference detection and BE suppression. The focus of this article is not here, readers can pay attention to follow-up articles. + +### Rich CPU scheduling strategies + +For LSE applications, when the machine is a hyper-threaded architecture, only logical cores can be guaranteed to be exclusive to the load. In this way, when there are other loads on the same physical core, application performance will still be disturbed. +To this end, Koordinator supports users to configure rich CPU scheduling policies on pod annotation to improve performance. + +CPU orchestration policies are divided into CPU-binding policies and CPU-exclusive policies. The CPU binding strategy determines the distribution of logical cores assigned to the application among physical cores, which can be spread or stacked among physical cores. Stacking (FullPCPU) refers to allocating complete physical cores to applications, which can effectively alleviate the Noisy Neighbor problem. SpreadByPCPU is mainly used in some delay-sensitive applications with different peak and valley characteristics, allowing the application to fully use the CPU at a specific time. The CPU exclusive policy determines the exclusive level of logical cores assigned to the application, and it can try to avoid physical cores or NUMANodes that have been applied for with the exclusive policy. + +### Enhanced CPU Scheduling Capabilities + +Koordinator supports the configuration of NUMA allocation strategies to determine how to select satisfactory NUMA nodes during scheduling. MostAllocated indicates allocation from the NUMA node with the least available resources, which can reduce fragmentation as much as possible and leave more allocation space for subsequent loads. However, this approach may cause the performance of parallel code that relies on Barriers to suffer. DistributeEvenly means that evenly distributing CPUs on NUMA nodes can improve the performance of the above parallel code. LeastAllocated indicates allocation from the NUMA node with the most available resources. + +In addition, Koordinator's CPU allocation logic is completed in the central scheduler. In this way, there will be a global perspective, avoiding the dilemma of single-node solution, where CPU resources may be sufficient but topology requirements are not met. + +## Best Practices +As can be seen from the above, Koordinator's fine-grained CPU orchestration capability can significantly improve the performance of CPU-sensitive workloads in multi-application colocation scenarios. In order to allow readers to use Koordinator’s fine-grained CPU scheduling capabilities more clearly and intuitively, this article deploys online applications to clusters in different ways, and observes the latency of services in stress testing to judge the effect of CPU scheduling capabilities. + +In this article, multiple online applications will be deployed on the same machine and pressure tested for 10 minutes to fully simulate the CPU core switching scenarios that may occur in production practice. For the colocation of online and offline applications, Koordinator provides strategies such as interference detection and BE suppression. The focus of this article is not here, and readers can pay attention to the practice in subsequent articles. + +|Group Number|Deployment Mode|Description|Scenarios| +|-|-|-|-| +|A|10 online applications are deployed on the nodes, and each node applies for 4 CPUs, all using kubernetes guaranteed QoS|Koordinator does not provide fine-grained CPU orchestration capabilities for applications|Due to CPU core switching, applications share logical cores, application performance will be affected, and it is not recommended to use| +|B|Deploy 10 online applications on the nodes, each application node has 4 CPUs, all adopt LSE QoS, CPU binding strategy adopts physical core binpacking(FullPCPUs)|Koordinator provides CPU core binding capability for LSE Pod and online applications will not share physical cores|Particularly sensitive online scenarios, application cannot accept CPU sharing at the physical core level| +|C|Deploy 10 online applications on the node, each application node has 4 CPUs, all adopt LSR QoS, CPU binding strategy adopts physical core split (SpreadByPCPUs), use CPU exclusively by physical cpu level|Koordinator provides CPU binding core capability for LSR Pod and online application logical core can use more physical core capacity|It is often used to share physical cores with offline Pods and implement time-sharing multiplexing at the physical core level. This article does not focus on the mixed deployment of online and offline applications, so it only tests the overuse of online applications| + +This experiment uses the following performance indicators to evaluate the performance of Nginx applications under different deployment methods: + +- RT (Response Time) quantile value: RT is a performance indicator that online applications usually focus on. The lower the RT, the better the online service performance. The RT indicator is obtained by collecting the information printed after the wrk pressure tests. In the experiment, it reflects the time it takes for the Nginx application to respond to the wrk request. For example, RT-p50 indicates the maximum time (median) it takes for Nginx to respond to the first 50% of wrk requests, and RT-p90 indicates the maximum time it takes for Nginx to respond to the first 90% of wrk requests. +- RPS (Request Per Second): RPS is the number of requests served by an online application per second. The more RPS a service bears, the better the performance of the online service. + + +The experimental results are as follows: + +|Performance Indicators/Deployment Mode| A(colocation of two online applications, Guaranteed)|B(colocation of two online applications, LSE、FullPCPU)|C(colocation of two online applications, LSR、SpreadByPCPU、PCPULevel| +|-|-|-|-| +|RPS| 114778.29|114648.19|115268.50| +|RT-avg (ms)|3.46 ms|3.33 ms|3.25 ms| +|RT-p90 (ms)|5.27 ms|5.11 ms|5.06 ms| +|RT-p99 (ms)|15.22 ms|12.61 ms|12.14 ms| + +- Comparing B and A, it can be found that after adopting LSE QoS to bind the core, the service response time P99 is significantly reduced, and the long tail phenomenon is well alleviated +- Comparing C and B, it can be found that after using LSR QoS to bind cores and allowing logical cores to occupy more physical core resources, more requests can be tolerated with better service response time + +In summary, in the scenario where online services are deployed on the same machine, using Koordinator to refine the CPU arrangement can effectively suppress the Noisy Neighbor problem and reduce the performance degradation caused by CPU core switching. + +### Environemnt + +First, prepare a Kubernetes cluster and install Koordinator. This article chooses two nodes of a Kubernetes cluster to do the experiment, one of the nodes is used as a test machine, which will run the Nginx online server; the other node is used as a pressure test machine, which will run the client's wrk, request the Nginx online server, and make pressure test requests . + +### Online application deployment + +1. Inject fine-grained CPU orchestration protocols into applications using ColocationProfile + + Group B fine-grained CPU orchestration protocol + + ```yaml + apiVersion: config.koordinator.sh/v1alpha1 + kind: ClusterColocationProfile + metadata: + name: colocation-profile-example + spec: + selector: + matchLabels: + app: nginx + # 采用 LSE QoS + qosClass: LSE + annotations: + # 采用物理核间堆叠 + scheduling.koordinator.sh/resource-spec: '{"preferredCPUBindPolicy":"FullPCPUs"}' + priorityClassName: koord-prod + ``` + + Group C fine-grained CPU orchestration protocol + + ```yaml + apiVersion: config.koordinator.sh/v1alpha1 + kind: ClusterColocationProfile + metadata: + name: colocation-profile-example + spec: + selector: + matchLabels: + app: nginx + # 采用 LSR QoS + qosClass: LSR + annotations: + # 采用物理核间打散且独占物理核 + scheduling.koordinator.sh/resource-spec: '{"preferredCPUBindPolicy":"SpreadByPCPUs", "preferredCPUExclusivePolicy":"PCPULevel"}' + priorityClassName: koord-prod + ``` + +2. This article uses Nginx server as Online Service , Pod YAML is as follows: + + ```yaml + --- + # nginx应用配置 + apiVersion: v1 + data: + config: |- + user nginx; + worker_processes 4; # Nginx的Worker个数,影响Nginx Server的并发。 + + events { + worker_connections 1024; # 默认值为1024。 + } + + http { + server { + listen 8000; + + gzip off; + gzip_min_length 32; + gzip_http_version 1.0; + gzip_comp_level 3; + gzip_types *; + } + } + + #daemon off; + kind: ConfigMap + metadata: + name: nginx-conf-0 + --- + # Nginx实例,作为在线类型服务应用。 + apiVersion: v1 + kind: Pod + metadata: + labels: + app: nginx + name: nginx-0 + namespace: default + spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: In + values: + - "${node_name}" + schedulerName: koord-scheduler + priorityClassName: koord-prod + containers: + - image: 'koordinatorsh/nginx:v1.18-koord-exmaple' + imagePullPolicy: IfNotPresent + name: nginx + ports: + - containerPort: 8000 + hostPort: 8000 # 压测请求访问的端口。 + protocol: TCP + resources: + limits: + cpu: '4' + memory: 8Gi + requests: + cpu: '4' + memory: 8Gi + volumeMounts: + - mountPath: /apps/nginx/conf + name: config + hostNetwork: true + restartPolicy: Never + volumes: + - configMap: + items: + - key: config + path: nginx.conf + name: nginx-conf-0 + name: config + ``` + +3. Execute the following command to deploy the Nginx application. + + ```bash + kubectl apply -f nginx.yaml + ``` + +4. Execute the following command to view the Pod status of the Nginx application. + + ```bash + kubectl get pod -l app=nginx -o wide + ``` + + You can see output similar to the following, indicating that the Nginx application has been running normally on the test machine. + + ``` + NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES + nginx-0 1/1 Running 0 2m46s 10.0.0.246 test-machine-name + + ``` + +### Load Test + +1. On the testing machine, execute the following command to deploy the stress testing tool wrk. + + ```bash + wget -O wrk-4.2.0.tar.gz https://github.com/wg/wrk/archive/refs/tags/4.2.0.tar.gz && tar -xvf wrk-4.2.0.tar.gz + cd wrk-4.2.0 && make && chmod +x ./wrk + ``` + +2. On the testing machine, execute the following command to deploy the load testing tool wrk + + ```bash + # node_ip填写测试机的IP地址,用于wrk向测试机发起压测;8000是Nginx暴露到测试机的端口。 + taskset -c 32-45 ./wrk -t120 -c400 -d600s --latency http://${node_ip}:8000/ + ``` + +3. After waiting for wrk to finish running, obtain the pressure test results of wrk. The output format of wrk is as follows. Repeat the test several times to obtain relatively stable results. + + ``` + Running 10m test @ http://192.168.0.186:8000/ + 120 threads and 400 connections + Thread Stats Avg Stdev Max +/- Stdev + Latency 3.29ms 2.49ms 352.52ms 91.07% + Req/Sec 0.96k 321.04 3.28k 62.00% + Latency Distribution + 50% 2.60ms + 75% 3.94ms + 90% 5.55ms + 99% 12.40ms + 68800242 requests in 10.00m, 54.46GB read + Requests/sec: 114648.19 + Transfer/sec: 92.93MB + ``` + +## Conclusion + +In a Kubernetes cluster, there may be competition for resources such as CPU and memory among different business loads, which affects the performance and stability of the business. In the face of the Noisy Neighbor phenomenon, users can use Koordinator to configure more refined CPU scheduling policies for applications, so that different applications can share CPU resources collaboratively. We have shown through experiments that Koordinator's fine-grained CPU scheduling capability can effectively suppress the competition for CPU resources and improve application performance. \ No newline at end of file diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/current/designs/node-prediction.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/designs/node-prediction.md new file mode 100644 index 000000000..9bda2cc8a --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/designs/node-prediction.md @@ -0,0 +1,278 @@ +# Node Prediction + +## Summary + +The *node prediction* is proposed to both improve the node utilization and avoid overloading. By profiling the +tendency of the node metrics, we can estimate the peak usage and implement more efficient over-commitment policy. + +## Motivation + +Scheduling pods with setting appropriate resource requirements is truly hard to follow. Underestimating requests can +bring performance issues. However, overvaluing requests is likely to cause resource waste and low efficiency. One +common approach is using Vertical Pod Autoscaler (VPA) to autopilot the resource requirements for the pods of the same +workload. The VPA optimizes the resource requirements of the pod according to the pod metrics of the same workload. It +estimates the pod usage and specifies proper resource requirements. It works well when we want to optimize the resource +requirements of workloads. However, most VPA approaches try to abandon the time series attribute from the pod metrics +and generate a relatively static requests/limits that should guarantee to make no bad ignoring the timing. It leaves +the usage-to-limit gap, i.e. the gap between the recommended pod request with the real-time pod usage, and the +well-known pooling effect, i.e. the gap between the sum of the pod usages with the node usage. Inspired by +[Google's work](#references) in the EuroSys'21, we propose the node prediction in Koordinator to conquer these two +gaps. + +### Goals + +- Define the node prediction API. +- Propose an online history-based-optimized (HBO) prediction model. +- Clarify how the Mid-tier resources are calculated with the prediction. + +### Non-Goals/Future Work + +- Propose a time-series-forecasting-based or offline prediction model. + +## User Stories + +### Story 1 + +As a cluster administrator, there are many web service pods allocating almost node resources. Whereas, the node +utilization is low since most allocated resources are not actually used. To improve node utilization, I want to reclaim +the unused resources to submit some low-priority online-service pods and Flink jobs. However, I am concerned with the +risks of over-utilization bringing machine overload which may cause the performance degradation and hurt the pod QoS. + +### Story 2 + +As a Kubernetes developer, I want to support the long-term load balancing in the scheduler. Thus, I need the information +that which nodes should be idle for a long time. + +## Design + +### Design Principles + +- The node prediction is low-cost and can be implemented in the Koordlet. +- The node prediction is pluggable. Users can replace the default model to customize the prediction. + +### Architecture + +The node prediction is implemented mainly in the Koordlet and Koord-Manager. The architecture is as below: + +![image](/img/node-prediction.svg) + +- Koordlet: The agent runs on the node. It implements the metrics collection, metrics storage, and predict server. + - Metrics Advisor: It collects the cpu/memory usage of the node and running pods. It stores the collected metrics in the Metric Cache. + - Metric Cache: It stores the node and pod metrics in a TSDB, which allows other modules to query the metrics later. + - Predict Server: With the node and pod metrics retrieved from the Metric Cache, it calculates and checkpoints the predicted result based on the prediction model. + - States Informer: It maintains the metadata of the node and the pods. It also reports the latest prediction periodically to the kube-apiserver. +- Koord-Manager: The controller runs on a master node. + - Configuration delivery: It maintains the prediction and colocation strategies and distributes the node strategy onto the NodeMetric. + - Resource Calculator: It fetches the node prediction result, and calculates the resource allocatable of the reclaimed resources (i.e. Mid-tier resource). +- Koord-Scheduler: It schedules the pod with different priority bands (e.g. Prod, Mid, Batch). It can enable load-aware scheduling to balance the over-committed nodes' utilization. + +#### Workflow + +In the koordlet, stages to update the node prediction are as follows: + +1. Histogram initialization: The predict server initializes a set of histograms for CPU and memory. For implementing `N-Sigma_v1`, it initializes decayed histograms only for the node and priority classes. While implementing `N-Sigma_v2`, it initializes histograms both for the node and every running pod. +2. Metrics collection: The metrics advisor collects the usage statistics of node and pods and stores them as metric points into the metric cache every CollectInterval (e.g. 1s). +3. Histogram updating: The predict server fetches the node metrics and pod metrics of latest HistogramUpdateInterval (e.g. 30s). Then it uses the aggregated result to update the decayed histograms. +4. Periodical reporting: The states informer fetches node metrics and the last histograms for the node and priority classes every ReportingInterval (e.g. 60s). Then it reports the complete NodeMetric status with last node prediction info to the kube-apiserver. +5. Fast reporting: The states informer fetches the last histograms every CheckPredictionInterval (e.g. 20s). It checks if the predicted result is too small or too larger than the last updated prediction exceeding the ResourceDiffThreshold (e.g. 5%), or the updated duration is longer than ForceUpdateInterval (e.g. 600s). If the check result is true, It updates the latest node prediction to the kube-apiserver. + +In the koord-manager, stages to update the Mid-tier resources allocatable are as follows: + +1. NodeMetric lifecycle management: The koord-manager list-watches the Node and the ConfigMap slo-controller-config, and maintains the lifecycle of the NodeMetric CR. Once the colocation strategy in the slo-controller-config updated, the koord-manager parses the config data and updates the node prediction policy and mid colocation policy into the NodeMetric.Spec. +2. Mid resource updating: The koord-manager list-watches the NodeMetric. Once the NodeMetric status is updated, the koord-manager gets the latest node metrics and node prediction, and calculates the Mid allocatable resources based on the Mid over-commitment formula. Finally, it updates the Mid allocatable resources into the Node status as the extended resources (`kubernetes.io/mid-cpu`, `kubernetes.io/mid-memory`). + +#### Scheduling Optimization + +The results of the node prediction on the NodeMetric, the Mid extended resources on the Node and the scheduling Pod +in the scheduler are updated in different time. It is inevitable to find that the scheduler schedules a pod with an +older version of the node prediction, which may cause the schedule result "lagged". + +To relief the lagged prediction, the koordlet and koord-manager try both updating earlier when the +prediction/NodeMetric differs from the previous result than a threshold and set a resource buffer which should +tolerant most of the result changes between synchronizations. + +For the worst case in which the prediction could be lagged too much (e.g. 1 hour), we can maintain a lower bound of +the real Mid allocatable resources inside the scheduler. This part is not planned in the first version of the Mid-tier +over-commitment. + +### API + +#### Node Prediction + +##### Predict Policy + +```go +// ColocationStrategy defines the colocation strategy in slo-controller-config ConfigMap. +type ColocationStrategy struct { + // ... + NodePredictPolicy *slov1alpha1.PredictPolicy `json:"nodePredictPolicy,omitempty"` +} + +type NodeMetricSpec struct { + // ... + PredictPolicy *PredictPolicy `json:"predictPolicy,omitempty"` +} + +// PredictPolicy defines the policy for the node prediction. +type PredictPolicy struct { + ResourceDiffThresholdPercent *int64 `json:"resourceDiffThresholdPercent,omitempty"` + ColdStartPeriodSeconds *int64 `json:"coldStartPeriodSeconds,omitempty"` +} +``` + +##### Predicted Result + +```go +type NodeMetricStatus struct { + // ... + // ProdReclaimableMetric is the estimated reclaimable resources for the Prod-type pods. + ProdReclaimableMetric *ReclaimableMetric `json:"prodReclaimableMetric,omitempty"` +} + +type ReclaimableMetric struct { + // Resource is the resource usage of the prediction. + Resource ResourceMap `json:"resource,omitempty"` +} +``` + +#### Mid Overcommitment + +##### Colocation Strategy + +```go +type ColocationStrategy struct { + // ... + // MidCPUThresholdPercent defines the maximum percentage of the Mid-tier cpu resource dividing the node allocatable. + // MidCPUAllocatable <= NodeCPUAllocatable * MidCPUThresholdPercent / 100. + MidCPUThresholdPercent *int64 `json:"midCPUThresholdPercent,omitempty" validate:"omitempty,min=0,max=100"` + // MidMemoryThresholdPercent defines the maximum percentage of the Mid-tier memory resource dividing the node allocatable. + // MidMemoryAllocatable <= NodeMemoryAllocatable * MidMemoryThresholdPercent / 100. + MidMemoryThresholdPercent *int64 `json:"midMemoryThresholdPercent,omitempty" validate:"omitempty,min=0,max=100"` +} +``` + +##### Extended Resources + +```yaml +apiVersion: v1 +kind: Node +metadata: + name: test-node +status: + allocatable: + cpu: '32' + memory: 129636240Ki + pods: '213' + kubernetes.io/mid-cpu: '16000' # allocatable cpu milli-cores for Mid-tier pods + kubernetes.io/mid-memory: 64818120Ki # allocatable memory bytes for Mid-tier pods + capacity: + cpu: '32' + memory: 129636240Ki + pods: '213' + kubernetes.io/mid-cpu: '16000' + kubernetes.io/mid-memory: 64818120Ki +``` + +### Theoretical Model + +#### Node Peak Prediction + +Before elaborating the peak prediction algorithm, let's formalize the node peak prediction problem. + +Let's denote the usage of a Pod `p` at the time `t` is `U(p, t)`. + +Then the usage of a Node `M` which schedules a set of Pods is `MU(Pods, t) = sum[p in Pods](U(p, t))`. + +> Note that the non-Pod usage of the node can be regarded as the usage of a special pod `S`. + +When we want to predict the node peak at the time `T`, we are calculating +`Peak(Pods, T) = max[t >= T](sum[p in Pods](U(p, t)))`. + +The predicted peak `Peak(Pods, T)` is our node prediction result at `T`. + +#### N-sigma Prediction + +There are several [statistical peak prediction models](#alternatives) which are practical to implement in the online +scheduler. [*N-sigma*](#references) is the picked peak prediction model in the current implementation. It assumes the +timing node metrics follow the Gaussian distribution, which allows us to estimate the node peak with the mean and +standard deviation (stdev): + +`Peak_N-Sigma_v1(Pods, T) = mean[T0 <= t <= T](MU(Pods, t)) + N * stdev[T0 <= t <= T](MU(Pods, t))` + +The `Peak_N-Sigma_v1` is the predicted node peak. It is implemented as the first version of node prediction, which is +calculated based on node-level metrics. + +Moreover, we can calculate with the pods' metrics: + +`Peak_Pods-N-Sigma'(Pods, T) = sum[p in Pods](mean[T0 <= t <= T](U(p, t)) + N * stdev[T0 <= t <= T](U(p, t)))` + +A more conservative is derived from their maximal. The `Peak_N-sigma_v2` is the second version of node prediction, +which also considers the pod-level metrics. + +`Peak_N-Sigma_v2(Pods, T) = max(Peak_N-Sigma_v1(Pods, T), Peak_Pods-N-Sigma(Pods, T))`. + +#### Mid-tier Overcommitment + +In the first version, the Mid-tier resource contains the reclaimable resources which are probably unused in the +long-term by the high-priority (i.e. Prod) pods. +The resource calculation for the Mid-tier resources can be described as follows: + +``` +Allocatable[Mid] := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio) +``` + +- `Reclaimable[Mid] := max(0, reclaimRatio * Allocated[Prod] - Peak[Prod])`. The peak prediction model is used for estimating the future usage of the running Prod pods. The Mid pods can allocate a proportion of reclaimed resources from running Prod pods. +- `NodeAllocatable * thresholdRatio` is the maximal co-located Mid-tier resource setting from a ratio of the node allocatable. + +In next versions, the Mid-tier resource is planned to mix with the default node allocatable (i.e. the Prod allocatable), +which means a Mid pod can allocate the unallocated node allocatable resource, and an idle node is able to schedule Mid +pods. The Prod pods can preempt the Mid pods when the mixed allocatable is exhausted by the Mid pods, so that the +Prod-tier resource is still more stable and guaranteed than the Mid-tier. +Then the resource calculation for the mixed Mid-tier resources can be described as follows: + +``` +Allocatable[Mid]' := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio) + Unallocated[Mid] +Unallocated[Mid] = max(NodeAllocatable - Allocated[Prod], 0) +``` + +## Alternatives + +### Peak Prediction Models + +There are several different peak prediction and time series forecasting models which can estimate the future peak +based on the historical node metrics, including statistical methods and machine learning methods. In this proposal, +statistical peak prediction models are preferred since they are practical to implement in the online scheduling system, +have less overhead of metrics collection than the ML approaches, and more simple to analyze and debug. + +Here are some common statistical peak prediction models: + +1. [Borg-default](#references) + +Borg-default simply over-commits the machine resources in a fixed rate `a`, which means the peak usage is regarded as +the result of the requests dividing `a`. + +Let's denote the resource request of the Pod `p` at the time `t` is `R(p, t)`, where `R(p, t) = 0` when `p` is not +running. Then we have, + +`Peak_Borg-default(Pods, T) = 1/a * sum[p in Pods](R(p, T))`, `a = 1.1` by default. + +2. [Resource Central](#references) + +Resource Central considers the peak of the machine as the sum of the peak of individual pods (or VMs). And a simple +peak prediction of a pod is the percentile of the historical usages, e.g. `percentile[t in [T-C, T]](U(p, t))`. + +`Peak_ResourceCentral(Pods, T) = sum[p in Pods](percentile[t in [T-C, T]](U(p, t)))` + +3. [Max](#references) + +The Max prediction model does not use the historical metrics directly, but takes the maximal of any known peak results. +It gets the more conservative result than the input models. For example, we have a `Max_Borg-default_ResourceCentral` +model calculated from the Borg-default and Resource Central models: + +`Peak_Max_Borg-default_ResourceCentral(Pods, T) = max(Peak_Borg-default(Pods, T), Peak_ResourceCentral(Pods, T))` + +## References + +1. Vertical Pod Autoscaler: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler +2. Bashir, Noman, et al. "Take it to the limit: peak prediction-driven resource overcommitment in datacenters." Proceedings of the Sixteenth European Conference on Computer Systems. 2021. +3. Cortez, Eli, et al. "Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms." Proceedings of the 26th Symposium on Operating Systems Principles. 2017. diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/current/designs/nri-mode-resource-management.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/designs/nri-mode-resource-management.md new file mode 100644 index 000000000..507c7e30c --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/designs/nri-mode-resource-management.md @@ -0,0 +1,152 @@ +# NRI Mode Resource Management + +## Glossary + +NRI, node resource interface. See: https://github.com/containerd/nri + +## Summary + +We hope to enable NRI mode resource management for koordinator for easy deployment and in-time control. + +## Motivation + +Koordinator as a QoS based scheduling system for hybrid workloads orchestration on Kubernetes and its runtime hooks support two working [modes](https://github.com/koordinator-sh/koordinator/blob/main/docs/design-archive/koordlet-runtime-hooks.md) for different scenarios: `Standalone` and `Proxy`. However, both of them have some [constraints](https://shimo.im/docs/m4kMLdgO1LIma9qD). NRI (Node Resource Interface), which is a public interface for controlling node resources is a general framework for CRI-compatible container runtime plug-in extensions. It provides a mechanism for extensions to track the state of pod/containers and make limited modifications to their configuration. We'd like to integrate NRI framework to address `Standalone` and `Proxy` constraints based on this community recommend mechanism. + +### Goals + +- Support NRI mode resource management for koordinator. +- Support containerd container runtime. + +### Non-Goals/Future Work + +- Support docker runtime + +## Proposal + +Different from standalone and proxy mode, Koodlet will start an NRI plugin to subscribe pod/container lifecycle events from container runtime (e.g. containerd, crio), and then koordlet NRI plugin will call runtime hooks to adjust pod resources or OCI spec. The flow should be: + +- Get pod/container lifecycle events and OCI format information from container runtime (e.g. containerd, crio). +- Transform the OCI format information into internal protocols. (e.g. PodContext, ContainerContext) to re-use existing runtime hook plugins. +- Transform the runtime hook plugins' response into OCI spec format +- Return OCI spec format response to container runtime(e.g. containerd, crio). + +![nri-proposal.png](/img/nri-proposal.png) + +### User Stories + +#### Story 1 +As a cluster administrator, I want to apply QoS policy before pod's status become running. + +#### Story 2 +As a cluster administrator, I want to deploy koordinator cluster without restart. + +#### Story 3 +As a cluster administrator, I want to adjust resources' policies at runtime. + +#### Story 4 +As a GPU user, I want to inject environment before pod running. + +### Requirements + +- Need to upgrade containerd to >= 1.7.0, crio to >= v1.25.0 + +#### Functional Requirements + +NRI mode should support all existing functionalities supported by standalone and Proxy mode. + +#### Non-Functional Requirements + +Non-functional requirements are user expectations of the solution. Include +considerations for performance, reliability and security. + +### Implementation Details/Notes/Constraints +1. koordlet [NRI plugin](https://github.com/containerd/nri/blob/main/plugins/template/plugin.go) +```go +type nriServer struct { + stub stub.Stub + mask stub.EventMask + options Options // server options +} + +// Enable 3 hooks (RunPodSandbox, CreateContainer, UpdateContainer) in NRI +func (p *nriServer) Configure(config, runtime, version string) (stub.EventMask, error) { +} + +// Sync all pods/containers information before koordlet nri plugin run +func (p *nriServer) Synchronize(pods []*api.PodSandbox, containers []*api.Container) ([]*api.ContainerUpdate, error) { +} + +func (p *nriServer) RunPodSandbox(pod *api.PodSandbox) error { + podCtx.FromNri(pod) + RunHooks(...) + podCtx.NriDone() +} + +func (p *nriServer) CreateContainer(pod *api.PodSandbox, container *api.Container) (*api.ContainerAdjustment, []*api.ContainerUpdate, error) { + containerCtx.FromNri(pod, container) + RunHooks(...) + containCtx.NriDone() +} + +func (p *nriServer) UpdateContainer(pod *api.PodSandbox, container *api.Container) ([]*api.ContainerUpdate, error) { + containerCtx.FromNri(pod, container) + RunHooks(...) + containCtx.NriDone() +} +``` +2. koordlet enhancement for NRI +- PodContext +```go +// fill PodContext from OCI spec +func (p *PodContext) FromNri(pod *api.PodSandbox) { +} + +// apply QoS resource policies for pod +func (p *PodContext) NriDone() { +} +``` +- ContainerContext +```go +// fill ContainerContext from OCI spec +func (c *ContainerContext) FromNri(pod *api.PodSandbox, container *api.Container) { +} + +// apply QoS resource policies for container +func (c *ContainerContext) NriDone() (*api.ContainerAdjustment, []*api.ContainerUpdate, error) { +} +``` + +### Risks and Mitigations + +## Alternatives +There are several approaches to extending the Kubernetes CRI (Container Runtime Interface) to manage container resources such as `standalone` and `proxy`. Under `standalone` running mode, resource isolation parameters will be injected asynchronously. Under `proxy` running mode, proxy can hijack CRI requests from kubelet for pods and then apply resource policies in time. However, `proxy` mode needs to configure and restart kubelet. + +There are a little difference in execution timing between `NRI` and `proxy` modes. Hook points (execution timing) are not exactly same. The biggest difference is `proxy` call koordlet hooks between kubelet and containerd. However, NRI will call NRI plugin (koodlet hooks) in containerd, that means containerd still could do something before or after containerd call NRI plugin (koordlet hooks). For example, under `NRI` running mode, containerd setup pod network first and then call NRI plugin (koordlet hooks) in RunPodSanbox, but under `proxy` running mode, containerd couldn't do anything before koordlet hooks running when `proxy` handle RunPodSandbox CRI request. + +- Standalone + + - kubelet -- CRI Request -> CRI Runtime -- OCI Spec -> OCI compatible runtime -> containers + - kubelet -> Node Agent -> CRI Runtime / containers + +![standalone.png](/img/standalone.png) + +- Proxy + + - kubelet -- CRI Request -> CRI Proxy -- CRI Request (hooked) -> CRI Runtime -- OCI Spec -> OCI compatible runtime -> containers + +![proxy.png](/img/proxy.png) + +- NRI + + - kubelet -- CRI Request -> CRI Runtime -- OCI Spec --> OCI compatible runtime -> containers +                  ↘   ↗ +                Koordlet NRI plugin + +![nri.png](/img/nri.png) + +## Upgrade Strategy + +- Need to upgrade containerd to 1.7.0+ or CRIO to 1.26.0+ +- Need to enable NRI + + diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/current/installation.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/installation.md index 3336deedd..15a71e334 100644 --- a/i18n/zh-Hans/docusaurus-plugin-content-docs/current/installation.md +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/installation.md @@ -20,7 +20,7 @@ $ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/ $ helm repo update # Install the latest version. -$ helm install koordinator koordinator-sh/koordinator --version 1.2.0 +$ helm install koordinator koordinator-sh/koordinator --version 1.3.0 ``` ## 使用 Helm 升级 @@ -33,7 +33,7 @@ $ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/ $ helm repo update # Upgrade the latest version. -$ helm upgrade koordinator koordinator-sh/koordinator --version 1.2.0 [--force] +$ helm upgrade koordinator koordinator-sh/koordinator --version 1.3.0 [--force] ``` 注意: @@ -49,6 +49,17 @@ $ helm upgrade koordinator koordinator-sh/koordinator --version 1.2.0 [--force] $ helm install/upgrade koordinator /PATH/TO/CHART ``` +## 启用 NRI 资源管理模式 + +### 前置条件 + +- Containerd >= 1.7.0 且配置启用 NRI。请确保 NRI 已在 containerd 中启用,否则请参考 [Enable NRI in Containerd](https://github.com/containerd/containerd/blob/main/docs/NRI.md)。 +- Koordinator >= 1.3 + +### 配置方式 + +NRI 资源管理模式是*默认启用*的。你无需修改 koordlet 配置就可以使用它,也可以通过设置 `enable-nri-runtime-hook=false` 的 koordlet 启动参数来禁用它。当它的前置条件不满足时,启用也不会影响其他功能。 + ## 安装 koord-runtime-proxy koord-runtime-proxy 充当 Kubelet 和 Containerd 之间的代理(Dockershim 场景下的 Dockerd),旨在拦截 CRI 请求, 并应用一些资源管理策略,比如在混合工作负载编排场景下通过 Pod 优先级设置不同的 CGroup 参数,为最新的 Linux 内核应用新的隔离策略, CPU 架构等等。 @@ -58,7 +69,7 @@ koord-runtime-proxy 充当 Kubelet 和 Containerd 之间的代理(Dockershim 从 Github 下载: ```bash $ # select the version -$ wget https://github.com/koordinator-sh/koordinator/releases/download/v1.2.0/koord-runtime-proxy_1.2.0_linux_x86_64 -O koord-runtime-proxy +$ wget https://github.com/koordinator-sh/koordinator/releases/download/v1.3.0/koord-runtime-proxy_1.3.0.linux_x86_64 -O koord-runtime-proxy $ chmod +x koord-runtime-proxy ``` @@ -133,7 +144,7 @@ kubelet --docker-endpoint=unix:///var/run/koord-runtimeproxy/run | `manager.log.level` | Log level that koord-manager printed | `4` | | `manager.replicas` | Replicas of koord-manager deployment | `2` | | `manager.image.repository` | Repository for koord-manager image | `koordinatorsh/koord-manager` | -| `manager.image.tag` | Tag for koord-manager image | `v1.2.0` | +| `manager.image.tag` | Tag for koord-manager image | `v1.3.0` | | `manager.resources.limits.cpu` | CPU resource limit of koord-manager container | `1000m` | | `manager.resources.limits.memory` | Memory resource limit of koord-manager container | `1Gi` | | `manager.resources.requests.cpu` | CPU resource request of koord-manager container | `500m` | @@ -148,7 +159,7 @@ kubelet --docker-endpoint=unix:///var/run/koord-runtimeproxy/run | `scheduler.log.level` | Log level that koord-scheduler printed | `4` | | `scheduler.replicas` | Replicas of koord-scheduler deployment | `2` | | `scheduler.image.repository` | Repository for koord-scheduler image | `koordinatorsh/koord-scheduler` | -| `scheduler.image.tag` | Tag for koord-scheduler image | `v1.2.0` | +| `scheduler.image.tag` | Tag for koord-scheduler image | `v1.3.0` | | `scheduler.resources.limits.cpu` | CPU resource limit of koord-scheduler container | `1000m` | | `scheduler.resources.limits.memory` | Memory resource limit of koord-scheduler container | `1Gi` | | `scheduler.resources.requests.cpu` | CPU resource request of koord-scheduler container | `500m` | @@ -160,7 +171,7 @@ kubelet --docker-endpoint=unix:///var/run/koord-runtimeproxy/run | `scheduler.hostNetwork` | Whether koord-scheduler pod should run with hostnetwork | `false` | | `koordlet.log.level` | Log level that koordlet printed | `4` | | `koordlet.image.repository` | Repository for koordlet image | `koordinatorsh/koordlet` | -| `koordlet.image.tag` | Tag for koordlet image | `v1.2.0` | +| `koordlet.image.tag` | Tag for koordlet image | `v1.3.0` | | `koordlet.resources.limits.cpu` | CPU resource limit of koordlet container | `500m` | | `koordlet.resources.limits.memory` | Memory resource limit of koordlet container | `256Mi` | | `koordlet.resources.requests.cpu` | CPU resource request of koordlet container | `0` | diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/cpu-burst.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/cpu-burst.md new file mode 100644 index 000000000..315ab8661 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/cpu-burst.md @@ -0,0 +1,197 @@ +# CPU Burst + +## Introduction + +CPU Burst is a service level objective (SLO)-aware resource scheduling feature provided by Koordinator. You can use CPU Burst to improve the performance of latency-sensitive applications. CPU scheduling for a container may be throttled by the kernel due to the CPU limit, which downgrades the performance of the application. The koordlet component automatically detects CPU throttling events and automatically adjusts the CPU limit to a proper value. This greatly improves the performance of latency-sensitive applications. + +### How CPU Burst works + +Kubernetes allows you to specify CPU limits, which can be reused based on time-sharing. If you specify a CPU limit for a container, the OS limits the amount of CPU resources that can be used by the container within a specific time period. For example, you set the CPU limit of a container to 2. The OS kernel limits the CPU time slices that the container can use to 200 milliseconds within each 100-millisecond period. + +CPU utilization is a key metric that is used to evaluate the performance of a container. In most cases, the CPU limit is specified based on CPU utilization. CPU utilization on a per-millisecond basis shows more spikes than on a per-second basis. If the CPU utilization of a container reaches the limit within a 100-millisecond period, CPU throttling is enforced by the OS kernel and threads in the container are suspended for the rest of the time period, as shown in the following figure. + +![image](/img/cpu-throttles.png) + +The following figure shows the thread allocation of a web application container that runs on a node with four vCPUs. The CPU limit of the container is set to 2. The overall CPU utilization within the last second is low. However, Thread 2 cannot be resumed until the third 100-millisecond period starts because CPU throttling is enforced somewhere in the second 100-millisecond period. This increases the response time (RT) and causes long-tail latency problems in containers. + +![image](/img/cpu-throttles-1.png) + +Upstream Linux kernel >=5.14 and Anolis OS both provide [Burstable CFS Controller](https://github.com/torvalds/linux/commit/f4183717b370ad28dd0c0d74760142b20e6e7931#diff-cc1a82129952a910fdc4292448c2a097a2ba538bebefcf3c06381e45639ae73e), namely *CPU Burst* feature. It allows a container to accumulate CPU time slices when the container is idle. The container can use the accumulated CPU time slices to burst above the CPU limit when CPU utilization spikes. This improves performance and reduces the RT of the container. + +![image](/img/cpu-throttles-2.png) + +For kernel versions that do not support CPU Burst, koordlet detects CPU throttling events and dynamically adjusts the CPU limit to achieve the same effect as CPU Burst. + +For more information about CPU Burst, see the presentation at KubeCon 2021: [CPU Burst: Getting Rid of Unnecessary Throttling, Achieving High CPU Utilization and Application Performance at the Same Time](https://kccncosschn21.sched.com/event/pcdF?spm=a2c63.p38356.0.0.2ec3731dhQbCIe&iframe=no). + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.3 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to +[Installation](/docs/installation). + +### Configurations + +Koordlet has already enabled CPU Burst feature (`-feature-gates=AllAlpha=true`). If not, please enable it manually by updating the feature gate in the koordlet daemonset. + +NOTE: CPU Burst is not available for `LSR` and `BE` pods since it targets on burstable cpu usages. + +```yaml +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: koordlet +spec: + selector: + matchLabels: + koord-app: koordlet + template: + metadata: + labels: + koord-app: koordlet + spec: + containers: + - command: + - /koordlet + args: + - -CgroupRootDir=/host-cgroup/ + - -feature-gates=XXXX,CPUBurst=true # enable CPU Burst feature + ... +``` + +## Use CPU Burst + +### Use an annotation to enable CPU Burst for the pod + +Add the following annotation to the pod configuration to enable CPU Burst: + +```yaml +apiVersion: apps/v1 +kind: Pod +metadata: + name: demo-pod-xxx + annotations: + # Set the value to auto to enable CPU Burst for the pod. + koordinator.sh/cpuBurst: '{"policy": "auto"}' + # To disable CPU Burst for the pod, set the value to none. + #koordinator.sh/cpuBurst: '{"policy": "none"}' +``` + +### Use a ConfigMap to enable CPU Burst for all pods in a cluster + +Modify the slo-controller-config ConfigMap based on the following content to enable CPU Burst for all pods in a cluster: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: slo-controller-config + namespace: koordinator-system +data: + cpu-burst-config: '{"clusterStrategy": {"policy": "auto"}}' + #cpu-burst-config: '{"clusterStrategy": {"policy": "cpuBurstOnly"}}' + #cpu-burst-config: '{"clusterStrategy": {"policy": "none"}}' +``` + +### (Optional) Advanced Settings + +The following code block shows the pod annotations and ConfigMap fields that you can use for advanced configurations: + +```yaml +# Example of the slo-controller-config ConfigMap. +data: + cpu-burst-config: | + { + "clusterStrategy": { + "policy": "auto", + "cpuBurstPercent": 1000, + "cfsQuotaBurstPercent": 300, + "sharePoolThresholdPercent": 50, + "cfsQuotaBurstPeriodSeconds": -1 + } + } + + # Example of pod annotations. + koordinator.sh/cpuBurst: '{"policy": "auto", "cpuBurstPercent": 1000, "cfsQuotaBurstPercent": 300, "cfsQuotaBurstPeriodSeconds": -1}' +``` + +The following table describes the ConfigMap fields that you can use for advanced configurations of CPU Burst. + +| Field | Data type | Description | +| ---------------------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| policy | string |
  • none: disables CPU Burst. If you set the value to none, the related fields are reset to their original values. This is the default value.
  • cpuBurstOnly: enables the CPU Burst feature only for the kernel of Anolis OS or upstream linux kernel >= 5.14.
  • cfsQuotaBurstOnly: enables automatic adjustment of CFS quotas of general kernel versions.
  • auto: enables CPU Burst and all the related features.
| +| cpuBurstPercent | int | Default value:`1000`. Unit: %. This field specifies the percentage to which the CPU limit can be increased by CPU Burst. If the CPU limit is set to `1`, CPU Burst can increase the limit to 10 by default. | +| cfsQuotaBurstPercent | int | Default value:`300`. Unit: %. This field specifies the maximum percentage to which the value of cfs_quota in the cgroup parameters can be increased. By default, the value of cfs_quota can be increased to at most three times. | +| cfsQuotaBurstPeriodSeconds | int | Default value:`-1`. Unit: seconds. This indicates that the time period in which the container can run with an increased CFS quota is unlimited. This field specifies the time period in which the container can run with an increased CFS quota, which cannot exceed the upper limit specified by `cfsQuotaBurstPercent`. | +| sharePoolThresholdPercent | int | Default value:`50`. Unit: %. This field specifies the CPU utilization threshold of the node. If the CPU utilization of the node exceeds the threshold, the value of cfs_quota in cgroup parameters is reset to the original value. | + +### Verify CPU Burst + +1. Use the following YAML template to create an apache-demo.yaml file. + +> To enable CPU Burst for a pod, specify an annotation in the annotations parameter of the metadata section of the pod configuration. + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: apache-demo + annotations: + koordinator.sh/cpuBurst: '{"policy": "auto"}' # Use this annotation to enable or disable CPU Burst. +spec: + containers: + - command: + - httpd + - -D + - FOREGROUND + image: koordinatorsh/apache-2-4-51-for-slo-test:v0.1 + imagePullPolicy: Always + name: apache + resources: + limits: + cpu: "4" + memory: 10Gi + requests: + cpu: "4" + memory: 10Gi + nodeName: # $nodeName Set the value to the name of the node that you use. + hostNetwork: False + restartPolicy: Never + schedulerName: default-scheduler +``` + +2. Run the following command to create an application by using Apache HTTP Server. + +```bash +kubectl apply -f apache-demo.yaml +``` + +3. Use the wrk2 tool to perform stress tests. + +```bash +# Download, decompress, and then install the wrk2 package. +# The Gzip module is enabled in the configuration of the Apache application. The Gzip module is used to simulate the logic of processing requests on the server. +# Run the following command to send requests. Replace the IP address in the command with the IP address of the application. +./wrk -H "Accept-Encoding: deflate, gzip" -t 2 -c 12 -d 120 --latency --timeout 2s -R 24 http://$target_ip_address:8010/static/file.1m.test +``` + +4. Check the results of CPU Burst enabled and disabled. + +e.g. We may have the following results: + +| CentOS 7 | Disabled | Enabled | +| ----------------------------- | ----------- | ------------------- | +| apache RT-p99 | 111.69 ms | 71.30 ms (-36.2%) | +| CPU Throttled Ratio | 33% | 0% | +| Average pod CPU utilization | 32.5% | 33.8% | + +The preceding metrics indicate the following information: + +- After CPU Burst is enabled, the P99 latency of apache is greatly reduced. +- After CPU Burst is enabled, CPU throttling is stopped and the average pod CPU utilization remains approximately at the same value. diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/memory-qos.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/memory-qos.md new file mode 100644 index 000000000..66f5e60f9 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/user-manuals/memory-qos.md @@ -0,0 +1,355 @@ +# Memory QoS + +## Introduction + +The Koordlet provides the *Memory Quality of Service* (QoS) feature for containers. You can use this feature to +optimize the performance of memory-sensitive applications while ensuring fair memory scheduling among containers. This +topic describes how to enable the memory QoS feature for containers. + +### Background + +The following memory limits apply to containers: + +- The memory limit of the container. If the amount of memory that a container uses, including the page cache, is about + to reach the memory limit of the container, the memory reclaim mechanism of the OS kernel is triggered. As a result, + the application in the container may not be able to request or release memory resources as normal. +- The memory limit of the node. If the memory limit of a container is greater than the memory request of the container, + the container can overcommit memory resources. In this case, the available memory on the node may become insufficient. + This causes the OS kernel to reclaim memory from containers. As a result, the performance of your application is + downgraded. In extreme cases, the node cannot run as normal. + +To improve the performance of applications and the stability of nodes, Koordinator provides the memory QoS feature for +containers. We recommend that you use Anolis OS as the node OS. For other OS, we will try our best to adapt, and users +can still enable it without side effects. After you enable the memory QoS feature for a container, Koordlet +automatically configures the memory control group (memcg) based on the configuration of the container. This helps you +optimize the performance of memory-sensitive applications while ensuring fair memory scheduling on the node. + +Memory QoS provides the following optimizations to improve the memory utilization of pods: + +- When the memory used by a pod is about to reach the memory limit of the pod, the memcg performs asynchronous reclaim for a specific amount of memory. This prevents the reclaim of all the memory that the pod uses and therefore minimizes the adverse impact on the application performance caused by direct memory reclaim. +- Memory reclaim is performed in a fairer manner among pods. When the available memory on a node becomes insufficient, memory reclaim is first performed on pods that use more memory than their memory requests. This ensures sufficient memory on the node when a pod applies for a large amount of memory. +- If the BestEffort pods on a node use more memory than their memory requests, the system prioritizes the memory requirements of Guaranteed pods and Burstable pods over the memory requirements of BestEffort pods. + +![image](/img/memory-qos.png) + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.3 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to +[Installation](/docs/installation). + +### Configurations + +Koordlet has already enabled Memory QoS feature (`-feature-gates=AllAlpha=true`). +If not, please enable it manually by updating the feature gate in the koordlet daemonset. + +> NOTE: Memory QoS is controlled by the `CgroupReconcile` feature-gate. + +```yaml +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: koordlet +spec: + selector: + matchLabels: + koord-app: koordlet + template: + metadata: + labels: + koord-app: koordlet + spec: + containers: + - command: + - /koordlet + args: + - -CgroupRootDir=/host-cgroup/ + - -feature-gates=XXXX,CgroupReconcile=true # enable CPU Burst feature + ... +``` + +## Use Memory QoS + +When you enable memory QoS for the containers in a pod, the memcg is automatically configured based on the specified +ratios and pod parameters. To enable memory QoS for the containers in a pod, perform the following steps. + +### Use an annotation to enable Memory QoS for the pod + +Add the following annotations to enable memory QoS for the containers in a pod: + +```yaml +annotations: + # To enable memory QoS for the containers in a pod, set the value to auto. + koordinator.sh/memoryQOS: '{"policy": "auto"}' + # To disable memory QoS for the containers in a pod, set the value to none. + #koordinator.sh/memoryQOS: '{"policy": "none"}' +``` + +### Use a ConfigMap to enable memory QoS for all the containers in a cluster + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: slo-controller-config + namespace: koordinator-system +data: + resource-qos-config: |- + { + "clusterStrategy": { + "lsClass": { + "memoryQOS": { + "enable": true + } + }, + "beClass": { + "memoryQOS": { + "enable": true + } + } + } + } +``` + +### (Optional) Advanced Settings + +The following table describes the advanced parameters that you can use to configure fine-grained memory QoS +configurations at the pod level and cluster level. + +| Parameter | Data type | Valid value | Description | +| ------------------- | ----------- | --------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| enable | Boolean |
  • true
  • false
|
  • true: enables memory QoS for all the containers in a cluster. The default memory QoS settings for the QoS class of the containers are used.
  • false: disables memory QoS for all the containers in a cluster. The memory QoS settings are restored to the original settings for the QoS class of the containers.
| +| policy | String |
  • auto
  • default
  • none
|
  • auto: enables memory QoS for the containers in the pod and uses the recommended memory QoS settings. The recommended memory QoS settings are prioritized over the cluster-wide memory QoS settings.
  • default: specifies that the pod inherits the cluster-wide memory QoS settings.
  • none: disables memory QoS for the pod. The relevant memory QoS settings are restored to the original settings. The original settings are prioritized over the cluster-wide memory QoS settings.
| +| minLimitPercent | Int | 0~100 | Unit: %. Default value:`0`. The default value indicates that this parameter is disabled. This parameter specifies the unreclaimable proportion of the memory request of a pod. The amount of unreclaimable memory is calculated based on the following formula: `Value of memory.min = Memory request × Value of minLimitPercent/100`. This parameter is suitable for scenarios where applications are sensitive to the page cache. You can use this parameter to cache files to optimize read and write performance. For example, if you specify Memory `Request=100MiB` and `minLimitPercent=100` for a container, `the value of memory.min is 104857600`. | +| lowLimitPercent | Int | 0~100 | Unit: %. Default value:`0`. The default value indicates that this parameter is disabled. This parameter specifies the relatively unreclaimable proportion of the memory request of a pod. The amount of relatively unreclaimable memory is calculated based on the following formula: `Value of memory.low = Memory request × Value of lowLimitPercent/100`. For example, if you specify `Memory Request=100MiB` and `lowLimitPercent=100` for a container, `the value of memory.low is 104857600`. | +| throttlingPercent | Int | 0~100 | Unit: %. Default value:`0`. The default value indicates that this parameter is disabled. This parameter specifies the memory throttling threshold for the ratio of the memory usage of a container to the memory limit of the container. The memory throttling threshold for memory usage is calculated based on the following formula: `Value of memory.high = Memory limit × Value of throttlingPercent/100`. If the memory usage of a container exceeds the memory throttling threshold, the memory used by the container will be reclaimed. This parameter is suitable for container memory overcommitment scenarios. You can use this parameter to cgroups from triggering OOM. For example, if you specify `Memory Limit=100MiB` and `throttlingPercent=80` for a container, `the value of memory.high is 83886080`, which is equal to 80 MiB. | +| wmarkRatio | Int | 0~100 | Unit: %. Default value:`95`. A value of `0` indicates that this parameter is disabled. This parameter specifies the threshold of the usage of the memory limit or the value of `memory.high` that triggers asynchronous memory reclaim. If `throttlingPercent` is disabled, the asynchronous memory reclaim threshold for memory usage is calculated based on the following formula: `Value of memory.wmark_high = Memory limit × wmarkRatio/100`. If `throttlingPercent` is enabled, the asynchronous memory reclaim threshold for memory usage is calculated based on the following formula: `Value of memory.wmark_high = Value of memory.high × wmarkRatio/100`. If the usage of the memory limit or the value of memory.high exceeds the threshold, the memcg backend asynchronous reclaim feature is triggered. For example, if you specify `Memory Limit=100MiB`for a container, the memory throttling setting is`memory.high=83886080`, the reclaim ratio setting is `memory.wmark_ratio=95`, and the reclaim threshold setting is `memory.wmark_high=79691776`. | +| wmarkMinAdj | Int | -25~50 | Unit: %. The default value is `-25` for the `LS`/ `LSR` QoS class and `50` for the `BE` QoS class. A value of 0 indicates that this parameter is disabled. This parameter specifies the adjustment to the global minimum watermark for a container. A negative value decreases the global minimum watermark and therefore postpones memory reclaim for the container. A positive value increases the global minimum watermark and therefore antedates memory reclaim for the container. For example, if you create a pod whose QoS class is LS, the default setting of this parameter is `memory.wmark_min_adj=-25`, which indicates that the minimum watermark is decreased by 25% for the containers in the pod. | + +### Example + +0. The testing environment is shown below: + +- Kubernetes: 1.20 +- Nodes: + - Stress Node: an ECS instance (8 vCPU, 32GB RAM) for performing stress tests. + - Tested Node: an ECS instance (8 vCPU, 32GB RAM) runs the workload and serves. + +1. Create a file named redis-demo.yaml with the following YAML template: + +```yaml +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: redis-demo-config +data: + redis-config: | + appendonly yes + appendfsync no +--- +apiVersion: v1 +kind: Pod +metadata: + name: redis-demo + labels: + name: redis-demo + annotations: + koordinator.sh/memoryQOS: '{"policy": "auto"}' # Add this annotation to enable memory QoS + koordinator.sh/qosClass: 'LS' # Set the QoS class of the Redis pod to LS +spec: + containers: + - name: redis + image: redis:5.0.4 + command: + - redis-server + - "/redis-master/redis.conf" + env: + - name: MASTER + value: "true" + ports: + - containerPort: 6379 + resources: + limits: + cpu: "2" + memory: "6Gi" + requests: + cpu: "2" + memory: "2Gi" + volumeMounts: + - mountPath: /redis-master-data + name: data + - mountPath: /redis-master + name: config + volumes: + - name: data + emptyDir: {} + - name: config + configMap: + name: redis-demo-config + items: + - key: redis-config + path: redis.conf + nodeName: # Set nodeName to the name of the tested node +--- +apiVersion: v1 +kind: Service +metadata: + name: redis-demo +spec: + ports: + - name: redis-port + port: 6379 + protocol: TCP + targetPort: 6379 + selector: + name: redis-demo + type: ClusterIP +``` + +2. Run the following command to deploy Redis Server as the test application. + +You can access the redis-demo Service from within the cluster. + +```bash +kubectl apply -f redis-demo.yaml +``` + +3. Simulate the scenario of memory overcommitment. + +Use the Stress tool to increase the load on memory and trigger memory reclaim. The sum of the memory limits of all pods +on the node exceeds the physical memory of the node. + + a. Create a file named stress-demo.yaml with the following YAML template: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: stress-demo + labels: + name: stress-demo + annotations: + koordinator.sh/memoryQOS: '{"policy": "auto"}' # Add this annotation to enable memory QoS + koordinator.sh/qosClass: 'BE' # Set the QoS class of the Stress pod to BE +spec: + containers: + - args: + - '--vm' + - '2' + - '--vm-bytes' + - 11G + - '-c' + - '2' + - '--vm-hang' + - '2' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + restartPolicy: Always + nodeName: # Set nodeName to the name of the tested node, which is the node on which the Redis pod is deployed +``` + + b. Run the following command to deploy stress-demo: + +```bash +kubectl apply -f stress-demo.yaml +``` + +4. Run the following command to query the global minimum watermark of the node: + +> Note In memory overcommitment scenarios, if the global minimum watermark of the node is set to a low value, OOM +> killers may be triggered for all pods on the node even before memory reclaim is performed. Therefore, we recommend +> that you set the global minimum watermark to a high value. In this example, the global minimum watermark is set +> to 4,000,000 KB for the tested node that has 32 GiB of memory. + +```bash +cat /proc/sys/vm/min_free_kbytes +``` + +Expected output: + +```bash +4000000 +``` + +5. Use the following YAML template to deploy the memtier-benchmark tool to send requests to the tested node: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + labels: + name: memtier-demo + name: memtier-demo +spec: + containers: + - command: + - memtier_benchmark + - '-s' + - 'redis-demo' + - '--data-size' + - '200000' + - "--ratio" + - "1:4" + image: 'redislabs/memtier_benchmark:1.3.0' + name: memtier + restartPolicy: Never + nodeName: # Set nodeName to the name of the stress node that is used to send requests. +``` + +6. Run the following command to query the test results from memtier-benchmark: + +```bash +kubectl logs -f memtier-demo +``` + +7. Use the following YAML template to disable memory QoS for the Redis pod and Stress pod. Then, perform stress tests +again and compare the results. + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: redis-demo + labels: + name: redis-demo + annotations: + koordinator.sh/memoryQOS: '{"policy": "none"}' # Disable memory QoS. + koordinator.sh/qosClass: 'LS' +spec: + ... + +--- +apiVersion: v1 +kind: Pod +metadata: + name: stress-demo + labels: + name: stress-demo + annotations: + koordinator.sh/memoryQOS: '{"policy": "none"}' # Disable memory QoS. + koordinator.sh/qosClass: 'BE' +``` + +8. Check the results of Memory QoS enabled and disabled. + +- Disabled: Set the memory QoS policy of the pod to `none`. +- Enabled: Set the memory QoS policy of the pod to `auto` (the recommended parameters of memory QoS are used). + +| Metric | Disabled | Enabled | +| ----------------- | ------------- | ------------- | +| Latency-avg | 51.32 ms | 47.25 ms | +| Throughput-avg | 149.0 MB/s | 161.9 MB/s | + +The table shows that the latency of the Redis pod is reduced by 7.9% and the throughput of the Redis pod is increased +by 8.7% after memory QoS is enabled. This indicates that the memory QoS feature can optimize the performance of +applications in memory overcommitment scenarios. diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3.json b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3.json new file mode 100644 index 000000000..6057e117d --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3.json @@ -0,0 +1,22 @@ +{ + "sidebar.docs.category.Getting Started": { + "message": "快速开始", + "description": "快速开始" + }, + "sidebar.docs.category.Architecture": { + "message": "架构", + "description": "The label for category Architecture in sidebar docs" + }, + "sidebar.docs.category.User Manuals": { + "message": "用户手册", + "description": "The label for category User Manuals in sidebar docs" + }, + "sidebar.docs.category.Design Details": { + "message": "设计", + "description": "The label for category Design Details in sidebar docs" + }, + "sidebar.docs.category.Best Practices": { + "message": "最佳实践", + "description": "The label for category Best Practices in sidebar docs" + } +} diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/architecture/overview.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/architecture/overview.md new file mode 100644 index 000000000..8e62edf0e --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/architecture/overview.md @@ -0,0 +1,60 @@ +# 概述 + +本节描述了 Koordinator 部署到 Kubernetes 集群相关的架构、组件和核心概念。Koordinator 由两个控制面([Koordinator Scheduler](#koordinator-scheduler)/[Koordinator Manager](#koordinator-manager))和一个 DaemonSet 组件([Koordlet](#koordlet))组成。 +Koordinator 在 Kubernetes 原有的能力基础上增加了混部功能,并兼容了 Kubernetes 原有的工作负载。 + +![架构](/img/architecture.png) + +## Koord-Scheduler + +Koord-Scheduler 以 Deployment 的形式部署在集群中,用于增强 Kubernetes 在 QoS-aware,差异化 SLO 以及任务调度场景的资源调度能力,具体包括: + +- QoS-aware 调度,包括负载感知调度让节点间负载更佳平衡,资源超卖的方式支持运行更多的低优先级工作负载。 +- 差异化 SLO,包括 CPU 精细化编排,为不同的工作负载提供不同的 QoS 隔离策略(cfs,LLC,memory 带宽,网络带宽,磁盘io)。 +- 任务调度,包括弹性额度管理,Gang 调度,异构资源调度等,以支持更好的运行大数据和 AI 工作负载。 + +为了更好的支持不同类型的工作负载,Koord-scheduler 还包括了一些通用性的能力增强: + +- Reservation,支持为特定的 Pod 或者工作负载预留节点资源。资源预留特性广泛应用于重调度,资源抢占以及节点碎片整理等相关优化过程。 +- Node Reservation,支持为 kubernetes 之外的工作负载预留节点资源,一般应用于节点上运行着非容器化的负载场景。 + +## Koord-Descheduler + +Koord-Decheduler 以 Deployment 的形式部署在集群中,它是 kubernetes 上游社区的增强版本,当前包含: + +- 重调度框架, Koord-Decheduler 重新设计了全新重调度框架,在可扩展性、资源确定性以及安全性上增加了诸多的加强,更多的[细节](../designs/descheduler-framework). +- 负载感知重调度,基于新框架实现的一个负载感知重调度插件,支持用户配置节点的安全水位,以驱动重调度器持续优化集群编排,从而规避集群中出现局部节点热点. + +## Koord-Manager + +Koord-Manager 以 Deployment 的形式部署,通常由两个实例组成,一个 leader 实例和一个 backup 实例。Koordinator Manager 由几个控制器和 webhooks 组成,用于协调混部场景下的工作负载,资源超卖(resource overcommitment)和 SLO 管理。 + +目前,提供了三个组件: + +- Colocation Profile,用于支持混部而不需要修改工作负载。用户只需要在集群中做少量的配置,原来的工作负载就可以在混部模式下运行,了解更多关于[Colocation Profile](../user-manuals/colocation-profile.md)。 +- SLO 控制器,用于资源超卖(resource overcommitment)管理,根据节点混部时的运行状态,动态调整集群的超发(overcommit)配置比例。该控制器的核心职责是管理混部时的 SLO,如智能识别出集群中的异常节点并降低其权重,动态调整混部时的水位和压力策略,从而保证集群中 pod 的稳定性和吞吐量。 +- Recommender(即将推出),它使用 histograms 来统计和预测工作负载的资源使用细节,用来预估工作负载的峰值资源需求,从而支持更好地分散热点,提高混部的效率。此外,资源 profiling 还将用于简化用户资源规范化配置的复杂性,如支持 VPA。 + +## Koordlet + +Koordlet 以 DaemonSet 的形式部署在 Kubernetes 集群中,用于支持混部场景下的资源超卖(resource overcommitment)、干扰检测、QoS 保证等。 + +在Koordlet内部,它主要包括以下模块: + +- 资源 Profiling,估算 Pod 资源的实际使用情况,回收已分配但未使用的资源,用于低优先级 Pod 的 overcommit。 +- 资源隔离,为不同类型的 Pod 设置资源隔离参数,避免低优先级的 Pod 影响高优先级 Pod 的稳定性和性能。 +- 干扰检测,对于运行中的 Pod,动态检测资源争夺,包括 CPU 调度、内存分配延迟、网络、磁盘 IO 延迟等。 +- QoS 管理器,根据资源剖析、干扰检测结果和 SLO 配置,动态调整混部节点的水位,抑制影响服务质量的 Pod。 +- 资源调优,针对混部场景进行容器资源调优,优化容器的 CPU Throttle、OOM 等,提高服务运行质量。 + +## Koord-RuntimeProxy + +Koord-RuntimeProxy 以 systemd service 的形式部署在 Kubernetes 集群的节点上,用于代理 Kubelet 与 containerd/docker 之间的 CRI 请求。这一个代理被设计来支持精细化的资源管理策略,比如为不同 QoS Pod 设置不同的 cgroup 参数,包括内核 cfs quota,resctl 等等技术特性,以改进 Pod 的运行时质量。。 + +## 下一步 + +以下是推荐下一步阅读的内容: + +- 学习 Koordinator 的[资源模型](./resource-model)。 +- 学习 Koordinator 的[Priority](./priority)。 +- 学习 Koordinator 的[QoS](./qos)。 diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/architecture/priority.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/architecture/priority.md new file mode 100644 index 000000000..da2f081ed --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/architecture/priority.md @@ -0,0 +1,88 @@ +# 优先级 + +Koordinator 在 Kubernetes 优先级类型的基础上定义了一套规范,并扩展了优先级的一个维度以对混部场景的细粒度支持。 + +## 定义 + +优先级用数字表示,目前定义了四个类: + +PriorityClass|优先级范围|描述 +----- | ----------- | -------- +koord-prod | [9000, 9999] | 需要提前规划资源配额,并且保证在配额内成功。 +koord-mid | [7000, 7099] | 需要提前规划资源配额,并且保证在配额内成功。 +koord-batch | [5000, 5999] | 需要提前规划资源配额,一般允许借用配额。 +koord-free | [3000, 3999] | 不保证资源配额,可分配的资源总量取决于集群的总闲置资源。 + +PriorityClass 目前留有一些暂未使用的区间,以支持未来可能的扩展。 + +## 约束 + +Koordinator 将不同类型的工作负载匹配到不同的优先级: + +- koord-prod,运行典型的延迟敏感型服务,一般是指需要 "实时 "响应的服务类型,比如通过点击移动APP中的按钮调用的典型服务。 +- koord-mid,对应于长周期的可用资源,一般用于运行一些实时计算、人工智能训练任务/作业,如 tensorflow/pytorch 等。 +- koord-batch,对应于的短周期可用资源,运行典型的离线批处理作业,一般指离线分析类作业,如日级大数据报告、非交互式 SQL 查询。 +- koord-free,运行低优先级的离线批处理作业,一般指不做资源预算,利用闲置资源尽量完成,如开发人员为测试目提交的作业。 + +## Koordinator 优先级与 Kubernetes优先级的对比 + +Koordinator 在 Kubernetes 集群中部署时会初始化这四个 PriorityClass。 + +``` +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: koord-prod +value: 9000 +description: "This priority class should be used for prod service pods only." +--- +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: koord-mid +value: 7000 +description: "This priority class should be used for mid service pods only." +--- +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: koord-batch +value: 5000 +description: "This priority class should be used for batch service pods only." +--- +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: koord-free +value: 3000 +description: "This priority class should be used for free service pods only." +``` + +在每个 PriorityClass 内,Koordinator 允许用户为精细化资源调度设置混部 Pod 的优先级。 + +## 示例 + +下面的 YAML 是一个 Pod 配置的例子,它使用了前面例子中创建的 PriorityClass 和优先级。 + +``` +apiVersion: v1 +kind: Pod +metadata: + name: nginx + labels: + env: test + koordinator.sh/priority: "5300" +spec: + containers: + - name: nginx + image: nginx + imagePullPolicy: IfNotPresent + priorityClassName: koord-batch +``` + +## 下一步是什么 + +以下是推荐下一步阅读的内容: + +- 学习 Koordinator 的[资源模型](./resource-model)。 +- 学习 Koordinator 的[QoS](./qos)。 diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/architecture/qos.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/architecture/qos.md new file mode 100644 index 000000000..057b565a1 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/architecture/qos.md @@ -0,0 +1,35 @@ +# QoS + +QoS 用于表达节点上 Pod 的运行质量,如获取资源的方式、获取资源的比例、QoS 保障策略等。 + +## 定义 + +Koordinator 调度系统支持的 QoS 有五种类型: + +QoS | 特点 | 说明 +--- | ---- | ------------- +SYSTEM | 系统进程,资源受限 | 对于 DaemonSets 等系统服务,虽然需要保证系统服务的延迟,但也需要限制节点上这些系统服务容器的资源使用,以确保其不占用过多的资源 +LSE(Latency Sensitive Exclusive) | 保留资源并组织同 QoS 的 pod 共享资源 | 很少使用,常见于中间件类应用,一般在独立的资源池中使用 +LSR(Latency Sensitive Reserved) | 预留资源以获得更好的确定性 | 类似于社区的 Guaranteed,CPU 核被绑定 +LS(Latency Sensitive) | 共享资源,对突发流量有更好的弹性 | 微服务工作负载的典型QoS级别,实现更好的资源弹性和更灵活的资源调整能力 +BE(Best Effort) | 共享不包括 LSE 的资源,资源运行质量有限,甚至在极端情况下被杀死 | 批量作业的典型 QoS 水平,在一定时期内稳定的计算吞吐量,低成本资源 + +## QoS CPU 编排隔离与共享 + +![img](/img/qos-cpu-orchestration.png) + +## Koordinator QoS与 Kubernetes QoS 的对比 + +从[定义](#定义)部分可以看出,Koordinator 的 QoS 比 Kubernetes 的 QoS 更复杂,因为在混部场景下,我们需要对延迟敏感的工作负载的 QoS 进行微调,以满足混部时性能的需求。 + +Koordinator 和 Kubernetes QoS 之间是有对应关系的: + +Koordinator QoS | Kubernetes QoS +--------------- | -------------- +SYSTEM | --- +LSE | Guaranteed +LSR | Guaranteed +LS | Guaranteed/Burstable +BE | BestEffort + +Koordlet 根据 Pod 的优先级和 QoS 定义,触发相应的资源隔离和 QoS 保障。 diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/architecture/resource-model.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/architecture/resource-model.md new file mode 100644 index 000000000..099ce0607 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/architecture/resource-model.md @@ -0,0 +1,37 @@ +# 资源模型 + +混部是一套资源调度解决方案,用于对延迟敏感的工作负载与大数据计算工作负载进行精细化编排。它需要解决两个主要问题: + +1. 如何为延迟敏感的工作负载调度资源,以满足性能和长尾延迟的要求。**这里涉及到的**关键点是资源调度策略和 QoS 感知策略。 +2. 如何调度和编排大数据计算工作负载,以较低的成本满足任务对计算资源的需求。**这里涉及到的**关键是如何在极端异常情况下实现合理的资源超额配置和 QoS 保障。 + +## 定义 + +![Resource Model](/img/resource-model.png) + +上图是 Koordinator 的混部资源模型,其基本思想是利用那些已分配但未使用的资源来运行低优先级的 pod。如图所示,有四条线: + +1. limit:灰色,高优先级 Pod 所请求的资源量,对应于 Kubernetes 的 Pod 请求。 +2. usage:红色,Pod 实际使用的资源量,横轴为时间线,红线为 Pod 负载随时间变化的波动曲线。 +3. short-term reservation:深蓝色,这是基于过去(较短)时期内的资源使用量,对未来一段时间内其资源使用量的估计。预留和限制的区别在于,分配的未使用(未来不会使用的资源)可以用来运行短期执行的批处理 Pod。 +4. long-term reservation:浅蓝色,与 short-term reservation 类似,但估计的历史使用期更长。从保留到限制的资源可以用于生命周期较长的Pod,与短期的预测值相比,可用的资源较少,但更稳定。 + +整个混部资源调度是基于上图所示的资源模型构建的,不仅可以满足各种工作负载的资源需求,还可以充分利用集群的闲置资源。 + +## SLO描述 + +在集群中运行的 Pod 资源 SLO 由两个概念组成,即优先级和 QoS。 + +* 优先级,即资源的优先级,代表了请求资源被调度的优先级。通常情况下,优先级会影响 Pod 在调度器待定队列中的相对位置。 + +* QoS,代表 Pod 运行时的服务质量。如cgroups cpu share、cfs 配额、LLC、内存、OOM 优先级等等。 + +需要注意的是,Priority 和 QoS 是两个维度的概念,但在实际业务场景中,两者之间会有一些约束(不是所有的组合都是合法的)。 + +## 下一步是什么 + +以下是推荐下一步阅读的内容: + +* 学习 Koordinator 的[优先级](./priority)。 +* 学习 Koordinator 的[QoS](./qos.md)。 + diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/best-practices/anolis_plugsched.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/best-practices/anolis_plugsched.md new file mode 100644 index 000000000..185b2633a --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/best-practices/anolis_plugsched.md @@ -0,0 +1,37 @@ +--- +sidebar_position: 2 +--- + +# Anolis Plugsched + +为了提升CentOS 7.9操作系统内核在CPU资源维度的混部效果,龙蜥社区提供了一种插件化的解决方案,即利用 plugsched 调度器热升级技术提供一种 CPU 混部技术的调度器插件包。该插件可直接安装到 CentOS 7.9,不需要停机和业务迁移等工作。了解更多信息,请参阅[Blog](https://koordinator.sh/blog/anolis-CPU-Co-location) + +## Prerequisites + +- Kernel: 必须使用官方CentOS 7.9的内核。 +- version == 3.10.0 +- release >= 1160.81.1 + +## 使用 Plugsched + +### 安装插件 + + ``` + # rpm -ivh https://github.com/koordinator-sh/koordinator/releases/download/v1.1.1/scheduler-bvt-noise-clean-$(uname -r).rpm + ``` + +如果你更新内核版本,你可以使用如下命令安装新的插件。 + + ``` + # rpm -ivh https://github.com/koordinator-sh/koordinator/releases/download/v1.1.1/scheduler-bvt-noise-clean-$(uname -r).rpm --oldpackage + ``` + +安装完成后,你可以在cpu cgroup目录下看到 `cpu.bvt_warp_ns` ,其使用方法与Group Identity特性兼容。 + +### 移除插件 + +移除插件可以使用 `rpm -e` 命令,然后 `cpu.bvt_warp_ns` 将也不再存在。请确保卸载插件前没有任何任务还在使用 `cpu.bvt_warp_ns` 。 + +## 使用Koordinator的CPU QoS功能 + +请参阅对应的[用户文档](../user-manuals/cpu-qos.md)。 \ No newline at end of file diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/best-practices/colocation-of-spark-jobs.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/best-practices/colocation-of-spark-jobs.md new file mode 100644 index 000000000..ee27d1389 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/best-practices/colocation-of-spark-jobs.md @@ -0,0 +1,101 @@ +--- +sidebar_position: 1 +--- + +# Colocation of Spark Jobs +Apache Spark is an analysis engine for large-scale data processing, which is widely used in Big Data, SQL Analysis and Machine Learning scenarios. This tutorial provides a quick practice guide about running Spark jobs in colocation mode with other latency sensitive applications by Koordinator, which is helpful for improving cluster resource utilization. For more details about how to use, compose, and work with Koordinator colocation, please refer to the [Introduction](../) + +## Requirements +### Koordinator Components +Before submitting Spark jobs as colocate mode, you need to ensure all Koordinator components have already been successfully installed. Please follow the step in [Installation](../installation) guide. + +### Install Kubernetes Operator for Apache Spark +To simplify running of Spark jobs in Cluster, we import the Kubernetes Operator for Apache Spark in this practice, which uses Kubernetes custom resource for managing Spark applications. + +With the help of Helm [chart](https://github.com/koordinator-sh/koordinator/tree/main/examples/spark-operator-chart), Kubernetes Operator for Apache Spark can be easily installed using the command below. +``` +$ helm install koord-spark-operator ./spark-operator-chart/ --namespace spark-operator +``` + +Installing the chart will create a namespace `spark-operator` and if doesn't exist, besides, helm will create a spark-operator Deployment and set up RBAC role for it. After the installation, you should see the operator in running successfully by checking the status of helm release. +``` +$ helm status --namespace spark-operator koord-spark-operator +``` + +## Run Spark Applications with Koordinator +Due to the mechanism that Spark driver pod needs a Kubernetes service account to manage executor pods, the service account must be authorized with appropriate permissions. Run the following command to create namespace `spark-demo` and service account `spark` before submitting jobs. +``` +$ kubectl apply -f examples/spark-jobs/service-account.yaml +``` + +Next, run the following command to create Colocation Profile so that all pods submitted following in namespace `spark-demo` will run in colocation mode. See this [tutorial](../user-manuals/colocation-profile) to learn more about Colocation Profile. +``` +$ kubectl apply -f examples/spark-jobs/cluster-colocation-profile.yaml +``` + +Submit a Spark TC example job to namespace `spark-demo` with the command: +``` +$ kubectl apply -f examples/spark-jobs/spark-tc-complex.yaml +``` + +Then, check the status of Spark application by running the following command. +``` +$ kubectl get sparkapplication -n spark-demo spark-tc-complex +``` + +This will show similar content as following: +``` +NAME STATUS ATTEMPTS START FINISH AGE +spark-tc-complex RUNNING 1 2022-03-30T09:11:22Z 14s +``` +Now, all pods submitted to namespace `spark-demo` will be switched to colocation mode, check spark-driver pod as below for example. We can see the protocols like`koordinator.sh/qosClass: BE` and `kubernetes.io/batch-cpu` are successfully injected to pod by Colocation Profile. +``` +apiVersion: v1 +kind: Pod +metadata: + labels: + koordinator.sh/qosClass: BE + spark-role: driver + ... +spec: + containers: + - args: + - driver + - --properties-file + - /opt/spark/conf/spark.properties + - --class + - org.apache.spark.examples.SparkTC + - local:///opt/spark/examples/jars/spark-examples_2.12-3.2.1-tc1.2.jar + resources: + limits: + kubernetes.io/batch-cpu: "1000" + kubernetes.io/batch-memory: 3456Mi + requests: + kubernetes.io/batch-cpu: "1000" + kubernetes.io/batch-memory: 3456Mi + ... +``` + +## Evaluation +With the help of Koordinator, when pods resource usage is idle, resources already requested can be reallocated to other colocation pods by the overcommitment model, which can significantly improve the resource utilization of cluster. + +In our experiment environment, before the Spark job submitted, we can see the cluster allocatable resources run out while the actual resource usage is in low level. +``` +$ kubectl describe node + Allocated resources: + Resource Requests + cpu 7620m (95.25%) + +$ kubectl top node + NAME CPU(cores) CPU% + cn-hangzhou.your-node-1 1190m 14.8% + cn-hangzhou.your-node-2 1620m 20.25% +``` + +After submit the Spark job in colocation mode, those unused resources will be reallocated through `batch priority` to Spark pods, so that we can make the cluster a higher utilization level. +``` +$ kubectl top node +NAME CPU(cores) CPU% +cn-hangzhou.your-node-1 4077m 52% +cn-hangzhou.your-node-2 3830m 49% +``` \ No newline at end of file diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/best-practices/fine-grained-cpu-orchestration.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/best-practices/fine-grained-cpu-orchestration.md new file mode 100644 index 000000000..92851eb8c --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/best-practices/fine-grained-cpu-orchestration.md @@ -0,0 +1,259 @@ +# Coordinated sharing of CPU resources in Colocation Scenarios - Fine-grained CPU Orchestration + +## Introduction + +In a cloud-native environment, users often deploy different types of workloads in the same cluster, leveraging different peak effects of different services to achieve time-sharing multiplexing of resources and avoid resource waste. However, colocation of different types of workloads often leads to resource competition and mutual interference. The most typical scenario is the colocation of online and offline workloads. When more computing resources are occupied by offline workloads, the response time of online loads will be affected; when more computing resources are occupied by online workloads for a long time, the task completion time of offline workloads cannot be guaranteed. This phenomenon belongs to the Noisy Neighbor problem. + +Depending on the degree of colocation and resource types, there are many different ways to solve this problem. Quota management can limit the resource usage of loads from the entire cluster dimension, and Koordinator provides multi-level elastic quota management functions in this regard. From the single-node level, CPU, memory, disk IO, and network resources may be shared by different loads. Koordinator has provided some resource isolation and guarantee capabilities on CPU and memory, and related capabilities on disk IO and network resources are under construction. + +This article mainly introduces how Koordinator helps loads (online and online, online and offline) share CPU resources collaboratively when different types of workloads are colocated on the same node. + +## Problem Description + +The essence of CPU resource Noisy Neighbor is that different workloads share CPU resources without coordination. +1. The default resource model of Kubernetes uses cgroup (cfs quota) to limit the access of different loads to CPU resources in terms of CPU time usage. In this case, some workloads may be switched to CPU cores by the operating system scheduler. Since different CPU cores have different memory access time to different physical locations, switching cpu cores will result in longer memory access time, thus affecting load performance, thereby affecting load performance. +2. In NUMA architecture, SMT threads (logical cores) share execution units and L2 cache of physical cores. +When there are multiple workloads on the same physical core, resource contention will happen between different workloads, resulting in load performance degradation. + +Kubernetes provides topology manager and CPU manager on node level to solve the above problems. However, this feature will only attempt to take effect after the Pod has been scheduled on the machine. This may lead to the situation where Pods are scheduled to nodes with sufficient CPU resources but topology requirements are not met. + +## Solutions + +### Application-Oriented CPU Orchestration QoS Semantics + +In response to the above problems and deficiencies, Koordinator designed an application-oriented QoS semantics and CPU orchestration protocol, as shown in the figure below. + +![img](/img/qos-cpu-orchestration.png) + +LS (Latency Sensitive) is applied to typical microservice loads, and Koordinator isolates it from other latency-sensitive loads to ensure its performance. LSR (Latency Sensitive Reserved) is similar to Kubernetes' Guaranteed. On the basis of LS, it adds the semantics that applications require reserved binding cores. LSE (Latency Sensitive Exclusive) is common in applications that are particularly sensitive to CPU, such as middleware. In addition to satisfying its semantics similar to LSR's requirement to bind cores, Koordinator also ensures that the allocated CPU is not shared with any other load. + +Also, to improve resource utilization, BE workloads can share CPU with LSR and LS. To ensure that latency-sensitive applications shared with BE are not disturbed by it, Koordinator provides strategies such as interference detection and BE suppression. The focus of this article is not here, readers can pay attention to follow-up articles. + +### Rich CPU scheduling strategies + +For LSE applications, when the machine is a hyper-threaded architecture, only logical cores can be guaranteed to be exclusive to the load. In this way, when there are other loads on the same physical core, application performance will still be disturbed. +To this end, Koordinator supports users to configure rich CPU scheduling policies on pod annotation to improve performance. + +CPU orchestration policies are divided into CPU-binding policies and CPU-exclusive policies. The CPU binding strategy determines the distribution of logical cores assigned to the application among physical cores, which can be spread or stacked among physical cores. Stacking (FullPCPU) refers to allocating complete physical cores to applications, which can effectively alleviate the Noisy Neighbor problem. SpreadByPCPU is mainly used in some delay-sensitive applications with different peak and valley characteristics, allowing the application to fully use the CPU at a specific time. The CPU exclusive policy determines the exclusive level of logical cores assigned to the application, and it can try to avoid physical cores or NUMANodes that have been applied for with the exclusive policy. + +### Enhanced CPU Scheduling Capabilities + +Koordinator supports the configuration of NUMA allocation strategies to determine how to select satisfactory NUMA nodes during scheduling. MostAllocated indicates allocation from the NUMA node with the least available resources, which can reduce fragmentation as much as possible and leave more allocation space for subsequent loads. However, this approach may cause the performance of parallel code that relies on Barriers to suffer. DistributeEvenly means that evenly distributing CPUs on NUMA nodes can improve the performance of the above parallel code. LeastAllocated indicates allocation from the NUMA node with the most available resources. + +In addition, Koordinator's CPU allocation logic is completed in the central scheduler. In this way, there will be a global perspective, avoiding the dilemma of single-node solution, where CPU resources may be sufficient but topology requirements are not met. + +## Best Practices +As can be seen from the above, Koordinator's fine-grained CPU orchestration capability can significantly improve the performance of CPU-sensitive workloads in multi-application colocation scenarios. In order to allow readers to use Koordinator’s fine-grained CPU scheduling capabilities more clearly and intuitively, this article deploys online applications to clusters in different ways, and observes the latency of services in stress testing to judge the effect of CPU scheduling capabilities. + +In this article, multiple online applications will be deployed on the same machine and pressure tested for 10 minutes to fully simulate the CPU core switching scenarios that may occur in production practice. For the colocation of online and offline applications, Koordinator provides strategies such as interference detection and BE suppression. The focus of this article is not here, and readers can pay attention to the practice in subsequent articles. + +|Group Number|Deployment Mode|Description|Scenarios| +|-|-|-|-| +|A|10 online applications are deployed on the nodes, and each node applies for 4 CPUs, all using kubernetes guaranteed QoS|Koordinator does not provide fine-grained CPU orchestration capabilities for applications|Due to CPU core switching, applications share logical cores, application performance will be affected, and it is not recommended to use| +|B|Deploy 10 online applications on the nodes, each application node has 4 CPUs, all adopt LSE QoS, CPU binding strategy adopts physical core binpacking(FullPCPUs)|Koordinator provides CPU core binding capability for LSE Pod and online applications will not share physical cores|Particularly sensitive online scenarios, application cannot accept CPU sharing at the physical core level| +|C|Deploy 10 online applications on the node, each application node has 4 CPUs, all adopt LSR QoS, CPU binding strategy adopts physical core split (SpreadByPCPUs), use CPU exclusively by physical cpu level|Koordinator provides CPU binding core capability for LSR Pod and online application logical core can use more physical core capacity|It is often used to share physical cores with offline Pods and implement time-sharing multiplexing at the physical core level. This article does not focus on the mixed deployment of online and offline applications, so it only tests the overuse of online applications| + +This experiment uses the following performance indicators to evaluate the performance of Nginx applications under different deployment methods: + +- RT (Response Time) quantile value: RT is a performance indicator that online applications usually focus on. The lower the RT, the better the online service performance. The RT indicator is obtained by collecting the information printed after the wrk pressure tests. In the experiment, it reflects the time it takes for the Nginx application to respond to the wrk request. For example, RT-p50 indicates the maximum time (median) it takes for Nginx to respond to the first 50% of wrk requests, and RT-p90 indicates the maximum time it takes for Nginx to respond to the first 90% of wrk requests. +- RPS (Request Per Second): RPS is the number of requests served by an online application per second. The more RPS a service bears, the better the performance of the online service. + + +The experimental results are as follows: + +|Performance Indicators/Deployment Mode| A(colocation of two online applications, Guaranteed)|B(colocation of two online applications, LSE、FullPCPU)|C(colocation of two online applications, LSR、SpreadByPCPU、PCPULevel| +|-|-|-|-| +|RPS| 114778.29|114648.19|115268.50| +|RT-avg (ms)|3.46 ms|3.33 ms|3.25 ms| +|RT-p90 (ms)|5.27 ms|5.11 ms|5.06 ms| +|RT-p99 (ms)|15.22 ms|12.61 ms|12.14 ms| + +- Comparing B and A, it can be found that after adopting LSE QoS to bind the core, the service response time P99 is significantly reduced, and the long tail phenomenon is well alleviated +- Comparing C and B, it can be found that after using LSR QoS to bind cores and allowing logical cores to occupy more physical core resources, more requests can be tolerated with better service response time + +In summary, in the scenario where online services are deployed on the same machine, using Koordinator to refine the CPU arrangement can effectively suppress the Noisy Neighbor problem and reduce the performance degradation caused by CPU core switching. + +### Environemnt + +First, prepare a Kubernetes cluster and install Koordinator. This article chooses two nodes of a Kubernetes cluster to do the experiment, one of the nodes is used as a test machine, which will run the Nginx online server; the other node is used as a pressure test machine, which will run the client's wrk, request the Nginx online server, and make pressure test requests . + +### Online application deployment + +1. Inject fine-grained CPU orchestration protocols into applications using ColocationProfile + + Group B fine-grained CPU orchestration protocol + + ```yaml + apiVersion: config.koordinator.sh/v1alpha1 + kind: ClusterColocationProfile + metadata: + name: colocation-profile-example + spec: + selector: + matchLabels: + app: nginx + # 采用 LSE QoS + qosClass: LSE + annotations: + # 采用物理核间堆叠 + scheduling.koordinator.sh/resource-spec: '{"preferredCPUBindPolicy":"FullPCPUs"}' + priorityClassName: koord-prod + ``` + + Group C fine-grained CPU orchestration protocol + + ```yaml + apiVersion: config.koordinator.sh/v1alpha1 + kind: ClusterColocationProfile + metadata: + name: colocation-profile-example + spec: + selector: + matchLabels: + app: nginx + # 采用 LSR QoS + qosClass: LSR + annotations: + # 采用物理核间打散且独占物理核 + scheduling.koordinator.sh/resource-spec: '{"preferredCPUBindPolicy":"SpreadByPCPUs", "preferredCPUExclusivePolicy":"PCPULevel"}' + priorityClassName: koord-prod + ``` + +2. This article uses Nginx server as Online Service , Pod YAML is as follows: + + ```yaml + --- + # nginx应用配置 + apiVersion: v1 + data: + config: |- + user nginx; + worker_processes 4; # Nginx的Worker个数,影响Nginx Server的并发。 + + events { + worker_connections 1024; # 默认值为1024。 + } + + http { + server { + listen 8000; + + gzip off; + gzip_min_length 32; + gzip_http_version 1.0; + gzip_comp_level 3; + gzip_types *; + } + } + + #daemon off; + kind: ConfigMap + metadata: + name: nginx-conf-0 + --- + # Nginx实例,作为在线类型服务应用。 + apiVersion: v1 + kind: Pod + metadata: + labels: + app: nginx + name: nginx-0 + namespace: default + spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: In + values: + - "${node_name}" + schedulerName: koord-scheduler + priorityClassName: koord-prod + containers: + - image: 'koordinatorsh/nginx:v1.18-koord-exmaple' + imagePullPolicy: IfNotPresent + name: nginx + ports: + - containerPort: 8000 + hostPort: 8000 # 压测请求访问的端口。 + protocol: TCP + resources: + limits: + cpu: '4' + memory: 8Gi + requests: + cpu: '4' + memory: 8Gi + volumeMounts: + - mountPath: /apps/nginx/conf + name: config + hostNetwork: true + restartPolicy: Never + volumes: + - configMap: + items: + - key: config + path: nginx.conf + name: nginx-conf-0 + name: config + ``` + +3. Execute the following command to deploy the Nginx application. + + ```bash + kubectl apply -f nginx.yaml + ``` + +4. Execute the following command to view the Pod status of the Nginx application. + + ```bash + kubectl get pod -l app=nginx -o wide + ``` + + You can see output similar to the following, indicating that the Nginx application has been running normally on the test machine. + + ``` + NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES + nginx-0 1/1 Running 0 2m46s 10.0.0.246 test-machine-name + + ``` + +### Load Test + +1. On the testing machine, execute the following command to deploy the stress testing tool wrk. + + ```bash + wget -O wrk-4.2.0.tar.gz https://github.com/wg/wrk/archive/refs/tags/4.2.0.tar.gz && tar -xvf wrk-4.2.0.tar.gz + cd wrk-4.2.0 && make && chmod +x ./wrk + ``` + +2. On the testing machine, execute the following command to deploy the load testing tool wrk + + ```bash + # node_ip填写测试机的IP地址,用于wrk向测试机发起压测;8000是Nginx暴露到测试机的端口。 + taskset -c 32-45 ./wrk -t120 -c400 -d600s --latency http://${node_ip}:8000/ + ``` + +3. After waiting for wrk to finish running, obtain the pressure test results of wrk. The output format of wrk is as follows. Repeat the test several times to obtain relatively stable results. + + ``` + Running 10m test @ http://192.168.0.186:8000/ + 120 threads and 400 connections + Thread Stats Avg Stdev Max +/- Stdev + Latency 3.29ms 2.49ms 352.52ms 91.07% + Req/Sec 0.96k 321.04 3.28k 62.00% + Latency Distribution + 50% 2.60ms + 75% 3.94ms + 90% 5.55ms + 99% 12.40ms + 68800242 requests in 10.00m, 54.46GB read + Requests/sec: 114648.19 + Transfer/sec: 92.93MB + ``` + +## Conclusion + +In a Kubernetes cluster, there may be competition for resources such as CPU and memory among different business loads, which affects the performance and stability of the business. In the face of the Noisy Neighbor phenomenon, users can use Koordinator to configure more refined CPU scheduling policies for applications, so that different applications can share CPU resources collaboratively. We have shown through experiments that Koordinator's fine-grained CPU scheduling capability can effectively suppress the competition for CPU resources and improve application performance. \ No newline at end of file diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/descheduler-framework.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/descheduler-framework.md new file mode 100644 index 000000000..e054a557a --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/descheduler-framework.md @@ -0,0 +1,84 @@ +# Descheduler Framework + +## Summary + +This proposal is based on the K8s community's [descheduler](https://github.com/kubernetes-sigs/descheduler) to design and implement the descheduler framework required by the koordinator. + +## Motivation + +The existing [descheduler](https://github.com/kubernetes-sigs/descheduler) in the community can solve some problems, but we think that there are still many aspects of the descheduler that can be improved, for example, it only supports the mode of periodic execution, and does not support the event-triggered mode. It is not possible to extend and configure custom rescheduling strategies without invading the existing code of descheduler like kube-scheduler; it also does not support implementing custom evictor. + +We also noticed that the K8s descheduler community also found these problems and proposed corresponding solutions such as [#753 Descheduler framework Proposal](https://github.com/kubernetes-sigs/descheduler/issues/753) and [PoC #781](https://github.com/kubernetes-sigs/descheduler/pull/781). The K8s descheduler community tries to implement a descheduler framework similar to the k8s scheduling framework. This coincides with our thinking. + +On the whole, these solutions solved most of our problems, but we also noticed that the related implementations were not merged into the main branch. But we review these implementations and discussions, and we believe this is the right direction. Considering that Koordiantor has clear milestones for descheduler-related features, we will implement Koordinator's own descheduler independently of the upstream community. We try to use some of the designs in the [#753 PR](https://github.com/kubernetes-sigs/descheduler/issues/753) proposed by the community and we will follow the Koordinator's compatibility principle with K8s to maintain compatibility with the upstream community descheduler when implementing. Such as independent implementation can also drive the evolution of the upstream community's work on the descheduler framework. And when the upstream community has new changes or switches to the architecture that Koordinator deems appropriate, Koordinator will follow up promptly and actively. + +### Goals + +1. Implement Koordinator Descheduler following part of the design in [#753](https://github.com/kubernetes-sigs/descheduler/issues/753) proposed by the community + +### Non-Goals/Future Work + +1. Break any existing use cases of the Descheduler. + +## Proposal + +### Implementation Details/Notes/Constraints + +#### Descheduler profile + +The current descheduler configuration is too simple to support disabling or enabling plugins or supporting custom plugin configurations. The [PR #587](https://github.com/kubernetes-sigs/descheduler/pull/587) introducing descheduler profiles with v1alpha2 api version. We will use this proposal as Koordiantor Descheduler's configuration API. + +- The descheduler profile API supports user specify which extension points are enabled/disabled, alongside specifying plugin configuration. Including ability to configure multiple descheduling profiles. +- The descheduling framework configuration can be converted into an internal representation. +- To reduce need to specify value for every possible configuration, also defaulting serves as a recommended/opinionated settings for the plugins. + +#### Abstract PodEvictor interface + +Currently, descheduler has split `Pod Evictor` and `Evictor Filter`. Users can inject `Evictor Filter` on demand, and the plug-in calls `Evictor Filter` when selecting abnormal Pods to select Pods that meet the requirements and calls `Pod Evictor` to initiate eviction. At present, `Pod Evictor` has not been abstracted as an interface. We adopt the solution in [PoC #781](https://github.com/kubernetes-sigs/descheduler/pull/781) to abstract an `Evictor interface`. And refer to [PR #885](https://github.com/kubernetes-sigs/descheduler/pull/885) to add an `EvictOptions` paramters. We can implement custom Evictor based on [PodMigrationJob](https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/scheduling/20220701-pod-migration-job.md). + +The `Evictor` interface defined as follows: + +```go +type EvictOptons struct { + // PluginName represents the initiator of the eviction operation + PluginName string + // Reason allows for passing details about the specific eviction for logging. + Reason string + // DeleteOptions holds the arguments used to delete + DeleteOptions *metav1.DeleteOptions +} + +// Plugin is the parent type for all the descheduling framework plugins. +type Plugin interface { + Name() string +} + +type Evictor interface { + Plugin + // Filter checks if a pod can be evicted + Filter(pod *corev1.Pod) bool + // Evict evicts a pod (no pre-check performed) + Evict(ctx context.Context, pod *corev1.Pod, evictOptions EvictOptions) bool +} +``` + +#### Plug-in descheduler strategy + +The current descheduler has some strategies. In [PoC #781](https://github.com/kubernetes-sigs/descheduler/pull/781), it is converted into `Plugin` and executed periodically. In this `periodic execution mode`, it is appropriate to abstract the policy for Pod and Node dimensions as `DeschedulePlugin` or `BalancePlugin`. The load hotspot descheduling capability that we will implement later can also implement the BalancePlugin interface. + +The `DeschedulePlugin` and `BalancePlugin` interfaces defined as follows: + +```go +type DeschedulePlugin interface { + Plugin + Deschedule(ctx context.Context, nodes []*corev1.Node) *Status +} + +type BalancePlugin interface { + Plugin + Balance(ctx context.Context, nodes []*corev1.Node) *Status +} +``` + +We also need to support the `event-triggered mode`, which means that descheduling is performed in the form of a Controller. +In some scenarios, CRD-oriented descheduling needs to be implemented. For example, different descheduling configurations are provided according to the workload. When some abnormality is detected in the workload, descheduling will be triggered. We can think of Controller as a special form of Plugin. When the descheduler is initialized, an instance is constructed through the plugin factory function like a normal Plugin, and then a similar Run method is called to start execution. \ No newline at end of file diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/enhanced-scheduler-extension.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/enhanced-scheduler-extension.md new file mode 100644 index 000000000..8c61c719d --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/enhanced-scheduler-extension.md @@ -0,0 +1,232 @@ +# Enhanced Scheduler Extension + +## Summary + +This proposal describes how to extend the kubernetes scheduling framework without modify upstream codes to support the scheduling features that Koordinator needs to develop. + +## Motivation + +Although Kubernetes Scheduler provides the scheduling framework to help developer to extend scheduling features. However, it cannot support the features that Koordinator needs to develop, such as Reservation, problem diagnosis and analysis, etc. + +### Goals + +1. Provides scheduling extension point hook mechanism +1. Provides scheduling plugins expose state mechanism to help diagnose analysis problems + +### Non-Goals/Future Work + + +## Proposal + +### User stories + +#### Story 1 + +Koordiantor supports users to use `Reservation` CRD to reserve resources. We expect Reservation CRD objects to be scheduled like Pods. In this way, the native scheduling capabilities of Kubernetes and other extended scheduling capabilities can be reused. This requires a mechanism to disguise the Reservation CRD object as a Pod, and to extend some scheduling framework extension points to support updating the Reservation Status. + +#### Story 2 + +Koordinator provides some scheduling plugins, such as Fine-grained CPU Scheduling, Device Share Scheduling, Coscheduling, ElasticQuota, etc. These plugins are brand new, and the supported scenarios are relatively rich, and the internal logic and state of the plugins are also a bit complicated. When we may encounter some problems in the production environment and need to be diagnosed and analyzed, we need to confirm the cause of the problem based on the internal status of the plugin. But currently the kubernetes scheduling framework does not provide a mechanism to export the internal state of the plugin. + +#### Story 3 + +The scheduler provides many plugins, and most plugins implement Scoring Extension Point. How to configure the weights of these plugins needs to be decided in combination with specific problems. When the optimal node is selected according to the scoring results, the results may not meet expectations. At this point we need to be able to trace or debug these scoring results in some way. But there is currently no good way. + +### Design Details + +#### Enhancement Kubernetes Scheduling Framework principles + +At present, the kube-scheduler provided by Kubernetes can be divided into several parts. The outermost layer is `k8s.io/kubernetes/cmd/kube-scheduler`, which is the entrance of kube-scheduler; `k8s.io/kubernetes/pkg/scheduler` is responsible for integrating the framework And execute scheduling workflow, including initializing framework and plugins, scheduling Pod, etc. The core module is `k8s.io/kubernetes/pkg/scheduler/framwork`, which is the **Kubernetes Scheduling Framework**. + +Each layer provides some interfaces or methods to support developers to extend some capabilities, and the evolution speed of each layer is also different. Generally speaking, the evolution speed of the more core modules should be slower, and the evolution of core modules tends to extend rather than modify the existing interface or extension mechanism, otherwise it will bring very large cost and reliability to external dependencies. question. But each layer does not support implementing some features for some reason. But as far as the problems Koordinator is currently experiencing, there are still some workarounds. However, some principles need to be followed to reduce future conflicts with the evolution of the upstream Kubernetes community. + +1. DO NOT modify the Kubernetes Scheduling Framework. The scheduling framework is the core module of kube-scheduler and is still evolving. In order to avoid conflict with the upstream community between Koordinator's enhanced capabilities. +1. DO NOT modify the `k8s.io/kubernetes/pkg/scheduler` but can implements supported interfaces or high-order functions, such as `ScheduleAlgorithm`, `NextPod`, `Error` and `Profiles`. The `Profiles` contains an instance of the Framework interface corresponding to each KubeSchedulerProfile. We can implement the Framework and replace the instances in Profiles to get the opportunity to participate in the scheduling process to do something. +1. Extend `k8s.io/kubernetes/cmd/kube-scheduler` as simply as possible. + +#### Custom Extension Overview + +![image](/img/scheduler-extension.jpg) + +#### ExtendedHandle + +ExtendedHandle extends the k8s scheduling framework `Handle` interface to facilitate plugins to access Koordinator's resources and states. +Before constructs the `k8s.io/kubernetes/pkg/scheduler.Scheduler` object, we should build an ExtendedHandle object and pass the object to each custom plugins. + +```go +type ExtendedHandle interface { + framework.Handle + KoordinatorClientSet() koordinatorclientset.Interface + KoordinatorSharedInformerFactory() koordinatorinformers.SharedInformerFactory + SnapshotSharedLister() framework.SharedLister +} +``` + +#### Intercept plugin initialization process + +In order to pass the `ExtendedHandle` object to each custom plugins, we should intercept the plugin initialization process. +And we expect that any customized plugins can be directly and seamlessly integrated into the koordinator scheduler, so the `PluginFactory` of the plugin will not be changed. Therefore, we can modify the prototype of `k8s.io/kubernetes/cmd/kube-scheduler/app.Option` and the implementation of `k8s.io/kubernetes/cmd/kube-scheduler/app.WithPlugin` as the follows to get the opportunity to intercept the plugin initialization process. + +When the custom plugin is registered to the out-of registry using `WithPlugin`, it will use `frameworkext.PluginFactoryProxy` to wrap the plugin's original `PluginFactory`. We finally complete the interception of the plugin initialization process in `frameworkext.PluginFactoryProxy`. + +Of course, we will not modify `k8s.io/kubernetes/cmd/kube-scheduler` directly. Considering that the logic of `k8s.io/kubernetes/cmd/kube-scheduler` itself is not complicated, it will basically not bring us additional maintenance costs, so we will copy the relevant code to Koordinator for separate maintenance. + + +```go + +// Option configures a framework.Registry. +type Option func(frameworkext.ExtendedHandle, runtime.Registry) error + +// WithPlugin creates an Option based on plugin name and factory. Please don't remove this function: it is used to register out-of-tree plugins, +// hence there are no references to it from the kubernetes scheduler code base. +func WithPlugin(name string, factory runtime.PluginFactory) Option { + return func(handle frameworkext.ExtendedHandle, registry runtime.Registry) error { + return registry.Register(name, frameworkext.PluginFactoryProxy(handle, factory)) + } +} + +// frameworkext.PluginFactoryProxy +func PluginFactoryProxy(extendHandle ExtendedHandle, factoryFn frameworkruntime.PluginFactory) frameworkruntime.PluginFactory { + return func(args runtime.Object, handle framework.Handle) (framework.Plugin, error) { + impl := extendHandle.(*frameworkExtendedHandleImpl) + impl.once.Do(func() { + impl.Handle = handle + }) + return factoryFn(args, extendHandle) + } +} +``` + +#### Expose the internal state of plugins + +We will define a new extension interface to help the plugin expose the internal state through the Restful API, and provide some built-in Restful APIs to query which APIs are exposed by the current scheduler and some commonly used internal data, such as NodeInfo, etc. + +The new extension interface named `APIServiceProvider`. The plugins can implement this interface to register the API to be exposed as needed. When the plugin is initialized, `frameworkext.PluginFactoryProxy` will check whether the newly constructed plugin implements `APIServiceProvider`, and if so, it will call the `RegisterEndpoints` method of the interface to register the API. The Restful APIs exposed by these plugins will be bound to the URL path `/apis/v1/plugins/` and will be prefixed with the name of each plugin. For example, the API `/availableCPUs/:nodeName` exposed by the plugin `NodeNUMAResource` will be converted to `/apis/v1/plugins/NodeNUMAResource/availableCPUs/:nodeName`. + + +```go +type APIServiceProvider interface { + RegisterEndpoints(group *gin.RouterGroup) +} + +type ErrorMessage struct { + Message string `json:"message,omitempty"` +} + +func ResponseErrorMessage(c *gin.Context, statusCode int, format string, args ...interface{}) { + var e ErrorMessage + e.Message = fmt.Sprintf(format, args...) + c.JSON(statusCode, e) +} +``` + +Users can use the built-in API `/apis/v1/__services__` to query how many Restful APIs are provided by the current scheduler. The response as the follows: + +```json +{ + "GET": [ + "/apis/v1/__services__", + "/apis/v1/nodes/:nodeName", + "/apis/v1/plugins/Coscheduling/gang/:namespace/:name", + "/apis/v1/plugins/Coscheduling/gangs", + "/apis/v1/plugins/NodeNUMAResource/availableCPUs/:nodeName", + "/apis/v1/plugins/NodeNUMAResource/cpuTopologyOptions/:nodeName" + ] +} +``` + +Koordinator scheduler also provides `/apis/v1/nodes/:nodeNa` to expose internal `NodeInfo` to developers. + + +#### Support plugin to create Controllers + +Similar to Coscheduling/ElasticQuota Scheduling, these scheduling plugins have a matching Controller to synchronize the status of the related CRD. The most common way is to deploy these controllers independently of the scheduler. This method will not only bring additional maintenance costs and resource costs, but also if there are more states in the scheduling plugin that need to be synchronized to the CRD Status, the logic in the Controller and the logic in the plugin need to be more closely coordinated. The best way is that the Controller and the scheduling plugin are in the same process. + +We can define a new interface called `ControllerProvider`. When the plugin is initialized, `frameworkext.PluginFactoryProxy` will check whether the newly constructed plugin implements `ControllerProvider`, and if so, it will call the `NewControllers` method of the interface to get the instances of Controllers, and save these instances in the `ExtendedHandle`. When the scheduler gets the leader role, it can trigger the `ExtendedHandle` to start these controllers. + +```go +type ControllerProvider interface { + NewControllers() ([]Controller, error) +} + +type Controller interface { + Start() + Name() string +} +``` + + +#### Debug Scoring Result + +If we want to support debug scoring results, the easiest way is to directly modify `Framework.RunScorePlugins` and print the results after scoring. But this goes against the extend principles we laid out earlier. But we can think differently. When `scheduler.Scheduler` executes `scheduleOne`, it obtains an instance of the `framework.Framework` interface from `Profiles` and calls the method `RunScorePlugins`. At the same time, considering that we have maintained the initialization code of scheduler separately, then we can customize the implementation of the `framework.Framework` interface, implement the method `RunScorePlugins` and take over the `Profiles` in `scheduler.Scheduler`. In this way, we can first call the `RunScorePlugins` method of the original `framework.Framework` interface instance in the custom implemented `RunScorePlugins`, and then print the result. + +For the processing of the results, we can simply print it to the log in markdown format. When needed, enable Scoring Result debugging capability through the HTTP interface `/debug/flags/s` like `/debug/flags/v`. The developers also enable the capability via flags `--debug-scores`. + +```bash +# print top 100 score results. +$ curl -X POST schedulerIP:10251/debug/flags/s --data '100' +successfully set debugTopNScores to 100 +``` + +The following are the specific scoring results: + + +``` +| # | Pod | Node | Score | ImageLocality | InterPodAffinity | LoadAwareScheduling | NodeAffinity | NodeNUMAResource | NodeResourcesBalancedAllocation | NodeResourcesFit | PodTopologySpread | Reservation | TaintToleration | +| --- | --- | --- | ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| +| 0 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.51 | 577 | 0 | 0 | 87 | 0 | 0 | 96 | 94 | 200 | 0 | 100 | +| 1 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.50 | 574 | 0 | 0 | 85 | 0 | 0 | 96 | 93 | 200 | 0 | 100 | +| 2 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.19 | 541 | 0 | 0 | 55 | 0 | 0 | 95 | 91 | 200 | 0 | 100 | +| 3 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.18 | 487 | 0 | 0 | 15 | 0 | 0 | 90 | 82 | 200 | 0 | 100 | +``` + +| # | Pod | Node | Score | ImageLocality | InterPodAffinity | LoadAwareScheduling | NodeAffinity | NodeNUMAResource | NodeResourcesBalancedAllocation | NodeResourcesFit | PodTopologySpread | Reservation | TaintToleration | +| --- | --- | --- | ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| +| 0 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.51 | 577 | 0 | 0 | 87 | 0 | 0 | 96 | 94 | 200 | 0 | 100 | +| 1 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.50 | 574 | 0 | 0 | 85 | 0 | 0 | 96 | 93 | 200 | 0 | 100 | +| 2 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.19 | 541 | 0 | 0 | 55 | 0 | 0 | 95 | 91 | 200 | 0 | 100 | +| 3 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.18 | 487 | 0 | 0 | 15 | 0 | 0 | 90 | 82 | 200 | 0 | 100 | + + +#### Custom Hook Extend Points to Support Reservation Scheduling + +If we want to schedule the Reservation CRD object in the form of Pod, we need to solve several problems: + +1. Before calling `PreFilter`, check whether the Pod has a matching Reservation. If there is a matching Reservation, and if the Pod is set with `Pod Affinity/AntiAffinity` or `TopologySpreadConstraints`, we need to modify the Pod to remove these fields. The reason is that when the Reservation CRD object is created, the user generally sets these fields, and expects to find suitable nodes to reserve resources according to these scheduling constraints. Therefore, if the Pod is scheduled with the same fields, it will cause the scheduling to fail. To do this, it cannot be achieved by implementing the `PreFilter` extension point, because the scheduler has already obtained the appropriate Pod to start executing when calling `PreFilter`, and we have lost the opportunity to modify the Pod to affect other plugins. +1. In the `Filter` phase, we also need to update the NodeInfo. If there is a Reservation CRD object on NodeInfo, and the current Pod matches the Reservation CRD object, then the resources applied for by the Reservation CRD object should be returned to NodeInfo. Only in this way can it pass the resource check of the scheduler, including the network port check. + +To solve these problems, we define the `Hook` interface. The plugin can be implemented on demand, and the Pod or NodeInfo can be modified when the PreFilter/Filter is executed. Similar to the custom implementation method `RunScorePlugins` mentioned above, we can customize the implementation methods `RunPreFilterPlugins` and `RunFilterPluginsWithNominatedPods`. Before executing the real extension point logic, first execute the `Hook` interface and modify the Pod and NodeInfo. + +If necessary, you can modify the Pod or Node before executing the Score Extension Point by implementing ScorePhaseHook. + +Considering that there may be multiple different Hooks to modify the Pod or NodeInfo requirements, when the Hook is called, the Hook will be called cyclically, and the modification result of the previous Hook and the input of the next Hook will continue to be executed. + +Here are some additional explanations for the scenarios in which these new extension points should be used. If you can complete the scheduling function through the extension points such as Filter/Score provided by the K8s Scheduling Framework without modifying the incoming NodeInfo/Pod and other objects, you do not need to use these new extension points. + +```go +type SchedulingPhaseHook interface { + Name() string +} + +type PreFilterPhaseHook interface { + SchedulingPhaseHook + PreFilterHook(handle ExtendedHandle, state *framework.CycleState, pod *corev1.Pod) (*corev1.Pod, bool) +} + +type FilterPhaseHook interface { + SchedulingPhaseHook + FilterHook(handle ExtendedHandle, cycleState *framework.CycleState, pod *corev1.Pod, nodeInfo *framework.NodeInfo) (*corev1.Pod, *framework.NodeInfo, bool) +} + +type ScorePhaseHook interface { + SchedulingPhaseHook + ScoreHook(handle ExtendedHandle, cycleState *framework.CycleState, pod *corev1.Pod, nodes []*corev1.Node) (*corev1.Pod, []*corev1.Node, bool) +} + +``` + +## Alternatives + +### Use Filter instead of Filter Hook + +We can change the order of Filter plugins to support Reservation Scheduling to update NodeInfo earlier, which can replace Filter Hook. Subsequent implementations can be implemented as an optimization. diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/fine-grained-cpu-orchestration.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/fine-grained-cpu-orchestration.md new file mode 100644 index 000000000..929cba386 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/fine-grained-cpu-orchestration.md @@ -0,0 +1,451 @@ +# 精细化 CPU 编排 + +## 摘要 + +该提案详细定义了 Koordinator QoS 的细粒度 CPU 编排,以及如何兼容 K8s 现有的设计原则和实现, 描述了 koordlet、koord-runtime-proxy 和 koord-scheduler 需要增强的功能。 + +## 动机 + +越来越多的系统利用 CPU 和硬件加速器的组合来支持延迟关键性的执行和高吞吐量并行计算。其中包括电信、科学计算、机器学习、金融服务和数据分析等领域的工作负载。这种混合系统构成高性能环境。 + +为了获得最佳性能,需要实现 CPU 隔离、NUMA-locality 相关的优化。 + +### 目标 + +1. 改进 Koordinator QoS 的 CPU 编排定义。 +1. 明确兼容 kubelet CPU Manager Policy的策略。 +1. 阐明 koordlet 应如何增强 CPU 调度机制。 +1. 为应用和集群管理员提供一套 API 支持复杂的CPU编排场景,例如 CPU 绑定策略、CPU 独占策略、NUMA 拓扑对齐策略和NUMA 拓扑信息等 +1. 提供优化 CPU 编排的 API。 + +### 非目标/未来工作 + +1. 描述 koordlet/koordlet-runtime-proxy 的具体设计细节。 +1. 描述 CPU 重调度机制的具体设计细节。 + + +## 设计概述 + +![image](/img/cpu-orchestration-seq-uml.svg) + +当 koordlet 启动时,koordlet 从 kubelet 收集 NUMA 拓扑信息,包括 NUMA 拓扑、CPU 拓扑、kubelet CPU 管理策略、kubelet 为 Guaranteed Pod 分配的 CPU 等,并更新到节点资源拓扑 CRD。当延迟敏感的应用程序扩容时,可以为新Pod设置 Koordinator QoS LSE/LSR、CPU绑定策略和 CPU独占策略,要求 koord-scheduler 分配最适合的 CPU 以获得最佳性能。当 koord-scheduler 调度 Pod 时,koord-scheduler 会过滤满足 NUMA 拓扑对齐策略的节点,并通过评分选择最佳节点,在 Reserve 阶段分配 CPU,并在 PreBinding 时将结果记录到 Pod Annotation。koordlet 通过 Hook kubelet CRI 请求,替换通过 koord-scheduler 调度的 CPU 配置参数到运行时,例如配置 cgroup。 + +## 用户故事 + +### 故事 1 + +兼容 kubelet 现有的 CPU 管理策略。CPU 管理器 `static` 策略允许具有某些资源特征的 Pod 在节点中被授予更高的 CPU 亲和性和排他性。如果启用 `static` 策略,集群管理员必须配置 kubelet 保留一些 CPU。 `static` 策略有一些选项,如果指定了 full-pcpus-only(beta, 默认可见) 策略选项,则 `static` 策略将始终分配完整的物理内核。如果指定了 distribute-cpus-across-numa(alpha, 默认不可见) 选项,在需要多个 NUMA 节点来满足分配的情况下, `static` 策略将在 NUMA 节点之间平均分配 CPU。 + +### 故事 2 + +同样,应该兼容社区中现有的 K8s Guaranteed Pod 的语义。静态策略分配给 K8s Guaranteed Pod 的 CPU 不会共享给默认的 BestEffort Pod,所以相当于 LSE。但是当节点的负载比较低时,LSR Pod 分配的 CPU 应该与 BestEffort 的工作负载共享,以获得经济效益。 + +### 故事 3 + +拓扑管理器是一个 kubelet 组件,旨在协调负责这些优化的组件集。引入拓扑管理器后,在工作节点具有不同的 NUMA 拓扑,并且该拓扑中具有不同资源量的集群中启动 Pod 的问题成为现实。Pod 可以调度在资源总量足够的节点上,但是资源分布不能满足合适的拓扑策略。 + +### 故事 4 + +调度器支持协调编排多个延迟敏感的应用程序。例如,支持延迟敏感的应用程序多个实例在 CPU 维度上互斥,并且延迟敏感的应用和一般应用在 CPU 维度亲和。这样可以降低成本并保证运行质量。 + +### 故事 5 + +在基于 NUMA 拓扑分配 CPU 时,用户希望有不同的分配策略。例如 bin-packing 优先,或者分配最空闲的 NUMA 节点。 + +### 故事 6 + +随着应用程序的伸缩或滚动,最适合的可分配空间会逐渐变得碎片化,这会导致一些策略的分配效果不好,影响应用程序的运行时效果。 + +## 设计细节 + +### CPU 编排基本原则 + +1. 仅支持 Pod 维度的 CPU 分配机制。 +1. Koordinator 将机器上的 CPU 分为 `CPU Shared Pool`,`statically exclusive CPUs` 和 `BE CPU Shared Pool`。 + 1. `CPU Shared Pool` 是一组共享 CPU 池,K8s Burstable 和 Koordinator LS Pod 中的任何容器都可以在其上运行。K8s Guaranteed `fractional CPU requests` 的 Pod 也可以运行在 `CPU Shared Pool` 中。`CPU Shared Pool` 包含节点中所有未分配的 CPU,但不包括由 K8s Guaranteed、LSE 和 LSR Pod 分配的 CPU。如果 kubelet 保留 CPU,则 `CPU Shared Pool` 包括保留的 CPU。 + 1. `statically exclusive CPUs` 是指分配给 K8s Guaranteed、Koordinator LSE/LSR Pods 使用的一组独占 CPU。当 K8s Guaranteed、LSE 和 LSR Pods 谁申请 CPU 时,koord-scheduler 将从 `CPU Shared Pool` 中分配。 + 1. `BE CPU Shared pool` 是一组 `K8s BestEffort` 和 `Koordinator BE` 的 Pod 都可运行的 CPU 池。`BE CPU Shared pool` 包含节点中除 K8s Guaranteed 和 Koordinator LSE Pod 分配的之外的所有 CPU。 + +### Koordinator QoS CPU 编排原则 + +1. LSE/LSR Pod 的 Request 和 Limit 必须相等,CPU 值必须是 1000 的整数倍。 +1. LSE Pod 分配的 CPU 是完全独占的,不得共享。如果节点是超线程架构,只保证逻辑核心维度是隔离的,但是可以通过 `CPUBindPolicyFullPCPUs` 策略获得更好的隔离。 +1. LSR Pod 分配的 CPU 只能与 BE Pod 共享。 +1. LS Pod 绑定了与 LSE/LSR Pod 独占之外的共享 CPU 池。 +1. BE Pod 绑定使用节点中除 LSE Pod 独占之外的所有 CPU 。 +1. 如果 kubelet 的 CPU 管理器策略为 static 策略,则已经运行的 K8s Guaranteed Pods 等价于 Koordinator LSR。 +1. 如果 kubelet 的 CPU 管理器策略为 none 策略,则已经运行的 K8s Guaranteed Pods 等价于 Koordinator LS。 +1. 新创建但未指定 Koordinator QoS 的 K8s Guaranteed Pod 等价于 Koordinator LS。 + +![img](/img/qos-cpu-orchestration.png) + +### kubelet CPU Manager Policy 兼容原则 + +1. 如果 kubelet 设置 CPU 管理器策略选项 `full-pcpus-only=true/distribute-cpus-across-numa=true`,并且节点中没有 Koordinator 定义的新 CPU 绑定策略,则遵循 kubelet 定义的这些参数的定义。 +1. 如果 kubelet 设置了拓扑管理器策略,并且节点中没有 Koordinator 定义的新的 NUMA Topology Alignment 策略,则遵循 kubelet 定义的这些参数的定义。 + +### 接管 kubelet CPU 管理策略 + +kubelet 预留的 CPU 主要服务于 K8s BestEffort 和 Burstable Pods。但 Koordinator 不会遵守该策略。K8s Burstable Pods 应该使用 `CPU Shared Pool`,而 K8s BestEffort Pods 应该使用 `BE CPU Shared Pool`。Koordinator LSE 和 LSR Pod 不会从被 kubelet 预留的 CPU 中分配。 + + +1. 对于 K8s Burstable 和 Koordinator LS Pod: + 1. 当 koordlet 启动时,计算 `CPU Shared Pool` 并将共享池应用到节点中的所有 Burstable 和 LS Pod,即更新它们的 cpu cgroups, 设置 cpuset。在创建或销毁 LSE/LSR Pod 时执行相同的逻辑。 + 1. koordlet 会忽略 kubelet 预留的 CPU,将其替换为 Koordinator 定义的 `CPU Shared Pool`。 +1. 对于 K8s BestEffort 和 Koordinator BE Pod: + 1. 如果 kubelet 预留了 CPU,BestEffort Pod 会首先使用预留的 CPU。 + 1. koordlet 可以使用节点中的所有 CPU,但不包括由具有整数 CPU 的 K8s Guaranteed 和 Koordinator LSE Pod 分配的 CPU。这意味着如果 koordlet 启用 CPU Suppress 功能,则应遵循约束以保证不会影响 LSE Pod。同样,如果 kubelet 启用了静态 CPU 管理器策略,则也应排除 K8s Guaranteed Pod。 +1. 对于 K8s Guaranteed Pod: + 1. 如果 Pod 的 annotations 中有 koord-scheduler 更新的 `scheduling.koordinator.sh/resource-status`,在 Sandbox/Container 创建阶段,则会替换 kubelet CRI 请求中的 CPUSet。 + 1. kubelet 有时会调用 CRI 中定义的 Update 方法来更新容器 cgroup 以设置新的 CPU,因此 koordlet 和 koord-runtime-proxy 需要 Hook 该方法。 +1. 自动调整 `CPU Shared Pool` 大小 + 1. koordlet 会根据 Pod 创建/销毁等变化自动调整 `CPU Shared Pool` 的大小。如果 `CPU Shared Pool` 发生变化,koordlet 应该更新所有使用共享池的 LS/K8s Burstable Pod 的 cgroups。 + 1. 如果 Pod 的 annotations`scheduling.koordinator.sh/resource-status` 中指定了对应的 `CPU Shared Pool`,koordlet 在配置 cgroup 时只需要绑定对应共享池的 CPU 即可。 + +接管逻辑要求 koord-runtime-proxy 添加新的扩展点并且 koordlet 实现新的运行时插件的 Hook 。当没有安装 koord-runtime-proxy 时,这些接管逻辑也将能够实现。 + +## CPU 编排 API + +### 应用程序 CPU 编排 API + +#### Resource Spec + +Annotation `scheduling.koordinator.sh/resource-spec` 是 Koordinator 定义的资源分配 API。用户通过设置 annotation 来指定所需的 CPU 编排策略。未来,我们还可以根据需要扩展和添加需要支持的资源类型。Annotation Value 对应的定义如下: + +```go +// ResourceSpec describes extra attributes of the compute resource requirements. +type ResourceSpec struct { + PreferredCPUBindPolicy CPUBindPolicy `json:"preferredCPUBindPolicy,omitempty"` + PreferredCPUExclusivePolicy CPUExclusivePolicy `json:"preferredCPUExclusivePolicy,omitempty"` +} + +type CPUBindPolicy string + +const ( + // CPUBindPolicyDefault performs the default bind policy that specified in koord-scheduler configuration + CPUBindPolicyDefault CPUBindPolicy = "Default" + // CPUBindPolicyFullPCPUs favor cpuset allocation that pack in few physical cores + CPUBindPolicyFullPCPUs CPUBindPolicy = "FullPCPUs" + // CPUBindPolicySpreadByPCPUs favor cpuset allocation that evenly allocate logical cpus across physical cores + CPUBindPolicySpreadByPCPUs CPUBindPolicy = "SpreadByPCPUs" + // CPUBindPolicyConstrainedBurst constrains the CPU Shared Pool range of the Burstable Pod + CPUBindPolicyConstrainedBurst CPUBindPolicy = "ConstrainedBurst" +) + +type CPUExclusivePolicy string + +const ( + // CPUExclusivePolicyDefault performs the default exclusive policy that specified in koord-scheduler configuration + CPUExclusivePolicyDefault CPUExclusivePolicy = "Default" + // CPUExclusivePolicyPCPULevel represents mutual exclusion in the physical core dimension + CPUExclusivePolicyPCPULevel CPUExclusivePolicy = "PCPULevel" + // CPUExclusivePolicyNUMANodeLevel indicates mutual exclusion in the NUMA topology dimension + CPUExclusivePolicyNUMANodeLevel CPUExclusivePolicy = "NUMANodeLevel" +) +``` + +- `CPUBindPolicy` 定义CPU绑定策略。具体取值定义如下: + - `CPUBindPolicyDefault` 或空值不执行任何绑定策略。它完全由调度器插件配置决定。 + - `CPUBindPolicyFullPCPUs` 是一种 bin-packing 策略,类似于 kubelet 定义的 `full-pcpus-only=true` 选项,用于分配完整的物理内核。但是,如果节点中剩余的逻辑 CPU 数量足够,但完整的物理核心数量不足,则继续分配。该策略可以有效避免扰邻(noisy neighbor)问题。 + - `CPUBindPolicySpreadByPCPUs` 是一种打散(Spread)策略。如果节点启用了超线程,当采用该策略时,调度器将在物理内核之间均匀的分配逻辑 CPU。例如,当前节点有 8 个物理内核和 16 个逻辑 CPU。当一个 Pod 需要 8 个逻辑 CPU 并且采用 `CPUBindPolicySpreadByPCPUs` 策略时,调度器会从每个物理核中分配一个逻辑 CPU。该策略主要用于一些具有多种不同峰谷特性的延迟敏感型应用程序。它不仅可以让应用程序在特定时间充分使用 CPU,而且不会被同一物理内核上的应用程序所干扰。所以在使用这个策略时可能会出现扰邻(noisy neighbor)问题。 + - `CPUBindPolicyConstrainedBurst` 主要帮助 K8s Burstable/Koordinator LS Pod 获得更好性能的特殊策略。使用该策略时,koord-scheduler 会根据 Pod 限制过滤掉具有合适 CPU 共享池的 NUMA 节点的节点。调度成功后,调度器会更新 Pod 中的 `scheduling.koordinator.sh/resource-status`,声明要绑定的 `CPU Shared Pool`。koordlet 根据 `CPU Shared Pool` 绑定对应 NUMA Node 的 `CPU Shared Pool`。 + - 如果 `NodeResourceTopology` 中的 `kubelet.koartiator.sh/cpu-manager-policy` 选项为 `full-pcpus-only=true`,或者 Node 中的 `node.koordator.sh/cpubind-policy` 的值为 `FullPCPUsOnly`,则 koord-scheduler 会检查实例的 CPU 请求数是否满足 SMT 对齐要求,以避免调度后被 kubelet 拒绝。如果 Pod 使用 `CPUBindPolicySpreadByPCPUs` 策略或映射到物理核心数的逻辑 CPU 数量不是整数,koord-scheduler 将避免调度此类节点。 +- `CPUExclusivePolicy` 定义了 CPU 独占策略,它可以帮助解决扰邻(noisy neighbor)问题。具体值定义如下 + - `CPUExclusivePolicyDefault` 或空值不执行任何隔离策略。它完全由调度器插件配置决定。 + - `CPUExclusivePolicyPCPULevel` 在分配逻辑CPU时,尽量避开已经被同一个独占策略申请的物理核。它是对 `CPUBindPolicySpreadByPCPUs` 策略的补充。 + - `CPUExclusivePolicyNUMANodeLevel` 在分配逻辑 CPU 时,尽量避免 NUMA 节点已经被相同的独占策略申请。如果没有满足策略的 NUMA 节点,则降级为 `CPUExclusivePolicyPCPULevel` 策略。 + +对于ARM架构,`CPUBindPolicy` 只支持 `CPUBindPolicyFullPCPUs` ,`CPUExclusivePolicy` 只支持 `CPUExclusivePolicyNUMANodeLevel` 。 + +#### Resource Status + +Annotation `scheduling.koordinator.sh/resource-status` 表示资源分配结果。 koord-scheduler 在绑定 Pod 到节点之前修改 annotation。 koordlet 使用结果来配置 cgroup。 + +Annotation value 对应的定义如下: + +```go +type ResourceStatus struct { + CPUSet string `json:"cpuset,omitempty"` + CPUSharedPools []CPUSharedPool `json:"cpuSharedPools,omitempty"` +} +``` + +- `CPUSet` 表示分配的 CPU。当 LSE/LSR Pod 请求时,koord-scheduler 将更新该字段。它是 Linux CPU 列表格式的字符串。更多详细信息,[请参阅文档](http://man7.org/linux/man-pages/man7/cpuset.7.html#FORMATS) 。 +- `CPUSharedPools` 表示 LS Pod 使用的所需 CPU 共享池。如果节点的标签 `node.koordinator.sh/numa-topology-alignment-policy` 带有 `Restricted/SingleNUMANode`,koord-scheduler 将为 LS Pod 找到最适合的 NUMA 节点,并更新需要 koordlet 使用指定 `CPU Shared Pool` 的字段。需要注意的是,调度器不会更新 `CPU Shared Pool` 中的 CPUSet 字段,koordlet 根据 `CPU Shared Pool` 中的 `Socket` 和 `Node` 字段绑定对应 NUMA 节点的 `CPU Shared Pool`。 + +#### 例子 + +具体例子: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + annotations: + scheduling.koordinator.sh/resource-spec: |- + { + "preferredCPUBindPolicy": "SpreadByPCPUs", + "preferredCPUExclusivePolicy": "PCPULevel" + } + scheduling.koordinator.sh/resource-status: |- + { + "cpuset": "0-3" + } + name: test-pod + namespace: default +spec: + ... +``` + +### 节点 CPU 编排 API + +从集群管理员的角度来看,需要提供一些 API 来控制节点的 CPU 编排行为。 + +#### CPU 绑定策略 + +标签 `node.koordinator.sh/cpu-bind-policy` 限制了调度时如何绑定 CPU、逻辑 CPU。 + +具体的取值定义: +- `None` 或空值不执行任何策略 +- `FullPCPUsOnly` 要求调度器必须分配完整的物理内核。等效于 kubelet CPU 管理器策略选项 `full-pcpus-only=true`。 +- `SpreadByPCPUs` 要求调度器必须按照物理核维度均匀的分配CPU。 + +如果 Node 的 Label 中没有 `node.koordinator.sh/cpu-bind-policy`,则按照 Pod 或 koord-scheduler 配置的策略执行。 + +#### NUMA 分配策略 + +标签 `node.koordinator.sh/numa-allocate-strategy` 表示在调度时如何选择满意的 NUMA 节点。下面是具体的值定义: +- `MostAllocated` 表示从可用资源最少的 NUMA 节点分配。 +- `LeastAllocated` 表示从可用资源最多的 NUMA 节点分配。 +- `DistributeEvenly` 表示在 NUMA 节点上平均分配 CPU。 + +如果集群管理员没有在Node上设置标签 `node.koordinator.sh/numa-allocate-strategy`,但是 `NodeResourceTopology` 中的 `kubelet.koordinator.sh/cpu-manager-policy` 有选项 `distribute-cpus-across-numa=true`,然后按照 `distribute-cpus-across-numa` 的定义分配。 + +如果节点的标签中没有 `node.koordinator.sh/numa-allocate-strategy` 并且 `NodeResourceTopology` 中没有带有 `Distribute-cpus-across-numa` 选项的 `kubelet.koordinator.sh/cpu-manager-policy`,它将根据 koord-scheduler 配置的策略执行。 + +如果同时定义了 `node.koordinator.sh/numa-allocate-strategy` 和 `kubelet.koordinator.sh/cpu-manager-policy`,则首先使用 `node.koordinator.sh/numa-allocate-strategy`。 + +#### NUMA 拓扑对齐策略 + +标签 `node.koordinator.sh/numa-topology-alignment-policy` 表示如何根据 NUMA 拓扑对齐资源分配。策略语义遵循 K8s 社区。相当于 `NodeResourceTopology` 中的 `TopologyPolicies` 字段,拓扑策略 `SingleNUMANodePodLevel` 和 `SingleNUMANodeContainerLevel` 映射到 `SingleNUMANode` 策略。 + +- `None` 是默认策略,不执行任何拓扑对齐。 +- `BestEffort` 表示优先选择拓扑对齐的 NUMA Node,如果没有,则继续为 Pods 分配资源。 +- `Restricted` 表示每个 Pod 在 NUMA 节点上请求的资源是拓扑对齐的,如果不是,koord-scheduler 会在调度时跳过该节点。 +- `SingleNUMANode` 表示一个 Pod 请求的所有资源都必须在同一个 NUMA 节点上,如果不是,koord-scheduler 调度时会跳过该节点。 + +如果节点的 Label 中没有 `node.koordinator.sh/numa-topology-alignment-policy`,并且 `NodeResourceTopology中的TopologyPolicies=None`,则按照 koord-scheduler 配置的策略执行。 + +如果同时定义了 Node 中的 `node.koordinator.sh/numa-topology-alignment-policy` 和 `NodeResourceTopology` 中的 `TopologyPolicies=None`,则首先使用 `node.koordinator.sh/numa-topology-alignment-policy`。 + +#### 例子 + +具体例子: + +```yaml +apiVersion: v1 +kind: Node +metadata: + labels: + node.koordinator.sh/cpu-bind-policy: "FullPCPUsOnly" + node.koordinator.sh/numa-topology-alignment-policy: "BestEffort" + node.koordinator.sh/numa-allocate-strategy: "MostAllocated" + name: node-0 +spec: + ... +``` + +### 节点资源拓扑 CRD + +需要上报的节点资源信息主要包括以下几类: + +- NUMA Topology,包括资源信息、CPU 信息如逻辑 CPU ID、物理 Core ID、NUMA Socket ID 和 NUMA Node ID 等。 +- kubelet 配置的拓扑管理器范围和策略。 +- kubelet 配置的 CPU 管理器策略和选项。 +- 由 kubelet 或 koord-scheduler 分配的 Pod 绑定 CPU,包括 K8s Guaranteed Pod、Koordinator LSE/LSR Pod,但 LS/BE 除外。 +- kubelet 定义的 `CPU Shared Pool`。 + +以上信息可以指导 koord-scheduler 更好地兼容 kubelet 的 CPU 管理逻辑,做出更合适的调度决策,帮助用户快速排查问题。 + +#### CRD 字段定义 + +我们使用 [NodeResourceTopology](https://github.com/k8stopologyawareschedwg/noderesourcetopology-api/blob/master/pkg/apis/topology/v1alpha1/types.go) CRD 来描述 NUMA 拓扑。社区定义的 NodeResourceTopology CRD 主要用于以下考虑: + +- NodeResourceTopology 已经包含了基本的 NUMA 拓扑信息和 kubelet TopologyManager 的 Scope 和 Policies 信息。我们可以重用现有的代码。 +- 跟上社区的发展,影响社区做出更多的改变。 + +#### 兼容 + +koordlet 周期性的创建或者更新 NodeResourceTopology 实例。NodeResourceTopology 实例名与节点名保持一致。并通过添加标签 `app.kubernetes.io/managed-by=Koordinator` 描述节点由 Koordinator 管理。 + +#### 扩展 + +目前 `NodeResourceTopology` 缺少一些信息,暂时以 annotation 或 label 的形式写在 `NodeResourceTopology` 中: + +- Annotation `kubelet.koordinator.sh/cpu-manger-policy` 描述了 kubelet CPU 管理器策略和选项。方案定义如下 + +```go +const ( + FullPCPUsOnlyOption string = "full-pcpus-only" + DistributeCPUsAcrossNUMAOption string = "distribute-cpus-across-numa" +) + +type kubeletCPUManagerPolicy struct { + Policy string `json:"policy,omitempty"` + Options map[string]string `json:"options,omitempty"` +} + +``` + +- Annotation `node.koordinator.sh/cpu-topology` 描述了详细的 CPU 拓扑。精细化的管理机制需要更详细的 CPU 拓扑信息。该方案定义如下: + +```go +type CPUTopology struct { + Detail []CPUInfo `json:"detail,omitempty"` +} + +type CPUInfo struct { + ID int32 `json:"id"` + Core int32 `json:"core"` + Socket int32 `json:"socket"` + Node int32 `json:"node"` +} +``` + +- Annotation `node.koordinator.sh/pod-cpu-allocs` 描述了 Koordinator LSE/LSR 和 K8s Guaranteed Pods 分配的 CPU。Annotation Value 定义如下: + +```go +type PodCPUAlloc struct { + Namespace string `json:"namespace,omitempty"` + Name string `json:"name,omitempty"` + UID types.UID `json:"uid,omitempty"` + CPUSet string `json:"cpuset,omitempty"` + ManagedByKubelet bool `json:"managedByKubelet,omitempty"` +} + +type PodCPUAllocs []PodCPUAlloc +``` + +- Annotation `node.koordinator.sh/cpu-shared-pools` 描述了 Koordinator 定义的 CPU 共享池。共享池主要由 Koordinator LS Pods 或 K8s Burstable Pods 使用。该方案定义如下: + +```go +type NUMACPUSharedPools []CPUSharedPool + +type CPUSharedPool struct { + Socket int32 `json:"socket"` + Node int32 `json:"node"` + CPUSet string `json:"cpuset,omitempty"` +} +``` +`CPUSet` 字段是 Linux CPU 列表格式的字符串。更多详细信息,[请参阅文档](http://man7.org/linux/man-pages/man7/cpuset.7.html#FORMATS) 。 + + +#### 创建/更新 NodeResourceTopology + +- koordlet 负责创建/更新 `NodeResourceTopology` +- 建议 koordlet 通过解析 CPU 状态检查点文件来获取现有 K8s Guaranteed Pod 的 CPU 分配信息。或者通过 kubelet 提供的 CRI 接口和 gRPC 获取这些信息。 +- 当 koord-scheduler 分配 Pod 的 CPU 时,替换 kubelet 状态检查点文件中的 CPU。 +- 建议 koordlet 从 [kubeletConfiguration](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/) 获取 CPU 管理器策略和选项。 + +#### 例子 + +完整的 `NodeResourceTopology` 示例: + +```yaml +apiVersion: topology.node.k8s.io/v1alpha1 +kind: NodeResourceTopology +metadata: + annotations: + kubelet.koordinator.sh/cpu-manager-policy: |- + { + "policy": "static", + "options": { + "full-pcpus-only": "true", + "distribute-cpus-across-numa": "true" + } + } + node.koordinator.sh/cpu-topology: |- + { + "detail": [ + { + "id": 0, + "core": 0, + "socket": 0, + "node": 0 + }, + { + "id": 1, + "core": 1, + "socket": 1, + "node": 1 + } + ] + } + node.koordinator.sh/cpu-shared-pools: |- + [ + { + "socket": 0, + "node": 0, + "cpuset": "0-3" + } + ] + node.koordinator.sh/pod-cpu-allocs: |- + [ + { + "namespace": "default", + "name": "static-guaranteed-pod", + "uid": "32b14702-2efe-4be9-a9da-f3b779175846", + "cpu": "4-8", + "managedByKubelet": "true" + } + ] + labels: + app.kubernetes.io/managed-by: Koordinator + name: node1 +topologyPolicies: ["SingleNUMANodePodLevel"] +zones: + - name: node-0 + type: Node + resources: + - name: cpu + capacity: 20 + allocatable: 15 + available: 10 + - name: vendor/nic1 + capacity: 3 + allocatable: 3 + available: 3 + - name: node-1 + type: Node + resources: + - name: cpu + capacity: 30 + allocatable: 25 + available: 15 + - name: vendor/nic2 + capacity: 6 + allocatable: 6 + available: 6 + - name: node-2 + type: Node + resources: + - name: cpu + capacity: 30 + allocatable: 25 + available: 15 + - name: vendor/nic1 + capacity: 3 + allocatable: 3 + available: 3 + - name: node-3 + type: Node + resources: + - name: cpu + capacity: 30 + allocatable: 25 + available: 15 + - name: vendor/nic1 + capacity: 3 + allocatable: 3 + available: 3 +``` diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/fine-grained-device-scheduling.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/fine-grained-device-scheduling.md new file mode 100644 index 000000000..e27e8a951 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/fine-grained-device-scheduling.md @@ -0,0 +1,408 @@ +# Fine-grained device scheduling + +## Summary + +This proposal provides a fine-grained mechanism for managing GPUs and other devices such as RDMA and FPGA, defines a set of APIs to describe device information on nodes, including GPU, RDMA, and FPGA, and a new set of resource names to flexibly support users to apply at a finer granularity GPU resources. This mechanism is the basis for subsequent other GPU scheduling capabilities such as GPU Share, GPU Overcommitment, etc. + +## Motivation + +GPU devices have very strong computing power, but are expensive. How to make better use of GPU equipment, give full play to the value of GPU and reduce costs is a problem that needs to be solved. In the existing GPU allocation mechanism of the K8s community, the GPU is allocated by the kubelet, and it is a complete device allocation. This method is simple and reliable, but similar to the CPU and memory, the GPU will also be wasted. Therefore, some users expect to use only a portion of the GPU's resources and share the rest with other workloads to save costs. Moreover, GPU has particularities. For example, the NVLink and oversold scenarios supported by NVIDIA GPU mentioned below both require a central decision through the scheduler to obtain globally optimal allocation results. + +![image](/img/nvlink.jpg) + +From the picture, we can find that although the node has 8 GPU instances whose model is A100/V100, the data transmission speed between GPU instances is different. When a Pod requires multiple GPU instances, we can assign the Pod the GPU instances with the maximum data transfer speed combined relationship. In addition, when we want the GPU instances among a group of Pods to have the maximum data transfer speed combined relationship, the scheduler should batch allocate the best GPU instances to these Pods and assign them to the same node. + +### Goals + +1. Definition Device CRD and the Resource API. +1. Provides a reporter component in koordlet to report Device information and resource capacities. +1. Provides a scheduler plugin to support users to apply at a finer granularity GPU resources. +1. Provider a new runtime hook plugin in koordlet to support update the environments of containers with GPUs that be allocated by scheduler. + +### Non-goals/Future work + +1. Define flexible allocation strategies, such as implementing BinPacking or Spread according to GPU resources + +## Proposal + +### API + +#### Device resource dimensions + +Due to GPU is complicated, we will introduce GPU first. As we all know there is compute and GPU Memory capability for the GPU device. Generally user apply GPU like "I want 1/2/4/8 GPUs", but if node support GPU level isolation mechanism, user may apply GPU like "I want 0.5/0.25 GPU resources". Moreover, user may set different compute capability and GPU memory capability for best resource utilization, so the user want apply GPU like "I want X percent of "compute capability and Y percent of memory capability". + +We abstract GPU resources into different dimensions: + +- `kubernetes.io/gpu-core` represents the computing capacity of the GPU. Similar to K8s MilliCPU, we abstract the total computing power of GPU into one hundred, and users can apply for the corresponding amount of GPU computing power according to their needs. +- `kubernetes.io/gpu-memory` represents the memory capacity of the GPU in bytes. +- `kubernetes.io/gpu-memory-ratio` represents the percentage of the GPU's memory. + +Assuming that node A has 4 GPU instances, and the total memory of each instance is 8GB, when device reporter reports GPU capacity information to `Node.Status.Allocatable`, it no longer reports nvidia.com/gpu=4, but reports the following information: + +```yaml +status: + capacity: + kubernetes.io/gpu-core: 400 + kubernetes.io/gpu-memory: "32GB" + kubernetes.io/gpu-memory-ratio: 400 + allocatable: + kubernetes.io/gpu-core: 400 + kubernetes.io/gpu-memory: "32GB" + kubernetes.io/gpu-memory-ratio: 400 +``` + +For the convenience of users, an independent resource name `kubernetes.io/gpu` is defined. For example, when a user wants to use half of the computing resources and memory resources of a GPU instance, the user can directly declare `kubernetes.io/gpu: 50`, and the scheduler will convert it to `kubernetes.io/gpu-core: 50, kubernetes.io/gpu-memory-ratio: 50` + +For other devices like RDMA and FPGA, the node has 1 RDMA and 1 FGPA, will report the following information: + +```yaml +status: + capacity: + kubernetes.io/rdma: 100 + kubernetes.io/fpga: 100 + allocatable: + kubernetes.io/rdma: 100 + kubernetes.io/fpga: 100 +``` + +Why do we need `kubernetes.io/gpu-memory-ratio` and `kubernetes.io/gpu-memory` ? +When user apply 0.5/0.25 GPU, the user don't know the exact memory total bytes per GPU, only wants to use +half or quarter percentage of memory, so user can request the GPU memory with `kubernetes.io/gpu-memory-ratio`. +When scheduler assigned Pod on concrete node, scheduler will translate the `kubernetes.io/gpu-memory-ratio` to `kubernetes.io/gpu-memory` by the formulas: ***allocatedMemory = totalMemoryOf(GPU) * `kubernetes.io/gpu-memory-ratio`***, so that the GPU isolation can work. + +During the scheduling filter phase, the scheduler will do special processing for `kubernetes.io/gpu-memory` and `kubernetes.io/gpu-memory-ratio`. When a Pod specifies `kubernetes.io/gpu-memory-ratio`, the scheduler checks each GPU instance on each node for unallocated or remaining resources to ensure that the remaining memory on each GPU instance meets the ratio requirement. + +If the user knows exactly or can roughly estimate the specific memory consumption of the workload, he can apply for GPU memory through `kubernetes.io/gpu-memory`. All details can be seen below. + +Besides, when dimension's value > 100, means Pod need multi-devices. now only allow the value can be divided by 100. + +#### User apply device resources scenarios + +##### Compatible with `nvidia.com/gpu` + +```yaml +resources: + requests: + nvidia.com/gpu: "2" + cpu: "4" + memory: "8Gi" +``` + +The scheduler translates the `nvida.com/gpu: 2` to the following spec: + +```yaml +resources: + requests: + kubernetes.io/gpu-core: "200" + kubernetes.io/gpu-memory-ratio: "200" + kubernetes.io/gpu-memory: "16Gi" # assume 8G memory in bytes per GPU + cpu: "4" + memory: "8Gi" +``` + +##### Apply whole resources of GPU or part resources of GPU + +```yaml +resources: + requests: + kubernetes.io/gpu: "50" + cpu: "4" + memory: "8Gi" +``` + +The scheduler translates the `kubernetes.io/gpu: "50"` to the following spec: + +```yaml +resources: + requests: + kubernetes.io/gpu-core: "50" + kubernetes.io/gpu-memory-ratio: "50" + kubernetes.io/gpu-memory: "4Gi" # assume 8G memory in bytes for the GPU + cpu: "4" + memory: "8Gi" +``` + +##### Apply `kubernetes.io/gpu-core` and `kubernetes.io/gpu-memory-ratio` separately + +```yaml +resources: + requests: + kubernetes.io/gpu-core: "50" + kubernetes.io/gpu-memory-ratio: "60" + cpu: "4" + memory: "8Gi" +``` + +##### Apply `kubernetes.io/gpu-core` and `kubernetes.io/gpu-memory` separately + +```yaml +resources: + requests: + kubernetes.io/gpu-core: "60" + kubernetes.io/gpu-memory: "4Gi" + cpu: "4" + memory: "8Gi" +``` + +##### Apply RDMA + +```yaml +resources: + requests: + kubernetes.io/rdma: "100" + cpu: "4" + memory: "8Gi" +``` + +### Implementation Details + +#### Scheduling + +1. Abstract new data structure to describe resources and healthy status per device on the node. +2. Implements the Filter/Reserve/PreBind extenstion points. +3. Automatically recognize different kind devices. When a new device added, we don't need modify any code + +##### DeviceAllocation + +In the PreBind stage, the scheduler will update the device (including GPU) allocation results, including the device's Minor and resource allocation information, to the Pod in the form of annotations. + +```go +/* +{ + "gpu": [ + { + "minor": 0, + "resouurces": { + "kubernetes.io/gpu-core": 100, + "kubernetes.io/gpu-mem-ratio": 100, + "kubernetes.io/gpu-mem": "16Gi" + } + }, + { + "minor": 1, + "resouurces": { + "kubernetes.io/gpu-core": 100, + "kubernetes.io/gpu-mem-ratio": 100, + "kubernetes.io/gpu-mem": "16Gi" + } + } + ] +} +*/ +type DeviceAllocation struct { + Minor int32 + Resources map[string]resource.Quantity +} + +type DeviceAllocations map[DeviceType][]*DeviceAllocation +``` + +##### NodeDevicePlugin + +```go +var ( + _ framework.PreFilterPlugin = &NodeDevicePlugin{} + _ framework.FilterPlugin = &NodeDevicePlugin{} + _ framework.ReservePlugin = &NodeDevicePlugin{} + _ framework.PreBindPlugin = &NodeDevicePlugin{} +) + +type NodeDevicePlugin struct { + frameworkHandler framework.Handle + nodeDeviceCache *NodeDeviceCache +} + +type NodeDeviceCache struct { + lock sync.Mutex + nodeDevices map[string]*nodeDevice +} + +type nodeDevice struct { + lock sync.Mutex + DeviceTotal map[DeviceType]deviceResource + DeviceFree map[DeviceType]deviceResource + DeviceUsed map[DeviceType]deviceResource + AllocateSet map[DeviceType]*corev1.PodList +} + +// We use `deviceResource` to present resources per device. +// "0": {kubernetes.io/gpu-core:100, kubernetes.io/gpu-memory-ratio:100, kubernetes.io/gpu-memory: 16GB} +// "1": {kubernetes.io/gpu-core:100, kubernetes.io/gpu-memory-ratio:100, kubernetes.io/gpu-memory: 16GB} +type deviceResources map[int]corev1.ResourceList + +``` + +We will register node and device event handler to maintain device account. + +- In Filter, we will make-up each device request by a node(the gpu-memory example), and try compare each device free resource and Pod device request. +- In Reserve/Unreserve, we will update nodeDeviceCache's used/free resource and allocateSet. Now device selection rule just based on device minor id order. +- In PreBind, we will write DeviceAllocations to Pod's annotation. +- In Init stage, we should list all Node/Device/Pods to recover device accounts. + +#### Device Reporter + +Implements a new component called `Device Reporter` in koordlet to create or update `Device` CRD instance with the resources information and healthy status per device including GPU, RDMA and FPGA, etc. This version we only support GPU. It will execution `nccl` commands to get each minor resource just like k8s-gpu-device-plugins. We will apply community health check logic. + +#### Device CRD Scheme definition +```go +type DeviceType string + +const ( + GPU DeviceType = "gpu" + FPGA DeviceType = "fpga" + RDMA DeviceType = "rdma" +) + +type DeviceSpec struct { + Devices []DeviceInfo `json:"devices"` +} + +type DeviceInfo struct { + // UUID represents the UUID of device + UUID string `json:"id,omitempty"` + // Minor represents the Minor number of Device, starting from 0 + Minor int32 `json:"minor,omitempty"` + // Type represents the type of device + Type DeviceType `json:"deviceType,omitempty"` + // Health indicates whether the device is normal + Health bool `json:"health,omitempty"` + // Resources represents the total capacity of various resources of the device + Resources map[string]resource.Quantity `json:"resource,omitempty"` +} + +type DeviceStatus struct {} + +type Device struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + + Spec DeviceSpec `json:"spec,omitempty"` + Status DeviceStatus `json:"status,omitempty"` +} + +type DeviceList struct { + metav1.TypeMeta `json:",inline"` + metav1.ListMeta `json:"metadata,omitempty"` + + Items []Device `json:"items"` +} +``` + +##### Compatible + +Considering that some users already have many existing GPU Pods in their clusters, it is necessary to ensure that Koordinator GPU Scheduling does not repeatedly allocate the GPU devices held by these GPU Pods. Therefore, koord-scheduler needs to obtain the GPU devices's information held by these existing Pods. These GPU devices are allocated by the kubelet and recorded in the local file `/var/lib/kubelet/device-plugins/kubelet_internal_checkpoint`, so the device reporter will parse the file to obtain the GPU Device ID assigned to each Pod. When parsing, it needs to exclude the Pod that allocates GPU through koord-scheduler, and finally update it to Device CRD in the form of annotation. The corresponding annotation key is `node.koordinator.sh/devices-checkpoints`, and the annotation value is defined as follows: + +```go +type PodDevicesEntry struct { + PodUID string `json:"podUID,omitempty"` + ContainerName string `json:"containerName,omitempty"` + ResourceName string `json:"resourceName,omitempty"` + DeviceIDs []string `json:"deviceIDs,omitempty"` + AllocResp []byte `json:"allocResp,omitempty"` +} + +type PodDevicesEntries []PodDevicesEntry +``` + +#### CRD Example +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: Device +metadata: + name: node-1 + annotations: + node.koordinator.sh/gpu-checkpoints: |- + [ + { + "podUID": "fa8983dc-bb76-4eeb-8dcc-556fbd44d7ce", + "containerName": "cuda-container", + "resourceName": "nvidia.com/gpu", + "deviceIDs": ["GPU-36b27e44-b086-46f7-f2dc-73c36dc65991"] + } + ] +spec: + devices: + - health: true + id: GPU-98583a5c-c155-9cf6-f955-03c189d3dbfb + minor: 0 + resources: + kubernetes.io/gpu-core: "100" + kubernetes.io/gpu-memory: 15472384Ki + kubernetes.io/gpu-memory-ratio: "100" + type: gpu + - health: true + id: GPU-7f6410b9-bdf7-f9a5-de09-aa5ec31a7124 + minor: 1 + resources: + kubernetes.io/gpu-core: "100" + kubernetes.io/gpu-memory: 15472384Ki + kubernetes.io/gpu-memory-ratio: "100" + type: gpu +status: {} +``` + +#### koordlet and koord-runtime-proxy + +Our target is to work compatible with origin k8s kubelet and k8s device plugins, so: + +1. We still allow kubelet and device plugin to allocate concrete device, which means no matter there's a k8s device +plugin or not, our design can work well. + +2. In koord-runtime-proxy, we will use Pod's `DeviceAllocation` in annotation to replace the step1's result of container's +args and envs. + +We should modify protocol between koord-runtime-proxy and koordlet to add container env: + +```go +type ContainerResourceHookRequest struct { + .... + Env map[string]string +} + +type ContainerResourceHookResponse struct { + .... + Env map[string]string +} +``` + +Then we will add a new `gpu-hook` in koordlet's runtimehooks, registered to `PreCreateContainer` stage. +We will generate new GPU env `NVIDIA_VISIBLE_DEVICES` by Pod GPU allocation result in annotation. + +The koord-runtime-proxy can see these Pod's env, we need koord-runtime-proxy to pass these environments to koordlet, and koordlet parse the GPU related env to find the concrete device ids. + +Besides, the koordlet should report GPU model to node labels same as device plugin, this is in-case Koordinator working without device-plugin. + +Finally, we should modify `ContainerResourceExecutor`'s `UpdateRequest` function in koord-runtime-proxy, and let new GPU env covering old GPU env. + +When we handle hot-update processing, we can handle the existing scheduled Pods without device allocation in Pod's annotation. If GPU allocation info is not in annotation, we will find the GPU allocations from `ContainerResourceHookRequest`'s `Env`, and we will update all GPU allocations to Device CRD instance. + +### Compatibility + +As we know, the GPU scheduling in kube-scheduler side has no any different with other scalar resources. The concrete device-level assigning is done by kubelet and GPU device plugin, which will generate container's GPU env. + +Our design has no conflict with the above process. Our device reporter reports Koordinator GPU resources for kubelet +updating node resources. Then we schedule device request in our new plugin with new device resource account. In pre-bind +stage, we will update container resources with Koordinator GPU resources, this is for kubelet to check resource limitation. +We will also add device allocation information to Pod's annotation. In node side, the k8s device plugin will first patch +container env, but we will overwrite these envs in runtimeproxy by allocation result in Pod's annotation. + +### Upgrade strategy + +If using Koordinator GPU Scheduling to schedule GPU Pods in a brand new cluster, simply install Koordinator components. + +However, if you want to upgrade to Koordinator GPU Scheduing in an existing cluster, you need to avoid GPU devices being repeatedly allocated because of switching between different scheduling mechanisms. You need to pay attention to the order when upgrading: +1. Install the Koordinator components. In particular, make sure that the koordlets are all started successfully. +2. Stop the system or platform that creates the new GPU Pod. +3. Stop the scheduler currently responsible for the GPU Pod and ensure that there are no pending GPU Pods in the current cluster. +3. Wait a few minutes to ensure that each node's koordlet creates and updates the Device CRD. +4. Modify all components that create GPU Pods to switch the schedulerName of the Pod to koord-scheduler +5. Start trying to create a GPU Pod and verify the koord-scheduler GPU Scheduling scheduling result. +6. Restore the system or platform that created the GPU Pod and the old scheduler. + +In the future Koordinator will provide a webhook to solve the upgrade existing cluster problem. The webhook will identify the GPU Pod and modify the schedulerName of the newly created GPU Pod to koord-scheduler. At the same time, the webhook will take over the Binding operation of the GPU Pod. If the Binding is not initiated by koord-scheduler, it will be rejected. + +## Unsolved Problems + +## Alternatives + +1. User can choose whether use k8s-device plugin. as mentioned above, we can compatible in both cases. diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/gang-scheduling.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/gang-scheduling.md new file mode 100644 index 000000000..6122b6efc --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/gang-scheduling.md @@ -0,0 +1,347 @@ +# GangScheduling + +## 概览 +Koord-dscheduler 提供了 Gang Scheduling 满足 All-or-Nothing 调度需求。用户可以声明最小资源集合数(resource-collection-minimum),只有当已经完成调度资源数(assigned-resources)超过前面声明当前最小资源集合数才能触发节点绑定。 +同时提供 `Strict` 和 `NonStrict` 两个参数用于控制 resource-accumulation-process ,区别于其他社区方案将提供 two-level Gang 描述用于更好匹配真实场景。 + +## 动机 +在 AI 场景中很多任务都需要使用 Gang scheduling,社区已经有很多相关实现,比如 `Coscheduling` 、 `Vocalno`,设计过程中我们从社区项目中得到了很多灵感。 + +### 竞品对标 + +#### Coscheduling +1. `Coscheduling` 主要通过实现新型队列排序(queue-sort)接口以及其他方法将一组 Gang pod 尽量有序的出队。 + 举个🌰 ,我们有 10 个任务需要进行 Gang 调度,前面 5 个任务已经调度成功,此时第 6 个任务调度失败,`Coscheduling` 将会回滚前面 5 个已经完成调度的任务,同时会跳过后面 4 个待调度中的任务。 + +2. `Coscheduling` 会简单得使用一个全局间隔时间作为 Gang 调度周期。该设计会带来两个问题: + 1. 问题一,如果配置间隔太长会带来无效等待,如果太短会带来无效调度。 + 2. 问题二,如果待调度分组任务很多,此时大概率会出现周期内无法完成调度,出现调度超时的情况。 + + 对于上面的场景,我们的设计中称为 `Strict`,此场景下调度会严格按照既定配置的周期时间进行工作。 + +3. 有些任务需要复杂的 Gang 要求。例如,一个任务有几个规则,每个规则都有几个 pod 以及自身的 Gang 条件,任务也需要不同的规则来组成不同的 GangGroups。 +一个 GangGroup 中的所有 pod 只有在 GangGroup 中的所有规则都满足 Gang 条件后才触发绑定过程。上游标准的 `Coscheduling` 不能满足这个需求。 + +### 目标 +1. 定义 Gang 调度配置。 + +2. 提供调度器插件实现 Gang 调度。 + +### 非目标/未来工作 +1. 提供使用 `NonStrict` 解决 Gang 资源死锁问题的能力。 + +## 方案 + +### 核心概念 + +#### Strict / NonStrict + +`Strict` 模式,如果其中一个 pod 调度失败,当前调度周期内,其他已经调度成功的 pod 将会被取消调度,同时正在调度中的 pod 将会在 PreFilter 阶段被拒绝调度。 + +`NonStrict` 模式,如果其中一个 pod 调度失败,并不会影响其他 pod 参与调度,会继续累计已经被调度的 pod 直到符合 Gang 调度条件。此模式对于 pod 比较多的情况比较友好,但是会增加不同 Gang 调度之间资源死锁的风险。 +> 举个🌰 ,如果当前资源配额为 10,此时用户提交三组 Gang 调度任务 pod 数都为 5,由于各种条件限制,Gang 调度 1/2/3 任务分别调度起来 pod 数量为 3/3/4, +> 此时当前资源组配额已经耗尽,不会有新的 pod 完成调度,三组 Gang 调度任务就会一直出于等待状态,这就是上面说到到资源死锁情况,目前还没有解决这个问题。 + +#### GangGroup + +`GangGroup`,有些任务需要复杂的 Gang 要求。例如,一个任务有几个规则,每个规则都有几个 pod 以及自身的 Gang 条件,任务也需要不同的规则来组成不同的 GangGroups。 +一个 GangGroup 中的所有 pod 只有在 GangGroup 中的所有规则都满足 Gang 条件后才触发绑定过程。`GangGroup` 则允许我们将不同 Gangs 进行聚合。 + +#### After Gang + +注意⚠️,如果满足 Gang 调度资源积累条件,随后一些 pod 在 binding 阶段失败,或者一些已经绑定的 pod 被抢占或者重新调度,这种情况下 Gang 的约束在资源重新分配过程中是否依然有效? + +答案:应该有效。因为 Gang 的设计初衷要求所有 pod 需要同时被拉起,如果只有其中一些 pod 被拉起,那么后续操作继续执行 Gang 调度策略将失去意义。因此,一旦 Gang 策略已经满足,后续所有的资源分配将不受 Gang 规则约束,后续将使用默认调度进行 pod 调度。 + +#### WaitTime + +`WaitTime` 自第一个 pod 进入 permit 阶段依赖的最大等待时间。如果 `WaitTime` 已经超时,调度器将会回滚所有已经调度完成的 pod,并且更新所有 pod annotation `gang.scheduling.koordinator.sh/timeout=true`,调度器将不会再调度这些 pod。用户需要注意这种情况并及时删除此类 pod。 + +### API +#### 定义 + +我们设计的初衷是优化以及增强社区原有的 `PodGroup` 能力,所以我们的 `PodGroup` 定义会兼容社区设计。我们会提供通过使用更新 annotation 方式使用 Gang 调度特性。 + +#### CRD 方式 +用户可以使用社区 `PodGroup` CRD 声明 Gang: +```go +type PodGroup struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + Spec PodGroupSpec `json:"spec,omitempty"` + Status PodGroupStatus `json:"status,omitempty"` +} +type PodGroupSpec struct { + MinMember int32 `json:"minMember,omitempty"` + MinResources *v1.ResourceList `json:"minResources,omitempty"` + + ScheduleTimeoutSeconds *int32 `json:"scheduleTimeoutSeconds,omitempty"` +} +``` +Pod 需要添加 label `pod-group.scheduling.sigs.k8s.io` 来关联 `PodGroup` 配置。 + +同时,我们也可以使用以下可选配置: +```yaml +gang.scheduling.koordinator.sh/total-number +gang.scheduling.koordinator.sh/mode +gang.scheduling.koordinator.sh/groups +``` +- `gang.scheduling.koordinator.sh/name` 配置 Gang 调度器名称, 名称需要符合 RFC 1123 规范。 + +- `gang.scheduling.koordinator.sh/total-number` 当前配置仅作用于 `Strict` 模式, 详情请参考 `Data-Structure` 部分。默认与 `gang.scheduling.koordinator.sh/min-available` 一致。 + +- `gang.scheduling.koordinator.sh/mode` 选项 `Strict` 或者 `NonStrict`。 默认配置为 `Strict`。 + +- `gang.scheduling.koordinator.sh/groups` 用于配置 GangGroups 名称。默认为空,表示不需要与其他资源合并到 GangGroups,同一个 GangGroups 的 Gangs 可以来自于不同的 namespace。 + +`PodGroup` annotation 可以包含 `gang.scheduling.koordinator.sh/total-number`, `gang.scheduling.koordinator.sh/mode`, `gang.scheduling.koordinator.sh/gang-groups`。 + +##### 示例 +基础 Gang 调度配置如下: +```yaml +apiVersion: v1alpha1 +kind: PodGroup +metadata: + creationTimestamp: "2022-07-11T18:26:33Z" + name: gang-a + namespace: default +spec: + minMember: 5 + minResources: + cpu: "5" + memory: "2048Mi" + scheduleTimeoutSeconds: 600 +``` + +创建一个任务包含两个策略:A 和 B,每个策略包含一些 pod。PodA 属于 roleA,PodB 属于 roleB。roleA、roleB 归属于同一个 GangGroup,示例如下: +```yaml +apiVersion: v1alpha1 +kind: PodGroup +metadata: + creationTimestamp: "2022-07-11T18:26:33Z" + name: gang-a + namespace: namespaceA + annotations: + gang.scheduling.koordinator.sh/total-number: 5 + gang.scheduling.koordinator.sh/mode: Strict + gang.scheduling.koordinator.sh/groups: ["namespaceA/gang-a", "namespaceB/gang-b"] +spec: + minMember: 5 + minResources: + cpu: "5" + memory: "2048Mi" + scheduleTimeoutSeconds: 600 +``` + +注意:如果用户使用 `CRD way`,需要集群管理员提前将 PodGroup 策略部署到集群,否则会出现带有 Gang 配置的 Pod 进行调度时,找不到对应的 Gang 策略 PodGroup 配置。 +此外,从调度的角度来看,调度应该处理 Gang CRD 和 Pod 之间的任务顺序问题。 例如,如果 Pod 在 Gang CRD 之前到达调度,我们必须构建一个假 Gang 数据结构 +临时收集所有相关的 Pod,需要暂停 Pod 的调度,直到从真正的 Gang CRD 解析配置。 + +#### Annotation 方式 +```yaml +gang.scheduling.koordinator.sh/name +gang.scheduling.koordinator.sh/min-available +``` + +以上配置为必填,同时我们兼容社区 annotation `pod-group.scheduling.sigs.k8s.io`, `pod-group.scheduling.sigs.k8s.io/name`以及 `pod-group.scheduling.sigs.k8s.io/min-available` 。 + + +此外,我们还支持以下可选配置: +```yaml +gang.scheduling.koordinator.sh/waiting-time +gang.scheduling.koordinator.sh/total-number +gang.scheduling.koordinator.sh/mode +gang.scheduling.koordinator.sh/groups +``` + +- `gang.scheduling.koordinator.sh/waiting-time` 自第一个 pod 进入 permit 阶段依赖的最大等待时间。默认值可以在全局配置中设置。 + +- `gang.scheduling.koordinator.sh/total-number` 当前配置仅作用于 `Strict` 模式, 详情请参考 `Data-Structure` 部分。默认与 `gang.scheduling.koordinator.sh/min-available` 一致。 + +- `gang.scheduling.koordinator.sh/mode` 选项 `Strict` 或者 `NonStrict`。 默认配置为 `Strict`。 + +- `gang.scheduling.koordinator.sh/groups` 用于配置 GangGroups 名称。默认为空,表示不需要与其他资源合并到 GangGroups,同一个 GangGroups 的 Gangs 可以来自于不同的 namespace。 + +注意⚠️,如果同时通过 CRD 和 annotation 方式进行配置,该 annotation 配置将会覆盖 CRD 配置。同时, GangGroup 名称格式为 " gangNamespace" + "/" + "gangName " + +##### 示例 +基础 Gang 调度配置如下: +```yaml +metadata: + annotations: + gang.scheduling.koordinator.sh/name: gang-a + gang.scheduling.koordinator.sh/min-available: 5 +``` + +创建一个任务包含两个策略:A 和 B,每个策略包含一些 Pod。PodA 属于 roleA,PodB 属于 roleB。roleA、roleB 归属于同一个 GangGroup,示例如下: +```yaml +metadata: + annotations: + gang.scheduling.koordinator.sh/name: gang-a + gang.scheduling.koordinator.sh/waiting-time: 3600s + gang.scheduling.koordinator.sh/min-available: 5 + gang.scheduling.koordinator.sh/total-number: 5 + gang.scheduling.koordinator.sh/mode: Strict + gang.scheduling.koordinator.sh/groups: ["namespaceA/gang-a", "namespaceB/gang-b"] +metadata: + annotations: + gang.scheduling.koordinator.sh/name: gang-b + gang.scheduling.koordinator.sh/waiting-time: 3600s + gang.scheduling.koordinator.sh/min-available: 5 + gang.scheduling.koordinator.sh/total-number: 5 + gang.scheduling.koordinator.sh/mode: Strict + gang.scheduling.koordinator.sh/groups: ["namespaceA/gang-a", "namespaceB/gang-b"] +``` + +创建一个任务包含两个策略:A 和 B,每个策略包含一些 Pod。PodA 属于 roleA,PodB 属于 roleB。roleA、roleB 归属于不同 GangGroup,示例如下: +```yaml +metadata: + annotations: + gang.scheduling.koordinator.sh/name: gang-a + gang.scheduling.koordinator.sh/waiting-time: 3600s + gang.scheduling.koordinator.sh/min-available: 5 + gang.scheduling.koordinator.sh/total-number: 5 + gang.scheduling.koordinator.sh/mode: Strict + gang.scheduling.koordinator.sh/groups: "" +metadata: + annotations: + gang.scheduling.koordinator.sh/name: gang-b + gang.scheduling.koordinator.sh/waiting-time: 3600s + gang.scheduling.koordinator.sh/min-available: 5 + gang.scheduling.koordinator.sh/total-number: 5 + gang.scheduling.koordinator.sh/mode: Strict + gang.scheduling.koordinator.sh/groups: "" +``` + +### 详细设计 +#### QueueSortPlugin + +我们单独设计调度器插件用于实现 `QueueSort` 拓展点,这样就可以将队列排序逻辑集成到所有插件,并且只需要注册一次。 + +当前方案中,我们实现 Less 方法汇总属于相同 Gang 的 pod。具体排序规则为: + +1. 比较两个 pod 的优先级配置,优先级越高的 pod 优先入队。 +2. 比较两个 pod 的创建时间戳,如果 pod 归属于同一个 Gang 配置,我们比较 Gang 配置创建时间,谁先创建则优先入队。 +3. 比较 pod 的 namespace,如果 pod 归属某一个 Gang 配置,则比较 Gang 名称。 + +```go +type QueueSortPlugin interface{ + QueueSort(*QueuedPodInfo, *QueuedPodInfo) bool +} +``` + +#### GangSchedulingPlugin +##### Data-Structure +###### Gang +```go +type Gang struct { + Name string + WaitTime time.Duration + Mode string //Strict or NonStrict + GangGroup []string + MinRequiredNumber int + TotalChildrenNum int + Children map[string]*PodInfo + BoundChildren map[string]*PodInfo + WaitingForBindChildren map[string]*PodInfo + ResourceSatisfied bool + ScheduleCycle int + ScheduleCycleValid bool + ChildrenScheduleRoundMap map[string]int +} +``` + +Gang,用于记录 Gang 调度状态到调度器缓存。 + +- `Children`,用于记录归属于当前 Gang 的 pod 列表。 +- `BoundChildren`,`WaitingForBindChildren` 用于记录已经出于 binding 状态的 pod,用于检查 pod 是否已经通过 permit 阶段。 +- `ResourceSatisfied`,用于标记当前 pod 是否通过调度 Permit 阶段,如果通过则为 true。该字段主要用于判断当前 Gang 调度是否满足条件。 +- `scheduleCycle`,`childrenScheduleRoundMap`,前面两个字段主要用于控制 Gang 调度周期。 +> 举个🌰 ,调度伊始 `scheduleCycle` 字段为 1,`childrenScheduleRoundMap` 中所有 pod 值为 0。 +> 所有 pod 进入 PreFilter 阶段时,将会判断 `childrenScheduleRoundMap` 中 pod 值是否小于 `scheduleCycle` 值; +> 如果上一步校验通过,则将 `childrenScheduleRoundMap` 值设置为 `scheduleCycle` 的值,并通过当前校验; +> 反之则说明当前 pod 在本轮调度周期内已经完成调度,需要拒绝本次调度。 +> 根据 `totalChildrenNum` 字段,当所有 pod 都通过 PreFilter 阶段,说明当前调度周期所有 pod 已经完成调度,`scheduleCycle` 需要累加 1,说明开启新一轮调度周期。 +- `scheduleCycleValid`,当前 Gang 中任意 pod 在 Filter 阶段失败,scheduleCycleValid 将设置为 true,只有所有 pod 全部通过 Filter 阶段,该字段才会设置为 true。 + `scheduleCycleValid=false` 此场景下所有 pod 将不会进行调度,同时所有调度中都 pod 将被在 PreFilter 阶段被拒绝,当新一轮调度周期开启时,`scheduleCycleValid` 才会被设置为 true。 + +注意⚠️ ,`scheduleCycle\scheduleCycleValid\childrenScheduleRoundMap` 仅作用于 `Strict` 模式。 + +##### GangPlugin + +在调度器框架 Plugin 结构提基础上,增加 gangCache 用于缓存 Gang 信息。 +```go +type GangPlugin struct { + frameworkHandler framework.Handle + gangClient gangClient.Interface + podLister listerv1.PodLister + snapshotSharedLister framework.SharedLister + gangCache map[string]*Gang +} +``` +当启动 kubernetes 调度器时,我们仅需要将我们当逻辑挂载到以下 4 个扩展点: +```go +var( + _ framework.PreFilterPlugin = &GangScheduling{} + _ framework.PostFilterPlugin = &GangScheduling{} + _ framework.PermitPlugin = &GangScheduling{} + _ framework.ReservePlugin = &Coscheduling{} +) +type GangScheduling interface{ + ActiveGang(pod *corev1.Pod, state *framework.CycleState) + PreFilter(context.Context, *corev1.Pod) error + PostFilter(ctx context.Context, state *CycleState, pod *v1.Pod, filteredNodeStatusMap NodeToStatusMap) (*PostFilterResult, *Status) + Permit(context.Context, *corev1.Pod) Status + Unreserve(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) +} +``` +###### **PreFilter** + +`NonStrict` 模式,我们仅处理 步骤一和二: + +- 校验 Gang 下包含所有 pod 是否符合最小数,如果不符合则拒绝当前 pod。 + +- 校验 Gang 是否超时,如果超时则拒绝当前 pod。 + +- 校验 Gang scheduleCycleValid 字段是否为 true,如果为 false 则拒绝当前 pod。 + +- 尝试更新 `scheduleCycle`, `scheduleCycleValid`, `childrenScheduleRoundMap` 字段。 + +###### **PostFilter** + +到达当前阶段说明 pod 没有通过 Filter 校验,操作如下: + +- 如果 `Strict` 模式,设置 `scheduleCycleValid` 字段为 false,同时释放所有已经完成调度的 pod。 + +- 如果 `NonStrict` 模式则不做任何操作。 + +###### **Permit** + +到达当前阶段说明 pod 已经通过 Filter 校验,调度器插件将会计算 GangGroup 下所有 Gang 已经完成调度 pod 数量是否满足 Gang 最小值。 + +- 如果 Gang 不符合 bind 条件,我们会将 pod 状态修改为 "Wait" 并配置超时时间,同时 bind 协程一直保持等待直到超时或者通过校验。 + 随后,我们会执行 `ActiveGang` 操作,该操作会将归属于 Gang 的 pod 从 `schedulableQueue` 或者 `backoffQueue` 队列中迁移到 `activeQueue` 队列, + 如此操作之后,pod 将会被尽快尽享调度。 + +> 注意⚠️ ,社区调度器中,调度周期最长不能超过 15 分钟,我们则需要通过改写 RunPermitPlugins 将调度周期配置超过 15 分钟。 + +- 如果 Gang 符合 bind 条件,我们将等待中 pod 状态修改为 "Success",此时 bind 协程将结束等待并执行后续操作,并将 Gang 对象中 `ResourceSatisfied` 设置为 true。 + +###### **Un-reserve** + +如果 permit 阶段超时且 binding 阶段失败,此时调度阶段将会流转到 un-reserve 阶段,我们通过 Gang 对象中 `ResourceSatisfied` 值判断,如果此时值为 true 说明 binding 阶段失败,反之则说明 Gang 超时。 + +- 如果 permit 阶段超时,我们将在所有 Gang 下所有 pod annotation 中增加 `gang.scheduling.koordinator.sh/timeout=true`,同时释放所有已经调度成功的 pod。 + 此时,Gang 下所有 pod 将永远不会再进行调度,用户需要手动处理 permit 超时问题。 + +- 如果 binding 阶段失败,Gang 资源累计操作将会结束,随后会回滚所有失败的 pod 。 + +###### **Init** + +我们将 watch pod 事件,并根据事件类型持续更新 Gang。 + +## 未解问题 + +## 可选性 + +用户可以根据具体场景选择使用 Gang `Strict` 或者 `NonStrict` 模式。 diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/koordlet-overview.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/koordlet-overview.md new file mode 100644 index 000000000..0f334fa1e --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/koordlet-overview.md @@ -0,0 +1,42 @@ +# Koordlet + + +## 摘要 +Koordlet 是部署在 Kubernetes 节点中的 DaemonSet,用于混部资源超卖、干扰检测、QoS 保障等。它由几个模块组成,分别负责信息收集、数据分析和 QoS 管理。 +一些模块还提供了框架脚手架,提供了一组插件进行扩展(如"QoS Manager"),以便于添加新策略。 + + +## 架构 +![image](/img/koordlet-arch.svg) + +## 模块 + +### Metrics Advisor +Metric Advisor 提供节点、Pod 和容器的资源使用和性能特征的基本信息。 +它是一个独立的模块,定期收集、处理和导出资源画像。它还检测运行容器的干扰,例如 CPU 调度、内存分配延迟和压力阻塞信息(Pressure Stall Information, PSI)。 +该信息将广泛用于资源超卖和 QoS 保障插件。 + +### Storage +Storage 管理来自 Metrics Advisor 和 States Informer 的信息,提供一系列增删改查的API,并对过期数据定期清理。 +它有两种类型的数据:静态和时间序列。时间序列类型存储历史数据用于统计目的,例如 CPU 和内存使用情况。静态类型包括节点、Pod 和容器的状态信息,例如节点的 CPU 信息、Pod 的元数据。 + +### States Informer +States Informer 从 kube-apiserver 和 kubelet 同步节点和 Pod 状态,并将数据作为 `static` 类型保存到 Storage 中。与其他模块相比,该模块在开发迭代中应该保持相对稳定。 + +### QoS Manager +QoS Manager 协调一组插件,这些插件负责按优先级保障 SLO,减少 Pod 之间的干扰。插件根据资源分析、干扰检测以及 SLO 策略配置,在不同场景下动态调整资源参数配置。通常来说,每个插件都会在资源调参过程中生成对应的执行计划。 + +QoS Manager 可能是迭代频率最高的模块,扩展了新的插件,更新了策略算法并添加了策略执行方式。 +一个新的插件应该实现包含一系列标准API的接口,确保 QoS Manager 的核心部分简单且具有较好的可维护性。 +高级插件(例如用于干扰检测的插件)会随着时间的推移变得更加复杂,在孵化已经稳定在 QoS Manager 中之后,它可能会成为一个独立的模块。 + +### Metrics Reporter +Metrics Reporter 从 Storage 中读取历史指标和状态数据,然后将它们合并并发送到 ApiServer,这些数据将被 Koordinator Manager 用于资源超卖模型管理。 +Metrics Reporter 还支持针对不同混部场景的多种处理算法。 + +### Runtime Hooks +Runtime Hooks 充当运行时 Hook 管理器的后端服务器。 Runtime Hook 管理器是一个 CRI 代理,它拦截CRI请求,调用后端服务器注入策略,如通过 Pod 优先级设置资源隔离参数,应用资源分配策略。 +Runtime Hooks 提供了一个框架来维护不同类型的策略,并在容器的生命周期中提供灵活的扩展点。 + +#### 例如 Pod 生命周期中的 LLC 隔离注入 +![image](/img/llc-isolation.svg) diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/load-aware-scheduling.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/load-aware-scheduling.md new file mode 100644 index 000000000..a233834c6 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/load-aware-scheduling.md @@ -0,0 +1,114 @@ +# 负载感知调度 + +## 摘要 + +虽然 Koordiantor 通过超卖机制超卖资源可以提高节点的利用率,但也会因为 BestEffort 类型的工作负载干扰延迟敏感型应用,尤其是当节点负载水位较高时,这种干扰带来的影响会放大,不仅可能导致延迟敏感型应用的服务质量,也可能导致 BestEffort 类型的工作负载本身也不能很快的完成任务。 + +## 动机 + +Koordinator 通过超卖机制超卖一些资源。尽管它可以提高节点的利用率,但 BestEffort 工作负载也可能会干扰对延迟敏感的应用程序。 + +### 目标 + +1. 提供可配置的调度插件来帮助控制集群资源利用率。 +2. 资源利用率控制机制支持多种资源。 +3. 将资源利用率控制在安全阈值。 + +### 非目标/未来工作 + +1. 通过应用画像帮助插件实现更合理的评估机制并获得更好的均衡效果。这是一项后续工作,将在不同的提案下完成。 + +## 用户故事 + +### 故事 1 + +当节点的资源利用率达到高阈值时,节点上正在运行的工作负载之间会发生严重的资源争用。例如,由于更高优先级的应用程序需要资源,因此经常 BestEffort 的工作负载。结果,BestEffort 的工作负载超时甚至被迫结束;或者对延迟敏感的应用程序将在高利用率下遭受严重的性能下降,无法满足外部 SLA。应该避免这种情况。 + +### 故事 2 + +混部集群中的工作负载具有不同的资源需求。典型的 CPU 密集型工作负载预计会使用更多 CPU,而其他类型的工作负载可能会使用更多内存。有可能 CPU 资源的利用率比较高,而内存资源的利用率比较低。在这种情况下,资源的不平衡利用会影响调度的效果,甚至可能导致资源空闲但 Pod 无法调度的问题。 + +### 故事 3 + +Koordinator 定义 NodeMetric CRD 来描述节点的资源使用情况,并由 Koordlet 定期更新。但是,如果在更新周期中有很多 Pod 调度到冷节点(即资源利用率低的节点),当这些 Pod 开始运行时,这些节点的资源利用率可能会超过预期的阈值。结果,这些 Pod 的运行时质量并没有预期的那么好。 +### 故事 4 + +由于节点异常,Koordlet 可能无法报告最新的资源使用情况。在调度过程中应避免此类节点,以防止出现意外异常。 + +## 实施细节 + +![image](/img/load-aware-scheduling-arch.svg) + +调度插件过滤异常节点并根据资源使用情况对其进行评分。这个调度插件扩展了 Kubernetes 调度框架中定义的 Filter/Score/Reserve/Unreserve 扩展点。 + +### 过滤不健康的节点 + +默认过滤异常节点,但是用户可以根据需要通过配置来决定是否开启。 + +- 过滤 Koordlet 无法更新 NodeMetric 的节点。如果配置启用,插件将排除 nodeMetrics.status.updateTime >= LoadAwareSchedulingArgs.nodeMetricExpirationSeconds 的节点。 + +- 按利用率阈值过滤节点。如果配置启用,插件将排除 latestUsageUtilization >= 利用率阈值的节点。 在过滤阶段,仅从最新的 NodeMetric 中获取资源利用率,已分配但尚未统计的 Pod 的资源利用率不参与计算,以便为新创建的 Pod 分配资源,避免因估算不合理而导致调度失败。 + +### 评分算法 + +评分算法的核心逻辑是选择资源使用量最小的节点。但是考虑到资源使用上报的延迟和 Pod 启动时间的延迟,时间窗口内已经调度的 Pod 和当前正在调度的 Pod 的资源请求也会被估算出来,并且估算值将参与计算。 + +### 插件配置 + +```go + +type LoadAwareSchedulingArgs struct { + metav1.TypeMeta + + FilterExpiredNodeMetrics *bool `json:"filterExpiredNodeMetrics,omitempty"` + NodeMetricExpirationSeconds *int64 `json:"nodeMetricExpirationSeconds,omitempty"` + ResourceWeights map[corev1.ResourceName]int64 `json:"resourceWeights,omitempty"` + UsageThresholds map[corev1.ResourceName]int64 `json:"usageThresholds,omitempty"` + EstimatedScalingFactors map[corev1.ResourceName]int64 `json:"estimatedScalingFactors,omitempty"` +} + +``` + +- `FilterExpiredNodeMetrics` 指定是否过滤 Koordlet 无法更新 NodeMetric 的节点。 +- `NodeMetricExpirationSeconds` 表示 NodeMetric 过期时间,单位为秒;当NodeMetric过期时,节点被认为异常。默认为180秒。 +- `ResourceWeights` 表示资源的权重。默认情况下,CPU 和内存的权重都为1。 +- `UsageThresholds` 表示资源利用率阈值,CPU 的默认值为65%,内存的默认值为95%。 +- `EstimatedScalingFactors` 表示估计资源使用情况时的系数。CPU 的默认值为85%,内存的默认值为70%。 + +`FilterExpiredNodeMetrics` 控制 Filter 行为,如果值为 `false`,`NodeMetricExpirationSeconds` 在计分时仍然可以使用。 + +### 自定义节点指标更新周期 + +此插件依赖于 NodeMetric 的报告周期。需要根据不同的场景和工作量设置不同的报告周期。如果报告周期比较长,Koordlet 需要在报告周期内进行汇总,以保证指标的效果。因此,NodeMetricSpec 需要扩展以支持用户自定义的报告周期和聚合周期。用户可以修改 `slo-controller-config` 来完成相应的配置,Koord-Manager 中的控制器会负责更新相关节点的 NodeMetrics 的上报周期和聚合周期字段。 + +```go +// NodeMetricSpec defines the desired state of NodeMetric +type NodeMetricSpec struct { + // CollectPolicy defines the Metric collection policy + CollectPolicy *NodeMetricCollectPolicy `json:"metricCollectPolicy,omitempty"` +} + +// NodeMetricCollectPolicy defines the Metric collection policy +type NodeMetricCollectPolicy struct { + // AggregateDurationSeconds represents the aggregation period in seconds + AggregateDurationSeconds *int64 `json:"aggregateDurationSeconds,omitempty"` + // ReportIntervalSeconds represents the report period in seconds + ReportIntervalSeconds *int64 `json:"reportIntervalSeconds,omitempty"` +} +``` + +### 自定义节点使用阈值 + +目前,节点的资源利用率阈值是根据经验配置的,以保证节点的运行质量。但也有一些方法可以评估节点上运行的工作负载,以达到更合适的资源利用率阈值。例如,在分时场景中,可以设置更高的阈值以允许调度在延迟敏感的应用程序的低谷期间运行更多的 BestEffort 工作负载。当对延迟敏感的应用程序的峰值出现时,降低阈值并驱逐一些 BestEffort 工作负载。此外,可以使用 3-sigma 来分析集群中的利用率水平,以获得更合适的阈值。 + +支持用户通过 Annotation 自定义节点资源利用率阈值。 + +```go +const ( + AnnotationCustomUsageThresholds = "scheduling.koordinator.sh/usage-thresholds" +) + +type CustomUsageThresholds struct { + UsageThresholds map[corev1.ResourceName]int64 `json:"usageThresholds,omitempty"` +} +``` \ No newline at end of file diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/multi-hierarchy-elastic-quota-management.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/multi-hierarchy-elastic-quota-management.md new file mode 100644 index 000000000..6c8cebc88 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/multi-hierarchy-elastic-quota-management.md @@ -0,0 +1,342 @@ +# Multi Hierarchy Elastic Quota Management + +## Summary +When several users or teams share a cluster, fairness of resource allocation is very important. This proposal provides +multi-hierarchy elastic quota management mechanism for the scheduler. +- It supports configuring quota groups in a tree structure, which is similar to the organizational structure of most companies. +- It supports the borrowing / returning of resources between different quota groups, for better resource utilization efficiency. +The busy quota groups can automatically temporarily borrow the resources from the idle quota groups, which can improve the +utilization of the cluster. At the same time, when the idle quota group turn into the busy quota group, it can also automatically +take back the "lent-to" resources. +- It considers the resource fairness between different quota groups. When the busy quota groups borrow the +resources from the idle quota groups, the resources can be allocated to the busy quota groups under some fair rules. + +## Motivation + +### Compared with competitors + +#### Resource Quotas +[Resource Quotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/) provides the ability to restrain the upper +limit of resource usage in one quota group. The quota group resource usage aggregated based on the pod resource configurations. +Suppose there are still free resources in the cluster, but the resource usage of this quota group is close to the limit. +The quota group cannot flexibly borrow the idle resources from the cluster. The only possible way is to manually adjust the +limit of the quota group, but it is difficult to determine the timing and value of the adjustment when there are lots of +quota groups. + +#### Elastic Quota +[Elastic Quota](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/9-capacity-scheduling/README.md#goals) +proposed concepts of "max" and "min". "Max" is the upper bound of the resource consumption of the consumers. "Min" is the minimum +resources that are guaranteed to ensure the functionality/performance of the consumers. This mechanism allows the workloads +from one quota group to "borrow" unused reserved "min" resources from other quota groups. The unused "min" of one quota group +can be used by other quota groups, under the condition that there is a mechanism to guarantee the "victim" quota group can +consume its "min" resource whenever it needs. + +If multiple quota groups need borrow unused reserved "min" resources from other quota groups at the same time, +the implementation strategy is FIFO, which means that one quota group may occupy all "borrowed-from "resources, +while other quota groups cannot borrow any resources at all from the cluster. + +Neither of the above support multi hierarchy quota management. + +### Goals +1. Define API to announce multi hierarchy quota configuration. + +2. Provides a scheduler plugin to achieve multi hierarchy quota management ability. + +### Non-goals/Future work +Users have two ways to manage GPU quotas. One is to only declare the number of GPU cards in the quota group, but do not +care about the specific card type assigned. The other is to specify the quotas required by different card types. For example, +suppose user A\B both has 10 GPU quota, and cluster has two GPU types A100\V100. quotaA only declare 10 GPU quota, so in the +scheduling process, as long as the total number of GPU cards allocated to A is 10, no matter what the allocation ratio of +a100\v100 is, it will meet the expectation. QuotaB also declare 10 GPU quota, but has more details with V100 is 5 and A100 is 5, +so the maximum allocation of V100 is 5 and A100 is 5 in the scheduling will meet the expectation. + +We know that the GPU card type reflected by the label or annotation on the node, not in the resource dimension, so we can't +simply configure nvidia.com/gpu-v100, nvidia.com/gpu-a100 directly into the quota group's resource dimension. + +What's more complicated is that in a cluster, there will be multiple quota groups like A\B at the same time, +These two modes will conflict. Suppose that the cluster resource has 20 cards, including 10 cards for A100 and 10 cards for V100. +If the scheduler first assigns 10 cards to quota groupA with all V100, then quota group B's V100 resource has no way to be guaranteed, +which obviously does not meet expectations. Therefore, we need to solve the problem that if the above two modes coexist, +the quota mechanism can still work normally. + +The above problems will be solved in the next proposal. + +## Proposal + +### Key Concept\User Stories +1. Each quota group declares its own "min" and "max". The semantics of "min" is the quota group's guaranteed resources, +if quota group's "request" less than or equal to "min", the quota group can obtain equivalent resources to the "request". +The semantics of "max" is the quota group's upper limit of resources. We require "min" to be less than or equal to max. + +2. We define "request" as the sum pod's request in the quota group. When some quota groups "request" is less than "min", and some +quota groups "request" is more than "min", the unused resources of the former can be lent to (or you can choose not to share) the +latter. The latter should use these resources according to the fair rule. When the former needs to use the "lent-to" resources, +the latter should also return the "borrowed-from" resources according to the fair rule. + +3. We define the "runtime" as the current actual resource that can be used by the quota group. For a quota group whose "request" +is less than min, the value of "runtime" is equal to "request". That is to say "request" should be unconditionally satisfied +if the "request" is less than "min". For a quota group whose "request" is greater than "min", the value of "runtime" is between +"min" and "max", and the part exceeding "min" is based on its own "request", the "lent-to" resources, and the ability of +other quota groups to compete for "lent-to" resources. This will be described in detail below. + +4. Hierarchy is very important in a resource-shared cluster. Suppose that the cluster shared by multiple departments, and +each department has multiple teams. If each team is a quota group, we naturally hope that the relationship between departments +and teams is tree shaped. In this way, no matter how to add, delete or adjust quota groups within the department, it is an +internal matter of the department. The cluster administrator only needs to be responsible for the quota configuration at the +level of departments, and the quota group's configuration can delegate power to the department itself. Moreover, tree can +help us easily see the summary of resources from the perspective of departments when there are lots of teams in one department. + +Another advantage of tree structure is that we can control the scope of the "lent-to" resource. For example, a department only +wants to its quota groups can borrow resources from each other, while the resources of the department do not want to be lent +to other departments. This is very convenient for the tree structure. It should be pointed out that although two levels can +meet most scenarios (the more levels, the higher the maintenance complexity), we will support that the height of the quota-tree +is arbitrary. + +### Implementation Details + +#### Calculate RuntimeQuota + +We use an example to introduce how to calculate "runtime". Suppose the cluster total resource is 100, and has 4 quotas, +the configuration and "request" of each quota group described as below: + +![image](/img/runtimequota1.jpg) + +We first calculate the "min" part of "runtime". It should be like as below: + +![image](/img/runtimequota2.jpg) + +Then we find quota groupA can lent 5 quotas to B\C\D, and the cluster has 40 quotas to allocate, so the sum is 45 for B\C\D +to share. We introduce a new field to represent the allocation fairness, which is called "shared-weight". "shared-weight" determines +the ability of quota groups to compete for shared resources. That is to say, B/C/D will allocate resources in the cluster according +to its "shared-weight". + +For example, assuming that the weights of B\C\D are 60\50\80 + +- B can get 45 * 60 / (60 + 50 + 80) = 14 + +- C can get 45 * 50 / (60 + 50 + 80) = 12 + +- D can get 45 * 80 / (60 + 50 + 80) = 19 + +However, quota group B only need 5 more due to request is 20 and min is 15, and quota group C and D are still hungry, +so quota group B can share 14 - 5 = 9 to C and D. + +![image](/img/runtimequota3.jpg) + +quota group C and D can still share the remained quota of 9 by allocation proportion, which C get 9 * 50 / (50 + 80) = 3, +D get 9 * 80 / (50 + 80) = 6, and we get the runtime of each quota group finally. + +![image](/img/runtimequota4.jpg) + +The whole process can be summarized as follows: + +1. The quota divided into two categories, one is whose "request" is less than "min", we call it "lent-to-quotas". The other is +whose "request" is greater than "min", we call it "borrowed-quotas". + +2. Calculate the "runtime" of each quota group not exceed "min", so we can get how many resources can be lent to "borrowed-quotas". + +3. The "borrowed-quotas" share the resources by allocation proportion. + +4. If the new "runtime" is larger than "request", there will be new resources which can be lent to the rest "borrowed-quotas". + +It is very difficult to manage the weight of thousands of quota groups in a company. Therefore, we need to set a default value +for the "shared-weight". According to our experience in online operations, using max as the default "shared-weight" of the quota +group can satisfy most scenarios. In this way, "max" has both the meaning of resource ceiling and allocation proportion: the +larger the "max" is, the more resources it wants. For individual special scenarios, the resource administrator can adjust the weight. + +It must be pointed out that if the cluster resources suddenly decrease due to node failure, the sum of "min" may be +greater than the total resources of the cluster. If this case happens, we can't grantee "min" of each quota group actually. +So we will reduce the "min" of each quota group in a moderate proportion, which is to ensure that the sum of +"min" actually in effect is less than the total resources of the cluster. + +We need to introduce the concept of "sys-group". "sys-group" means that the "min" of this quota group is infinite, +and its request will never be bound by the quota. It is usually used for system level pods. When the scheduler starts, +the "sys-group" will be created by default not only in scheduler memory, but also try create the quota group crd. +Its "min" and "max" are INT_MAX. At the same time, its "min" will not be reduced in proportion to the above process. +The real available total resource of normal quota groups is the cluster total resource minus the "used" of the "sys-group". + +We also need to introduce the concept of "default-group". If the pod cannot find a matching quota group, it will be +matched to the "default-group". the "default-group" will be created by default not only in scheduler memory, but also try +create the quota group crd. Its "min" and "max" has default value, users can modify them on demand. + +#### Hierarchy +We can organize quota groups using quota-tree, each quota group has its own configuration. Currently, we only allow leaf +nodes to submit jobs. An example is as below: + +![image](/img/quotatree1.jpg) + +When we calculate the "request" of each quota group. We first count the requests of each parent group from the bottom up, +which is the accumulation of mathematical min(child group request, child group max). + +![image](/img/quotatree2.jpg) + +Then we calculate the "runtime" from top to bottom. The "runtime" of the parent quota group is the total resources of the +child quota groups. First we calculate parent quota group's "runtime". + +![image](/img/quotatree3.jpg) + +Then we calculate child quota group's "runtime". + +![image](/img/quotatree4.jpg) + +#### Min Guarantee and Preemption +Considering the following situations, suppose that the cluster has two quotas group A\B. At t0 time, only quota groupA has job +submission, it can borrow from quota group B's resource, and the "request" and "used" of quota group are both 100 as below: + +![image](/img/quotaguarantee1.jpg) + +At t1 time, quota groupB has job submission too, so the "runtime" of quota group A\B is both 50. However, if quota +groupA don't return resource back, quota groupB can't assign any resource cause node resource occupied by the quota groupA. + +![image](/img/quotaguarantee2.jpg) + +The solution is that we will monitor the relationship between "used" and "runtime" of each quota group in the background thread. +If quota group's "used" continues to be greater than "runtime", we will start the forced recycling mechanism to kill +several pods in the order of priority from low to high until the "used" is less than or equal to "runtime". If some pods +in the quota group do not want to be recycled, we require such pods can only use resource up to "min". By default, we +assume all pods can use resource beyond "min" if "runtime" larger than "min". + +We do not adopt the cross quota preemption method to solve the problem that when quota group "used" is less than "runtime" +(to preempt the quota group whose "used" is greater than the "runtime"). Due to each quota group has an accurate runtime, +we can accurately recycle the overused resources of each quota group. This is more direct than preemption. + +In addition, we do not think that cross quota preemption is worth recommending. In principle, the priorities of different +quota groups are not comparable, because they may come from different business lines. The high priority of this business line +is not more important than the low priority of other business lines. Only priorities within a quota group have comparative +significance. So we will not support cross quota preemption temporary. Moreover, in inner quota preemption, we will limit +existUsed - preempted + preempt smaller than runtime. + +It can be seen from the above, if "min" of the quota group is not equal to "max", the "runtime" part exceeding "min" may +recycled by the scheduler. + +#### Configuration Limit +We introduce several constraints to ensure that the quota mechanism works properly. + +1. Except for the first level quota group, we require that the sum of "min" of all sub quota groups should be less than or +equal to the "min" of parent group. The reason for excluding the first level quota group is that the cluster resources +cannot avoid jitter. If the cluster resource reduced, we don't want to hinder the update of the quota groups. + +2. The "max" of child quota group can be larger than the "max" of parent group. Consider the following scenario, there are +2 subtrees in the cluster, "dev-parent" and "production-parent". Each subtree has several "quota-groups". When "production" +is busy, we can limit the resource use of the "dev" by only decreasing the "max" of "dev-parent", instead of decreasing +the "max" of each sub quota group of "dev-parent". + +3. Parent group cannot run pod. We did receive a request to allow the parent group to submit jobs. The priority of the +parent group's self jobs is higher than that of all the sub-groups, which means that the parent group's self jobs can +preempt the "runtime" of the sub-group's jobs at any time. This is somewhat similar to the hierarchical relationship of +"Town City province". Due to complexity,we do not support this issue for now. + +4. The parent of node can only be parent group, not child group. + +5. A quota group can't be converted on the attribute of parent group\child group. + +6. We allow a node on the quota tree to freely change its parent node, as long as it does not break the existing detection rules. + +We will introduce a new "web-hook" to check the configuration limitation. + +#### Extension Point + +##### PreFilter +We will check if the (Pod.request + Quota.Used) is less than Quota.Runtime. If not, the scheduling cycle of Pod will fail. + +##### PostFilter +We will re-implement the method selectVictimsOnNode in defaultPreempt. The original selectVictimsOnNode method selects all +the pods with the lower priority than the preemptor’s priority as potential victims in a node. For now, we only allow +inner-quota-group preemption. + +##### Cache and Controller +1. We will watch the event of quota group and pod to calculate "runtime" of each quota group. +2. We will create a thread to update quota group crd to display "request\used\runtime" periodicity. +3. We will create a thread to monitor "used" and "runtime" of each quota group. If quota group's "used" continues to be +greater than "runtime", we will start the forced recycling mechanism to kill several pods in the order of priority from +low to high until the "used" is less than or equal to "runtime". + +### API + +#### Quota +We will reuse [Elastic Quota](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/9-capacity-scheduling/README.md#goals) +'s crd to declare quota group. + +```go +type ElasticQuota struct { + metav1.TypeMeta + metav1.ObjectMeta + Spec ElasticQuotaSpec + Status ElasticQuotaStatus +} + +type ElasticQuotaSpec struct { + Min v1.ResourceList + Max v1.ResourceList +} + +type ElasticQuotaStatus struct { + Used v1.ResourceList +} +``` + +we will also add new annotation and labels to achieve our desired functionality. +```yaml +annotations: + quota.scheduling.koordinator.sh/runtime: {cpu:4, memory: 8Gi} + quota.scheduling.koordinator.sh/shared-weight: {cpu:4, memory: 8Gi} +labels: + quota.scheduling.koordinator.sh/is-parent: false + quota.scheduling.koordinator.sh/parent-quota-name: "parent" + quota.scheduling.koordinator.sh/allow-lent-resource: true +``` +- `quota.scheduling.koordinator.sh/runtime` is updated by the scheduler. It reflects the "runtime" of the quota group. +- `quota.scheduling.koordinator.sh/is-parent` is disposed by the user. It reflects the "child\parent" attribute of the quota group. Default is child. +- `quota.scheduling.koordinator.sh/parent-quota-name` is disposed by the user. It reflects the parent quota name. Default is root. +- `quota.scheduling.koordinator.sh/shared-weight` is disposed by the user. It reflects the ability to share the "lent to" resource. Default equals to "max". +- `quota.scheduling.koordinator.sh/allow-lent-resource` is disposed by the user. It reflects whether quota group allows lent unused "min" to others. + +Here is a example: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: test + namespace: test + annotations: + quota.scheduling.koordinator.sh/runtime: {cpu:4, memory: 8Gi} + quota.scheduling.koordinator.sh/shared-weight: {cpu:4, memory: 8Gi} + labels: + quota.scheduling.koordinator.sh/is-parent: false + quota.scheduling.koordinator.sh/parent-quota-name: "parent" + quota.scheduling.koordinator.sh/allow-lent-resource: true +spec: + max: + cpu: 20 + memory: 40Gi + nvidia.com/gpu: 2 + min: + cpu: 10 + memory: 20Gi + nvidia.com/gpu: 1 +``` + +#### Pod +We introduce a new label on the pod to associate pod with quota group: +```yaml +labels: + quota.scheduling.koordinator.sh/quota-name: "test1" +``` + +if pod's don't have the label, we will follow [Elastic Quota](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/9-capacity-scheduling/README.md#goals) +using namespace to associate pod with quota group. + +### Compatibility +We are fully compatible with [Elastic Quota](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/9-capacity-scheduling/README.md#goals) 's interface. +If pod's don't have the "quota-name" label, we will use the namespace to associate pod with quota group. If the pod has +the "quota-name" label, we will use it to associate pod with quota group instead of namespace. If we can't find the +matched quota group, we force the pod to associate with the "default-group". + +## Unsolved Problems +Please see Non-goals/Future work. + +## Alternatives + +## Implementation History + +## References diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/node-prediction.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/node-prediction.md new file mode 100644 index 000000000..9bda2cc8a --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/node-prediction.md @@ -0,0 +1,278 @@ +# Node Prediction + +## Summary + +The *node prediction* is proposed to both improve the node utilization and avoid overloading. By profiling the +tendency of the node metrics, we can estimate the peak usage and implement more efficient over-commitment policy. + +## Motivation + +Scheduling pods with setting appropriate resource requirements is truly hard to follow. Underestimating requests can +bring performance issues. However, overvaluing requests is likely to cause resource waste and low efficiency. One +common approach is using Vertical Pod Autoscaler (VPA) to autopilot the resource requirements for the pods of the same +workload. The VPA optimizes the resource requirements of the pod according to the pod metrics of the same workload. It +estimates the pod usage and specifies proper resource requirements. It works well when we want to optimize the resource +requirements of workloads. However, most VPA approaches try to abandon the time series attribute from the pod metrics +and generate a relatively static requests/limits that should guarantee to make no bad ignoring the timing. It leaves +the usage-to-limit gap, i.e. the gap between the recommended pod request with the real-time pod usage, and the +well-known pooling effect, i.e. the gap between the sum of the pod usages with the node usage. Inspired by +[Google's work](#references) in the EuroSys'21, we propose the node prediction in Koordinator to conquer these two +gaps. + +### Goals + +- Define the node prediction API. +- Propose an online history-based-optimized (HBO) prediction model. +- Clarify how the Mid-tier resources are calculated with the prediction. + +### Non-Goals/Future Work + +- Propose a time-series-forecasting-based or offline prediction model. + +## User Stories + +### Story 1 + +As a cluster administrator, there are many web service pods allocating almost node resources. Whereas, the node +utilization is low since most allocated resources are not actually used. To improve node utilization, I want to reclaim +the unused resources to submit some low-priority online-service pods and Flink jobs. However, I am concerned with the +risks of over-utilization bringing machine overload which may cause the performance degradation and hurt the pod QoS. + +### Story 2 + +As a Kubernetes developer, I want to support the long-term load balancing in the scheduler. Thus, I need the information +that which nodes should be idle for a long time. + +## Design + +### Design Principles + +- The node prediction is low-cost and can be implemented in the Koordlet. +- The node prediction is pluggable. Users can replace the default model to customize the prediction. + +### Architecture + +The node prediction is implemented mainly in the Koordlet and Koord-Manager. The architecture is as below: + +![image](/img/node-prediction.svg) + +- Koordlet: The agent runs on the node. It implements the metrics collection, metrics storage, and predict server. + - Metrics Advisor: It collects the cpu/memory usage of the node and running pods. It stores the collected metrics in the Metric Cache. + - Metric Cache: It stores the node and pod metrics in a TSDB, which allows other modules to query the metrics later. + - Predict Server: With the node and pod metrics retrieved from the Metric Cache, it calculates and checkpoints the predicted result based on the prediction model. + - States Informer: It maintains the metadata of the node and the pods. It also reports the latest prediction periodically to the kube-apiserver. +- Koord-Manager: The controller runs on a master node. + - Configuration delivery: It maintains the prediction and colocation strategies and distributes the node strategy onto the NodeMetric. + - Resource Calculator: It fetches the node prediction result, and calculates the resource allocatable of the reclaimed resources (i.e. Mid-tier resource). +- Koord-Scheduler: It schedules the pod with different priority bands (e.g. Prod, Mid, Batch). It can enable load-aware scheduling to balance the over-committed nodes' utilization. + +#### Workflow + +In the koordlet, stages to update the node prediction are as follows: + +1. Histogram initialization: The predict server initializes a set of histograms for CPU and memory. For implementing `N-Sigma_v1`, it initializes decayed histograms only for the node and priority classes. While implementing `N-Sigma_v2`, it initializes histograms both for the node and every running pod. +2. Metrics collection: The metrics advisor collects the usage statistics of node and pods and stores them as metric points into the metric cache every CollectInterval (e.g. 1s). +3. Histogram updating: The predict server fetches the node metrics and pod metrics of latest HistogramUpdateInterval (e.g. 30s). Then it uses the aggregated result to update the decayed histograms. +4. Periodical reporting: The states informer fetches node metrics and the last histograms for the node and priority classes every ReportingInterval (e.g. 60s). Then it reports the complete NodeMetric status with last node prediction info to the kube-apiserver. +5. Fast reporting: The states informer fetches the last histograms every CheckPredictionInterval (e.g. 20s). It checks if the predicted result is too small or too larger than the last updated prediction exceeding the ResourceDiffThreshold (e.g. 5%), or the updated duration is longer than ForceUpdateInterval (e.g. 600s). If the check result is true, It updates the latest node prediction to the kube-apiserver. + +In the koord-manager, stages to update the Mid-tier resources allocatable are as follows: + +1. NodeMetric lifecycle management: The koord-manager list-watches the Node and the ConfigMap slo-controller-config, and maintains the lifecycle of the NodeMetric CR. Once the colocation strategy in the slo-controller-config updated, the koord-manager parses the config data and updates the node prediction policy and mid colocation policy into the NodeMetric.Spec. +2. Mid resource updating: The koord-manager list-watches the NodeMetric. Once the NodeMetric status is updated, the koord-manager gets the latest node metrics and node prediction, and calculates the Mid allocatable resources based on the Mid over-commitment formula. Finally, it updates the Mid allocatable resources into the Node status as the extended resources (`kubernetes.io/mid-cpu`, `kubernetes.io/mid-memory`). + +#### Scheduling Optimization + +The results of the node prediction on the NodeMetric, the Mid extended resources on the Node and the scheduling Pod +in the scheduler are updated in different time. It is inevitable to find that the scheduler schedules a pod with an +older version of the node prediction, which may cause the schedule result "lagged". + +To relief the lagged prediction, the koordlet and koord-manager try both updating earlier when the +prediction/NodeMetric differs from the previous result than a threshold and set a resource buffer which should +tolerant most of the result changes between synchronizations. + +For the worst case in which the prediction could be lagged too much (e.g. 1 hour), we can maintain a lower bound of +the real Mid allocatable resources inside the scheduler. This part is not planned in the first version of the Mid-tier +over-commitment. + +### API + +#### Node Prediction + +##### Predict Policy + +```go +// ColocationStrategy defines the colocation strategy in slo-controller-config ConfigMap. +type ColocationStrategy struct { + // ... + NodePredictPolicy *slov1alpha1.PredictPolicy `json:"nodePredictPolicy,omitempty"` +} + +type NodeMetricSpec struct { + // ... + PredictPolicy *PredictPolicy `json:"predictPolicy,omitempty"` +} + +// PredictPolicy defines the policy for the node prediction. +type PredictPolicy struct { + ResourceDiffThresholdPercent *int64 `json:"resourceDiffThresholdPercent,omitempty"` + ColdStartPeriodSeconds *int64 `json:"coldStartPeriodSeconds,omitempty"` +} +``` + +##### Predicted Result + +```go +type NodeMetricStatus struct { + // ... + // ProdReclaimableMetric is the estimated reclaimable resources for the Prod-type pods. + ProdReclaimableMetric *ReclaimableMetric `json:"prodReclaimableMetric,omitempty"` +} + +type ReclaimableMetric struct { + // Resource is the resource usage of the prediction. + Resource ResourceMap `json:"resource,omitempty"` +} +``` + +#### Mid Overcommitment + +##### Colocation Strategy + +```go +type ColocationStrategy struct { + // ... + // MidCPUThresholdPercent defines the maximum percentage of the Mid-tier cpu resource dividing the node allocatable. + // MidCPUAllocatable <= NodeCPUAllocatable * MidCPUThresholdPercent / 100. + MidCPUThresholdPercent *int64 `json:"midCPUThresholdPercent,omitempty" validate:"omitempty,min=0,max=100"` + // MidMemoryThresholdPercent defines the maximum percentage of the Mid-tier memory resource dividing the node allocatable. + // MidMemoryAllocatable <= NodeMemoryAllocatable * MidMemoryThresholdPercent / 100. + MidMemoryThresholdPercent *int64 `json:"midMemoryThresholdPercent,omitempty" validate:"omitempty,min=0,max=100"` +} +``` + +##### Extended Resources + +```yaml +apiVersion: v1 +kind: Node +metadata: + name: test-node +status: + allocatable: + cpu: '32' + memory: 129636240Ki + pods: '213' + kubernetes.io/mid-cpu: '16000' # allocatable cpu milli-cores for Mid-tier pods + kubernetes.io/mid-memory: 64818120Ki # allocatable memory bytes for Mid-tier pods + capacity: + cpu: '32' + memory: 129636240Ki + pods: '213' + kubernetes.io/mid-cpu: '16000' + kubernetes.io/mid-memory: 64818120Ki +``` + +### Theoretical Model + +#### Node Peak Prediction + +Before elaborating the peak prediction algorithm, let's formalize the node peak prediction problem. + +Let's denote the usage of a Pod `p` at the time `t` is `U(p, t)`. + +Then the usage of a Node `M` which schedules a set of Pods is `MU(Pods, t) = sum[p in Pods](U(p, t))`. + +> Note that the non-Pod usage of the node can be regarded as the usage of a special pod `S`. + +When we want to predict the node peak at the time `T`, we are calculating +`Peak(Pods, T) = max[t >= T](sum[p in Pods](U(p, t)))`. + +The predicted peak `Peak(Pods, T)` is our node prediction result at `T`. + +#### N-sigma Prediction + +There are several [statistical peak prediction models](#alternatives) which are practical to implement in the online +scheduler. [*N-sigma*](#references) is the picked peak prediction model in the current implementation. It assumes the +timing node metrics follow the Gaussian distribution, which allows us to estimate the node peak with the mean and +standard deviation (stdev): + +`Peak_N-Sigma_v1(Pods, T) = mean[T0 <= t <= T](MU(Pods, t)) + N * stdev[T0 <= t <= T](MU(Pods, t))` + +The `Peak_N-Sigma_v1` is the predicted node peak. It is implemented as the first version of node prediction, which is +calculated based on node-level metrics. + +Moreover, we can calculate with the pods' metrics: + +`Peak_Pods-N-Sigma'(Pods, T) = sum[p in Pods](mean[T0 <= t <= T](U(p, t)) + N * stdev[T0 <= t <= T](U(p, t)))` + +A more conservative is derived from their maximal. The `Peak_N-sigma_v2` is the second version of node prediction, +which also considers the pod-level metrics. + +`Peak_N-Sigma_v2(Pods, T) = max(Peak_N-Sigma_v1(Pods, T), Peak_Pods-N-Sigma(Pods, T))`. + +#### Mid-tier Overcommitment + +In the first version, the Mid-tier resource contains the reclaimable resources which are probably unused in the +long-term by the high-priority (i.e. Prod) pods. +The resource calculation for the Mid-tier resources can be described as follows: + +``` +Allocatable[Mid] := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio) +``` + +- `Reclaimable[Mid] := max(0, reclaimRatio * Allocated[Prod] - Peak[Prod])`. The peak prediction model is used for estimating the future usage of the running Prod pods. The Mid pods can allocate a proportion of reclaimed resources from running Prod pods. +- `NodeAllocatable * thresholdRatio` is the maximal co-located Mid-tier resource setting from a ratio of the node allocatable. + +In next versions, the Mid-tier resource is planned to mix with the default node allocatable (i.e. the Prod allocatable), +which means a Mid pod can allocate the unallocated node allocatable resource, and an idle node is able to schedule Mid +pods. The Prod pods can preempt the Mid pods when the mixed allocatable is exhausted by the Mid pods, so that the +Prod-tier resource is still more stable and guaranteed than the Mid-tier. +Then the resource calculation for the mixed Mid-tier resources can be described as follows: + +``` +Allocatable[Mid]' := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio) + Unallocated[Mid] +Unallocated[Mid] = max(NodeAllocatable - Allocated[Prod], 0) +``` + +## Alternatives + +### Peak Prediction Models + +There are several different peak prediction and time series forecasting models which can estimate the future peak +based on the historical node metrics, including statistical methods and machine learning methods. In this proposal, +statistical peak prediction models are preferred since they are practical to implement in the online scheduling system, +have less overhead of metrics collection than the ML approaches, and more simple to analyze and debug. + +Here are some common statistical peak prediction models: + +1. [Borg-default](#references) + +Borg-default simply over-commits the machine resources in a fixed rate `a`, which means the peak usage is regarded as +the result of the requests dividing `a`. + +Let's denote the resource request of the Pod `p` at the time `t` is `R(p, t)`, where `R(p, t) = 0` when `p` is not +running. Then we have, + +`Peak_Borg-default(Pods, T) = 1/a * sum[p in Pods](R(p, T))`, `a = 1.1` by default. + +2. [Resource Central](#references) + +Resource Central considers the peak of the machine as the sum of the peak of individual pods (or VMs). And a simple +peak prediction of a pod is the percentile of the historical usages, e.g. `percentile[t in [T-C, T]](U(p, t))`. + +`Peak_ResourceCentral(Pods, T) = sum[p in Pods](percentile[t in [T-C, T]](U(p, t)))` + +3. [Max](#references) + +The Max prediction model does not use the historical metrics directly, but takes the maximal of any known peak results. +It gets the more conservative result than the input models. For example, we have a `Max_Borg-default_ResourceCentral` +model calculated from the Borg-default and Resource Central models: + +`Peak_Max_Borg-default_ResourceCentral(Pods, T) = max(Peak_Borg-default(Pods, T), Peak_ResourceCentral(Pods, T))` + +## References + +1. Vertical Pod Autoscaler: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler +2. Bashir, Noman, et al. "Take it to the limit: peak prediction-driven resource overcommitment in datacenters." Proceedings of the Sixteenth European Conference on Computer Systems. 2021. +3. Cortez, Eli, et al. "Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms." Proceedings of the 26th Symposium on Operating Systems Principles. 2017. diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/nri-mode-resource-management.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/nri-mode-resource-management.md new file mode 100644 index 000000000..507c7e30c --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/nri-mode-resource-management.md @@ -0,0 +1,152 @@ +# NRI Mode Resource Management + +## Glossary + +NRI, node resource interface. See: https://github.com/containerd/nri + +## Summary + +We hope to enable NRI mode resource management for koordinator for easy deployment and in-time control. + +## Motivation + +Koordinator as a QoS based scheduling system for hybrid workloads orchestration on Kubernetes and its runtime hooks support two working [modes](https://github.com/koordinator-sh/koordinator/blob/main/docs/design-archive/koordlet-runtime-hooks.md) for different scenarios: `Standalone` and `Proxy`. However, both of them have some [constraints](https://shimo.im/docs/m4kMLdgO1LIma9qD). NRI (Node Resource Interface), which is a public interface for controlling node resources is a general framework for CRI-compatible container runtime plug-in extensions. It provides a mechanism for extensions to track the state of pod/containers and make limited modifications to their configuration. We'd like to integrate NRI framework to address `Standalone` and `Proxy` constraints based on this community recommend mechanism. + +### Goals + +- Support NRI mode resource management for koordinator. +- Support containerd container runtime. + +### Non-Goals/Future Work + +- Support docker runtime + +## Proposal + +Different from standalone and proxy mode, Koodlet will start an NRI plugin to subscribe pod/container lifecycle events from container runtime (e.g. containerd, crio), and then koordlet NRI plugin will call runtime hooks to adjust pod resources or OCI spec. The flow should be: + +- Get pod/container lifecycle events and OCI format information from container runtime (e.g. containerd, crio). +- Transform the OCI format information into internal protocols. (e.g. PodContext, ContainerContext) to re-use existing runtime hook plugins. +- Transform the runtime hook plugins' response into OCI spec format +- Return OCI spec format response to container runtime(e.g. containerd, crio). + +![nri-proposal.png](/img/nri-proposal.png) + +### User Stories + +#### Story 1 +As a cluster administrator, I want to apply QoS policy before pod's status become running. + +#### Story 2 +As a cluster administrator, I want to deploy koordinator cluster without restart. + +#### Story 3 +As a cluster administrator, I want to adjust resources' policies at runtime. + +#### Story 4 +As a GPU user, I want to inject environment before pod running. + +### Requirements + +- Need to upgrade containerd to >= 1.7.0, crio to >= v1.25.0 + +#### Functional Requirements + +NRI mode should support all existing functionalities supported by standalone and Proxy mode. + +#### Non-Functional Requirements + +Non-functional requirements are user expectations of the solution. Include +considerations for performance, reliability and security. + +### Implementation Details/Notes/Constraints +1. koordlet [NRI plugin](https://github.com/containerd/nri/blob/main/plugins/template/plugin.go) +```go +type nriServer struct { + stub stub.Stub + mask stub.EventMask + options Options // server options +} + +// Enable 3 hooks (RunPodSandbox, CreateContainer, UpdateContainer) in NRI +func (p *nriServer) Configure(config, runtime, version string) (stub.EventMask, error) { +} + +// Sync all pods/containers information before koordlet nri plugin run +func (p *nriServer) Synchronize(pods []*api.PodSandbox, containers []*api.Container) ([]*api.ContainerUpdate, error) { +} + +func (p *nriServer) RunPodSandbox(pod *api.PodSandbox) error { + podCtx.FromNri(pod) + RunHooks(...) + podCtx.NriDone() +} + +func (p *nriServer) CreateContainer(pod *api.PodSandbox, container *api.Container) (*api.ContainerAdjustment, []*api.ContainerUpdate, error) { + containerCtx.FromNri(pod, container) + RunHooks(...) + containCtx.NriDone() +} + +func (p *nriServer) UpdateContainer(pod *api.PodSandbox, container *api.Container) ([]*api.ContainerUpdate, error) { + containerCtx.FromNri(pod, container) + RunHooks(...) + containCtx.NriDone() +} +``` +2. koordlet enhancement for NRI +- PodContext +```go +// fill PodContext from OCI spec +func (p *PodContext) FromNri(pod *api.PodSandbox) { +} + +// apply QoS resource policies for pod +func (p *PodContext) NriDone() { +} +``` +- ContainerContext +```go +// fill ContainerContext from OCI spec +func (c *ContainerContext) FromNri(pod *api.PodSandbox, container *api.Container) { +} + +// apply QoS resource policies for container +func (c *ContainerContext) NriDone() (*api.ContainerAdjustment, []*api.ContainerUpdate, error) { +} +``` + +### Risks and Mitigations + +## Alternatives +There are several approaches to extending the Kubernetes CRI (Container Runtime Interface) to manage container resources such as `standalone` and `proxy`. Under `standalone` running mode, resource isolation parameters will be injected asynchronously. Under `proxy` running mode, proxy can hijack CRI requests from kubelet for pods and then apply resource policies in time. However, `proxy` mode needs to configure and restart kubelet. + +There are a little difference in execution timing between `NRI` and `proxy` modes. Hook points (execution timing) are not exactly same. The biggest difference is `proxy` call koordlet hooks between kubelet and containerd. However, NRI will call NRI plugin (koodlet hooks) in containerd, that means containerd still could do something before or after containerd call NRI plugin (koordlet hooks). For example, under `NRI` running mode, containerd setup pod network first and then call NRI plugin (koordlet hooks) in RunPodSanbox, but under `proxy` running mode, containerd couldn't do anything before koordlet hooks running when `proxy` handle RunPodSandbox CRI request. + +- Standalone + + - kubelet -- CRI Request -> CRI Runtime -- OCI Spec -> OCI compatible runtime -> containers + - kubelet -> Node Agent -> CRI Runtime / containers + +![standalone.png](/img/standalone.png) + +- Proxy + + - kubelet -- CRI Request -> CRI Proxy -- CRI Request (hooked) -> CRI Runtime -- OCI Spec -> OCI compatible runtime -> containers + +![proxy.png](/img/proxy.png) + +- NRI + + - kubelet -- CRI Request -> CRI Runtime -- OCI Spec --> OCI compatible runtime -> containers +                  ↘   ↗ +                Koordlet NRI plugin + +![nri.png](/img/nri.png) + +## Upgrade Strategy + +- Need to upgrade containerd to 1.7.0+ or CRIO to 1.26.0+ +- Need to enable NRI + + diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/pod-migration-job.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/pod-migration-job.md new file mode 100644 index 000000000..47a94aba8 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/pod-migration-job.md @@ -0,0 +1,374 @@ +# PodMigrationJob + +## Summary + +This proposal defines a CRD-based Pod migration API, through which the descheduler or other automatic fault recovery components can evict or delete Pods more safely. At the same time, the proposal also describes the specific implementation details of the API. + +## Motivation + +Migrating Pods is an important capability that many components (such as deschedulers) rely on, and can be used to optimize scheduling or help resolve workload runtime quality issues. We believe that pod migration is a complex process, involving steps such as auditing, resource allocation, and application startup, and is mixed with application upgrading, scaling scenarios, and resource operation and maintenance operations by cluster administrators. Therefore, how to manage the stability risk of this process to ensure that the application does not fail due to the migration of Pods is a very critical issue that must be resolved. + +Therefore, it is necessary to realize a final state-oriented migration capability based on CRD, track the status of each process in the migration, and perceive scenarios such as upgrading and scaling of the application. + +### Goals + +1. Defines a CRD-based Pod Migration Job API, through which the descheduler can evict or delete Pods more safely. +2. Describe in detail the design details behind the API. + +### Non-Goals/Future Work + +1. A new descheduler framework +2. Descheduling capability for different scenarios such as load-aware descheduling, defragemention, etc. +3. The details about Deterministic preemption that preempts other Pods for Reservation. + +## Proposal + +### User Stories + +#### Story 1 + +The descheduler in the K8s community evicts pods to be rescheduled according to different strategies. However, it does not guarantee whether the evicted Pod has resources available after re-creation. If a large number of new Pods are in the Pending state when the resources in the cluster are tight, may lower the application availabilities. + +#### Story 2 + +The descheduler evicts the Pod through the Eviction API, and the Eviction API decides whether to delete the Pod according to the PDB status. However, it is unable to perceive workload upgrades, scaling and other scenarios in which Pods are deleted, which will also bring security risks. + +#### Story 3 + +The Pod migration capability itself can be provided to users as a service. Users can integrate this API in their own systems to achieve safe migration, and are no longer limited to deschedulers. + + +### Basic Migration API + +These APIs provide cluster administrators with more fine-grained migration control capabilities, which can better reduce risks. + +- `scheduling.koordinator.sh/eviction-cost` indicates the eviction cost. It can be used to set to an int32. The implicit eviction cost for pods that don't set the annotation is 0, negative values are permitted. If set the cost ith `math.MaxInt32`, it means the Pod will not be evicted. Pods with lower eviction cost are preferred to be evicted before pods with higher eviction cost. If a batch of Pods to be evicted have the same priority, they will be sorted by cost, and the Pod with the smallest cost will be evicted. Although the K8s community has [Pod Deletion Cost #2255](https://github.com/kubernetes/enhancements/issues/2255), it is not a general mechanism. To avoid conflicts with components that use `Pod Deletion Cost`, users can individually mark the eviction cost for Pods. + + +### Pod Migration Job CRD + +In order to support the above user stories, a Custom Resource Definition(CRD) named `PodMigrationJob` is proposed to ensure the migration process safely. + +#### Migration Job Spec + +```go + +// PodMigrationJob is the Schema for the PodMigrationJob API +// +k8s:openapi-gen=true +// +kubebuilder:resource:scope=Cluster +type PodMigrationJob struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + Spec PodMigrationJobSpec `json:"spec,omitempty"` + Status PodMigrationJobStatus `json:"status,omitempty"` +} + +type PodMigrationJobSpec struct { + // Paused indicates whether the PodMigrationJob should to work or not. + // Default is false + // +optional + Paused bool `json:"paused,omitempty"` + + // TTL controls the PodMigrationJob timeout duration. + // +optional + TTL *metav1.Duration `json:"ttl,omitempty"` + + // Mode represents the operating mode of the Job + // Default is PodMigrationJobModeReservationFirst + // +optional + Mode PodMigrationJobMode `json:"mode,omitempty"` + + // PodRef represents the Pod that be migrated + // +required + PodRef *corev1.ObjectReference `json:"podRef"` + + // ReservationOptions defines the Reservation options for migrated Pod + // +optional + ReservationOptions *PodMigrateReservationOptions `json:"reservationOptions,omitempty"` + + // DeleteOptions defines the deleting options for the migrated Pod and preempted Pods + // +optional + DeleteOptions *metav1.DeleteOptions `json:"deleteOptions,omitempty"` +} + +type PodMigrationJobMode string + +const ( + PodMigrationJobModeReservationFirst PodMigrationJobMode = "ReservationFirst" + PodMigrationJobModeEvictionDirectly PodMigrationJobMode = "EvictDirectly" +) + +type PodMigrateReservationOptions struct { + // ReservationRef if specified, PodMigrationJob will check if the status of Reservation is available. + // ReservationRef if not specified, PodMigrationJob controller will create Reservation by Template, + // and update the ReservationRef to reference the Reservation + // +optional + ReservationRef *corev1.ObjectReference `json:"reservationRef,omitempty"` + + // Template is the object that describes the Reservation that will be created if not specified ReservationRef + // +optional + Template *ReservationTemplateSpec `json:"template,omitempty"` + + // PreemptionOption decides whether to preempt other Pods. + // The preemption is safe and reserves resources for preempted Pods. + // +optional + PreemptionOptions *PodMigrationJobPreemptionOptions `json:"preemptionOptions,omitempty"` +} + +type PodMigrationJobPreemptionOptions struct { + // Reserved object. +} +``` + +- `Paused` indicates whether the PodMigrationJob should to work or not. In some scenarios, the user does not expect the PodMigrationJob Controller to process the PodMigrationJob immediately, but rather to decide whether to execute it after completing some operations similar to auditing. +- `TimeoutInSeconds` controls the PodMigrationJob timeout duration. +- The `PodMigrationJob` support two modes defined by the field `Mode`: + - `PodMigrationJobModeReservationFirst` means that before migrating a Pod, try to reserve resources through the `Reservation` API, delete the Pod to be migrated after successfully reserved, and observe the status of the `Reservation` to ensure that the `Reservation` is consumed. + - `PodMigrationJobModeEvictionDirectly` indicates that the user clearly knows the risk of evicting the Pod and decides to evict the Pod directly. + - If `Mode` is not specified, `PodMigrationJobModeReservationFirst` is used by default +- `PodRef` represents the Pod that be migrated. The field is required. +- `ReservationOptions` defines options for how to reserve resource through `Reservation` API: + - `ReservationRef` if is specified, the referenced `Reservation` instance is used first. In some scenarios, such as defragmentation, in order to ensure the reliability of the upper-layer logic, resources may have been reserved on the target node. In this case, the specified `Reservation` can be used directly. + - `Template` describes the spec of `Reservation`. It is often not necessary to set this field. When neither `ReservationRef` nor `Template` is specified, the `PodMigrationJob controller` will construct the `ReservationSpec` reserved resources according to the Spec of the migrated Pod. If `Template` is set, the `ReservationTemplateSpec` and the Spec of the migrated Pod will be merged to construct the `ReservationSpec` reserved resources. + - `PreemptionOptions` decides whether to preempt other Pods if reserved resources failed. The specific details of preemption will be submitted in a separate proposal description in future work, and will not be expanded here for the time being. +- `DeleteOptions` defines the options of delete operation. Whether to delete a Pod through the `K8s Delete API` or evict a Pod through the `K8s Eviction API` depends on how the user configures the parameters of the `PodMigrationJob Controller`. Users only need to set `DeleteOptions` according to the workload in their own cluster. + +#### Migration Job Status + +```go +type PodMigrationJobStatus struct { + // PodMigrationJobPhase represents the phase of a PodMigrationJob is a simple, high-level summary of where the PodMigrationJob is in its lifecycle. + // e.g. Pending/Running/Failed + Phase PodMigrationJobPhase `json:"phase,omitempty"` + // Status represents the current status of PodMigrationJob + // e.g. ReservationCreated + Status string `json:"status,omitempty"` + // Reason represents a brief CamelCase message indicating details about why the PodMigrationJob is in this state. + Reason string `json:"reason,omitempty"` + // Message represents a human-readable message indicating details about why the PodMigrationJob is in this state. + Message string `json:"message,omitempty"` + // Conditions records the stats of PodMigrationJob + Conditions []PodMigrationJobCondition `json:"conditions,omitempty"` + // NodeName represents the node's name of migrated Pod + NodeName string `json:"nodeName,omitempty"` + // PodRef represents the newly created Pod after being migrated + PodRef *corev1.ObjectReference `json:"podRef,omitempty"` + // PreemptedPodsRef represents the Pods that be preempted + PreemptedPodsRef []corev1.ObjectReference `json:"preemptedPodsRef,omitempty"` + // PreemptedPodsReservations records information about Reservations created due to preemption + PreemptedPodsReservations []PodMigrationJobPreemptedReservation `json:"preemptedPodsReservation,omitempty"` +} + +type PodMigrationJobPreemptedReservation struct { + // Namespace represents the namespace of Reservation + Namespace string `json:"namespace,omitempty"` + // Name represents the name of Reservation + Name string `json:"name,omitempty"` + // NodeName represents the assigned node for Reservation by scheduler + NodeName string `json:"nodeName,omitempty"` + // Phase represents the Phase of Reservation + Phase string `json:"phase,omitempty"` + // PreemptedPodRef represents the Pod that be preempted + PreemptedPodRef *corev1.ObjectReference `json:"preemptedPodRef,omitempty"` + // PodsRef represents the newly created Pods after being preempted + PodsRef []corev1.ObjectReference `json:"podsRef,omitempty"` +} + +type PodMigrationJobCondition struct { + // Type is the type of the condition. + Type PodMigrationJobConditionType `json:"type"` + // Status is the status of the condition. + // Can be True, False, Unknown. + Status PodMigrationJobConditionStatus `json:"status"` + // Last time we probed the condition. + // +nullable + LastProbeTime metav1.Time `json:"lastProbeTime,omitempty"` + // Last time the condition transitioned from one status to another. + // +nullable + LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty"` + // Unique, one-word, CamelCase reason for the condition's last transition. + Reason string `json:"reason,omitempty"` + // Human-readable message indicating details about last transition. + Message string `json:"message,omitempty"` +} + +type PodMigrationJobPhase string + +const ( + // PodMigrationJobPending represents the initial status + PodMigrationJobPending PodMigrationJobPhase = "Pending" + // PodMigrationJobRunning represents the PodMigrationJob is being processed + PodMigrationJobRunning PodMigrationJobPhase = "Running" + // PodMigrationJobSucceed represents the PodMigrationJob processed successfully + PodMigrationJobSucceed PodMigrationJobPhase = "Succeed" + // PodMigrationJobFailed represents the PodMigrationJob process failed caused by Timeout, Reservation failed, etc. + PodMigrationJobFailed PodMigrationJobPhase = "Failed" + // PodMigrationJobAborted represents the user forcefully aborted the PodMigrationJob. + PodMigrationJobAborted PodMigrationJobPhase = "Aborted" +) + +// These are valid conditions of PodMigrationJob. +const ( + PodMigrationJobConditionReservationCreated PodMigrationJobConditionType = "ReservationCreated" + PodMigrationJobConditionReservationScheduled PodMigrationJobConditionType = "ReservationScheduled" + PodMigrationJobConditionPreemption PodMigrationJobConditionType = "Preemption" + PodMigrationJobConditionEviction PodMigrationJobConditionType = "Eviction" + PodMigrationJobConditionPodScheduled PodMigrationJobConditionType = "PodScheduled" + PodMigrationJobConditionReservationPodBoundReservation PodMigrationJobConditionType = "PodBoundReservation" + PodMigrationJobConditionReservationBound PodMigrationJobConditionType = "ReservationBound" +) + +// These are valid reasons of PodMigrationJob. +const ( + PodMigrationJobReasonTimeout = "Timeout" + PodMigrationJobReasonFailedCreateReservation = "FailedCreateReservation" + PodMigrationJobReasonUnschedulable = "Unschedulable" + PodMigrationJobReasonMissingPod = "MissingPod" + PodMigrationJobReasonMissingReservation = "MissingReservation" + PodMigrationJobReasonPreempting = "Preempting" + PodMigrationJobReasonPreemptComplete = "PreemptComplete" + PodMigrationJobReasonEvicting = "Evicting" + PodMigrationJobReasonFailedEvict = "FailedEvict" + PodMigrationJobReasonEvictComplete = "EvictComplete" + PodMigrationJobReasonWaitForPodBindReservation = "WaitForPodBindReservation" +) + +type PodMigrationJobConditionStatus string + +const ( + PodMigrationJobConditionStatusTrue PodMigrationJobConditionStatus = "True" + PodMigrationJobConditionStatusFalse PodMigrationJobConditionStatus = "False" + PodMigrationJobConditionStatusUnknown PodMigrationJobConditionStatus = "Unknown" +) +``` + +### Implementation Details/Notes/Constraints + +#### PodMigrationJob Controller + +The difference between `PodMigrationJobController` and general controller is that `PodMigrationJobController` will evaluate all pending PodMigrationJobs together (ie PodMigrationJob.Phase is Pending) and select a batch of PodMigrationJob and reconcile them. This selection process is called the arbitration mechanism. The reason why the arbitration mechanism is introduced is mainly to control the stability risk and control the cost of migrating Pods. The arbitration mechanism includes three stages: `Group`, `Filter` and `Sort`. + +##### Group PodMigrationJob + +Aggregate according to different workloads to facilitate the processing of subsequent processes + +- Aggregate PodMigrationJob by workload +- Aggregate PodMigrationJob by Node +- Aggregate PodMigrationJob by Namespace + +##### Filter PodMigrationJob + +- Check how many PodMigrationJob of each workload are in the Running state, and record them as ***migratingReplicas***. If the ***migratingReplicas*** reach a certain threshold, they will be excluded. The detailed algorithm of this threshold is described later. +- Check the number of ***unavailableReplicas*** of each workload, and determine whether the ***unavailableReplicas + migratingReplicas*** conform to the corresponding [PDB(Pod Disruption Budget)](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) or [PUB(Pod Unavailable Budget)](https://openkruise.io/docs/user-manuals/podunavailablebudget). If there is no PDB or PUB, use the algorithm to calculate dynamically. If not, exclude the corresponding PodMigrationJob. +- Check the number of Pods being migrated on the node where each target Pod is located. If it exceeds the maximum migration amount for a single node, exclude it. +- Check the number of Pods being migrated in the Namespace where each target Pod is located. If it exceeds the maximum migration amount for a single Namespace, exclude it + +The detailed algorithm of Workload Max Migrating/Unavailable Replicas: + +```go +func GetMaxMigrating(replicas int, intOrPercent *intstr.IntOrString) (int, error) { + return GetMaxUnavailable(replicas, intOrPercent) +} + +func GetMaxUnavailable(replicas int, intOrPercent *intstr.IntOrString) (int, error) { + if intOrPercent == nil { + if replicas > 10 { + s := intstr.FromString("10%") + intOrPercent = &s + } else if replicas >= 4 && replicas <= 10 { + s := intstr.FromInt(2) + intOrPercent = &s + } else { + s := intstr.FromInt(1) + intOrPercent = &s + } + } + return intstr.GetValueFromIntOrPercent(intOrPercent, replicas, true) +} +``` + +##### Sort PodMigrationJob + +- Pods with higher QoS requirements are given priority, LSE > LSR > LS > BE +- Pods with higher priority will be processed first +- The higher migration priority will be processed first +- If the Pod has already initiated a migration job in the past and it fails, sort by the number of times. The lower the number of times, the priority will be given to processing +- If the workload where the Pod is located has been descheduled for a certain number of times in the past, it is sorted according to the number of times. The lower the number of times, the priority will be processed. +- Sort by the number of replicas being migrated by the workload. The lower the number of replicas being migrated, the priority will be given to processing. + +##### Execute PodMigrationJob + +- Update PodMigrationJobStatus.Phase to Running to trigger the PodMigrationJob controller reconcile these jobs +- PodMigrationJob controller reconciles process: + - If the mode of PodMigrationJob is `EvictionDirectly`, just delete the Pod through the delete method that configured in PodMigrationJob controller. And update the phase of PodMigrationJob to Success. + - If not specified ReservationOptions.ReservationRef, create the Reservation instance by the reservation template or Pod spec to reserve resources. And updates the created Reservation instance to the ReservationOptions.ReservationRef. + - Check the status of Reservation to determine whether reserve resource successfully. + - If failed to reserve, abort the PodMigrationJob and update the phase of PodMigrationJob to Fail + - If successfully reserve, delete the Pod through the delete method that configured in PodMigrationJob controller. + - Check the Reservation status to determine whether the Reservation consumed. + - If Reservation consumed, tracks the status of Reservation and update the status to PodMigrationJob + - Update phase of PodMigrationJob to Success. + +##### Migration Stability mechanism + +- Support for disabling this capability by configuration +- Supports a simple central flow control mechanism to limit the number of migrations over a period of time. + +See the Configuration section for more details + +#### Controller Configuration + +User can configure the `MigrationControllerArgs` through Koordinator Descheduler ConfigMap. + +```go +// MigrationControllerArgs holds arguments used to configure the MigrationController +type MigrationControllerArgs struct { + metav1.TypeMeta + + // DryRun means only execute the entire migration logic except create Reservation or Delete Pod + // Default is false + DryRun bool `json:"dryRun,omitempty"` + + // EvictFailedBarePods allows pods without ownerReferences and in failed phase to be evicted. + EvictFailedBarePods bool `json:"evictFailedBarePods"` + + // EvictLocalStoragePods allows pods using local storage to be evicted. + EvictLocalStoragePods bool `json:"evictLocalStoragePods"` + + // EvictSystemCriticalPods allows eviction of pods of any priority (including Kubernetes system pods) + EvictSystemCriticalPods bool `json:"evictSystemCriticalPods"` + + // IgnorePVCPods prevents pods with PVCs from being evicted. + IgnorePvcPods bool `json:"ignorePvcPods"` + + // LabelSelector sets whether to apply label filtering when evicting. + // Any pod matching the label selector is considered evictable. + LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"` + + // FlowControlQPS controls the number of arbitrations per second + FlowControlQPS string `json:"flowControlQPS,omitempty"` + // FlowControlBurst is the maximum number of tokens + FlowControlBurst int32 `json:"flowControlBurst,omitempty"` + + // MaxMigratingPerNode represents he maximum number of pods that can be migrating during migrate per node. + MaxMigratingPerNode *int32 `json:"maxMigratingPerNode,omitempty"` + + // MaxMigratingPerNamespace represents he maximum number of pods that can be migrating during migrate per namespace. + MaxMigratingPerNamespace *int32 `json:"maxMigratingPerNamespace,omitempty"` + + // MaxMigratingPerWorkload represents he maximum number of pods that can be migrating during migrate per workload. + // Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%). + MaxMigratingPerWorkload *intstr.IntOrString `json:"maxMigratingPerWorkload,omitempty"` + + // MaxUnavailablePerWorkload represents he maximum number of pods that can be unavailable during migrate per workload. + // The unavailable state includes NotRunning/NotReady/Migrating/Evicting + // Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%). + MaxUnavailablePerWorkload *intstr.IntOrString `json:"maxUnavailablePerWorkload,omitempty"` + + // EvictionPolicy represents how to delete Pod, support "Delete" and "Eviction", default value is "Eviction" + EvictionPolicy string `json:"evictionPolicy,omitempty"` + // DefaultDeleteOptions defines options when deleting migrated pods and preempted pods through the method specified by EvictionPolicy + DefaultDeleteOptions *metav1.DeleteOptions `json:"defaultDeleteOptions,omitempty"` +} + +``` \ No newline at end of file diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/resource-reservation.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/resource-reservation.md new file mode 100644 index 000000000..7fa73c84f --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/resource-reservation.md @@ -0,0 +1,245 @@ +# Resource Reservation + +## Summary + +A scheduling mechanism and its API is provided to reserve node resources for pods may not be created yet. + +## Motivation + +Pods are fundamental units for allocating node resources in Kubernetes, which bind resource requirements with business logic. The scheduler is not able to reserve node resources for specific pods or workloads. We may try using a [fake pod](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler) to prepare resources by the preemption mechanism. However, fake pods can be preempted by any scheduled pods with higher priorities, which make resources get scrambled unexpectedly. + +In Koordinator, a resource reservation mechanism is proposed to enhance scheduling and especially benefits scenarios below: + +1. Preemption: Existing preemption does not guarantee that only preempting pods can allocate preempted resources. With a reservation, the scheduler should be able to "lock" resources preventing from allocation of other pods with the same or higher priority. +2. De-scheduling: For the descheduler, it is better to ensure sufficient resources with the reservation before pods get rescheduled. Otherwise, rescheduled pods may not be runnable anymore and make the belonging application disrupted. +3. Horizontal scaling: Using reservation to achieve more deterministic horizontal scaling. e.g. Submit a reservation and make sure it is available before scaling up replicas. +4. Resource Pre-allocation: Sometimes we want to pre-allocate node resources for future resource demands even if the resources are not currently allocatable. Reservation can help with this and it should make no physical cost. + +### Goals + +- Define the basic API of resource reservation for *Motivations<1,2,3>*, extensible for supporting *Motivation<4>* in the future. +- Provide a scheduler plugin that implements above reservation mechanism. + +### Non-Goals/Future Work + +- Detailed design of reservative preemption/descheduler/horizontal scaler/pre-allocation. +- Modify kubelet admission control for reservation objects. + +## Proposal + +### User Stories + +#### Story 1 + +As a Kubernetes developer, I want to enhance the current **preemption** mechanism since preempted resources may be allocated by pods other than the preemptor. The scheduler can create a reservation for the preempting pods, so the ownership of preempted resources can be guaranteed, making the preemption more reliable. + +#### Story 2 + +As a cluster administrator, I want to use **descheduler** to migrate pods that are placed abnormally to somewhere they could "live better" and fulfill orchestration requirements of the app. e.g. Move pods on a over-utilized node to idler nodes and bind CPUs of same NUMA node. Reservations can be created before rescheduling pods, helping ensure there are sufficient resources and well placement. + +#### Story 3 + +As an application administrator, I want to make the **horizontal scaling** of my app more deterministic by submitting reservations before a scale-up. Besides, I can also reserve resources after a scale-down for future demands. It is useful especially when we want a guaranteed scale-up of applications for the coming business peak. + +#### Story 4 + +As a cluster administrator, I want to **pre-allocate** node resources for future usage no matter whether they are available now or not. I want to allocate the future free resources but do not disrupt the running of scheduled pods. Reservation can be made to pre-allocate resources since it makes no physical cost to the node. It may be in a `Waiting` state. When there is enough space for the reservation, it will become `Available` for the owner pods' scheduling. + +### API + +In this section, a Custom Resource Definition (CRD) named `Reservation` is proposed to allow the scheduler to reserve node resources for specific pods. + +![image](/img/resource-reservation.svg) + +```go +// Reservation objects are non-namespaced. +// It can reserve resources for pods of any namespace. Any affinity/anti-affinity of reservation scheduling can be +// specified in the pod template. +type Reservation struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + Spec ReservationSpec `json:"spec,omitempty"` + Status ReservationStatus `json:"status,omitempty"` +} + +type ReservationSpec struct { + // Template defines the scheduling requirements (resources, affinities, images, ...) processed by the scheduler just + // like a normal pod. + // If the `template.spec.nodeName` is specified, the scheduler will not choose another node but reserve resources on + // the specified node. + Template *corev1.PodTemplateSpec `json:"template,omitempty"` + // Specify the owners who can allocate the reserved resources. + // Multiple owner selectors and ORed. + Owners []ReservationOwner `json:"owners,omitempty"` + // By default, the resources requirements of reservation (specified in `template.spec`) is filtered by whether the + // node has sufficient free resources (i.e. ReservationRequest < NodeFree). + // When `preAllocation` is set, the scheduler will skip this validation and allow overcommitment. The scheduled + // reservation would be waiting to be available until free resources are sufficient. + // NOTE: Not supported in v0.6. + PreAllocation bool `json:"preAllocation,omitempty"` + // Time-to-Live period for the reservation. + // `expires` and `ttl` are mutually exclusive. If both `ttl` and `expires` are not specified, a very + // long TTL will be picked as default. Set 0 to disable the expiration. + TTL *metav1.Duration `json:"ttl,omitempty"` + // Expired timestamp when the reservation expires. + // `expires` and `ttl` are mutually exclusive. Defaults to being set dynamically at runtime based on the `ttl`. + Expires *metav1.Time `json:"expires,omitempty"` +} + +type ReservationStatus struct { + // The `phase` indicates whether is reservation is waiting for process (`Pending`), available to allocate + // (`Available`) or timeout/expired to get cleanup (Failed). + Phase ReservationPhase `json:"phase,omitempty"` + // The `conditions` indicate the messages of reason why the reservation is still pending. + Conditions []ReservationCondition `json:"conditions,omitempty"` + // Current resource owners which allocated the reservation resources. + CurrentOwners []corev1.ObjectReference `json:"currentOwners,omitempty"` + // Name of node the reservation is scheduled on. + NodeName string `json:"nodeName,omitempty"` + // Resource reserved and allocatable for owners. + Allocatable corev1.ResourceList `json:"allocatable,omitempty"` + // Resource allocated by current owners. + Allocated corev1.ResourceList `json:"allocated,omitempty"` +} + +type ReservationOwner struct { + // Multiple field selectors are ANDed. + Object *corev1.ObjectReference `json:"object,omitempty"` + Controller *ReservationControllerReference `json:"controller,omitempty"` + LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"` +} + +type ReservationControllerReference struct { + // Extend with a `namespace` field for reference different namespaces. + metav1.OwnerReference `json:",inline"` + Namespace string `json:"namespace,omitempty"` +} + +type ReservationPhase string + +const ( + // ReservationPending indicates the Reservation has not been processed by the scheduler or is unschedulable for + // some reasons (e.g. the resource requirements cannot get satisfied). + ReservationPending ReservationPhase = "Pending" + // ReservationAvailable indicates the Reservation is both scheduled and available for allocation. + ReservationAvailable ReservationPhase = "Available" + // ReservationWaiting indicates the Reservation is scheduled, but the resources to reserve are not ready for + // allocation (e.g. in pre-allocation for running pods). + ReservationWaiting ReservationPhase = "Waiting" + // ReservationFailed indicates the Reservation is failed to reserve resources, due to expiration or marked as + // unavailable, which the object is not available to allocate and will get cleaned in the future. + ReservationFailed ReservationPhase = "Failed" +) + +type ReservationCondition struct { + Type ReservationConditionType `json:"type,omitempty"` + Status ConditionStatus `json:"status,omitempty"` + Reason string `json:"reason,omitempty"` + Message string `json:"message,omitempty"` + LastProbeTime metav1.Time `json:"lastProbeTime,omitempty"` + LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty"` +} + +type ReservationConditionType string + +const ( + ReservationConditionScheduled ReservationConditionType = "Scheduled" + ReservationConditionReady ReservationConditionType = "Ready" +) + +type ConditionStatus string + +const ( + ConditionStatusTrue ConditionStatus = "True" + ConditionStatusFalse ConditionStatus = "False" + ConditionStatusUnknown ConditionStatus = "Unknown" +) + +const ( + ReasonReservationScheduled = "Scheduled" + ReasonReservationUnschedulable = "Unschedulable" + ReasonReservationAvailable = "Available" + ReasonReservationExpired = "Expired" +) +``` + +### Implementation Details + +#### Reservation Plugin + +##### Schedule Reservations + +A `Reservation` object has its scheduling requirements like a pod. Ideally, A `Reservation` object should get processed directly by the scheduler like a pod. However, it can require a series of modifications on [scheduling framework](https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/), losing the compatibility with standard kube-scheduler, kubelet, autoscaler, etc. In the reservation plugin, we fake one *reservation pod* for one `Reservation` inside the scheduler to fulfill general scheduling plugins (noderesources, nodeaffinity, tainttolerations, ...). The scheduling framework can handle `Reservation` objects by processing fake pods in both [scheduling cycle and binding cycle](https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#scheduling-cycle-binding-cycle). + +A fake pod inside the scheduler can construct the same affinity/anti-affinity constraints as owner pods, which may change the reservation result. To handle this problem, koord-scheduler extends the framework to skip check of pod affinity for existing reservations in the `Filter` phase. + +A reservation specified `PreAllocation` intends to pre-allocate resources on nodes. The scheduler will skip its filtering of node resources in the scheduling cycle. However, the scheduled reservation will be `Waiting` to be `Available` until there are enough resources to fulfill its requests. + +If all nodes are unscheduled for the reservation, the scheduler keeps its status as `Pending` and sets `Conditions` with the failure message. + +Once the scheduling decision has been made, the corresponding `Reservation` object is updated with a new status indicating whether the reservation succeeded or not. The fake pod does not expose to other components, and the kubelet without modification does not perceive a `Reservation` assigned. Fortunately, a `Reservation` does not need to be executable on the node, so existing containers can keep running as usual without additional admissions. + +If a reservation has set the `nodeName` (inside the `template` field), the scheduler is responsible for checking if the node can fulfill the reservation since kubelet does not do admissions for the reservation. + +##### Allocate Reserved Resources + +Let's call the reservation is *allocatable* for a pod if: + +1. The reservation is available. +2. The pod matches the reservation owner spec. +3. There are sufficient free resources in the reservation to fulfill the pod. + +When the reservation plugin is enabled, the scheduler checks for every scheduling pod if there are allocatable reservations on a node. With a `Score` plugin implemented, the scheduler prefers pods to schedule on nodes which have more allocatable reserved resources. + +When a pod is scheduled on a node with allocatable reservations, it allocates resources belonging to one of reservations. To pick one of reservations, we choose the one which can get most reserved resources allocated (i.e. MostAllocated). And the scheduler also annotates the pod with the reservation info. + +##### Expiration and Cleanup + +When a reservation has been created for a long time exceeding the `TTL` or `Expires`, the scheduler updates its status as `Expired`. For expired reservations, the scheduler will cleanup them with a custom garbage collection period. + +When a node is deleted, the available and waiting reservations on the node should be marked as `Failed` since they are not allocatable any more. + +#### Use Cases + +To generally reserve node resources, submit a `Reservation` and set the pod template in the field `spec.template`. Then the koord-scheduler will update this `Reservation` with the scheduling result and the resources will get reserved. + +To be more specific, + +- `spec.template` specifies the fundamental resource requirements of a reservation. The scheduler will schedule the fake pod based on the template. +- `spec.owners` specifies which kinds of pods can use the reservation. +- `spec.ttl` and `expires` specifies the expiration for the reservation. +- `spec.preAllocation` indicates whether the scheduler should filter with its resource requirements. Otherwise, the pre-allocation of node resources is allowed, and the reservation will become available until there are sufficient resources. +- `status.phase` is marked as `Pending` when the Reservation is created. And it is marked as `Available` when the Reservation is successfully scheduled. +- `status.conditions` shows why the reservation is unscheduled or failed. +- When a Reservation is `Available` on the node, only specified pods can allocate the reserved resources. + +##### Usage in Preemption + +The [Priority Preemption](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#preemption) happens in the PostFilter phase trying to make preemptive pods schedulable by evicting low-priority pods. When a pod succeeds the preemption, the pod `status` will be patched with a *nominated node* where the scheduler do the eviction. However, the preemptor's nominated node is not always the same as the scheduled node, since the scheduler does not reserve resources for the preemptor. +To ensure the preemptive resources are for the preemptor, firstly the scheduler can create a reservation that both sets `owners` with the preemptor pod and relevant affinity rules for reserving resources of the preempts. Then the scheduler evict pods, and the reservation will become `Available` once the resources are released. Finally, the preemptor pods can get scheduled on the nodes with preemptive resource reserved. + +##### Usage in Descheduling + +Before a pod is rescheduled, the descheduler can create a reservation that sets `template` and `owners` for the candidate. When the reservation becomes `Available`, the descheduler can assign the pod to allocate the reserved resources. This solves the problem in which the rescheduled pod has stopped at the old node but cannot run on the new node. Moreover, the descheduler can migrate resources between pods by setting the `preAllocation` field. + +##### Usage in Pre-allocation + +Reservations with `preAllocation` specified allow users to pre-allocate the node resources from running pods. The `status.phase` of the reservation is set as `Waiting` until the resources are released, indicating that its availability is conditional. Once the referenced pods have terminated, the `phase` is `Available` for owners, and the pre-allocation succeeds. + +### Risks and Mitigations + +Kubelet without any modification possibly ignore `Reservation` objects in predicate admission, which increases the chance of unexpected overcommitment at nodes. `Reservation` does not require any physical resources to be executable, so the overcommitment is mainly a problem only when pods get scheduled with `Reservation` and start to run, which is somewhat easier to mitigate since Kubelet do admit these pods. To further descrease the possibility of unexpected overcommitment or pods admit failures, we could use resource estimation for in-flight pods, balance pods to the nodes with less reserved resources, etc. + +## Unsolved Problems + +As stated above, `Reservation` can generate the same pod affinity/anti-affinity rules as the owner pods. The problem gets resolved in the koord-scheduler by extending scheduling framework, but it still limits the standard kube-scheduler. + +## Alternatives + +### Use a `pause` pod with a low priority to reserve resources + +Reserving resources with [`pause` pods with very low assigned priority](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler) does work when the preemption can be precisely enabled for specific pods. In the example of cluster autoscaler, `pause` pods are helpful when we need to overprovision resources to prevent idle nodes from scaling down by CA. However, a `pause` pod has no reservation guarantee except `priority`. As declared above, many scenarios require reservations to rely on other pod characteristics (e.g. names, namespaces, labels, priorityClass), where `pause` pods cannot meet the demands. + +## References + +1. [Kueue Pod Resource Reservation](https://docs.google.com/document/d/1sbFUA_9qWtorJkcukNULr12FKX6lMvISiINxAURHNFo) diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/runtime-proxy.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/runtime-proxy.md new file mode 100644 index 000000000..ab26955e9 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/designs/runtime-proxy.md @@ -0,0 +1,136 @@ +# RuntimeProxy + +## 摘要 + +KoordRuntimeProxy 充当 Kubelet 和 Containerd 之间的代理( Dockershim 场景下是 Dockerd ),它用于拦截 CRI 请求,并应用一些资源管理策略, +如混合工作负载编排场景下按实例优先级设置不同的 cgroup 参数,针对最新的 Linux 内核、CPU 架构应用新的隔离策略等。 + +这里涉及两个组件,KoordRuntimeProxy 和 RuntimePlugins。 + +![image](/img/koord-runtime-proxy-architecture.svg) + +## 目标 + +- 增强基于 QoS 的调度的资源管理。 +- 为 CRI 不支持的新隔离特性提供接口。 + +## 组件 + +### KoordRuntimeProxy + +KoordRuntimeProxy 负责在 Pod 的生命周期内拦截请求,例如 RunPodSandbox、CreateContainer 等, +在请求从后端 Containerd(Dockerd) 到 Kubelet 之间的传输过程中,会调用 RuntimePlugins 做资源隔离策略。 +KoordRuntimeProxy 提供了一个隔离策略执行框架,允许注册的自定义插件执行指定的隔离策略,这些插件称为 RuntimePlugins。 KoordRuntimeProxy 本身不执行任何隔离策略。 + +### RuntimePlugins + +RuntimePlugins 将事件(如 RunPodSandbox 等)注册到 KoordRuntimeProxy 并在事件发生时接收通知。 +RuntimePlugins 应该根据通知消息完成资源隔离策略,然后响应给 KoordRuntimeProxy,KoordRuntimeProxy 将根据插件的响应决定将请求转移到后端 Containerd 或丢弃。 + +如果没有注册 RuntimePlugins,KoordRuntimeProxy 将成为 Kubelet 和 Containerd 之间的透明代理。 + +## 架构 + +![image](/img/koord-runtime-proxy-design.svg) + +KoordRounmeProxy 有4个主要组件。 + +### CRI Server + +KoordRuntimeProxy 作为 Kubelet 和 Containerd 之间的代理,充当 Kubelet 的 CRI 服务器(Dockershim 场景下的 Http 服务器)。它应该拦截来自 Kubelet 的所有请求,并在与后端 Containerd(Dockerd) 调用之前和之后生成与插件调用的协议。 + +### Plugins Manager + +PluginsManager 负责动态解析来自 `/etc/runtime/hookserver.d` 的插件注册信息。 + +### Runtime Dispatcher + +RuntimeDispatcher 旨在管理与插件的通信。 + +### Store + +作为代理,KoordRuntimeProxy 最好设计为无状态,但有时候现实并不完美。 +以 StartContainer hook 为例,CRI StartContainerRequest 中只有 ContainerID,这不足以让插件调整策略,因为插件可能不会在本地存储 Pod/Container 信息(如 Meta、Priority)。所以 KoordRuntimeProxy 应该在 RunPodSandbox/CreateContainer 阶段存储 Pod/Container 信息。当 StartContainer 请求到来时,KoordRuntimeProxy 可以通过 ContainerID 获取 Pod/Container 信息,然后使用 Pod/Container 信息调用插件。 + +有了 Store,每次 KoordRuntimeProxy 调用插件都会有 Pod/Container 信息,所以插件不需要特别存储 Pod/Container 信息,插件可以设计成无状态的。 + +考虑到性能,Store 位于内存中,不会产生外部 IO 到磁盘。 + +## Runtime Plugins + +### 如何注册插件 +所有的插件配置文件都应该放在 `/etc/runtime/hookserver.d` 并带有 `.json` 后缀。您可以使用 RuntimeProxy 注册 Koordlet 实现的插件: + +1. touch /etc/runtime/hookserver.d/koordlet.json +2. 将以下内容复制到 /etc/runtime/hookserver.d/koordlet.json +``` +{ + "remote-endpoint": "/var/run/koordlet/koordlet.sock", + "failure-policy": "Ignore", + "runtime-hooks": [ + "PreRunPodSandbox", + "PreCreateContainer", + "PreStartContainer" + ] +} +``` + + +涉及3个字段: +- remote-endpoint: KoordRuntimeProxy 与插件对话端点,由插件生成。 +- failure-policy: 调用插件失败时的策略,失败或忽略,默认为忽略。 +- runtime-hooks: 目前有7个钩点: + 1. PreRunPodSandbox + 2. PreCreateContainer + 3. PreStartContainer + 4. PostStartContainer + 5. PreUpdateContainerResources + 6. PostStopContainer + 7. PostStopPodSandbox + +带有前缀 “Pre” 的挂钩点表示在将请求传输到 Contianerd(Dockerd) 之前调用插件。带有前缀 “Post“ 的挂钩点意味着在收到来自 Containerd(Dockerd) 的响应后调用插件。插件提供者可以将任何钩子组合设置为“运行时钩子”。 + +### KoordRunmeProxy 和 Plugins 之间的协议 +[Protocols](https://github.com/koordinator-sh/koordinator/blob/main/apis/runtime/v1alpha1/api.proto) + +### Runtime Plugins 例子 +[koordlet-runtime-plugin-design](https://github.com/koordinator-sh/koordinator/blob/main/docs/design-archive/koordlet-runtime-hooks.md) + +## 安装 + +### 源代码安装 +获取源代码:`git clone https://github.com/koordinator-sh/koordinator.git` + +构建:`cd koordinator; make build-koord-runtime-proxy` + +### 包安装 +下载最新发布的程序包:`https://github.com/koordinator-sh/koordinator/releases` + +### 配置 Kubelet +在 Containerd 场景下,为了让 koord-runtime-proxy 成为 Kubelet 和 Containerd 之间的代理,Kubelet 的参数需要修改如下: +``` +kubelet --container-runtime=remote --container-runtime-endpoint=unix:///var/run/koord-runtimeproxy/runtimeproxy.sock +``` + +在 Docker 场景下,为了让 koord-runtime-proxy 成为 Kubelet 和 Dockerd 之间的代理,Kubelet 的参数需要修改如下: +``` +kubelet --docker-endpoint=unix:///var/run/koord-runtimeproxy/runtimeproxy.sock +``` + +### 配置 KoordRuntimeProxy +首先,请确保您的运行时后端是 Containerd 或 Dockerd。 + +在 Containerd 场景下,koord-runtime-proxy 可以使用以下命令设置: +``` +koord-runtime-proxy --remote-runtime-service-endpoint= + --remote-image-service-endpoint= +``` +如果 Containerd 在默认 `/var/run/koord-runtimeproxy/runtimeproxy.sock` 上监听 CRI 请求,koord-runtime-proxy 可以通过以下方式设置: +``` +koord-runtime-proxy +``` + +在 Docker 场景下,koord-runtime-proxy 应该使用附加参数 `--backend-runtime-mode Docker` 设置,并且没有 `remote-image-service-endpoint`: +``` +koord-runtime-proxy --backend-runtime-mode=Docker --remote-runtime-service-endpoint= +``` diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/installation.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/installation.md new file mode 100644 index 000000000..15a71e334 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/installation.md @@ -0,0 +1,231 @@ +# 安装 + +Koordinator 依赖 **Kubernetes version >= 1.18**。 + +Koordinator 需要从 kubelet 只读端口收集指标(默认设置为禁用)。 +更多信息 [here](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/). + +为了最好的体验,koordinator 推荐 **linux kernel 4.19** 或者更高版本。 + + +## 使用 Helm 安装 + +Koordinator 可以使用 Helm v3.5+ 安装, Helm 是一个简单的命令行工具,更多信息 [here](https://github.com/helm/helm/releases). + +```bash +# Firstly add koordinator charts repository if you haven't do this. +$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/ + +# [Optional] +$ helm repo update + +# Install the latest version. +$ helm install koordinator koordinator-sh/koordinator --version 1.3.0 +``` + +## 使用 Helm 升级 + +```bash +# Firstly add koordinator charts repository if you haven't do this. +$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/ + +# [Optional] +$ helm repo update + +# Upgrade the latest version. +$ helm upgrade koordinator koordinator-sh/koordinator --version 1.3.0 [--force] +``` + +注意: + +1. 升级前,为确保你了解新版本中的重大更改,你 **必须** 先阅读 [变更日志](https://github.com/koordinator-sh/koordinator/blob/master/CHANGELOG.md)。 +2. 如果你想删除或者新增旧版本中的 Chart 参数,推荐在 `helm upgrade` 命令中添加参数 `--reset-values` 。否则,你应该使用 `--reuse-values` 参数来恢复上一个版本的值。 + +## 可选:手动下载 Charts + +如果你在生产环境中连接到 `https://koordinator-sh.github.io/charts/` 时遇到问题,你可能需要从 [此处](https://github.com/koordinator-sh/charts/releases) 手动下载 Charts 进行安装或升级。 + +```bash +$ helm install/upgrade koordinator /PATH/TO/CHART +``` + +## 启用 NRI 资源管理模式 + +### 前置条件 + +- Containerd >= 1.7.0 且配置启用 NRI。请确保 NRI 已在 containerd 中启用,否则请参考 [Enable NRI in Containerd](https://github.com/containerd/containerd/blob/main/docs/NRI.md)。 +- Koordinator >= 1.3 + +### 配置方式 + +NRI 资源管理模式是*默认启用*的。你无需修改 koordlet 配置就可以使用它,也可以通过设置 `enable-nri-runtime-hook=false` 的 koordlet 启动参数来禁用它。当它的前置条件不满足时,启用也不会影响其他功能。 + +## 安装 koord-runtime-proxy + +koord-runtime-proxy 充当 Kubelet 和 Containerd 之间的代理(Dockershim 场景下的 Dockerd),旨在拦截 CRI 请求, 并应用一些资源管理策略,比如在混合工作负载编排场景下通过 Pod 优先级设置不同的 CGroup 参数,为最新的 Linux 内核应用新的隔离策略, CPU 架构等等。 + +### 1、下载二进制文件 + +从 Github 下载: +```bash +$ # select the version +$ wget https://github.com/koordinator-sh/koordinator/releases/download/v1.3.0/koord-runtime-proxy_1.3.0.linux_x86_64 -O koord-runtime-proxy +$ chmod +x koord-runtime-proxy +``` + +或者,你可以从源代码开始构建: +```bash +$ git clone https://github.com/koordinator-sh/koordinator.git +$ cd koordinator +$ make build-koord-runtime-proxy +``` + +### 2、设置 koord-runtime-proxy + +首先,请确保你的运行时后端是 Containerd 或 Dockerd。 + +在 Containerd 场景下,如果 Containerd 在默认的 `/var/run/containerd/containerd.sock` 监听 CRI 请求,koord-runtime-proxy 可以这样设置(无需任何参数): + +``` +koord-runtime-proxy +``` + +或者使用以下命令设置: + +``` +koord-runtime-proxy \ + --remote-runtime-service-endpoint= \ + --remote-image-service-endpoint= +``` + +在 Docker 的场景下,koord-runtime-proxy 应该使用附加参数设置 `--backend-runtime-mode Docker`,无需 `remote-image-service-endpoint`: + +``` +koord-runtime-proxy \ + --backend-runtime-mode=Docker \ + --remote-runtime-service-endpoint= +``` + +koord-runtime-proxy 将监听 `/var/run/koord-runtimeproxy/runtimeproxy.sock`。 + +### 3、设置 Kubelet + +要使 koord-runtime-proxy 成为 Kubelet 和 Containerd 之间的代理,应修改 Kubelet 参数,如下所示: + +``` +kubelet \ + --container-runtime=remote \ + --container-runtime-endpoint=unix:///var/run/koord-runtimeproxy/runtimeproxy.sock +``` + +在 Docker 的场景下, 应修改 Kubelet 参数如下: + +``` +kubelet --docker-endpoint=unix:///var/run/koord-runtimeproxy/runtimeproxy.sock +``` + + +## 可选 + +请注意,直接安装这个 Chart 意味着使用 Koordinator 的默认模板值。 + +如果将其部署到生产集群中,或者你想要配置 `feature-gates`,你可能需要设置特定配置。 + +### 可选: Chart 参数 + +下表列出了 Chart 可配置参数及其默认值。 + +| Parameter | Description | Default | +| ----------------------------------------- | ---------------------------------------------------------------- |---------------------------------| +| `featureGates` | Feature gates for Koordinator, empty string means all by default | ` ` | +| `installation.namespace` | namespace for Koordinator installation | `koordinator-system` | +| `installation.createNamespace` | Whether to create the installation.namespace | `true` | +| `imageRepositoryHost` | Image repository host | `ghcr.io` | +| `manager.log.level` | Log level that koord-manager printed | `4` | +| `manager.replicas` | Replicas of koord-manager deployment | `2` | +| `manager.image.repository` | Repository for koord-manager image | `koordinatorsh/koord-manager` | +| `manager.image.tag` | Tag for koord-manager image | `v1.3.0` | +| `manager.resources.limits.cpu` | CPU resource limit of koord-manager container | `1000m` | +| `manager.resources.limits.memory` | Memory resource limit of koord-manager container | `1Gi` | +| `manager.resources.requests.cpu` | CPU resource request of koord-manager container | `500m` | +| `manager.resources.requests.memory` | Memory resource request of koord-manager container | `256Mi` | +| `manager.metrics.port` | Port of metrics served | `8080` | +| `manager.webhook.port` | Port of webhook served | `9443` | +| `manager.nodeAffinity` | Node affinity policy for koord-manager pod | `{}` | +| `manager.nodeSelector` | Node labels for koord-manager pod | `{}` | +| `manager.tolerations` | Tolerations for koord-manager pod | `[]` | +| `manager.resyncPeriod` | Resync period of informer koord-manager, defaults no resync | `0` | +| `manager.hostNetwork` | Whether koord-manager pod should run with hostnetwork | `false` | +| `scheduler.log.level` | Log level that koord-scheduler printed | `4` | +| `scheduler.replicas` | Replicas of koord-scheduler deployment | `2` | +| `scheduler.image.repository` | Repository for koord-scheduler image | `koordinatorsh/koord-scheduler` | +| `scheduler.image.tag` | Tag for koord-scheduler image | `v1.3.0` | +| `scheduler.resources.limits.cpu` | CPU resource limit of koord-scheduler container | `1000m` | +| `scheduler.resources.limits.memory` | Memory resource limit of koord-scheduler container | `1Gi` | +| `scheduler.resources.requests.cpu` | CPU resource request of koord-scheduler container | `500m` | +| `scheduler.resources.requests.memory` | Memory resource request of koord-scheduler container | `256Mi` | +| `scheduler.port` | Port of metrics served | `10251` | +| `scheduler.nodeAffinity` | Node affinity policy for koord-scheduler pod | `{}` | +| `scheduler.nodeSelector` | Node labels for koord-scheduler pod | `{}` | +| `scheduler.tolerations` | Tolerations for koord-scheduler pod | `[]` | +| `scheduler.hostNetwork` | Whether koord-scheduler pod should run with hostnetwork | `false` | +| `koordlet.log.level` | Log level that koordlet printed | `4` | +| `koordlet.image.repository` | Repository for koordlet image | `koordinatorsh/koordlet` | +| `koordlet.image.tag` | Tag for koordlet image | `v1.3.0` | +| `koordlet.resources.limits.cpu` | CPU resource limit of koordlet container | `500m` | +| `koordlet.resources.limits.memory` | Memory resource limit of koordlet container | `256Mi` | +| `koordlet.resources.requests.cpu` | CPU resource request of koordlet container | `0` | +| `koordlet.resources.requests.memory` | Memory resource request of koordlet container | `0` | +| `koordlet.enableServiceMonitor` | Whether to enable ServiceMonitor for koordlet | `false` | +| `webhookConfiguration.failurePolicy.pods` | The failurePolicy for pods in mutating webhook configuration | `Ignore` | +| `webhookConfiguration.timeoutSeconds` | The timeoutSeconds for all webhook configuration | `30` | +| `crds.managed` | Koordinator will not install CRDs with chart if this is false | `true` | +| `imagePullSecrets` | The list of image pull secrets for koordinator image | `false` | + +使用 `helm install` 或 `helm upgrade` 的 `--set key=value[,key=value]` 参数指定每个参数。 + +### 可选: feature-gate + +Feature-Gate 控制 Koordinator 中的一些有影响力的功能: + +| Name | Description | Default | Effect (if closed) | +| ------------------------- | ---------------------------------------------------------------- | ------- | -------------------------------------- | +| `PodMutatingWebhook` | Whether to open a mutating webhook for Pod **create** | `true` | Don't inject koordinator.sh/qosClass, koordinator.sh/priority and don't replace koordinator extend resources ad so on | +| `PodValidatingWebhook` | Whether to open a validating webhook for Pod **create/update** | `true` | It is possible to create some Pods that do not conform to the Koordinator specification, causing some unpredictable problems | + + +如果要配置 feature-gate ,只需在安装或升级时设置参数即可。如: + +```bash +$ helm install koordinator https://... --set featureGates="PodMutatingWebhook=true\,PodValidatingWebhook=true" +``` + +如果要启用所有 feature-gates ,请将参数设置为 `featureGates=AllAlpha=true` 。 + +### 可选: 中国本地镜像 + +如果你在中国并且无法从官方 DockerHub 拉取镜像,你可以使用托管在阿里云上的镜像仓库: + +```bash +$ helm install koordinator https://... --set imageRepositoryHost=registry.cn-beijing.aliyuncs.com +``` + +## 最佳实践 + +### AWS EKS 的安装参数 + +在 EKS 上使用自定义 CNI(例如 Weave 或 Calico)时,默认情况下无法访问 webhook。发生这种情况是因为在 EKS 上控制平面无法配置运行自定义 CNI ,因此控制平面和工作节点之间的 CNI 不同。 + +为了解决这个问题,使用 helm install 或 upgrade 时设置 `--set manager.hostNetwork=true`,webhook 可以在主机网络中运行。 + +## 卸载 + +请注意,这将导致 Koordinator 创建的所有资源,包括 Webhook 配置、Services、Namespace、CRD 和由 Koordinator 控制器管理的 CR 实例,都被删除! +请在充分了解后果的情况下才这样做。 + +卸载通过 Chart 安装的 Koordinator : + +```bash +$ helm uninstall koordinator +release "koordinator" uninstalled +``` diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/introduction.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/introduction.md new file mode 100644 index 000000000..784b88edd --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/introduction.md @@ -0,0 +1,48 @@ +--- +title: 简介 +slug: / +--- + +# 简介 + +欢迎来到 Koordinator! + +## 概述 + +Koordinator 是一个基于 QoS 的 Kubernetes 混合工作负载调度系统。它旨在提高对延迟敏感的工作负载和批处理作业的运行时效率和可靠性,简化与资源相关的配置调整的复杂性,并增加 Pod 部署密度以提高资源利用率。 + + +## 关键特性 + +Koordinator 通过提供以下功能增强了在 Kubernetes 中管理工作负载的用户体验: + +- 精心设计的优先级和 QoS 机制,可将不同类型的工作负载混跑在集群中,并在单个节点上运行不同类型的 Pod 。 +- 允许资源超卖以实现高资源利用率,但仍通过利用应用程序分析机制来满足 QoS 保证。 +- 细粒度的资源协调和隔离机制,以提高延迟敏感的工作负载和批处理作业的效率。 +- 灵活的作业调度机制,支持特定领域的工作负载,例如大数据、人工智能、音频和视频。 +- 一整套用于监控、故障排除和操作的工具。 + + +## Koordinator vs 其他概念 + +### Koordinator QoS vs Kubernetes QoS + +Kubernetes 提供三种类型的 QoS: Guaranteed/Burstable/BestEffort,其中 Guaranteed/Burstable 被广泛使用 BestEffort 很少使用。Koordinator 与 Kubernetes QoS 兼容,并且对每种类型都有许多增强功能。为了避免干扰原生 QoS 语义,Koordinator 引入了一个独立的字段 `koordinator.sh/qosClass` 来描述混部 QoS。该 QoS 描述了在混部场景中节点上运行的 Pod 的服务质量。它是混合系统最关键的语义。 + +Koordinator 与 Kubernetes QoS 兼容,并且对每种类型都有许多增强功能。 + +### Koordinator scheduler vs kube-scheduler + +Koordinator 调度器并非旨在取代 kube-scheduler,而是为了让混部的工作负载在 kubernetes 上运行得 **更好**。 + +Koordinator 调度器是基于 schedule-framework 开发的,在原生调度能力之上增加了与混部和优先级抢占相关的调度插件。Koordinator 将致力于推动相关的增强进入 Kubernetes 的上游社区,推动混部技术的标准化。 + + +## 接下来 + +推荐后续步骤: + +- 开始 [安装 Koordinator ](./installation). +- 学习 Koordinator 的 [架构](architecture/overview). + + diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/colocation-profile.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/colocation-profile.md new file mode 100644 index 000000000..6ae88d637 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/colocation-profile.md @@ -0,0 +1,136 @@ +--- +sidebar_position: 1 +--- + +# Colocation Profile + +## Motivation + +如果现有集群中的工作负载想要通过 Koordinator 进行混合部署,则需要修改现有的 Controller/Operator 以支持 Koordinator 定义的 QoS Class、优先级和资源模型等协议。为了降低 Koordinator 混部系统的使用门槛,让大家可以简单快速的使用混部技术获得收益,因此 Koordinator 提供了一个 `ClusterColocationProfile` CRD 和 对应的 Webhook 修改和验证新创建的 Pod,注入 `ClusterColocationProfile` 中描述的字段。 + + +## 构架 + +![image](/img/clustercolocationprofile-arch.png) + +## Feature Gates + +ClusterColocationProfile mutating/validating 功能默认是打开的,如果想要关闭,请设置 Feature Gates: + +```bash +$ helm install koordinator https://... --set featureGates="PodMutatingWebhook=false\,PodValidatingWebhook=false" +``` + + +## 规格定义 + +如果您对 Kubernetes 资源不熟悉,请参考页面 [了解 Kubernetes 对象](https://kubernetes.io/docs/concepts/overview/working-with-objects/kubernetes-objects/)。 + +- **namespaceSelector**: 如果命名空间与选择器匹配,则决定是否改变/验证 Pod。 LabelSelector 默认为空,它将匹配所有 Namespace。 + +- **selector**: 如果 Pod 与选择器匹配,则决定是否改变/验证 Pod。 默认为空的 LabelSelector,它将匹配所有 Pod。 + +- **qosClass** (*required*): 描述了 Pod 的 Koordinator QoSClass。该值以标签 `koordinator.sh/qosClass` 的形式更新到 Pod 中。对应的选项为 `LSE`、`LSR`、`LS`、`BE` 和 `SYSTEM`。 有关更多信息,请查看页面[此处](../architecture/qos)。 + +- **priorityClassName** (*required*): 指定要写入到 Pod.Spec.PriorityClassName 中的 Kubenretes PriorityClass. 选项为 `koord-prod`、`koord-mid`、`koord-batch` 和 `koord-free`。有关更多信息,请查看 [此处](../architecture/priority)。 + +- **koordinatorPriority**: Koordinator 还提供了 Pod 级别的子优先级 sub-priority。 优先级值将作为标签 `koordinator.sh/priority` 更新到 Pod。 各个 Koordinator 组件通过 KoordinatorPriority 和 PriorityClassName 中的优先级值来确定 Koordinator 中 Pod 的优先级,值越高,优先级越高。 + +- **labels**: 描述需要注入到 `Pod.Labels` 的 k/v 键值对。 + +- **annotations**: 描述了需要注入到 `Pod.Annotations` 的 k/v 键值对。 + +- **schedulerName**: 如果指定,则 Pod 将由指定的调度器调度。 + +- **patch**: 表示用户想要注入 Pod 的 Pod 模板补丁。 + + +## 例子 + +### 创建 ClusterColocationProfile + +下面的 `profile.yaml` 文件描述了对所有含有标签 `koordinator.sh/enable-colocation=true` 的 Namespace 下的所有含有标签 `koordinator.sh/enable-colocation=true` 的 Pod 进行修改,注入 Koordinator QoSClass、Koordinator Priority 等。 + +```yaml +apiVersion: config.koordinator.sh/v1alpha1 +kind: ClusterColocationProfile +metadata: + name: colocation-profile-example +spec: + namespaceSelector: + matchLabels: + koordinator.sh/enable-colocation: "true" + selector: + matchLabels: + koordinator.sh/enable-colocation: "true" + qosClass: BE + priorityClassName: koord-batch + koordinatorPriority: 1000 + schedulerName: koord-scheduler + labels: + koordinator.sh/mutated: "true" + annotations: + koordinator.sh/intercepted: "true" + patch: + spec: + terminationGracePeriodSeconds: 30 +``` + +基于 YAML 文件创建 ClusterColocationProfile: + +```bash +$ kubectl apply -f profile.yaml +``` + +### 验证 ClusterColocationProfile 是否生效 + +```yaml +apiVersion: v1 +kind: Pod +metadata: + labels: + koordinator.sh/enable-colocation: "true" + name: test-pod +spec: + containers: + - name: app + image: nginx:1.15.1 + resources: + limits: + cpu: "1" + memory: "3456Mi" + requests: + cpu: "1" + memory: "3456Mi" +``` + +创建这个 Pod,现在你会发现该 Pod 被注入了 Koordinator QoSClass、Koordinator Priority 等。 + +```bash +$ kubectl get pod test-pod -o yaml +apiVersion: v1 +kind: Pod +metadata: + annotations: + koordinator.sh/intercepted: true + labels: + koordinator.sh/qosClass: BE + koordinator.sh/priority: 1000 + koordinator.sh/mutated: true + ... +spec: + terminationGracePeriodSeconds: 30 + priority: 5000 + priorityClassName: koord-batch + schedulerName: koord-scheduler + containers: + - name: app + image: nginx:1.15.1 + resources: + limits: + kubernetes.io/batch-cpu: "1000" + kubernetes.io/batch-memory: 3456Mi + requests: + kubernetes.io/batch-cpu: "1000" + kubernetes.io/batch-memory: 3456Mi +``` \ No newline at end of file diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/cpu-burst.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/cpu-burst.md new file mode 100644 index 000000000..315ab8661 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/cpu-burst.md @@ -0,0 +1,197 @@ +# CPU Burst + +## Introduction + +CPU Burst is a service level objective (SLO)-aware resource scheduling feature provided by Koordinator. You can use CPU Burst to improve the performance of latency-sensitive applications. CPU scheduling for a container may be throttled by the kernel due to the CPU limit, which downgrades the performance of the application. The koordlet component automatically detects CPU throttling events and automatically adjusts the CPU limit to a proper value. This greatly improves the performance of latency-sensitive applications. + +### How CPU Burst works + +Kubernetes allows you to specify CPU limits, which can be reused based on time-sharing. If you specify a CPU limit for a container, the OS limits the amount of CPU resources that can be used by the container within a specific time period. For example, you set the CPU limit of a container to 2. The OS kernel limits the CPU time slices that the container can use to 200 milliseconds within each 100-millisecond period. + +CPU utilization is a key metric that is used to evaluate the performance of a container. In most cases, the CPU limit is specified based on CPU utilization. CPU utilization on a per-millisecond basis shows more spikes than on a per-second basis. If the CPU utilization of a container reaches the limit within a 100-millisecond period, CPU throttling is enforced by the OS kernel and threads in the container are suspended for the rest of the time period, as shown in the following figure. + +![image](/img/cpu-throttles.png) + +The following figure shows the thread allocation of a web application container that runs on a node with four vCPUs. The CPU limit of the container is set to 2. The overall CPU utilization within the last second is low. However, Thread 2 cannot be resumed until the third 100-millisecond period starts because CPU throttling is enforced somewhere in the second 100-millisecond period. This increases the response time (RT) and causes long-tail latency problems in containers. + +![image](/img/cpu-throttles-1.png) + +Upstream Linux kernel >=5.14 and Anolis OS both provide [Burstable CFS Controller](https://github.com/torvalds/linux/commit/f4183717b370ad28dd0c0d74760142b20e6e7931#diff-cc1a82129952a910fdc4292448c2a097a2ba538bebefcf3c06381e45639ae73e), namely *CPU Burst* feature. It allows a container to accumulate CPU time slices when the container is idle. The container can use the accumulated CPU time slices to burst above the CPU limit when CPU utilization spikes. This improves performance and reduces the RT of the container. + +![image](/img/cpu-throttles-2.png) + +For kernel versions that do not support CPU Burst, koordlet detects CPU throttling events and dynamically adjusts the CPU limit to achieve the same effect as CPU Burst. + +For more information about CPU Burst, see the presentation at KubeCon 2021: [CPU Burst: Getting Rid of Unnecessary Throttling, Achieving High CPU Utilization and Application Performance at the Same Time](https://kccncosschn21.sched.com/event/pcdF?spm=a2c63.p38356.0.0.2ec3731dhQbCIe&iframe=no). + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.3 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to +[Installation](/docs/installation). + +### Configurations + +Koordlet has already enabled CPU Burst feature (`-feature-gates=AllAlpha=true`). If not, please enable it manually by updating the feature gate in the koordlet daemonset. + +NOTE: CPU Burst is not available for `LSR` and `BE` pods since it targets on burstable cpu usages. + +```yaml +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: koordlet +spec: + selector: + matchLabels: + koord-app: koordlet + template: + metadata: + labels: + koord-app: koordlet + spec: + containers: + - command: + - /koordlet + args: + - -CgroupRootDir=/host-cgroup/ + - -feature-gates=XXXX,CPUBurst=true # enable CPU Burst feature + ... +``` + +## Use CPU Burst + +### Use an annotation to enable CPU Burst for the pod + +Add the following annotation to the pod configuration to enable CPU Burst: + +```yaml +apiVersion: apps/v1 +kind: Pod +metadata: + name: demo-pod-xxx + annotations: + # Set the value to auto to enable CPU Burst for the pod. + koordinator.sh/cpuBurst: '{"policy": "auto"}' + # To disable CPU Burst for the pod, set the value to none. + #koordinator.sh/cpuBurst: '{"policy": "none"}' +``` + +### Use a ConfigMap to enable CPU Burst for all pods in a cluster + +Modify the slo-controller-config ConfigMap based on the following content to enable CPU Burst for all pods in a cluster: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: slo-controller-config + namespace: koordinator-system +data: + cpu-burst-config: '{"clusterStrategy": {"policy": "auto"}}' + #cpu-burst-config: '{"clusterStrategy": {"policy": "cpuBurstOnly"}}' + #cpu-burst-config: '{"clusterStrategy": {"policy": "none"}}' +``` + +### (Optional) Advanced Settings + +The following code block shows the pod annotations and ConfigMap fields that you can use for advanced configurations: + +```yaml +# Example of the slo-controller-config ConfigMap. +data: + cpu-burst-config: | + { + "clusterStrategy": { + "policy": "auto", + "cpuBurstPercent": 1000, + "cfsQuotaBurstPercent": 300, + "sharePoolThresholdPercent": 50, + "cfsQuotaBurstPeriodSeconds": -1 + } + } + + # Example of pod annotations. + koordinator.sh/cpuBurst: '{"policy": "auto", "cpuBurstPercent": 1000, "cfsQuotaBurstPercent": 300, "cfsQuotaBurstPeriodSeconds": -1}' +``` + +The following table describes the ConfigMap fields that you can use for advanced configurations of CPU Burst. + +| Field | Data type | Description | +| ---------------------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| policy | string |
  • none: disables CPU Burst. If you set the value to none, the related fields are reset to their original values. This is the default value.
  • cpuBurstOnly: enables the CPU Burst feature only for the kernel of Anolis OS or upstream linux kernel >= 5.14.
  • cfsQuotaBurstOnly: enables automatic adjustment of CFS quotas of general kernel versions.
  • auto: enables CPU Burst and all the related features.
| +| cpuBurstPercent | int | Default value:`1000`. Unit: %. This field specifies the percentage to which the CPU limit can be increased by CPU Burst. If the CPU limit is set to `1`, CPU Burst can increase the limit to 10 by default. | +| cfsQuotaBurstPercent | int | Default value:`300`. Unit: %. This field specifies the maximum percentage to which the value of cfs_quota in the cgroup parameters can be increased. By default, the value of cfs_quota can be increased to at most three times. | +| cfsQuotaBurstPeriodSeconds | int | Default value:`-1`. Unit: seconds. This indicates that the time period in which the container can run with an increased CFS quota is unlimited. This field specifies the time period in which the container can run with an increased CFS quota, which cannot exceed the upper limit specified by `cfsQuotaBurstPercent`. | +| sharePoolThresholdPercent | int | Default value:`50`. Unit: %. This field specifies the CPU utilization threshold of the node. If the CPU utilization of the node exceeds the threshold, the value of cfs_quota in cgroup parameters is reset to the original value. | + +### Verify CPU Burst + +1. Use the following YAML template to create an apache-demo.yaml file. + +> To enable CPU Burst for a pod, specify an annotation in the annotations parameter of the metadata section of the pod configuration. + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: apache-demo + annotations: + koordinator.sh/cpuBurst: '{"policy": "auto"}' # Use this annotation to enable or disable CPU Burst. +spec: + containers: + - command: + - httpd + - -D + - FOREGROUND + image: koordinatorsh/apache-2-4-51-for-slo-test:v0.1 + imagePullPolicy: Always + name: apache + resources: + limits: + cpu: "4" + memory: 10Gi + requests: + cpu: "4" + memory: 10Gi + nodeName: # $nodeName Set the value to the name of the node that you use. + hostNetwork: False + restartPolicy: Never + schedulerName: default-scheduler +``` + +2. Run the following command to create an application by using Apache HTTP Server. + +```bash +kubectl apply -f apache-demo.yaml +``` + +3. Use the wrk2 tool to perform stress tests. + +```bash +# Download, decompress, and then install the wrk2 package. +# The Gzip module is enabled in the configuration of the Apache application. The Gzip module is used to simulate the logic of processing requests on the server. +# Run the following command to send requests. Replace the IP address in the command with the IP address of the application. +./wrk -H "Accept-Encoding: deflate, gzip" -t 2 -c 12 -d 120 --latency --timeout 2s -R 24 http://$target_ip_address:8010/static/file.1m.test +``` + +4. Check the results of CPU Burst enabled and disabled. + +e.g. We may have the following results: + +| CentOS 7 | Disabled | Enabled | +| ----------------------------- | ----------- | ------------------- | +| apache RT-p99 | 111.69 ms | 71.30 ms (-36.2%) | +| CPU Throttled Ratio | 33% | 0% | +| Average pod CPU utilization | 32.5% | 33.8% | + +The preceding metrics indicate the following information: + +- After CPU Burst is enabled, the P99 latency of apache is greatly reduced. +- After CPU Burst is enabled, CPU throttling is stopped and the average pod CPU utilization remains approximately at the same value. diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/cpu-evict.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/cpu-evict.md new file mode 100644 index 000000000..1e4f7b562 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/cpu-evict.md @@ -0,0 +1,135 @@ +# 基于CPU资源满足度的驱逐策略 + +## 简介 + +Koordinator提供了CPU的[动态压制能力](/docs/user-manuals/cpu-suppress),在混部场景下可以根据高优先级Pod(LS)的资源用量情况, +动态调整低优先级Pod(BE)可以使用的CPU资源上限,当LS Pod的资源用量上升时,koordlet将缩减BE Pod可使用的CPU核心。然而,当LS Pod负载突增时, +可能会导致大量BE Pod被压制在少量CPU上,使得这部分Pod的资源满足度较低,应用运行及其缓慢,甚至额外引入一些内核资源的竞争。 + +事实上,大部分BE Pod的离线任务都有较好的重试能力,可以接受一定程度的驱逐而换取更高的资源质量。Koordlet提供了基于CPU资源满足度的驱逐策略, +计算被压制部分的CPU利用率和资源满足度,当利用率和资源满足度同时超过配置的阈值时,会依次按更低优先级、更高的Pod CPU利用率对BE Pod进行驱逐, +直至CPU资源满足度恢复到阈值以上。 + +![image](/img/cpu-evict.svg) + +## 使用限制 +请确保Koordinator已正确安装在你的集群中。若未安装,请参考[安装文档](https://koordinator.sh/docs/installation)。 +该功能需开启Batch资源动态超卖,并和CPU动态压制能力配合使用,请参考[使用文档](/docs/user-manuals/cpu-suppress)。所需的版本要求情况如下: + +| 组件 | 版本要求 | +| --- | ------- | +| Kubernetes | ≥v1.18 | +| koordinator | ≥v0.4.0 | + +该功能由单机组件Koordlet提供,对应的feature-gate默认关闭,使用前请确保koordlet的启动参数`-feature-gates`中已经添加了`BECPUEvict=true`, +详见[参考示例](https://github.com/koordinator-sh/charts/blob/main/versions/v1.2.0/templates/koordlet.yaml#L36)。 + +## 操作步骤 + +1. 使用以下ConfigMap,创建configmap.yaml文件 + ```yaml + #ConfigMap slo-controller-config 样例。 + apiVersion: v1 + kind: ConfigMap + metadata: + name: slo-controller-config # 以koord-manager实际配置的名字为准,例如ack-slo-config + namespace: koordinator-system # 命名空间以环境中实际安装的情况为准,例如kube-system + data: + # 开启基于CPU资源满足度的驱逐功能。 + resource-threshold-config: | + { + "clusterStrategy": { + "enable": true, + "cpuEvictBESatisfactionLowerPercent": 60, + "cpuEvictBESatisfactionUpperPercent": 80, + "cpuEvictBEUsageThresholdPercent": 90, + "CPUEvictTimeWindowSeconds": 60 + } + } + ``` + + | 参数 | 类型 | 取值范围 | 说明 | + | :-------------- | :------ | :-------- | :----------------------------------------------------------- | + | `enable` | Boolean | true; false | true:集群全局开启CPU资源满足度的驱逐策略。false(默认值):集群全局关闭策略。 | + | `cpuEvictBESatisfactionLowerPercent` | Int | 0~60 | BE CPU资源满足度的驱逐阈值,低于该值时将触发对BE Pod的驱逐。 | + | `cpuEvictBESatisfactionUpperPercent` | Int | cpuEvictBESatisfactionLowerPercent~100 | BE CPU资源满足度的安全阈值,高于该值时将停止对BE Pod的驱逐。 | + | `cpuEvictBEUsageThresholdPercent` | Int | 0~100 | BE CPU利用率阈值,当BE Pod在CPU被压制范围内的利用率高于该值时,才会触发驱逐,默认值为90。 | + | `cpuEvictTimeWindowSeconds` | Int | >=2 | CPU资源满足度和BE CPU利用率计算的时间窗口,单位为秒 | + +2. 查看安装的命名空间下是否存在ConfigMap,以命名空间`koordinator-system`和ConfigMap名字`slo-controller-config`为例,具体以实际安装配置为准。 + + - 若存在ConfigMap `slo-controller-config`,请使用PATCH方式进行更新,避免干扰ConfigMap中其他配置项。 + + ```bash + kubectl patch cm -n koordinator-system slo-controller-config --patch "$(cat configmap.yaml)" + ``` + + - 若不存在ConfigMap `slo-controller-config`,请执行以下命令进行创建ConfigMap。 + + ```bash + kubectl apply -f configmap.yaml + ``` + +3. 使用以下YAML内容,创建be-pod-demo.yaml文件。 + + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: be-pod-demo + labels: + koordinator.sh/qosClass: 'BE' #指定Pod的QoS级别为BE。 + spec: + containers: + - args: + - '-c' + - '4' + - '--vm' + - '1' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + limits: + kubernetes.io/batch-cpu: 4k + kubernetes.io/batch-memory: 4Gi + requests: + kubernetes.io/batch-cpu: 4k + kubernetes.io/batch-memory: 4Gi + restartPolicy: Always + schedulerName: default-scheduler + ``` + +4. 执行以下命令,将be-pod-demo部署到集群。 + + ```bash + $ kubectl apply -f be-pod-demo.yaml + ``` + +5. 执行以下命令,查看be-pod-demo状态,等待Pod启动完成。 + + ```bash + $ kubectl get pod be-pod-demo + NAME READY STATUS RESTARTS AGE + be-pod-demo 1/1 Running 0 7s + ``` + +6. 在节点执行以下命令,使用[stress工具](https://linux.die.net/man/1/stress)启动进程, +确保整机内存资源用量被提升到驱逐水位以上,其中`--cpu`参数表示stress进程占用的CPU资源量10核,测试时可根据实际机型情况进行调整。 + + ```bash + $ stress --cpu 10 --vm 1 + ``` +7. 观察be-pod-demo运行情况,可以发现be-pod-demo已经不存在,驱逐信息可以通过event查看到。 + + ```bash + $ kubectl get pod be-pod-demo + Error from server (NotFound): pods "be-pod-demo" not found + + $ kubectl get event + LAST SEEN TYPE REASON OBJECT MESSAGE + 44s Normal Killing pod/be-pod-demo Stopping container stress + 44s Warning evictPodSuccess ${your-pod-object} evict Pod:be-pod-demo, reason: EvictPodByBECPUSatisfaction, message: killAndEvictBEPodsRelease for node(${your-node-id}), need realase CPU : 1200 + ``` diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/cpu-qos.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/cpu-qos.md new file mode 100644 index 000000000..2d844c504 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/cpu-qos.md @@ -0,0 +1,188 @@ +# CPU QoS + +## 简介 + +Kubernetes支持将多种类型的应用以容器化的方式部署在同一台宿主机上运行,不同优先级的应用可能会竞争CPU资源,导致应用服务受损。Koordinator支持基于容器的QoS等级,优先保障高优先级应用的CPU性能。本文介绍如何使用容器CPU QoS功能。 + +## 背景 + +为了充分利用机器中的资源,通常会将高优先延迟敏感性LS(Latency-Sensitive)和低优先级BE(Best-Effort)的任务部署在同一台机器上,导致两种不同优先级任务之间存在资源竞争问题。Kubernetes根据应用的CPU Request/Limit,为容器设置物理资源限制,但仍存在容器间对CPU资源的竞争。例如,BE应用和LS应用共享物理核或逻辑核时,当BE应用负载较高时,会干扰LS应用的运行,导致服务响应延迟变高。 + +为了提高LS应用使用CPU资源的稳定性,降低BE应用的干扰,Koordinator基于Alibaba Cloud Linux 2和Anolis OS,提供了容器CPU QoS功能。Koordinator基于Group Identity提供的Linux调度优先级,差异化保障不同优先级应用的CPU调度,将LS应用标识为高优,BE应用标识为低优,在混合部署场景中有效改善LS应用的服务质量。更多信息,请参见[Group Identity功能说明](https://help.aliyun.com/document_detail/338407.htm#task-2129392)。 + +通过启用CPU QoS功能,您可以获取以下功能特性: + +- LS应用的任务唤醒延迟最小化。 +- BE应用的任务唤醒不会对LS容器造成性能影响。 +- BE应用的任务不会通过同时多线程SMT(Simultaneous MultiThreading)调度器共享物理核而对LS应用造成性能影响。 + +## 设置 + +### 前提条件 + +- Kubernetes >= 1.18 + +- Koordinator >= 0.4 + +- 操作系统: + + - Alibaba Cloud Linux 2(版本号详情,请参见[Group Identity功能说明](https://help.aliyun.com/document_detail/338407.htm#task-2129392)) + + - Anolis OS >= 8.6 + - CentOS 7.9 (需要安装龙蜥社区的 CPU 混部调度器插件,请参阅[最佳实践](../best-practices/anolis_plugsched.md)) + +### 安装 + +请确保Koordinator组件已正确安装在你的集群中。如果没有,请参考[安装文档](https://koordinator.sh/docs/installation)。 + +## 使用CPU QoS + +1. 使用以下ConfigMap,创建configmap.yaml文件。 + + ```yaml + #ConfigMap slo-controller-config 样例。 + apiVersion: v1 + kind: ConfigMap + metadata: + name: slo-controller-config + namespace: koordinator-system + data: + #开启容器CPU QoS功能。 + resource-qos-config: | + { + "clusterStrategy": { + "lsClass": { + "cpuQOS": { + "enable": true, + "groupIdentity": 2 + } + }, + "beClass": { + "cpuQOS": { + "enable": true, + "groupIdentity": -1 + } + } + } + } + ``` + + `lsClass`、`beClass`分别用于配置QoS等级为LS、BE的Pod,`cpuQOS`用于配置容器CPU QoS功能。关键参数说明如下: + +| 参数 | 类型 | 取值范围 | 说明 | +| :-------------- | :------ | :-------- | :----------------------------------------------------------- | +| `enable` | Boolean | truefalse | true:集群全局开启容器CPU QoS功能。false:集群全局关闭容器CPU QoS功能。 | +| `groupIdentity` | Int | -1~2 | 表示CPU Group Identity的优先级。默认值依据QoS,LS对应2,BE对应-1。0表示关闭。`groupIdentity`值越大,表示容器在内核调度的优先级越高。例如,按默认配置,QoS等级为LS的容器Group Identity接口配置为`cpu.bvt_warp_ns=2`,BE容器配置为`cpu.bvt_warp_ns=-1`。更多信息,请参见[Group Identity功能说明](https://help.aliyun.com/document_detail/338407.htm#task-2129392)。 | + + + **说明** 对于未指定`koordinator.sh/qosClass`的Pod,Koordinator将参考Pod原生的QoSClass来设置参数,其中Besteffort使用ConfigMap中BE的配置,其他QoSClass使用ConfigMap中LS的配置。 + +2. 查看命名空间`koordinator-system`下是否存在ConfigMap `slo-controller-config`。 + + - 若存在ConfigMap `slo-controller-config`,请使用PATCH方式进行更新,避免干扰ConfigMap中其他配置项。 + + ```bash + kubectl patch cm -n koordinator-system slo-controller-config --patch "$(cat configmap.yaml)" + ``` + + - 若不存在ConfigMap `slo-controller-config`,请执行以下命令进行创建Configmap。 + + ```bash + kubectl apply -f configmap.yaml + ``` + +3. 使用以下YAML内容,创建ls-pod-demo.yaml文件。 + + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: ls-pod-demo + labels: + koordinator.sh/qosClass: 'LS' #指定Pod的QoS级别为LS。 + spec: + containers: + - command: + - "nginx" + - "-g" + - "daemon off; worker_processes 4;" + image: docker.io/koordinatorsh/nginx:v1.18-koord-example + imagePullPolicy: Always + name: nginx + resources: + limits: + cpu: "4" + memory: 10Gi + requests: + cpu: "4" + memory: 10Gi + restartPolicy: Never + schedulerName: default-scheduler + ``` + +4. 执行以下命令,将ls-pod-demo部署到集群。 + + ```bash + kubectl apply -f ls-pod-demo.yaml + ``` + +5. 执行以下命令,在单机端的Cgroup分组中查看LS Pod的内核Group Identity生效情况。 + + ```bash + cat /sys/fs/cgroup/cpu/kubepods.slice/kubepods-pod1c20f2ad****.slice/cpu.bvt_warp_ns + ``` + + 预期输出: + + ```bash + #LS Pod的Group Identity优先级为2(高优)。 + 2 + ``` + +6. 使用以下YAML内容,创建be-pod-demo.yaml文件。 + + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: be-pod-demo + labels: + koordinator.sh/qosClass: 'BE' #指定Pod的QoS级别为BE。 + spec: + containers: + - args: + - '-c' + - '1' + - '--vm' + - '1' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + restartPolicy: Always + schedulerName: default-scheduler + priorityClassName: koord-batch + ``` + +7. 执行以下命令,将be-pod-demo部署到集群。 + + ```bash + kubectl apply -f be-pod-demo.yaml + ``` + +8. 执行以下命令,在单机端的Cgroup分组中查看BE Pod的内核Group Identity生效情况。 + + ```bash + cat /sys/fs/cgroup/cpu/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4b6e96c8****.slice/cpu.bvt_warp_ns + ``` + + 预期输出: + + ```bash + #BE Pod的Group Identity优先级为-1(低优)。 + -1 + ``` + + 由预期输出得到,LS容器为Group Identity高优先级,BE容器为Group Identity低优先级,表示LS容器的CPU服务质量将被优先保障。 + diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/cpu-suppress.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/cpu-suppress.md new file mode 100644 index 000000000..7077acefd --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/cpu-suppress.md @@ -0,0 +1,103 @@ +# CPU Suppress + +## Introduction +In order to ensure the runtime quality of different workloads in co-located scenarios, Koordinator uses the CPU Suppress +mechanism provided by koordlet on the node side to suppress workloads of the Best Effort type when the load increases. +Or increase the resource quota for Best Effort type workloads when the load decreases. + +In the [Dynamic resource overcommitment model](/architecture/resource-model.md) that is provided by +Koordinator, the total amount of reclaimed resources dynamically changes based on the actual amount of resources used +by latency-sensitive (LS/LSR/LSE) pods. Reclaimed resources can be used by BE pods. You can use the dynamic resource +overcommitment feature to improve the resource utilization of a cluster by deploying both LS pods and BE pods in the +cluster. To ensure sufficient CPU resources for the LS pods on a node, you can use koordinator to limit the CPU +usage of the BE pods on the node. The elastic resource limit feature can maintain the resource utilization of a node +below a specified threshold and limit the amount of CPU resources that can be used by BE pods. This ensures the +stability of the containers on the node. + +CPU Threshold indicates the CPU utilization threshold of a node. Pod (LS).Usage indicates the CPU usage of LS pods. +CPU Restriction for BE indicates the CPU usage of BE pods. The amount of CPU resources that can be used by BE pods +is adjusted based on the increase or decrease of the CPU usage of LS pods. We recommend that you use the same value +for CPU Threshold and the reserved CPU watermark in the dynamic resource overcommitment model. +This ensures a consistent level of CPU resource utilization. + +![CPU-Suppress](/img/cpu-suppress-demo.svg) + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.6 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to +[Installation](/docs/installation). + +### Configurations +When installing through the helm chart, the ConfigMap slo-controller-config will be created in the koordinator-system +namespace, and the CPU Suppress mechanism is enabled by default. If it needs to be closed, refer to the configuration +below, and modify the configuration of the resource-threshold-config section to take effect. + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: slo-controller-config + namespace: {{ .Values.installation.namespace }} +data: + ... + resource-threshold-config: | + { + "clusterStrategy": { + "enable": true, + "cpuSuppressThresholdPercent": 65 + } + } +``` + +#### (Optional) Advanced Settings +Also, the `CPU Suppress` feature allows you to configure the CPU utilization threshold in a fine-grained manner. +The following table describes the parameters. + +| Parameter | Data type | Valid value | Description | +| --------- | --------- | ----------- | ----------- | +| enable | Boolean | true; false | true: enables the elastic resource limit feature; false: disables the elastic resource limit feature. | +| cpuSuppressThresholdPercent | Int | 0~100 | The CPU utilization threshold of the node. Unit: %. Default value: 65. | + +## Use CPU Suppress + +1. Create a configmap.yaml file based on the following ConfigMap content: +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: slo-controller-config + namespace: koordinator-system +data: + # Enable the elastic resource limit feature. + resource-threshold-config: | + { + "clusterStrategy": { + "enable": true + } + } +``` + +2. Run the following command to update the ConfigMap. +To avoid changing other settings in the ConfigMap, we commend that you run the kubectl patch command to update the ConfigMap. + +```bash +kubectl patch cm -n koordinator-system slo-controller-config --patch "$(cat configmap.yaml)" +``` + +3. Run the following command to query the CPU cores that are allocated to the BE pods on the node: +```bash +cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/cpuset.cpus +``` +Expected output: +```bash +10-25,35-51,62-77,87-103 +``` +The output shows that the following CPU cores are allocated to the BE pods on the node: 10-25, 35-51, 62-77, and 87-103, +which will be changed dynamically according to the load of latency-sensitve pods. \ No newline at end of file diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/fine-grained-cpu-orchestration.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/fine-grained-cpu-orchestration.md new file mode 100644 index 000000000..4aa55eafc --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/fine-grained-cpu-orchestration.md @@ -0,0 +1,251 @@ +# 精细化 CPU 编排 + +koord-scheduler 为了提升 CPU 密集型工作负载的性能提供了精细化 CPU 编排能力。 + +## Introduction + +越来越多的系统利用 CPU 和硬件加速器的组合来支持实时计算和高吞吐的并行计算。 许多应用程序都需要高性能环境,包括电信、科学计算、机器学习、金融服务和数据分析。 + +但是,Kubernetes 集群中的 Pod 在多种资源维度上都是共享的,存在相互干扰的问题。 CPU 资源的共享几乎是不可避免的,例如 SMT 线程(即逻辑处理器)共享同一个物理核,同一个芯片中的物理核共享同一个 L3 缓存。 资源竞争会减慢这些对 CPU 敏感的工作负载的运行质量,从而导致延迟升高。 + +为了提高对 CPU 敏感的工作负载的性能,koord-scheduler 提供了一种精细化的 CPU 编排机制。 它增强了 Kubernetes 的 CPU 管理,并支持详细的 NUMA 局部性和 CPU 排除。 + +有关详细信息,请参阅[设计:细粒度 CPU 编排](/docs/designs/fine-grained-cpu-orchestration)。 + +## 设置 + +### 前置条件 + +- Kubernetes >= 1.18 +- Koordinator >= 0.6 + +### 安装 + +请确保 Koordinator 组件已正确安装在你的集群中。 如果没有,请参考[安装文档](/docs/installation)。 + +### 配置全局参数 + +精细化 CPU 编排能力是默认开启的。用户不需要额外的配置即可使用。 + +对于需要深入定制的用户,可以按需修改 Helm Chart 中的配置文件 `koord-scheduler-config` 设置精细化 CPU 编排的参数。 + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: koord-scheduler-config + ... +data: + koord-scheduler-config: | + apiVersion: kubescheduler.config.k8s.io/v1beta2 + kind: KubeSchedulerConfiguration + profiles: + - schedulerName: koord-scheduler + - pluginConfig: + - name: NodeNUMAResource + args: + apiVersion: kubescheduler.config.k8s.io/v1beta2 + kind: NodeNUMAResourceArgs + # The default CPU Binding Policy. The default is FullPCPUs + # If the Pod belongs to LSE/LSR Prod Pods, and if no specific CPU binding policy is set, + # the CPU will be allocated according to the default core binding policy. + defaultCPUBindPolicy: FullPCPUs + # the scoring strategy + scoringStrategy: + # the scoring strategy ('MostAllocated', 'LeastAllocated') + # - MostAllocated(default): prefer the node with the least available resources + # - LeastAllocated: prefer the node with the most available resources + type: MostAllocated + # the weights of each resource type + resources: + - name: cpu + weight: 1 + plugins: + # enable the NodeNUMAResource plugin + preFilter: + enabled: + - name: NodeNUMAResource + filter: + enabled: + - name: NodeNUMAResource + ... + score: + enabled: + - name: NodeNUMAResource + weight: 1 + ... + reserve: + enabled: + - name: NodeNUMAResource + preBind: + enabled: + - name: NodeNUMAResource +``` + +koord-descheduler 是通过 Configmap 加载[调度器配置](https://kubernetes.io/docs/reference/scheduling/config/)的。因此需要通过重启调度器才能使用最新的配置。 + + +| 字段 | 说明 | 版本 | +|-------|-------------|---------| +| defaultCPUBindPolicy | 默认的 CPU 绑定策略。 默认值为 FullPCPUs。 如果 Pod 属于 LSE/LSR Prod Pod,并且没有设置具体的 CPU 绑定策略,CPU 则会按照默认的 CPU 绑定策略进行分配。 可选值为 FullPCPUs 和 SpreadByPCPUs | >= v0.6.0 | +| scoringStrategy | 打分策略,可选值为 MostAllocated 和 LeastAllocated | >= v0.6.0 | + +### 按节点配置 + +用户可以单独的为节点设置不同的 CPU 绑定策略和 NUMA Node 选择策略。 + +#### CPU 绑定策略 + +Label `node.koordinator.sh/cpu-bind-policy` 约束了调度时如何按照指定的策略分配和绑定CPU。具体的值定义如下: + +| 值 | 描述 | 版本 | +|-------|-------------|---------| +| None or empty | 不执行任何策略。 | >= v0.6.0 | +| FullPCPUsOnly | 要求调度器必须分配完整的物理核。等价于 kubelet CPU manager policy option full-pcpus-only=true. | >= v0.6.0 | +| SpreadByPCPUs | 要求调度器必须按照物理核维度均匀的分配逻辑核。 | >= v1.1.0 | + +如果节点 Label 上没有 `node.koordinator.sh/cpu-bind-policy`,调度器将会按照 Pod 指定的 CPU 绑定策略或者调度器配置的默认策略分配 CPU。 + +#### NUMA Node 选择策略 + +Label `node.koordinator.sh/numa-allocate-strategy` 表示调度时应该如何选择 NUMA Node。具体的值定义如下: + +| 值 | 描述 | 版本 | +|-------|-------------|---------| +| MostAllocated | MostAllocated 表示选择资源剩余最少的 NUMA Node。| >= v.0.6.0 | +| LeastAllocated | LeastAllocated 表示选择资源剩余最多的NUMA Node。| >= v.0.6.0 | + +如果 `node.koordinator.sh/numa-allocate-strategy` 和 `kubelet.koordinator.sh/cpu-manager-policy` 都设置了, 优先使用 `node.koordinator.sh/numa-allocate-strategy`。 + + +## 使用精细化 CPU 编排 + +1. 按照下面的 YAM了 创建 Deployment `nginx`。 + +> 使用精细化 CPU 编排时,Pod 需要在 Label 中指定具体的 [QoSClass](/docs/architecture/qos#definition) 并指定具体的绑定策略。 + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: nginx-lsr + labels: + app: nginx-lsr +spec: + replicas: 3 + selector: + matchLabels: + app: nginx-lsr + template: + metadata: + name: nginx-lsr + labels: + app: nginx-lsr + koordinator.sh/qosClass: LSR # set the QoS class as LSR, the binding policy is FullPCPUs by default + # in v0.5, binding policy should be specified. + # e.g. to set binding policy as FullPCPUs (prefer allocating full physical CPUs of the same core): + #annotations: + #scheduling.koordinator.sh/resource-spec: '{"preferredCPUBindPolicy": "FullPCPUs"}' + spec: + schedulerName: koord-scheduler # use the koord-scheduler + containers: + - name: nginx + image: nginx + resources: + limits: + cpu: '2' + requests: + cpu: '2' + priorityClassName: koord-prod +``` + +2. 创建 `nginx` deployment 并检查调度结果。 + +```bash +$ kubectl create -f nginx-deployment.yaml +deployment/nginx-lsr created +$ kubectl get pods -o wide | grep nginx +nginx-lsr-59cf487d4b-jwwjv 1/1 Running 0 21s 172.20.101.35 node-0 +nginx-lsr-59cf487d4b-4l7r4 1/1 Running 0 21s 172.20.101.79 node-1 +nginx-lsr-59cf487d4b-nrb7f 1/1 Running 0 21s 172.20.106.119 node-2 +``` + +3. 检查 Pod 的 CPU 分配结果 `scheduling.koordinator.sh/resource-status`. + +```bash +$ kubectl get pod nginx-lsr-59cf487d4b-jwwjv -o jsonpath='{.metadata.annotations.scheduling\.koordinator\.sh/resource-status}' +{"cpuset":"2,54"} +``` + +我们可以看到 Pod `nginx-lsr-59cf487d4b-jwwjv` 绑定了 2 个逻辑核,对应的逻辑核 ID 分别是 2 和 54,这两个逻辑核属于同一个物理核。 + +4. 更改 `nginx` deployment 的 CPU 绑定策略。 + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: nginx-lsr + labels: + app: nginx-lsr +spec: + replicas: 3 + selector: + matchLabels: + app: nginx-lsr + template: + metadata: + name: nginx-lsr + labels: + app: nginx-lsr + koordinator.sh/qosClass: LSR # set the QoS class as LSR + annotations: + # set binding policy as SpreadByPCPUs (prefer allocating physical CPUs of different cores) + scheduling.koordinator.sh/resource-spec: '{"preferredCPUBindPolicy": "SpreadByPCPUs"}' + spec: + schedulerName: koord-scheduler # use the koord-scheduler + containers: + - name: nginx + image: nginx + resources: + limits: + cpu: '2' + requests: + cpu: '2' + priorityClassName: koord-prod +``` + +5. 更新 `nginx` deployment 并检查调度结果。 + +```bash +$ kubectl apply -f nginx-deployment.yaml +deployment/nginx-lsr created +$ kubectl get pods -o wide | grep nginx +nginx-lsr-7fcbcf89b4-rkrgg 1/1 Running 0 49s 172.20.101.35 node-0 +nginx-lsr-7fcbcf89b4-ndbks 1/1 Running 0 49s 172.20.101.79 node-1 +nginx-lsr-7fcbcf89b4-9v8b8 1/1 Running 0 49s 172.20.106.119 node-2 +``` + +6. 检查 Pod 最新的 CPU 分配结果 `scheduling.koordinator.sh/resource-status`。 + +```bash +$ kubectl get pod nginx-lsr-7fcbcf89b4-rkrgg -o jsonpath='{.metadata.annotations.scheduling\.koordinator\.sh/resource-status}' +{"cpuset":"2-3"} +``` + +现在我们可以看到 Pod `nginx-lsr-59cf487d4b-jwwjv` 绑定了两个逻辑核,对应的 ID 分别是 2,3, 属于两个不同的物理核。 + +7. (可选) 高级配置. + +```yaml + labels: + # koordinator QoS class of the pod. (use 'LSR' or 'LSE' for binding CPUs) + koordinator.sh/qosClass: LSR + annotations: + # `resource-spec` indicates the specification of resource scheduling, here we need to set `preferredCPUBindPolicy`. + # `preferredCPUBindPolicy` indicating the CPU binding policy of the pod ('None', 'FullPCPUs', 'SpreadByPCPUs') + # - None: perform no exclusive policy + # - FullPCPUs(default): a bin-packing binding policy, prefer allocating full physical cores (SMT siblings) + # - SpreadByPCPUs: a spread binding policy, prefer allocating logical cores (SMT threads) evenly across physical cores (SMT siblings) + scheduling.koordinator.sh/resource-spec: '{"preferredCPUBindPolicy": "FullPCPUs"}' +``` diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/fine-grained-device-scheduling.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/fine-grained-device-scheduling.md new file mode 100644 index 000000000..b4a93a337 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/fine-grained-device-scheduling.md @@ -0,0 +1,318 @@ +# Device Scheduling +We provide a fine-grained mechanism for managing GPUs and other devices such as RDMA and FPGA, defines a set of APIs to +describe device information on nodes, including GPU, RDMA, and FPGA, and a new set of resource names to flexibly support +users to apply at a finer granularity GPU resources. This mechanism is the basis for subsequent other GPU scheduling +capabilities such as GPU Share, GPU Overcommitment, etc. + +## Introduction +GPU devices have very strong computing power, but are expensive. How to make better use of GPU equipment, give full play +to the value of GPU and reduce costs is a problem that needs to be solved. In the existing GPU allocation mechanism of +the K8s community, the GPU is allocated by the kubelet, and it is a complete device allocation. This method is simple +and reliable, but similar to the CPU and memory, the GPU will also be wasted. Therefore, some users expect to use only +a portion of the GPU's resources and share the rest with other workloads to save costs. Moreover, GPU has particularities. +For example, the NVLink and oversold scenarios supported by NVIDIA GPU mentioned below both require a central decision +through the scheduler to obtain globally optimal allocation results. + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.71 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to [Installation](/docs/installation). + +### Configurations + +DeviceScheduling is *Enabled* by default. You can use it without any modification on the koord-scheduler config. + +## Use DeviceScheduling + +### Quick Start + +1.check device crd: + +```bash +$ kubectl get device host04 -o yaml +``` + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: Device +metadata: + creationTimestamp: "2022-10-08T09:26:42Z" + generation: 1 + managedFields: + - apiVersion: scheduling.koordinator.sh/v1alpha1 + fieldsType: FieldsV1 + fieldsV1: + f:metadata: + f:ownerReferences: {} + f:spec: + .: {} + f:devices: {} + f:status: {} + manager: koordlet + operation: Update + time: "2022-10-08T09:26:42Z" + name: host04 + ownerReferences: + - apiVersion: v1 + blockOwnerDeletion: true + controller: true + kind: Node + name: host04 + uid: 09c4f912-6026-467a-85d2-6b2147c9557e + resourceVersion: "39011943" + selfLink: /apis/scheduling.koordinator.sh/v1alpha1/devices/host04 + uid: 5a498e1f-1357-4518-b74c-cab251d6c18c +spec: + devices: + - health: true + id: GPU-04cea5cd-966f-7116-1d58-1ac34421541b + minor: 0 + resources: + kubernetes.io/gpu-core: "100" + kubernetes.io/gpu-memory: 16Gi + kubernetes.io/gpu-memory-ratio: "100" + type: gpu + - health: true + id: GPU-3680858f-1753-371e-3c1a-7d8127fc7113 + minor: 1 + resources: + kubernetes.io/gpu-core: "100" + kubernetes.io/gpu-memory: 16Gi + kubernetes.io/gpu-memory-ratio: "100" + type: gpu +status: {} +``` +We can find this node has two gpu cards, we can find the detail info of each gpu card here. + +2.check node allocatable resource: + +```bash +$ kubectl get node host04 -o yaml +``` + +```yaml +apiVersion: v1 +kind: Node +metadata: + annotations: + flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"5a:69:48:10:29:25"}' + creationTimestamp: "2022-08-29T09:12:55Z" + labels: + beta.kubernetes.io/os: linux + status: + addresses: + - address: 10.15.0.37 + type: InternalIP + - address: host04 + type: Hostname + allocatable: + cpu: "6" + ephemeral-storage: "200681483926" + kubernetes.io/gpu: "200" + kubernetes.io/gpu-core: "200" + kubernetes.io/gpu-memory: 32Gi + kubernetes.io/gpu-memory-ratio: "200" + memory: 59274552Ki + nvidia.com/gpu: "2" + pods: "220" + capacity: + cpu: "8" + kubernetes.io/gpu: "200" + kubernetes.io/gpu-core: "200" + kubernetes.io/gpu-memory: 32Gi + kubernetes.io/gpu-memory-ratio: "200" + memory: 61678904Ki + nvidia.com/gpu: "2" + pods: "220" +``` +We can find the node allocatable resource has merged each gpu card resource. + +3.apply pod: +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example + namespace: default +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + kubernetes.io/gpu: "100" + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +```bash +$ kubectl get pod -n default pod-example -o yaml +``` + +```yaml +apiVersion: v1 +kind: Pod +metadata: + annotations: + scheduling.koordinator.sh/device-allocated: '{"gpu":[{"minor":0,"resources":{"kubernetes.io/gpu-core":"100","kubernetes.io/gpu-memory":"12508288Ki","kubernetes.io/gpu-memory-ratio":"100"}}]}' + creationTimestamp: "2022-10-08T09:33:07Z" + name: pod-example + namespace: default + resourceVersion: "39015044" + selfLink: /api/v1/namespaces/xlf/pods/gpu-pod7 + uid: 6bf1ac3c-0c9f-472a-8b86-de350bbfa795 +spec: + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: "1" + kubernetes.io/gpu: "100" + memory: 256Mi + requests: + cpu: "1" + kubernetes.io/gpu: "100" + memory: 256Mi +status: + conditions: + ... + hostIP: 10.0.0.149 + phase: Running + podIP: 10.244.2.45 + podIPs: + - ip: 10.244.2.45 + qosClass: Guaranteed + startTime: "2022-10-08T09:33:07Z" +``` +You can find the concrete device allocate result through annotation `scheduling.koordinator.sh/device-allocated`. + +4.more apply protocol: +```yaml +apiVersion: v1 +kind: Pod +... +spec: + ... + resources: + requests: + cpu: 40m + memory: 40Mi + nvidia.com/gpu: "100" +``` + +```yaml +apiVersion: v1 +kind: Pod +... +spec: + ... + resources: + requests: + cpu: 40m + memory: 40Mi + kubernetes.io/gpu-core: "100" + kubernetes.io/gpu-memory-ratio: "100" +``` + +```yaml +apiVersion: v1 +kind: Pod +... +spec: + ... + resources: + requests: + cpu: 40m + memory: 40Mi + kubernetes.io/gpu-core: "100" + kubernetes.io/gpu-memory: "16Mi" +``` + +4.device resource debug api: +```bash +$ kubectl -n koordinator-system get lease koord-scheduler --no-headers | awk '{print $2}' | cut -d'_' -f1 | xargs -I {} kubectl -n koordinator-system get pod {} -o wide --no-headers | awk '{print $6}' + 10.244.0.64 + +$ curl 10.244.0.64:10251/apis/v1/plugins/DeviceShare/nodeDeviceSummaries +$ curl 10.244.0.64:10251/apis/v1/plugins/DeviceShare/nodeDeviceSummaries/host04 +``` + +```json +{ + "allocateSet": { + "gpu": { + "xlf/gpu-pod7": { + "0": { + "kubernetes.io/gpu-core": "100", + "kubernetes.io/gpu-memory": "12508288Ki", + "kubernetes.io/gpu-memory-ratio": "100" + } + } + } + }, + "deviceFree": { + "kubernetes.io/gpu-core": "0", + "kubernetes.io/gpu-memory": "0", + "kubernetes.io/gpu-memory-ratio": "0" + }, + "deviceFreeDetail": { + "gpu": { + "0": { + "kubernetes.io/gpu-core": "0", + "kubernetes.io/gpu-memory": "0", + "kubernetes.io/gpu-memory-ratio": "0" + } + } + }, + "deviceTotal": { + "kubernetes.io/gpu-core": "100", + "kubernetes.io/gpu-memory": "12508288Ki", + "kubernetes.io/gpu-memory-ratio": "100" + }, + "deviceTotalDetail": { + "gpu": { + "0": { + "kubernetes.io/gpu-core": "100", + "kubernetes.io/gpu-memory": "12508288Ki", + "kubernetes.io/gpu-memory-ratio": "100" + } + } + }, + "deviceUsed": { + "kubernetes.io/gpu-core": "100", + "kubernetes.io/gpu-memory": "12508288Ki", + "kubernetes.io/gpu-memory-ratio": "100" + }, + "deviceUsedDetail": { + "gpu": { + "0": { + "kubernetes.io/gpu-core": "100", + "kubernetes.io/gpu-memory": "12508288Ki", + "kubernetes.io/gpu-memory-ratio": "100" + } + } + } +} +``` diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/gang-scheduling.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/gang-scheduling.md new file mode 100644 index 000000000..b439200ef --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/gang-scheduling.md @@ -0,0 +1,364 @@ +# GangScheduling + +## 简介 +Koord-dscheduler 提供了 Gang Scheduling 满足 All-or-Nothing 调度需求。用户可以声明最小资源集合数,只有当已经完成调度资源数超过前面声明当前最小资源集合数才能触发节点绑定。 +同时提供 `Strict` 和 `NonStrict` 两个参数用于控制资源累积过程,区别于其他社区方案将提供 two-level Gang 描述用于更好匹配真实场景。 + +## 设置 + +### 前置条件 + +- Kubernetes >= 1.18 +- Koordinator >= 0.70 + +### 安装 + +请确保 Kubernetes 集群已经安装 Koordinator 组件,如果没有安装,请参阅 [安装](/docs/installation)。 + +### 配置 + +GangScheduling 特性默认*开启*,无需修改 koord-scheduler 配置进行开启。 + +## GangScheduling 使用手册 + +### 快速开始 + +#### Gang CRD 方式 + +1.创建 pod-group 资源 +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: PodGroup +metadata: + name: gang-example + namespace: default +spec: + scheduleTimeoutSeconds: 100 + minMember: 2 +``` + +```bash +$ kubectl get pgs -n default + NAME AGE + gang-example 13s +``` + +2.创建子资源 pod1 +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example1 + namespace: default + labels: + pod-group.scheduling.sigs.k8s.io: gang-example +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +```bash +$ kubectl get pod -n default + NAME READY STATUS RESTARTS AGE + pod-example1 0/1 Pending 0 7s +``` + +3.创建子资源 pod2 +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example2 + namespace: default + labels: + pod-group.scheduling.sigs.k8s.io: gang-example +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +```bash +$ kubectl get pod -n default + NAME READY STATUS RESTARTS AGE + pod-example1 1/1 Running 0 53s + pod-example2 1/1 Running 0 5s +``` + +```bash +$ kubectl get pg gang-example -n default -o yaml +``` + +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: PodGroup +metadata: + creationTimestamp: "2022-10-09T09:08:17Z" + generation: 6 +spec: + minMember: 1 + scheduleTimeoutSeconds: 100 +status: + phase: Running + running: 2 + scheduled: 2 +``` + +#### Pod Annotaion 方式 +1.创建子资源 pod1 +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example1 + namespace: default + annotations: + gang.scheduling.koordinator.sh/name: "gang-example" + gang.scheduling.koordinator.sh/min-available: "2" +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +```bash +$ kubectl get pod -n default + NAME READY STATUS RESTARTS AGE + pod-example1 0/1 Pending 0 7s +``` + +2.创建子资源 pod2 +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example2 + namespace: default + annotations: + gang.scheduling.koordinator.sh/name: "gang-example" + gang.scheduling.koordinator.sh/min-available: "2" +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +```bash +$ kubectl get pod -n default + NAME READY STATUS RESTARTS AGE + pod-example1 1/1 Running 0 53s + pod-example2 1/1 Running 0 5s +``` + +```bash +$ kubectl get pg gang-example -n default -o yaml +``` + +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: PodGroup +metadata: + creationTimestamp: "2022-10-09T09:08:17Z" + generation: 6 +spec: + minMember: 1 + scheduleTimeoutSeconds: 100 +status: + phase: Running + running: 2 + scheduled: 2 +``` + +#### Gang 调度调试接口: +```bash +$ kubectl -n koordinator-system get lease koord-scheduler --no-headers | awk '{print $2}' | cut -d'_' -f1 | xargs -I {} kubectl -n koordinator-system get pod {} -o wide --no-headers | awk '{print $6}' + 10.244.0.64 + +$ curl 10.244.0.64:10251/apis/v1/plugins/Coscheduling/gang/default/gang-example +``` + +```json +{ + "boundChildren": { + "default/pod-example1": {}, + "default/pod-example2": {} + }, + "children": { + "default/pod-example1": {}, + "default/pod-example2": {} + }, + "childrenScheduleRoundMap": { + "default/pod-example1": 2, + "default/pod-example2": 2 + }, + "createTime": "2022-10-09T07:31:53Z", + "gangFrom": "GangFromPodAnnotation", + "gangGroup": null, + "hasGangInit": true, + "minRequiredNumber": 2, + "mode": "Strict", + "name": "default/gang-example", + "onceResourceSatisfied": true, + "scheduleCycle": 2, + "scheduleCycleValid": true, + "totalChildrenNum": 2, + "waitTime": 600000000000, + "waitingForBindChildren": {} +} +``` + +#### Gang 调度高级配置 +1.PodGroup Annotation 方式 + +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: PodGroup +metadata: + name: gang-example1 + namespace: default + annotations: + gang.scheduling.koordinator.sh/total-number: "3" + gang.scheduling.koordinator.sh/mode: "NonStrict" + gang.scheduling.koordinator.sh/groups: "[\"default/gang-example1\", \"default/gang-example2\"]" + +spec: + scheduleTimeoutSeconds: 100 + minMember: 2 + +``` + +- `gang.scheduling.koordinator.sh/total-number` 用于配置 gang 内子资源总数。如果未配置,则使用 `minMember` 配置。 +- `gang.scheduling.koordinator.sh/mode` 用于配置 Gang 调度失败处理策略。支持 `Strict\NonStrict` 两种模式,默认为 `Strict` 。 +- `gang.scheduling.koordinator.sh/groups` 用于配置支持多个 gang 为一组完成 Gang 调度,用于支持多个 gang 之间有依赖关系的场景。 + +2.Pod Annotation 方式 +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example2 + namespace: default + annotations: + gang.scheduling.koordinator.sh/name: "gang-example1" + gang.scheduling.koordinator.sh/min-available: "2" + gang.scheduling.koordinator.sh/total-number: "3" + gang.scheduling.koordinator.sh/mode: "Strict\NonStrict" + gang.scheduling.koordinator.sh/groups: "[\"default/gang-example1\", \"default/gang-example2\"]" + gang.scheduling.koordinator.sh/waiting-time: "100s" +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +- `gang.scheduling.koordinator.sh/total-number` 用于配置 gang 内子资源总数。如果未配置,则使用 `gang.scheduling.koordinator.sh/min-available` 配置。 +- `gang.scheduling.koordinator.sh/mode` 用于配置 Gang 调度失败处理策略。支持 `Strict\NonStrict` 两种模式,默认为 `Strict` 。 +- `gang.scheduling.koordinator.sh/groups` 用于配置支持多个 gang 为一组完成 Gang 调度,用于支持多个 gang 之间有依赖关系的场景。 +- `gang.scheduling.koordinator.sh/waiting-time` 用于配置自第一个 Pod 进入 Permit 阶段依赖的最大等待时间。 + +#### 调度器高级配置 +您可以在 helm 中修改 `koord-scheduler-config.yaml` 来调整 `Coscheduling` 配置,如下所示: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: koord-scheduler-config + namespace: {{ .Values.installation.namespace }} +data: + koord-scheduler-config: | + apiVersion: kubescheduler.config.k8s.io/v1beta2 + kind: KubeSchedulerConfiguration + leaderElection: + leaderElect: true + resourceLock: leases + resourceName: koord-scheduler + resourceNamespace: {{ .Values.installation.namespace }} + profiles: + - pluginConfig: + - name: Coscheduling + args: + apiVersion: kubescheduler.config.k8s.io/v1beta2 + kind: CoschedulingArgs` + defaultTimeout: 600s + controllerWorkers: 1 + - name: ElasticQuota + ... +``` + diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/load-aware-descheduling.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/load-aware-descheduling.md new file mode 100644 index 000000000..ea6247ce5 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/load-aware-descheduling.md @@ -0,0 +1,229 @@ +# 负载感知重调度 + + +调度器中支持的负载感知调度能够在调度时选择负载较低的节点运行新的Pod,但随着时间、集群环境变化以及工作负载面对的流量/请求的变化时,节点的利用率会动态的发生变化,集群内节点间原本负载均衡的情况被打破,甚至有可能出现极端负载不均衡的情况,影响到工作负载运行时质量。 + +koord-descheduler 感知集群内节点负载的变化,自动的优化超过负载水位安全阈值的节点,防止出现极端负载不均衡的情况。 + +## 简介 + +koord-descheduler 组件中 `LowNodeLoad` 插件负责感知负载水位完成热点打散重调度工作。`LowNodeLoad` 插件 与 Kubernetes 原生的 descheduler 的插件 LowNodeUtilization 不同的是,`LowNodeLoad` 是根据节点真实利用率的情况决策重调度,而 LowNodeUtilization 是根据资源分配率决策重调度。 + +LowNodeLoad插件有两个最重要的参数: +- `highThresholds` 表示负载水位的目标安全阈值,超过该阈值的节点上的 Pod 将参与重调度; +- `lowThresholds` 表示负载水位的空闲安全水位。低于该阈值的节点上的 Pod 不会被重调度。 + +以下图为例,`lowThresholds` 为45%,`highThresholds` 为 70%,我们可以把节点归为三类: + +1. 空闲节点(Idle Node)。资源利用率低于 45% 的节点; +2. 正常节点(Normal Node)。资源利用率高于 45% 但低于 70% 的节点,这个负载水位区间是我们期望的合理的区间范围 +3. 热点节点(Hotspot Node)。如果节点资源利用率高于70%,这个节点就会被判定为不安全了,属于热点节点,应该驱逐一部分 Pod,降低负载水位,使其不超过 70%。 + +![image](/img/low-node-load.png) + +在识别出哪些节点是热点后,koord-descheduler 将会执行迁移驱逐操作,驱逐热点节点中的部分 Pod 到空闲节点上。 + +如果一个集群中空闲节点的总数并不是很多时会终止重调度。这在大型集群中可能会有所帮助,在大型集群中,一些节点可能会经常或短时间使用不足。默认情况下,`numberOfNodes` 设置为零。可以通过设置参数 `numberOfNodes` 来开启该能力。 + +在迁移前,koord-descheduler 会计算出实际空闲容量,确保要迁移的 Pod 的实际利用率之和不超过集群内空闲总量。这些实际空闲容量来自于空闲节点,一个空闲节点实际空闲容量 = `(highThresholds - 节点当前负载) * 节点总容量`。假设节点 A 的负载水位是20%,highThresholdss是 70%,节点 A 的 CPU 总量为96C,那么 `(70%-20%) * 96 = 48C`,这 48C 就是可以承载的空闲容量了。 + +另外,在迁移热点节点时,会过滤筛选节点上的Pod,目前 koord-descheduler 支持多种筛选参数,可以避免迁移驱逐非常重要的 Pod: + +- 按 namespace 过滤。可以配置成只筛选某些 namespace 或者过滤掉某些 namespace +- 按 pod selector 过滤。可以通过 label selector 筛选出 Pod,或者排除掉具备某些 Label 的 Pod +- 配置 nodeFit 检查调度规则是否有备选节点。当开启后,koord-descheduler 根据备选 Pod 对应的 Node Affinity/Node Selector/Toleration ,检查集群内是否有与之匹配的 Node,如果没有的话,该 Pod 将不会去驱逐迁移。如果设置 `nodeFit` 为 false,此时完全由 koord-descheduler 底层的迁移控制器完成容量预留,确保有资源后开始迁移。 + +当筛选出 Pod 后,从 QoSClass、Priority、实际用量和创建时间等多个维度对这些 Pod 排序。 + +筛选 Pod 并完成排序后,开始执行迁移操作。迁移前会检查剩余空闲容量是否满足和当前节点的负载水位是否高于目标安全阈值,如果这两个条件中的一个不能满足,将停止重调度。每迁移一个 Pod 时,会预扣剩余空闲容量,同时也会调整当前节点的负载水位,直到剩余容量不足或者水位达到安全阈值。 + +## 设置 + +### 前置条件 + +- Kubernetes >= 1.18 +- Koordinator >= 1.1.1 + +### 安装 + +请确保 Koordinator 组件已正确安装在你的集群中。 如果没有,请参考[安装文档](/docs/installation)。 + +### 配置 + +负载感知重调度默认是禁用的。可以通过修改配置 ConfigMap `koord-descheduler-config` 启用该能力。 + +对于需要深入定制的用户,可以按照需要更改 Helm Chart 中的 ConfigMap `koord-descheduler-config` 设置参数。修改配置后需要重启 koord-descheduler 才能应用最新的配置。 + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: koord-descheduler-config + ... +data: + koord-descheduler-config: | + apiVersion: descheduler/v1alpha2 + kind: DeschedulerConfiguration + ... + # Execute the LowNodeLoad plugin every 60s + deschedulingInterval: 60s + profiles: + - name: koord-descheduler + plugins: + deschedule: + disabled: + - name: "*" + balance: + enabled: + - name: LowNodeLoad # Configure to enable the LowNodeLoad plugin + .... + pluginConfig: + - name: LowNodeLoad + args: + apiVersion: descheduler/v1alpha2 + kind: LowNodeLoadArgs + evictableNamespaces: + # include and exclude are mutually exclusive, only one of them can be configured. + # include indicates that only the namespace configured below will be processed + # include: + # - test-namespace + # exclude means to only process namespaces other than those configured below + exclude: + - "kube-system" + - "koordinator-system" + # lowThresholds defines the low usage threshold of resources + lowThresholds: + cpu: 20 + memory: 30 + # highThresholds defines the target usage threshold of resources + highThresholds: + cpu: 50 + memory: 60 + .... +``` + +| 字段 | 说明 | 版本 | +|-------|-------------|--------| +| paused | Paused 控制 LowNodeLoad 插件是否工作. | >= v1.1.1 | +| dryRun | DryRun 表示只执行重调度逻辑,但不重复啊迁移/驱逐 Pod | >= v1.1.1 | +| numberOfNodes | NumberOfNodes 可以配置为仅当未充分利用的节点数高于配置值时才激活该策略。 这在大型集群中可能会有所帮助,在大型集群中,一些节点可能会经常或短时间使用不足。 默认情况下,NumberOfNodes 设置为零。 | >= v1.1.1 | +| evictableNamespaces | 可以参与重调度的Namespace。可以配置 include和exclude两种,但两种策略只能二选一。include 表示只处理指定的 namespace;exclude 表示只处理指定之外的namespace。| >= v1.1.1 | +| nodeSelector | 通过 label selector 机制选择目标节点。 | >= v1.1.1 | +| podSelectors | 通过 label selector 选择要处理的Pod。 | >= v1.1.1 | +| nodeFit | 表示是否按照备选要迁移的Pod中指定的 Node Affinity/Node Selector/Resource Requests/TaintToleration 判断是否有空闲节点。没有则不参与调度。默认开启。可以设置为 false 禁用该能力。 | >= v1.1.1 | +| useDeviationThresholds | 如果 useDeviationThresholds 设置为 true,则阈值被视为与平均资源使用率的百分比偏差。lowThresholds 将从所有节点的平均值中减去,highThresholds 将添加到平均值中。高于此窗口的资源消耗被视为过度利用的,即热点节点。 | >= v1.1.1 | +| highThresholds | 表示负载水位的目标安全阈值,超过该阈值的节点上的Pod将参与重调度。 | >= v1.1.1 | +| lowThresholds | 表示负载水位的空闲安全水位。低于该阈值的节点上的Pod不会被重调度。 | >= v1.1.1 | + +## 使用负载感知重调度 + +本文示例的集群有3台 4核16GiB 节点。 + +1. 使用下面的 YAML 创建两个 stress Pod + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: stress-demo + namespace: default + labels: + app: stress-demo +spec: + replicas: 2 + selector: + matchLabels: + app: stress-demo + template: + metadata: + name: stress-demo + labels: + app: stress-demo + spec: + containers: + - args: + - '--vm' + - '2' + - '--vm-bytes' + - '1600M' + - '-c' + - '2' + - '--vm-hang' + - '2' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + limits: + cpu: '2' + memory: 4Gi + requests: + cpu: '2' + memory: 4Gi + restartPolicy: Always + schedulerName: koord-scheduler # use the koord-scheduler +``` + +```bash +$ kubectl create -f stress-demo.yaml +deployment.apps/stress-demo created +``` + +2. 观察 Pod 的状态,直到它们开始运行。 + +```bash +$ kubectl get pod -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +stress-demo-7fdd89cc6b-lml7k 1/1 Running 0 21m 10.0.2.83 cn-beijing.10.0.2.54 +stress-demo-7fdd89cc6b-xr5dl 1/1 Running 0 4m40s 10.0.2.77 cn-beijing.10.0.2.53 +``` + +这些 Pod 调度到了节点 `cn-beijing.10.0.2.53` 和 `cn-beijing.10.0.2.54`. + +3. 检查每个node节点的负载。 + +```bash +$ kubectl top node +NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% +cn-beijing.10.0.2.53 3825m 98% 4051Mi 31% +cn-beijing.10.0.2.54 2155m 55% 4500Mi 35% +cn-beijing.10.0.2.58 182m 4% 1367Mi 10% +``` + +按照输出结果显示, 节点 `cn-beijing.10.0.2.53` 和 `cn-beijing.10.0.2.54` 负载比较高, 节点 `cn-beijing.10.0.2.58` 负载最低。 + +4. 更新配置 `koord-descheduler-config` 启用插件 `LowNodeLoad`。 + +5. 观察 Pod 变化,等待重调度器执行驱逐迁移操作。 + +```bash +$ kubectl get pod -w +NAME READY STATUS RESTARTS AGE +stress-demo-7fdd89cc6b-lml7k 1/1 Running 0 22m +stress-demo-7fdd89cc6b-xr5dl 1/1 Running 0 5m45s +stress-demo-7fdd89cc6b-xr5dl 1/1 Terminating 0 5m59s +stress-demo-7fdd89cc6b-8k8wq 0/1 Pending 0 0s +stress-demo-7fdd89cc6b-8k8wq 0/1 Pending 0 0s +stress-demo-7fdd89cc6b-8k8wq 0/1 ContainerCreating 0 0s +stress-demo-7fdd89cc6b-8k8wq 0/1 ContainerCreating 0 1s +stress-demo-7fdd89cc6b-8k8wq 1/1 Running 0 3s +``` + +6. 观察Event,可以看到如下迁移记录 + +```bash +$ kubectl get event |grep stress-demo-7fdd89cc6b-xr5dl +74s Normal Evicting podmigrationjob/e54863dc-b651-47e3-9ffd-08b6b4ff64d5 Pod "default/stress-demo-7fdd89cc6b-xr5dl" evicted from node "cn-beijing.10.0.2.53" by the reason "node is overutilized, cpu usage(56.13%)>threshold(50.00%)" +41s Normal EvictComplete podmigrationjob/e54863dc-b651-47e3-9ffd-08b6b4ff64d5 Pod "default/stress-demo-7fdd89cc6b-xr5dl" has been evicted +7m12s Normal Scheduled pod/stress-demo-7fdd89cc6b-xr5dl Successfully assigned default/stress-demo-7fdd89cc6b-xr5dl to cn-beijing.10.0.2.53 +7m12s Normal AllocIPSucceed pod/stress-demo-7fdd89cc6b-xr5dl Alloc IP 10.0.2.77/24 +7m12s Normal Pulling pod/stress-demo-7fdd89cc6b-xr5dl Pulling image "polinux/stress" +6m59s Normal Pulled pod/stress-demo-7fdd89cc6b-xr5dl Successfully pulled image "polinux/stress" in 12.685405843s +6m59s Normal Created pod/stress-demo-7fdd89cc6b-xr5dl Created container stress +6m59s Normal Started pod/stress-demo-7fdd89cc6b-xr5dl Started container stress +74s Normal Descheduled pod/stress-demo-7fdd89cc6b-xr5dl Pod evicted from node "cn-beijing.10.0.2.53" by the reason "node is overutilized, cpu usage(56.13%)>threshold(50.00%)" +73s Normal Killing pod/stress-demo-7fdd89cc6b-xr5dl Stopping container stress +7m13s Normal SuccessfulCreate replicaset/stress-demo-7fdd89cc6b Created pod: stress-demo-7fdd89cc6b-xr5dl +``` diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/load-aware-scheduling.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/load-aware-scheduling.md new file mode 100644 index 000000000..ff33e8560 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/load-aware-scheduling.md @@ -0,0 +1,311 @@ +# 负载感知调度 + +负载感知调度(Load Aware Scheduling) 是 koord-scheduler 提供的一种调度能力,调度 Pod 时根据节点的负载情况选择合适的节点,均衡节点间的负载情况。 + +## 简介 + +负载均衡是资源调度中的常见问题。资源未充分利用的节点会带来很大的资源浪费,而过度使用的节点可能会导致性能下降。这些问题都不能高效的管理和使用资源。 +原生 Kubernetes Scheduler 根据 Requests 和节点可分配总量来调度 Pod,既不考虑实时负载,也不估计使用量。 当我们期望使用原生调度器均匀的打散 Pod 并保持节点间的负载均衡,我们需要为应用程序设置精确的资源规格。此外,当 Koordinator 通过超卖机制提升资源使用效率时,我们需要一种机制尽量避免性能回退,并避免负载过高的问题。 + +koord-scheduler 参考 koordlet 上报的资源利用率数据平衡在线 Pod(LSE/LSR/LS)和离线 Pod(BE)的调度。 + +![图片](/img/load-aware-scheduling-arch.svg) + +想要了解更多信息,请参阅 [设计:负载感知调度](/docs/designs/load-aware-scheduling)。 + +## 设置 + +### 前提条件 + +- Kubernetes >= 1.18 +- Koordinator >= 0.4 + +### 安装 + +请确保 Koordinator 组件已正确安装在你的集群中。 如果没有,请参考[安装文档](/docs/installation)。 + +### 配置全局策略 + +负载感知调度是默认启用的,不需要修改调度器的配置即可使用。 + +对于需要深入定制的用户,可以通过修改 Helm Chart 中的 ConfigMap `koord-scheduler-config` 规则来配置负载感知调度。 + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: koord-scheduler-config + ... +data: + koord-scheduler-config: | + apiVersion: kubescheduler.config.k8s.io/v1beta2 + kind: KubeSchedulerConfiguration + profiles: + - schedulerName: koord-scheduler + plugins: + # enable the LoadAwareScheduling plugin + filter: + enabled: + - name: LoadAwareScheduling + ... + score: + enabled: + - name: LoadAwareScheduling + weight: 1 + ... + reserve: + enabled: + - name: LoadAwareScheduling + ... + pluginConfig: + # configure the thresholds and weights for the plugin + - name: LoadAwareScheduling + args: + apiVersion: kubescheduler.config.k8s.io/v1beta2 + kind: LoadAwareSchedulingArgs + # whether to filter nodes where koordlet fails to update NodeMetric + filterExpiredNodeMetrics: true + # the expiration threshold seconds when using NodeMetric + nodeMetricExpirationSeconds: 300 + # weights of resources + resourceWeights: + cpu: 1 + memory: 1 + # thresholds (%) of resource utilization + usageThresholds: + cpu: 75 + memory: 85 + # thresholds (%) of resource utilization of Prod Pods + prodUsageThresholds: + cpu: 55 + memory: 65 + # enable score according Prod usage + scoreAccordingProdUsage: true + # the factor (%) for estimating resource usage + estimatedScalingFactors: + cpu: 80 + memory: 70 + # enable resource utilization filtering and scoring based on percentile statistics + aggregated: + usageThresholds: + cpu: 65 + memory: 75 + usageAggregationType: "p99" + scoreAggregationType: "p99" +``` + +koord-descheduler 是通过 Configmap 加载[调度器配置](https://kubernetes.io/docs/reference/scheduling/config/)的。因此需要通过重启调度器才能使用最新的配置。 + +| 字段 | 说明 | 版本 | +|-------|-------------| --------| +| filterExpiredNodeMetrics | filterExpiredNodeMetrics 表示是否过滤koordlet更新NodeMetric失败的节点。 默认情况下启用,但在 Helm chart 中,它被禁用。| >= v0.4.0 | +| nodeMetricExpirationSeconds | nodeMetricExpirationSeconds 指示 NodeMetric 过期时间(以秒为单位)。 当 NodeMetrics 过期时,节点被认为是异常的。 默认为 180 秒。| >= v0.4.0 | +| resourceWeights | resourceWeights 表示资源的权重。 CPU 和 Memory 的权重默认都是 1。| >= v0.4.0 | +| usageThresholds | usageThresholds 表示整机的资源利用率阈值。 CPU 的默认值为 65%,内存的默认值为 95%。| >= v0.4.0 | +| estimatedScalingFactors | estimatedScalingFactors 表示估计资源使用时的因子。 CPU 默认值为 85%,Memory 默认值为 70%。| >= v0.4.0 | +| prodUsageThresholds| prodUsageThresholds 表示 Prod Pod 相对于整机的资源利用率阈值。 默认情况下不启用。 | >= v1.1.0 | +| scoreAccordingProdUsage | scoreAccordingProdUsage 控制是否根据 Prod Pod 的利用率进行评分。| >= v1.1.0 | +| aggregated | aggregated 支持基于百分位数统计的资源利用率过滤和评分。| >= v1.1.0 | + +Aggregated 支持的字段: + +| 字段 | 说明 | 版本 | +|-------|-------------| --------| +| usageThresholds | usageThresholds 表示机器基于百分位统计的资源利用率阈值。| >= v1.1.0| +| usageAggregationType | usageAggregationType 表示过滤时机器利用率的百分位类型。 目前支持 `avg`、`p50`、`p90`、`p95` 和 `p99`。 | >= v1.1.0 | +| usageAggregatedDuration | usageAggregatedDuration 表示过滤时机器利用率百分位数的统计周期。不设置该字段时,调度器默认使用 NodeMetrics 中最大周期的数据。| >= v1.1.0| +| scoreAggregationType | scoreAggregationType 表示评分时机器利用率的百分位类型。 目前支持 `avg`、`p50`、`p90`、`p95` 和 `p99`。| >= v1.1.0 +| scoreAggregatedDuration | scoreAggregatedDuration 表示打分时 Prod Pod 利用率百分位的统计周期。 不设置该字段时,调度器默认使用 NodeMetrics 中最大周期的数据。| >= v1.1.0 | + +### 按照节点配置过滤阈值 + +通过插件的配置可以作为集群默认的全局配置,用户也可以通过在节点上附加 annotation 来设置节点维度的负载阈值。 当节点上存在 annotation 时,会根据注解指定的参数进行过滤。 + +Annotation 定义如下: + +```go +const ( + AnnotationCustomUsageThresholds = "scheduling.koordinator.sh/usage-thresholds" +) + +// CustomUsageThresholds supports user-defined node resource utilization thresholds. +type CustomUsageThresholds struct { + // UsageThresholds indicates the resource utilization threshold of the whole machine. + UsageThresholds map[corev1.ResourceName]int64 `json:"usageThresholds,omitempty"` + // ProdUsageThresholds indicates the resource utilization threshold of Prod Pods compared to the whole machine + ProdUsageThresholds map[corev1.ResourceName]int64 `json:"prodUsageThresholds,omitempty"` + // AggregatedUsage supports resource utilization filtering and scoring based on percentile statistics + AggregatedUsage *CustomAggregatedUsage `json:"aggregatedUsage,omitempty"` +} + +type CustomAggregatedUsage struct { + // UsageThresholds indicates the resource utilization threshold of the machine based on percentile statistics + UsageThresholds map[corev1.ResourceName]int64 `json:"usageThresholds,omitempty"` + // UsageAggregationType indicates the percentile type of the machine's utilization when filtering + UsageAggregationType slov1alpha1.AggregationType `json:"usageAggregationType,omitempty"` + // UsageAggregatedDuration indicates the statistical period of the percentile of the machine's utilization when filtering + UsageAggregatedDuration *metav1.Duration `json:"usageAggregatedDuration,omitempty"` +} +``` + +## 使用负载感知调度 + +### 感知整机负载进行调度 + +本文示例的集群有3台 4核16GiB 节点。 + +1. 使用下面的 YAML 创建一个 `stress` Pod + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: stress-demo + namespace: default + labels: + app: stress-demo +spec: + replicas: 1 + selector: + matchLabels: + app: stress-demo + template: + metadata: + name: stress-demo + labels: + app: stress-demo + spec: + containers: + - args: + - '--vm' + - '2' + - '--vm-bytes' + - '1600M' + - '-c' + - '2' + - '--vm-hang' + - '2' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + limits: + cpu: '2' + memory: 4Gi + requests: + cpu: '2' + memory: 4Gi + restartPolicy: Always + schedulerName: koord-scheduler # use the koord-scheduler +``` + +```bash +$ kubectl create -f stress-demo.yaml +deployment.apps/stress-demo created +``` + +2. 观察 Pod 的状态,直到它开始运行。 + +```bash +$ kubectl get pod -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +stress-demo-7fdd89cc6b-gcnzn 1/1 Running 0 82s 10.0.3.114 cn-beijing.10.0.3.112 +``` + +Pod `stress-demo-7fdd89cc6b-gcnzn` 调度在 `cn-beijing.10.0.3.112`。 + +3. 检查每个node节点的负载。 + +```bash +$ kubectl top node +NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% +cn-beijing.10.0.3.110 92m 2% 1158Mi 9% +cn-beijing.10.0.3.111 77m 1% 1162Mi 9% +cn-beijing.10.0.3.112 2105m 53% 3594Mi 28% +``` +按照输出结果显示,节点 `cn-beijing.10.0.3.111` 负载最低,节点`cn-beijing.10.0.3.112` 的负载最高。 + +4. 使用下面的 YAML 文件部署 `nginx` deployment。 + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: nginx-with-loadaware + labels: + app: nginx +spec: + replicas: 6 + selector: + matchLabels: + app: nginx + template: + metadata: + name: nginx + labels: + app: nginx + spec: + schedulerName: koord-scheduler # use the koord-scheduler + containers: + - name: nginx + image: nginx + resources: + limits: + cpu: 500m + requests: + cpu: 500m +``` + +```bash +$ kubectl create -f nginx-with-loadaware.yaml +deployment/nginx-with-loadawre created +``` + +5. 检查 `nginx` Pods 的调度结果。 + +```bash +$ kubectl get pods | grep nginx +nginx-with-loadaware-5646666d56-224jp 1/1 Running 0 18s 10.0.3.118 cn-beijing.10.0.3.110 +nginx-with-loadaware-5646666d56-7glt9 1/1 Running 0 18s 10.0.3.115 cn-beijing.10.0.3.110 +nginx-with-loadaware-5646666d56-kcdvr 1/1 Running 0 18s 10.0.3.119 cn-beijing.10.0.3.110 +nginx-with-loadaware-5646666d56-qzw4j 1/1 Running 0 18s 10.0.3.113 cn-beijing.10.0.3.111 +nginx-with-loadaware-5646666d56-sbgv9 1/1 Running 0 18s 10.0.3.120 cn-beijing.10.0.3.111 +nginx-with-loadaware-5646666d56-z79dn 1/1 Running 0 18s 10.0.3.116 cn-beijing.10.0.3.111 +``` + +现在我们可以看到 `nginx` pods 被调度在 `cn-beijing.10.0.3.112` (负载最高的节点) 以外的节点上。 + +### 感知 Prod Pods 的负载进行调度 + +如果一个 Node 中调度了很多 BestEffort Pod,可能会因为节点的负载已达到使用限制而导致延迟敏感的 Pod 无法调度。 在 Koordinator v1.1.0 中,负载感知调度针对这种场景进行了优化。 对于延迟敏感(LSE/LSR/LS)的 Pod,优先调度到 Prod Pod 总利用率较低的节点,而 BestEffort(BE) Pod 根据整机利用率水平进行调度。 + +通过设置以下参数启用相关优化: + +| 字段 | 说明 | 版本 | +|-------|-------------| --------| +| prodUsageThresholds| prodUsageThresholds 表示 Prod Pod 相对于整机的资源利用率阈值。 默认情况下不启用。 | >= v1.1.0 | +| scoreAccordingProdUsage | scoreAccordingProdUsage 控制是否根据 Prod Pod 的利用率进行评分。| >= v1.1.0 | + +### 感知基于百分位数统计的利用率进行调度 + +Koordinator v1.0及以前的版本都是按照 koordlet 上报的平均利用率数据进行过滤和打分。但平均值隐藏了比较多的信息,因此在 Koordinator v1.1 中 koordlet 新增了根据百分位数统计的利用率聚合数据。调度器侧也跟着做了相应的适配。 + +通过设置以下参数启用相关优化: + +| 字段 | 说明 | 版本 | +|-------|-------------| --------| +| aggregated | aggregated 支持基于百分位数统计的资源利用率过滤和评分。| >= v1.1.0 | + +Aggregated 支持的字段: + +| 字段 | 说明 | 版本 | +|-------|-------------| --------| +| usageThresholds | usageThresholds 表示机器基于百分位统计的资源利用率阈值。| >= v1.1.0| +| usageAggregationType | usageAggregationType 表示过滤时机器利用率的百分位类型。 目前支持 `avg`、`p50`、`p90`、`p95` 和 `p99`。 | >= v1.1.0 | +| usageAggregatedDuration | usageAggregatedDuration 表示过滤时机器利用率百分位数的统计周期。不设置该字段时,调度器默认使用 NodeMetrics 中最大周期的数据。| >= v1.1.0| +| scoreAggregationType | scoreAggregationType 表示评分时机器利用率的百分位类型。 目前支持 `avg`、`p50`、`p90`、`p95` 和 `p99`。| >= v1.1.0 +| scoreAggregatedDuration | scoreAggregatedDuration 表示打分时 Prod Pod 利用率百分位的统计周期。 不设置该字段时,调度器默认使用 NodeMetrics 中最大周期的数据。| >= v1.1.0 | + +`aggregated` 和 `usageThresholds` 参数是互斥的。 当两者都配置时,将使用 `aggregated`。此外,目前不支持 Pod 类型感知。 \ No newline at end of file diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/memory-evict.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/memory-evict.md new file mode 100644 index 000000000..e722cfdd3 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/memory-evict.md @@ -0,0 +1,120 @@ +# 基于内存用量的驱逐策略 + +## 简介 + +Koordinator支持了将节点空闲资源动态超卖给低优先级Pod,在混部场景下,节点实际的内存资源用量时刻在变化,对于内存这类不可压缩类型的资源, +当节点资源用量较高时,可能会引发整机内存OOM,导致高优先级Pod的进程被kill。为防止这一情况发生,Koordiantor提供了基于单机内存用量的驱逐策略。 +单机组件Koordlet会以秒级粒度持续探测整机内存的用量情况(Total-Available),当整机资源内存用量较高时,会将低优先级的BE类型Pod驱逐, +保障高优先级Pod的服务质量。在驱逐过程中会首先选择优先级(Pod.Spec.Priority)更低的Pod进行驱逐,若优先级相同, +则优先驱逐内存资源用量更多的Pod,直至整机内存用量降低到配置的安全水位(evictThreshold)以下。 + +![image](/img/memory-evict.svg) + +## 使用限制 +请确保Koordinator已正确安装在你的集群中。若未安装,请参考[安装文档](https://koordinator.sh/docs/installation),所需的版本要求情况如下: + +| 组件 | 版本要求 | +| --- | ------- | +| Kubernetes | ≥v1.18 | +| koordinator | ≥v0.3.0 | + +该功能由单机组件Koordlet提供,对应的feature-gate默认关闭,使用前请确保koordlet的启动参数`-feature-gates`中已经添加了`BEMemoryEvict=true`, +详见[参考示例](https://github.com/koordinator-sh/charts/blob/main/versions/v1.2.0/templates/koordlet.yaml#L36)。 + +## 操作步骤 + +1. 使用以下ConfigMap,创建configmap.yaml文件 + ```yaml + #ConfigMap slo-controller-config 样例。 + apiVersion: v1 + kind: ConfigMap + metadata: + name: slo-controller-config # 以koord-manager实际配置的名字为准,例如ack-slo-config + namespace: koordinator-system # 命名空间以环境中实际安装的情况为准,例如kube-system + data: + # 开启基于内存用量的驱逐功能。 + resource-threshold-config: | + { + "clusterStrategy": { + "enable": true, + "memoryEvictThresholdPercent": 70 + } + } + ``` + + | 参数 | 类型 | 取值范围 | 说明 | + | :-------------- | :------ | :-------- | :----------------------------------------------------------- | + | `enable` | Boolean | true; false | true:集群全局开启单机内存驱逐策略。false(默认值):集群全局关闭单机内存驱逐策略。 | + | `memoryEvictThresholdPercent` | Int | 0~100 | 整机内存资源用量百分比水位,表示触发驱逐的内存阈值,默认值为70。 | + +2. 查看安装的命名空间下是否存在ConfigMap,以命名空间`koordinator-system`和ConfigMap名字`slo-controller-config`为例,具体以实际安装配置为准。 + + - 若存在ConfigMap `slo-controller-config`,请使用PATCH方式进行更新,避免干扰ConfigMap中其他配置项。 + + ```bash + kubectl patch cm -n koordinator-system slo-controller-config --patch "$(cat configmap.yaml)" + ``` + + - 若不存在ConfigMap `slo-controller-config`,请执行以下命令进行创建Configmap。 + + ```bash + kubectl apply -f configmap.yaml + ``` + +3. 使用以下YAML内容,创建be-pod-demo.yaml文件。 + + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: be-pod-demo + labels: + koordinator.sh/qosClass: 'BE' #指定Pod的QoS级别为BE。 + spec: + containers: + - args: + - '-c' + - '1' + - '--vm' + - '1' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + restartPolicy: Always + schedulerName: default-scheduler + ``` + +4. 执行以下命令,将be-pod-demo部署到集群。 + + ```bash + $ kubectl apply -f be-pod-demo.yaml + ``` + +5. 执行以下命令,查看be-pod-demo状态,等待Pod启动完成。 + + ```bash + $ kubectl get pod be-pod-demo + NAME READY STATUS RESTARTS AGE + be-pod-demo 1/1 Running 0 7s + ``` + +6. 在节点执行以下命令,使用[stress工具](https://linux.die.net/man/1/stress)启动进程, +确保整机内存资源用量被提升到驱逐水位以上,其中`--vm-bytes`参数表示stress进程占用的内存量10GB,测试时可根据实际机型情况进行调整。 + + ```bash + $ stress --cpu 1 --vm 1 --vm-bytes 10G --vm-keep + ``` + +7. 观察be-pod-demo运行情况,可以发现be-pod-demo已经不存在,驱逐信息可以通过event查看到。 + + ```bash + $ kubectl get pod be-pod-demo + Error from server (NotFound): pods "be-pod-demo" not found + + $ kubectl get event + LAST SEEN TYPE REASON OBJECT MESSAGE + 46s Normal Killing pod/be-pod-demo Stopping container stress + 48s Warning evictPodSuccess $you-pod-object evict Pod:be-pod-demo, reason: EvictPodByNodeMemoryUsage, message: killAndEvictBEPods for node(${your-node-id}), need to release memory: 8077889699 + ``` \ No newline at end of file diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/memory-qos.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/memory-qos.md new file mode 100644 index 000000000..66f5e60f9 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/memory-qos.md @@ -0,0 +1,355 @@ +# Memory QoS + +## Introduction + +The Koordlet provides the *Memory Quality of Service* (QoS) feature for containers. You can use this feature to +optimize the performance of memory-sensitive applications while ensuring fair memory scheduling among containers. This +topic describes how to enable the memory QoS feature for containers. + +### Background + +The following memory limits apply to containers: + +- The memory limit of the container. If the amount of memory that a container uses, including the page cache, is about + to reach the memory limit of the container, the memory reclaim mechanism of the OS kernel is triggered. As a result, + the application in the container may not be able to request or release memory resources as normal. +- The memory limit of the node. If the memory limit of a container is greater than the memory request of the container, + the container can overcommit memory resources. In this case, the available memory on the node may become insufficient. + This causes the OS kernel to reclaim memory from containers. As a result, the performance of your application is + downgraded. In extreme cases, the node cannot run as normal. + +To improve the performance of applications and the stability of nodes, Koordinator provides the memory QoS feature for +containers. We recommend that you use Anolis OS as the node OS. For other OS, we will try our best to adapt, and users +can still enable it without side effects. After you enable the memory QoS feature for a container, Koordlet +automatically configures the memory control group (memcg) based on the configuration of the container. This helps you +optimize the performance of memory-sensitive applications while ensuring fair memory scheduling on the node. + +Memory QoS provides the following optimizations to improve the memory utilization of pods: + +- When the memory used by a pod is about to reach the memory limit of the pod, the memcg performs asynchronous reclaim for a specific amount of memory. This prevents the reclaim of all the memory that the pod uses and therefore minimizes the adverse impact on the application performance caused by direct memory reclaim. +- Memory reclaim is performed in a fairer manner among pods. When the available memory on a node becomes insufficient, memory reclaim is first performed on pods that use more memory than their memory requests. This ensures sufficient memory on the node when a pod applies for a large amount of memory. +- If the BestEffort pods on a node use more memory than their memory requests, the system prioritizes the memory requirements of Guaranteed pods and Burstable pods over the memory requirements of BestEffort pods. + +![image](/img/memory-qos.png) + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.3 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to +[Installation](/docs/installation). + +### Configurations + +Koordlet has already enabled Memory QoS feature (`-feature-gates=AllAlpha=true`). +If not, please enable it manually by updating the feature gate in the koordlet daemonset. + +> NOTE: Memory QoS is controlled by the `CgroupReconcile` feature-gate. + +```yaml +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: koordlet +spec: + selector: + matchLabels: + koord-app: koordlet + template: + metadata: + labels: + koord-app: koordlet + spec: + containers: + - command: + - /koordlet + args: + - -CgroupRootDir=/host-cgroup/ + - -feature-gates=XXXX,CgroupReconcile=true # enable CPU Burst feature + ... +``` + +## Use Memory QoS + +When you enable memory QoS for the containers in a pod, the memcg is automatically configured based on the specified +ratios and pod parameters. To enable memory QoS for the containers in a pod, perform the following steps. + +### Use an annotation to enable Memory QoS for the pod + +Add the following annotations to enable memory QoS for the containers in a pod: + +```yaml +annotations: + # To enable memory QoS for the containers in a pod, set the value to auto. + koordinator.sh/memoryQOS: '{"policy": "auto"}' + # To disable memory QoS for the containers in a pod, set the value to none. + #koordinator.sh/memoryQOS: '{"policy": "none"}' +``` + +### Use a ConfigMap to enable memory QoS for all the containers in a cluster + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: slo-controller-config + namespace: koordinator-system +data: + resource-qos-config: |- + { + "clusterStrategy": { + "lsClass": { + "memoryQOS": { + "enable": true + } + }, + "beClass": { + "memoryQOS": { + "enable": true + } + } + } + } +``` + +### (Optional) Advanced Settings + +The following table describes the advanced parameters that you can use to configure fine-grained memory QoS +configurations at the pod level and cluster level. + +| Parameter | Data type | Valid value | Description | +| ------------------- | ----------- | --------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| enable | Boolean |
  • true
  • false
|
  • true: enables memory QoS for all the containers in a cluster. The default memory QoS settings for the QoS class of the containers are used.
  • false: disables memory QoS for all the containers in a cluster. The memory QoS settings are restored to the original settings for the QoS class of the containers.
| +| policy | String |
  • auto
  • default
  • none
|
  • auto: enables memory QoS for the containers in the pod and uses the recommended memory QoS settings. The recommended memory QoS settings are prioritized over the cluster-wide memory QoS settings.
  • default: specifies that the pod inherits the cluster-wide memory QoS settings.
  • none: disables memory QoS for the pod. The relevant memory QoS settings are restored to the original settings. The original settings are prioritized over the cluster-wide memory QoS settings.
| +| minLimitPercent | Int | 0~100 | Unit: %. Default value:`0`. The default value indicates that this parameter is disabled. This parameter specifies the unreclaimable proportion of the memory request of a pod. The amount of unreclaimable memory is calculated based on the following formula: `Value of memory.min = Memory request × Value of minLimitPercent/100`. This parameter is suitable for scenarios where applications are sensitive to the page cache. You can use this parameter to cache files to optimize read and write performance. For example, if you specify Memory `Request=100MiB` and `minLimitPercent=100` for a container, `the value of memory.min is 104857600`. | +| lowLimitPercent | Int | 0~100 | Unit: %. Default value:`0`. The default value indicates that this parameter is disabled. This parameter specifies the relatively unreclaimable proportion of the memory request of a pod. The amount of relatively unreclaimable memory is calculated based on the following formula: `Value of memory.low = Memory request × Value of lowLimitPercent/100`. For example, if you specify `Memory Request=100MiB` and `lowLimitPercent=100` for a container, `the value of memory.low is 104857600`. | +| throttlingPercent | Int | 0~100 | Unit: %. Default value:`0`. The default value indicates that this parameter is disabled. This parameter specifies the memory throttling threshold for the ratio of the memory usage of a container to the memory limit of the container. The memory throttling threshold for memory usage is calculated based on the following formula: `Value of memory.high = Memory limit × Value of throttlingPercent/100`. If the memory usage of a container exceeds the memory throttling threshold, the memory used by the container will be reclaimed. This parameter is suitable for container memory overcommitment scenarios. You can use this parameter to cgroups from triggering OOM. For example, if you specify `Memory Limit=100MiB` and `throttlingPercent=80` for a container, `the value of memory.high is 83886080`, which is equal to 80 MiB. | +| wmarkRatio | Int | 0~100 | Unit: %. Default value:`95`. A value of `0` indicates that this parameter is disabled. This parameter specifies the threshold of the usage of the memory limit or the value of `memory.high` that triggers asynchronous memory reclaim. If `throttlingPercent` is disabled, the asynchronous memory reclaim threshold for memory usage is calculated based on the following formula: `Value of memory.wmark_high = Memory limit × wmarkRatio/100`. If `throttlingPercent` is enabled, the asynchronous memory reclaim threshold for memory usage is calculated based on the following formula: `Value of memory.wmark_high = Value of memory.high × wmarkRatio/100`. If the usage of the memory limit or the value of memory.high exceeds the threshold, the memcg backend asynchronous reclaim feature is triggered. For example, if you specify `Memory Limit=100MiB`for a container, the memory throttling setting is`memory.high=83886080`, the reclaim ratio setting is `memory.wmark_ratio=95`, and the reclaim threshold setting is `memory.wmark_high=79691776`. | +| wmarkMinAdj | Int | -25~50 | Unit: %. The default value is `-25` for the `LS`/ `LSR` QoS class and `50` for the `BE` QoS class. A value of 0 indicates that this parameter is disabled. This parameter specifies the adjustment to the global minimum watermark for a container. A negative value decreases the global minimum watermark and therefore postpones memory reclaim for the container. A positive value increases the global minimum watermark and therefore antedates memory reclaim for the container. For example, if you create a pod whose QoS class is LS, the default setting of this parameter is `memory.wmark_min_adj=-25`, which indicates that the minimum watermark is decreased by 25% for the containers in the pod. | + +### Example + +0. The testing environment is shown below: + +- Kubernetes: 1.20 +- Nodes: + - Stress Node: an ECS instance (8 vCPU, 32GB RAM) for performing stress tests. + - Tested Node: an ECS instance (8 vCPU, 32GB RAM) runs the workload and serves. + +1. Create a file named redis-demo.yaml with the following YAML template: + +```yaml +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: redis-demo-config +data: + redis-config: | + appendonly yes + appendfsync no +--- +apiVersion: v1 +kind: Pod +metadata: + name: redis-demo + labels: + name: redis-demo + annotations: + koordinator.sh/memoryQOS: '{"policy": "auto"}' # Add this annotation to enable memory QoS + koordinator.sh/qosClass: 'LS' # Set the QoS class of the Redis pod to LS +spec: + containers: + - name: redis + image: redis:5.0.4 + command: + - redis-server + - "/redis-master/redis.conf" + env: + - name: MASTER + value: "true" + ports: + - containerPort: 6379 + resources: + limits: + cpu: "2" + memory: "6Gi" + requests: + cpu: "2" + memory: "2Gi" + volumeMounts: + - mountPath: /redis-master-data + name: data + - mountPath: /redis-master + name: config + volumes: + - name: data + emptyDir: {} + - name: config + configMap: + name: redis-demo-config + items: + - key: redis-config + path: redis.conf + nodeName: # Set nodeName to the name of the tested node +--- +apiVersion: v1 +kind: Service +metadata: + name: redis-demo +spec: + ports: + - name: redis-port + port: 6379 + protocol: TCP + targetPort: 6379 + selector: + name: redis-demo + type: ClusterIP +``` + +2. Run the following command to deploy Redis Server as the test application. + +You can access the redis-demo Service from within the cluster. + +```bash +kubectl apply -f redis-demo.yaml +``` + +3. Simulate the scenario of memory overcommitment. + +Use the Stress tool to increase the load on memory and trigger memory reclaim. The sum of the memory limits of all pods +on the node exceeds the physical memory of the node. + + a. Create a file named stress-demo.yaml with the following YAML template: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: stress-demo + labels: + name: stress-demo + annotations: + koordinator.sh/memoryQOS: '{"policy": "auto"}' # Add this annotation to enable memory QoS + koordinator.sh/qosClass: 'BE' # Set the QoS class of the Stress pod to BE +spec: + containers: + - args: + - '--vm' + - '2' + - '--vm-bytes' + - 11G + - '-c' + - '2' + - '--vm-hang' + - '2' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + restartPolicy: Always + nodeName: # Set nodeName to the name of the tested node, which is the node on which the Redis pod is deployed +``` + + b. Run the following command to deploy stress-demo: + +```bash +kubectl apply -f stress-demo.yaml +``` + +4. Run the following command to query the global minimum watermark of the node: + +> Note In memory overcommitment scenarios, if the global minimum watermark of the node is set to a low value, OOM +> killers may be triggered for all pods on the node even before memory reclaim is performed. Therefore, we recommend +> that you set the global minimum watermark to a high value. In this example, the global minimum watermark is set +> to 4,000,000 KB for the tested node that has 32 GiB of memory. + +```bash +cat /proc/sys/vm/min_free_kbytes +``` + +Expected output: + +```bash +4000000 +``` + +5. Use the following YAML template to deploy the memtier-benchmark tool to send requests to the tested node: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + labels: + name: memtier-demo + name: memtier-demo +spec: + containers: + - command: + - memtier_benchmark + - '-s' + - 'redis-demo' + - '--data-size' + - '200000' + - "--ratio" + - "1:4" + image: 'redislabs/memtier_benchmark:1.3.0' + name: memtier + restartPolicy: Never + nodeName: # Set nodeName to the name of the stress node that is used to send requests. +``` + +6. Run the following command to query the test results from memtier-benchmark: + +```bash +kubectl logs -f memtier-demo +``` + +7. Use the following YAML template to disable memory QoS for the Redis pod and Stress pod. Then, perform stress tests +again and compare the results. + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: redis-demo + labels: + name: redis-demo + annotations: + koordinator.sh/memoryQOS: '{"policy": "none"}' # Disable memory QoS. + koordinator.sh/qosClass: 'LS' +spec: + ... + +--- +apiVersion: v1 +kind: Pod +metadata: + name: stress-demo + labels: + name: stress-demo + annotations: + koordinator.sh/memoryQOS: '{"policy": "none"}' # Disable memory QoS. + koordinator.sh/qosClass: 'BE' +``` + +8. Check the results of Memory QoS enabled and disabled. + +- Disabled: Set the memory QoS policy of the pod to `none`. +- Enabled: Set the memory QoS policy of the pod to `auto` (the recommended parameters of memory QoS are used). + +| Metric | Disabled | Enabled | +| ----------------- | ------------- | ------------- | +| Latency-avg | 51.32 ms | 47.25 ms | +| Throughput-avg | 149.0 MB/s | 161.9 MB/s | + +The table shows that the latency of the Redis pod is reduced by 7.9% and the throughput of the Redis pod is increased +by 8.7% after memory QoS is enabled. This indicates that the memory QoS feature can optimize the performance of +applications in memory overcommitment scenarios. diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/multi-hierarchy-elastic-quota-management.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/multi-hierarchy-elastic-quota-management.md new file mode 100644 index 000000000..e5de06fcb --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/multi-hierarchy-elastic-quota-management.md @@ -0,0 +1,621 @@ +# Multi Hierarchy Elastic Quota Management + +Multi Hierarchy ElasticQuota Management is an ability of koord-scheduler to manage different user's resource usage in a shared-cluster. + +## Introduction +When several users or teams share a cluster, fairness of resource allocation is very important. the Koordinator provides +multi-hierarchy elastic quota management mechanism for the scheduler. +- It supports configuring quota groups in a tree structure, which is similar to the organizational structure of most companies. +- It supports the borrowing / returning of resources between different quota groups, for better resource utilization efficiency. +The busy quota groups can automatically temporarily borrow the resources from the idle quota groups, which can improve the +utilization of the cluster. At the same time, when the idle quota group turn into the busy quota group, it can also automatically +take back the "lent-to" resources. +- It considers the resource fairness between different quota groups. When the busy quota groups borrow the +resources from the idle quota groups, the resources can be allocated to the busy quota groups under some fair rules. + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.71 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to [Installation](/docs/installation). + +### Configurations + +Multi-Hierarchy-ElasticQuota-Management is *Enabled* by default. You can use it without any modification on the koord-descheduler config. + +## Use Multi-Hierarchy-ElasticQuota-Management + +### Quick Start by Label + +1.Create a Deployment `quota-example` with the YAML file below. + +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-example + namespace: default + labels: + quota.scheduling.koordinator.sh/parent: "" + quota.scheduling.koordinator.sh/is-parent: "false" +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 10 + memory: 20Mi +``` + +```bash +$ kubectl apply -f quota-example.yaml + elasticquota.scheduling.sigs.k8s.io/quota-example created + +$ kubectl get eqs -n default + NAME AGE + test-d 2s +``` + +2.Create a pod `pod-example` with the YAML file below. +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example + namespace: default + labels: + quota.scheduling.koordinator.sh/name: "quota-example" +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +```bash +$ kubectl apply -f pod-example.yaml + pod/pod-example created +``` + +3.Verify `quota-example` has changed. +```bash +$ kubectl get eqs -n default quota-example -o yaml +``` +```yaml +kind: ElasticQuota +metadata: + annotations: + quota.scheduling.koordinator.sh/request: '{"cpu":"40m","memory":"40Mi"}' + quota.scheduling.koordinator.sh/runtime: '{"cpu":"40m","memory":"40Mi"}' + quota.scheduling.koordinator.sh/shared-weight: '{"cpu":"40","memory":"40Gi"}' + creationTimestamp: "2022-10-08T09:26:38Z" + generation: 2 + labels: + quota.scheduling.koordinator.sh/is-parent: "false" + quota.scheduling.koordinator.sh/parent: root + manager: koord-scheduler + operation: Update + time: "2022-10-08T09:26:50Z" + name: quota-example + namespace: default + resourceVersion: "39012008" +spec: + max: + cpu: "40" + memory: 40Gi + min: + cpu: "10" + memory: 20Mi +status: + used: + cpu: 40m + memory: 40Mi +``` + +### Quick Start by Namespace +1.Create namespace +```bash +$ kubectl create ns quota-example + namespace/quota-example created +``` + +2.Create a Deployment `quota-example` with the YAML file below. + +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-example + namespace: quota-example + labels: + quota.scheduling.koordinator.sh/parent: "" + quota.scheduling.koordinator.sh/is-parent: "false" +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 10 + memory: 20Mi +``` + +```bash +$ kubectl apply -f quota-example.yaml + elasticquota.scheduling.sigs.k8s.io/quota-example created + +$ kubectl get eqs -n quota-example + NAME AGE + test-d 2s +``` + +2.Create a pod `pod-example` with the YAML file below. +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example + namespace: quota-example +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +```bash +$ kubectl apply -f pod-example.yaml + pod/pod-example created +``` + +3.Verify `quota-example` has changed. +```bash +$ kubectl get eqs -n quota-example quota-example -o yaml +``` +```yaml +kind: ElasticQuota +metadata: + annotations: + quota.scheduling.koordinator.sh/request: '{"cpu":"40m","memory":"40Mi"}' + quota.scheduling.koordinator.sh/runtime: '{"cpu":"40m","memory":"40Mi"}' + quota.scheduling.koordinator.sh/shared-weight: '{"cpu":"40","memory":"40Gi"}' + creationTimestamp: "2022-10-08T09:26:38Z" + generation: 2 + labels: + quota.scheduling.koordinator.sh/is-parent: "false" + quota.scheduling.koordinator.sh/parent: root + manager: koord-scheduler + operation: Update + time: "2022-10-08T09:26:50Z" + name: quota-example + namespace: quota-example + resourceVersion: "39012008" +spec: + max: + cpu: "40" + memory: 40Gi + min: + cpu: "10" + memory: 20Mi +status: + used: + cpu: 40m + memory: 40Mi +``` + +### Quota Debug Api. +```bash +$ kubectl -n koordinator-system get lease koord-scheduler --no-headers | awk '{print $2}' | cut -d'_' -f1 | xargs -I {} kubectl -n koordinator-system get pod {} -o wide --no-headers | awk '{print $6}' + 10.244.0.64 + +$ curl 10.244.0.64:10251/apis/v1/plugins/ElasticQuota/quota/quota-example +``` + +```json +{ + "allowLentResource": true, + "autoScaleMin": { + "cpu": "10", + "memory": "20Mi", + }, + "isParent": false, + "max": { + "cpu": "40", + "memory": "40Gi", + }, + "min": { + "cpu": "10", + "memory": "20Mi", + }, + "name": "quota-example", + "parentName": "root", + "podCache": { + "pod-example": { + "isAssigned": true, + "resource": { + "cpu": "40m", + "memory": "40Mi" + } + } + }, + "request": { + "cpu": "40m", + "memory": "40Mi" + }, + "runtime": { + "cpu": "40m", + "memory": "41943040", + }, + "runtimeVersion": 39, + "sharedWeight": { + "cpu": "40", + "memory": "40Gi", + }, + "used": { + "cpu": "40m", + "memory": "40Mi" + } +} +``` +The main different with yaml is that we can find all quota's pods and its status in `podCache`. + +### Advanced Configurations +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: false + quota.scheduling.koordinator.sh/parent: "parent" + quota.scheduling.koordinator.sh/allow-lent-resource: true + quota.scheduling.koordinator.sh/shared-weight: '{"cpu":"40","memory":"40Gi"}' +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 10 + memory: 20Mi +``` + +- `quota.scheduling.koordinator.sh/is-parent` is disposed by the user. It reflects the "child\parent" attribute of the quota group. Default is child. +- `quota.scheduling.koordinator.sh/parent` is disposed by the user. It reflects the parent quota name. Default is root. +- `quota.scheduling.koordinator.sh/shared-weight` is disposed by the user. It reflects the ability to share the "lent to" resource. Default equals to "max". +- `quota.scheduling.koordinator.sh/allow-lent-resource` is disposed by the user. It reflects whether quota group allows lent unused "min" to others. + +### WebHook Verify +1.Except for the first level quota group, we require that the sum of "min" of all sub quota groups should be less than or +equal to the "min" of parent group. + +first create parent quota: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-parent-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: true +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 10 + memory: 20Mi +``` + +then create child quota: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: false + quota.scheduling.koordinator.sh/parent: "quota-parent-example" +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 20 + memory: 20Mi +``` + +```bash +kubectl apply -f quota-example.yaml +Error from server: error when creating "quota-example.yaml": admission webhook "vquota.kb.io" denied the request: checkMinQuotaSum allChildren SumMinQuota > parentMinQuota, parent: quota-parent-example +``` + +2.Parent and child's min\max resource key must same. +first create parent quota: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-parent-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: true +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 10 + memory: 20Mi +``` + +then create child quota: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: false + quota.scheduling.koordinator.sh/parent: "quota-parent-example" +spec: + max: + cpu: 40 + memory: 40Gi + test: 200 + min: + cpu: 10 + memory: 20Mi +``` + +```bash +$ kubectl apply -f quota-example.yaml + Error from server: error when creating "quota-example.yaml": admission webhook "vquota.kb.io" denied the request: checkSubAndParentGroupMaxQuotaKeySame failed: quota-parent-example's key is not the same with quota-example +``` + +3.Parent group cannot run pod. + +first create parent quota: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-parent-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: true +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 10 + memory: 20Mi +``` + +then create pod: +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example + namespace: default + labels: + quota.scheduling.koordinator.sh/name: "quota-parent-example" +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +```bash +$ kubectl apply -f pod-example_xb.yaml + Error from server: error when creating "pod-example.yaml": admission webhook "vpod.kb.io" denied the request: pod can not be linked to a parentQuotaGroup,quota:quota-parent-example, pod:pod-example +``` + +4.The parent of node can only be parent group, not child group. + +first create parent quota: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-parent-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: false +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 10 + memory: 20Mi +``` + +then create child quota: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: false + quota.scheduling.koordinator.sh/parent: "quota-parent-example" +spec: + max: + cpu: 40 + memory: 40Gi + test: 200 + min: + cpu: 10 + memory: 20Mi +``` + +```bash +$ kubectl apply -f quota-example.yaml + Error from server: error when creating "elastic-quota-example_xb.yaml": admission webhook "vquota.kb.io" denied the request: quota-example has parentName quota-parent-example but the parentQuotaInfo's IsParent is false +``` + +5.A quota group can't be converted on the attribute of parent group\child group. + +first create parent quota: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-parent-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: true +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 10 + memory: 20Mi +``` + +then modify `quota.scheduling.koordinator.sh/is-parent:false`: +```bash +$ kubectl apply -f quota-parent-example.yaml + elastic-quota-example_xb_parent.yaml": admission webhook "vquota.kb.io" denied the request: IsParent is forbidden modify now, quotaName:quota-parent-example +``` + +### used > runtime revoke +We offer a config to control if quota's used > runtime, we allow the scheduler to delete over-resource-used pod from +low priority to high priority. you should follow the below config of `koord-scheduler-config.yaml` in helm. + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: koord-scheduler-config + namespace: {{ .Values.installation.namespace }} +data: + koord-scheduler-config: | + apiVersion: kubescheduler.config.k8s.io/v1beta2 + kind: KubeSchedulerConfiguration + leaderElection: + leaderElect: true + resourceLock: leases + resourceName: koord-scheduler + resourceNamespace: {{ .Values.installation.namespace }} + profiles: + - pluginConfig: + - name: ElasticQuota + args: + apiVersion: kubescheduler.config.k8s.io/v1beta2 + kind: ElasticQuotaArgs + quotaGroupNamespace: {{ .Values.installation.namespace }} + monitorAllQuotas: true + revokePodInterval: 60s + delayEvictTime: 300s + plugins: + queueSort: + disabled: + - name: "*" + enabled: + - name: Coscheduling + preFilter: + enabled: + - name: NodeNUMAResource + - name: DeviceShare + - name: Reservation + - name: Coscheduling + - name: ElasticQuota + filter: + ... +``` +- `monitorAllQuotas` enable "used > runtime revoke" logic. Default is false. +- `revokePodInterval` check loop time interval. +- `delayEvictTime` when "used > runtime" continues over `delayEvictTime` will really trigger eviction. + +To let scheduler can really delete the pod successfully, you should config the `rbac/koord-scheduler.yaml` as below in helm. + +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: koord-scheduler-role +rules: +{{- if semverCompare "<= 1.20-0" .Capabilities.KubeVersion.Version }} +- apiGroups: + - "" + resources: + - namespaces + verbs: + - get + - list + - watch +{{- end }} +- apiGroups: + - coordination.k8s.io + resources: + - leases + verbs: + - create + - get + - update +- apiGroups: + - "" + resources: + - pods + verbs: + - patch + - update + - delete +- apiGroups: + ... +``` diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/performance-collector.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/performance-collector.md new file mode 100644 index 000000000..846d2dd6c --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/performance-collector.md @@ -0,0 +1,188 @@ +# Performance Collector + +## 背景 + +在真实的生产环境下,单机的运行时状态是一个“混沌系统”,资源竞争产生的应用干扰无法绝对避免。Koordinator正在建立干扰检测与优化的能力,通过提取应用运行状态的指标,进行实时的分析和检测,在发现干扰后对目标应用和干扰源采取更具针对性的策略。 +Koordinator已经实现了一系列`Performance Collector`,在单机侧采集与应用运行状态高相关性的底层指标,并通过`Prometheus`暴露出来,为干扰检测能力和集群应用调度提供支持。 + +## 使用方法 + +### 准备条件 + +- Kubernetes >= 1.18 + +- Koordinator >= 1.0 + +- 若您使用CPI Collector,请确保您的机器支持获取Cycles、Instructions这两个Kernel PMU(Performance Monitoring Unit)事件。 + + > 使用如下命令检查是否支持 + + ```shell + $ perf list + List of pre-defined events (to be used in -e): + + branch-instructions OR branches [Hardware event] + branch-misses [Hardware event] + bus-cycles [Hardware event] + ... + + cpu-cycles OR cpu/cpu-cycles/ [Kernel PMU event] + ... + instructions OR cpu/instructions/ [Kernel PMU event] + ``` + +- 若您使用PSI Collector,您需要在Anolis OS中开启PSI功能,您可以参考[文档](https://help.aliyun.com/document_detail/155464.html)获取开启方法。 + +### 安装 + +请确保Koordinator的相关组件已被正确安装于您的集群中。您可以参考文档[Installation](https://koordinator.sh/zh-Hans/docs/installation)来获取相关的安装方法。 + +### feature-gates + +Performance Collector由多个feature-gate进行控制,Koordinator目前提供一下几个指标采集器: + +- `CPICollector`:用于控制CPI指标采集器。CPI:Cycles Per Instruction。 +- `PSICollector`:用于控制PSI指标采集器。PSI:Pressure Stall Information。 + +### 配置 + +Performance Collector目前是默认关闭的。您可以通过修改Koordlet的feature-gates项来使用它,此项修改不会影响其他feature-gate + +```shell +kubectl edit ds koordlet -n koordinator-system +``` + +```shell +... +spec: + ... + spec: + containers: + - args: + ... + # modify here + # - -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true + - -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true,CPICollector=true,PSICollector=true + ... +``` + +## 开销对比 + +Koordinator Performance Collector作为干扰检测的重要工具,其核心目标之一为在低成本、无自身干扰引入的情况下采集相关指标。下文展示了开启Performance Collector前后Koordinator引入的系统开销。用户可参考此测试结果使用Performance Collector功能。 + +### 测试环境 + +- 阿里云容器服务Kubernetes版(ACK)托管版集群: + - Kubernetes版本:1.24.6-aliyun.1 + - 容器运行时:containerd 1.5.13 + - 节点规格:ecs.ebmg6.26xlarge,104 vCPU 384 GiB,操作系统Alibaba Cloud Linux 2.1903 +- 节点负载: + - 测试Pod镜像:nginx:1.14.2 + - 单节点Pod数量:100 test Pod + 50 system Pod + - 单节点容器数量:150 + - 系统CPU usage水位:25%左右,使用lookbusy-1.4工具在每个CPU核上生产压力 +- 其他条件: + - 100个nginx Pod由Linux cronjob管理,每五分钟删除一次。Deployment控制器将会随之进行重建。 + - CPI Collector的运行时间窗口为每60秒一次,每次持续时长10秒。 + - PSI Collector每10秒采集一次。 + - 测试在Performance Collector开启前后均运行一小时。 + +### 测试结论 + +#### Case 1:Koordlet容器运行Performance Collector前后开销对比 + +Performance Collector运行于Koordinator的Koordlet组件,现将其对该组件的开销对比如下: + +- 总体开销无明显上升: + + | 关键指标 | 开启前 | 开启后 | + | :--------------: | :------: | :--------: | + | RSS Memory usage | 341MiB | 366MiB | + | CPU usage | 0.5 core | 0.6 core | + | 网络I/O | - | 无明显变化 | + +- 性能开销原因分析: + - 新增Container维度的CPI、Container和Pod维度的PSI数据表 + - 每cgroup唯一的采集器goroutine带来的性能消耗 + - Prometheus上报数据仪表盘带来的少量内存消耗 + +#### Case 2:运行Performance Collector后节点开销对比 + +Performance Collector使用了perf_event_open(2)系统调用,并开启了节点上的PSI功能,现将其对节点影响对比如下: + +- 无明显开销增长: + + | 关键指标 | 开启前 | 开启后 | + | :-------------: | :----: | :----: | + | 内核态CPU使用率 | 0.94% | 0.96% | + | 用户态CPU使用率 | 24.51% | 25.19% | + +- 性能开销原因分析: + - perf_event_open(2)的使用 + - PSI功能的开启 + +## 实例 + +1. 打开想要使用的Performance Collector: +```shell +helm install koordinator https://... --set featureGates="CPICollector=true,PSICollector=true" +``` + +2. 使用如下flag配置指标采集器的时间窗口、采集间隔等: + + | flag名称 | 默认值 | 含义 | + | :-----------------------------: | :----: | :-----------------------------: | + | -cpi-collector-interval-seconds | 60 | CPI指标采集的时间间隔,单位为秒 | + | -collect-cpi-timewindow-seconds | 10 | CPI指标采集的时间窗口,单位为秒 | + | -psi-collector-interval-seconds | 10 | PSI指标采集的时间间隔,单位为秒 | + +3. 您可以在Prometheus指标暴露端口(默认为9316)处观察到采集到的指标,查询 API为`/metrics`,CPI指标以*cycles*和*instructions*两条记录分开展示: +```shell +$ curl http://localhost:9316/metrics + +# HELP koordlet_container_cpi Container cpi collected by koordlet +# TYPE koordlet_container_cpi gauge +koordlet_container_cpi{container_id="containerd://498de02ddd3ad7c901b3c80f96c57db5b3ed9a817dbfab9d16b18be7e7d2d047",container_name="koordlet",cpi_field="cycles",node="your-node-name",pod_name="koordlet-x8g2j",pod_namespace="koordinator-system",pod_uid="3440fb9c-423b-48e9-8850-06a6c50f633d"} 2.228107503e+09 +koordlet_container_cpi{container_id="containerd://498de02ddd3ad7c901b3c80f96c57db5b3ed9a817dbfab9d16b18be7e7d2d047",container_name="koordlet",cpi_field="instructions",node="your-node-name",pod_name="koordlet-x8g2j",pod_namespace="koordinator-system",pod_uid="3440fb9c-423b-48e9-8850-06a6c50f633d"} 4.1456092e+09 +``` + +4. 同时,我们提供ServiceMonitor用于暴露Koordlet采集的指标: + + ```yaml + apiVersion: v1 + kind: Service + metadata: + labels: + koord-app: koordlet + name: koordlet + namespace: koordinator-system + spec: + clusterIP: None + ports: + - name: koordlet-service + port: 9316 + targetPort: 9316 + selector: + koord-app: koordlet + --- + apiVersion: monitoring.coreos.com/v1 + kind: ServiceMonitor + metadata: + labels: + koord-app: koordlet + name: koordlet + namespace: koordinator-system + spec: + endpoints: + - interval: 30s + port: koordlet-service + scheme: http + jobLabel: koord-app + selector: + matchLabels: + koord-app: koordlet + ``` + + 您可以在部署后于Prometheus的Targets中找到并使用: + + ![koordlet-servicemonitor-prometheus](/img/koordlet-servicemonitor-prometheus.png) diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/pod-migration-job.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/pod-migration-job.md new file mode 100644 index 000000000..37c9d5d21 --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/pod-migration-job.md @@ -0,0 +1,254 @@ +# PodMigrationJob + +Koordinator定义了一个基于 CRD 的 Pod 迁移 API,称为 `PodMigrationJob`,通过此 API,重调度器(descheduler)或其他自动故障恢复组件可以更安全地将 Pod 驱逐或删除。 + +## 介绍 + +迁移 Pods 是许多组件(如descheduler)依赖的重要能力,可用于优化调度或帮助解决工作负载运行时质量问题。我们认为,Pod 迁移是一个复杂的过程,涉及诸如审计(auditing)、资源分配和应用程序启动等步骤,并与应用程序升级、伸缩等场景以及集群管理员的资源操作和维护操作混合在一起。因此,如何管理此过程的稳定性风险,以确保应用程序不会因为 Pod 迁移而失败,是必须解决的关键的问题。 + +基于 PodMigrationJob CRD 的最终状态导向迁移能力,我们可以跟踪迁移过程中每个过程的状态,感知应用程序升级和扩展等场景,以确保工作负载的稳定性。 + +## 设置 + +### 前置条件 + +- Kubernetes >= 1.18 +- Koordinator >= 0.6 + +### Installation + +请确保Koordinator组件已正确安装在您的集群中。如果未安装,请参考[安装](/docs/installation). + +### Configurations + +PodMigrationJob 已默认启用。您可以在koord-descheduler配置中无需任何修改即可使用它。 + +## 使用 PodMigrationJob + +### 快速开始 + +1. 使用下面的YAML文件创建一个名为`pod-demo`的Deployment + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: pod-demo + namespace: default +spec: + progressDeadlineSeconds: 600 + replicas: 1 + revisionHistoryLimit: 10 + selector: + matchLabels: + app: pod-demo + strategy: + rollingUpdate: + maxSurge: 25% + maxUnavailable: 25% + type: RollingUpdate + template: + metadata: + creationTimestamp: null + labels: + app: pod-demo + name: stress + spec: + containers: + - args: + - -c + - "1" + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + limits: + cpu: "2" + memory: 4Gi + requests: + cpu: 200m + memory: 400Mi + restartPolicy: Always + schedulerName: koord-scheduler +``` + +```bash +$ kubectl create -f pod-demo.yaml +deployment.apps/pod-demo created +``` + +2. 检查Pod `pod-demo-0` 的调度结果 + +```bash +$ kubectl get pod -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +pod-demo-5f9b977566-c7lvk 1/1 Running 0 41s 10.17.0.9 node-0 +``` + +`pod-demo-5f9b977566-c7lvk` 被调度在节点 `node-0`上 + +3. 使用下面的YAML文件创建一个 `PodMigrationJob` 来迁移 `pod-demo-0` + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: PodMigrationJob +metadata: + name: migrationjob-demo +spec: + paused: false + ttl: 5m + mode: ReservationFirst + podRef: + namespace: default + name: pod-demo-5f9b977566-c7lvk +status: + phase: Pending +``` + +```bash +$ kubectl create -f migrationjob-demo.yaml +podmigrationjob.scheduling.koordinator.sh/migrationjob-demo created +``` + +5. 查看迁移状态 + +```bash +$ kubectl get podmigrationjob migrationjob-demo +NAME PHASE STATUS AGE NODE RESERVATION PODNAMESPACE POD NEWPOD TTL +migrationjob-demo Succeed Complete 37s node-1 d56659ab-ba16-47a2-821d-22d6ba49258e default pod-demo-5f9b977566-c7lvk pod-demo-5f9b977566-nxjdf 5m0s +``` + +从上述结果可以观察到: +- **PHASE** 为 `Succeed`, **STATUS** 为 `Complete`, 表明迁移成功; +- **NODE** `node-1` 表示迁移后新Pod所调度的节点; +- **RESERVATION** `d56659ab-ba16-47a2-821d-22d6ba49258e` 是在迁移期间创建的 Reservation。PodMigrationJob Controller 将在开始驱逐 Pod 之前尝试为 Reservation 创建预留资源。在成功预留资源后,将启动驱逐操作,这可以确保新 Pod 必须被驱逐,因为已有资源可用; +- **PODNAMESPACE** `default` 表示待迁移 Pod 所在的命名空间; +- **POD** `pod-demo-5f9b977566-c7lvk` 表示待迁移的 Pod; +- **NEWPOD** `pod-demo-5f9b977566-nxjdf` 表示迁移后新创建的 Pod; +- **TTL** 表示当前作业的 TTL 周期。 + +6. 查看迁移事件 + +PodMigrationJob Controller 将在迁移过程的重要步骤中创建事件,以帮助用户诊断迁移问题 + +```bash +$ kubectl describe podmigrationjob migrationjob-demo +... +Events: + Type Reason Age From Message + ---- ------ ---- ---- ------- + Normal ReservationCreated 8m33s koord-descheduler Successfully create Reservation "d56659ab-ba16-47a2-821d-22d6ba49258e" + Normal ReservationScheduled 8m33s koord-descheduler Assigned Reservation "d56659ab-ba16-47a2-821d-22d6ba49258e" to node "node-1" + Normal Evicting 8m33s koord-descheduler Try to evict Pod "default/pod-demo-5f9b977566-c7lvk" + Normal EvictComplete 8m koord-descheduler Pod "default/pod-demo-5f9b977566-c7lvk" has been evicted + Normal Complete 8m koord-descheduler Bind Pod "default/pod-demo-5f9b977566-nxjdf" in Reservation "d56659ab-ba16-47a2-821d-22d6ba49258e" +``` + +### 高级配置 + +> 最新的API可以查看[`pod_migration_job_types.go`](https://github.com/koordinator-sh/koordinator/blob/main/apis/scheduling/v1alpha1/pod_migration_job_types.go). + +### 示例: 手动确认是否允许迁移 + +驱逐或迁移操作会带来稳定性风险,因此希望在启动迁移操作之前手动检查和确认没有错误,然后再启动迁移。 + +因此,在创建 PodMigrationJob 时,将 `spec.paused` 设置为 `true`,手动确认允许执行后再将 `spec.paused` 设置为 `false`。如果拒绝执行,则可以更新 `status.phase=Failed` 立即终止PodMigrationJob 的执行,或者等待 PodMigrationJob 自动过期。 + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: PodMigrationJob +metadata: + name: migrationjob-demo +spec: + # paused indicates whether the PodMigrationJob should to work or not. + paused: true + # ttl controls the PodMigrationJob timeout duration. + ttl: 5m + mode: ReservationFirst + podRef: + namespace: default + name: pod-demo-5f9b977566-c7lvk +status: + phase: Pending +``` + +### 示例: 只想驱逐 Pods, 无需预留资源 + +PodMigrationJob 提供两种迁移模式: +- `EvictDirectly` 直接驱逐 Pod,无需预留资源, +- `ReservationFirst` 先预留资源,以确保在开始驱逐之前可以分配资源。 + +如果你只想驱逐 Pod,只需将 `spec.mode` 设置为 `EvictDirectly`。 + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: PodMigrationJob +metadata: + name: migrationjob-demo +spec: + paused: false + ttl: 5m + mode: EvictDirectly + podRef: + namespace: default + name: pod-demo-5f9b977566-c7lvk +status: + phase: Pending +``` + +### 示例: 在迁移中使用预留资源 + +在某些情况下,首先预留资源,然后在成功后创建一个 PodMigrationJob,以重复使用 PodMigrationJob Controller 提供的仲裁机制(在v0.7中实现)以确保工作负载的稳定性。 + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: PodMigrationJob +metadata: + name: migrationjob-demo +spec: + paused: false + ttl: 5m + mode: ReservationFirst + podRef: + namespace: default + name: pod-demo-5f9b977566-c7lvk + reservationOptions: + # the reservation-0 created before creating PodMigrationJob + reservationRef: + name: reservation-0 +status: + phase: Pending +``` + +### 示例: 优雅驱逐 Pods + +PodMigrationJob 支持 Pod 的优雅驱逐。 + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: PodMigrationJob +metadata: + name: migrationjob-demo +spec: + paused: true + ttl: 5m + mode: ReservationFirst + podRef: + namespace: default + name: pod-demo-5f9b977566-c7lvk + deleteOptions: + # The duration in seconds before the object should be deleted. Value must be non-negative integer. + # The value zero indicates delete immediately. If this value is nil, the default grace period for the + # specified type will be used. + # Defaults to a per object value if not specified. zero means delete immediately. + gracePeriodSeconds: 60 +status: + phase: Pending +``` + + +### 已知问题 +- 当前不支持[Arbitration mechanism](https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/scheduling/20220701-pod-migration-job.md#filter-podmigrationjob),v0.6版本仅实现了基于资源预留的迁移能力。 +- 目前不支持[Basic Migration API](https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/scheduling/20220701-pod-migration-job.md#basic-migration-api) 。 diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/resource-reservation.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/resource-reservation.md new file mode 100644 index 000000000..403ad20fa --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/resource-reservation.md @@ -0,0 +1,443 @@ +# 资源预留 + +资源预留是koord-scheduler的一种为某些特定Pod或负载预留节点资源的能力。 + +## 介绍 + +Pod是kubernetes节点资源分配的基础载体,他根据业务逻辑绑定对应的资源需求。但是我们可能分为一些还没创建的特定Pod和负载分配资源,例如: + +1. 抢占:已经存在的抢占规则不能保证只有正在抢占中的Pod才能分配抢占的资源,我们期望调度器能锁定资源,防止这些资源被有相同或更高优先级的其他Pod抢占。 +2. 重调度:在重调度场景下,最好能保证在Pod被重调度之前保留足够的资源。否则,被重调度的Pod可能再也没法运行,然后对应的应用可能就会崩溃。 +3. 水平扩容:为了能更精准地进行水平扩容,我们希望能为扩容的Pod副本分配节点资源。 +4. 资源预分配:即使当前的资源还不可用,我们可能想为将来的资源需求提前预留节点资源。 + +为了增强kubernetes的资源调度能力,koord-scheduler提供了一个名字叫`Reservation`的调度API,允许我们为一些当前还未创建的特定的Pod和负载,提前预留节点资源。 + +![image](/img/resource-reservation.svg) + +更多信息,请看 [设计文档:资源预留](../designs/resource-reservation)。 + +## 设置 + +### 前提 + +- Kubernetes >= 1.18 +- Koordinator >= 0.6 + +### 安装步骤 + +请确保Koordinator的组件已经在你的集群中正确安装,如果还未正确安装,请参考[安装说明](/docs/installation)。 + +### 配置 + +资源预留功能默认*启用*,你无需对koord-scheduler配置做任何修改,即可使用。 + +## 使用指南 + +### 快速上手 + +1. 使用如下yaml文件预留资源:`reservation-demo`。 + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: Reservation +metadata: + name: reservation-demo +spec: + template: # set resource requirements + namespace: default + spec: + containers: + - args: + - '-c' + - '1' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: # reserve 500m cpu and 800Mi memory + requests: + cpu: 500m + memory: 800Mi + schedulerName: koord-scheduler # use koord-scheduler + owners: # set the owner specifications + - object: # owner pods whose name is `default/pod-demo-0` + name: pod-demo-0 + namespace: default + ttl: 1h # set the TTL, the reservation will get expired 1 hour later +``` + +```bash +$ kubectl create -f reservation-demo.yaml +reservation.scheduling.koordinator.sh/reservation-demo created +``` + +2. 跟踪reservation-demo的状态,直到它变成可用状态。 + +```bash +$ kubectl get reservation reservation-demo -o wide +NAME PHASE AGE NODE TTL EXPIRES +reservation-demo Available 88s node-0 1h +``` + +3. 使用如下YAML文件部署一个Pod:`Pod-demo-0`。 + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-demo-0 # match the owner spec of `reservation-demo` +spec: + containers: + - args: + - '-c' + - '1' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + limits: + cpu: '1' + memory: 1Gi + requests: + cpu: 200m + memory: 400Mi + restartPolicy: Always + schedulerName: koord-scheduler # use koord-scheduler +``` + +```bash +$ kubectl create -f pod-demo-0.yaml +pod/pod-demo-0 created +``` + +4. 检查`Pod-demo-0`的调度状态。 + +```bash +$ kubectl get pod pod-demo-0 -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +pod-demo-0 1/1 Running 0 32s 10.17.0.123 node-0 +``` + +`Pod-demo-0`将会和`reservation-demo`被调度到同一个节点。 + +5. 检查`reservation-demo`的状态。 + +```bash +$ kubectl get reservation reservation-demo -oyaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: Reservation +metadata: + name: reservation-demo + creationTimestamp: "YYYY-MM-DDT05:24:58Z" + uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx + ... +spec: + owners: + - object: + name: pod-demo-0 + namespace: default + template: + spec: + containers: + - args: + - -c + - "1" + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + requests: + cpu: 500m + memory: 800Mi + schedulerName: koord-scheduler + ttl: 1h +status: + allocatable: # total reserved + cpu: 500m + memory: 800Mi + allocated: # current allocated + cpu: 200m + memory: 400Mi + conditions: + - lastProbeTime: "YYYY-MM-DDT05:24:58Z" + lastTransitionTime: "YYYY-MM-DDT05:24:58Z" + reason: Scheduled + status: "True" + type: Scheduled + - lastProbeTime: "YYYY-MM-DDT05:24:58Z" + lastTransitionTime: "YYYY-MM-DDT05:24:58Z" + reason: Available + status: "True" + type: Ready + currentOwners: + - name: pod-demo-0 + namespace: default + uid: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy + nodeName: node-0 + phase: Available +``` + +现在我们可以看到`reservation-demo`预留了500m cpu和 800Mi内存, `Pod-demo-0`从预留的资源中分配了200m cpu and 400Mi内存。 + +6. 清理`reservation-demo`的预留资源。 + +```bash +$ kubectl delete reservation reservation-demo +reservation.scheduling.koordinator.sh "reservation-demo" deleted +$ kubectl get pod pod-demo-0 +NAME READY STATUS RESTARTS AGE +pod-demo-0 1/1 Running 0 110s +``` + +在预留资源被删除后,`Pod-demo-0`依然正常运行。 + +### 高级特性 + +> 最新的API可以在这里查看: [`reservation_types`](https://github.com/koordinator-sh/koordinator/blob/main/apis/scheduling/v1alpha1/reservation_types.go)。 + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: Reservation +metadata: + name: reservation-demo +spec: + # pod template (required): Reserve resources and play pod/node affinities according to the template. + # The resource requirements of the pod indicates the resource requirements of the reservation + template: + namespace: default + spec: + containers: + - args: + - '-c' + - '1' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + requests: + cpu: 500m + memory: 800Mi + # scheduler name (required): use koord-scheduler to schedule the reservation + schedulerName: koord-scheduler + # owner spec (required): Specify what kinds of pods can allocate resources of this reservation. + # Currently support three kinds of owner specifications: + # - object: specify the name, namespace, uid of the owner pods + # - controller: specify the owner reference of the owner pods, e.g. name, namespace(extended by koordinator), uid, kind + # - labelSelector: specify the matching labels are matching expressions of the owner pods + owners: + - object: + name: pod-demo-0 + namespace: default + - labelSelector: + matchLabels: + app: app-demo + # TTL (optional): Time-To-Live duration of the reservation. The reservation will get expired after the TTL period. + # If not set, use `24h` as default. + ttl: 1h + # Expires (optional): Expired timestamp when the reservation is expected to expire. + # If both `expires` and `ttl` are set, `expires` is checked first. + expires: "YYYY-MM-DDTHH:MM:SSZ" +``` + + + +### 案例:多个属主在同一个节点预留资源 + +1. 检查每个节点的可分配资源。 + +```bash +$ kubectl get node -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEMORY:.status.allocatable.memory +NAME CPU MEMORY +node-0 7800m 28625036Ki +node-1 7800m 28629692Ki +... +$ kubectl describe node node-1 | grep -A 8 "Allocated resources" + Allocated resources: + (Total limits may be over 100 percent, i.e., overcommitted.) + Resource Requests Limits + -------- -------- ------ + cpu 780m (10%) 7722m (99%) + memory 1216Mi (4%) 14044Mi (50%) + ephemeral-storage 0 (0%) 0 (0%) + hugepages-1Gi 0 (0%) 0 (0%) + hugepages-2Mi 0 (0%) 0 (0%) +``` + +如上图,`node-1`节点还保留7.0 cpu and 26Gi memory未分配。 + +2. 用如下YAML文件预留资源:`reservation-demo-big`。 + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: Reservation +metadata: + name: reservation-demo-big +spec: + template: + namespace: default + spec: + containers: + - args: + - '-c' + - '1' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: # reserve 6 cpu and 20Gi memory + requests: + cpu: 6 + memory: 20Gi + nodeName: node-1 # set the expected node name to schedule at + schedulerName: koord-scheduler + owners: # set multiple owners + - object: # owner pods whose name is `default/pod-demo-0` + name: pod-demo-1 + namespace: default + - labelSelector: # owner pods who have label `app=app-demo` can allocate the reserved resources + matchLabels: + app: app-demo + ttl: 1h +``` + +```bash +$ kubectl create -f reservation-demo-big.yaml +reservation.scheduling.koordinator.sh/reservation-demo-big created +``` + +3. 跟踪`reservation-demo-big`的状态,直到他变成可用状态。 + +```bash +$ kubectl get reservation reservation-demo-big -o wide +NAME PHASE AGE NODE TTL EXPIRES +reservation-demo-big Available 37s node-1 1h +``` + +`reservation-demo-big`将被调度到Pod模板中设置的nodeName属性节点:`node-1`。 + +4. 用如下YAML文件创建一次部署:`app-demo`。 + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: app-demo +spec: + replicas: 2 + selector: + matchLabels: + app: app-demo + template: + metadata: + name: stress + labels: + app: app-demo # match the owner spec of `reservation-demo-big` + spec: + schedulerName: koord-scheduler # use koord-scheduler + containers: + - name: stress + image: polinux/stress + args: + - '-c' + - '1' + command: + - stress + resources: + requests: + cpu: 2 + memory: 10Gi + limits: + cpu: 4 + memory: 20Gi +``` + +```bash +$ kubectl create -f app-demo.yaml +deployment.apps/app-demo created +``` + +5. 检查`app-demo`的Pod调度结果. + +```bash +k get pod -l app=app-demo -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +app-demo-798c66db46-ctnbr 1/1 Running 0 2m 10.17.0.124 node-1 +app-demo-798c66db46-pzphc 1/1 Running 0 2m 10.17.0.125 node-1 +``` + +`app-demo`的Pod将会被调度到`reservation-demo-big`所在的节点。 + +6. 检查`reservation-demo-big`的状态。 + +```bash +$ kubectl get reservation reservation-demo-big -oyaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: Reservation +metadata: + name: reservation-demo-big + creationTimestamp: "YYYY-MM-DDT06:28:16Z" + uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx + ... +spec: + owners: + - object: + name: pod-demo-0 + namespace: default + template: + spec: + containers: + - args: + - -c + - "1" + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + requests: + cpu: 500m + memory: 800Mi + schedulerName: koord-scheduler + ttl: 1h +status: + allocatable: + cpu: 6 + memory: 20Gi + allocated: + cpu: 4 + memory: 20Gi + conditions: + - lastProbeTime: "YYYY-MM-DDT06:28:17Z" + lastTransitionTime: "YYYY-MM-DDT06:28:17Z" + reason: Scheduled + status: "True" + type: Scheduled + - lastProbeTime: "YYYY-MM-DDT06:28:17Z" + lastTransitionTime: "YYYY-MM-DDT06:28:17Z" + reason: Available + status: "True" + type: Ready + currentOwners: + - name: app-demo-798c66db46-ctnbr + namespace: default + uid: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy + - name: app-demo-798c66db46-pzphc + namespace: default + uid: zzzzzzzz-zzzz-zzzz-zzzzzzzzzzzz + nodeName: node-1 + phase: Available +``` + +现在我们能看到`reservation-demo-big`预留了6 cpu和20Gi内存,`app-demo`从预留的资源中分配了4 cpu and 20Gi内存,预留资源的分配不会增加节点资源的请求容量,否则`node-1`的请求资源总容量将会超过可分配的资源容量。而且当有足够的未分配的预留资源时,这些预留资源可以被同时分配给多个属主。 diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/slo-config.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/slo-config.md new file mode 100644 index 000000000..c565856ef --- /dev/null +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/version-v1.3/user-manuals/slo-config.md @@ -0,0 +1,406 @@ +# SLO 配置 + +## 简介 + +Koordinator 使用一个 ConfigMap 管理 SLO 配置。该 ConfigMap 被 slo-controller 所使用,它的名字和命名空间可以在 koord-manager 的启 +动参数中指定(默认为 `koordinator-system/slo-controller-config`)。它分别包含了以下键值: + +- `colocation-config`:混部配置。例如,是否开启混部 Batch 资源,混部水位线。 +- `resource-threshold-config`:基于阈值的压制/驱逐策略的配置。例如,CPU 压制的阈值,内存驱逐的阈值。 +- `resource-qos-config`:QoS 特性的配置。例如,BE pods 的 Group Identity,LS pods 的内存 QoS,BE pods 的末级缓存划分。 +- `cpu-burst-config`:CPU Burst 特性的配置。例如,pod 的最大 burst 比例。 +- `system-config`:系统设定的配置。例如,全局内存最低水位线系数 `min_free_kbytes`。 + +### 配置层级 + +每个配置定义为集群级别和节点级别的形式。 + +例如, + +```go +type ColocationCfg struct { +ColocationStrategy `json:",inline"` +NodeConfigs []NodeColocationCfg `json:"nodeConfigs,omitempty"` +} + +type ResourceQOSCfg struct { +ClusterStrategy *slov1alpha1.ResourceQOSStrategy `json:"clusterStrategy,omitempty"` +NodeStrategies []NodeResourceQOSStrategy `json:"nodeStrategies,omitempty"` +} +``` + +集群级别配置用于设置全局配置,而节点级别则供用户调整部分节点的配置,特别是灰度部署的情况。 + +请注意,大部分可配置的字段都在组件内部(koordlet、koord-manager)有默认值,所以通常仅需要编辑变更的参数。 + +### NodeSLO + +SLO 配置的 data 字段会被 koord-manager 解析。Koord-manager 会检查配置数据是否合法,然后用解析后的配置更新到每个节点的 NodeSLO 对象中。 +如果解析失败,koord-manager 会在 ConfigMap 对象上记录 Events,以警示 unmarshal 错误。对于 agent 组件 koordlet,它会 watch NodeSLO +的 Spec,并对节点的 QoS 特性进行调谐。 + +```yaml +apiVersion: slo.koordinator.sh/v1alpha1 +kind: NodeSLO +metadata: + name: test-node +spec: + cpuBurstStrategy: {} + extensions: {} + resourceQOSStrategy: {} + systemStrategy: {} + # parsed from the `resource-threshold-config` data + resourceUsedThresholdWithBE: + cpuSuppressPolicy: cpuset + cpuSuppressThresholdPercent: 65 + enable: true + memoryEvictThresholdPercent: 70 +``` + +## 配置 + +> 参考版本:Koordinator v1.2 + +SLO 配置的模板如下: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: slo-controller-config + namespace: koordinator-system +data: + # colocation-config is the configuration for colocation. + # Related features: Dynamic resource over-commitment, Load-aware scheduling, Load-aware descheduling. + # - enable: whether to enable the colocation. If false, the reclaimed resources of the node allocatable (e.g. `kubernetes.io/batch-cpu`) will be removed. + # - metricAggregateDurationSeconds: the aggregated duration of node metrics reporting. + # - metricReportIntervalSeconds: the reporting interval of the node metrics. + # - metricAggregatePolicy: policies of reporting node metrics in different durations. + # - cpuReclaimThresholdPercent: the reclaim threshold for calculating the reclaimed cpu resource. Basically, the reclaimed resource cannot reclaim the unused resources which are exceeding the threshold. + # - memoryReclaimThresholdPercent: the reclaim threshold for calculating the reclaimed memory resource. Basically, the reclaimed resource cannot reclaim the unused resources which are exceeding the threshold. + # - memoryCalculatePolicy: the policy for calculating the reclaimable memory resource. If set to `request`, only unallocated memory resource of high-priority pods are reclaimable, and no allocated memory can be reclaimed. + # - degradeTimeMinutes: the threshold duration to degrade the colocation for which the node metrics has not been updated. + # - updateTimeThresholdSeconds: the threshold duration to force updating the reclaimed resources with the latest calculated result. + # - resourceDiffThreshold: the threshold to update the reclaimed resources than which the calculated reclaimed resources is different from the current. + # - nodeConfigs: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration. + colocation-config: | + { + "enable": false, + "metricAggregateDurationSeconds": 300, + "metricReportIntervalSeconds": 60, + "metricAggregatePolicy": { + "durations": [ + "5m", + "10m", + "15m" + ] + }, + "cpuReclaimThresholdPercent": 60, + "memoryReclaimThresholdPercent": 65, + "memoryCalculatePolicy": "usage", + "degradeTimeMinutes": 15, + "updateTimeThresholdSeconds": 300, + "resourceDiffThreshold": 0.1, + "nodeConfigs": [ + { + "name": "anolis", + "nodeSelector": { + "matchLabels": { + "kubernetes.io/kernel": "anolis" + } + }, + "updateTimeThresholdSeconds": 360, + "resourceDiffThreshold": 0.2 + } + ] + } + # The configuration for threshold-based strategies. + # Related features: BECPUSuppress, BEMemoryEvict, BECPUEvict. + # - clusterStrategy: the cluster-level configuration. + # - nodeStrategies: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration. + # - enable: whether to enable the threshold-based strategies or not. If false, all threshold-based strategies are disabled. If set to true, CPU Suppress and Memory Evict are enabled by default. + # - cpuSuppressThresholdPercent: the node cpu utilization threshold to suppress BE pods' usage. + # - cpuSuppressPolicy: the policy of cpu suppression. If set to `cpuset`, the BE pods' `cpuset.cpus` will be reconciled when suppression. If set to `cfsQuota`, the BE pods' `cpu.cfs_quota_us` will be reconciled. + # - memoryEvictThresholdPercent: the node memory utilization threshold to evict BE pods. + # - memoryEvictLowerPercent: the node memory utilization threshold to stop the memory eviction. By default, `lowerPercent = thresholdPercent - 2`. + # - cpuEvictBESatisfactionLowerPercent: the cpu satisfaction threshold to start the cpu eviction (also require to meet the BE util threshold). + # - cpuEvictBEUsageThresholdPercent: the BE utilization (BEUsage / BERealLimit) threshold to start the cpu eviction (also require to meet the cpu satisfaction threshold). + # - cpuEvictBESatisfactionUpperPercent: the cpu satisfaction threshold to stop the cpu eviction. + # - cpuEvictTimeWindowSeconds: the time window of the cpu metrics for the cpu eviction. + resource-threshold-config: | + { + "clusterStrategy": { + "enable": false, + "cpuSuppressThresholdPercent": 65, + "cpuSuppressPolicy": "cpuset", + "memoryEvictThresholdPercent": 70, + "memoryEvictLowerPercent": 65, + "cpuEvictBESatisfactionUpperPercent": 90, + "cpuEvictBESatisfactionLowerPercent": 60, + "cpuEvictBEUsageThresholdPercent": 90 + }, + "nodeStrategies": [ + { + "name": "anolis", + "nodeSelector": { + "matchLabels": { + "kubernetes.io/kernel": "anolis" + } + }, + "cpuEvictBEUsageThresholdPercent": 80 + } + ] + } + # The configuration for QoS-based features. + # Related features: CPUQoS (GroupIdentity), MemoryQoS (CgroupReconcile), ResctrlQoS. + # - clusterStrategy: the cluster-level configuration. + # - nodeStrategies: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration. + # - lsrClass/lsClass/beClass: the configuration for pods of QoS LSR/LS/BE respectively. + # - cpuQOS: the configuration of CPU QoS. + # - enable: whether to enable CPU QoS. If set to `false`, the related cgroup configs will be reset to the system default. + # - groupIdentity: the priority level of the Group Identity ([-1, 2]). `2` means the highest priority, while `-1` means the lowest priority. Anolis OS required. + # - memoryQOS: the configuration of Memory QoS. + # - enable: whether to enable Memory QoS. If set to `false`, the related cgroup configs will be reset to the system default. + # - minLimitPercent: the scale percentage for setting the `memory.min` based on the container's request. It enables the memory protection from the Linux memory reclaim. + # - lowLimitPercent: the scale percentage for setting the `memory.low` based on the container's request. It enables the memory soft protection from the Linux memory reclaim. + # - throttlingPercent: the scale percentage for setting the `memory.high` based on the container's limit. It enables the memory throttling in cgroup level. + # - wmarkRatio: the ratio of container-level asynchronous memory reclaim based on the container's limit. Anolis OS required. + # - wmarkScalePermill: the per-mill of container memory to reclaim in once asynchronous memory reclaim. Anolis OS required. + # - wmarkMinAdj: the adjustment percentage of global memory min watermark. It affects the reclaim priority when the node memory free is quite a few. Anolis OS required. + # - resctrlQOS: the configuration of Resctrl (Intel RDT) QoS. + # - enable: whether to enable Resctrl QoS. + # - catRangeStartPercent: the starting percentage of the L3 Cache way partitioning. L3 CAT required. + # - catRangeEndPercent: the ending percentage of the L3 Cache way partitioning. L3 CAT required. + # - mbaPercent: the allocation percentage of the memory bandwidth. MBA required. + resource-qos-config: | + { + "clusterStrategy": { + "lsrClass": { + "cpuQOS": { + "enable": false, + "groupIdentity": 2 + }, + "memoryQOS": { + "enable": false, + "minLimitPercent": 0, + "lowLimitPercent": 0, + "throttlingPercent": 0, + "wmarkRatio": 95, + "wmarkScalePermill": 20, + "wmarkMinAdj": -25, + "priorityEnable": 0, + "priority": 0, + "oomKillGroup": 0 + }, + "resctrlQOS": { + "enable": false, + "catRangeStartPercent": 0, + "catRangeEndPercent": 100, + "mbaPercent": 100 + } + }, + "lsClass": { + "cpuQOS": { + "enable": false, + "groupIdentity": 2 + }, + "memoryQOS": { + "enable": false, + "minLimitPercent": 0, + "lowLimitPercent": 0, + "throttlingPercent": 0, + "wmarkRatio": 95, + "wmarkScalePermill": 20, + "wmarkMinAdj": -25, + "priorityEnable": 0, + "priority": 0, + "oomKillGroup": 0 + }, + "resctrlQOS": { + "enable": false, + "catRangeStartPercent": 0, + "catRangeEndPercent": 100, + "mbaPercent": 100 + } + }, + "beClass": { + "cpuQOS": { + "enable": false, + "groupIdentity": -1 + }, + "memoryQOS": { + "enable": false, + "minLimitPercent": 0, + "lowLimitPercent": 0, + "throttlingPercent": 0, + "wmarkRatio": 95, + "wmarkScalePermill": 20, + "wmarkMinAdj": 50, + "priorityEnable": 0, + "priority": 0, + "oomKillGroup": 0 + }, + "resctrlQOS": { + "enable": false, + "catRangeStartPercent": 0, + "catRangeEndPercent": 30, + "mbaPercent": 100 + } + } + }, + "nodeStrategies": [ + { + "name": "anolis", + "nodeSelector": { + "matchLabels": { + "kubernetes.io/kernel": "anolis" + } + }, + "beClass": { + "memoryQOS": { + "wmarkRatio": 90 + } + } + } + ] + } + # The configuration for the CPU Burst. + # Related features: CPUBurst. + # - clusterStrategy: the cluster-level configuration. + # - nodeStrategies: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration. + # - policy: the policy of CPU Burst. If set to `none`, the CPU Burst is disabled. If set to `auto`, the CPU Burst is fully enabled. If set to `cpuBurstOnly`, only the Linux CFS Burst feature is enabled. + # - cpuBurstPercent: the percentage of Linux CFS Burst. It affects the value of `cpu.cfs_burst_us` of pod/container cgroups. It specifies the percentage to which the CPU limit can be increased by CPU Burst. + # - cfsQuotaBurstPercent: the percentage of cfs quota burst. It affects the scaled ratio of `cpu.cfs_quota_us` of pod/container cgroups. It specifies the maximum percentage to which the value of cfs_quota in the cgroup parameters can be increased. + # - cfsQuotaBurstPeriodSeconds: the maximum period of once cfs quota burst. It indicates that the time period in which the container can run with an increased CFS quota is unlimited. + # - sharePoolThresholdPercent: the threshold of share pool utilization. If the share pool utilization is too high, CPU Burst will be stopped and reset to avoid machine overload. + cpu-burst-config: | + { + "clusterStrategy": { + "policy": "none", + "cpuBurstPercent": 1000, + "cfsQuotaBurstPercent": 300, + "cfsQuotaBurstPeriodSeconds": -1, + "sharePoolThresholdPercent": 50 + }, + "nodeStrategies": [ + { + "name": "anolis", + "nodeSelector": { + "matchLabels": { + "kubernetes.io/kernel": "anolis" + } + }, + "policy": "cfsQuotaBurstOnly", + "cfsQuotaBurstPercent": 400 + } + ] + } + # The configuration for system-level settings. + # Related features: SystemConfig. + # - clusterStrategy: the cluster-level configuration. + # - nodeStrategies: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration. + # - minFreeKbytesFactor: the factor for calculating the global minimum memory free watermark `/proc/sys/vm/min_free_kbytes`. `min_free_kbytes = minFreeKbytesFactor * nodeTotalMemory / 10000`. + # - watermarkScaleFactor: the reclaim factor `/proc/sys/vm/watermark_scale_factor` in once global memory reclaim. + # - memcgReapBackGround: whether to enable the reaper for orphan memory cgroups. + system-config: |- + { + "clusterStrategy": { + "minFreeKbytesFactor": 100, + "watermarkScaleFactor": 150, + "memcgReapBackGround": 0 + } + "nodeStrategies": [ + { + "name": "anolis", + "nodeSelector": { + "matchLabels": { + "kubernetes.io/kernel": "anolis" + } + }, + "minFreeKbytesFactor": 100, + "watermarkScaleFactor": 150 + } + ] + } +``` + +对于更多信息,请查看相关特性的用户手册和设计文档。 + +## 快速开始 + +1. 通过 ConfigMap `koordinator-system/slo-controller-config` 检查当前的 SLO 配置。 + +```bash +$ kubectl get configmap -n koordinator-system slo-controller-config -o yaml +apiVersion: v1 +kind: ConfigMap +metadata: + annotations: + meta.helm.sh/release-name: koordinator + meta.helm.sh/release-namespace: default + labels: + app.kubernetes.io/managed-by: Helm + name: slo-controller-config + namespace: koordinator-system +data: + colocation-config: | + { + "enable": false, + "metricAggregateDurationSeconds": 300, + "metricReportIntervalSeconds": 60, + "cpuReclaimThresholdPercent": 60, + "memoryReclaimThresholdPercent": 65, + "memoryCalculatePolicy": "usage", + "degradeTimeMinutes": 15, + "updateTimeThresholdSeconds": 300, + "resourceDiffThreshold": 0.1 + } + resource-threshold-config: | + { + "clusterStrategy": { + "enable": false + } + } +``` + +2. 编辑 ConfigMap `koordinator-system/slo-controller-config` 来修改 SLO 配置。 + +```bash +$ kubectl edit configmap -n koordinator-system slo-controller-config +``` + +例如,ConfigMap 编辑如下: + +```yaml +data: + # ... + resource-threshold-config: | + { + "clusterStrategy": { + "enable": true, + "cpuSuppressThresholdPercent": 60, + "cpuSuppressPolicy": "cpuset", + "memoryEvictThresholdPercent": 60 + } + } +``` + +3. 确认 NodeSLO 是否成功下发。 + +> 注意:默认值会在 NodeSLO 中省略。 + +```bash +$ kubectl get nodeslo.slo.koordinator.sh test-node -o yaml +apiVersion: slo.koordinator.sh/v1alpha1 +kind: NodeSLO +metadata: + name: test-node +spec: + # ... + extensions: {} + resourceUsedThresholdWithBE: + cpuSuppressPolicy: cpuset + cpuSuppressThresholdPercent: 60 + enable: true + memoryEvictThresholdPercent: 60 +``` diff --git a/sidebars.js b/sidebars.js index d366332c1..c939b8a36 100644 --- a/sidebars.js +++ b/sidebars.js @@ -66,6 +66,7 @@ const sidebars = { 'designs/koordlet-overview', 'designs/runtime-proxy', 'designs/nri-mode-resource-management', + 'designs/node-prediction', 'designs/enhanced-scheduler-extension', 'designs/load-aware-scheduling', 'designs/fine-grained-cpu-orchestration', diff --git a/static/img/node-prediction.svg b/static/img/node-prediction.svg new file mode 100644 index 000000000..44e73090a --- /dev/null +++ b/static/img/node-prediction.svg @@ -0,0 +1,4 @@ + + + +
Update Node
Update Node
Koord Manager
Koord Manager
Update NodeMetric
Update NodeMetric
States Informer
States Informer
Predict Server
Predict Server
Metrics Advisor
Metrics Advisor
Bind pod
Bind pod
Koord Scheduler
Koord Scheduler
Koordlet
Koordlet
query metrics
query metrics
Metric Cache
Metric Cache
Node
Node
Pod
Pod
Pod
Pod
Pod
Pod
Get NodeMetric
Get NodeMetric
Get Node and Pod
Get Node and Pod
Kube APIServer
Kube APIServer
collect stats
collect stats
To be scheduled
To be scheduled
Pod
Pod
Text is not SVG - cannot display
\ No newline at end of file diff --git a/versioned_docs/version-v1.3/architecture/overview.md b/versioned_docs/version-v1.3/architecture/overview.md new file mode 100644 index 000000000..ad47a718a --- /dev/null +++ b/versioned_docs/version-v1.3/architecture/overview.md @@ -0,0 +1,58 @@ +# Overview + +This topic describes the architecture, components, and core concepts associated with Koordinator deployments to Kubernetes. Koordinator consists of two control planes ([Koordinator Scheduler](#koordinator-scheduler)/[Koordinator Manager](#koordinator-manager)) and one DaemonSet component ([Koordlet](#koordlet)). +Koordinator adds co-location capabilities on top of the original kubernetes, and maintains the compatibility of the original kubernetes workloads. + +![Architecture](/img/architecture.png) + +## Koord-Scheduler + +The Koordinator Scheduler is deployed as a ```Deployment```, which is used to enhance the resource scheduling capabilities of kubernetes in QoS-aware, differentiated SLO management, and job scheduling. Specifically including: + +- QoS-aware scheduling, including load-aware scheduling to make node load more balanced, resource overcommitment to run more computing workloads with low priority. +- Differentiated SLO management, including fine-grained CPU orchestration, different QoS policy(cfs/LLC/memory bw/net bw/blkio) for diffenent workloads. +- Job scheduling, including elastic quota scheduling, gang scheduling, heterogeneous resource scheduling, to support big-data and AI workloads. + +In order to better support diffenent workloads, the scheduler also provides a series of general capability enhancements: +- Reservation, an ability for reserving node resources for specific pods or workloads, which is widely used in descheduling, resource preemption and fragmentation optimization. +- Node reservation, an ability for reserving node resources for workloads out of kubernetes, which is typically used for non-containerized workloads. + +## Koord-Descheduler + +The Koordinator Descheduler is deployed as a ```Deployment```, which is an enhanced version of the community descheduler: + +- Framework, a descheduling framework with better scalability, determinism and security, for more [details](../designs/descheduler-framework). +- Load-aware descheduling, a descheduling plugins to support node load rebalancing, which supports user-defined CPU load level of nodes to avoids hotspot nodes. + +## Koord-Manager + +The Koordinator Manager is deployed as a ``` Deployment ```, usually consists of two instances, one leader and one backup. The Koordinator Manager consists of several controllers and webhooks, which are used to orchestrate co-located workloads and support resource overcommitment scheduling and SLO management. + +Currently, three components are provided: +- Colocation Profile, which used to support colocation without requiring modification of workloads. Users only need to do a small amount of configuration in the cluster, and the original workload can be run in a colocation mode, learn more about [Colocation Profile](../user-manuals/colocation-profile.md). +- SLO Controller, which is used for resource overcommitment model management, and dynamically adjusts the overcommitment ratio of the cluster according to the running status of the node co-location. The core responsibility of this controller is to manage co-located SLOs, such as intelligently identifying abnormal nodes in the cluster and lowering their weights, and dynamically adjusting the water level and suppression strategy of co-located, so as to ensure the stability and efficiency of Pods in the cluster. +- Recommender(coming soon), it uses histograms to count and predict the resource usage details of the workloads, which are used to estimate the peak resource requirements of the workloads, thereby supporting better hotspot dispersion and improving the efficiency of co-location. In addition, resource profiling will also be used to simplify the complexity of user resource specification configuration, such as to support automatic specification hosting (VPA). + + +## Koordlet + +The Koordlet is deployed as a ``` DaemonSet ``` in kubernetes cluster, which is used to support colocation resource overcommitment, interference detection, QoS guarantee, etc. + +Inside Koordlet, it mainly includes the following modules: +- Resource Profiling, which estimates the actual usage of Pod resources, and reclaims allocated but unused resources for overcommit low-priority pods according to the reclaimed resource. +- Resource Isolation, set resource isolation parameters for different types of Pods to avoid low-priority pods affecting the stability and performance of high-priority pods. +- Interference detection, for running Pods, dynamically detect resource contention, including CPU scheduling, memory allocation delay, network, disk IO delay, etc. +- QoS Manager, which dynamically adjusts the water level of node colocation based on resource profiling, interference detection results and SLO configuration, suppressing Pods that affect service quality. +- Resource Tuning, container resource tuning for co-located scenarios, optimize the container's CPU Throttle, OOM, etc., to improve the quality of service operation. + +## Koord-RuntimeProxy + +The Koord-RuntimeProxy is deployed as a ``` systemd service ``` in kubernetes node, which is designed to intercept CRI request, and apply some resource management policies, such as setting different cgroup parameters by pod priorities under hybrid workload orchestration scenario, applying new isolation policies for latest Linux kernel, CPU architecture, and etc. + +## What's Next + +Here are some recommended next steps: + +- Learn Koordinator's [Resource Model](./resource-model). +- Learn Koordinator's [Priority](./priority). +- Learn Koordinator's [QoS](./qos). diff --git a/versioned_docs/version-v1.3/architecture/priority.md b/versioned_docs/version-v1.3/architecture/priority.md new file mode 100644 index 000000000..950d5de5a --- /dev/null +++ b/versioned_docs/version-v1.3/architecture/priority.md @@ -0,0 +1,87 @@ +# Priority + +Koordinator defines a set of specifications on top of kubernetes priority class, and extends a dimension of priority to support fine-grained colocation. + +## Definition + +Priority is represented by numbers, and four classes are currently defined: + +PriorityClass | Priority Ranges | Description +----- | ----------- | -------- +koord-prod | [9000, 9999] | Selling requires planning resources quota in advance, and success is guaranteed within quota +koord-mid | [7000, 7999] | Selling requires planning resources quota in advance, and success is guaranteed within quota +koord-batch | [5000, 5999] | Selling requires planning resources quota in advance, and quota borrowing is allowed generally +koord-free | [3000, 3999] | Resource quota is not guaranteed, and the total allocatable resource depends on the total idle resources of the cluster + +There is some white space between PriorityClass to support possible future extensions. + + +## Constraints + +Koordinator matches different types of workloads to different priorities: +- koord-prod, running typical latency sensitive services, generally referring to types of services that require a "real-time" response, such as a typical service called by clicking a button in the mobile APP. +- koord-mid, corresponding to the available resources estimated by the appeal long-term reservation, typically used to run some real-time computing, AI training jobs, such as tensorflow/pytorch, etc. +- koord-batch, corresponding to the available resources estimated by the appeal short-term reservation, run typical offline batch jobs, generally referring to offline analysis type jobs, such as day-level big data reports, non-interactive SQL queries. +- koord-free, run low-priority offline batch jobs, generally refers to not making resource budgets, using idle resources to complete as much as possible, such as developers submitting a job for testing purposes. + +## Koordinator Priority vs. Kubernetes Priority + +Koordinator initializes four PriorityClasses in the kubernetes cluster: +``` +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: koord-prod +value: 9000 +description: "This priority class should be used for prod service pods only." +--- +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: koord-mid +value: 7000 +description: "This priority class should be used for mid service pods only." +--- +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: koord-batch +value: 5000 +description: "This priority class should be used for batch service pods only." +--- +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: koord-free +value: 3000 +description: "This priority class should be used for free service pods only." +``` + +Inside each PriorityClass, Koordinator allows users to set Pod colocation priorities for fine-grained resource scheduling. + +## Examples + +The following YAML is an example of a Pod configuration that uses the PriorityClass and Priority created in the preceding example. + +``` +apiVersion: v1 +kind: Pod +metadata: + name: nginx + labels: + env: test + koordinator.sh/priority: "5300" +spec: + containers: + - name: nginx + image: nginx + imagePullPolicy: IfNotPresent + priorityClassName: koord-batch +``` + +## What's Next + +Here are some recommended next steps: + +- Learn Koordinator's [Resource Model](./resource-model). +- Learn Koordinator's [QoS](./qos). diff --git a/versioned_docs/version-v1.3/architecture/qos.md b/versioned_docs/version-v1.3/architecture/qos.md new file mode 100644 index 000000000..2de95fc66 --- /dev/null +++ b/versioned_docs/version-v1.3/architecture/qos.md @@ -0,0 +1,36 @@ +# QoS + +QoS is used to express the running quality of the Pod on the node, such as the way to obtain resources, the proportion of resources obtained, and the QoS guarantee policy. + +## Definition + +There are five types of QoS supported by the Koordinator scheduling system: + +QoS | feature | Description +--- | ---- | ------------- +SYSTEM | system process, resource constrained | For system services such as DaemonSets, the latency needs to be guaranteed but it needs to limit the resource usage limit of all containers running on the node to ensure that system processes do not occupy too many resources +LSE(Latency Sensitive Exclusive) | reserve resources and organizing co-located pods to share resources | Rarely used, common in middleware-type applications, generally in independent resource pools +LSR(Latency Sensitive Reserved) | reserve resources for better certainty | Similar to Guaranteed by the community, CPU cores are bound +LS(Latency Sensitive) | share resources for better resilience to burst traffic | Typical QoS level for microservice workloads to achieve better resource elasticity and more flexible resource adjustment capabilities +BE(Best Effort) | share resource exclude LSE, the quality of resource operation is limited, or even killed in extreme cases | Typical QoS level for batch jobs, stable computing throughput within a certain period, low-cost resources + +## QoS CPU Orchestration + +![img](/img/qos-cpu-orchestration.png) + + +## Koordinator QoS vs. Kubernetes QoS + +As seen in the [Definition](#definition) section, Koordinator's QoS is more complicated than Kubernetes QoS, because in colocation scenarios, we need to fine-tune the QoS for latency-sensitive workloads to meet the needs of co-located performance. + +There is a correspondence between Koordinator and Kubernetes QoS: + +Koordinator QoS | Kubernetes QoS +--------------- | -------------- +SYSTEM | --- +LSE | Guaranteed +LSR | Guaranteed +LS | Guaranteed/Burstable +BE | BestEffort + +Koordlet triggers corresponding resource isolation and QoS guarantee according to Pod's Priority and QoS definition. diff --git a/versioned_docs/version-v1.3/architecture/resource-model.md b/versioned_docs/version-v1.3/architecture/resource-model.md new file mode 100644 index 000000000..c4296fbda --- /dev/null +++ b/versioned_docs/version-v1.3/architecture/resource-model.md @@ -0,0 +1,34 @@ +# Resource Model + +Colocation is a set of resource scheduling solutions for the fine grained orchestration of latency sensitive workloads with the big data computing workloads. It needs to solve two major problems: + +1. How to schedule resources for latency sensitive workloads to meet performance and long-tail latency requirements, the key points are resource scheduling strategies and QoS-aware strategies. +2. How to schedule and arrange big data computing workloads to meet the needs of jobs for computing resources at a lower cost. The key is how to achieve reasonable resource overcommitment and QoS protection in extreme abnormal scenarios. + + +## Definition + +![Resource Model](/img/resource-model.png) + +The above figure is the Koordinator colocation resource model, the basic idea is to use those allocated but unused resources to run low-priority pods. Four lines as shown: +1. limit: gray, the amount of resources requested by the high-priority Pod, corresponding to the Pod request of kubernetes. +2. usage: red, the amount of resources actually used by the Pod, the horizontal axis is the time line, and the red line is the fluctuation curve of the Pod load over time. +3. short-term reservation: dark blue, which is based on the resource usage of usage in the past (shorter) period, and the estimation of its resource usage in the future period of time. The difference between reservation and limit is the allocated unused ( resources that will not be used in the future) can be used to run short-lived batch pods. +4. long-term reservation: light blue, similar to short-term reservation but the estimated historical period of use is longer. The resources from reservation to limit can be used for pods with a longer life cycle, compared with the predicted value of short-term, fewer resources available but more stability. + +The entire co-located resource scheduling building is constructed based on the resource model shown above, which can not only meet the resource requirements of various workloads, but also make full use of the idle resources of the cluster. + +## SLO Description + +A Pod resource SLO running in a cluster consists of two concepts, Priority and QoS: +- Priority, the resource priority, represents the priority of the request being scheduled. Typically, Priority affects the relative position of the request in the scheduler pending queue. +- QoS, which represents the quality of service when the Pod runs. Such as cgroups cpu shares, cfs quota, LLC, memory bandwidth, OOM Priority, etc. + +It should be noted that Priority and QoS are two-dimensional concepts, but in real business scenarios, there will be some constraints between the two (not all combinations are legal). + +## What's Next + +Here are some recommended next steps: + +- Learn Koordinator's [Priority](./priority). +- Learn Koordinator's [QoS](./qos). diff --git a/versioned_docs/version-v1.3/best-practices/anolis_plugsched.md b/versioned_docs/version-v1.3/best-practices/anolis_plugsched.md new file mode 100644 index 000000000..08d4bc17f --- /dev/null +++ b/versioned_docs/version-v1.3/best-practices/anolis_plugsched.md @@ -0,0 +1,36 @@ +--- +sidebar_position: 2 +--- + +# Anolis Plugsched + +In order to improve the colocation effect on CentOS 7.9 operating system kernel in the CPU resource dimension, Anolis community provides a plug-in solution, which is to use the plugsched to provide a scheduler plug-in package for CPU colocation technology. This plug-in can be directly installed on CentOS 7.9 without downtime and business migration. For more details about plugsched, please refer to the [Blog](https://koordinator.sh/blog/anolis-CPU-Co-location). + +## Prerequisites + +- Kernel: The kernel must be the official CentOS 7.9 kernel. +- version == 3.10.0 +- release >= 1160.81.1 + +## Use Plugsched + +### Install the plug-in + + ``` + # rpm -ivh https://github.com/koordinator-sh/koordinator/releases/download/v1.1.1/scheduler-bvt-noise-clean-$(uname -r).rpm + ``` + +If you update the kernel version, you can use the following command to install the new plug-in. + ``` + # rpm -ivh https://github.com/koordinator-sh/koordinator/releases/download/v1.1.1/scheduler-bvt-noise-clean-$(uname -r).rpm --oldpackage + ``` + +After installation, you can see the `cpu.bvt_warp_ns` in cpu cgroup directory and the usage of it is compatible with Group Identity. + +### Removing plug-in + +Removing the plug-in can use the `rpm -e` command and the `cpu.bvt_warp_ns` doesn't exist either. Please make sure that no tasks are still using `cpu.bvt_warp_ns` before uninstalling. + +## Use Koordinator CPU QoS feature + +Please refer to [User Manual](../user-manuals/cpu-qos.md). \ No newline at end of file diff --git a/versioned_docs/version-v1.3/best-practices/colocation-of-spark-jobs.md b/versioned_docs/version-v1.3/best-practices/colocation-of-spark-jobs.md new file mode 100644 index 000000000..ee27d1389 --- /dev/null +++ b/versioned_docs/version-v1.3/best-practices/colocation-of-spark-jobs.md @@ -0,0 +1,101 @@ +--- +sidebar_position: 1 +--- + +# Colocation of Spark Jobs +Apache Spark is an analysis engine for large-scale data processing, which is widely used in Big Data, SQL Analysis and Machine Learning scenarios. This tutorial provides a quick practice guide about running Spark jobs in colocation mode with other latency sensitive applications by Koordinator, which is helpful for improving cluster resource utilization. For more details about how to use, compose, and work with Koordinator colocation, please refer to the [Introduction](../) + +## Requirements +### Koordinator Components +Before submitting Spark jobs as colocate mode, you need to ensure all Koordinator components have already been successfully installed. Please follow the step in [Installation](../installation) guide. + +### Install Kubernetes Operator for Apache Spark +To simplify running of Spark jobs in Cluster, we import the Kubernetes Operator for Apache Spark in this practice, which uses Kubernetes custom resource for managing Spark applications. + +With the help of Helm [chart](https://github.com/koordinator-sh/koordinator/tree/main/examples/spark-operator-chart), Kubernetes Operator for Apache Spark can be easily installed using the command below. +``` +$ helm install koord-spark-operator ./spark-operator-chart/ --namespace spark-operator +``` + +Installing the chart will create a namespace `spark-operator` and if doesn't exist, besides, helm will create a spark-operator Deployment and set up RBAC role for it. After the installation, you should see the operator in running successfully by checking the status of helm release. +``` +$ helm status --namespace spark-operator koord-spark-operator +``` + +## Run Spark Applications with Koordinator +Due to the mechanism that Spark driver pod needs a Kubernetes service account to manage executor pods, the service account must be authorized with appropriate permissions. Run the following command to create namespace `spark-demo` and service account `spark` before submitting jobs. +``` +$ kubectl apply -f examples/spark-jobs/service-account.yaml +``` + +Next, run the following command to create Colocation Profile so that all pods submitted following in namespace `spark-demo` will run in colocation mode. See this [tutorial](../user-manuals/colocation-profile) to learn more about Colocation Profile. +``` +$ kubectl apply -f examples/spark-jobs/cluster-colocation-profile.yaml +``` + +Submit a Spark TC example job to namespace `spark-demo` with the command: +``` +$ kubectl apply -f examples/spark-jobs/spark-tc-complex.yaml +``` + +Then, check the status of Spark application by running the following command. +``` +$ kubectl get sparkapplication -n spark-demo spark-tc-complex +``` + +This will show similar content as following: +``` +NAME STATUS ATTEMPTS START FINISH AGE +spark-tc-complex RUNNING 1 2022-03-30T09:11:22Z 14s +``` +Now, all pods submitted to namespace `spark-demo` will be switched to colocation mode, check spark-driver pod as below for example. We can see the protocols like`koordinator.sh/qosClass: BE` and `kubernetes.io/batch-cpu` are successfully injected to pod by Colocation Profile. +``` +apiVersion: v1 +kind: Pod +metadata: + labels: + koordinator.sh/qosClass: BE + spark-role: driver + ... +spec: + containers: + - args: + - driver + - --properties-file + - /opt/spark/conf/spark.properties + - --class + - org.apache.spark.examples.SparkTC + - local:///opt/spark/examples/jars/spark-examples_2.12-3.2.1-tc1.2.jar + resources: + limits: + kubernetes.io/batch-cpu: "1000" + kubernetes.io/batch-memory: 3456Mi + requests: + kubernetes.io/batch-cpu: "1000" + kubernetes.io/batch-memory: 3456Mi + ... +``` + +## Evaluation +With the help of Koordinator, when pods resource usage is idle, resources already requested can be reallocated to other colocation pods by the overcommitment model, which can significantly improve the resource utilization of cluster. + +In our experiment environment, before the Spark job submitted, we can see the cluster allocatable resources run out while the actual resource usage is in low level. +``` +$ kubectl describe node + Allocated resources: + Resource Requests + cpu 7620m (95.25%) + +$ kubectl top node + NAME CPU(cores) CPU% + cn-hangzhou.your-node-1 1190m 14.8% + cn-hangzhou.your-node-2 1620m 20.25% +``` + +After submit the Spark job in colocation mode, those unused resources will be reallocated through `batch priority` to Spark pods, so that we can make the cluster a higher utilization level. +``` +$ kubectl top node +NAME CPU(cores) CPU% +cn-hangzhou.your-node-1 4077m 52% +cn-hangzhou.your-node-2 3830m 49% +``` \ No newline at end of file diff --git a/versioned_docs/version-v1.3/best-practices/fine-grained-cpu-orchestration.md b/versioned_docs/version-v1.3/best-practices/fine-grained-cpu-orchestration.md new file mode 100644 index 000000000..92851eb8c --- /dev/null +++ b/versioned_docs/version-v1.3/best-practices/fine-grained-cpu-orchestration.md @@ -0,0 +1,259 @@ +# Coordinated sharing of CPU resources in Colocation Scenarios - Fine-grained CPU Orchestration + +## Introduction + +In a cloud-native environment, users often deploy different types of workloads in the same cluster, leveraging different peak effects of different services to achieve time-sharing multiplexing of resources and avoid resource waste. However, colocation of different types of workloads often leads to resource competition and mutual interference. The most typical scenario is the colocation of online and offline workloads. When more computing resources are occupied by offline workloads, the response time of online loads will be affected; when more computing resources are occupied by online workloads for a long time, the task completion time of offline workloads cannot be guaranteed. This phenomenon belongs to the Noisy Neighbor problem. + +Depending on the degree of colocation and resource types, there are many different ways to solve this problem. Quota management can limit the resource usage of loads from the entire cluster dimension, and Koordinator provides multi-level elastic quota management functions in this regard. From the single-node level, CPU, memory, disk IO, and network resources may be shared by different loads. Koordinator has provided some resource isolation and guarantee capabilities on CPU and memory, and related capabilities on disk IO and network resources are under construction. + +This article mainly introduces how Koordinator helps loads (online and online, online and offline) share CPU resources collaboratively when different types of workloads are colocated on the same node. + +## Problem Description + +The essence of CPU resource Noisy Neighbor is that different workloads share CPU resources without coordination. +1. The default resource model of Kubernetes uses cgroup (cfs quota) to limit the access of different loads to CPU resources in terms of CPU time usage. In this case, some workloads may be switched to CPU cores by the operating system scheduler. Since different CPU cores have different memory access time to different physical locations, switching cpu cores will result in longer memory access time, thus affecting load performance, thereby affecting load performance. +2. In NUMA architecture, SMT threads (logical cores) share execution units and L2 cache of physical cores. +When there are multiple workloads on the same physical core, resource contention will happen between different workloads, resulting in load performance degradation. + +Kubernetes provides topology manager and CPU manager on node level to solve the above problems. However, this feature will only attempt to take effect after the Pod has been scheduled on the machine. This may lead to the situation where Pods are scheduled to nodes with sufficient CPU resources but topology requirements are not met. + +## Solutions + +### Application-Oriented CPU Orchestration QoS Semantics + +In response to the above problems and deficiencies, Koordinator designed an application-oriented QoS semantics and CPU orchestration protocol, as shown in the figure below. + +![img](/img/qos-cpu-orchestration.png) + +LS (Latency Sensitive) is applied to typical microservice loads, and Koordinator isolates it from other latency-sensitive loads to ensure its performance. LSR (Latency Sensitive Reserved) is similar to Kubernetes' Guaranteed. On the basis of LS, it adds the semantics that applications require reserved binding cores. LSE (Latency Sensitive Exclusive) is common in applications that are particularly sensitive to CPU, such as middleware. In addition to satisfying its semantics similar to LSR's requirement to bind cores, Koordinator also ensures that the allocated CPU is not shared with any other load. + +Also, to improve resource utilization, BE workloads can share CPU with LSR and LS. To ensure that latency-sensitive applications shared with BE are not disturbed by it, Koordinator provides strategies such as interference detection and BE suppression. The focus of this article is not here, readers can pay attention to follow-up articles. + +### Rich CPU scheduling strategies + +For LSE applications, when the machine is a hyper-threaded architecture, only logical cores can be guaranteed to be exclusive to the load. In this way, when there are other loads on the same physical core, application performance will still be disturbed. +To this end, Koordinator supports users to configure rich CPU scheduling policies on pod annotation to improve performance. + +CPU orchestration policies are divided into CPU-binding policies and CPU-exclusive policies. The CPU binding strategy determines the distribution of logical cores assigned to the application among physical cores, which can be spread or stacked among physical cores. Stacking (FullPCPU) refers to allocating complete physical cores to applications, which can effectively alleviate the Noisy Neighbor problem. SpreadByPCPU is mainly used in some delay-sensitive applications with different peak and valley characteristics, allowing the application to fully use the CPU at a specific time. The CPU exclusive policy determines the exclusive level of logical cores assigned to the application, and it can try to avoid physical cores or NUMANodes that have been applied for with the exclusive policy. + +### Enhanced CPU Scheduling Capabilities + +Koordinator supports the configuration of NUMA allocation strategies to determine how to select satisfactory NUMA nodes during scheduling. MostAllocated indicates allocation from the NUMA node with the least available resources, which can reduce fragmentation as much as possible and leave more allocation space for subsequent loads. However, this approach may cause the performance of parallel code that relies on Barriers to suffer. DistributeEvenly means that evenly distributing CPUs on NUMA nodes can improve the performance of the above parallel code. LeastAllocated indicates allocation from the NUMA node with the most available resources. + +In addition, Koordinator's CPU allocation logic is completed in the central scheduler. In this way, there will be a global perspective, avoiding the dilemma of single-node solution, where CPU resources may be sufficient but topology requirements are not met. + +## Best Practices +As can be seen from the above, Koordinator's fine-grained CPU orchestration capability can significantly improve the performance of CPU-sensitive workloads in multi-application colocation scenarios. In order to allow readers to use Koordinator’s fine-grained CPU scheduling capabilities more clearly and intuitively, this article deploys online applications to clusters in different ways, and observes the latency of services in stress testing to judge the effect of CPU scheduling capabilities. + +In this article, multiple online applications will be deployed on the same machine and pressure tested for 10 minutes to fully simulate the CPU core switching scenarios that may occur in production practice. For the colocation of online and offline applications, Koordinator provides strategies such as interference detection and BE suppression. The focus of this article is not here, and readers can pay attention to the practice in subsequent articles. + +|Group Number|Deployment Mode|Description|Scenarios| +|-|-|-|-| +|A|10 online applications are deployed on the nodes, and each node applies for 4 CPUs, all using kubernetes guaranteed QoS|Koordinator does not provide fine-grained CPU orchestration capabilities for applications|Due to CPU core switching, applications share logical cores, application performance will be affected, and it is not recommended to use| +|B|Deploy 10 online applications on the nodes, each application node has 4 CPUs, all adopt LSE QoS, CPU binding strategy adopts physical core binpacking(FullPCPUs)|Koordinator provides CPU core binding capability for LSE Pod and online applications will not share physical cores|Particularly sensitive online scenarios, application cannot accept CPU sharing at the physical core level| +|C|Deploy 10 online applications on the node, each application node has 4 CPUs, all adopt LSR QoS, CPU binding strategy adopts physical core split (SpreadByPCPUs), use CPU exclusively by physical cpu level|Koordinator provides CPU binding core capability for LSR Pod and online application logical core can use more physical core capacity|It is often used to share physical cores with offline Pods and implement time-sharing multiplexing at the physical core level. This article does not focus on the mixed deployment of online and offline applications, so it only tests the overuse of online applications| + +This experiment uses the following performance indicators to evaluate the performance of Nginx applications under different deployment methods: + +- RT (Response Time) quantile value: RT is a performance indicator that online applications usually focus on. The lower the RT, the better the online service performance. The RT indicator is obtained by collecting the information printed after the wrk pressure tests. In the experiment, it reflects the time it takes for the Nginx application to respond to the wrk request. For example, RT-p50 indicates the maximum time (median) it takes for Nginx to respond to the first 50% of wrk requests, and RT-p90 indicates the maximum time it takes for Nginx to respond to the first 90% of wrk requests. +- RPS (Request Per Second): RPS is the number of requests served by an online application per second. The more RPS a service bears, the better the performance of the online service. + + +The experimental results are as follows: + +|Performance Indicators/Deployment Mode| A(colocation of two online applications, Guaranteed)|B(colocation of two online applications, LSE、FullPCPU)|C(colocation of two online applications, LSR、SpreadByPCPU、PCPULevel| +|-|-|-|-| +|RPS| 114778.29|114648.19|115268.50| +|RT-avg (ms)|3.46 ms|3.33 ms|3.25 ms| +|RT-p90 (ms)|5.27 ms|5.11 ms|5.06 ms| +|RT-p99 (ms)|15.22 ms|12.61 ms|12.14 ms| + +- Comparing B and A, it can be found that after adopting LSE QoS to bind the core, the service response time P99 is significantly reduced, and the long tail phenomenon is well alleviated +- Comparing C and B, it can be found that after using LSR QoS to bind cores and allowing logical cores to occupy more physical core resources, more requests can be tolerated with better service response time + +In summary, in the scenario where online services are deployed on the same machine, using Koordinator to refine the CPU arrangement can effectively suppress the Noisy Neighbor problem and reduce the performance degradation caused by CPU core switching. + +### Environemnt + +First, prepare a Kubernetes cluster and install Koordinator. This article chooses two nodes of a Kubernetes cluster to do the experiment, one of the nodes is used as a test machine, which will run the Nginx online server; the other node is used as a pressure test machine, which will run the client's wrk, request the Nginx online server, and make pressure test requests . + +### Online application deployment + +1. Inject fine-grained CPU orchestration protocols into applications using ColocationProfile + + Group B fine-grained CPU orchestration protocol + + ```yaml + apiVersion: config.koordinator.sh/v1alpha1 + kind: ClusterColocationProfile + metadata: + name: colocation-profile-example + spec: + selector: + matchLabels: + app: nginx + # 采用 LSE QoS + qosClass: LSE + annotations: + # 采用物理核间堆叠 + scheduling.koordinator.sh/resource-spec: '{"preferredCPUBindPolicy":"FullPCPUs"}' + priorityClassName: koord-prod + ``` + + Group C fine-grained CPU orchestration protocol + + ```yaml + apiVersion: config.koordinator.sh/v1alpha1 + kind: ClusterColocationProfile + metadata: + name: colocation-profile-example + spec: + selector: + matchLabels: + app: nginx + # 采用 LSR QoS + qosClass: LSR + annotations: + # 采用物理核间打散且独占物理核 + scheduling.koordinator.sh/resource-spec: '{"preferredCPUBindPolicy":"SpreadByPCPUs", "preferredCPUExclusivePolicy":"PCPULevel"}' + priorityClassName: koord-prod + ``` + +2. This article uses Nginx server as Online Service , Pod YAML is as follows: + + ```yaml + --- + # nginx应用配置 + apiVersion: v1 + data: + config: |- + user nginx; + worker_processes 4; # Nginx的Worker个数,影响Nginx Server的并发。 + + events { + worker_connections 1024; # 默认值为1024。 + } + + http { + server { + listen 8000; + + gzip off; + gzip_min_length 32; + gzip_http_version 1.0; + gzip_comp_level 3; + gzip_types *; + } + } + + #daemon off; + kind: ConfigMap + metadata: + name: nginx-conf-0 + --- + # Nginx实例,作为在线类型服务应用。 + apiVersion: v1 + kind: Pod + metadata: + labels: + app: nginx + name: nginx-0 + namespace: default + spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: In + values: + - "${node_name}" + schedulerName: koord-scheduler + priorityClassName: koord-prod + containers: + - image: 'koordinatorsh/nginx:v1.18-koord-exmaple' + imagePullPolicy: IfNotPresent + name: nginx + ports: + - containerPort: 8000 + hostPort: 8000 # 压测请求访问的端口。 + protocol: TCP + resources: + limits: + cpu: '4' + memory: 8Gi + requests: + cpu: '4' + memory: 8Gi + volumeMounts: + - mountPath: /apps/nginx/conf + name: config + hostNetwork: true + restartPolicy: Never + volumes: + - configMap: + items: + - key: config + path: nginx.conf + name: nginx-conf-0 + name: config + ``` + +3. Execute the following command to deploy the Nginx application. + + ```bash + kubectl apply -f nginx.yaml + ``` + +4. Execute the following command to view the Pod status of the Nginx application. + + ```bash + kubectl get pod -l app=nginx -o wide + ``` + + You can see output similar to the following, indicating that the Nginx application has been running normally on the test machine. + + ``` + NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES + nginx-0 1/1 Running 0 2m46s 10.0.0.246 test-machine-name + + ``` + +### Load Test + +1. On the testing machine, execute the following command to deploy the stress testing tool wrk. + + ```bash + wget -O wrk-4.2.0.tar.gz https://github.com/wg/wrk/archive/refs/tags/4.2.0.tar.gz && tar -xvf wrk-4.2.0.tar.gz + cd wrk-4.2.0 && make && chmod +x ./wrk + ``` + +2. On the testing machine, execute the following command to deploy the load testing tool wrk + + ```bash + # node_ip填写测试机的IP地址,用于wrk向测试机发起压测;8000是Nginx暴露到测试机的端口。 + taskset -c 32-45 ./wrk -t120 -c400 -d600s --latency http://${node_ip}:8000/ + ``` + +3. After waiting for wrk to finish running, obtain the pressure test results of wrk. The output format of wrk is as follows. Repeat the test several times to obtain relatively stable results. + + ``` + Running 10m test @ http://192.168.0.186:8000/ + 120 threads and 400 connections + Thread Stats Avg Stdev Max +/- Stdev + Latency 3.29ms 2.49ms 352.52ms 91.07% + Req/Sec 0.96k 321.04 3.28k 62.00% + Latency Distribution + 50% 2.60ms + 75% 3.94ms + 90% 5.55ms + 99% 12.40ms + 68800242 requests in 10.00m, 54.46GB read + Requests/sec: 114648.19 + Transfer/sec: 92.93MB + ``` + +## Conclusion + +In a Kubernetes cluster, there may be competition for resources such as CPU and memory among different business loads, which affects the performance and stability of the business. In the face of the Noisy Neighbor phenomenon, users can use Koordinator to configure more refined CPU scheduling policies for applications, so that different applications can share CPU resources collaboratively. We have shown through experiments that Koordinator's fine-grained CPU scheduling capability can effectively suppress the competition for CPU resources and improve application performance. \ No newline at end of file diff --git a/versioned_docs/version-v1.3/designs/descheduler-framework.md b/versioned_docs/version-v1.3/designs/descheduler-framework.md new file mode 100644 index 000000000..e054a557a --- /dev/null +++ b/versioned_docs/version-v1.3/designs/descheduler-framework.md @@ -0,0 +1,84 @@ +# Descheduler Framework + +## Summary + +This proposal is based on the K8s community's [descheduler](https://github.com/kubernetes-sigs/descheduler) to design and implement the descheduler framework required by the koordinator. + +## Motivation + +The existing [descheduler](https://github.com/kubernetes-sigs/descheduler) in the community can solve some problems, but we think that there are still many aspects of the descheduler that can be improved, for example, it only supports the mode of periodic execution, and does not support the event-triggered mode. It is not possible to extend and configure custom rescheduling strategies without invading the existing code of descheduler like kube-scheduler; it also does not support implementing custom evictor. + +We also noticed that the K8s descheduler community also found these problems and proposed corresponding solutions such as [#753 Descheduler framework Proposal](https://github.com/kubernetes-sigs/descheduler/issues/753) and [PoC #781](https://github.com/kubernetes-sigs/descheduler/pull/781). The K8s descheduler community tries to implement a descheduler framework similar to the k8s scheduling framework. This coincides with our thinking. + +On the whole, these solutions solved most of our problems, but we also noticed that the related implementations were not merged into the main branch. But we review these implementations and discussions, and we believe this is the right direction. Considering that Koordiantor has clear milestones for descheduler-related features, we will implement Koordinator's own descheduler independently of the upstream community. We try to use some of the designs in the [#753 PR](https://github.com/kubernetes-sigs/descheduler/issues/753) proposed by the community and we will follow the Koordinator's compatibility principle with K8s to maintain compatibility with the upstream community descheduler when implementing. Such as independent implementation can also drive the evolution of the upstream community's work on the descheduler framework. And when the upstream community has new changes or switches to the architecture that Koordinator deems appropriate, Koordinator will follow up promptly and actively. + +### Goals + +1. Implement Koordinator Descheduler following part of the design in [#753](https://github.com/kubernetes-sigs/descheduler/issues/753) proposed by the community + +### Non-Goals/Future Work + +1. Break any existing use cases of the Descheduler. + +## Proposal + +### Implementation Details/Notes/Constraints + +#### Descheduler profile + +The current descheduler configuration is too simple to support disabling or enabling plugins or supporting custom plugin configurations. The [PR #587](https://github.com/kubernetes-sigs/descheduler/pull/587) introducing descheduler profiles with v1alpha2 api version. We will use this proposal as Koordiantor Descheduler's configuration API. + +- The descheduler profile API supports user specify which extension points are enabled/disabled, alongside specifying plugin configuration. Including ability to configure multiple descheduling profiles. +- The descheduling framework configuration can be converted into an internal representation. +- To reduce need to specify value for every possible configuration, also defaulting serves as a recommended/opinionated settings for the plugins. + +#### Abstract PodEvictor interface + +Currently, descheduler has split `Pod Evictor` and `Evictor Filter`. Users can inject `Evictor Filter` on demand, and the plug-in calls `Evictor Filter` when selecting abnormal Pods to select Pods that meet the requirements and calls `Pod Evictor` to initiate eviction. At present, `Pod Evictor` has not been abstracted as an interface. We adopt the solution in [PoC #781](https://github.com/kubernetes-sigs/descheduler/pull/781) to abstract an `Evictor interface`. And refer to [PR #885](https://github.com/kubernetes-sigs/descheduler/pull/885) to add an `EvictOptions` paramters. We can implement custom Evictor based on [PodMigrationJob](https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/scheduling/20220701-pod-migration-job.md). + +The `Evictor` interface defined as follows: + +```go +type EvictOptons struct { + // PluginName represents the initiator of the eviction operation + PluginName string + // Reason allows for passing details about the specific eviction for logging. + Reason string + // DeleteOptions holds the arguments used to delete + DeleteOptions *metav1.DeleteOptions +} + +// Plugin is the parent type for all the descheduling framework plugins. +type Plugin interface { + Name() string +} + +type Evictor interface { + Plugin + // Filter checks if a pod can be evicted + Filter(pod *corev1.Pod) bool + // Evict evicts a pod (no pre-check performed) + Evict(ctx context.Context, pod *corev1.Pod, evictOptions EvictOptions) bool +} +``` + +#### Plug-in descheduler strategy + +The current descheduler has some strategies. In [PoC #781](https://github.com/kubernetes-sigs/descheduler/pull/781), it is converted into `Plugin` and executed periodically. In this `periodic execution mode`, it is appropriate to abstract the policy for Pod and Node dimensions as `DeschedulePlugin` or `BalancePlugin`. The load hotspot descheduling capability that we will implement later can also implement the BalancePlugin interface. + +The `DeschedulePlugin` and `BalancePlugin` interfaces defined as follows: + +```go +type DeschedulePlugin interface { + Plugin + Deschedule(ctx context.Context, nodes []*corev1.Node) *Status +} + +type BalancePlugin interface { + Plugin + Balance(ctx context.Context, nodes []*corev1.Node) *Status +} +``` + +We also need to support the `event-triggered mode`, which means that descheduling is performed in the form of a Controller. +In some scenarios, CRD-oriented descheduling needs to be implemented. For example, different descheduling configurations are provided according to the workload. When some abnormality is detected in the workload, descheduling will be triggered. We can think of Controller as a special form of Plugin. When the descheduler is initialized, an instance is constructed through the plugin factory function like a normal Plugin, and then a similar Run method is called to start execution. \ No newline at end of file diff --git a/versioned_docs/version-v1.3/designs/enhanced-scheduler-extension.md b/versioned_docs/version-v1.3/designs/enhanced-scheduler-extension.md new file mode 100644 index 000000000..8c61c719d --- /dev/null +++ b/versioned_docs/version-v1.3/designs/enhanced-scheduler-extension.md @@ -0,0 +1,232 @@ +# Enhanced Scheduler Extension + +## Summary + +This proposal describes how to extend the kubernetes scheduling framework without modify upstream codes to support the scheduling features that Koordinator needs to develop. + +## Motivation + +Although Kubernetes Scheduler provides the scheduling framework to help developer to extend scheduling features. However, it cannot support the features that Koordinator needs to develop, such as Reservation, problem diagnosis and analysis, etc. + +### Goals + +1. Provides scheduling extension point hook mechanism +1. Provides scheduling plugins expose state mechanism to help diagnose analysis problems + +### Non-Goals/Future Work + + +## Proposal + +### User stories + +#### Story 1 + +Koordiantor supports users to use `Reservation` CRD to reserve resources. We expect Reservation CRD objects to be scheduled like Pods. In this way, the native scheduling capabilities of Kubernetes and other extended scheduling capabilities can be reused. This requires a mechanism to disguise the Reservation CRD object as a Pod, and to extend some scheduling framework extension points to support updating the Reservation Status. + +#### Story 2 + +Koordinator provides some scheduling plugins, such as Fine-grained CPU Scheduling, Device Share Scheduling, Coscheduling, ElasticQuota, etc. These plugins are brand new, and the supported scenarios are relatively rich, and the internal logic and state of the plugins are also a bit complicated. When we may encounter some problems in the production environment and need to be diagnosed and analyzed, we need to confirm the cause of the problem based on the internal status of the plugin. But currently the kubernetes scheduling framework does not provide a mechanism to export the internal state of the plugin. + +#### Story 3 + +The scheduler provides many plugins, and most plugins implement Scoring Extension Point. How to configure the weights of these plugins needs to be decided in combination with specific problems. When the optimal node is selected according to the scoring results, the results may not meet expectations. At this point we need to be able to trace or debug these scoring results in some way. But there is currently no good way. + +### Design Details + +#### Enhancement Kubernetes Scheduling Framework principles + +At present, the kube-scheduler provided by Kubernetes can be divided into several parts. The outermost layer is `k8s.io/kubernetes/cmd/kube-scheduler`, which is the entrance of kube-scheduler; `k8s.io/kubernetes/pkg/scheduler` is responsible for integrating the framework And execute scheduling workflow, including initializing framework and plugins, scheduling Pod, etc. The core module is `k8s.io/kubernetes/pkg/scheduler/framwork`, which is the **Kubernetes Scheduling Framework**. + +Each layer provides some interfaces or methods to support developers to extend some capabilities, and the evolution speed of each layer is also different. Generally speaking, the evolution speed of the more core modules should be slower, and the evolution of core modules tends to extend rather than modify the existing interface or extension mechanism, otherwise it will bring very large cost and reliability to external dependencies. question. But each layer does not support implementing some features for some reason. But as far as the problems Koordinator is currently experiencing, there are still some workarounds. However, some principles need to be followed to reduce future conflicts with the evolution of the upstream Kubernetes community. + +1. DO NOT modify the Kubernetes Scheduling Framework. The scheduling framework is the core module of kube-scheduler and is still evolving. In order to avoid conflict with the upstream community between Koordinator's enhanced capabilities. +1. DO NOT modify the `k8s.io/kubernetes/pkg/scheduler` but can implements supported interfaces or high-order functions, such as `ScheduleAlgorithm`, `NextPod`, `Error` and `Profiles`. The `Profiles` contains an instance of the Framework interface corresponding to each KubeSchedulerProfile. We can implement the Framework and replace the instances in Profiles to get the opportunity to participate in the scheduling process to do something. +1. Extend `k8s.io/kubernetes/cmd/kube-scheduler` as simply as possible. + +#### Custom Extension Overview + +![image](/img/scheduler-extension.jpg) + +#### ExtendedHandle + +ExtendedHandle extends the k8s scheduling framework `Handle` interface to facilitate plugins to access Koordinator's resources and states. +Before constructs the `k8s.io/kubernetes/pkg/scheduler.Scheduler` object, we should build an ExtendedHandle object and pass the object to each custom plugins. + +```go +type ExtendedHandle interface { + framework.Handle + KoordinatorClientSet() koordinatorclientset.Interface + KoordinatorSharedInformerFactory() koordinatorinformers.SharedInformerFactory + SnapshotSharedLister() framework.SharedLister +} +``` + +#### Intercept plugin initialization process + +In order to pass the `ExtendedHandle` object to each custom plugins, we should intercept the plugin initialization process. +And we expect that any customized plugins can be directly and seamlessly integrated into the koordinator scheduler, so the `PluginFactory` of the plugin will not be changed. Therefore, we can modify the prototype of `k8s.io/kubernetes/cmd/kube-scheduler/app.Option` and the implementation of `k8s.io/kubernetes/cmd/kube-scheduler/app.WithPlugin` as the follows to get the opportunity to intercept the plugin initialization process. + +When the custom plugin is registered to the out-of registry using `WithPlugin`, it will use `frameworkext.PluginFactoryProxy` to wrap the plugin's original `PluginFactory`. We finally complete the interception of the plugin initialization process in `frameworkext.PluginFactoryProxy`. + +Of course, we will not modify `k8s.io/kubernetes/cmd/kube-scheduler` directly. Considering that the logic of `k8s.io/kubernetes/cmd/kube-scheduler` itself is not complicated, it will basically not bring us additional maintenance costs, so we will copy the relevant code to Koordinator for separate maintenance. + + +```go + +// Option configures a framework.Registry. +type Option func(frameworkext.ExtendedHandle, runtime.Registry) error + +// WithPlugin creates an Option based on plugin name and factory. Please don't remove this function: it is used to register out-of-tree plugins, +// hence there are no references to it from the kubernetes scheduler code base. +func WithPlugin(name string, factory runtime.PluginFactory) Option { + return func(handle frameworkext.ExtendedHandle, registry runtime.Registry) error { + return registry.Register(name, frameworkext.PluginFactoryProxy(handle, factory)) + } +} + +// frameworkext.PluginFactoryProxy +func PluginFactoryProxy(extendHandle ExtendedHandle, factoryFn frameworkruntime.PluginFactory) frameworkruntime.PluginFactory { + return func(args runtime.Object, handle framework.Handle) (framework.Plugin, error) { + impl := extendHandle.(*frameworkExtendedHandleImpl) + impl.once.Do(func() { + impl.Handle = handle + }) + return factoryFn(args, extendHandle) + } +} +``` + +#### Expose the internal state of plugins + +We will define a new extension interface to help the plugin expose the internal state through the Restful API, and provide some built-in Restful APIs to query which APIs are exposed by the current scheduler and some commonly used internal data, such as NodeInfo, etc. + +The new extension interface named `APIServiceProvider`. The plugins can implement this interface to register the API to be exposed as needed. When the plugin is initialized, `frameworkext.PluginFactoryProxy` will check whether the newly constructed plugin implements `APIServiceProvider`, and if so, it will call the `RegisterEndpoints` method of the interface to register the API. The Restful APIs exposed by these plugins will be bound to the URL path `/apis/v1/plugins/` and will be prefixed with the name of each plugin. For example, the API `/availableCPUs/:nodeName` exposed by the plugin `NodeNUMAResource` will be converted to `/apis/v1/plugins/NodeNUMAResource/availableCPUs/:nodeName`. + + +```go +type APIServiceProvider interface { + RegisterEndpoints(group *gin.RouterGroup) +} + +type ErrorMessage struct { + Message string `json:"message,omitempty"` +} + +func ResponseErrorMessage(c *gin.Context, statusCode int, format string, args ...interface{}) { + var e ErrorMessage + e.Message = fmt.Sprintf(format, args...) + c.JSON(statusCode, e) +} +``` + +Users can use the built-in API `/apis/v1/__services__` to query how many Restful APIs are provided by the current scheduler. The response as the follows: + +```json +{ + "GET": [ + "/apis/v1/__services__", + "/apis/v1/nodes/:nodeName", + "/apis/v1/plugins/Coscheduling/gang/:namespace/:name", + "/apis/v1/plugins/Coscheduling/gangs", + "/apis/v1/plugins/NodeNUMAResource/availableCPUs/:nodeName", + "/apis/v1/plugins/NodeNUMAResource/cpuTopologyOptions/:nodeName" + ] +} +``` + +Koordinator scheduler also provides `/apis/v1/nodes/:nodeNa` to expose internal `NodeInfo` to developers. + + +#### Support plugin to create Controllers + +Similar to Coscheduling/ElasticQuota Scheduling, these scheduling plugins have a matching Controller to synchronize the status of the related CRD. The most common way is to deploy these controllers independently of the scheduler. This method will not only bring additional maintenance costs and resource costs, but also if there are more states in the scheduling plugin that need to be synchronized to the CRD Status, the logic in the Controller and the logic in the plugin need to be more closely coordinated. The best way is that the Controller and the scheduling plugin are in the same process. + +We can define a new interface called `ControllerProvider`. When the plugin is initialized, `frameworkext.PluginFactoryProxy` will check whether the newly constructed plugin implements `ControllerProvider`, and if so, it will call the `NewControllers` method of the interface to get the instances of Controllers, and save these instances in the `ExtendedHandle`. When the scheduler gets the leader role, it can trigger the `ExtendedHandle` to start these controllers. + +```go +type ControllerProvider interface { + NewControllers() ([]Controller, error) +} + +type Controller interface { + Start() + Name() string +} +``` + + +#### Debug Scoring Result + +If we want to support debug scoring results, the easiest way is to directly modify `Framework.RunScorePlugins` and print the results after scoring. But this goes against the extend principles we laid out earlier. But we can think differently. When `scheduler.Scheduler` executes `scheduleOne`, it obtains an instance of the `framework.Framework` interface from `Profiles` and calls the method `RunScorePlugins`. At the same time, considering that we have maintained the initialization code of scheduler separately, then we can customize the implementation of the `framework.Framework` interface, implement the method `RunScorePlugins` and take over the `Profiles` in `scheduler.Scheduler`. In this way, we can first call the `RunScorePlugins` method of the original `framework.Framework` interface instance in the custom implemented `RunScorePlugins`, and then print the result. + +For the processing of the results, we can simply print it to the log in markdown format. When needed, enable Scoring Result debugging capability through the HTTP interface `/debug/flags/s` like `/debug/flags/v`. The developers also enable the capability via flags `--debug-scores`. + +```bash +# print top 100 score results. +$ curl -X POST schedulerIP:10251/debug/flags/s --data '100' +successfully set debugTopNScores to 100 +``` + +The following are the specific scoring results: + + +``` +| # | Pod | Node | Score | ImageLocality | InterPodAffinity | LoadAwareScheduling | NodeAffinity | NodeNUMAResource | NodeResourcesBalancedAllocation | NodeResourcesFit | PodTopologySpread | Reservation | TaintToleration | +| --- | --- | --- | ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| +| 0 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.51 | 577 | 0 | 0 | 87 | 0 | 0 | 96 | 94 | 200 | 0 | 100 | +| 1 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.50 | 574 | 0 | 0 | 85 | 0 | 0 | 96 | 93 | 200 | 0 | 100 | +| 2 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.19 | 541 | 0 | 0 | 55 | 0 | 0 | 95 | 91 | 200 | 0 | 100 | +| 3 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.18 | 487 | 0 | 0 | 15 | 0 | 0 | 90 | 82 | 200 | 0 | 100 | +``` + +| # | Pod | Node | Score | ImageLocality | InterPodAffinity | LoadAwareScheduling | NodeAffinity | NodeNUMAResource | NodeResourcesBalancedAllocation | NodeResourcesFit | PodTopologySpread | Reservation | TaintToleration | +| --- | --- | --- | ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| +| 0 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.51 | 577 | 0 | 0 | 87 | 0 | 0 | 96 | 94 | 200 | 0 | 100 | +| 1 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.50 | 574 | 0 | 0 | 85 | 0 | 0 | 96 | 93 | 200 | 0 | 100 | +| 2 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.19 | 541 | 0 | 0 | 55 | 0 | 0 | 95 | 91 | 200 | 0 | 100 | +| 3 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.18 | 487 | 0 | 0 | 15 | 0 | 0 | 90 | 82 | 200 | 0 | 100 | + + +#### Custom Hook Extend Points to Support Reservation Scheduling + +If we want to schedule the Reservation CRD object in the form of Pod, we need to solve several problems: + +1. Before calling `PreFilter`, check whether the Pod has a matching Reservation. If there is a matching Reservation, and if the Pod is set with `Pod Affinity/AntiAffinity` or `TopologySpreadConstraints`, we need to modify the Pod to remove these fields. The reason is that when the Reservation CRD object is created, the user generally sets these fields, and expects to find suitable nodes to reserve resources according to these scheduling constraints. Therefore, if the Pod is scheduled with the same fields, it will cause the scheduling to fail. To do this, it cannot be achieved by implementing the `PreFilter` extension point, because the scheduler has already obtained the appropriate Pod to start executing when calling `PreFilter`, and we have lost the opportunity to modify the Pod to affect other plugins. +1. In the `Filter` phase, we also need to update the NodeInfo. If there is a Reservation CRD object on NodeInfo, and the current Pod matches the Reservation CRD object, then the resources applied for by the Reservation CRD object should be returned to NodeInfo. Only in this way can it pass the resource check of the scheduler, including the network port check. + +To solve these problems, we define the `Hook` interface. The plugin can be implemented on demand, and the Pod or NodeInfo can be modified when the PreFilter/Filter is executed. Similar to the custom implementation method `RunScorePlugins` mentioned above, we can customize the implementation methods `RunPreFilterPlugins` and `RunFilterPluginsWithNominatedPods`. Before executing the real extension point logic, first execute the `Hook` interface and modify the Pod and NodeInfo. + +If necessary, you can modify the Pod or Node before executing the Score Extension Point by implementing ScorePhaseHook. + +Considering that there may be multiple different Hooks to modify the Pod or NodeInfo requirements, when the Hook is called, the Hook will be called cyclically, and the modification result of the previous Hook and the input of the next Hook will continue to be executed. + +Here are some additional explanations for the scenarios in which these new extension points should be used. If you can complete the scheduling function through the extension points such as Filter/Score provided by the K8s Scheduling Framework without modifying the incoming NodeInfo/Pod and other objects, you do not need to use these new extension points. + +```go +type SchedulingPhaseHook interface { + Name() string +} + +type PreFilterPhaseHook interface { + SchedulingPhaseHook + PreFilterHook(handle ExtendedHandle, state *framework.CycleState, pod *corev1.Pod) (*corev1.Pod, bool) +} + +type FilterPhaseHook interface { + SchedulingPhaseHook + FilterHook(handle ExtendedHandle, cycleState *framework.CycleState, pod *corev1.Pod, nodeInfo *framework.NodeInfo) (*corev1.Pod, *framework.NodeInfo, bool) +} + +type ScorePhaseHook interface { + SchedulingPhaseHook + ScoreHook(handle ExtendedHandle, cycleState *framework.CycleState, pod *corev1.Pod, nodes []*corev1.Node) (*corev1.Pod, []*corev1.Node, bool) +} + +``` + +## Alternatives + +### Use Filter instead of Filter Hook + +We can change the order of Filter plugins to support Reservation Scheduling to update NodeInfo earlier, which can replace Filter Hook. Subsequent implementations can be implemented as an optimization. diff --git a/versioned_docs/version-v1.3/designs/fine-grained-cpu-orchestration.md b/versioned_docs/version-v1.3/designs/fine-grained-cpu-orchestration.md new file mode 100644 index 000000000..c250091a5 --- /dev/null +++ b/versioned_docs/version-v1.3/designs/fine-grained-cpu-orchestration.md @@ -0,0 +1,452 @@ +# Fine-grained CPU orchestration + +## Summary + +This proposal defines the fine-grained CPU orchestration for Koordinator QoS in detail, and how to be compatible with the existing design principles and implementations of K8s. This proposal describes the functionality that koordlet, koord-runtime-proxy and koord-scheduler need to enhance. + +## Motivation + +An increasing number of systems leverage a combination of CPUs and hardware accelerators to support latency-critical execution and high-throughput parallel computation. These include workloads in fields such as telecommunications, scientific computing, machine learning, financial services and data analytics. Such hybrid systems comprise a high performance environment. + +In order to extract the best performance, optimizations related to CPU isolation, NUMA-locality are required. + +### Goals + +1. Improve the CPU orchestration semantics of Koordinator QoS. +1. Determine compatible kubelet policies. +1. Clarify how koordlet should enhance CPU scheduling mechanism. +1. Provide a set of API such as CPU bind policies, CPU exclusive policies, NUMA topology alignment policies, NUMA topology information, etc. for applications and cluster administrator to support complex CPU orchestration scenarios. +1. Provide the CPU orchestration optimization API. + +### Non-Goals/Future Work + +1. Describe specific design details of koordlet/koord-runtime-proxy. +1. Describe specific design details of CPU descheduling mechanism. + +## Proposal + +### Design Overview + +![image](/img/cpu-orchestration-seq-uml.svg) + +When koordlet starts, koordlet gather the NUMA topology information from kubelet include NUMA Topology, CPU Topology, kubelet cpu management policy, kubelet allocated CPUs for Guaranteed Pods etc., and update to the NodeResourceTopology CRD. The latency-sensitive applications are scaling, the new Pod can set Koordinator QoS with LSE/LSR, CPU Bind policy and CPU exclusive policy to require koord-scheduler to allocate best-fit CPUs to get the best performance. When koord-scheduler scheduling the Pod, koord-scheduler will filter Nodes that satisfied NUMA Topology Alignment policy, and select the best Node by scoring, allocating the CPUs in Reserve phase, and records the result to Pod annotation when PreBinding. koordlet hooks the kubelet CRI request to replace the CPUs configuration parameters with the koord-scheduler scheduled result to the runtime such as configure the cgroup. + +### User stories + +#### Story 1 + +Compatible with kubelet's existing CPU management policies. The CPU manager policy `static` allows pods with certain resource characteristics to be granted increased CPU affinity and exclusivity in the node. If enabled the `static` policy, the cluster administrator must configure the kubelet reserve some CPUs. There are some options for `static` policy. If the `full-pcpus-only(beta, visible by default)` policy option is specified, the `static` policy will always allocate full physical cores. If the `distribute-cpus-across-numa(alpha, hidden by default)` option is specified, the `static` policy will evenly distribute CPUs across NUMA nodes in cases where more than one NUMA node is required to satisfy the allocation. + +#### Story 2 + +Similarly, the semantics of the existing K8s Guaranteed Pods in the community should be compatible. The cpu cores allocated to K8s Guaranteed Pods with `static` policy will not share to the default best effort Pods, so it is equivalent to LSE. But when the load in the node is relatively low, the CPUs allocated by LSR Pods should be shared with best effort workloads to obtain economic benefits. + +#### Story 3 + +The Topology Manager is a kubelet component that aims to coordinate the set of components that are responsible for these optimizations. After Topology Manager was introduced the problem of launching pod in the cluster where worker nodes have different NUMA topology and different amount of resources in that topology became actual. The Pod could be scheduled in the node where the total amount of resources is enough, but resource distribution could not satisfy the appropriate Topology policy. + +#### Story 4 + +The scheduler can coordinate the arrangement between latency-sensitive applications. For example, the same latency-sensitive applications can be mutually exclusive in the CPU dimension, and latency-sensitive applications and general applications can be deployed in the CPU dimension affinity. Costs can be reduced and runtime quality can be guaranteed. + +#### Story 5 + +When allocating CPUs based on NUMA topology, users want to have different allocation strategies. For example, bin-packing takes precedence, or assigns the most idle NUMA Node. + +#### Story 6 + +As the application scaling or rolling deployment, the best-fit allocatable space will gradually become fragmented, which will lead to the bad allocation effect of some strategies and affect the runtime effect of the application. + +## Design Details + +### Basic CPU orchestration principles + +1. Only supports the CPU allocation mechanism of the Pod dimension. +1. Koordinator divides the CPU on the machine into `CPU Shared Pool`, `statically exclusive CPUs` and `BE CPU Shared Pool`. + 1. The `CPU Shared Pool` is the set of CPUs on which any containers in `K8s Burstable` and `Koordinator LS` Pods run. Containers in `K8s Guaranteed` pods with `fractional CPU requests` also run on CPUs in the shared pool. The shared pool contains all unallocated CPUs in the node but excluding CPUs allocated by K8s Guaranteed, LSE and LSR Pods. If kubelet reserved CPUs, the shared pool includes the reserved CPUs. + 1. The `statically exclusive CPUs` are the set of CPUs on which any containers in `K8s Guaranteed`, `Koordinator LSE` and `LSR` Pods that have integer CPU run. When K8s Guaranteed, LSE and LSR Pods request CPU, koord-scheduler will be allocated from the `CPU Shared Pool`. + 1. The `BE CPU Shared pool` is the set of CPUs on which any containers in `K8s BestEffort` and `Koordinator BE` Pods run. The `BE CPU Shared Pool` contains all CPUs in the node but excluding CPUs allocated by `K8s Guaranteed` and `Koordinator LSE` Pods. + +### Koordinator QoS CPU orchestration principles + +1. The Request and Limit of LSE/LSR Pods **MUST** be equal and the CPU value **MUST** be an integer multiple of 1000. +1. The CPUs allocated by the LSE Pod are completely **exclusive** and **MUST NOT** be shared. If the node is hyper-threading architecture, only the logical core dimension is guaranteed to be isolated, but better isolation can be obtained through the `CPUBindPolicyFullPCPUs` policy. +1. The CPUs allocated by the LSR Pod only can be shared with BE Pods. +1. LS Pods bind the CPU shared pool, **excluding** CPUs allocated by LSE/LSR Pods. +1. BE Pods bind all CPUs in the node, **excluding** CPUs allocated by LSE Pods. +1. The K8s Guaranteed Pods already running is equivalent to Koordinator LSR if kubelet enables the CPU manager `static` policy. +1. The K8s Guaranteed Pods already running is equivalent to Koordinator LS if kubelet enables the CPU manager `none` policy. +1. Newly created K8s Guaranteed Pod without Koordinator QoS specified will be treated as LS. + +![img](/img/qos-cpu-orchestration.png) + +### Compatible kubelet CPU management policies + +1. If kubelet set the CPU manager policy options `full-pcpus-only=true` / `distribute-cpus-across-numa=true`, and there is no new CPU bind policy defined by Koordinator in the node, follow the semantics of these parameters defined by the kubelet. +1. If kubelet set the Topology manager policy, and there is no new NUMA Topology Alignment policy defined by Koordinator in the node, follow the semantics of these parameters defined by the kubelet. + +### Take over kubelet CPU management policies + +Because the CPU reserved by kubelet mainly serves K8s BestEffort and Burstable Pods. But Koordinator will not follow the policy. The K8s Burstable Pods should use the CPU Shared Pool, and the K8s BestEffort Pods should use the `BE CPU Shared Pool`. + +1. For K8s Burstable and Koordinator LS Pods: + 1. When the koordlet starts, calculates the `CPU Shared Pool` and applies the shared pool to all Burstable and LS Pods in the node, that is, updating their cgroups to set cpuset. The same logic is executed when LSE/LSR Pods are creating or destroying. + 1. koordlet ignore the CPUs reserved by kubelet, and replace them with CPU Shared Pool defined by Koordinator. +1. For K8s BestEffort and Koordinator BE Pods: + 1. If kubelet reserved CPUs, the best effort Pods use the reserved CPUs first. + 1. koordlet can use all CPUs in the node but exclude the CPUs allocated by K8s Guaranteed and Koordinator LSE Pods that have integer CPU. It means that if koordlet enables the CPU Suppress feature should follow the constraint to guarantee not affecting LSE Pods. Similarly, if kubelet enables the CPU manager policy with `static`, the K8s Guaranteed Pods should also be excluded. +1. For K8s Guaranteed Pods: + 1. If there is `scheduling.koordinator.sh/resource-status` updated by koord-scheduler in the Pod annotation, then replace the CPUSet in the kubelet CRI request, including Sandbox/Container creating stage. + 1. kubelet sometimes call `Update` method defined in CRI to update container cgroup to set new CPUs, so koordlet and koord-runtime-proxy need to hook the method. +1. Automatically resize CPU Shared Pool + 1. koordlet automatically resize `CPU Shared Pool` based on the changes such as Pod creating/destroying. If `CPU Shared Pool` changed, koordlet should update cgroups of all LS/K8s Burstable Pods with the CPUs of shared pool. + 1. If the corresponding `CPU Shared Pool` is specified in the annotation `scheduling.koordinator.sh/resource-status` of the Pod, koordlet need to bind only the CPUs of the corresponding pool when configuring the cgroup. + +The takeover logic will require koord-runtime-proxy to add new extension points, and require koordlet to implement a new runtime hook plugin. When koord-runtime-proxy is not installed, these takeover logic will also be able to be implemented. + +## CPU orchestration API + +### Application CPU CPU orchestration API + +#### Resource spec + +The annotation `scheduling.koordinator.sh/resource-spec` is a resource allocation API defined by Koordinator. The user specifies the desired CPU orchestration policy by setting the annotation. In the future, we can also extend and add resource types that need to be supported as needed. The scheme corresponding to the annotation value is defined as follows: + +```go +// ResourceSpec describes extra attributes of the compute resource requirements. +type ResourceSpec struct { + PreferredCPUBindPolicy CPUBindPolicy `json:"preferredCPUBindPolicy,omitempty"` + PreferredCPUExclusivePolicy CPUExclusivePolicy `json:"preferredCPUExclusivePolicy,omitempty"` +} + +type CPUBindPolicy string + +const ( + // CPUBindPolicyDefault performs the default bind policy that specified in koord-scheduler configuration + CPUBindPolicyDefault CPUBindPolicy = "Default" + // CPUBindPolicyFullPCPUs favor cpuset allocation that pack in few physical cores + CPUBindPolicyFullPCPUs CPUBindPolicy = "FullPCPUs" + // CPUBindPolicySpreadByPCPUs favor cpuset allocation that evenly allocate logical cpus across physical cores + CPUBindPolicySpreadByPCPUs CPUBindPolicy = "SpreadByPCPUs" + // CPUBindPolicyConstrainedBurst constrains the CPU Shared Pool range of the Burstable Pod + CPUBindPolicyConstrainedBurst CPUBindPolicy = "ConstrainedBurst" +) + +type CPUExclusivePolicy string + +const ( + // CPUExclusivePolicyDefault performs the default exclusive policy that specified in koord-scheduler configuration + CPUExclusivePolicyDefault CPUExclusivePolicy = "Default" + // CPUExclusivePolicyPCPULevel represents mutual exclusion in the physical core dimension + CPUExclusivePolicyPCPULevel CPUExclusivePolicy = "PCPULevel" + // CPUExclusivePolicyNUMANodeLevel indicates mutual exclusion in the NUMA topology dimension + CPUExclusivePolicyNUMANodeLevel CPUExclusivePolicy = "NUMANodeLevel" +) +``` + +- The `CPUBindPolicy` defines the CPU binding policy. The specific values are defined as follows: + - `CPUBindPolicyDefault` or empty value performs the default bind policy that specified in koord-scheduler configuration. + - `CPUBindPolicyFullPCPUs` is a bin-packing policy, similar to the `full-pcpus-only=true` option defined by the kubelet, that allocate full physical cores. However, if the number of remaining logical CPUs in the node is sufficient but the number of full physical cores is insufficient, the allocation will continue. This policy can effectively avoid the noisy neighbor problem. + - `CPUBindPolicySpreadByPCPUs` is a spread policy. If the node enabled Hyper-Threading, when this policy is adopted, the scheduler will evenly allocate logical CPUs across physical cores. For example, the current node has 8 physical cores and 16 logical CPUs. When a Pod requires 8 logical CPUs and the `CPUBindPolicySpreadByPCPUs` policy is adopted, the scheduler will allocate an logical CPU from each physical core. This policy is mainly used by some latency-sensitive applications with multiple different peak-to-valley characteristics. It can not only allow the application to fully use the CPU at certain times, but will not be disturbed by the application on the same physical core. So the noisy neighbor problem may arise when using this policy. + - `CPUBindPolicyConstrainedBurst` a special policy that mainly helps K8s Burstable/Koordinator LS Pod get better performance. When using the policy, koord-scheduler is filtering out Nodes that have NUMA Nodes with suitable CPU Shared Pool by Pod Limit. After the scheduling is successful, the scheduler will update `scheduling.koordinator.sh/resource-status` in the Pod, declaring the `CPU Shared Pool` to be bound. The koordlet binds the CPU Shared Pool of the corresponding NUMA Node according to the `CPU Shared Pool` + - If `kubelet.koordinator.sh/cpu-manager-policy` in `NodeResourceTopology` has option `full-pcpus-only=true`, or `node.koordinator.sh/cpu-bind-policy` in the Node with the value `PCPUOnly`, the koord-scheduler will check whether the number of CPU requests of the Pod meets the `SMT-alignment` requirements, so as to avoid being rejected by the kubelet after scheduling. koord-scheduler will avoid such nodes if the Pod uses the `CPUBindPolicySpreadByPCPUs` policy or the number of logical CPUs mapped to the number of physical cores is not an integer. +- The `CPUExclusivePolicy` defines the CPU exclusive policy, it can help users to avoid noisy neighbor problems. The specific values are defined as follows: + - `CPUExclusivePolicyDefault` or empty value performs the default exclusive policy that specified in koord-scheduler configuration. + - `CPUExclusivePolicyPCPULevel`. When allocating logical CPUs, try to avoid physical cores that have already been applied for by the same exclusive policy. It is a supplement to the `CPUBindPolicySpreadByPCPUs` policy. + - `CPUExclusivePolicyNUMANodeLevel`. When allocating logical CPUs, try to avoid NUMA Nodes that has already been applied for by the same exclusive policy. If there is no NUMA Node that satisfies the policy, downgrade to `CPUExclusivePolicyPCPULevel` policy. + +For the ARM architecture, `CPUBindPolicy` only support `CPUBindPolicyFullPCPUs`, and `CPUExclusivePolicy` only support `CPUExclusivePolicyNUMANodeLevel`. + +#### Resource status + +The annotation `scheduling.koordinator.sh/resource-status` represents resource allocation result. koord-scheduler patch Pod with the annotation before binding to node. koordlet uses the result to configure cgroup. + +The scheme corresponding to the annotation value is defined as follows: + +```go +type ResourceStatus struct { + CPUSet string `json:"cpuset,omitempty"` + CPUSharedPools []CPUSharedPool `json:"cpuSharedPools,omitempty"` +} +``` + +- `CPUSet` represents the allocated CPUs. When LSE/LSR Pod requested, koord-scheduler will update the field. It is Linux CPU list formatted string. For more details, please refer to [doc](http://man7.org/linux/man-pages/man7/cpuset.7.html#FORMATS). +- `CPUSharedPools` represents the desired CPU Shared Pools used by LS Pods. If the Node has the label `node.koordinator.sh/numa-topology-alignment-policy` with `Restricted/SingleNUMANode`, koord-scheduler will find the best-fit NUMA Node for the LS Pod, and update the field that requires koordlet uses the specified CPU Shared Pool. It should be noted that the scheduler does not update the `CPUSet` field in the `CPUSharedPool`, koordlet binds the CPU Shared Pool of the corresponding NUMA Node according to the `Socket` and `Node` fields in the `CPUSharedPool`. + +#### Example + +The following specific example: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + annotations: + scheduling.koordinator.sh/resource-spec: |- + { + "preferredCPUBindPolicy": "SpreadByPCPUs", + "preferredCPUExclusivePolicy": "PCPULevel" + } + scheduling.koordinator.sh/resource-status: |- + { + "cpuset": "0-3" + } + name: test-pod + namespace: default +spec: + ... +``` + +### Node CPU orchestration API + +From the perspective of cluster administrators, it is necessary to support some APIs to control the CPU orchestration behavior of nodes. + +#### CPU bind policy + +The label `node.koordinator.sh/cpu-bind-policy` constrains how to bind CPU logical CPUs when scheduling. + +The following is the specific value definition: +- `None` or empty value does not perform any policy. +- `FullPCPUsOnly` requires that the scheduler must allocate full physical cores. Equivalent to kubelet CPU manager policy option `full-pcpus-only=true`. +- `SpreadByPCPUs` requires that the schedler must evenly allocate logical CPUs across physical cores. + +If there is no `node.koordinator.sh/cpu-bind-policy` in the node's label, it will be executed according to the policy configured by the Pod or koord-scheduler. + +#### NUMA allocate strategy + +The label `node.koordinator.sh/numa-allocate-strategy` indicates how to choose satisfied NUMA Nodes when scheduling. The following is the specific value definition: +- `MostAllocated` indicates that allocates from the NUMA Node with the least amount of available resource. +- `LeastAllocated` indicates that allocates from the NUMA Node with the most amount of available resource. +- `DistributeEvenly` indicates that evenly distribute CPUs across NUMA Nodes. + +If the cluster administrator does not set label `node.koordinator.sh/numa-allocate-strategy` on Node, but `kubelet.koordinator.sh/cpu-manager-policy` in `NodeResourceTopology` has option `distribute-cpus-across-numa=true`, then follow the semantic allocation of `distribute-cpus-across-numa`. + +If there is no `node.koordinator.sh/numa-allocate-strategy` in the node's label and no `kubelet.koordinator.sh/cpu-manager-policy` with `distribute-cpus-across-numa` option in `NodeResourceTopology`, it will be executed according to the policy configured by the koord-scheduler. + +If both `node.koordinator.sh/numa-allocate-strategy` and `kubelet.koordinator.sh/cpu-manager-policy` are defined, `node.koordinator.sh/numa-allocate-strategy` is used first. + +#### NUMA topology alignment policy + +The label `node.koordinator.sh/numa-topology-alignment-policy` represents that how to aligning resource allocation according to the NUMA topology. The policy semantic follow the K8s community. Equivalent to the field `TopologyPolicies` in `NodeResourceTopology`, and the topology policies `SingleNUMANodePodLevel` and `SingleNUMANodeContainerLevel` are mapping to `SingleNUMANode` policy. + +- `None` is the default policy and does not perform any topology alignment. +- `BestEffort` indicates that preferred select NUMA Node that is topology alignment, and if not, continue to allocate resources to Pods. +- `Restricted` indicates that each resource requested by a Pod on the NUMA Node that is topology alignment, and if not, koord-scheduler will skip the node when scheduling. +- `SingleNUMANode` indicates that all resources requested by a Pod must be on the same NUMA Node, and if not, koord-scheduler will skip the node when scheduling. + +If there is no `node.koordinator.sh/numa-topology-alignment-policy` in the node's label and `TopologyPolicies=None` in `NodeResourceTopology`, it will be executed according to the policy configured by the koord-scheduler. + +If both `node.koordinator.sh/numa-topology-alignment-policy` in Node and `TopologyPolicies=None` in `NodeResourceTopology` are defined, `node.koordinator.sh/numa-topology-alignment-policy` is used first. + +#### Example + +The following specific example: + +```yaml +apiVersion: v1 +kind: Node +metadata: + labels: + node.koordinator.sh/cpu-bind-policy: "FullPCPUsOnly" + node.koordinator.sh/numa-topology-alignment-policy: "BestEffort" + node.koordinator.sh/numa-allocate-strategy: "MostAllocated" + name: node-0 +spec: + ... +``` + +### NodeResourceTopology CRD + +The node resource information to be reported mainly includes the following categories: + +- NUMA Topology, including resources information, CPU information such as logical CPU ID, physical Core ID, NUMA Socket ID and NUMA Node ID and etc. +- The topology manager scopes and policies configured by kubelet. +- The CPU manager policies and options configured by kubelet. +- Pod bound CPUs allocated by kubelet or koord-scheduler, including K8s Guaranteed Pods, Koordinator LSE/LSR Pods but except the LS/BE. +- CPU Shared Pool defined by koordlet + +The above information can guide koord-scheduler to better be compatible with the kubelet's CPU management logic, make more appropriate scheduling decisions and help users quickly troubleshoot. + +#### CRD Scheme definition + +We use [NodeResourceTopology](https://github.com/k8stopologyawareschedwg/noderesourcetopology-api/blob/master/pkg/apis/topology/v1alpha1/types.go) CRD to describe the NUMA Topology. The community-defined NodeResourceTopology CRD is mainly used for the following considerations: + +- NodeResourceTopology already contains basic NUMA topology information and kubelet TopologyManager's Scope and Policies information. We can reuse the existing codes. +- Keep up with the evolution of the community and influence the community to make more changes. + +#### Compatible + +The koordlet creates or updates NodeResourceTopology periodically. The name of NodeResourceTopology is same as the name of Node. and add the label `app.kubernetes.io/managed-by=Koordinator` describes the node is managed by Koordinator. + +#### Extension + +At present, the NodeResourceTopology lacks some information, and it is temporarily written in the NodeResourceTopology in the form of annotations or labels: + +- The annotation `kubelet.koordinator.sh/cpu-manager-policy` describes the kubelet CPU manager policy and options. The scheme is defined as follows: + +```go +const ( + FullPCPUsOnlyOption string = "full-pcpus-only" + DistributeCPUsAcrossNUMAOption string = "distribute-cpus-across-numa" +) + +type KubeletCPUManagerPolicy struct { + Policy string `json:"policy,omitempty"` + Options map[string]string `json:"options,omitempty"` + ReservedCPUs string `json:"reservedCPUs,omitempty"` +} + +``` + +- The annotation `node.koordinator.sh/cpu-topology` describes the detailed CPU topology. Fine-grained management mechanism needs more detailed CPU topology information. The scheme is defined as follows: + +```go +type CPUTopology struct { + Detail []CPUInfo `json:"detail,omitempty"` +} + +type CPUInfo struct { + ID int32 `json:"id"` + Core int32 `json:"core"` + Socket int32 `json:"socket"` + Node int32 `json:"node"` +} +``` + +- annotation `node.koordinator.sh/pod-cpu-allocs` describes the CPUs allocated by Koordinator LSE/LSR and K8s Guaranteed Pods. The scheme corresponding to the annotation value is defined as follows: + +```go +type PodCPUAlloc struct { + Namespace string `json:"namespace,omitempty"` + Name string `json:"name,omitempty"` + UID types.UID `json:"uid,omitempty"` + CPUSet string `json:"cpuset,omitempty"` + ManagedByKubelet bool `json:"managedByKubelet,omitempty"` +} + +type PodCPUAllocs []PodCPUAlloc +``` + +- The annotation `node.koordinator.sh/cpu-shared-pools` describes the CPU Shared Pool defined by Koordinator. The shared pool is mainly used by Koordinator LS Pods or K8s Burstable Pods. The scheme is defined as follows: + +```go +type NUMACPUSharedPools []CPUSharedPool + +type CPUSharedPool struct { + Socket int32 `json:"socket"` + Node int32 `json:"node"` + CPUSet string `json:"cpuset,omitempty"` +} +``` +The `CPUSet` field is Linux CPU list formatted string. For more details, please refer to [doc](http://man7.org/linux/man-pages/man7/cpuset.7.html#FORMATS). + + +#### Create/Update NodeResourceTopology + +- koordlet is responsible for creating/updating NodeResourceTopology +- It is recommended that koordlet obtain the CPU allocation information of the existing K8s Guaranteed Pod by parsing the CPU state checkpoint file. Or obtain this information through the CRI interface and gRPC provided by kubelet. +- When the CPU of the Pod is allocated by koord-scheduler, replace the CPUs in the kubelet state checkpoint file. +- It is recommended that koordlet obtain the CPU manager policy and options from [kubeletConfiguration](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/). + +#### Example + +A complete NodeResourceTopology example: + +```yaml +apiVersion: topology.node.k8s.io/v1alpha1 +kind: NodeResourceTopology +metadata: + annotations: + kubelet.koordinator.sh/cpu-manager-policy: |- + { + "policy": "static", + "options": { + "full-pcpus-only": "true", + "distribute-cpus-across-numa": "true" + } + } + node.koordinator.sh/cpu-topology: |- + { + "detail": [ + { + "id": 0, + "core": 0, + "socket": 0, + "node": 0 + }, + { + "id": 1, + "core": 1, + "socket": 1, + "node": 1 + } + ] + } + node.koordinator.sh/cpu-shared-pools: |- + [ + { + "socket": 0, + "node": 0, + "cpuset": "0-3" + } + ] + node.koordinator.sh/pod-cpu-allocs: |- + [ + { + "namespace": "default", + "name": "static-guaranteed-pod", + "uid": "32b14702-2efe-4be9-a9da-f3b779175846", + "cpu": "4-8", + "managedByKubelet": "true" + } + ] + labels: + app.kubernetes.io/managed-by: Koordinator + name: node1 +topologyPolicies: ["SingleNUMANodePodLevel"] +zones: + - name: node-0 + type: Node + resources: + - name: cpu + capacity: 20 + allocatable: 15 + available: 10 + - name: vendor/nic1 + capacity: 3 + allocatable: 3 + available: 3 + - name: node-1 + type: Node + resources: + - name: cpu + capacity: 30 + allocatable: 25 + available: 15 + - name: vendor/nic2 + capacity: 6 + allocatable: 6 + available: 6 + - name: node-2 + type: Node + resources: + - name: cpu + capacity: 30 + allocatable: 25 + available: 15 + - name: vendor/nic1 + capacity: 3 + allocatable: 3 + available: 3 + - name: node-3 + type: Node + resources: + - name: cpu + capacity: 30 + allocatable: 25 + available: 15 + - name: vendor/nic1 + capacity: 3 + allocatable: 3 + available: 3 +``` \ No newline at end of file diff --git a/versioned_docs/version-v1.3/designs/fine-grained-device-scheduling.md b/versioned_docs/version-v1.3/designs/fine-grained-device-scheduling.md new file mode 100644 index 000000000..e27e8a951 --- /dev/null +++ b/versioned_docs/version-v1.3/designs/fine-grained-device-scheduling.md @@ -0,0 +1,408 @@ +# Fine-grained device scheduling + +## Summary + +This proposal provides a fine-grained mechanism for managing GPUs and other devices such as RDMA and FPGA, defines a set of APIs to describe device information on nodes, including GPU, RDMA, and FPGA, and a new set of resource names to flexibly support users to apply at a finer granularity GPU resources. This mechanism is the basis for subsequent other GPU scheduling capabilities such as GPU Share, GPU Overcommitment, etc. + +## Motivation + +GPU devices have very strong computing power, but are expensive. How to make better use of GPU equipment, give full play to the value of GPU and reduce costs is a problem that needs to be solved. In the existing GPU allocation mechanism of the K8s community, the GPU is allocated by the kubelet, and it is a complete device allocation. This method is simple and reliable, but similar to the CPU and memory, the GPU will also be wasted. Therefore, some users expect to use only a portion of the GPU's resources and share the rest with other workloads to save costs. Moreover, GPU has particularities. For example, the NVLink and oversold scenarios supported by NVIDIA GPU mentioned below both require a central decision through the scheduler to obtain globally optimal allocation results. + +![image](/img/nvlink.jpg) + +From the picture, we can find that although the node has 8 GPU instances whose model is A100/V100, the data transmission speed between GPU instances is different. When a Pod requires multiple GPU instances, we can assign the Pod the GPU instances with the maximum data transfer speed combined relationship. In addition, when we want the GPU instances among a group of Pods to have the maximum data transfer speed combined relationship, the scheduler should batch allocate the best GPU instances to these Pods and assign them to the same node. + +### Goals + +1. Definition Device CRD and the Resource API. +1. Provides a reporter component in koordlet to report Device information and resource capacities. +1. Provides a scheduler plugin to support users to apply at a finer granularity GPU resources. +1. Provider a new runtime hook plugin in koordlet to support update the environments of containers with GPUs that be allocated by scheduler. + +### Non-goals/Future work + +1. Define flexible allocation strategies, such as implementing BinPacking or Spread according to GPU resources + +## Proposal + +### API + +#### Device resource dimensions + +Due to GPU is complicated, we will introduce GPU first. As we all know there is compute and GPU Memory capability for the GPU device. Generally user apply GPU like "I want 1/2/4/8 GPUs", but if node support GPU level isolation mechanism, user may apply GPU like "I want 0.5/0.25 GPU resources". Moreover, user may set different compute capability and GPU memory capability for best resource utilization, so the user want apply GPU like "I want X percent of "compute capability and Y percent of memory capability". + +We abstract GPU resources into different dimensions: + +- `kubernetes.io/gpu-core` represents the computing capacity of the GPU. Similar to K8s MilliCPU, we abstract the total computing power of GPU into one hundred, and users can apply for the corresponding amount of GPU computing power according to their needs. +- `kubernetes.io/gpu-memory` represents the memory capacity of the GPU in bytes. +- `kubernetes.io/gpu-memory-ratio` represents the percentage of the GPU's memory. + +Assuming that node A has 4 GPU instances, and the total memory of each instance is 8GB, when device reporter reports GPU capacity information to `Node.Status.Allocatable`, it no longer reports nvidia.com/gpu=4, but reports the following information: + +```yaml +status: + capacity: + kubernetes.io/gpu-core: 400 + kubernetes.io/gpu-memory: "32GB" + kubernetes.io/gpu-memory-ratio: 400 + allocatable: + kubernetes.io/gpu-core: 400 + kubernetes.io/gpu-memory: "32GB" + kubernetes.io/gpu-memory-ratio: 400 +``` + +For the convenience of users, an independent resource name `kubernetes.io/gpu` is defined. For example, when a user wants to use half of the computing resources and memory resources of a GPU instance, the user can directly declare `kubernetes.io/gpu: 50`, and the scheduler will convert it to `kubernetes.io/gpu-core: 50, kubernetes.io/gpu-memory-ratio: 50` + +For other devices like RDMA and FPGA, the node has 1 RDMA and 1 FGPA, will report the following information: + +```yaml +status: + capacity: + kubernetes.io/rdma: 100 + kubernetes.io/fpga: 100 + allocatable: + kubernetes.io/rdma: 100 + kubernetes.io/fpga: 100 +``` + +Why do we need `kubernetes.io/gpu-memory-ratio` and `kubernetes.io/gpu-memory` ? +When user apply 0.5/0.25 GPU, the user don't know the exact memory total bytes per GPU, only wants to use +half or quarter percentage of memory, so user can request the GPU memory with `kubernetes.io/gpu-memory-ratio`. +When scheduler assigned Pod on concrete node, scheduler will translate the `kubernetes.io/gpu-memory-ratio` to `kubernetes.io/gpu-memory` by the formulas: ***allocatedMemory = totalMemoryOf(GPU) * `kubernetes.io/gpu-memory-ratio`***, so that the GPU isolation can work. + +During the scheduling filter phase, the scheduler will do special processing for `kubernetes.io/gpu-memory` and `kubernetes.io/gpu-memory-ratio`. When a Pod specifies `kubernetes.io/gpu-memory-ratio`, the scheduler checks each GPU instance on each node for unallocated or remaining resources to ensure that the remaining memory on each GPU instance meets the ratio requirement. + +If the user knows exactly or can roughly estimate the specific memory consumption of the workload, he can apply for GPU memory through `kubernetes.io/gpu-memory`. All details can be seen below. + +Besides, when dimension's value > 100, means Pod need multi-devices. now only allow the value can be divided by 100. + +#### User apply device resources scenarios + +##### Compatible with `nvidia.com/gpu` + +```yaml +resources: + requests: + nvidia.com/gpu: "2" + cpu: "4" + memory: "8Gi" +``` + +The scheduler translates the `nvida.com/gpu: 2` to the following spec: + +```yaml +resources: + requests: + kubernetes.io/gpu-core: "200" + kubernetes.io/gpu-memory-ratio: "200" + kubernetes.io/gpu-memory: "16Gi" # assume 8G memory in bytes per GPU + cpu: "4" + memory: "8Gi" +``` + +##### Apply whole resources of GPU or part resources of GPU + +```yaml +resources: + requests: + kubernetes.io/gpu: "50" + cpu: "4" + memory: "8Gi" +``` + +The scheduler translates the `kubernetes.io/gpu: "50"` to the following spec: + +```yaml +resources: + requests: + kubernetes.io/gpu-core: "50" + kubernetes.io/gpu-memory-ratio: "50" + kubernetes.io/gpu-memory: "4Gi" # assume 8G memory in bytes for the GPU + cpu: "4" + memory: "8Gi" +``` + +##### Apply `kubernetes.io/gpu-core` and `kubernetes.io/gpu-memory-ratio` separately + +```yaml +resources: + requests: + kubernetes.io/gpu-core: "50" + kubernetes.io/gpu-memory-ratio: "60" + cpu: "4" + memory: "8Gi" +``` + +##### Apply `kubernetes.io/gpu-core` and `kubernetes.io/gpu-memory` separately + +```yaml +resources: + requests: + kubernetes.io/gpu-core: "60" + kubernetes.io/gpu-memory: "4Gi" + cpu: "4" + memory: "8Gi" +``` + +##### Apply RDMA + +```yaml +resources: + requests: + kubernetes.io/rdma: "100" + cpu: "4" + memory: "8Gi" +``` + +### Implementation Details + +#### Scheduling + +1. Abstract new data structure to describe resources and healthy status per device on the node. +2. Implements the Filter/Reserve/PreBind extenstion points. +3. Automatically recognize different kind devices. When a new device added, we don't need modify any code + +##### DeviceAllocation + +In the PreBind stage, the scheduler will update the device (including GPU) allocation results, including the device's Minor and resource allocation information, to the Pod in the form of annotations. + +```go +/* +{ + "gpu": [ + { + "minor": 0, + "resouurces": { + "kubernetes.io/gpu-core": 100, + "kubernetes.io/gpu-mem-ratio": 100, + "kubernetes.io/gpu-mem": "16Gi" + } + }, + { + "minor": 1, + "resouurces": { + "kubernetes.io/gpu-core": 100, + "kubernetes.io/gpu-mem-ratio": 100, + "kubernetes.io/gpu-mem": "16Gi" + } + } + ] +} +*/ +type DeviceAllocation struct { + Minor int32 + Resources map[string]resource.Quantity +} + +type DeviceAllocations map[DeviceType][]*DeviceAllocation +``` + +##### NodeDevicePlugin + +```go +var ( + _ framework.PreFilterPlugin = &NodeDevicePlugin{} + _ framework.FilterPlugin = &NodeDevicePlugin{} + _ framework.ReservePlugin = &NodeDevicePlugin{} + _ framework.PreBindPlugin = &NodeDevicePlugin{} +) + +type NodeDevicePlugin struct { + frameworkHandler framework.Handle + nodeDeviceCache *NodeDeviceCache +} + +type NodeDeviceCache struct { + lock sync.Mutex + nodeDevices map[string]*nodeDevice +} + +type nodeDevice struct { + lock sync.Mutex + DeviceTotal map[DeviceType]deviceResource + DeviceFree map[DeviceType]deviceResource + DeviceUsed map[DeviceType]deviceResource + AllocateSet map[DeviceType]*corev1.PodList +} + +// We use `deviceResource` to present resources per device. +// "0": {kubernetes.io/gpu-core:100, kubernetes.io/gpu-memory-ratio:100, kubernetes.io/gpu-memory: 16GB} +// "1": {kubernetes.io/gpu-core:100, kubernetes.io/gpu-memory-ratio:100, kubernetes.io/gpu-memory: 16GB} +type deviceResources map[int]corev1.ResourceList + +``` + +We will register node and device event handler to maintain device account. + +- In Filter, we will make-up each device request by a node(the gpu-memory example), and try compare each device free resource and Pod device request. +- In Reserve/Unreserve, we will update nodeDeviceCache's used/free resource and allocateSet. Now device selection rule just based on device minor id order. +- In PreBind, we will write DeviceAllocations to Pod's annotation. +- In Init stage, we should list all Node/Device/Pods to recover device accounts. + +#### Device Reporter + +Implements a new component called `Device Reporter` in koordlet to create or update `Device` CRD instance with the resources information and healthy status per device including GPU, RDMA and FPGA, etc. This version we only support GPU. It will execution `nccl` commands to get each minor resource just like k8s-gpu-device-plugins. We will apply community health check logic. + +#### Device CRD Scheme definition +```go +type DeviceType string + +const ( + GPU DeviceType = "gpu" + FPGA DeviceType = "fpga" + RDMA DeviceType = "rdma" +) + +type DeviceSpec struct { + Devices []DeviceInfo `json:"devices"` +} + +type DeviceInfo struct { + // UUID represents the UUID of device + UUID string `json:"id,omitempty"` + // Minor represents the Minor number of Device, starting from 0 + Minor int32 `json:"minor,omitempty"` + // Type represents the type of device + Type DeviceType `json:"deviceType,omitempty"` + // Health indicates whether the device is normal + Health bool `json:"health,omitempty"` + // Resources represents the total capacity of various resources of the device + Resources map[string]resource.Quantity `json:"resource,omitempty"` +} + +type DeviceStatus struct {} + +type Device struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + + Spec DeviceSpec `json:"spec,omitempty"` + Status DeviceStatus `json:"status,omitempty"` +} + +type DeviceList struct { + metav1.TypeMeta `json:",inline"` + metav1.ListMeta `json:"metadata,omitempty"` + + Items []Device `json:"items"` +} +``` + +##### Compatible + +Considering that some users already have many existing GPU Pods in their clusters, it is necessary to ensure that Koordinator GPU Scheduling does not repeatedly allocate the GPU devices held by these GPU Pods. Therefore, koord-scheduler needs to obtain the GPU devices's information held by these existing Pods. These GPU devices are allocated by the kubelet and recorded in the local file `/var/lib/kubelet/device-plugins/kubelet_internal_checkpoint`, so the device reporter will parse the file to obtain the GPU Device ID assigned to each Pod. When parsing, it needs to exclude the Pod that allocates GPU through koord-scheduler, and finally update it to Device CRD in the form of annotation. The corresponding annotation key is `node.koordinator.sh/devices-checkpoints`, and the annotation value is defined as follows: + +```go +type PodDevicesEntry struct { + PodUID string `json:"podUID,omitempty"` + ContainerName string `json:"containerName,omitempty"` + ResourceName string `json:"resourceName,omitempty"` + DeviceIDs []string `json:"deviceIDs,omitempty"` + AllocResp []byte `json:"allocResp,omitempty"` +} + +type PodDevicesEntries []PodDevicesEntry +``` + +#### CRD Example +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: Device +metadata: + name: node-1 + annotations: + node.koordinator.sh/gpu-checkpoints: |- + [ + { + "podUID": "fa8983dc-bb76-4eeb-8dcc-556fbd44d7ce", + "containerName": "cuda-container", + "resourceName": "nvidia.com/gpu", + "deviceIDs": ["GPU-36b27e44-b086-46f7-f2dc-73c36dc65991"] + } + ] +spec: + devices: + - health: true + id: GPU-98583a5c-c155-9cf6-f955-03c189d3dbfb + minor: 0 + resources: + kubernetes.io/gpu-core: "100" + kubernetes.io/gpu-memory: 15472384Ki + kubernetes.io/gpu-memory-ratio: "100" + type: gpu + - health: true + id: GPU-7f6410b9-bdf7-f9a5-de09-aa5ec31a7124 + minor: 1 + resources: + kubernetes.io/gpu-core: "100" + kubernetes.io/gpu-memory: 15472384Ki + kubernetes.io/gpu-memory-ratio: "100" + type: gpu +status: {} +``` + +#### koordlet and koord-runtime-proxy + +Our target is to work compatible with origin k8s kubelet and k8s device plugins, so: + +1. We still allow kubelet and device plugin to allocate concrete device, which means no matter there's a k8s device +plugin or not, our design can work well. + +2. In koord-runtime-proxy, we will use Pod's `DeviceAllocation` in annotation to replace the step1's result of container's +args and envs. + +We should modify protocol between koord-runtime-proxy and koordlet to add container env: + +```go +type ContainerResourceHookRequest struct { + .... + Env map[string]string +} + +type ContainerResourceHookResponse struct { + .... + Env map[string]string +} +``` + +Then we will add a new `gpu-hook` in koordlet's runtimehooks, registered to `PreCreateContainer` stage. +We will generate new GPU env `NVIDIA_VISIBLE_DEVICES` by Pod GPU allocation result in annotation. + +The koord-runtime-proxy can see these Pod's env, we need koord-runtime-proxy to pass these environments to koordlet, and koordlet parse the GPU related env to find the concrete device ids. + +Besides, the koordlet should report GPU model to node labels same as device plugin, this is in-case Koordinator working without device-plugin. + +Finally, we should modify `ContainerResourceExecutor`'s `UpdateRequest` function in koord-runtime-proxy, and let new GPU env covering old GPU env. + +When we handle hot-update processing, we can handle the existing scheduled Pods without device allocation in Pod's annotation. If GPU allocation info is not in annotation, we will find the GPU allocations from `ContainerResourceHookRequest`'s `Env`, and we will update all GPU allocations to Device CRD instance. + +### Compatibility + +As we know, the GPU scheduling in kube-scheduler side has no any different with other scalar resources. The concrete device-level assigning is done by kubelet and GPU device plugin, which will generate container's GPU env. + +Our design has no conflict with the above process. Our device reporter reports Koordinator GPU resources for kubelet +updating node resources. Then we schedule device request in our new plugin with new device resource account. In pre-bind +stage, we will update container resources with Koordinator GPU resources, this is for kubelet to check resource limitation. +We will also add device allocation information to Pod's annotation. In node side, the k8s device plugin will first patch +container env, but we will overwrite these envs in runtimeproxy by allocation result in Pod's annotation. + +### Upgrade strategy + +If using Koordinator GPU Scheduling to schedule GPU Pods in a brand new cluster, simply install Koordinator components. + +However, if you want to upgrade to Koordinator GPU Scheduing in an existing cluster, you need to avoid GPU devices being repeatedly allocated because of switching between different scheduling mechanisms. You need to pay attention to the order when upgrading: +1. Install the Koordinator components. In particular, make sure that the koordlets are all started successfully. +2. Stop the system or platform that creates the new GPU Pod. +3. Stop the scheduler currently responsible for the GPU Pod and ensure that there are no pending GPU Pods in the current cluster. +3. Wait a few minutes to ensure that each node's koordlet creates and updates the Device CRD. +4. Modify all components that create GPU Pods to switch the schedulerName of the Pod to koord-scheduler +5. Start trying to create a GPU Pod and verify the koord-scheduler GPU Scheduling scheduling result. +6. Restore the system or platform that created the GPU Pod and the old scheduler. + +In the future Koordinator will provide a webhook to solve the upgrade existing cluster problem. The webhook will identify the GPU Pod and modify the schedulerName of the newly created GPU Pod to koord-scheduler. At the same time, the webhook will take over the Binding operation of the GPU Pod. If the Binding is not initiated by koord-scheduler, it will be rejected. + +## Unsolved Problems + +## Alternatives + +1. User can choose whether use k8s-device plugin. as mentioned above, we can compatible in both cases. diff --git a/versioned_docs/version-v1.3/designs/gang-scheduling.md b/versioned_docs/version-v1.3/designs/gang-scheduling.md new file mode 100644 index 000000000..dbe2762e9 --- /dev/null +++ b/versioned_docs/version-v1.3/designs/gang-scheduling.md @@ -0,0 +1,385 @@ +# GangScheduling + +## Summary +This proposal provides Gang mechanism for the scheduler to control pods binding opportunity. User can declare a resource-collection-minimum number, +only when assigned-resources reach the given limitation can trigger the binding. We provide `Strict` and `NonStrict` to +control the resource-accumulation-process by a configuration. We also provide a two-level Gang description for better matching +the real scenario, which is different from community. + +## Motivation +In AI scenarios, lots of jobs need Gang scheduling. The community have lots of related implements such as `Coscheduling` or `vocalno`. +We received lots of inspirations in the design process from them. + +### Compared with competitors + +#### Coscheduling +1. `Coscheduling` implement a new queue-sort interface and other methods to let one Gang's pods get out of the queue in order as much as possible. +If a pod failed to be scheduled, the requests that have been successfully scheduled in this round of Gang scheduling cycle will be rolled back, +and the remaining pods waiting for scheduling will be rejected in PreFilter check until this scheduling cycle passed. +For example, there is a Gang requires 10 tasks to be scheduled, if first 5 tasks allocated, the 6th task failed to be scheduled, +`Coscheduling` will roll-back first 5 tasks and ignore the remaining 4 tasks in this Gang scheduling cycle. `Coscheduling` simply use a +global time interval to control the Gang scheduling cycle. The first defect is that the uniform time interval will cause +some problems. If the time configuration is too long, it will lead to useless waiting; If the time configuration is too short, +it will lead to useless scheduling. Secondly, it is very difficult for a large job to meet all resource requests at one time. +This mechanism will lead to a very low probability of full resources, and eventually make the job starve to death. We call this process as `Strict`. + +2. Some jobs have complex Gang requirements. For example, a job has several roles. Each role will have several pods +and its own Gang conditions. Jobs also need different roles to form different GangGroups. All pods in a GangGroup can +trigger the bind process only after all roles in a GangGroup meet their Gang conditions. The `Coscheduling` can't meet +this requirement. + +### Goals +1. Define API to announce Gang scheduling configuration. + +2. Provides a scheduler plugin to achieve Gang scheduling ability. + +### Non Goals and Future Work +1. Provide ability to solve Gang resource deadlock problems with `NonStrict`. + +## Proposal + +### Key concept + +#### Strict and NonStrict + +As mentioned above, in `Strict`, if a pod failed to be scheduled, the pods that have been successfully scheduled in +this scheduling cycle will be rolled back, and the remaining pods waiting for scheduling will be rejected in +PreFilter check util this scheduling cycle passed. We call this mode is `Strict`. + +In `NonStrict`, if a pod failed to be scheduled, it has no impact on any other pod. We will continue to accumulate +the allocated pod until the condition of Gang is met. This process is friendly to Gangs with large number of pods, but it +will increase the risk of resource deadlock between Gangs. For example, the quota of the quota group is 10(quota will be proposed later), +and the user submits three Gangs with 5 pods. Due to various plugin constraints, Gang1\2\3 may allocate resources of 3\3\4 respectively. +Since the quota group's quota is full, there will be no new resource scheduling. We call this is resource deadlock of resource Gang. +In future proposal, we will try to fix this problem. + +#### GangGroup +As mentioned above, Some jobs have complex Gang requirements. For example, a job has several roles. Each role will have several pods +and its own Gang conditions. Jobs also need different roles to form different GangGroups. All pods in a GangGroup can +trigger the bind process only after all roles in a GangGroup meet their Gang conditions. So we introduce `GangGroup` concept, +which allow user to bundle different Gangs together. + +#### After Gang +It should be noted that, if the resource accumulation conditions of Gang are met, then some pods failed in the process of binding, +or some bound pods are preempted\rescheduled, should the constraints of Gang still be effective in the process of resource reallocation? +Because the initial purpose of Gang is to require pods to be pulled up at the same time, if some pods have been pulled up, +then the subsequent Gang behavior is meaningless. Therefore, when once Gang has been satisfied, all subsequent resource allocations +are no longer constrained by Gang rules, and their performance is similar to ordinary pod. + +As mentioned above, `WaitTime` is the max wait time since first pod comes to permit stage. If `WaitTime` is timeout, +scheduler will roll back all assumed pods, update each pod's annotation with `gang.scheduling.koordinator.sh/timeout=true`, and +won't schedule these pods anymore. User should pay attention to this status and delete pods timely. + +### API +#### Definition + +Our original intention is to improve and enhance the ability of the community's original `PodGroup`, so we will be +compatible with the way the community declares the `PodGroup`. We also provide a lighting way to just use annotations to +use Gang feature. + +#### CRD way +User can use `PodGroup` CRD in community to declare a gang: +```go +type PodGroup struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + Spec PodGroupSpec `json:"spec,omitempty"` + Status PodGroupStatus `json:"status,omitempty"` +} +type PodGroupSpec struct { + MinMember int32 `json:"minMember,omitempty"` + MinResources *v1.ResourceList `json:"minResources,omitempty"` + + ScheduleTimeoutSeconds *int32 `json:"scheduleTimeoutSeconds,omitempty"` +} +``` +Pod should use `pod-group.scheduling.sigs.k8s.io` in label to associate with `PodGroup`. + +Also, we introduce some optional definitions as below: +```yaml +gang.scheduling.koordinator.sh/total-number +gang.scheduling.koordinator.sh/mode +gang.scheduling.koordinator.sh/groups +``` +- `gang.scheduling.koordinator.sh/name` indicates the gang's name, it should be emphasized that the name should be in the form of RFC 1123 + +- `gang.scheduling.koordinator.sh/total-number` helps to calculate Gang scheduling cycle in `strict mode`, you can +find more detail in `Data-Structure` chapter. Default equals to `gang.scheduling.koordinator.sh/min-available`. + +- `gang.scheduling.koordinator.sh/mode` determines `Strict` or `NonStrict`. Default is `Strict`. + +- `gang.scheduling.koordinator.sh/groups` describes GangGroups. Default is empty, which means don't need to form a `GangGroup` with others, +and the gangs in one gangGroup can from different namespaces. + +`gang.scheduling.koordinator.sh/total-number`, `gang.scheduling.koordinator.sh/mode`, `gang.scheduling.koordinator.sh/gang-groups` should be found in +`PodGroup`'s annotation if needed. + +##### Example +When user apply a basic gang, the example is as follows: +```yaml +apiVersion: v1alpha1 +kind: PodGroup +metadata: + creationTimestamp: "2022-07-11T18:26:33Z" + name: gang-a + namespace: default +spec: + minMember: 5 + minResources: + cpu: "5" + memory: "2048Mi" + scheduleTimeoutSeconds: 600 +``` + +Let's assume a job has two roles: A and B, each role has several pods. podA belongs to roleA, podB belongs to roleB. +roleA and roleB belongs to one GangGroup, the example is as follows: +```yaml +apiVersion: v1alpha1 +kind: PodGroup +metadata: + creationTimestamp: "2022-07-11T18:26:33Z" + name: gang-a + namespace: namespaceA + annotations: + gang.scheduling.koordinator.sh/total-number: 5 + gang.scheduling.koordinator.sh/mode: Strict + gang.scheduling.koordinator.sh/groups: ["namespaceA/gang-a", "namespaceB/gang-b"] +spec: + minMember: 5 + minResources: + cpu: "5" + memory: "2048Mi" + scheduleTimeoutSeconds: 600 +``` + +It should be noted that, if use Gang feature by `CRD way`, user should let high level operator maintain Gang CRD life circle +like handling `update/create/delete` events. Also, from a Scheduler perspective, scheduler should handle receive-order-issue's +between Gang CRD and pod. For example, if pods arrive to scheduler before Gang CRD, we have to build a fake Gang data structure +temporarily to collect all related pods, and need to suspend the scheduling of pods until parse the configuration from real Gang CRD. + +#### Annotation way +```yaml +gang.scheduling.koordinator.sh/name +gang.scheduling.koordinator.sh/min-available +``` + +The upper definitions are indispensable. We are compatible with `pod-group.scheduling.sigs.k8s.io`, `pod-group.scheduling.sigs.k8s.io/name` +and `pod-group.scheduling.sigs.k8s.io/min-available` in community. We also support new definitions to declare Gang's name and minimum number. + +Also, we introduce some optional definitions as below, most are mentioned above: +```yaml +gang.scheduling.koordinator.sh/waiting-time +gang.scheduling.koordinator.sh/total-number +gang.scheduling.koordinator.sh/mode +gang.scheduling.koordinator.sh/groups +``` + +- `gang.scheduling.koordinator.sh/waiting-time` represents max wait time since first pod comes to permit stage. Default is a global config. + +- `gang.scheduling.koordinator.sh/total-number` helps to calculate Gang scheduling cycle in `strict mode`, you can +find more detail in `Data-Structure` chapter. Default equals to `gang.scheduling.koordinator.sh/min-available`. + +- `gang.scheduling.koordinator.sh/mode` determines `Strict` or `NonStrict`. Default is `Strict`. + +- `gang.scheduling.koordinator.sh/groups` describes GangGroups. Default is empty, which means don't need to form a `GangGroup` with others. + +It should be noted that, the annotation mode's parameter will overwrite CRD's mode if both exist. +And gangGroup should be announced with " gangNamespace" + "/" + "gangName " + +##### Example +When user apply a basic gang, the example is as follows: +```yaml +metadata: + annotations: + gang.scheduling.koordinator.sh/name: gang-a + gang.scheduling.koordinator.sh/min-available: 5 +``` + +Let's assume a job has two roles: A and B, each role has several pods. PodA belongs to roleA, podB belongs to roleB. +roleA and roleB belongs to one GangGroup, the example is as follows: +```yaml +metadata: + annotations: + gang.scheduling.koordinator.sh/name: gang-a + gang.scheduling.koordinator.sh/waiting-time: 3600s + gang.scheduling.koordinator.sh/min-available: 5 + gang.scheduling.koordinator.sh/total-number: 5 + gang.scheduling.koordinator.sh/mode: Strict + gang.scheduling.koordinator.sh/groups: ["namespaceA/gang-a", "namespaceB/gang-b"] +metadata: + annotations: + gang.scheduling.koordinator.sh/name: gang-b + gang.scheduling.koordinator.sh/waiting-time: 3600s + gang.scheduling.koordinator.sh/min-available: 5 + gang.scheduling.koordinator.sh/total-number: 5 + gang.scheduling.koordinator.sh/mode: Strict + gang.scheduling.koordinator.sh/groups: ["namespaceA/gang-a", "namespaceB/gang-b"] +``` + +Assuming a job has two roles: A and B, each role has several pods. podA belongs to roleA, podB belongs to roleB. +roleA and roleB belongs to different GangGroup, the example as follows: +```yaml +metadata: + annotations: + gang.scheduling.koordinator.sh/name: gang-a + gang.scheduling.koordinator.sh/waiting-time: 3600s + gang.scheduling.koordinator.sh/min-available: 5 + gang.scheduling.koordinator.sh/total-number: 5 + gang.scheduling.koordinator.sh/mode: Strict + gang.scheduling.koordinator.sh/groups: "" +metadata: + annotations: + gang.scheduling.koordinator.sh/name: gang-b + gang.scheduling.koordinator.sh/waiting-time: 3600s + gang.scheduling.koordinator.sh/min-available: 5 + gang.scheduling.koordinator.sh/total-number: 5 + gang.scheduling.koordinator.sh/mode: Strict + gang.scheduling.koordinator.sh/groups: "" +``` + +### Implementation Details +#### QueueSortPlugin + +We design an independent plugin to implement the `QueueSort` extension point separately, so that we can integrate +queue sort logic of all plugins, and register them at one time. + +In this proposal, we implement the Less function to gather pods belong to same Gang. The specific queuing rule is: + +1. Firstly, compare the priorities of the two pods, the higher priority is at the front of the queue. + +2. Secondly, compare creationTimestamp of two pods, if pod belongs to a Gang, then we compare creationTimestamp of the Gang, +the one created first will be at the front of the queue. + +3. Finally, compare pod's namespace, if pod belongs to a Gang, then we compare Gang name. + +```go +type QueueSortPlugin interface{ + QueueSort(*QueuedPodInfo, *QueuedPodInfo) bool +} +``` + +#### GangSchedulingPlugin +##### Data-Structure +###### Gang +```go +type Gang struct { + Name string + WaitTime time.Duration + Mode string //Strict or NonStrict + GangGroup []string + MinRequiredNumber int + TotalChildrenNum int + Children map[string]*PodInfo + BoundChildren map[string]*PodInfo + WaitingForBindChildren map[string]*PodInfo + ResourceSatisfied bool + ScheduleCycle int + ScheduleCycleValid bool + ChildrenScheduleRoundMap map[string]int +} +``` + +We design the Gang to record Gang status in scheduler memory. We can get the children pods from "Children" field, and the +`BoundChildren, WaitingForBindChildren` store the pods binding status, which is used to check if the pods can pass permit stage. + +Once Permit stage passed, we will set `ResourceSatisfied=true`, as mentioned above in `After Gang` chapter, this variable is +used for judging whether gang has been satisfied. when handle failover case, if any pod in Gang has been bound, we set `ResourceSatisfied=true`. + +We especially explain `scheduleCycle` and `childrenScheduleRoundMap` field. These fields control Gang's scheduling cycle. For example, +at the beginning, `scheduleCycle` is 1, and each pod's cycle in `childrenScheduleRoundMap` is 0. When each pod comes to PreFilter, +we will check if the pod's value in `childrenScheduleRoundMap` is smaller than Gang's `scheduleCycle`, If result is positive, +we set the pod's cycle in `childrenScheduleRoundMap` equal with `scheduleCycle` and pass the check. If result is negative, means +the pod has been scheduled in this cycle, so we should reject it. With `totalChildrenNum`'s help, when the last pod comes to make all +`childrenScheduleRoundMap`'s values equal to `scheduleCycle`, Gang's `scheduleCycle` will be added by 1, which means a new schedule cycle. + +We continue to explain `scheduleCycleValid` field, during the scheduling, When a pod failed at Filter stage, we will set ScheduleCycleValid to +false in PostFilter stage, which means any pod in this Gang shouldn't be scheduled until it is set to "true", +and the remaining pods should be rejected in PreFilter stage. Only When `scheduleCycle` added by 1, we will reset the `scheduleCycleValid` to true. + +It should be emphasized that `scheduleCycle\scheduleCycleValid\childrenScheduleRoundMap` only work in `Strict`. + +##### GangPlugin + +this is the framework of the Plugin,we cache the Gang info above in the gangCache. +```go +type GangPlugin struct { + frameworkHandler framework.Handle + gangClient gangClient.Interface + podLister listerv1.PodLister + snapshotSharedLister framework.SharedLister + gangCache map[string]*Gang +} +``` +during the whole kubernetes shceduling process,we only need to realize our logic into four extention points as below: +```go +var( + _ framework.PreFilterPlugin = &GangScheduling{} + _ framework.PostFilterPlugin = &GangScheduling{} + _ framework.PermitPlugin = &GangScheduling{} + _ framework.ReservePlugin = &Coscheduling{} +) +type GangScheduling interface{ + ActiveGang(pod *corev1.Pod, state *framework.CycleState) + PreFilter(context.Context, *corev1.Pod) error + PostFilter(ctx context.Context, state *CycleState, pod *v1.Pod, filteredNodeStatusMap NodeToStatusMap) (*PostFilterResult, *Status) + Permit(context.Context, *corev1.Pod) Status + Unreserve(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) +} +``` +###### **PreFilter** + +if `NonStrict`, we only do step1 and step2: + +- Check whether childes in Gang has met the requirements of minimum number under each Gang, and reject the pod if negative. + +- Check whether the Gang has been timeout(check the pod's annotation,later introduced at Permit section), and reject the pod if positive. + +- Check whether the Gang has met the `scheduleCycleValid` check, and reject the pod if negative. + +- Try update `scheduleCycle`, `scheduleCycleValid`, `childrenScheduleRoundMap` as mentioned above. + + +###### **PostFilter** + +At this point means the pod didn't pass the Filter Plugin, we should: + +- If `Strict`, we will set `scheduleCycleValid` to false and release all assumed pods. + +- If `NonStrict`, we will do nothing. + +###### **Permit** + +Any pod passes Filter stage will come to this stage. Scheduler will calculate all Gangs in GangGroup whether the current +number of assumed-pods in each Gang meets the Gang's minimum requirement. + +- If Gang don't meet the bind-condition, we will give the pod a "Wait" Status with a timeout duration, and the bind +goroutine will keep waiting until the wait is timeout or passed. Then we will run the `ActiveGang` method, it can put all +the pods belong to the Gang which in `schedulableQueue` or `backoffQueue` back to `activeQueue`, so that the pod of Gang +can be continuously scheduled as much as possible. + +It should be noted that, in community, scheduler limit maximum timeout value under 15 min, we may need to hook RunPermitPlugins +to enlarge the timeout when 15 minutes is not enough. Now we record as a known-issue. + +- If Gang meet the bind-condition, we will give every waiting pod a "Success" status, which will let the bind goroutine of +each pod leave the waiting status and continue to run. Also, as mentioned above, we will set Gang's `ResourceSatisfied` to true. + +###### **Un-reserve** + +Both permit stage is timeout and binding failed will lead the pod to un-reserve stage, we can distinguish from Gang's "ResourceSatisfied" field, +if the field is true means binding failed, else means the Gang is timeout. + +- When permit stage is timeout, we will give an annotation like `gang.scheduling.koordinator.sh/timeout=true` to all the pods +belong to the Gang and will release the resource of all the assumed pods. The Gang will not be scheduled anymore, +user should manually handle the timeout event. + +- When binding failed, as mentioned above, the collection of Gang's resource is over, we will do nothing except roll back +the failed pod resource. + +###### **Init** + +We will register pod's event handler to watch pod event for updating Gang. + +## Unsolved Problems + +## Alternatives +User can choose use Gang by `Strict` and `NonStrict` case by case. diff --git a/versioned_docs/version-v1.3/designs/koordlet-overview.md b/versioned_docs/version-v1.3/designs/koordlet-overview.md new file mode 100644 index 000000000..2ea0cbdfc --- /dev/null +++ b/versioned_docs/version-v1.3/designs/koordlet-overview.md @@ -0,0 +1,56 @@ +# Koordlet + + +## Summary +Koordlet is a DaemonSet deployed in Kubernetes node, which is used for co-location resource overcommitment, interference +detection, QoS guarantee, etc. It is composed of several modules which are responsible for information collection, +data profiling and QoS management independent. Some modules also provides a framework scaffold, which provides a set +of plugin for extension (such as the "QoS Manager"), so that new strategies can be easily added. + +## Architecture +![image](/img/koordlet-arch.svg) + +## Modules + +### Metrics Advisor +Metric Advisor provides the basic information of resource usage and performance characteristic of node, pods and containers. +It is an independent module that collects, processes and exports resource profile periodically. It also detects the +interference of running containers such as CPU scheduling, memory allocation latency and Pressure Stall Information(PSI). +The information will be widely used for resource overcommitment and QoS guaranteed plugins. + +### Storage +Storage manages the information from Metrics Advisor and States Informer, providing APIs for CURD and GC outdated data +periodically. There are two types of data: `static` and `time-series`. Time-series type keeps historical data for +statistics purpose, such as CPU and memory usage. Static type includes the of status information node, pod and container, +such as CPU info of node, metadata of pod. + +### States Informer +States Informer syncs node and pod status from kube-apiserver and kubelet, and saves data into Storage as `static` type. +This module should remain relatively stable over developing iterations compared with others. + +### QoS Manager +QoS Manager coordinates a set of plugins which are responsible for SLO guarantee by priority, mitigating interference +among pods. Plugins dynamically tunes the "knobs" of resource parameters on different scenarios, according to resource +profiling, interference detection results and SLO configuration. For each plugin, it always produces execution plans for +"knobs" tuning. QoS Manager also act as an arbitrator among multiple execution plans, consolidating the duplicates and +resolving the conflicts. + +QoS Manager could be the most frequently iterated module, with new plugins extended, strategies algorithm updated and +policy execution ways added. A new plugin should implement the interface which contains a series of standard APIs, so +that the "core" can be kept simple and maintainable. Advanced plugins such as those for interference detection purpose +will get more complex as time goes by, which might becomes an independent module after the incubation has been already +stabled in QoS Manager. + +### Metrics Reporter +Metrics Reporter reads historical metric and state data from Storage, then merges and sends them to apiserver, +which will be consumed by Koordinator Manager for resource overcommitment model management. Metrics Reporter also +supports multiple processing algorithms for different co-location scenarios. + +### Runtime Hooks +Runtime Hooks act as the back-end server of Runtime Hook Manager. Runtime Hook Manager is a CRI Proxy, which +intercepting the CRI request, calling back-end server to inject policies, such as setting resource isolation +parameters by pod priorities, applying resource allocation policies. Runtime Hooks provide a framework to maintain +different kinds of policies, and provides flexible extension points during the lifecycle of containers. + +#### e.g. LLC Isolation Injections during Pod Lifecycle +![image](/img/llc-isolation.svg) diff --git a/versioned_docs/version-v1.3/designs/load-aware-scheduling.md b/versioned_docs/version-v1.3/designs/load-aware-scheduling.md new file mode 100644 index 000000000..3ca5bdc56 --- /dev/null +++ b/versioned_docs/version-v1.3/designs/load-aware-scheduling.md @@ -0,0 +1,115 @@ +# Load-aware Scheduling + +## Summary + +Although Koordinator provides the co-location mechanism to improve the resource utilization of the cluster and reduce costs, it does not yet have the ability to control the utilization level of the cluster dimension. This proposal defines a scheduling plugin to help Koordinator achieve this capability. + +## Motivation + +Koordinator oversells some resources through the co-location mechanism. Although it can improve the utilization of nodes, Best Effort workloads may also interfere with latency-sensitive applications. + +### Goals + +1. Provides a configurable scheduling plugin to help control cluster resource utilization. +2. Utilization control mechanism that supports multiple resources. +3. Control resource utilization at a safe threshold. + +### Non-Goals/Future Work + +1. Help the plugin to achieve more reasonable estimates and better results through application profiles. This is left as a follow-up work that will be done under a different proposal. + +## User stories + +### Story 1 + +When the resource utilization of the node has reached a high threshold, serious resource contention will occur between the running workloads on the node. For example, best effort workloads are frequently suppressed due to higher-priority applications requiring resources. As a result, best effort workloads are timed out or even forced to end; or a latency-sensitive application will suffer severe performance degradation under high utilization, failing to meet external SLAs. This should be avoided. + +### Story 2 + +Workloads in a co-located cluster have different resource requirements. Typical CPU-bound workloads expect to use more CPU, while other types of workloads may use more memory. It is possible that the utilization of CPU resources is relatively high, while the utilization of memory resources is relatively low. In this scenario, the unbalanced utilization of resources will affect the effect of scheduling, and may even lead to the problem that resources are idle but Pods cannot be scheduled. + +### Story 3 + +Koordinator defines NodeMetric CRD to describe the resource usage of nodes and is regularly updated by koordlet. However, if there are many Pods scheduled to cold nodes (that is, nodes with low resource utilization) during the update cycle, when these Pods start running, the resource utilization of these nodes may exceed the expected threshold. As a result, the runtime quality of these pods is not as good as expected. + +### Story 4 + +The koordlet may not be able to report the latest resource usage due to node exception. Such nodes should be avoided during scheduling to prevent unexpected exceptions. + +## Implementation Details + +![image](/img/load-aware-scheduling-arch.svg) + +The scheduling plugin filters abnormal nodes and scores them according to resource usage. This scheduling plugin extends the Filter/Score/Reserve/Unreserve extension points defined in the Kubernetes scheduling framework. + +### Filter unhealthy nodes + +By default, abnormal nodes are filtered, and users can decide whether to enable or not by configuring as needed. + +- Filter nodes where koordlet fails to update NodeMetric. If the configuration enables, the plugin will exclude nodes with *nodeMetrics.status.updateTime >= LoadAwareSchedulingArgs.nodeMetricExpirationSeconds*. + +- Filter nodes by utilization thresholds. If the configuration enables, the plugin will exclude nodes with *latestUsageUtilization >= utilizationThreshold*. In the filtering phase, only the resource utilization is obtained from the latest NodeMetric, and the resource usage of the allocated but not yet counted Pods does not participate in the calculation, so as to allocate resources to the newly created Pods and avoid scheduling failure due to unreasonable estimates. + +### Score algorithm + +The core logic of the scoring algorithm is to select the node with the smallest resource usage. However, considering the delay of resource usage reporting and the delay of Pod startup time, the resource requests of the Pods that have been scheduled and the Pods currently being scheduled within the time window will also be estimated, and the estimated values will be involved in the calculation. + +### Plugin configuration + +```go + +type LoadAwareSchedulingArgs struct { + metav1.TypeMeta + + FilterExpiredNodeMetrics *bool `json:"filterExpiredNodeMetrics,omitempty"` + NodeMetricExpirationSeconds *int64 `json:"nodeMetricExpirationSeconds,omitempty"` + ResourceWeights map[corev1.ResourceName]int64 `json:"resourceWeights,omitempty"` + UsageThresholds map[corev1.ResourceName]int64 `json:"usageThresholds,omitempty"` + EstimatedScalingFactors map[corev1.ResourceName]int64 `json:"estimatedScalingFactors,omitempty"` +} + +``` + +- `FilterExpiredNodeMetrics` indicates whether to filter nodes where koordlet fails to update NodeMetric. +- `NodeMetricExpirationSeconds` indicates the NodeMetric expiration in seconds. When NodeMetrics expired, the node is considered abnormal.Default is 180 seconds. +- `ResourceWeights` indicates the weights of resources. The weights of CPU and Memory are both 1 by default. +- `UsageThresholds` indicates the resource utilization threshold, the default for CPU is 65%, and the default for memory is 95%. +- `EstimatedScalingFactors` indicates the factor when estimating resource usage. The default value of CPU is 85%, and the default value of Memory is 70%. + +`FilterExpiredNodeMetrics` controls the filter behavior, if it is false, `NodeMetricExpirationSeconds` can still be used when scoring. + +### Custom NodeMetric update Period + +This plugin is dependent on NodeMetric's reporting period. Different reporting periods need to be set according to different scenarios and workloads. If the reporting period is relatively long, koordlet needs to aggregate within the reporting period to ensure the effect of the metrics. Therefore, NodeMetricSpec needs to be extended to support user-defined reporting period and aggregation period. Users can modify `slo-controller-config` to complete the corresponding configuration, and the controller in `koord-manager` will be responsible for updating the reporting period and aggregation period fields of NodeMetrics of related nodes. + +```go +// NodeMetricSpec defines the desired state of NodeMetric +type NodeMetricSpec struct { + // CollectPolicy defines the Metric collection policy + CollectPolicy *NodeMetricCollectPolicy `json:"metricCollectPolicy,omitempty"` +} + +// NodeMetricCollectPolicy defines the Metric collection policy +type NodeMetricCollectPolicy struct { + // AggregateDurationSeconds represents the aggregation period in seconds + AggregateDurationSeconds *int64 `json:"aggregateDurationSeconds,omitempty"` + // ReportIntervalSeconds represents the report period in seconds + ReportIntervalSeconds *int64 `json:"reportIntervalSeconds,omitempty"` +} +``` + +### Custom node usage thresholds + +Currently, the resource utilization thresholds of nodes are configured based on experience to ensure the runtime quality of nodes. But there are also ways to evaluate the workload running on the node to arrive at a more appropriate threshold for resource utilization. For example, in a time-sharing scenario, a higher threshold can be set to allow scheduling to run more best effort workloads during the valley of latency-sensitive applications. When the peak of latency-sensitive applications comes up, lower the threshold and evict some best effort workloads. In addition, 3-sigma can be used to analyze the utilization level in the cluster to obtain a more appropriate threshold. + +Define Annotation supports user-defined node resource utilization thresholds. + +```go +const ( + AnnotationCustomUsageThresholds = "scheduling.koordinator.sh/usage-thresholds" +) + +type CustomUsageThresholds struct { + UsageThresholds map[corev1.ResourceName]int64 `json:"usageThresholds,omitempty"` +} +``` \ No newline at end of file diff --git a/versioned_docs/version-v1.3/designs/multi-hierarchy-elastic-quota-management.md b/versioned_docs/version-v1.3/designs/multi-hierarchy-elastic-quota-management.md new file mode 100644 index 000000000..6c8cebc88 --- /dev/null +++ b/versioned_docs/version-v1.3/designs/multi-hierarchy-elastic-quota-management.md @@ -0,0 +1,342 @@ +# Multi Hierarchy Elastic Quota Management + +## Summary +When several users or teams share a cluster, fairness of resource allocation is very important. This proposal provides +multi-hierarchy elastic quota management mechanism for the scheduler. +- It supports configuring quota groups in a tree structure, which is similar to the organizational structure of most companies. +- It supports the borrowing / returning of resources between different quota groups, for better resource utilization efficiency. +The busy quota groups can automatically temporarily borrow the resources from the idle quota groups, which can improve the +utilization of the cluster. At the same time, when the idle quota group turn into the busy quota group, it can also automatically +take back the "lent-to" resources. +- It considers the resource fairness between different quota groups. When the busy quota groups borrow the +resources from the idle quota groups, the resources can be allocated to the busy quota groups under some fair rules. + +## Motivation + +### Compared with competitors + +#### Resource Quotas +[Resource Quotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/) provides the ability to restrain the upper +limit of resource usage in one quota group. The quota group resource usage aggregated based on the pod resource configurations. +Suppose there are still free resources in the cluster, but the resource usage of this quota group is close to the limit. +The quota group cannot flexibly borrow the idle resources from the cluster. The only possible way is to manually adjust the +limit of the quota group, but it is difficult to determine the timing and value of the adjustment when there are lots of +quota groups. + +#### Elastic Quota +[Elastic Quota](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/9-capacity-scheduling/README.md#goals) +proposed concepts of "max" and "min". "Max" is the upper bound of the resource consumption of the consumers. "Min" is the minimum +resources that are guaranteed to ensure the functionality/performance of the consumers. This mechanism allows the workloads +from one quota group to "borrow" unused reserved "min" resources from other quota groups. The unused "min" of one quota group +can be used by other quota groups, under the condition that there is a mechanism to guarantee the "victim" quota group can +consume its "min" resource whenever it needs. + +If multiple quota groups need borrow unused reserved "min" resources from other quota groups at the same time, +the implementation strategy is FIFO, which means that one quota group may occupy all "borrowed-from "resources, +while other quota groups cannot borrow any resources at all from the cluster. + +Neither of the above support multi hierarchy quota management. + +### Goals +1. Define API to announce multi hierarchy quota configuration. + +2. Provides a scheduler plugin to achieve multi hierarchy quota management ability. + +### Non-goals/Future work +Users have two ways to manage GPU quotas. One is to only declare the number of GPU cards in the quota group, but do not +care about the specific card type assigned. The other is to specify the quotas required by different card types. For example, +suppose user A\B both has 10 GPU quota, and cluster has two GPU types A100\V100. quotaA only declare 10 GPU quota, so in the +scheduling process, as long as the total number of GPU cards allocated to A is 10, no matter what the allocation ratio of +a100\v100 is, it will meet the expectation. QuotaB also declare 10 GPU quota, but has more details with V100 is 5 and A100 is 5, +so the maximum allocation of V100 is 5 and A100 is 5 in the scheduling will meet the expectation. + +We know that the GPU card type reflected by the label or annotation on the node, not in the resource dimension, so we can't +simply configure nvidia.com/gpu-v100, nvidia.com/gpu-a100 directly into the quota group's resource dimension. + +What's more complicated is that in a cluster, there will be multiple quota groups like A\B at the same time, +These two modes will conflict. Suppose that the cluster resource has 20 cards, including 10 cards for A100 and 10 cards for V100. +If the scheduler first assigns 10 cards to quota groupA with all V100, then quota group B's V100 resource has no way to be guaranteed, +which obviously does not meet expectations. Therefore, we need to solve the problem that if the above two modes coexist, +the quota mechanism can still work normally. + +The above problems will be solved in the next proposal. + +## Proposal + +### Key Concept\User Stories +1. Each quota group declares its own "min" and "max". The semantics of "min" is the quota group's guaranteed resources, +if quota group's "request" less than or equal to "min", the quota group can obtain equivalent resources to the "request". +The semantics of "max" is the quota group's upper limit of resources. We require "min" to be less than or equal to max. + +2. We define "request" as the sum pod's request in the quota group. When some quota groups "request" is less than "min", and some +quota groups "request" is more than "min", the unused resources of the former can be lent to (or you can choose not to share) the +latter. The latter should use these resources according to the fair rule. When the former needs to use the "lent-to" resources, +the latter should also return the "borrowed-from" resources according to the fair rule. + +3. We define the "runtime" as the current actual resource that can be used by the quota group. For a quota group whose "request" +is less than min, the value of "runtime" is equal to "request". That is to say "request" should be unconditionally satisfied +if the "request" is less than "min". For a quota group whose "request" is greater than "min", the value of "runtime" is between +"min" and "max", and the part exceeding "min" is based on its own "request", the "lent-to" resources, and the ability of +other quota groups to compete for "lent-to" resources. This will be described in detail below. + +4. Hierarchy is very important in a resource-shared cluster. Suppose that the cluster shared by multiple departments, and +each department has multiple teams. If each team is a quota group, we naturally hope that the relationship between departments +and teams is tree shaped. In this way, no matter how to add, delete or adjust quota groups within the department, it is an +internal matter of the department. The cluster administrator only needs to be responsible for the quota configuration at the +level of departments, and the quota group's configuration can delegate power to the department itself. Moreover, tree can +help us easily see the summary of resources from the perspective of departments when there are lots of teams in one department. + +Another advantage of tree structure is that we can control the scope of the "lent-to" resource. For example, a department only +wants to its quota groups can borrow resources from each other, while the resources of the department do not want to be lent +to other departments. This is very convenient for the tree structure. It should be pointed out that although two levels can +meet most scenarios (the more levels, the higher the maintenance complexity), we will support that the height of the quota-tree +is arbitrary. + +### Implementation Details + +#### Calculate RuntimeQuota + +We use an example to introduce how to calculate "runtime". Suppose the cluster total resource is 100, and has 4 quotas, +the configuration and "request" of each quota group described as below: + +![image](/img/runtimequota1.jpg) + +We first calculate the "min" part of "runtime". It should be like as below: + +![image](/img/runtimequota2.jpg) + +Then we find quota groupA can lent 5 quotas to B\C\D, and the cluster has 40 quotas to allocate, so the sum is 45 for B\C\D +to share. We introduce a new field to represent the allocation fairness, which is called "shared-weight". "shared-weight" determines +the ability of quota groups to compete for shared resources. That is to say, B/C/D will allocate resources in the cluster according +to its "shared-weight". + +For example, assuming that the weights of B\C\D are 60\50\80 + +- B can get 45 * 60 / (60 + 50 + 80) = 14 + +- C can get 45 * 50 / (60 + 50 + 80) = 12 + +- D can get 45 * 80 / (60 + 50 + 80) = 19 + +However, quota group B only need 5 more due to request is 20 and min is 15, and quota group C and D are still hungry, +so quota group B can share 14 - 5 = 9 to C and D. + +![image](/img/runtimequota3.jpg) + +quota group C and D can still share the remained quota of 9 by allocation proportion, which C get 9 * 50 / (50 + 80) = 3, +D get 9 * 80 / (50 + 80) = 6, and we get the runtime of each quota group finally. + +![image](/img/runtimequota4.jpg) + +The whole process can be summarized as follows: + +1. The quota divided into two categories, one is whose "request" is less than "min", we call it "lent-to-quotas". The other is +whose "request" is greater than "min", we call it "borrowed-quotas". + +2. Calculate the "runtime" of each quota group not exceed "min", so we can get how many resources can be lent to "borrowed-quotas". + +3. The "borrowed-quotas" share the resources by allocation proportion. + +4. If the new "runtime" is larger than "request", there will be new resources which can be lent to the rest "borrowed-quotas". + +It is very difficult to manage the weight of thousands of quota groups in a company. Therefore, we need to set a default value +for the "shared-weight". According to our experience in online operations, using max as the default "shared-weight" of the quota +group can satisfy most scenarios. In this way, "max" has both the meaning of resource ceiling and allocation proportion: the +larger the "max" is, the more resources it wants. For individual special scenarios, the resource administrator can adjust the weight. + +It must be pointed out that if the cluster resources suddenly decrease due to node failure, the sum of "min" may be +greater than the total resources of the cluster. If this case happens, we can't grantee "min" of each quota group actually. +So we will reduce the "min" of each quota group in a moderate proportion, which is to ensure that the sum of +"min" actually in effect is less than the total resources of the cluster. + +We need to introduce the concept of "sys-group". "sys-group" means that the "min" of this quota group is infinite, +and its request will never be bound by the quota. It is usually used for system level pods. When the scheduler starts, +the "sys-group" will be created by default not only in scheduler memory, but also try create the quota group crd. +Its "min" and "max" are INT_MAX. At the same time, its "min" will not be reduced in proportion to the above process. +The real available total resource of normal quota groups is the cluster total resource minus the "used" of the "sys-group". + +We also need to introduce the concept of "default-group". If the pod cannot find a matching quota group, it will be +matched to the "default-group". the "default-group" will be created by default not only in scheduler memory, but also try +create the quota group crd. Its "min" and "max" has default value, users can modify them on demand. + +#### Hierarchy +We can organize quota groups using quota-tree, each quota group has its own configuration. Currently, we only allow leaf +nodes to submit jobs. An example is as below: + +![image](/img/quotatree1.jpg) + +When we calculate the "request" of each quota group. We first count the requests of each parent group from the bottom up, +which is the accumulation of mathematical min(child group request, child group max). + +![image](/img/quotatree2.jpg) + +Then we calculate the "runtime" from top to bottom. The "runtime" of the parent quota group is the total resources of the +child quota groups. First we calculate parent quota group's "runtime". + +![image](/img/quotatree3.jpg) + +Then we calculate child quota group's "runtime". + +![image](/img/quotatree4.jpg) + +#### Min Guarantee and Preemption +Considering the following situations, suppose that the cluster has two quotas group A\B. At t0 time, only quota groupA has job +submission, it can borrow from quota group B's resource, and the "request" and "used" of quota group are both 100 as below: + +![image](/img/quotaguarantee1.jpg) + +At t1 time, quota groupB has job submission too, so the "runtime" of quota group A\B is both 50. However, if quota +groupA don't return resource back, quota groupB can't assign any resource cause node resource occupied by the quota groupA. + +![image](/img/quotaguarantee2.jpg) + +The solution is that we will monitor the relationship between "used" and "runtime" of each quota group in the background thread. +If quota group's "used" continues to be greater than "runtime", we will start the forced recycling mechanism to kill +several pods in the order of priority from low to high until the "used" is less than or equal to "runtime". If some pods +in the quota group do not want to be recycled, we require such pods can only use resource up to "min". By default, we +assume all pods can use resource beyond "min" if "runtime" larger than "min". + +We do not adopt the cross quota preemption method to solve the problem that when quota group "used" is less than "runtime" +(to preempt the quota group whose "used" is greater than the "runtime"). Due to each quota group has an accurate runtime, +we can accurately recycle the overused resources of each quota group. This is more direct than preemption. + +In addition, we do not think that cross quota preemption is worth recommending. In principle, the priorities of different +quota groups are not comparable, because they may come from different business lines. The high priority of this business line +is not more important than the low priority of other business lines. Only priorities within a quota group have comparative +significance. So we will not support cross quota preemption temporary. Moreover, in inner quota preemption, we will limit +existUsed - preempted + preempt smaller than runtime. + +It can be seen from the above, if "min" of the quota group is not equal to "max", the "runtime" part exceeding "min" may +recycled by the scheduler. + +#### Configuration Limit +We introduce several constraints to ensure that the quota mechanism works properly. + +1. Except for the first level quota group, we require that the sum of "min" of all sub quota groups should be less than or +equal to the "min" of parent group. The reason for excluding the first level quota group is that the cluster resources +cannot avoid jitter. If the cluster resource reduced, we don't want to hinder the update of the quota groups. + +2. The "max" of child quota group can be larger than the "max" of parent group. Consider the following scenario, there are +2 subtrees in the cluster, "dev-parent" and "production-parent". Each subtree has several "quota-groups". When "production" +is busy, we can limit the resource use of the "dev" by only decreasing the "max" of "dev-parent", instead of decreasing +the "max" of each sub quota group of "dev-parent". + +3. Parent group cannot run pod. We did receive a request to allow the parent group to submit jobs. The priority of the +parent group's self jobs is higher than that of all the sub-groups, which means that the parent group's self jobs can +preempt the "runtime" of the sub-group's jobs at any time. This is somewhat similar to the hierarchical relationship of +"Town City province". Due to complexity,we do not support this issue for now. + +4. The parent of node can only be parent group, not child group. + +5. A quota group can't be converted on the attribute of parent group\child group. + +6. We allow a node on the quota tree to freely change its parent node, as long as it does not break the existing detection rules. + +We will introduce a new "web-hook" to check the configuration limitation. + +#### Extension Point + +##### PreFilter +We will check if the (Pod.request + Quota.Used) is less than Quota.Runtime. If not, the scheduling cycle of Pod will fail. + +##### PostFilter +We will re-implement the method selectVictimsOnNode in defaultPreempt. The original selectVictimsOnNode method selects all +the pods with the lower priority than the preemptor’s priority as potential victims in a node. For now, we only allow +inner-quota-group preemption. + +##### Cache and Controller +1. We will watch the event of quota group and pod to calculate "runtime" of each quota group. +2. We will create a thread to update quota group crd to display "request\used\runtime" periodicity. +3. We will create a thread to monitor "used" and "runtime" of each quota group. If quota group's "used" continues to be +greater than "runtime", we will start the forced recycling mechanism to kill several pods in the order of priority from +low to high until the "used" is less than or equal to "runtime". + +### API + +#### Quota +We will reuse [Elastic Quota](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/9-capacity-scheduling/README.md#goals) +'s crd to declare quota group. + +```go +type ElasticQuota struct { + metav1.TypeMeta + metav1.ObjectMeta + Spec ElasticQuotaSpec + Status ElasticQuotaStatus +} + +type ElasticQuotaSpec struct { + Min v1.ResourceList + Max v1.ResourceList +} + +type ElasticQuotaStatus struct { + Used v1.ResourceList +} +``` + +we will also add new annotation and labels to achieve our desired functionality. +```yaml +annotations: + quota.scheduling.koordinator.sh/runtime: {cpu:4, memory: 8Gi} + quota.scheduling.koordinator.sh/shared-weight: {cpu:4, memory: 8Gi} +labels: + quota.scheduling.koordinator.sh/is-parent: false + quota.scheduling.koordinator.sh/parent-quota-name: "parent" + quota.scheduling.koordinator.sh/allow-lent-resource: true +``` +- `quota.scheduling.koordinator.sh/runtime` is updated by the scheduler. It reflects the "runtime" of the quota group. +- `quota.scheduling.koordinator.sh/is-parent` is disposed by the user. It reflects the "child\parent" attribute of the quota group. Default is child. +- `quota.scheduling.koordinator.sh/parent-quota-name` is disposed by the user. It reflects the parent quota name. Default is root. +- `quota.scheduling.koordinator.sh/shared-weight` is disposed by the user. It reflects the ability to share the "lent to" resource. Default equals to "max". +- `quota.scheduling.koordinator.sh/allow-lent-resource` is disposed by the user. It reflects whether quota group allows lent unused "min" to others. + +Here is a example: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: test + namespace: test + annotations: + quota.scheduling.koordinator.sh/runtime: {cpu:4, memory: 8Gi} + quota.scheduling.koordinator.sh/shared-weight: {cpu:4, memory: 8Gi} + labels: + quota.scheduling.koordinator.sh/is-parent: false + quota.scheduling.koordinator.sh/parent-quota-name: "parent" + quota.scheduling.koordinator.sh/allow-lent-resource: true +spec: + max: + cpu: 20 + memory: 40Gi + nvidia.com/gpu: 2 + min: + cpu: 10 + memory: 20Gi + nvidia.com/gpu: 1 +``` + +#### Pod +We introduce a new label on the pod to associate pod with quota group: +```yaml +labels: + quota.scheduling.koordinator.sh/quota-name: "test1" +``` + +if pod's don't have the label, we will follow [Elastic Quota](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/9-capacity-scheduling/README.md#goals) +using namespace to associate pod with quota group. + +### Compatibility +We are fully compatible with [Elastic Quota](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/9-capacity-scheduling/README.md#goals) 's interface. +If pod's don't have the "quota-name" label, we will use the namespace to associate pod with quota group. If the pod has +the "quota-name" label, we will use it to associate pod with quota group instead of namespace. If we can't find the +matched quota group, we force the pod to associate with the "default-group". + +## Unsolved Problems +Please see Non-goals/Future work. + +## Alternatives + +## Implementation History + +## References diff --git a/versioned_docs/version-v1.3/designs/node-prediction.md b/versioned_docs/version-v1.3/designs/node-prediction.md new file mode 100644 index 000000000..9bda2cc8a --- /dev/null +++ b/versioned_docs/version-v1.3/designs/node-prediction.md @@ -0,0 +1,278 @@ +# Node Prediction + +## Summary + +The *node prediction* is proposed to both improve the node utilization and avoid overloading. By profiling the +tendency of the node metrics, we can estimate the peak usage and implement more efficient over-commitment policy. + +## Motivation + +Scheduling pods with setting appropriate resource requirements is truly hard to follow. Underestimating requests can +bring performance issues. However, overvaluing requests is likely to cause resource waste and low efficiency. One +common approach is using Vertical Pod Autoscaler (VPA) to autopilot the resource requirements for the pods of the same +workload. The VPA optimizes the resource requirements of the pod according to the pod metrics of the same workload. It +estimates the pod usage and specifies proper resource requirements. It works well when we want to optimize the resource +requirements of workloads. However, most VPA approaches try to abandon the time series attribute from the pod metrics +and generate a relatively static requests/limits that should guarantee to make no bad ignoring the timing. It leaves +the usage-to-limit gap, i.e. the gap between the recommended pod request with the real-time pod usage, and the +well-known pooling effect, i.e. the gap between the sum of the pod usages with the node usage. Inspired by +[Google's work](#references) in the EuroSys'21, we propose the node prediction in Koordinator to conquer these two +gaps. + +### Goals + +- Define the node prediction API. +- Propose an online history-based-optimized (HBO) prediction model. +- Clarify how the Mid-tier resources are calculated with the prediction. + +### Non-Goals/Future Work + +- Propose a time-series-forecasting-based or offline prediction model. + +## User Stories + +### Story 1 + +As a cluster administrator, there are many web service pods allocating almost node resources. Whereas, the node +utilization is low since most allocated resources are not actually used. To improve node utilization, I want to reclaim +the unused resources to submit some low-priority online-service pods and Flink jobs. However, I am concerned with the +risks of over-utilization bringing machine overload which may cause the performance degradation and hurt the pod QoS. + +### Story 2 + +As a Kubernetes developer, I want to support the long-term load balancing in the scheduler. Thus, I need the information +that which nodes should be idle for a long time. + +## Design + +### Design Principles + +- The node prediction is low-cost and can be implemented in the Koordlet. +- The node prediction is pluggable. Users can replace the default model to customize the prediction. + +### Architecture + +The node prediction is implemented mainly in the Koordlet and Koord-Manager. The architecture is as below: + +![image](/img/node-prediction.svg) + +- Koordlet: The agent runs on the node. It implements the metrics collection, metrics storage, and predict server. + - Metrics Advisor: It collects the cpu/memory usage of the node and running pods. It stores the collected metrics in the Metric Cache. + - Metric Cache: It stores the node and pod metrics in a TSDB, which allows other modules to query the metrics later. + - Predict Server: With the node and pod metrics retrieved from the Metric Cache, it calculates and checkpoints the predicted result based on the prediction model. + - States Informer: It maintains the metadata of the node and the pods. It also reports the latest prediction periodically to the kube-apiserver. +- Koord-Manager: The controller runs on a master node. + - Configuration delivery: It maintains the prediction and colocation strategies and distributes the node strategy onto the NodeMetric. + - Resource Calculator: It fetches the node prediction result, and calculates the resource allocatable of the reclaimed resources (i.e. Mid-tier resource). +- Koord-Scheduler: It schedules the pod with different priority bands (e.g. Prod, Mid, Batch). It can enable load-aware scheduling to balance the over-committed nodes' utilization. + +#### Workflow + +In the koordlet, stages to update the node prediction are as follows: + +1. Histogram initialization: The predict server initializes a set of histograms for CPU and memory. For implementing `N-Sigma_v1`, it initializes decayed histograms only for the node and priority classes. While implementing `N-Sigma_v2`, it initializes histograms both for the node and every running pod. +2. Metrics collection: The metrics advisor collects the usage statistics of node and pods and stores them as metric points into the metric cache every CollectInterval (e.g. 1s). +3. Histogram updating: The predict server fetches the node metrics and pod metrics of latest HistogramUpdateInterval (e.g. 30s). Then it uses the aggregated result to update the decayed histograms. +4. Periodical reporting: The states informer fetches node metrics and the last histograms for the node and priority classes every ReportingInterval (e.g. 60s). Then it reports the complete NodeMetric status with last node prediction info to the kube-apiserver. +5. Fast reporting: The states informer fetches the last histograms every CheckPredictionInterval (e.g. 20s). It checks if the predicted result is too small or too larger than the last updated prediction exceeding the ResourceDiffThreshold (e.g. 5%), or the updated duration is longer than ForceUpdateInterval (e.g. 600s). If the check result is true, It updates the latest node prediction to the kube-apiserver. + +In the koord-manager, stages to update the Mid-tier resources allocatable are as follows: + +1. NodeMetric lifecycle management: The koord-manager list-watches the Node and the ConfigMap slo-controller-config, and maintains the lifecycle of the NodeMetric CR. Once the colocation strategy in the slo-controller-config updated, the koord-manager parses the config data and updates the node prediction policy and mid colocation policy into the NodeMetric.Spec. +2. Mid resource updating: The koord-manager list-watches the NodeMetric. Once the NodeMetric status is updated, the koord-manager gets the latest node metrics and node prediction, and calculates the Mid allocatable resources based on the Mid over-commitment formula. Finally, it updates the Mid allocatable resources into the Node status as the extended resources (`kubernetes.io/mid-cpu`, `kubernetes.io/mid-memory`). + +#### Scheduling Optimization + +The results of the node prediction on the NodeMetric, the Mid extended resources on the Node and the scheduling Pod +in the scheduler are updated in different time. It is inevitable to find that the scheduler schedules a pod with an +older version of the node prediction, which may cause the schedule result "lagged". + +To relief the lagged prediction, the koordlet and koord-manager try both updating earlier when the +prediction/NodeMetric differs from the previous result than a threshold and set a resource buffer which should +tolerant most of the result changes between synchronizations. + +For the worst case in which the prediction could be lagged too much (e.g. 1 hour), we can maintain a lower bound of +the real Mid allocatable resources inside the scheduler. This part is not planned in the first version of the Mid-tier +over-commitment. + +### API + +#### Node Prediction + +##### Predict Policy + +```go +// ColocationStrategy defines the colocation strategy in slo-controller-config ConfigMap. +type ColocationStrategy struct { + // ... + NodePredictPolicy *slov1alpha1.PredictPolicy `json:"nodePredictPolicy,omitempty"` +} + +type NodeMetricSpec struct { + // ... + PredictPolicy *PredictPolicy `json:"predictPolicy,omitempty"` +} + +// PredictPolicy defines the policy for the node prediction. +type PredictPolicy struct { + ResourceDiffThresholdPercent *int64 `json:"resourceDiffThresholdPercent,omitempty"` + ColdStartPeriodSeconds *int64 `json:"coldStartPeriodSeconds,omitempty"` +} +``` + +##### Predicted Result + +```go +type NodeMetricStatus struct { + // ... + // ProdReclaimableMetric is the estimated reclaimable resources for the Prod-type pods. + ProdReclaimableMetric *ReclaimableMetric `json:"prodReclaimableMetric,omitempty"` +} + +type ReclaimableMetric struct { + // Resource is the resource usage of the prediction. + Resource ResourceMap `json:"resource,omitempty"` +} +``` + +#### Mid Overcommitment + +##### Colocation Strategy + +```go +type ColocationStrategy struct { + // ... + // MidCPUThresholdPercent defines the maximum percentage of the Mid-tier cpu resource dividing the node allocatable. + // MidCPUAllocatable <= NodeCPUAllocatable * MidCPUThresholdPercent / 100. + MidCPUThresholdPercent *int64 `json:"midCPUThresholdPercent,omitempty" validate:"omitempty,min=0,max=100"` + // MidMemoryThresholdPercent defines the maximum percentage of the Mid-tier memory resource dividing the node allocatable. + // MidMemoryAllocatable <= NodeMemoryAllocatable * MidMemoryThresholdPercent / 100. + MidMemoryThresholdPercent *int64 `json:"midMemoryThresholdPercent,omitempty" validate:"omitempty,min=0,max=100"` +} +``` + +##### Extended Resources + +```yaml +apiVersion: v1 +kind: Node +metadata: + name: test-node +status: + allocatable: + cpu: '32' + memory: 129636240Ki + pods: '213' + kubernetes.io/mid-cpu: '16000' # allocatable cpu milli-cores for Mid-tier pods + kubernetes.io/mid-memory: 64818120Ki # allocatable memory bytes for Mid-tier pods + capacity: + cpu: '32' + memory: 129636240Ki + pods: '213' + kubernetes.io/mid-cpu: '16000' + kubernetes.io/mid-memory: 64818120Ki +``` + +### Theoretical Model + +#### Node Peak Prediction + +Before elaborating the peak prediction algorithm, let's formalize the node peak prediction problem. + +Let's denote the usage of a Pod `p` at the time `t` is `U(p, t)`. + +Then the usage of a Node `M` which schedules a set of Pods is `MU(Pods, t) = sum[p in Pods](U(p, t))`. + +> Note that the non-Pod usage of the node can be regarded as the usage of a special pod `S`. + +When we want to predict the node peak at the time `T`, we are calculating +`Peak(Pods, T) = max[t >= T](sum[p in Pods](U(p, t)))`. + +The predicted peak `Peak(Pods, T)` is our node prediction result at `T`. + +#### N-sigma Prediction + +There are several [statistical peak prediction models](#alternatives) which are practical to implement in the online +scheduler. [*N-sigma*](#references) is the picked peak prediction model in the current implementation. It assumes the +timing node metrics follow the Gaussian distribution, which allows us to estimate the node peak with the mean and +standard deviation (stdev): + +`Peak_N-Sigma_v1(Pods, T) = mean[T0 <= t <= T](MU(Pods, t)) + N * stdev[T0 <= t <= T](MU(Pods, t))` + +The `Peak_N-Sigma_v1` is the predicted node peak. It is implemented as the first version of node prediction, which is +calculated based on node-level metrics. + +Moreover, we can calculate with the pods' metrics: + +`Peak_Pods-N-Sigma'(Pods, T) = sum[p in Pods](mean[T0 <= t <= T](U(p, t)) + N * stdev[T0 <= t <= T](U(p, t)))` + +A more conservative is derived from their maximal. The `Peak_N-sigma_v2` is the second version of node prediction, +which also considers the pod-level metrics. + +`Peak_N-Sigma_v2(Pods, T) = max(Peak_N-Sigma_v1(Pods, T), Peak_Pods-N-Sigma(Pods, T))`. + +#### Mid-tier Overcommitment + +In the first version, the Mid-tier resource contains the reclaimable resources which are probably unused in the +long-term by the high-priority (i.e. Prod) pods. +The resource calculation for the Mid-tier resources can be described as follows: + +``` +Allocatable[Mid] := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio) +``` + +- `Reclaimable[Mid] := max(0, reclaimRatio * Allocated[Prod] - Peak[Prod])`. The peak prediction model is used for estimating the future usage of the running Prod pods. The Mid pods can allocate a proportion of reclaimed resources from running Prod pods. +- `NodeAllocatable * thresholdRatio` is the maximal co-located Mid-tier resource setting from a ratio of the node allocatable. + +In next versions, the Mid-tier resource is planned to mix with the default node allocatable (i.e. the Prod allocatable), +which means a Mid pod can allocate the unallocated node allocatable resource, and an idle node is able to schedule Mid +pods. The Prod pods can preempt the Mid pods when the mixed allocatable is exhausted by the Mid pods, so that the +Prod-tier resource is still more stable and guaranteed than the Mid-tier. +Then the resource calculation for the mixed Mid-tier resources can be described as follows: + +``` +Allocatable[Mid]' := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio) + Unallocated[Mid] +Unallocated[Mid] = max(NodeAllocatable - Allocated[Prod], 0) +``` + +## Alternatives + +### Peak Prediction Models + +There are several different peak prediction and time series forecasting models which can estimate the future peak +based on the historical node metrics, including statistical methods and machine learning methods. In this proposal, +statistical peak prediction models are preferred since they are practical to implement in the online scheduling system, +have less overhead of metrics collection than the ML approaches, and more simple to analyze and debug. + +Here are some common statistical peak prediction models: + +1. [Borg-default](#references) + +Borg-default simply over-commits the machine resources in a fixed rate `a`, which means the peak usage is regarded as +the result of the requests dividing `a`. + +Let's denote the resource request of the Pod `p` at the time `t` is `R(p, t)`, where `R(p, t) = 0` when `p` is not +running. Then we have, + +`Peak_Borg-default(Pods, T) = 1/a * sum[p in Pods](R(p, T))`, `a = 1.1` by default. + +2. [Resource Central](#references) + +Resource Central considers the peak of the machine as the sum of the peak of individual pods (or VMs). And a simple +peak prediction of a pod is the percentile of the historical usages, e.g. `percentile[t in [T-C, T]](U(p, t))`. + +`Peak_ResourceCentral(Pods, T) = sum[p in Pods](percentile[t in [T-C, T]](U(p, t)))` + +3. [Max](#references) + +The Max prediction model does not use the historical metrics directly, but takes the maximal of any known peak results. +It gets the more conservative result than the input models. For example, we have a `Max_Borg-default_ResourceCentral` +model calculated from the Borg-default and Resource Central models: + +`Peak_Max_Borg-default_ResourceCentral(Pods, T) = max(Peak_Borg-default(Pods, T), Peak_ResourceCentral(Pods, T))` + +## References + +1. Vertical Pod Autoscaler: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler +2. Bashir, Noman, et al. "Take it to the limit: peak prediction-driven resource overcommitment in datacenters." Proceedings of the Sixteenth European Conference on Computer Systems. 2021. +3. Cortez, Eli, et al. "Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms." Proceedings of the 26th Symposium on Operating Systems Principles. 2017. diff --git a/versioned_docs/version-v1.3/designs/nri-mode-resource-management.md b/versioned_docs/version-v1.3/designs/nri-mode-resource-management.md new file mode 100644 index 000000000..507c7e30c --- /dev/null +++ b/versioned_docs/version-v1.3/designs/nri-mode-resource-management.md @@ -0,0 +1,152 @@ +# NRI Mode Resource Management + +## Glossary + +NRI, node resource interface. See: https://github.com/containerd/nri + +## Summary + +We hope to enable NRI mode resource management for koordinator for easy deployment and in-time control. + +## Motivation + +Koordinator as a QoS based scheduling system for hybrid workloads orchestration on Kubernetes and its runtime hooks support two working [modes](https://github.com/koordinator-sh/koordinator/blob/main/docs/design-archive/koordlet-runtime-hooks.md) for different scenarios: `Standalone` and `Proxy`. However, both of them have some [constraints](https://shimo.im/docs/m4kMLdgO1LIma9qD). NRI (Node Resource Interface), which is a public interface for controlling node resources is a general framework for CRI-compatible container runtime plug-in extensions. It provides a mechanism for extensions to track the state of pod/containers and make limited modifications to their configuration. We'd like to integrate NRI framework to address `Standalone` and `Proxy` constraints based on this community recommend mechanism. + +### Goals + +- Support NRI mode resource management for koordinator. +- Support containerd container runtime. + +### Non-Goals/Future Work + +- Support docker runtime + +## Proposal + +Different from standalone and proxy mode, Koodlet will start an NRI plugin to subscribe pod/container lifecycle events from container runtime (e.g. containerd, crio), and then koordlet NRI plugin will call runtime hooks to adjust pod resources or OCI spec. The flow should be: + +- Get pod/container lifecycle events and OCI format information from container runtime (e.g. containerd, crio). +- Transform the OCI format information into internal protocols. (e.g. PodContext, ContainerContext) to re-use existing runtime hook plugins. +- Transform the runtime hook plugins' response into OCI spec format +- Return OCI spec format response to container runtime(e.g. containerd, crio). + +![nri-proposal.png](/img/nri-proposal.png) + +### User Stories + +#### Story 1 +As a cluster administrator, I want to apply QoS policy before pod's status become running. + +#### Story 2 +As a cluster administrator, I want to deploy koordinator cluster without restart. + +#### Story 3 +As a cluster administrator, I want to adjust resources' policies at runtime. + +#### Story 4 +As a GPU user, I want to inject environment before pod running. + +### Requirements + +- Need to upgrade containerd to >= 1.7.0, crio to >= v1.25.0 + +#### Functional Requirements + +NRI mode should support all existing functionalities supported by standalone and Proxy mode. + +#### Non-Functional Requirements + +Non-functional requirements are user expectations of the solution. Include +considerations for performance, reliability and security. + +### Implementation Details/Notes/Constraints +1. koordlet [NRI plugin](https://github.com/containerd/nri/blob/main/plugins/template/plugin.go) +```go +type nriServer struct { + stub stub.Stub + mask stub.EventMask + options Options // server options +} + +// Enable 3 hooks (RunPodSandbox, CreateContainer, UpdateContainer) in NRI +func (p *nriServer) Configure(config, runtime, version string) (stub.EventMask, error) { +} + +// Sync all pods/containers information before koordlet nri plugin run +func (p *nriServer) Synchronize(pods []*api.PodSandbox, containers []*api.Container) ([]*api.ContainerUpdate, error) { +} + +func (p *nriServer) RunPodSandbox(pod *api.PodSandbox) error { + podCtx.FromNri(pod) + RunHooks(...) + podCtx.NriDone() +} + +func (p *nriServer) CreateContainer(pod *api.PodSandbox, container *api.Container) (*api.ContainerAdjustment, []*api.ContainerUpdate, error) { + containerCtx.FromNri(pod, container) + RunHooks(...) + containCtx.NriDone() +} + +func (p *nriServer) UpdateContainer(pod *api.PodSandbox, container *api.Container) ([]*api.ContainerUpdate, error) { + containerCtx.FromNri(pod, container) + RunHooks(...) + containCtx.NriDone() +} +``` +2. koordlet enhancement for NRI +- PodContext +```go +// fill PodContext from OCI spec +func (p *PodContext) FromNri(pod *api.PodSandbox) { +} + +// apply QoS resource policies for pod +func (p *PodContext) NriDone() { +} +``` +- ContainerContext +```go +// fill ContainerContext from OCI spec +func (c *ContainerContext) FromNri(pod *api.PodSandbox, container *api.Container) { +} + +// apply QoS resource policies for container +func (c *ContainerContext) NriDone() (*api.ContainerAdjustment, []*api.ContainerUpdate, error) { +} +``` + +### Risks and Mitigations + +## Alternatives +There are several approaches to extending the Kubernetes CRI (Container Runtime Interface) to manage container resources such as `standalone` and `proxy`. Under `standalone` running mode, resource isolation parameters will be injected asynchronously. Under `proxy` running mode, proxy can hijack CRI requests from kubelet for pods and then apply resource policies in time. However, `proxy` mode needs to configure and restart kubelet. + +There are a little difference in execution timing between `NRI` and `proxy` modes. Hook points (execution timing) are not exactly same. The biggest difference is `proxy` call koordlet hooks between kubelet and containerd. However, NRI will call NRI plugin (koodlet hooks) in containerd, that means containerd still could do something before or after containerd call NRI plugin (koordlet hooks). For example, under `NRI` running mode, containerd setup pod network first and then call NRI plugin (koordlet hooks) in RunPodSanbox, but under `proxy` running mode, containerd couldn't do anything before koordlet hooks running when `proxy` handle RunPodSandbox CRI request. + +- Standalone + + - kubelet -- CRI Request -> CRI Runtime -- OCI Spec -> OCI compatible runtime -> containers + - kubelet -> Node Agent -> CRI Runtime / containers + +![standalone.png](/img/standalone.png) + +- Proxy + + - kubelet -- CRI Request -> CRI Proxy -- CRI Request (hooked) -> CRI Runtime -- OCI Spec -> OCI compatible runtime -> containers + +![proxy.png](/img/proxy.png) + +- NRI + + - kubelet -- CRI Request -> CRI Runtime -- OCI Spec --> OCI compatible runtime -> containers +                  ↘   ↗ +                Koordlet NRI plugin + +![nri.png](/img/nri.png) + +## Upgrade Strategy + +- Need to upgrade containerd to 1.7.0+ or CRIO to 1.26.0+ +- Need to enable NRI + + diff --git a/versioned_docs/version-v1.3/designs/pod-migration-job.md b/versioned_docs/version-v1.3/designs/pod-migration-job.md new file mode 100644 index 000000000..47a94aba8 --- /dev/null +++ b/versioned_docs/version-v1.3/designs/pod-migration-job.md @@ -0,0 +1,374 @@ +# PodMigrationJob + +## Summary + +This proposal defines a CRD-based Pod migration API, through which the descheduler or other automatic fault recovery components can evict or delete Pods more safely. At the same time, the proposal also describes the specific implementation details of the API. + +## Motivation + +Migrating Pods is an important capability that many components (such as deschedulers) rely on, and can be used to optimize scheduling or help resolve workload runtime quality issues. We believe that pod migration is a complex process, involving steps such as auditing, resource allocation, and application startup, and is mixed with application upgrading, scaling scenarios, and resource operation and maintenance operations by cluster administrators. Therefore, how to manage the stability risk of this process to ensure that the application does not fail due to the migration of Pods is a very critical issue that must be resolved. + +Therefore, it is necessary to realize a final state-oriented migration capability based on CRD, track the status of each process in the migration, and perceive scenarios such as upgrading and scaling of the application. + +### Goals + +1. Defines a CRD-based Pod Migration Job API, through which the descheduler can evict or delete Pods more safely. +2. Describe in detail the design details behind the API. + +### Non-Goals/Future Work + +1. A new descheduler framework +2. Descheduling capability for different scenarios such as load-aware descheduling, defragemention, etc. +3. The details about Deterministic preemption that preempts other Pods for Reservation. + +## Proposal + +### User Stories + +#### Story 1 + +The descheduler in the K8s community evicts pods to be rescheduled according to different strategies. However, it does not guarantee whether the evicted Pod has resources available after re-creation. If a large number of new Pods are in the Pending state when the resources in the cluster are tight, may lower the application availabilities. + +#### Story 2 + +The descheduler evicts the Pod through the Eviction API, and the Eviction API decides whether to delete the Pod according to the PDB status. However, it is unable to perceive workload upgrades, scaling and other scenarios in which Pods are deleted, which will also bring security risks. + +#### Story 3 + +The Pod migration capability itself can be provided to users as a service. Users can integrate this API in their own systems to achieve safe migration, and are no longer limited to deschedulers. + + +### Basic Migration API + +These APIs provide cluster administrators with more fine-grained migration control capabilities, which can better reduce risks. + +- `scheduling.koordinator.sh/eviction-cost` indicates the eviction cost. It can be used to set to an int32. The implicit eviction cost for pods that don't set the annotation is 0, negative values are permitted. If set the cost ith `math.MaxInt32`, it means the Pod will not be evicted. Pods with lower eviction cost are preferred to be evicted before pods with higher eviction cost. If a batch of Pods to be evicted have the same priority, they will be sorted by cost, and the Pod with the smallest cost will be evicted. Although the K8s community has [Pod Deletion Cost #2255](https://github.com/kubernetes/enhancements/issues/2255), it is not a general mechanism. To avoid conflicts with components that use `Pod Deletion Cost`, users can individually mark the eviction cost for Pods. + + +### Pod Migration Job CRD + +In order to support the above user stories, a Custom Resource Definition(CRD) named `PodMigrationJob` is proposed to ensure the migration process safely. + +#### Migration Job Spec + +```go + +// PodMigrationJob is the Schema for the PodMigrationJob API +// +k8s:openapi-gen=true +// +kubebuilder:resource:scope=Cluster +type PodMigrationJob struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + Spec PodMigrationJobSpec `json:"spec,omitempty"` + Status PodMigrationJobStatus `json:"status,omitempty"` +} + +type PodMigrationJobSpec struct { + // Paused indicates whether the PodMigrationJob should to work or not. + // Default is false + // +optional + Paused bool `json:"paused,omitempty"` + + // TTL controls the PodMigrationJob timeout duration. + // +optional + TTL *metav1.Duration `json:"ttl,omitempty"` + + // Mode represents the operating mode of the Job + // Default is PodMigrationJobModeReservationFirst + // +optional + Mode PodMigrationJobMode `json:"mode,omitempty"` + + // PodRef represents the Pod that be migrated + // +required + PodRef *corev1.ObjectReference `json:"podRef"` + + // ReservationOptions defines the Reservation options for migrated Pod + // +optional + ReservationOptions *PodMigrateReservationOptions `json:"reservationOptions,omitempty"` + + // DeleteOptions defines the deleting options for the migrated Pod and preempted Pods + // +optional + DeleteOptions *metav1.DeleteOptions `json:"deleteOptions,omitempty"` +} + +type PodMigrationJobMode string + +const ( + PodMigrationJobModeReservationFirst PodMigrationJobMode = "ReservationFirst" + PodMigrationJobModeEvictionDirectly PodMigrationJobMode = "EvictDirectly" +) + +type PodMigrateReservationOptions struct { + // ReservationRef if specified, PodMigrationJob will check if the status of Reservation is available. + // ReservationRef if not specified, PodMigrationJob controller will create Reservation by Template, + // and update the ReservationRef to reference the Reservation + // +optional + ReservationRef *corev1.ObjectReference `json:"reservationRef,omitempty"` + + // Template is the object that describes the Reservation that will be created if not specified ReservationRef + // +optional + Template *ReservationTemplateSpec `json:"template,omitempty"` + + // PreemptionOption decides whether to preempt other Pods. + // The preemption is safe and reserves resources for preempted Pods. + // +optional + PreemptionOptions *PodMigrationJobPreemptionOptions `json:"preemptionOptions,omitempty"` +} + +type PodMigrationJobPreemptionOptions struct { + // Reserved object. +} +``` + +- `Paused` indicates whether the PodMigrationJob should to work or not. In some scenarios, the user does not expect the PodMigrationJob Controller to process the PodMigrationJob immediately, but rather to decide whether to execute it after completing some operations similar to auditing. +- `TimeoutInSeconds` controls the PodMigrationJob timeout duration. +- The `PodMigrationJob` support two modes defined by the field `Mode`: + - `PodMigrationJobModeReservationFirst` means that before migrating a Pod, try to reserve resources through the `Reservation` API, delete the Pod to be migrated after successfully reserved, and observe the status of the `Reservation` to ensure that the `Reservation` is consumed. + - `PodMigrationJobModeEvictionDirectly` indicates that the user clearly knows the risk of evicting the Pod and decides to evict the Pod directly. + - If `Mode` is not specified, `PodMigrationJobModeReservationFirst` is used by default +- `PodRef` represents the Pod that be migrated. The field is required. +- `ReservationOptions` defines options for how to reserve resource through `Reservation` API: + - `ReservationRef` if is specified, the referenced `Reservation` instance is used first. In some scenarios, such as defragmentation, in order to ensure the reliability of the upper-layer logic, resources may have been reserved on the target node. In this case, the specified `Reservation` can be used directly. + - `Template` describes the spec of `Reservation`. It is often not necessary to set this field. When neither `ReservationRef` nor `Template` is specified, the `PodMigrationJob controller` will construct the `ReservationSpec` reserved resources according to the Spec of the migrated Pod. If `Template` is set, the `ReservationTemplateSpec` and the Spec of the migrated Pod will be merged to construct the `ReservationSpec` reserved resources. + - `PreemptionOptions` decides whether to preempt other Pods if reserved resources failed. The specific details of preemption will be submitted in a separate proposal description in future work, and will not be expanded here for the time being. +- `DeleteOptions` defines the options of delete operation. Whether to delete a Pod through the `K8s Delete API` or evict a Pod through the `K8s Eviction API` depends on how the user configures the parameters of the `PodMigrationJob Controller`. Users only need to set `DeleteOptions` according to the workload in their own cluster. + +#### Migration Job Status + +```go +type PodMigrationJobStatus struct { + // PodMigrationJobPhase represents the phase of a PodMigrationJob is a simple, high-level summary of where the PodMigrationJob is in its lifecycle. + // e.g. Pending/Running/Failed + Phase PodMigrationJobPhase `json:"phase,omitempty"` + // Status represents the current status of PodMigrationJob + // e.g. ReservationCreated + Status string `json:"status,omitempty"` + // Reason represents a brief CamelCase message indicating details about why the PodMigrationJob is in this state. + Reason string `json:"reason,omitempty"` + // Message represents a human-readable message indicating details about why the PodMigrationJob is in this state. + Message string `json:"message,omitempty"` + // Conditions records the stats of PodMigrationJob + Conditions []PodMigrationJobCondition `json:"conditions,omitempty"` + // NodeName represents the node's name of migrated Pod + NodeName string `json:"nodeName,omitempty"` + // PodRef represents the newly created Pod after being migrated + PodRef *corev1.ObjectReference `json:"podRef,omitempty"` + // PreemptedPodsRef represents the Pods that be preempted + PreemptedPodsRef []corev1.ObjectReference `json:"preemptedPodsRef,omitempty"` + // PreemptedPodsReservations records information about Reservations created due to preemption + PreemptedPodsReservations []PodMigrationJobPreemptedReservation `json:"preemptedPodsReservation,omitempty"` +} + +type PodMigrationJobPreemptedReservation struct { + // Namespace represents the namespace of Reservation + Namespace string `json:"namespace,omitempty"` + // Name represents the name of Reservation + Name string `json:"name,omitempty"` + // NodeName represents the assigned node for Reservation by scheduler + NodeName string `json:"nodeName,omitempty"` + // Phase represents the Phase of Reservation + Phase string `json:"phase,omitempty"` + // PreemptedPodRef represents the Pod that be preempted + PreemptedPodRef *corev1.ObjectReference `json:"preemptedPodRef,omitempty"` + // PodsRef represents the newly created Pods after being preempted + PodsRef []corev1.ObjectReference `json:"podsRef,omitempty"` +} + +type PodMigrationJobCondition struct { + // Type is the type of the condition. + Type PodMigrationJobConditionType `json:"type"` + // Status is the status of the condition. + // Can be True, False, Unknown. + Status PodMigrationJobConditionStatus `json:"status"` + // Last time we probed the condition. + // +nullable + LastProbeTime metav1.Time `json:"lastProbeTime,omitempty"` + // Last time the condition transitioned from one status to another. + // +nullable + LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty"` + // Unique, one-word, CamelCase reason for the condition's last transition. + Reason string `json:"reason,omitempty"` + // Human-readable message indicating details about last transition. + Message string `json:"message,omitempty"` +} + +type PodMigrationJobPhase string + +const ( + // PodMigrationJobPending represents the initial status + PodMigrationJobPending PodMigrationJobPhase = "Pending" + // PodMigrationJobRunning represents the PodMigrationJob is being processed + PodMigrationJobRunning PodMigrationJobPhase = "Running" + // PodMigrationJobSucceed represents the PodMigrationJob processed successfully + PodMigrationJobSucceed PodMigrationJobPhase = "Succeed" + // PodMigrationJobFailed represents the PodMigrationJob process failed caused by Timeout, Reservation failed, etc. + PodMigrationJobFailed PodMigrationJobPhase = "Failed" + // PodMigrationJobAborted represents the user forcefully aborted the PodMigrationJob. + PodMigrationJobAborted PodMigrationJobPhase = "Aborted" +) + +// These are valid conditions of PodMigrationJob. +const ( + PodMigrationJobConditionReservationCreated PodMigrationJobConditionType = "ReservationCreated" + PodMigrationJobConditionReservationScheduled PodMigrationJobConditionType = "ReservationScheduled" + PodMigrationJobConditionPreemption PodMigrationJobConditionType = "Preemption" + PodMigrationJobConditionEviction PodMigrationJobConditionType = "Eviction" + PodMigrationJobConditionPodScheduled PodMigrationJobConditionType = "PodScheduled" + PodMigrationJobConditionReservationPodBoundReservation PodMigrationJobConditionType = "PodBoundReservation" + PodMigrationJobConditionReservationBound PodMigrationJobConditionType = "ReservationBound" +) + +// These are valid reasons of PodMigrationJob. +const ( + PodMigrationJobReasonTimeout = "Timeout" + PodMigrationJobReasonFailedCreateReservation = "FailedCreateReservation" + PodMigrationJobReasonUnschedulable = "Unschedulable" + PodMigrationJobReasonMissingPod = "MissingPod" + PodMigrationJobReasonMissingReservation = "MissingReservation" + PodMigrationJobReasonPreempting = "Preempting" + PodMigrationJobReasonPreemptComplete = "PreemptComplete" + PodMigrationJobReasonEvicting = "Evicting" + PodMigrationJobReasonFailedEvict = "FailedEvict" + PodMigrationJobReasonEvictComplete = "EvictComplete" + PodMigrationJobReasonWaitForPodBindReservation = "WaitForPodBindReservation" +) + +type PodMigrationJobConditionStatus string + +const ( + PodMigrationJobConditionStatusTrue PodMigrationJobConditionStatus = "True" + PodMigrationJobConditionStatusFalse PodMigrationJobConditionStatus = "False" + PodMigrationJobConditionStatusUnknown PodMigrationJobConditionStatus = "Unknown" +) +``` + +### Implementation Details/Notes/Constraints + +#### PodMigrationJob Controller + +The difference between `PodMigrationJobController` and general controller is that `PodMigrationJobController` will evaluate all pending PodMigrationJobs together (ie PodMigrationJob.Phase is Pending) and select a batch of PodMigrationJob and reconcile them. This selection process is called the arbitration mechanism. The reason why the arbitration mechanism is introduced is mainly to control the stability risk and control the cost of migrating Pods. The arbitration mechanism includes three stages: `Group`, `Filter` and `Sort`. + +##### Group PodMigrationJob + +Aggregate according to different workloads to facilitate the processing of subsequent processes + +- Aggregate PodMigrationJob by workload +- Aggregate PodMigrationJob by Node +- Aggregate PodMigrationJob by Namespace + +##### Filter PodMigrationJob + +- Check how many PodMigrationJob of each workload are in the Running state, and record them as ***migratingReplicas***. If the ***migratingReplicas*** reach a certain threshold, they will be excluded. The detailed algorithm of this threshold is described later. +- Check the number of ***unavailableReplicas*** of each workload, and determine whether the ***unavailableReplicas + migratingReplicas*** conform to the corresponding [PDB(Pod Disruption Budget)](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) or [PUB(Pod Unavailable Budget)](https://openkruise.io/docs/user-manuals/podunavailablebudget). If there is no PDB or PUB, use the algorithm to calculate dynamically. If not, exclude the corresponding PodMigrationJob. +- Check the number of Pods being migrated on the node where each target Pod is located. If it exceeds the maximum migration amount for a single node, exclude it. +- Check the number of Pods being migrated in the Namespace where each target Pod is located. If it exceeds the maximum migration amount for a single Namespace, exclude it + +The detailed algorithm of Workload Max Migrating/Unavailable Replicas: + +```go +func GetMaxMigrating(replicas int, intOrPercent *intstr.IntOrString) (int, error) { + return GetMaxUnavailable(replicas, intOrPercent) +} + +func GetMaxUnavailable(replicas int, intOrPercent *intstr.IntOrString) (int, error) { + if intOrPercent == nil { + if replicas > 10 { + s := intstr.FromString("10%") + intOrPercent = &s + } else if replicas >= 4 && replicas <= 10 { + s := intstr.FromInt(2) + intOrPercent = &s + } else { + s := intstr.FromInt(1) + intOrPercent = &s + } + } + return intstr.GetValueFromIntOrPercent(intOrPercent, replicas, true) +} +``` + +##### Sort PodMigrationJob + +- Pods with higher QoS requirements are given priority, LSE > LSR > LS > BE +- Pods with higher priority will be processed first +- The higher migration priority will be processed first +- If the Pod has already initiated a migration job in the past and it fails, sort by the number of times. The lower the number of times, the priority will be given to processing +- If the workload where the Pod is located has been descheduled for a certain number of times in the past, it is sorted according to the number of times. The lower the number of times, the priority will be processed. +- Sort by the number of replicas being migrated by the workload. The lower the number of replicas being migrated, the priority will be given to processing. + +##### Execute PodMigrationJob + +- Update PodMigrationJobStatus.Phase to Running to trigger the PodMigrationJob controller reconcile these jobs +- PodMigrationJob controller reconciles process: + - If the mode of PodMigrationJob is `EvictionDirectly`, just delete the Pod through the delete method that configured in PodMigrationJob controller. And update the phase of PodMigrationJob to Success. + - If not specified ReservationOptions.ReservationRef, create the Reservation instance by the reservation template or Pod spec to reserve resources. And updates the created Reservation instance to the ReservationOptions.ReservationRef. + - Check the status of Reservation to determine whether reserve resource successfully. + - If failed to reserve, abort the PodMigrationJob and update the phase of PodMigrationJob to Fail + - If successfully reserve, delete the Pod through the delete method that configured in PodMigrationJob controller. + - Check the Reservation status to determine whether the Reservation consumed. + - If Reservation consumed, tracks the status of Reservation and update the status to PodMigrationJob + - Update phase of PodMigrationJob to Success. + +##### Migration Stability mechanism + +- Support for disabling this capability by configuration +- Supports a simple central flow control mechanism to limit the number of migrations over a period of time. + +See the Configuration section for more details + +#### Controller Configuration + +User can configure the `MigrationControllerArgs` through Koordinator Descheduler ConfigMap. + +```go +// MigrationControllerArgs holds arguments used to configure the MigrationController +type MigrationControllerArgs struct { + metav1.TypeMeta + + // DryRun means only execute the entire migration logic except create Reservation or Delete Pod + // Default is false + DryRun bool `json:"dryRun,omitempty"` + + // EvictFailedBarePods allows pods without ownerReferences and in failed phase to be evicted. + EvictFailedBarePods bool `json:"evictFailedBarePods"` + + // EvictLocalStoragePods allows pods using local storage to be evicted. + EvictLocalStoragePods bool `json:"evictLocalStoragePods"` + + // EvictSystemCriticalPods allows eviction of pods of any priority (including Kubernetes system pods) + EvictSystemCriticalPods bool `json:"evictSystemCriticalPods"` + + // IgnorePVCPods prevents pods with PVCs from being evicted. + IgnorePvcPods bool `json:"ignorePvcPods"` + + // LabelSelector sets whether to apply label filtering when evicting. + // Any pod matching the label selector is considered evictable. + LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"` + + // FlowControlQPS controls the number of arbitrations per second + FlowControlQPS string `json:"flowControlQPS,omitempty"` + // FlowControlBurst is the maximum number of tokens + FlowControlBurst int32 `json:"flowControlBurst,omitempty"` + + // MaxMigratingPerNode represents he maximum number of pods that can be migrating during migrate per node. + MaxMigratingPerNode *int32 `json:"maxMigratingPerNode,omitempty"` + + // MaxMigratingPerNamespace represents he maximum number of pods that can be migrating during migrate per namespace. + MaxMigratingPerNamespace *int32 `json:"maxMigratingPerNamespace,omitempty"` + + // MaxMigratingPerWorkload represents he maximum number of pods that can be migrating during migrate per workload. + // Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%). + MaxMigratingPerWorkload *intstr.IntOrString `json:"maxMigratingPerWorkload,omitempty"` + + // MaxUnavailablePerWorkload represents he maximum number of pods that can be unavailable during migrate per workload. + // The unavailable state includes NotRunning/NotReady/Migrating/Evicting + // Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%). + MaxUnavailablePerWorkload *intstr.IntOrString `json:"maxUnavailablePerWorkload,omitempty"` + + // EvictionPolicy represents how to delete Pod, support "Delete" and "Eviction", default value is "Eviction" + EvictionPolicy string `json:"evictionPolicy,omitempty"` + // DefaultDeleteOptions defines options when deleting migrated pods and preempted pods through the method specified by EvictionPolicy + DefaultDeleteOptions *metav1.DeleteOptions `json:"defaultDeleteOptions,omitempty"` +} + +``` \ No newline at end of file diff --git a/versioned_docs/version-v1.3/designs/resource-reservation.md b/versioned_docs/version-v1.3/designs/resource-reservation.md new file mode 100644 index 000000000..7fa73c84f --- /dev/null +++ b/versioned_docs/version-v1.3/designs/resource-reservation.md @@ -0,0 +1,245 @@ +# Resource Reservation + +## Summary + +A scheduling mechanism and its API is provided to reserve node resources for pods may not be created yet. + +## Motivation + +Pods are fundamental units for allocating node resources in Kubernetes, which bind resource requirements with business logic. The scheduler is not able to reserve node resources for specific pods or workloads. We may try using a [fake pod](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler) to prepare resources by the preemption mechanism. However, fake pods can be preempted by any scheduled pods with higher priorities, which make resources get scrambled unexpectedly. + +In Koordinator, a resource reservation mechanism is proposed to enhance scheduling and especially benefits scenarios below: + +1. Preemption: Existing preemption does not guarantee that only preempting pods can allocate preempted resources. With a reservation, the scheduler should be able to "lock" resources preventing from allocation of other pods with the same or higher priority. +2. De-scheduling: For the descheduler, it is better to ensure sufficient resources with the reservation before pods get rescheduled. Otherwise, rescheduled pods may not be runnable anymore and make the belonging application disrupted. +3. Horizontal scaling: Using reservation to achieve more deterministic horizontal scaling. e.g. Submit a reservation and make sure it is available before scaling up replicas. +4. Resource Pre-allocation: Sometimes we want to pre-allocate node resources for future resource demands even if the resources are not currently allocatable. Reservation can help with this and it should make no physical cost. + +### Goals + +- Define the basic API of resource reservation for *Motivations<1,2,3>*, extensible for supporting *Motivation<4>* in the future. +- Provide a scheduler plugin that implements above reservation mechanism. + +### Non-Goals/Future Work + +- Detailed design of reservative preemption/descheduler/horizontal scaler/pre-allocation. +- Modify kubelet admission control for reservation objects. + +## Proposal + +### User Stories + +#### Story 1 + +As a Kubernetes developer, I want to enhance the current **preemption** mechanism since preempted resources may be allocated by pods other than the preemptor. The scheduler can create a reservation for the preempting pods, so the ownership of preempted resources can be guaranteed, making the preemption more reliable. + +#### Story 2 + +As a cluster administrator, I want to use **descheduler** to migrate pods that are placed abnormally to somewhere they could "live better" and fulfill orchestration requirements of the app. e.g. Move pods on a over-utilized node to idler nodes and bind CPUs of same NUMA node. Reservations can be created before rescheduling pods, helping ensure there are sufficient resources and well placement. + +#### Story 3 + +As an application administrator, I want to make the **horizontal scaling** of my app more deterministic by submitting reservations before a scale-up. Besides, I can also reserve resources after a scale-down for future demands. It is useful especially when we want a guaranteed scale-up of applications for the coming business peak. + +#### Story 4 + +As a cluster administrator, I want to **pre-allocate** node resources for future usage no matter whether they are available now or not. I want to allocate the future free resources but do not disrupt the running of scheduled pods. Reservation can be made to pre-allocate resources since it makes no physical cost to the node. It may be in a `Waiting` state. When there is enough space for the reservation, it will become `Available` for the owner pods' scheduling. + +### API + +In this section, a Custom Resource Definition (CRD) named `Reservation` is proposed to allow the scheduler to reserve node resources for specific pods. + +![image](/img/resource-reservation.svg) + +```go +// Reservation objects are non-namespaced. +// It can reserve resources for pods of any namespace. Any affinity/anti-affinity of reservation scheduling can be +// specified in the pod template. +type Reservation struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + Spec ReservationSpec `json:"spec,omitempty"` + Status ReservationStatus `json:"status,omitempty"` +} + +type ReservationSpec struct { + // Template defines the scheduling requirements (resources, affinities, images, ...) processed by the scheduler just + // like a normal pod. + // If the `template.spec.nodeName` is specified, the scheduler will not choose another node but reserve resources on + // the specified node. + Template *corev1.PodTemplateSpec `json:"template,omitempty"` + // Specify the owners who can allocate the reserved resources. + // Multiple owner selectors and ORed. + Owners []ReservationOwner `json:"owners,omitempty"` + // By default, the resources requirements of reservation (specified in `template.spec`) is filtered by whether the + // node has sufficient free resources (i.e. ReservationRequest < NodeFree). + // When `preAllocation` is set, the scheduler will skip this validation and allow overcommitment. The scheduled + // reservation would be waiting to be available until free resources are sufficient. + // NOTE: Not supported in v0.6. + PreAllocation bool `json:"preAllocation,omitempty"` + // Time-to-Live period for the reservation. + // `expires` and `ttl` are mutually exclusive. If both `ttl` and `expires` are not specified, a very + // long TTL will be picked as default. Set 0 to disable the expiration. + TTL *metav1.Duration `json:"ttl,omitempty"` + // Expired timestamp when the reservation expires. + // `expires` and `ttl` are mutually exclusive. Defaults to being set dynamically at runtime based on the `ttl`. + Expires *metav1.Time `json:"expires,omitempty"` +} + +type ReservationStatus struct { + // The `phase` indicates whether is reservation is waiting for process (`Pending`), available to allocate + // (`Available`) or timeout/expired to get cleanup (Failed). + Phase ReservationPhase `json:"phase,omitempty"` + // The `conditions` indicate the messages of reason why the reservation is still pending. + Conditions []ReservationCondition `json:"conditions,omitempty"` + // Current resource owners which allocated the reservation resources. + CurrentOwners []corev1.ObjectReference `json:"currentOwners,omitempty"` + // Name of node the reservation is scheduled on. + NodeName string `json:"nodeName,omitempty"` + // Resource reserved and allocatable for owners. + Allocatable corev1.ResourceList `json:"allocatable,omitempty"` + // Resource allocated by current owners. + Allocated corev1.ResourceList `json:"allocated,omitempty"` +} + +type ReservationOwner struct { + // Multiple field selectors are ANDed. + Object *corev1.ObjectReference `json:"object,omitempty"` + Controller *ReservationControllerReference `json:"controller,omitempty"` + LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"` +} + +type ReservationControllerReference struct { + // Extend with a `namespace` field for reference different namespaces. + metav1.OwnerReference `json:",inline"` + Namespace string `json:"namespace,omitempty"` +} + +type ReservationPhase string + +const ( + // ReservationPending indicates the Reservation has not been processed by the scheduler or is unschedulable for + // some reasons (e.g. the resource requirements cannot get satisfied). + ReservationPending ReservationPhase = "Pending" + // ReservationAvailable indicates the Reservation is both scheduled and available for allocation. + ReservationAvailable ReservationPhase = "Available" + // ReservationWaiting indicates the Reservation is scheduled, but the resources to reserve are not ready for + // allocation (e.g. in pre-allocation for running pods). + ReservationWaiting ReservationPhase = "Waiting" + // ReservationFailed indicates the Reservation is failed to reserve resources, due to expiration or marked as + // unavailable, which the object is not available to allocate and will get cleaned in the future. + ReservationFailed ReservationPhase = "Failed" +) + +type ReservationCondition struct { + Type ReservationConditionType `json:"type,omitempty"` + Status ConditionStatus `json:"status,omitempty"` + Reason string `json:"reason,omitempty"` + Message string `json:"message,omitempty"` + LastProbeTime metav1.Time `json:"lastProbeTime,omitempty"` + LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty"` +} + +type ReservationConditionType string + +const ( + ReservationConditionScheduled ReservationConditionType = "Scheduled" + ReservationConditionReady ReservationConditionType = "Ready" +) + +type ConditionStatus string + +const ( + ConditionStatusTrue ConditionStatus = "True" + ConditionStatusFalse ConditionStatus = "False" + ConditionStatusUnknown ConditionStatus = "Unknown" +) + +const ( + ReasonReservationScheduled = "Scheduled" + ReasonReservationUnschedulable = "Unschedulable" + ReasonReservationAvailable = "Available" + ReasonReservationExpired = "Expired" +) +``` + +### Implementation Details + +#### Reservation Plugin + +##### Schedule Reservations + +A `Reservation` object has its scheduling requirements like a pod. Ideally, A `Reservation` object should get processed directly by the scheduler like a pod. However, it can require a series of modifications on [scheduling framework](https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/), losing the compatibility with standard kube-scheduler, kubelet, autoscaler, etc. In the reservation plugin, we fake one *reservation pod* for one `Reservation` inside the scheduler to fulfill general scheduling plugins (noderesources, nodeaffinity, tainttolerations, ...). The scheduling framework can handle `Reservation` objects by processing fake pods in both [scheduling cycle and binding cycle](https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#scheduling-cycle-binding-cycle). + +A fake pod inside the scheduler can construct the same affinity/anti-affinity constraints as owner pods, which may change the reservation result. To handle this problem, koord-scheduler extends the framework to skip check of pod affinity for existing reservations in the `Filter` phase. + +A reservation specified `PreAllocation` intends to pre-allocate resources on nodes. The scheduler will skip its filtering of node resources in the scheduling cycle. However, the scheduled reservation will be `Waiting` to be `Available` until there are enough resources to fulfill its requests. + +If all nodes are unscheduled for the reservation, the scheduler keeps its status as `Pending` and sets `Conditions` with the failure message. + +Once the scheduling decision has been made, the corresponding `Reservation` object is updated with a new status indicating whether the reservation succeeded or not. The fake pod does not expose to other components, and the kubelet without modification does not perceive a `Reservation` assigned. Fortunately, a `Reservation` does not need to be executable on the node, so existing containers can keep running as usual without additional admissions. + +If a reservation has set the `nodeName` (inside the `template` field), the scheduler is responsible for checking if the node can fulfill the reservation since kubelet does not do admissions for the reservation. + +##### Allocate Reserved Resources + +Let's call the reservation is *allocatable* for a pod if: + +1. The reservation is available. +2. The pod matches the reservation owner spec. +3. There are sufficient free resources in the reservation to fulfill the pod. + +When the reservation plugin is enabled, the scheduler checks for every scheduling pod if there are allocatable reservations on a node. With a `Score` plugin implemented, the scheduler prefers pods to schedule on nodes which have more allocatable reserved resources. + +When a pod is scheduled on a node with allocatable reservations, it allocates resources belonging to one of reservations. To pick one of reservations, we choose the one which can get most reserved resources allocated (i.e. MostAllocated). And the scheduler also annotates the pod with the reservation info. + +##### Expiration and Cleanup + +When a reservation has been created for a long time exceeding the `TTL` or `Expires`, the scheduler updates its status as `Expired`. For expired reservations, the scheduler will cleanup them with a custom garbage collection period. + +When a node is deleted, the available and waiting reservations on the node should be marked as `Failed` since they are not allocatable any more. + +#### Use Cases + +To generally reserve node resources, submit a `Reservation` and set the pod template in the field `spec.template`. Then the koord-scheduler will update this `Reservation` with the scheduling result and the resources will get reserved. + +To be more specific, + +- `spec.template` specifies the fundamental resource requirements of a reservation. The scheduler will schedule the fake pod based on the template. +- `spec.owners` specifies which kinds of pods can use the reservation. +- `spec.ttl` and `expires` specifies the expiration for the reservation. +- `spec.preAllocation` indicates whether the scheduler should filter with its resource requirements. Otherwise, the pre-allocation of node resources is allowed, and the reservation will become available until there are sufficient resources. +- `status.phase` is marked as `Pending` when the Reservation is created. And it is marked as `Available` when the Reservation is successfully scheduled. +- `status.conditions` shows why the reservation is unscheduled or failed. +- When a Reservation is `Available` on the node, only specified pods can allocate the reserved resources. + +##### Usage in Preemption + +The [Priority Preemption](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#preemption) happens in the PostFilter phase trying to make preemptive pods schedulable by evicting low-priority pods. When a pod succeeds the preemption, the pod `status` will be patched with a *nominated node* where the scheduler do the eviction. However, the preemptor's nominated node is not always the same as the scheduled node, since the scheduler does not reserve resources for the preemptor. +To ensure the preemptive resources are for the preemptor, firstly the scheduler can create a reservation that both sets `owners` with the preemptor pod and relevant affinity rules for reserving resources of the preempts. Then the scheduler evict pods, and the reservation will become `Available` once the resources are released. Finally, the preemptor pods can get scheduled on the nodes with preemptive resource reserved. + +##### Usage in Descheduling + +Before a pod is rescheduled, the descheduler can create a reservation that sets `template` and `owners` for the candidate. When the reservation becomes `Available`, the descheduler can assign the pod to allocate the reserved resources. This solves the problem in which the rescheduled pod has stopped at the old node but cannot run on the new node. Moreover, the descheduler can migrate resources between pods by setting the `preAllocation` field. + +##### Usage in Pre-allocation + +Reservations with `preAllocation` specified allow users to pre-allocate the node resources from running pods. The `status.phase` of the reservation is set as `Waiting` until the resources are released, indicating that its availability is conditional. Once the referenced pods have terminated, the `phase` is `Available` for owners, and the pre-allocation succeeds. + +### Risks and Mitigations + +Kubelet without any modification possibly ignore `Reservation` objects in predicate admission, which increases the chance of unexpected overcommitment at nodes. `Reservation` does not require any physical resources to be executable, so the overcommitment is mainly a problem only when pods get scheduled with `Reservation` and start to run, which is somewhat easier to mitigate since Kubelet do admit these pods. To further descrease the possibility of unexpected overcommitment or pods admit failures, we could use resource estimation for in-flight pods, balance pods to the nodes with less reserved resources, etc. + +## Unsolved Problems + +As stated above, `Reservation` can generate the same pod affinity/anti-affinity rules as the owner pods. The problem gets resolved in the koord-scheduler by extending scheduling framework, but it still limits the standard kube-scheduler. + +## Alternatives + +### Use a `pause` pod with a low priority to reserve resources + +Reserving resources with [`pause` pods with very low assigned priority](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler) does work when the preemption can be precisely enabled for specific pods. In the example of cluster autoscaler, `pause` pods are helpful when we need to overprovision resources to prevent idle nodes from scaling down by CA. However, a `pause` pod has no reservation guarantee except `priority`. As declared above, many scenarios require reservations to rely on other pod characteristics (e.g. names, namespaces, labels, priorityClass), where `pause` pods cannot meet the demands. + +## References + +1. [Kueue Pod Resource Reservation](https://docs.google.com/document/d/1sbFUA_9qWtorJkcukNULr12FKX6lMvISiINxAURHNFo) diff --git a/versioned_docs/version-v1.3/designs/runtime-proxy.md b/versioned_docs/version-v1.3/designs/runtime-proxy.md new file mode 100644 index 000000000..47775c107 --- /dev/null +++ b/versioned_docs/version-v1.3/designs/runtime-proxy.md @@ -0,0 +1,153 @@ +# RuntimeProxy + +## Summary + +KoordRuntimeProxy acts as a proxy between kubelet and containerd(dockerd under dockershim scenario), which is designed to +intercept CRI request, and apply some resource management policies, such as setting different cgroup parameters by pod +priorities under hybrid workload orchestration scenario, applying new isolation policies for latest Linux kernel, +CPU architecture, and etc. + +There are two components involved, KoordRuntimeProxy and RuntimePlugins. + +![image](/img/koord-runtime-proxy-architecture.svg) + +## Goals + +- Enhance resource management for QoS based Scheduling. +- Provide interface for new isolation features which are not supported by CRI. + +## Components + +### KoordRuntimeProxy + +KoordRuntimeProxy is in charge of intercepting request during pod's lifecycle, such as RunPodSandbox, CreateContainer etc., +and then calling RuntimePlugins to do resource isolation policies before transferring request to backend containerd(dockerd) +and after transferring response to kubelet. KoordRuntimeProxy provides an isolation-policy-execution framework which allows +customized plugins registered to do specified isolation policies, these plugins are called RuntimePlugins. +KoordRuntimeProxy itself does NOT do any isolation policies. + +### RuntimePlugins + +RuntimePlugins register events(RunPodSandbox etc.) to KoordRuntimeProxy and would receive notifications when events happen. +RuntimePlugins should complete resource isolation policies basing on the notification message, and then response +KoordRuntimeProxy, KoordRuntimeProxy would decide to transfer request to backend containerd or discard request according to +plugins' response. + +If no RuntimePlugins registered, KoordRuntimeProxy would become a transparent proxy between kubelet and containerd. + +## Architecture + +![image](/img/koord-runtime-proxy-design.svg) + +There are 4 main components for KoordRuntimeProxy. + +### CRI Server + +As a proxy between kubelet and containerd, KoordRuntimeProxy acts as a CRI server for kubelet(http server under dockershim +scenario). It should intercept all request from kubelet, and generate protocols for talking with plugins before and +after talking with backend containerd(dockerd) + +### Plugins Manager + +PluginsManager is in charge of parsing registered plugin info from `/etc/runtime/hookserver.d` dynamically. + +### Runtime Dispatcher + +RuntimeDispatcher is designed to manage communications with plugins. + +### Store + +As a proxy, KoordRuntimeProxy had better be designed as stateless, but sometimes it does NOT work. Take StartContainer hook +for example, there exists only containerID in CRI StartContainerRequest, which is not enough for plugins to adapt policies +since plugins may not store pod/container info(such as meta, priority) locally. So KoordRuntimeProxy should store pod/container +info during RunPodSandbox/CreateContainer Stage. When StartContainer request comes, KoordRuntimeProxy can get pod/container info +by containerID, and then call plugins with pod/container info. + +With store, there would be pod/container info everytime KoordRuntimeProxy calls plugins, so there is no need for plugins to +store pod/container info exceptionally, plugins can be designed as stateless. + +Considering performance, store locates in memory and does not generate external io to disk. + +## Runtime Plugins + +### How to Register Plugins +All the plugin config files should be put to `/etc/runtime/hookserver.d` with `.json` suffix. You can register the plugin implemented by koordlet with RuntimeProxy: + +1. touch /etc/runtime/hookserver.d/koordlet.json +2. Copy the following content into /etc/runtime/hookserver.d/koordlet.json +``` +{ + "remote-endpoint": "/var/run/koordlet/koordlet.sock", + "failure-policy": "Ignore", + "runtime-hooks": [ + "PreRunPodSandbox", + "PreCreateContainer", + "PreStartContainer" + ] +} +``` + + +There are 3 fields involved: +- remote-endpoint: endpoint KoordRuntimeProxy talking with plugin, generated by plugin. +- failure-policy: policy when calling plugin fail, Fail or Ignore, default to Ignore. +- runtime-hooks: currently 7 hook points: + 1. PreRunPodSandbox + 2. PreCreateContainer + 3. PreStartContainer + 4. PostStartContainer + 5. PreUpdateContainerResources + 6. PostStopContainer + 7. PostStopPodSandbox + +hook points with prefix 'Pre' means calling plugins before transferring request to contianerd(dockerd). +hook points with prefix 'Post' means calling plugins after receiving response from containerd(dockerd). +plugin provider can set any hook combinations to "runtime-hooks". + +### Protocols between KoordRuntimeProxy and Plugins +[Protocols](https://github.com/koordinator-sh/koordinator/blob/main/apis/runtime/v1alpha1/api.proto) + +### Examples for Runtime Plugins +[koordlet-runtime-plugin-design](https://github.com/koordinator-sh/koordinator/blob/main/docs/design-archive/koordlet-runtime-hooks.md) + +## Installation + +### Installing from sources +get sources: `git clone https://github.com/koordinator-sh/koordinator.git` + +build: `cd koordinator; make build-koord-runtime-proxy` + +### Installing from packages +Download latest released package from: https://github.com/koordinator-sh/koordinator/releases + +### Setup Kubelet +Under containerd scenario, to make koord-runtime-proxy a proxy between kubelet and containerd, kubelet parameters should be altered as shown +below: +``` +kubelet --container-runtime=remote --container-runtime-endpoint=unix:///var/run/koord-runtimeproxy/runtimeproxy.sock +``` + +Under docker scenario, to make koord-runtime-proxy a proxy between kubelet and dockerd, kubelet parameters should be altered as shown +below: +``` +kubelet --docker-endpoint=unix:///var/run/koord-runtimeproxy/runtimeproxy.sock +``` + +### Setup KoordRuntimeProxy +Firstly, please make sure your runtime backend is containerd or dockerd. + +Under containerd scenario, koord-runtime-proxy can be setup with command: +``` +koord-runtime-proxy --remote-runtime-service-endpoint= + --remote-image-service-endpoint= +``` +If containerd listening CRI request on default /var/run/koord-runtimeproxy/runtimeproxy.sock, koord-runtime-proxy can be setup by: +``` +koord-runtime-proxy +``` + +Under docker scenario, koord-runtime-proxy should be setup with the additional parameter `--backend-runtime-mode Docker`, +and without `remote-image-service-endpoint`: +``` +koord-runtime-proxy --backend-runtime-mode=Docker --remote-runtime-service-endpoint= +``` diff --git a/versioned_docs/version-v1.3/installation.md b/versioned_docs/version-v1.3/installation.md new file mode 100644 index 000000000..91c0795a6 --- /dev/null +++ b/versioned_docs/version-v1.3/installation.md @@ -0,0 +1,236 @@ +# Installation + +Koordinator requires **Kubernetes version >= 1.18**. + +Koordinator need collect metrics from kubelet read-only port(default is disabled). +you can get more info form [here](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/). + +For the best experience, koordinator recommands **linux kernel 4.19** or higher. + + +## Install with helms + +Koordinator can be simply installed by helm v3.5+, which is a simple command-line tool and you can get it from [here](https://github.com/helm/helm/releases). + +```bash +# Firstly add koordinator charts repository if you haven't do this. +$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/ + +# [Optional] +$ helm repo update + +# Install the latest version. +$ helm install koordinator koordinator-sh/koordinator --version 1.3.0 +``` + +## Upgrade with helm + +```bash +# Firstly add koordinator charts repository if you haven't do this. +$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/ + +# [Optional] +$ helm repo update + +# Upgrade the latest version. +$ helm upgrade koordinator koordinator-sh/koordinator --version 1.3.0 [--force] +``` + +Note that: + +1. Before upgrade, you **must** firstly read the [Change Log](https://github.com/koordinator-sh/koordinator/blob/master/CHANGELOG.md) + to make sure that you have understand the breaking changes in the new version. +2. If you want to drop the chart parameters you configured for the old release or set some new parameters, + it is recommended to add `--reset-values` flag in `helm upgrade` command. + Otherwise you should use `--reuse-values` flag to reuse the last release's values. + +## Optional: download charts manually + +If you have problem with connecting to `https://koordinator-sh.github.io/charts/` in production, you might need to download the chart from [here](https://github.com/koordinator-sh/charts/releases) manually and install or upgrade with it. + +```bash +$ helm install/upgrade koordinator /PATH/TO/CHART +``` + +## Enable NRI Mode Resource Management + +### Prerequisite + +- Containerd >= 1.7.0 and enable NRI. Please make sure NRI is enabled in containerd. If not, please refer to [Enable NRI in Containerd](https://github.com/containerd/containerd/blob/main/docs/NRI.md) +- Koordinator >= 1.3 + +### Configurations + +NRI mode resource management is *Enabled* by default. You can use it without any modification on the koordlet config. You can also disable it to set `enable-nri-runtime-hook=false` in koordlet start args. It doesn't matter if all prerequisites are not meet. You can use all other features as expected. + +## Install koord-runtime-proxy + +koord-runtime-proxy acts as a proxy between kubelet and containerd(dockerd under dockershim scenario), which is designed to intercept CRI request, and apply some resource management policies, such as setting different cgroup parameters by pod priorities under hybrid workload orchestration scenario, applying new isolation policies for latest Linux kernel, CPU architecture, and etc. +For pods that do not want hook servers processing (such as addon pods), you can skip them by adding `runtimeproxy.koordinator.sh/skip-hookserver=true` to the pod label. + +### 1、Get binary + +Download from github releases: +```bash +$ # select the version +$ wget https://github.com/koordinator-sh/koordinator/releases/download/v1.3.0/koord-runtime-proxy_1.3.0_linux_x86_64 -O koord-runtime-proxy +$ chmod +x koord-runtime-proxy +``` + +Or you can build from source: +```bash +$ git clone https://github.com/koordinator-sh/koordinator.git +$ cd koordinator +$ make build-koord-runtime-proxy +``` + +### 2、Setup koord-runtime-proxy + +Firstly, please make sure your runtime backend is containerd or dockerd. + +Under containerd scenario, if your containerd listening CRI request on default `/var/run/containerd/containerd.sock`, koord-runtime-proxy can be setup by(no need to set any parameters): + +``` +koord-runtime-proxy +``` + +Or koord-runtime-proxy can be setup with command: + +``` +koord-runtime-proxy \ + --remote-runtime-service-endpoint= \ + --remote-image-service-endpoint= +``` + +Under docker scenario, koord-runtime-proxy should be setup with the additional parameter `--backend-runtime-mode Docker`, and without `remote-image-service-endpoint`: + +``` +koord-runtime-proxy \ + --backend-runtime-mode=Docker \ + --remote-runtime-service-endpoint= +``` + +koord-runtime-proxy will listen on `/var/run/koord-runtimeproxy/runtimeproxy.sock`. + +### 3、Setup Kubelet + +To make koord-runtime-proxy a proxy between kubelet and containerd, kubelet parameters should be altered as shown below: + +``` +kubelet \ + --container-runtime=remote \ + --container-runtime-endpoint=unix:///var/run/koord-runtimeproxy/runtimeproxy.sock +``` + +Under docker scenario, to make koord-runtime-proxy a proxy between kubelet and dockerd, kubelet parameters should be altered as shown below: + +``` +kubelet --docker-endpoint=unix:///var/run/koord-runtimeproxy/runtimeproxy.sock +``` + + +## Options + +Note that installing this chart directly means it will use the default template values for Koordinator. + +You may have to set your specific configurations if it is deployed into a production cluster, or you want to configure feature-gates. + +### Optional: chart parameters + +The following table lists the configurable parameters of the chart and their default values. + +| Parameter | Description | Default | +| ----------------------------------------- | ---------------------------------------------------------------- |---------------------------------| +| `featureGates` | Feature gates for Koordinator, empty string means all by default | ` ` | +| `installation.namespace` | namespace for Koordinator installation | `koordinator-system` | +| `installation.createNamespace` | Whether to create the installation.namespace | `true` | +| `imageRepositoryHost` | Image repository host | `ghcr.io` | +| `manager.log.level` | Log level that koord-manager printed | `4` | +| `manager.replicas` | Replicas of koord-manager deployment | `2` | +| `manager.image.repository` | Repository for koord-manager image | `koordinatorsh/koord-manager` | +| `manager.image.tag` | Tag for koord-manager image | `v1.3.0` | +| `manager.resources.limits.cpu` | CPU resource limit of koord-manager container | `1000m` | +| `manager.resources.limits.memory` | Memory resource limit of koord-manager container | `1Gi` | +| `manager.resources.requests.cpu` | CPU resource request of koord-manager container | `500m` | +| `manager.resources.requests.memory` | Memory resource request of koord-manager container | `256Mi` | +| `manager.metrics.port` | Port of metrics served | `8080` | +| `manager.webhook.port` | Port of webhook served | `9443` | +| `manager.nodeAffinity` | Node affinity policy for koord-manager pod | `{}` | +| `manager.nodeSelector` | Node labels for koord-manager pod | `{}` | +| `manager.tolerations` | Tolerations for koord-manager pod | `[]` | +| `manager.resyncPeriod` | Resync period of informer koord-manager, defaults no resync | `0` | +| `manager.hostNetwork` | Whether koord-manager pod should run with hostnetwork | `false` | +| `scheduler.log.level` | Log level that koord-scheduler printed | `4` | +| `scheduler.replicas` | Replicas of koord-scheduler deployment | `2` | +| `scheduler.image.repository` | Repository for koord-scheduler image | `koordinatorsh/koord-scheduler` | +| `scheduler.image.tag` | Tag for koord-scheduler image | `v1.3.0` | +| `scheduler.resources.limits.cpu` | CPU resource limit of koord-scheduler container | `1000m` | +| `scheduler.resources.limits.memory` | Memory resource limit of koord-scheduler container | `1Gi` | +| `scheduler.resources.requests.cpu` | CPU resource request of koord-scheduler container | `500m` | +| `scheduler.resources.requests.memory` | Memory resource request of koord-scheduler container | `256Mi` | +| `scheduler.port` | Port of metrics served | `10251` | +| `scheduler.nodeAffinity` | Node affinity policy for koord-scheduler pod | `{}` | +| `scheduler.nodeSelector` | Node labels for koord-scheduler pod | `{}` | +| `scheduler.tolerations` | Tolerations for koord-scheduler pod | `[]` | +| `scheduler.hostNetwork` | Whether koord-scheduler pod should run with hostnetwork | `false` | +| `koordlet.log.level` | Log level that koordlet printed | `4` | +| `koordlet.image.repository` | Repository for koordlet image | `koordinatorsh/koordlet` | +| `koordlet.image.tag` | Tag for koordlet image | `v1.3.0` | +| `koordlet.resources.limits.cpu` | CPU resource limit of koordlet container | `500m` | +| `koordlet.resources.limits.memory` | Memory resource limit of koordlet container | `256Mi` | +| `koordlet.resources.requests.cpu` | CPU resource request of koordlet container | `0` | +| `koordlet.resources.requests.memory` | Memory resource request of koordlet container | `0` | +| `koordlet.enableServiceMonitor` | Whether to enable ServiceMonitor for koordlet | `false` | +| `webhookConfiguration.failurePolicy.pods` | The failurePolicy for pods in mutating webhook configuration | `Ignore` | +| `webhookConfiguration.timeoutSeconds` | The timeoutSeconds for all webhook configuration | `30` | +| `crds.managed` | Koordinator will not install CRDs with chart if this is false | `true` | +| `imagePullSecrets` | The list of image pull secrets for koordinator image | `false` | + +Specify each parameter using the `--set key=value[,key=value]` argument to `helm install` or `helm upgrade`. + +### Optional: feature-gate + +Feature-gate controls some influential features in Koordinator: + +| Name | Description | Default | Effect (if closed) | +| ------------------------- | ---------------------------------------------------------------- | ------- | -------------------------------------- | +| `PodMutatingWebhook` | Whether to open a mutating webhook for Pod **create** | `true` | Don't inject koordinator.sh/qosClass, koordinator.sh/priority and don't replace koordinator extend resources ad so on | +| `PodValidatingWebhook` | Whether to open a validating webhook for Pod **create/update** | `true` | It is possible to create some Pods that do not conform to the Koordinator specification, causing some unpredictable problems | + + +If you want to configure the feature-gate, just set the parameter when install or upgrade. Such as: + +```bash +$ helm install koordinator https://... --set featureGates="PodMutatingWebhook=true\,PodValidatingWebhook=true" +``` + +If you want to enable all feature-gates, set the parameter as `featureGates=AllAlpha=true`. + +### Optional: the local image for China + +If you are in China and have problem to pull image from official DockerHub, you can use the registry hosted on Alibaba Cloud: + +```bash +$ helm install koordinator https://... --set imageRepositoryHost=registry.cn-beijing.aliyuncs.com +``` + +## Best Practices + +### Installation parameters for AWS EKS + +When using a custom CNI (such as Weave or Calico) on EKS, the webhook cannot be reached by default. This happens because the control plane cannot be configured to run on a custom CNI on EKS, so the CNIs differ between control plane and worker nodes. + +To address this, the webhook can be run in the host network so it can be reached, by setting `--set manager.hostNetwork=true` when use helm install or upgrade. + +## Uninstall + +Note that this will lead to all resources created by Koordinator, including webhook configurations, services, namespace, CRDs and CR instances managed by Koordinator controller, to be deleted! + +Please do this ONLY when you fully understand the consequence. + +To uninstall koordinator if it is installed with helm charts: + +```bash +$ helm uninstall koordinator +release "koordinator" uninstalled +``` diff --git a/versioned_docs/version-v1.3/introduction.md b/versioned_docs/version-v1.3/introduction.md new file mode 100644 index 000000000..e0625ab55 --- /dev/null +++ b/versioned_docs/version-v1.3/introduction.md @@ -0,0 +1,48 @@ +--- +title: Introduction +slug: / +--- + +# Introduction + +Welcome to Koordinator! + +## Overview + +Koordinator is a QoS based scheduling system for hybrid workloads orchestration on Kubernetes. It aims to improve the runtime efficiency and reliability of both latency sensitive workloads and batch jobs, simplify the complexity of resource-related configuration tuning, and increase pod deployment density to improve resource utilizations. + + +## Key Features + +Koordinator enhances the kubernetes user experiences in the workload management by providing the following: + +- Well-designed priority and QoS mechanism to co-locate different types of workloads in a cluster and run different types of pods on a single node. +- Allowing for resource overcommitments to achieve high resource utilizations but still satisfying the QoS guarantees by leveraging an application profiling mechanism. +- Fine-grained resource orchestration and isolation mechanism to improve the efficiency of latency-sensitive workloads and batch jobs. +- Flexible job scheduling mechanism to support workloads in specific areas, e.g., big data, AI, audio and video. +- A set of tools for monitoring, troubleshooting and operations. + + +## Koordinator vs. Other Concept + +### Koordinator QoS vs Kubernetes QoS + +Kubernetes provides three types of QoS: Guaranteed/Burstable/BestEffort, of which Guaranteed/Burstable is widely used and BestEffort is rarely used. Koordinator is compatible with Kubernetes QoS and has numerous enhancements on each type. In order to avoid interfering with the native QoS semantics, Koordinator introduces an independent field ```koordinator.sh/qosClass``` to describe the co-location QoS. This QoS describes the service quality of the Pod running on the node in the co-location scenario. It is the most critical semantics of the mixed system. + +Koordinator is compatible with Kubernetes QoS and has numerous enhancements on each type. + +### Koordinator scheduler vs kube-scheduler + +Koordinator scheduler is **not** designed to replace kube-scheduler, but to make co-located workloads run **better** on kubernetes. + +Koordinator scheduler is developed based on schedule-framework, adding scheduling plugins related to co-location and priority preemption on top of native scheduling capabilities. Koordinator will be committed to promoting related enhancements into the upstream community of kubernetes and promoting the standardization of co-location technology. + + +## What's Next + +Here are some recommended next steps: + +- Start to [install Koordinator](./installation). +- Learn Koordinator's [Overview](architecture/overview). + + diff --git a/versioned_docs/version-v1.3/user-manuals/colocation-profile.md b/versioned_docs/version-v1.3/user-manuals/colocation-profile.md new file mode 100644 index 000000000..3d3c3c540 --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/colocation-profile.md @@ -0,0 +1,137 @@ +--- +sidebar_position: 1 +--- + +# Colocation Profile + +## Motivation + +If the workloads in the existing cluster want to be co-located through Koordinator, you need to modify the existing Controller/Operator to support protocols such as the QoS class, priority, and resource model defined by Koordinator. +In order to avoid repeated construction and make it easier for everyone to obtain the benefits of co-location technology, Koordinator defines `ClusterColocationProfile` CRD, and implements webhook modify and verify newly created Pods, inject the fields described in `ClusterColocationProfile`. + + +## Architecture + +![image](/img/clustercolocationprofile-arch.png) + +## feature-gates + +ClusterColocationProfile mutating/validating feature is turned on by default, if you want to turn it off set feature-gates: + +```bash +$ helm install koordinator https://... --set featureGates="PodMutatingWebhook=false\,PodValidatingWebhook=false" +``` + + +## Spec definition + +If you are not familiar with Kubernetes resources please refer to the page [Understanding Kubernetes Objects](https://kubernetes.io/docs/concepts/overview/working-with-objects/kubernetes-objects/). + +- **namespaceSelector**: decides whether to mutate/validate Pods if the namespace matches the selector. Default to the empty LabelSelector, which will match everything. + +- **selector**: decides whether to mutate/validate Pods if the Pod matches the selector. Default to the empty LabelSelector, which will match everything. + +- **qosClass** (*required*): describes the type of Koordinator QoS that the Pod is running. The value will be injected into Pod as label koordinator.sh/qosClass. Options are `LSE`, `LSR`, `LS`, `BE`, and `SYSTEM`. For more information, please check [here](../architecture/qos). + +- **priorityClassName** (*required*): the priorityClassName and the priority value defined in PriorityClass will be injected into the Pod. Options are `koord-prod`, `koord-mid`, `koord-batch`, and `koord-free`. For more information, please check [here](../architecture/priority). + +- **koordinatorPriority**: defines the Pod sub-priority in Koordinator. The priority value will be injected into Pod as label koordinator.sh/priority. Various Koordinator components determine the priority of the Pod in the Koordinator through KoordinatorPriority and the priority value in PriorityClassName. Higher the value, higher the priority. + +- **labels**: describes the k/v pair that needs to inject into `Pod.Labels`. + +- **annotations**: describes the k/v pair that needs to inject into `Pod.Annotations`. + +- **schedulerName**: if specified, the pod will be dispatched by specified scheduler. + +- **patch**: indicates Pod Template patching that user would like to inject into the Pod. + + +## Example + +### Create ClusterColocationProfile + +The `profile.yaml` file below describes to modify Pod in Namepspace with label `koordinator.sh/enable-colocation=true` and inject Koordinator QoS, Koordinator Priority etc. + +```yaml +apiVersion: config.koordinator.sh/v1alpha1 +kind: ClusterColocationProfile +metadata: + name: colocation-profile-example +spec: + namespaceSelector: + matchLabels: + koordinator.sh/enable-colocation: "true" + selector: + matchLabels: + koordinator.sh/enable-colocation: "true" + qosClass: BE + priorityClassName: koord-batch + koordinatorPriority: 1000 + schedulerName: koord-scheduler + labels: + koordinator.sh/mutated: "true" + annotations: + koordinator.sh/intercepted: "true" + patch: + spec: + terminationGracePeriodSeconds: 30 +``` + +Create a ClusterColocationProfile based on the YAML file: + +```bash +$ kubectl apply -f profile.yaml +``` + +### Verify ClusterColocationProfile works + +```yaml +apiVersion: v1 +kind: Pod +metadata: + labels: + koordinator.sh/enable-colocation: "true" + name: test-pod +spec: + containers: + - name: app + image: nginx:1.15.1 + resources: + limits: + cpu: "1" + memory: "3456Mi" + requests: + cpu: "1" + memory: "3456Mi" +``` + +Create this pod and now you will find it's injected with Koordinator QoS, Koordinator Priority etc. + +```bash +$ kubectl get pod test-pod -o yaml +apiVersion: v1 +kind: Pod +metadata: + annotations: + koordinator.sh/intercepted: true + labels: + koordinator.sh/qosClass: BE + koordinator.sh/priority: 1000 + koordinator.sh/mutated: true + ... +spec: + terminationGracePeriodSeconds: 30 + priority: 5000 + priorityClassName: koord-batch + schedulerName: koord-scheduler + containers: + - name: app + image: nginx:1.15.1 + resources: + limits: + kubernetes.io/batch-cpu: "1000" + kubernetes.io/batch-memory: 3456Mi + requests: + kubernetes.io/batch-cpu: "1000" + kubernetes.io/batch-memory: 3456Mi +``` diff --git a/versioned_docs/version-v1.3/user-manuals/cpu-burst.md b/versioned_docs/version-v1.3/user-manuals/cpu-burst.md new file mode 100644 index 000000000..315ab8661 --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/cpu-burst.md @@ -0,0 +1,197 @@ +# CPU Burst + +## Introduction + +CPU Burst is a service level objective (SLO)-aware resource scheduling feature provided by Koordinator. You can use CPU Burst to improve the performance of latency-sensitive applications. CPU scheduling for a container may be throttled by the kernel due to the CPU limit, which downgrades the performance of the application. The koordlet component automatically detects CPU throttling events and automatically adjusts the CPU limit to a proper value. This greatly improves the performance of latency-sensitive applications. + +### How CPU Burst works + +Kubernetes allows you to specify CPU limits, which can be reused based on time-sharing. If you specify a CPU limit for a container, the OS limits the amount of CPU resources that can be used by the container within a specific time period. For example, you set the CPU limit of a container to 2. The OS kernel limits the CPU time slices that the container can use to 200 milliseconds within each 100-millisecond period. + +CPU utilization is a key metric that is used to evaluate the performance of a container. In most cases, the CPU limit is specified based on CPU utilization. CPU utilization on a per-millisecond basis shows more spikes than on a per-second basis. If the CPU utilization of a container reaches the limit within a 100-millisecond period, CPU throttling is enforced by the OS kernel and threads in the container are suspended for the rest of the time period, as shown in the following figure. + +![image](/img/cpu-throttles.png) + +The following figure shows the thread allocation of a web application container that runs on a node with four vCPUs. The CPU limit of the container is set to 2. The overall CPU utilization within the last second is low. However, Thread 2 cannot be resumed until the third 100-millisecond period starts because CPU throttling is enforced somewhere in the second 100-millisecond period. This increases the response time (RT) and causes long-tail latency problems in containers. + +![image](/img/cpu-throttles-1.png) + +Upstream Linux kernel >=5.14 and Anolis OS both provide [Burstable CFS Controller](https://github.com/torvalds/linux/commit/f4183717b370ad28dd0c0d74760142b20e6e7931#diff-cc1a82129952a910fdc4292448c2a097a2ba538bebefcf3c06381e45639ae73e), namely *CPU Burst* feature. It allows a container to accumulate CPU time slices when the container is idle. The container can use the accumulated CPU time slices to burst above the CPU limit when CPU utilization spikes. This improves performance and reduces the RT of the container. + +![image](/img/cpu-throttles-2.png) + +For kernel versions that do not support CPU Burst, koordlet detects CPU throttling events and dynamically adjusts the CPU limit to achieve the same effect as CPU Burst. + +For more information about CPU Burst, see the presentation at KubeCon 2021: [CPU Burst: Getting Rid of Unnecessary Throttling, Achieving High CPU Utilization and Application Performance at the Same Time](https://kccncosschn21.sched.com/event/pcdF?spm=a2c63.p38356.0.0.2ec3731dhQbCIe&iframe=no). + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.3 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to +[Installation](/docs/installation). + +### Configurations + +Koordlet has already enabled CPU Burst feature (`-feature-gates=AllAlpha=true`). If not, please enable it manually by updating the feature gate in the koordlet daemonset. + +NOTE: CPU Burst is not available for `LSR` and `BE` pods since it targets on burstable cpu usages. + +```yaml +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: koordlet +spec: + selector: + matchLabels: + koord-app: koordlet + template: + metadata: + labels: + koord-app: koordlet + spec: + containers: + - command: + - /koordlet + args: + - -CgroupRootDir=/host-cgroup/ + - -feature-gates=XXXX,CPUBurst=true # enable CPU Burst feature + ... +``` + +## Use CPU Burst + +### Use an annotation to enable CPU Burst for the pod + +Add the following annotation to the pod configuration to enable CPU Burst: + +```yaml +apiVersion: apps/v1 +kind: Pod +metadata: + name: demo-pod-xxx + annotations: + # Set the value to auto to enable CPU Burst for the pod. + koordinator.sh/cpuBurst: '{"policy": "auto"}' + # To disable CPU Burst for the pod, set the value to none. + #koordinator.sh/cpuBurst: '{"policy": "none"}' +``` + +### Use a ConfigMap to enable CPU Burst for all pods in a cluster + +Modify the slo-controller-config ConfigMap based on the following content to enable CPU Burst for all pods in a cluster: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: slo-controller-config + namespace: koordinator-system +data: + cpu-burst-config: '{"clusterStrategy": {"policy": "auto"}}' + #cpu-burst-config: '{"clusterStrategy": {"policy": "cpuBurstOnly"}}' + #cpu-burst-config: '{"clusterStrategy": {"policy": "none"}}' +``` + +### (Optional) Advanced Settings + +The following code block shows the pod annotations and ConfigMap fields that you can use for advanced configurations: + +```yaml +# Example of the slo-controller-config ConfigMap. +data: + cpu-burst-config: | + { + "clusterStrategy": { + "policy": "auto", + "cpuBurstPercent": 1000, + "cfsQuotaBurstPercent": 300, + "sharePoolThresholdPercent": 50, + "cfsQuotaBurstPeriodSeconds": -1 + } + } + + # Example of pod annotations. + koordinator.sh/cpuBurst: '{"policy": "auto", "cpuBurstPercent": 1000, "cfsQuotaBurstPercent": 300, "cfsQuotaBurstPeriodSeconds": -1}' +``` + +The following table describes the ConfigMap fields that you can use for advanced configurations of CPU Burst. + +| Field | Data type | Description | +| ---------------------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| policy | string |
  • none: disables CPU Burst. If you set the value to none, the related fields are reset to their original values. This is the default value.
  • cpuBurstOnly: enables the CPU Burst feature only for the kernel of Anolis OS or upstream linux kernel >= 5.14.
  • cfsQuotaBurstOnly: enables automatic adjustment of CFS quotas of general kernel versions.
  • auto: enables CPU Burst and all the related features.
| +| cpuBurstPercent | int | Default value:`1000`. Unit: %. This field specifies the percentage to which the CPU limit can be increased by CPU Burst. If the CPU limit is set to `1`, CPU Burst can increase the limit to 10 by default. | +| cfsQuotaBurstPercent | int | Default value:`300`. Unit: %. This field specifies the maximum percentage to which the value of cfs_quota in the cgroup parameters can be increased. By default, the value of cfs_quota can be increased to at most three times. | +| cfsQuotaBurstPeriodSeconds | int | Default value:`-1`. Unit: seconds. This indicates that the time period in which the container can run with an increased CFS quota is unlimited. This field specifies the time period in which the container can run with an increased CFS quota, which cannot exceed the upper limit specified by `cfsQuotaBurstPercent`. | +| sharePoolThresholdPercent | int | Default value:`50`. Unit: %. This field specifies the CPU utilization threshold of the node. If the CPU utilization of the node exceeds the threshold, the value of cfs_quota in cgroup parameters is reset to the original value. | + +### Verify CPU Burst + +1. Use the following YAML template to create an apache-demo.yaml file. + +> To enable CPU Burst for a pod, specify an annotation in the annotations parameter of the metadata section of the pod configuration. + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: apache-demo + annotations: + koordinator.sh/cpuBurst: '{"policy": "auto"}' # Use this annotation to enable or disable CPU Burst. +spec: + containers: + - command: + - httpd + - -D + - FOREGROUND + image: koordinatorsh/apache-2-4-51-for-slo-test:v0.1 + imagePullPolicy: Always + name: apache + resources: + limits: + cpu: "4" + memory: 10Gi + requests: + cpu: "4" + memory: 10Gi + nodeName: # $nodeName Set the value to the name of the node that you use. + hostNetwork: False + restartPolicy: Never + schedulerName: default-scheduler +``` + +2. Run the following command to create an application by using Apache HTTP Server. + +```bash +kubectl apply -f apache-demo.yaml +``` + +3. Use the wrk2 tool to perform stress tests. + +```bash +# Download, decompress, and then install the wrk2 package. +# The Gzip module is enabled in the configuration of the Apache application. The Gzip module is used to simulate the logic of processing requests on the server. +# Run the following command to send requests. Replace the IP address in the command with the IP address of the application. +./wrk -H "Accept-Encoding: deflate, gzip" -t 2 -c 12 -d 120 --latency --timeout 2s -R 24 http://$target_ip_address:8010/static/file.1m.test +``` + +4. Check the results of CPU Burst enabled and disabled. + +e.g. We may have the following results: + +| CentOS 7 | Disabled | Enabled | +| ----------------------------- | ----------- | ------------------- | +| apache RT-p99 | 111.69 ms | 71.30 ms (-36.2%) | +| CPU Throttled Ratio | 33% | 0% | +| Average pod CPU utilization | 32.5% | 33.8% | + +The preceding metrics indicate the following information: + +- After CPU Burst is enabled, the P99 latency of apache is greatly reduced. +- After CPU Burst is enabled, CPU throttling is stopped and the average pod CPU utilization remains approximately at the same value. diff --git a/versioned_docs/version-v1.3/user-manuals/cpu-evict.md b/versioned_docs/version-v1.3/user-manuals/cpu-evict.md new file mode 100644 index 000000000..992bdd3a2 --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/cpu-evict.md @@ -0,0 +1,142 @@ +# Eviction Strategy base on CPU Satisfaction + +## Introduction +Koordinator supports [CPU Suppress](/docs/user-manuals/cpu-suppress) strategy, which is used for limiting the available +CPU Cores of low-priority Pods (BE) according to the usage of high-priority Pods (LS) under during co-location. When the +resource usage of LS Pods increases, `Koordlet` will reduce the CPU cores that can be used by BE Pods. However, when the +LS Pod utilization increases suddenly, large number of BE Pods could be suppressed on small number of CPUs, resulting in +the low resource satisfaction of BE pods, moreover, there might be some additional competition on kernel resources. + +In fact, most BE pods are batch computing type, which have well failover abilities, and the eviction is acceptable for +them since they can retry with better resource quality on other nodes. `Koordlet` provides an eviction strategy based +on CPU resource satisfaction. When the utilization and resource satisfaction exceed the threshold at the same time, +pods with lower priority and higher CPU utilization will be evicted first until the CPU satisfaction has returned +above the threshold. + +![image](/img/cpu-evict.svg) + +### Prerequisite +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to +[Installation](/docs/installation). +Batch resource overcommitment and cpu suppress strategy must be enabled first, see this [manual](/docs/user-manuals/cpu-suppress) +for more details. + +| Component | Version Requirement | +| --- | ------- | +| Kubernetes | ≥v1.18 | +| koordinator | ≥v0.3.0 | + +The eviction strategy is provided by `Koordlet`, which is disabled by default in feature-gate. +Please make sure the `BECPUEvict=true` field has been added in the `-feature-gates` arguments of `Koordlet` +as the [example](https://github.com/koordinator-sh/charts/blob/main/versions/v1.2.0/templates/koordlet.yaml#L36)。 + +## Use Eviction Strategy base on CPU Satisfaction + +1. Create a configmap.yaml file based on the following ConfigMap content: + ```yaml + #ConfigMap slo-controller-config example。 + apiVersion: v1 + kind: ConfigMap + metadata: + name: slo-controller-config # name should be set as the configuration of koord-manager, e.g. ack-slo-config + namespace: koordinator-system # namespace should be set as the configuration of installation, e.g. kube-system + data: + # enable the eviction strategy base on CPU satisfaction + resource-threshold-config: | + { + "clusterStrategy": { + "enable": true, + "cpuEvictBESatisfactionLowerPercent": 60, + "cpuEvictBESatisfactionUpperPercent": 80, + "cpuEvictBEUsageThresholdPercent": 90, + "CPUEvictTimeWindowSeconds": 60 + } + } + ``` + + | Configuration item | Parameter | Valid values | Description | + | :-------------- | :------ | :-------- | :----------------------------------------------------------- | + | `enable` | Boolean | true; false | true:enable the eviction.; false(default): disable the eviction. | + | `cpuEvictBESatisfactionLowerPercent` | Int | 0~60 | eviction threshold percent of BE CPU satisfaction. BE pods will be evicted if the satisfaction less than the threshold. | + | `cpuEvictBESatisfactionUpperPercent` | Int | cpuEvictBESatisfactionLowerPercent~100 | threshold percent of BE CPU satisfaction. eviction will be stopped if the satisfaction greater than the threshold. | + | `cpuEvictBEUsageThresholdPercent` | Int | 0~100 | threshold of BE CPU utilization. Pods will be evicted if the BE utilization under the suppressed CPU greater than the threshold. default value is 90. | + | `cpuEvictTimeWindowSeconds` | Int | >=2 | time window by seconds during calculating the CPU satisfaction and BE CPU utilization. | + +2. Check whether a ConfigMap named `slo-controller-config` exists in the `koordinator-system` namespace. + + - If a ConfigMap named `slo-controller-config` exists, we commend that you run the kubectl patch command to update the ConfigMap. This avoids changing other settings in the ConfigMap. + + ```bash + kubectl patch cm -n koordinator-system slo-controller-config --patch "$(cat configmap.yaml)" + ``` + + - If no ConfigMap named `slo-controller-config` exists, run the kubectl patch command to create a ConfigMap named ack-slo-config: + + ```bash + kubectl apply -f configmap.yaml + ``` + +3. Create a file named be-pod-demo.yaml based on the following YAML content: + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: be-pod-demo + labels: + koordinator.sh/qosClass: 'BE' # set Pod QoS as BE + spec: + containers: + - args: + - '-c' + - '4' + - '--vm' + - '1' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + limits: + kubernetes.io/batch-cpu: 4k + kubernetes.io/batch-memory: 4Gi + requests: + kubernetes.io/batch-cpu: 4k + kubernetes.io/batch-memory: 4Gi + restartPolicy: Always + schedulerName: default-scheduler + ``` + +4. Run the following command to deploy the be-pod-demo pod in the cluster: + + ```bash + kubectl apply -f be-pod-demo.yaml + ``` + +5. Run the following command to check the be-pod-demo pod in Running state: + + ```bash + $ kubectl get pod be-pod-demo + NAME READY STATUS RESTARTS AGE + be-pod-demo 1/1 Running 0 7s + +6. Run the following command through [stress tool](https://linux.die.net/man/1/stress) +make sure the memory usage of node is above the threshold config, and the argument `--cpu` +means the process will consume 10 cores, this should be adjusted according to the node capacity. + + ```bash + $ stress --cpu 1 --vm 1 --vm-bytes 10G --vm-keep + ``` + +7. Check the running state of be-pod-demo, then you can find the be-pod-demo pod is not exist, +and the eviction information can be found in events. + + ```bash + $ kubectl get pod be-pod-demo + Error from server (NotFound): pods "be-pod-demo" not found + + $ kubectl get event + LAST SEEN TYPE REASON OBJECT MESSAGE + 44s Normal Killing pod/be-pod-demo Stopping container stress + 44s Warning evictPodSuccess ${your-pod-object} evict Pod:be-pod-demo, reason: EvictPodByBECPUSatisfaction, message: killAndEvictBEPodsRelease for node(${your-node-id}), need realase CPU : 1200 + ``` diff --git a/versioned_docs/version-v1.3/user-manuals/cpu-qos.md b/versioned_docs/version-v1.3/user-manuals/cpu-qos.md new file mode 100644 index 000000000..b1d1a589e --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/cpu-qos.md @@ -0,0 +1,182 @@ +# CPU QoS + +## Introduction + +Kubernetes allows you to deploy various types of containerized applications on the same node. This causes applications with different priorities to compete for CPU resources. As a result, the performance of the applications with high priorities cannot be guaranteed. Koordinator allows you to use quality of service (QoS) classes to guarantee CPU resources for applications with high priorities. This topic describes how to configure the CPU QoS feature for pods. + +## Background + +To fully utilize computing resources, workloads of different priorities are usually deployed on the same node. For example, latency-sensitive (LS) workloads (with high priorities) and best-effort (BE) workloads (with low priorities) can be deployed on the same node. However, this may cause these workloads to compete for computing resources. In Kubernetes, CPU requests and CPU limits are used to control the amount of CPU resources that pods can use. However, pods may still compete for CPU resources. For example, BE pods and LS pods can share CPU cores or vCPU cores. When the loads of the BE pods increase, the performance of the LS pods is compromised. As a result, the response latency of the application that uses the LS pods increases. + +To reduce the performance impact on the BE pods in this scenario, you can use the CPU QoS feature provided by Koordinator to limit the CPU usage of the LS pods. The CPU QoS feature is based on Alibaba Cloud Linux 2 and Anolis OS. Koordinator allows you to use the group identity feature available in Alibaba Cloud Linux 2 to configure Linux scheduling priorities for pods. In an environment where both LS pods and BE pods are deployed, you can set the priority of LS pods to high and the priority of BE pods to low to avoid resource contention. The LS pods are prioritized to use the limited CPU resources to ensure the service quality of the corresponding application. For more information, see [Group identity feature](https://www.alibabacloud.com/help/en/elastic-compute-service/latest/group-identity-feature). + +You can gain the following benefits after you enable the CPU QoS feature: + +- The wake-up latency of tasks for LS workloads is minimized. +- Waking up tasks for BE workloads does not adversely impact the performance of LS pods. +- Tasks for BE workloads cannot use the simultaneous multithreading (SMT) scheduler to share CPU cores. This further reduces the impact on the performance of LS pods. + +## Setup + +### Prerequisites + +- Kubernetes >= 1.18 +- Koordinator >= 0.4 +- Operating System: + - Alibaba Cloud Linux 2(For more information, see [Group identity feature](https://www.alibabacloud.com/help/en/elastic-compute-service/latest/group-identity-feature)) + - Anolis OS >= 8.6 + - CentOS 7.9 (Need to install the CPU Co-location scheduler plug-in from OpenAnolis community, see [best practice](../best-practices/anolis_plugsched.md)). + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to [Installation](https://koordinator.sh/docs/installation). + +## Use CPU QoS + +1. Create a configmap.yaml file based on the following ConfigMap content: + + ```yaml + # Example of the slo-controller-config ConfigMap. + apiVersion: v1 + kind: ConfigMap + metadata: + name: slo-controller-config + namespace: koordinator-system + data: + # Enable the CPU QoS feature. + resource-qos-config: | + { + "clusterStrategy": { + "lsClass": { + "cpuQOS": { + "enable": true, + "groupIdentity": 2 + } + }, + "beClass": { + "cpuQOS": { + "enable": true, + "groupIdentity": -1 + } + } + } + } + ``` + + Specify `lsClass` and `beClass` to assign the LS and BE classes to different pods. `cpuQOS` includes the CPU QoS parameters. The following table describes the parameters. + +| Configuration item | Parameter | Valid values | Description | +| :----------------- | :-------- | :----------- | :----------------------------------------------------------- | +| `enable` | Boolean | truefalse | true: enables the CPU QoS feature for all containers in the cluster.false: disables the CPU QoS feature for all containers in the cluster. | +| `groupIdentity` | Int | -1~2 | Specify group identities for CPU scheduling. By default, the group identity of LS pods is 2 and the group identity of BE pods is -1. A value of 0 indicates that no group identity is assigned.A greater `group identity` value indicates a higher priority in CPU scheduling. For example, you can set `cpu.bvt_warp_ns=2` for LS pods and set `cpu.bvt_warp_ns=-1` for BE pods because the priority of LS pods is higher than that of BE pods. For more information, see [Group identity feature](https://www.alibabacloud.com/help/en/elastic-compute-service/latest/group-identity-feature#task-2129392). | + + **Note** If `koordinator.sh/qosClass` is not specified for a pod, Koordinator configures the pod based on the original QoS class of the pod. The component uses the BE settings in the preceding ConfigMap if the original QoS class is BE. The component uses the LS settings in the preceding ConfigMap if the original QoS class is not BE + +2. Check whether a ConfigMap named `slo-controller-config` exists in the `koordinator-system` namespace. + + - If a ConfigMap named `slo-controller-config` exists, we commend that you run the kubectl patch command to update the ConfigMap. This avoids changing other settings in the ConfigMap. + + ```bash + kubectl patch cm -n koordinator-system slo-controller-config --patch "$(cat configmap.yaml)" + ``` + + - If no ConfigMap named `slo-controller-config` exists, run the kubectl patch command to create a ConfigMap named ack-slo-config: + + ```bash + kubectl apply -f configmap.yaml + ``` + +3. Create a file named ls-pod-demo.yaml based on the following YAML content: + + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: ls-pod-demo + labels: + koordinator.sh/qosClass: 'LS' # Set the QoS class of the pod to LS + spec: + containers: + - command: + - "nginx" + - "-g" + - "daemon off; worker_processes 4;" + image: docker.io/koordinatorsh/nginx:v1.18-koord-example + imagePullPolicy: Always + name: nginx + resources: + limits: + cpu: "4" + memory: 10Gi + requests: + cpu: "4" + memory: 10Gi + restartPolicy: Never + schedulerName: default-scheduler + ``` + +4. Run the following command to deploy the ls-pod-demo pod in the cluster: + + ```bash + kubectl apply -f ls-pod-demo.yaml + ``` + +5. Run the following command to check whether the CPU group identity of the LS pod in the control group (cgroup) of the node takes effect: + + ```bash + cat /sys/fs/cgroup/cpu/kubepods.slice/kubepods-pod1c20f2ad****.slice/cpu.bvt_warp_ns + ``` + + Expected output: + + ```bash + #The group identity of the LS pod is 2 (high priority). + 2 + ``` + +6. Create a file named be-pod-demo.yaml based on the following content: + + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: be-pod-demo + labels: + koordinator.sh/qosClass: 'BE' # Set the QoS class of the pod to BE. + spec: + containers: + - args: + - '-c' + - '1' + - '--vm' + - '1' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + restartPolicy: Always + schedulerName: default-scheduler + priorityClassName: koord-batch + ``` + +7. Run the following command to deploy the be-pod-demo pod in the cluster: + + ```bash + kubectl apply -f be-pod-demo.yaml + ``` + +8. Run the following command to check whether the CPU group identity of the BE pod in the cgroup of the node takes effect: + + ```bash + cat /sys/fs/cgroup/cpu/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4b6e96c8****.slice/cpu.bvt_warp_ns + ``` + + Expected output: + + ```bash + #The group identity of the BE pod is -1 (low priority). + -1 + ``` + + The output shows that the priority of the LS pod is high and the priority of the BE pod is low. CPU resources are preferably scheduled to the LS pod to ensure the service quality. diff --git a/versioned_docs/version-v1.3/user-manuals/cpu-suppress.md b/versioned_docs/version-v1.3/user-manuals/cpu-suppress.md new file mode 100644 index 000000000..7077acefd --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/cpu-suppress.md @@ -0,0 +1,103 @@ +# CPU Suppress + +## Introduction +In order to ensure the runtime quality of different workloads in co-located scenarios, Koordinator uses the CPU Suppress +mechanism provided by koordlet on the node side to suppress workloads of the Best Effort type when the load increases. +Or increase the resource quota for Best Effort type workloads when the load decreases. + +In the [Dynamic resource overcommitment model](/architecture/resource-model.md) that is provided by +Koordinator, the total amount of reclaimed resources dynamically changes based on the actual amount of resources used +by latency-sensitive (LS/LSR/LSE) pods. Reclaimed resources can be used by BE pods. You can use the dynamic resource +overcommitment feature to improve the resource utilization of a cluster by deploying both LS pods and BE pods in the +cluster. To ensure sufficient CPU resources for the LS pods on a node, you can use koordinator to limit the CPU +usage of the BE pods on the node. The elastic resource limit feature can maintain the resource utilization of a node +below a specified threshold and limit the amount of CPU resources that can be used by BE pods. This ensures the +stability of the containers on the node. + +CPU Threshold indicates the CPU utilization threshold of a node. Pod (LS).Usage indicates the CPU usage of LS pods. +CPU Restriction for BE indicates the CPU usage of BE pods. The amount of CPU resources that can be used by BE pods +is adjusted based on the increase or decrease of the CPU usage of LS pods. We recommend that you use the same value +for CPU Threshold and the reserved CPU watermark in the dynamic resource overcommitment model. +This ensures a consistent level of CPU resource utilization. + +![CPU-Suppress](/img/cpu-suppress-demo.svg) + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.6 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to +[Installation](/docs/installation). + +### Configurations +When installing through the helm chart, the ConfigMap slo-controller-config will be created in the koordinator-system +namespace, and the CPU Suppress mechanism is enabled by default. If it needs to be closed, refer to the configuration +below, and modify the configuration of the resource-threshold-config section to take effect. + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: slo-controller-config + namespace: {{ .Values.installation.namespace }} +data: + ... + resource-threshold-config: | + { + "clusterStrategy": { + "enable": true, + "cpuSuppressThresholdPercent": 65 + } + } +``` + +#### (Optional) Advanced Settings +Also, the `CPU Suppress` feature allows you to configure the CPU utilization threshold in a fine-grained manner. +The following table describes the parameters. + +| Parameter | Data type | Valid value | Description | +| --------- | --------- | ----------- | ----------- | +| enable | Boolean | true; false | true: enables the elastic resource limit feature; false: disables the elastic resource limit feature. | +| cpuSuppressThresholdPercent | Int | 0~100 | The CPU utilization threshold of the node. Unit: %. Default value: 65. | + +## Use CPU Suppress + +1. Create a configmap.yaml file based on the following ConfigMap content: +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: slo-controller-config + namespace: koordinator-system +data: + # Enable the elastic resource limit feature. + resource-threshold-config: | + { + "clusterStrategy": { + "enable": true + } + } +``` + +2. Run the following command to update the ConfigMap. +To avoid changing other settings in the ConfigMap, we commend that you run the kubectl patch command to update the ConfigMap. + +```bash +kubectl patch cm -n koordinator-system slo-controller-config --patch "$(cat configmap.yaml)" +``` + +3. Run the following command to query the CPU cores that are allocated to the BE pods on the node: +```bash +cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/cpuset.cpus +``` +Expected output: +```bash +10-25,35-51,62-77,87-103 +``` +The output shows that the following CPU cores are allocated to the BE pods on the node: 10-25, 35-51, 62-77, and 87-103, +which will be changed dynamically according to the load of latency-sensitve pods. \ No newline at end of file diff --git a/versioned_docs/version-v1.3/user-manuals/fine-grained-cpu-orchestration.md b/versioned_docs/version-v1.3/user-manuals/fine-grained-cpu-orchestration.md new file mode 100644 index 000000000..53f24a5e5 --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/fine-grained-cpu-orchestration.md @@ -0,0 +1,262 @@ +# Fine-grained CPU Orchestration + +Fine-grained CPU Orchestration is an ability of koord-scheduler for improving the performance of CPU-sensitive workloads. + +## Introduction + +There is an increasing number of systems that leverage a combination of CPUs and hardware accelerators to support +latency-critical execution and high-throughput parallel computation. A high-performance environment is expected in +plenty of applications including in telecommunications, scientific computing, machine learning, financial services, and +data analytics. + +However, pods in the Kubernetes cluster may interfere with others' running when they share the same physical resources +and both demand many resources. The sharing of CPU resources is almost inevitable. e.g. SMT threads (i.e. logical +processors) share execution units of the same core, and cores in the same chip share one last-level cache. The resource +contention can slow down the running of these CPU-sensitive workloads, resulting in high response latency (RT). + +To improve the performance of CPU-sensitive workloads, koord-scheduler provides a mechanism of fine-grained CPU +orchestration. It enhances the CPU management of Kubernetes and supports detailed NUMA-locality and CPU exclusions. + +For more information, please see [Design: Fine-grained CPU orchestration](/docs/designs/fine-grained-cpu-orchestration). + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.6 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to [Installation](/docs/installation). + +### Global Configuration via plugin args + +Fine-grained CPU orchestration is *Enabled* by default. You can use it without any modification on the koord-scheduler config. + +For users who need deep insight, please configure the rules of fine-grained CPU orchestration by modifying the ConfigMap +`koord-scheduler-config` in the helm chart. + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: koord-scheduler-config + ... +data: + koord-scheduler-config: | + apiVersion: kubescheduler.config.k8s.io/v1beta2 + kind: KubeSchedulerConfiguration + profiles: + - schedulerName: koord-scheduler + - pluginConfig: + - name: NodeNUMAResource + args: + apiVersion: kubescheduler.config.k8s.io/v1beta2 + kind: NodeNUMAResourceArgs + # The default CPU Binding Policy. The default is FullPCPUs + # If the Pod belongs to LSE/LSR Prod Pods, and if no specific CPU binding policy is set, + # the CPU will be allocated according to the default core binding policy. + defaultCPUBindPolicy: FullPCPUs + # the scoring strategy + scoringStrategy: + # the scoring strategy ('MostAllocated', 'LeastAllocated') + # - MostAllocated(default): prefer the node with the least available resources + # - LeastAllocated: prefer the node with the most available resources + type: MostAllocated + # the weights of each resource type + resources: + - name: cpu + weight: 1 + plugins: + # enable the NodeNUMAResource plugin + preFilter: + enabled: + - name: NodeNUMAResource + filter: + enabled: + - name: NodeNUMAResource + ... + score: + enabled: + - name: NodeNUMAResource + weight: 1 + ... + reserve: + enabled: + - name: NodeNUMAResource + preBind: + enabled: + - name: NodeNUMAResource +``` + +The koord-scheduler takes this ConfigMap as [scheduler Configuration](https://kubernetes.io/docs/reference/scheduling/config/). +New configurations will take effect after the koord-scheduler restarts. + +| Field | Description | Version | +|-------|-------------|---------| +| defaultCPUBindPolicy | The default CPU Binding Policy. The default is FullPCPUs. If the Pod belongs to LSE/LSR Prod Pods, and if no specific CPU binding policy is set, the CPU will be allocated according to the default CPU binding policy. The optional values are FullPCPUs and SpreadByPCPUs | >= v0.6.0 | +| scoringStrategy | the scoring strategy, including MostAllocated and LeastAllocated | >= v0.6.0 | + +### Configure by Node + +Users can set CPU binding policy and NUMA Node selection policy separately for Node. + +#### CPU bind policy + +The label `node.koordinator.sh/cpu-bind-policy` constrains how to bind CPU logical CPUs when scheduling. +The following is the specific value definition: + +| Value | Description | Version | +|-------|-------------|---------| +| None or empty value | does not perform any policy| >= v0.6.0 | +| FullPCPUsOnly | requires that the scheduler must allocate full physical cores. Equivalent to kubelet CPU manager policy option full-pcpus-only=true. | >= v0.6.0 | +| SpreadByPCPUs | requires that the schedler must evenly allocate logical CPUs across physical cores. | >= v1.1.0 | + +If there is no `node.koordinator.sh/cpu-bind-policy` in the node's label, it will be executed according to the policy configured by the Pod or koord-scheduler. + +#### NUMA allocate strategy + +The label `node.koordinator.sh/numa-allocate-strategy` indicates how to choose satisfied NUMA Nodes when scheduling. +The following is the specific value definition: + +| Value | Description | Version | +|-------|-------------|---------| +| MostAllocated | MostAllocated indicates that allocates from the NUMA Node with the least amount of available resource.| >= v.0.6.0 | +| LeastAllocated | LeastAllocated indicates that allocates from the NUMA Node with the most amount of available resource.| >= v.0.6.0 | + +If both `node.koordinator.sh/numa-allocate-strategy` and `kubelet.koordinator.sh/cpu-manager-policy` are defined, `node.koordinator.sh/numa-allocate-strategy` is used first. + +## Use Fine-grained CPU Orchestration + +1. Create an `nginx` deployment with the YAML file below. + +> Fine-grained CPU Orchestration allows pods to bind CPUs exclusively. To use fine-grained CPU orchestration, pods should set a label of [QoS Class](/docs/architecture/qos#definition)) and specify the cpu binding policy. + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: nginx-lsr + labels: + app: nginx-lsr +spec: + replicas: 3 + selector: + matchLabels: + app: nginx-lsr + template: + metadata: + name: nginx-lsr + labels: + app: nginx-lsr + koordinator.sh/qosClass: LSR # set the QoS class as LSR, the binding policy is FullPCPUs by default + # in v0.5, binding policy should be specified. + # e.g. to set binding policy as FullPCPUs (prefer allocating full physical CPUs of the same core): + #annotations: + #scheduling.koordinator.sh/resource-spec: '{"preferredCPUBindPolicy": "FullPCPUs"}' + spec: + schedulerName: koord-scheduler # use the koord-scheduler + containers: + - name: nginx + image: nginx + resources: + limits: + cpu: '2' + requests: + cpu: '2' + priorityClassName: koord-prod +``` + +2. Deploy the `nginx` deployment and check the scheduling result. + +```bash +$ kubectl create -f nginx-deployment.yaml +deployment/nginx-lsr created +$ kubectl get pods -o wide | grep nginx +nginx-lsr-59cf487d4b-jwwjv 1/1 Running 0 21s 172.20.101.35 node-0 +nginx-lsr-59cf487d4b-4l7r4 1/1 Running 0 21s 172.20.101.79 node-1 +nginx-lsr-59cf487d4b-nrb7f 1/1 Running 0 21s 172.20.106.119 node-2 +``` + +3. Check the CPU binding results of pods on `scheduling.koordinator.sh/resource-status` annotations. + +```bash +$ kubectl get pod nginx-lsr-59cf487d4b-jwwjv -o jsonpath='{.metadata.annotations.scheduling\.koordinator\.sh/resource-status}' +{"cpuset":"2,54"} +``` + +We can see that the pod `nginx-lsr-59cf487d4b-jwwjv` binds 2 CPUs, and the IDs are 2,54, which are the logical +processors of the **same** core. + +4. Change the binding policy in the `nginx` deployment with the YAML file below. + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: nginx-lsr + labels: + app: nginx-lsr +spec: + replicas: 3 + selector: + matchLabels: + app: nginx-lsr + template: + metadata: + name: nginx-lsr + labels: + app: nginx-lsr + koordinator.sh/qosClass: LSR # set the QoS class as LSR + annotations: + # set binding policy as SpreadByPCPUs (prefer allocating physical CPUs of different cores) + scheduling.koordinator.sh/resource-spec: '{"preferredCPUBindPolicy": "SpreadByPCPUs"}' + spec: + schedulerName: koord-scheduler # use the koord-scheduler + containers: + - name: nginx + image: nginx + resources: + limits: + cpu: '2' + requests: + cpu: '2' + priorityClassName: koord-prod +``` + +5. Update the `nginx` deployment and check the scheduling result. + +```bash +$ kubectl apply -f nginx-deployment.yaml +deployment/nginx-lsr created +$ kubectl get pods -o wide | grep nginx +nginx-lsr-7fcbcf89b4-rkrgg 1/1 Running 0 49s 172.20.101.35 node-0 +nginx-lsr-7fcbcf89b4-ndbks 1/1 Running 0 49s 172.20.101.79 node-1 +nginx-lsr-7fcbcf89b4-9v8b8 1/1 Running 0 49s 172.20.106.119 node-2 +``` + +6. Check the new CPU binding results of pods on `scheduling.koordinator.sh/resource-status` annotations. + +```bash +$ kubectl get pod nginx-lsr-7fcbcf89b4-rkrgg -o jsonpath='{.metadata.annotations.scheduling\.koordinator\.sh/resource-status}' +{"cpuset":"2-3"} +``` + +Now we can see that the pod `nginx-lsr-59cf487d4b-jwwjv` binds 2 CPUs, and the IDs are 2,3, which are the logical +processors of the **different** core. + +7. (Optional) Advanced configurations. + +```yaml + labels: + # koordinator QoS class of the pod. (use 'LSR' or 'LSE' for binding CPUs) + koordinator.sh/qosClass: LSR + annotations: + # `resource-spec` indicates the specification of resource scheduling, here we need to set `preferredCPUBindPolicy`. + # `preferredCPUBindPolicy` indicating the CPU binding policy of the pod ('None', 'FullPCPUs', 'SpreadByPCPUs') + # - None: perform no exclusive policy + # - FullPCPUs(default): a bin-packing binding policy, prefer allocating full physical cores (SMT siblings) + # - SpreadByPCPUs: a spread binding policy, prefer allocating logical cores (SMT threads) evenly across physical cores (SMT siblings) + scheduling.koordinator.sh/resource-spec: '{"preferredCPUBindPolicy": "FullPCPUs"}' +``` diff --git a/versioned_docs/version-v1.3/user-manuals/fine-grained-device-scheduling.md b/versioned_docs/version-v1.3/user-manuals/fine-grained-device-scheduling.md new file mode 100644 index 000000000..b4a93a337 --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/fine-grained-device-scheduling.md @@ -0,0 +1,318 @@ +# Device Scheduling +We provide a fine-grained mechanism for managing GPUs and other devices such as RDMA and FPGA, defines a set of APIs to +describe device information on nodes, including GPU, RDMA, and FPGA, and a new set of resource names to flexibly support +users to apply at a finer granularity GPU resources. This mechanism is the basis for subsequent other GPU scheduling +capabilities such as GPU Share, GPU Overcommitment, etc. + +## Introduction +GPU devices have very strong computing power, but are expensive. How to make better use of GPU equipment, give full play +to the value of GPU and reduce costs is a problem that needs to be solved. In the existing GPU allocation mechanism of +the K8s community, the GPU is allocated by the kubelet, and it is a complete device allocation. This method is simple +and reliable, but similar to the CPU and memory, the GPU will also be wasted. Therefore, some users expect to use only +a portion of the GPU's resources and share the rest with other workloads to save costs. Moreover, GPU has particularities. +For example, the NVLink and oversold scenarios supported by NVIDIA GPU mentioned below both require a central decision +through the scheduler to obtain globally optimal allocation results. + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.71 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to [Installation](/docs/installation). + +### Configurations + +DeviceScheduling is *Enabled* by default. You can use it without any modification on the koord-scheduler config. + +## Use DeviceScheduling + +### Quick Start + +1.check device crd: + +```bash +$ kubectl get device host04 -o yaml +``` + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: Device +metadata: + creationTimestamp: "2022-10-08T09:26:42Z" + generation: 1 + managedFields: + - apiVersion: scheduling.koordinator.sh/v1alpha1 + fieldsType: FieldsV1 + fieldsV1: + f:metadata: + f:ownerReferences: {} + f:spec: + .: {} + f:devices: {} + f:status: {} + manager: koordlet + operation: Update + time: "2022-10-08T09:26:42Z" + name: host04 + ownerReferences: + - apiVersion: v1 + blockOwnerDeletion: true + controller: true + kind: Node + name: host04 + uid: 09c4f912-6026-467a-85d2-6b2147c9557e + resourceVersion: "39011943" + selfLink: /apis/scheduling.koordinator.sh/v1alpha1/devices/host04 + uid: 5a498e1f-1357-4518-b74c-cab251d6c18c +spec: + devices: + - health: true + id: GPU-04cea5cd-966f-7116-1d58-1ac34421541b + minor: 0 + resources: + kubernetes.io/gpu-core: "100" + kubernetes.io/gpu-memory: 16Gi + kubernetes.io/gpu-memory-ratio: "100" + type: gpu + - health: true + id: GPU-3680858f-1753-371e-3c1a-7d8127fc7113 + minor: 1 + resources: + kubernetes.io/gpu-core: "100" + kubernetes.io/gpu-memory: 16Gi + kubernetes.io/gpu-memory-ratio: "100" + type: gpu +status: {} +``` +We can find this node has two gpu cards, we can find the detail info of each gpu card here. + +2.check node allocatable resource: + +```bash +$ kubectl get node host04 -o yaml +``` + +```yaml +apiVersion: v1 +kind: Node +metadata: + annotations: + flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"5a:69:48:10:29:25"}' + creationTimestamp: "2022-08-29T09:12:55Z" + labels: + beta.kubernetes.io/os: linux + status: + addresses: + - address: 10.15.0.37 + type: InternalIP + - address: host04 + type: Hostname + allocatable: + cpu: "6" + ephemeral-storage: "200681483926" + kubernetes.io/gpu: "200" + kubernetes.io/gpu-core: "200" + kubernetes.io/gpu-memory: 32Gi + kubernetes.io/gpu-memory-ratio: "200" + memory: 59274552Ki + nvidia.com/gpu: "2" + pods: "220" + capacity: + cpu: "8" + kubernetes.io/gpu: "200" + kubernetes.io/gpu-core: "200" + kubernetes.io/gpu-memory: 32Gi + kubernetes.io/gpu-memory-ratio: "200" + memory: 61678904Ki + nvidia.com/gpu: "2" + pods: "220" +``` +We can find the node allocatable resource has merged each gpu card resource. + +3.apply pod: +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example + namespace: default +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + kubernetes.io/gpu: "100" + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +```bash +$ kubectl get pod -n default pod-example -o yaml +``` + +```yaml +apiVersion: v1 +kind: Pod +metadata: + annotations: + scheduling.koordinator.sh/device-allocated: '{"gpu":[{"minor":0,"resources":{"kubernetes.io/gpu-core":"100","kubernetes.io/gpu-memory":"12508288Ki","kubernetes.io/gpu-memory-ratio":"100"}}]}' + creationTimestamp: "2022-10-08T09:33:07Z" + name: pod-example + namespace: default + resourceVersion: "39015044" + selfLink: /api/v1/namespaces/xlf/pods/gpu-pod7 + uid: 6bf1ac3c-0c9f-472a-8b86-de350bbfa795 +spec: + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: "1" + kubernetes.io/gpu: "100" + memory: 256Mi + requests: + cpu: "1" + kubernetes.io/gpu: "100" + memory: 256Mi +status: + conditions: + ... + hostIP: 10.0.0.149 + phase: Running + podIP: 10.244.2.45 + podIPs: + - ip: 10.244.2.45 + qosClass: Guaranteed + startTime: "2022-10-08T09:33:07Z" +``` +You can find the concrete device allocate result through annotation `scheduling.koordinator.sh/device-allocated`. + +4.more apply protocol: +```yaml +apiVersion: v1 +kind: Pod +... +spec: + ... + resources: + requests: + cpu: 40m + memory: 40Mi + nvidia.com/gpu: "100" +``` + +```yaml +apiVersion: v1 +kind: Pod +... +spec: + ... + resources: + requests: + cpu: 40m + memory: 40Mi + kubernetes.io/gpu-core: "100" + kubernetes.io/gpu-memory-ratio: "100" +``` + +```yaml +apiVersion: v1 +kind: Pod +... +spec: + ... + resources: + requests: + cpu: 40m + memory: 40Mi + kubernetes.io/gpu-core: "100" + kubernetes.io/gpu-memory: "16Mi" +``` + +4.device resource debug api: +```bash +$ kubectl -n koordinator-system get lease koord-scheduler --no-headers | awk '{print $2}' | cut -d'_' -f1 | xargs -I {} kubectl -n koordinator-system get pod {} -o wide --no-headers | awk '{print $6}' + 10.244.0.64 + +$ curl 10.244.0.64:10251/apis/v1/plugins/DeviceShare/nodeDeviceSummaries +$ curl 10.244.0.64:10251/apis/v1/plugins/DeviceShare/nodeDeviceSummaries/host04 +``` + +```json +{ + "allocateSet": { + "gpu": { + "xlf/gpu-pod7": { + "0": { + "kubernetes.io/gpu-core": "100", + "kubernetes.io/gpu-memory": "12508288Ki", + "kubernetes.io/gpu-memory-ratio": "100" + } + } + } + }, + "deviceFree": { + "kubernetes.io/gpu-core": "0", + "kubernetes.io/gpu-memory": "0", + "kubernetes.io/gpu-memory-ratio": "0" + }, + "deviceFreeDetail": { + "gpu": { + "0": { + "kubernetes.io/gpu-core": "0", + "kubernetes.io/gpu-memory": "0", + "kubernetes.io/gpu-memory-ratio": "0" + } + } + }, + "deviceTotal": { + "kubernetes.io/gpu-core": "100", + "kubernetes.io/gpu-memory": "12508288Ki", + "kubernetes.io/gpu-memory-ratio": "100" + }, + "deviceTotalDetail": { + "gpu": { + "0": { + "kubernetes.io/gpu-core": "100", + "kubernetes.io/gpu-memory": "12508288Ki", + "kubernetes.io/gpu-memory-ratio": "100" + } + } + }, + "deviceUsed": { + "kubernetes.io/gpu-core": "100", + "kubernetes.io/gpu-memory": "12508288Ki", + "kubernetes.io/gpu-memory-ratio": "100" + }, + "deviceUsedDetail": { + "gpu": { + "0": { + "kubernetes.io/gpu-core": "100", + "kubernetes.io/gpu-memory": "12508288Ki", + "kubernetes.io/gpu-memory-ratio": "100" + } + } + } +} +``` diff --git a/versioned_docs/version-v1.3/user-manuals/gang-scheduling.md b/versioned_docs/version-v1.3/user-manuals/gang-scheduling.md new file mode 100644 index 000000000..6a1ddc371 --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/gang-scheduling.md @@ -0,0 +1,364 @@ +# GangScheduling + +## Introduction +We provide Gang mechanism for the scheduler to control pods binding opportunity. User can declare a resource-collection-minimum number, +only when assigned-resources reach the given limitation can trigger the binding. We provide `Strict` and `NonStrict` to +control the resource-accumulation-process by a configuration. We also provide a two-level Gang description for better matching +the real scenario, which is different from community. + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.70 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to [Installation](/docs/installation). + +### Configurations + +GangScheduling is *Enabled* by default. You can use it without any modification on the koord-scheduler config. + +## Use GangScheduling + +### Quick Start + +#### apply gang through gang crd +1.create pod-group +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: PodGroup +metadata: + name: gang-example + namespace: default +spec: + scheduleTimeoutSeconds: 100 + minMember: 2 +``` + +```bash +$ kubectl get pgs -n default + NAME AGE + gang-example 13s +``` + +2.create child pod1 +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example1 + namespace: default + labels: + pod-group.scheduling.sigs.k8s.io: gang-example +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +```bash +$ kubectl get pod -n default + NAME READY STATUS RESTARTS AGE + pod-example1 0/1 Pending 0 7s +``` + +3.create child pod2 +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example2 + namespace: default + labels: + pod-group.scheduling.sigs.k8s.io: gang-example +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +```bash +$ kubectl get pod -n default + NAME READY STATUS RESTARTS AGE + pod-example1 1/1 Running 0 53s + pod-example2 1/1 Running 0 5s +``` + +```bash +$ kubectl get pg gang-example -n default -o yaml +``` + +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: PodGroup +metadata: + creationTimestamp: "2022-10-09T09:08:17Z" + generation: 6 +spec: + minMember: 1 + scheduleTimeoutSeconds: 100 +status: + phase: Running + running: 2 + scheduled: 2 +``` + +#### apply gang through annotation +1.create child pod1 +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example1 + namespace: default + annotations: + gang.scheduling.koordinator.sh/name: "gang-example" + gang.scheduling.koordinator.sh/min-available: "2" +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +```bash +$ kubectl get pod -n default + NAME READY STATUS RESTARTS AGE + pod-example1 0/1 Pending 0 7s +``` + +2.create child pod2 +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example2 + namespace: default + annotations: + gang.scheduling.koordinator.sh/name: "gang-example" + gang.scheduling.koordinator.sh/min-available: "2" +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +```bash +$ kubectl get pod -n default + NAME READY STATUS RESTARTS AGE + pod-example1 1/1 Running 0 53s + pod-example2 1/1 Running 0 5s +``` + +```bash +$ kubectl get pg gang-example -n default -o yaml +``` + +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: PodGroup +metadata: + creationTimestamp: "2022-10-09T09:08:17Z" + generation: 6 +spec: + minMember: 1 + scheduleTimeoutSeconds: 100 +status: + phase: Running + running: 2 + scheduled: 2 +``` + +#### device resource debug api: +```bash +$ kubectl -n koordinator-system get lease koord-scheduler --no-headers | awk '{print $2}' | cut -d'_' -f1 | xargs -I {} kubectl -n koordinator-system get pod {} -o wide --no-headers | awk '{print $6}' + 10.244.0.64 + +$ curl 10.244.0.64:10251/apis/v1/plugins/Coscheduling/gang/default/gang-example +``` + +```json +{ + "boundChildren": { + "default/pod-example1": {}, + "default/pod-example2": {} + }, + "children": { + "default/pod-example1": {}, + "default/pod-example2": {} + }, + "childrenScheduleRoundMap": { + "default/pod-example1": 2, + "default/pod-example2": 2 + }, + "createTime": "2022-10-09T07:31:53Z", + "gangFrom": "GangFromPodAnnotation", + "gangGroup": null, + "hasGangInit": true, + "minRequiredNumber": 2, + "mode": "Strict", + "name": "default/gang-example", + "onceResourceSatisfied": true, + "scheduleCycle": 2, + "scheduleCycleValid": true, + "totalChildrenNum": 2, + "waitTime": 600000000000, + "waitingForBindChildren": {} +} +``` + +#### advanced configuration for gang +1.apply through pod-group. + +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: PodGroup +metadata: + name: gang-example1 + namespace: default + annotations: + gang.scheduling.koordinator.sh/total-number: "3" + gang.scheduling.koordinator.sh/mode: "NonStrict" + gang.scheduling.koordinator.sh/groups: "[\"default/gang-example1\", \"default/gang-example2\"]" + +spec: + scheduleTimeoutSeconds: 100 + minMember: 2 + +``` + +- `gang.scheduling.koordinator.sh/total-number` specifies the total children number of the gang. If not specified,it will be set with the `minMember` +- `gang.scheduling.koordinator.sh/mode` defines the Gang Scheduling operation when failed scheduling. Support `Strict\NonStrict`, default is `Strict` +- `gang.scheduling.koordinator.sh/groups` defines which gangs are bundled as a group. The gang will go to bind only all gangs in one group meet the conditions + +2.apply through pod annotations. +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example2 + namespace: default + annotations: + gang.scheduling.koordinator.sh/name: "gang-example1" + gang.scheduling.koordinator.sh/min-available: "2" + gang.scheduling.koordinator.sh/total-number: "3" + gang.scheduling.koordinator.sh/mode: "Strict\NonStrict" + gang.scheduling.koordinator.sh/groups: "[\"default/gang-example1\", \"default/gang-example2\"]" + gang.scheduling.koordinator.sh/waiting-time: "100s" +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` +- `gang.scheduling.koordinator.sh/total-number` specifies the total children number of the gang. If not specified,it will be set with the `gang.scheduling.koordinator.sh/min-available` +- `gang.scheduling.koordinator.sh/mode` defines the Gang Scheduling operation when failed scheduling. Support `Strict\NonStrict`, default is `Strict` +- `gang.scheduling.koordinator.sh/groups` defines which gangs are bundled as a group. The gang will go to bind only all gangs in one group meet the conditions +- `gang.scheduling.koordinator.sh/waiting-time` specifies gang's max wait time in Permit Stage. + +#### advanced configuration for scheduler +you can modify `koord-scheduler-config.yaml` in helm to adjust `Coscheduling` configuration as below: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: koord-scheduler-config + namespace: {{ .Values.installation.namespace }} +data: + koord-scheduler-config: | + apiVersion: kubescheduler.config.k8s.io/v1beta2 + kind: KubeSchedulerConfiguration + leaderElection: + leaderElect: true + resourceLock: leases + resourceName: koord-scheduler + resourceNamespace: {{ .Values.installation.namespace }} + profiles: + - pluginConfig: + - name: Coscheduling + args: + apiVersion: kubescheduler.config.k8s.io/v1beta2 + kind: CoschedulingArgs` + defaultTimeout: 600s + controllerWorkers: 1 + - name: ElasticQuota + ... +``` + diff --git a/versioned_docs/version-v1.3/user-manuals/load-aware-descheduling.md b/versioned_docs/version-v1.3/user-manuals/load-aware-descheduling.md new file mode 100644 index 000000000..c1773dc60 --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/load-aware-descheduling.md @@ -0,0 +1,229 @@ +# Load Aware Descheduling + +The load-aware scheduling supported in the scheduler can select nodes with lower loads to run new Pods during scheduling, but as time, cluster environment changes, and changes in traffic/requests faced by workloads, the utilization of nodes will change dynamically Changes in the cluster will break the original load balance between nodes in the cluster, and even extreme load imbalance may occur, affecting the runtime quality of the workload. + +koord-descheduler perceives changes in the load of nodes in the cluster, automatically optimizes nodes that exceed the safety load to prevents extreme load imbalance. + +## Introduction + +The LowNodeLoad plugin in the koord-descheduler is responsible for sensing the load of the node, and reducing the load hotspot by evict/migrate Pod. The `LowNodeLoad` plugin is different from the Kubernetes native descheduler plugin LowNodeUtilization in that `LowNodeLoad` decides to deschedule based on the actual utilization of nodes, while LowNodeUtilization decides to deschedule based on the resource allocation. + +The `LowNodeLoad` plugin has two most important parameters: + +- `highThresholds` defines the target usage threshold of resources. The Pods on nodes exceeding this threshold will be evicted/migrated. +- `lowThresholds` defines the low usage threshold of resources. The Pods on nodes below this threshold will not be evicted/migrated. + +Take the following figure as an example, `lowThresholds` is 45%, `highThresholds` is 70%, we can classify nodes into three categories: + +1. Idle Node. Nodes with resource utilization below 45%; +2. Normal Node. For nodes whose resource utilization is higher than 45% but lower than 70%, this load water level range is a reasonable range we expect +3. Hotspot Node. If the node resource utilization rate is higher than 70%, the node will be judged as unsafe and belongs to the hotspot node, and some pods should be expelled to reduce the load level so that it does not exceed 70%. + +![image](/img/low-node-load.png) + +After identifying which nodes are hotspots, descheduler will perform a eviction/migration operation to evict some Pods from hotspot nodes to idle nodes. + +If the total number of idle nodes in a cluster is not many, descheduling will be terminated. This can be helpful in large clusters where some nodes may be underutilized frequently or for short periods of time. By default, `numberOfNodes` is set to zero. This capability can be enabled by setting the parameter `numberOfNodes`. +Before migration, descheduler will calculate the actual free capacity to ensure that the sum of the actual utilization of the Pods to be migrated does not exceed the total free capacity in the cluster. These actual free capacities come from idle nodes, and the actual free capacity of an idle node = `(highThresholds - current load of the node) * total capacity of the node`. Suppose the load level of node A is 20%, the highThresholdss is 70%, and the total CPU of node A is 96C, then `(70%-20%) * 96 = 48C`, and this 48C is the free capacity that can be carried. + +In addition, when migrating hotspot nodes, the Pods on the nodes will be filtered. Currently, descheduler supports multiple filtering parameters, which can avoid migration and expulsion of very important Pods: + +- Filter by namespace. Can be configured to filter only certain namespaces or filter out certain namespaces +- Filter by pod selector. Pods can be filtered out through the label selector, or Pods with certain Labels can be excluded +- Configure `nodeFit` to check whether the scheduling rules have candidate nodes. When enabled, descheduler checks whether there is a matching Node in the cluster according to the Node Affinity/Node Selector/Toleration corresponding to the candidate Pod. If not, the Pod will not be evicted for migration. If you set `nodeFit` to false, the migration controller in the descheduler will complete the capacity reservation at this time, and start the migration after ensuring that there are resources. + +After the Pods are filtered out, these Pods are sorted from multiple dimensions such as QoSClass, Priority, actual usage, and creation time. + +After pods have been filtered and sorted, the migration operation begins. Before migration, it will check whether the remaining free capacity is satisfied and whether the load the current node is higher than the target safety threshold. If one of these two conditions cannot be met, descheduling will stop. Every time a Pod is migrated, the remaining free capacity will be withheld, and the load of the current node will be adjusted at the same time until the remaining capacity is insufficient or the load reaches the safety threshold. + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 1.1.1 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to [Installation](/docs/installation). + +### Global Configuration via plugin args + +Load-aware descheduling is *Disabled* by default. You can modify the ConfigMap `koord-descheduler-config` to enable the plugin. + +For users who need deep insight, please configure the rules of load-aware descheduling by modifying the ConfigMap +`koord-descheduler-config` in the helm chart. New configurations will take effect after the koord-descheduler restarts. + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: koord-descheduler-config + ... +data: + koord-descheduler-config: | + apiVersion: descheduler/v1alpha2 + kind: DeschedulerConfiguration + ... + # Execute the LowNodeLoad plugin every 60s + deschedulingInterval: 60s + profiles: + - name: koord-descheduler + plugins: + deschedule: + disabled: + - name: "*" + balance: + enabled: + - name: LowNodeLoad # Configure to enable the LowNodeLoad plugin + .... + pluginConfig: + - name: LowNodeLoad + args: + apiVersion: descheduler/v1alpha2 + kind: LowNodeLoadArgs + evictableNamespaces: + # include and exclude are mutually exclusive, only one of them can be configured. + # include indicates that only the namespace configured below will be processed + # include: + # - test-namespace + # exclude means to only process namespaces other than those configured below + exclude: + - "kube-system" + - "koordinator-system" + # lowThresholds defines the low usage threshold of resources + lowThresholds: + cpu: 20 + memory: 30 + # highThresholds defines the target usage threshold of resources + highThresholds: + cpu: 50 + memory: 60 + .... +``` + +| Field | Description | Version | +|-------|-------------| --------| +| paused | Paused indicates whether the LowNodeLoad should to work or not. | >= v1.1.1 | +| dryRun | DryRun means only execute the entire deschedule logic but don't migrate Pod | >= v1.1.1 | +| numberOfNodes | NumberOfNodes can be configured to activate the strategy only when the number of under utilized nodes are above the configured value. This could be helpful in large clusters where a few nodes could go under utilized frequently or for a short period of time. By default, NumberOfNodes is set to zero. | >= v1.1.1 | +| evictableNamespaces | Naming this one differently since namespaces are still considered while considering resources used by pods but then filtered out before eviction. | >= v1.1.1 | +| nodeSelector | NodeSelector selects the nodes that matched labelSelector. | >= v1.1.1 | +| podSelectors | PodSelectors selects the pods that matched labelSelector. | >= v1.1.1 | +| nodeFit | NodeFit if enabled, it will check whether the candidate Pods have suitable nodes, including NodeAffinity, TaintTolerance, and whether resources are sufficient. By default, NodeFit is set to true. | >= v1.1.1 | +| useDeviationThresholds | If UseDeviationThresholds is set to `true`, the thresholds are considered as percentage deviations from mean resource usage. `lowThresholds` will be deducted from the mean among all nodes and `highThresholds` will be added to the mean. A resource consumption above (resp. below) this window is considered as overutilization (resp. underutilization). | >= v1.1.1 | +| highThresholds | HighThresholds defines the target usage threshold of resources | >= v1.1.1 | +| lowThresholds | LowThresholds defines the low usage threshold of resources | >= v1.1.1 | + +## Use Load Aware Descheduling + +The example cluster in this article has three 4-core 16GiB nodes. + +1. Deploy two `stress` pod with the YAML file below. + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: stress-demo + namespace: default + labels: + app: stress-demo +spec: + replicas: 2 + selector: + matchLabels: + app: stress-demo + template: + metadata: + name: stress-demo + labels: + app: stress-demo + spec: + containers: + - args: + - '--vm' + - '2' + - '--vm-bytes' + - '1600M' + - '-c' + - '2' + - '--vm-hang' + - '2' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + limits: + cpu: '2' + memory: 4Gi + requests: + cpu: '2' + memory: 4Gi + restartPolicy: Always + schedulerName: koord-scheduler # use the koord-scheduler +``` + +```bash +$ kubectl create -f stress-demo.yaml +deployment.apps/stress-demo created +``` + +2. Watch the pod status util they become running. + +```bash +$ kubectl get pod -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +stress-demo-7fdd89cc6b-lml7k 1/1 Running 0 21m 10.0.2.83 cn-beijing.10.0.2.54 +stress-demo-7fdd89cc6b-xr5dl 1/1 Running 0 4m40s 10.0.2.77 cn-beijing.10.0.2.53 +``` + +The stress pods are scheduled on `cn-beijing.10.0.2.53` and `cn-beijing.10.0.2.54`. + +3. Check the load of each node. + +```bash +$ kubectl top node +NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% +cn-beijing.10.0.2.53 3825m 98% 4051Mi 31% +cn-beijing.10.0.2.54 2155m 55% 4500Mi 35% +cn-beijing.10.0.2.58 182m 4% 1367Mi 10% +``` + +In above order, `cn-beijing.10.0.2.53` and `cn-beijing.10.0.2.54` have the highest load, while `cn-beijing.10.0.2.58` has the lowest load. + +4. Update `koord-descheduler-config` to enable `LowNodeLoad` plugin. + +5. Observe the Pod changes and wait for the koord-descheduler to execute the eviction/migration operation. + +```bash +$ kubectl get pod -w +NAME READY STATUS RESTARTS AGE +stress-demo-7fdd89cc6b-lml7k 1/1 Running 0 22m +stress-demo-7fdd89cc6b-xr5dl 1/1 Running 0 5m45s +stress-demo-7fdd89cc6b-xr5dl 1/1 Terminating 0 5m59s +stress-demo-7fdd89cc6b-8k8wq 0/1 Pending 0 0s +stress-demo-7fdd89cc6b-8k8wq 0/1 Pending 0 0s +stress-demo-7fdd89cc6b-8k8wq 0/1 ContainerCreating 0 0s +stress-demo-7fdd89cc6b-8k8wq 0/1 ContainerCreating 0 1s +stress-demo-7fdd89cc6b-8k8wq 1/1 Running 0 3s +``` + +5. Observe the Event, you can see the following migration records + +```bash +$ kubectl get event |grep stress-demo-7fdd89cc6b-xr5dl +74s Normal Evicting podmigrationjob/e54863dc-b651-47e3-9ffd-08b6b4ff64d5 Pod "default/stress-demo-7fdd89cc6b-xr5dl" evicted from node "cn-beijing.10.0.2.53" by the reason "node is overutilized, cpu usage(56.13%)>threshold(50.00%)" +41s Normal EvictComplete podmigrationjob/e54863dc-b651-47e3-9ffd-08b6b4ff64d5 Pod "default/stress-demo-7fdd89cc6b-xr5dl" has been evicted +7m12s Normal Scheduled pod/stress-demo-7fdd89cc6b-xr5dl Successfully assigned default/stress-demo-7fdd89cc6b-xr5dl to cn-beijing.10.0.2.53 +7m12s Normal AllocIPSucceed pod/stress-demo-7fdd89cc6b-xr5dl Alloc IP 10.0.2.77/24 +7m12s Normal Pulling pod/stress-demo-7fdd89cc6b-xr5dl Pulling image "polinux/stress" +6m59s Normal Pulled pod/stress-demo-7fdd89cc6b-xr5dl Successfully pulled image "polinux/stress" in 12.685405843s +6m59s Normal Created pod/stress-demo-7fdd89cc6b-xr5dl Created container stress +6m59s Normal Started pod/stress-demo-7fdd89cc6b-xr5dl Started container stress +74s Normal Descheduled pod/stress-demo-7fdd89cc6b-xr5dl Pod evicted from node "cn-beijing.10.0.2.53" by the reason "node is overutilized, cpu usage(56.13%)>threshold(50.00%)" +73s Normal Killing pod/stress-demo-7fdd89cc6b-xr5dl Stopping container stress +7m13s Normal SuccessfulCreate replicaset/stress-demo-7fdd89cc6b Created pod: stress-demo-7fdd89cc6b-xr5dl +``` diff --git a/versioned_docs/version-v1.3/user-manuals/load-aware-scheduling.md b/versioned_docs/version-v1.3/user-manuals/load-aware-scheduling.md new file mode 100644 index 000000000..ef59b32d1 --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/load-aware-scheduling.md @@ -0,0 +1,324 @@ +# Load Aware Scheduling + +Load Aware Scheduling is an ability of koord-scheduler for balancing pods scheduling based on the real-time load of each node. + +## Introduction + +Load balancing is a common issue in resource scheduling. Under-utilized nodes bring much resource waste to the +cluster, while over-utilized nodes are likely to cause performance degradation. Neither of them is suitable for +efficient resource management. + +The native Kubernetes scheduler schedules pods based on the requests and the allocation of nodes, considering neither +the real-time load nor the estimated usage. When we want to balance the pod scheduling on each node and make the loads +even with the native scheduler, we need to set precise resource requirements for the applications. Moreover, since +Koordinator enables resource overcommitment to achieve better resource efficiency, we need a mechanism to reduce the +probability of performance degradation and avoid over-utilization. + +Koord-scheduler can retrieve node metrics by cooperating with the koordlet. It provides the ability to balance the +scheduling of both the online (LSE/LSR/LS) pods and offline (BE) pods based on node utilization. + +![image](/img/load-aware-scheduling-arch.svg) + +For more information, please see [Design: Load Aware Scheduling](/docs/designs/load-aware-scheduling). + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.4 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to [Installation](/docs/installation). + +### Global Configuration via plugin args + +Load-aware scheduling is *Enabled* by default. You can use it without any modification on the koord-scheduler config. + +For users who need deep insight, please configure the rules of load-aware scheduling by modifying the ConfigMap +`koord-scheduler-config` in the helm chart. + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: koord-scheduler-config + ... +data: + koord-scheduler-config: | + apiVersion: kubescheduler.config.k8s.io/v1beta2 + kind: KubeSchedulerConfiguration + profiles: + - schedulerName: koord-scheduler + plugins: + # enable the LoadAwareScheduling plugin + filter: + enabled: + - name: LoadAwareScheduling + ... + score: + enabled: + - name: LoadAwareScheduling + weight: 1 + ... + reserve: + enabled: + - name: LoadAwareScheduling + ... + pluginConfig: + # configure the thresholds and weights for the plugin + - name: LoadAwareScheduling + args: + apiVersion: kubescheduler.config.k8s.io/v1beta2 + kind: LoadAwareSchedulingArgs + # whether to filter nodes where koordlet fails to update NodeMetric + filterExpiredNodeMetrics: true + # the expiration threshold seconds when using NodeMetric + nodeMetricExpirationSeconds: 300 + # weights of resources + resourceWeights: + cpu: 1 + memory: 1 + # thresholds (%) of resource utilization + usageThresholds: + cpu: 75 + memory: 85 + # thresholds (%) of resource utilization of Prod Pods + prodUsageThresholds: + cpu: 55 + memory: 65 + # enable score according Prod usage + scoreAccordingProdUsage: true + # the factor (%) for estimating resource usage + estimatedScalingFactors: + cpu: 80 + memory: 70 + # enable resource utilization filtering and scoring based on percentile statistics + aggregated: + usageThresholds: + cpu: 65 + memory: 75 + usageAggregationType: "p99" + scoreAggregationType: "p99" +``` + +The koord-scheduler takes this ConfigMap as [scheduler Configuration](https://kubernetes.io/docs/reference/scheduling/config/). +New configurations will take effect after the koord-scheduler restarts. + +| Field | Description | Version | +|-------|-------------| --------| +| filterExpiredNodeMetrics | filterExpiredNodeMetrics indicates whether to filter nodes where koordlet fails to update NodeMetric. Enabled by default but in Helm chart, it's disabled. | >= v0.4.0 | +| nodeMetricExpirationSeconds | nodeMetricExpirationSeconds indicates the NodeMetric expiration in seconds. When NodeMetrics expired, the node is considered abnormal. Default is 180 seconds.| >= v0.4.0 | +| resourceWeights | resourceWeights indicates the weights of resources. The weights of CPU and Memory are both 1 by default.| >= v0.4.0 | +| usageThresholds | usageThresholds indicates the resource utilization threshold of the whole machine. The default for CPU is 65%, and the default for memory is 95%.| >= v0.4.0 | +| estimatedScalingFactors | estimatedScalingFactors indicates the factor when estimating resource usage. The default value of CPU is 85%, and the default value of Memory is 70%. | >= v0.4.0 | +| prodUsageThresholds| prodUsageThresholds indicates the resource utilization threshold of Prod Pods compared to the whole machine. Not enabled by default. | >= v1.1.0 | +| scoreAccordingProdUsage | scoreAccordingProdUsage controls whether to score according to the utilization of Prod Pod. | >= v1.1.0 | +| aggregated | aggregated supports resource utilization filtering and scoring based on percentile statistics. | >= v1.1.0 | + +The fields of Aggregated: + +| Field | Description | Version | +|-------|-------------| --------| +| usageThresholds | usageThresholds indicates the resource utilization threshold of the machine based on percentile statistics. | >= v1.1.0| +| usageAggregationType | usageAggregationType indicates the percentile type of the machine's utilization when filtering. Currently supports `avg`, `p50`, `p90`, `p95` and `p99`. | >= v1.1.0 | +| usageAggregatedDuration | usageAggregatedDuration indicates the statistical period of the percentile of the machine's utilization when filtering. When this field is not set, the scheduler uses the data of the maximum period in NodeMetrics by default. | >= v1.1.0| +| scoreAggregationType | scoreAggregationType indicates the percentile type of the machine's utilization when scoring. Currently supports `avg`, `p50`, `p90`, `p95` and `p99`. | >= v1.1.0 +| scoreAggregatedDuration | scoreAggregatedDuration indicates the statistical period of the percentile of Prod Pod's utilization when scoring. When this field is not set, the scheduler uses the data of the maximum period in NodeMetrics by default. | >= v1.1.0 | + +### Configure filter thresholds by Node + +The configuration through the plugin can be used as the default global configuration of the cluster, and users can also set the load thresholds of the node dimension by appending annotation to the node. When the annotation exists on the node, it will be filtered according to the parameters specified by the annotation. + +The annotation is defined as follows: + +```go +const ( + AnnotationCustomUsageThresholds = "scheduling.koordinator.sh/usage-thresholds" +) + +// CustomUsageThresholds supports user-defined node resource utilization thresholds. +type CustomUsageThresholds struct { + // UsageThresholds indicates the resource utilization threshold of the whole machine. + UsageThresholds map[corev1.ResourceName]int64 `json:"usageThresholds,omitempty"` + // ProdUsageThresholds indicates the resource utilization threshold of Prod Pods compared to the whole machine + ProdUsageThresholds map[corev1.ResourceName]int64 `json:"prodUsageThresholds,omitempty"` + // AggregatedUsage supports resource utilization filtering and scoring based on percentile statistics + AggregatedUsage *CustomAggregatedUsage `json:"aggregatedUsage,omitempty"` +} + +type CustomAggregatedUsage struct { + // UsageThresholds indicates the resource utilization threshold of the machine based on percentile statistics + UsageThresholds map[corev1.ResourceName]int64 `json:"usageThresholds,omitempty"` + // UsageAggregationType indicates the percentile type of the machine's utilization when filtering + UsageAggregationType slov1alpha1.AggregationType `json:"usageAggregationType,omitempty"` + // UsageAggregatedDuration indicates the statistical period of the percentile of the machine's utilization when filtering + UsageAggregatedDuration *metav1.Duration `json:"usageAggregatedDuration,omitempty"` +} +``` + +## Use Load Aware Scheduling + +### Load-aware scheduling by the whole machine load + +The example cluster in this article has three 4-core 16GiB nodes. + +1. Deploy a `stress` pod with the YAML file below. + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: stress-demo + namespace: default + labels: + app: stress-demo +spec: + replicas: 1 + selector: + matchLabels: + app: stress-demo + template: + metadata: + name: stress-demo + labels: + app: stress-demo + spec: + containers: + - args: + - '--vm' + - '2' + - '--vm-bytes' + - '1600M' + - '-c' + - '2' + - '--vm-hang' + - '2' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + limits: + cpu: '2' + memory: 4Gi + requests: + cpu: '2' + memory: 4Gi + restartPolicy: Always + schedulerName: koord-scheduler # use the koord-scheduler +``` + +```bash +$ kubectl create -f stress-demo.yaml +deployment.apps/stress-demo created +``` + +2. Watch the pod status util it becomes running. + +```bash +$ kubectl get pod -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +stress-demo-7fdd89cc6b-gcnzn 1/1 Running 0 82s 10.0.3.114 cn-beijing.10.0.3.112 +``` + +The pod `stress-demo-7fdd89cc6b-gcnzn` is scheduled on `cn-beijing.10.0.3.112`. + +3. Check the load of each node. + +```bash +$ kubectl top node +NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% +cn-beijing.10.0.3.110 92m 2% 1158Mi 9% +cn-beijing.10.0.3.111 77m 1% 1162Mi 9% +cn-beijing.10.0.3.112 2105m 53% 3594Mi 28% +``` + +In above order, `cn-beijing.10.0.3.112` has the highest load, while `cn-beijing.10.0.3.111` has the lowest load. + +4. Deploy an `nginx` deployment with the YAML file below. + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: nginx-with-loadaware + labels: + app: nginx +spec: + replicas: 6 + selector: + matchLabels: + app: nginx + template: + metadata: + name: nginx + labels: + app: nginx + spec: + schedulerName: koord-scheduler # use the koord-scheduler + containers: + - name: nginx + image: nginx + resources: + limits: + cpu: 500m + requests: + cpu: 500m +``` + +```bash +$ kubectl create -f nginx-with-loadaware.yaml +deployment/nginx-with-loadawre created +``` + +5. Check the scheduling results of `nginx` pods. + +```bash +$ kubectl get pods | grep nginx +nginx-with-loadaware-5646666d56-224jp 1/1 Running 0 18s 10.0.3.118 cn-beijing.10.0.3.110 +nginx-with-loadaware-5646666d56-7glt9 1/1 Running 0 18s 10.0.3.115 cn-beijing.10.0.3.110 +nginx-with-loadaware-5646666d56-kcdvr 1/1 Running 0 18s 10.0.3.119 cn-beijing.10.0.3.110 +nginx-with-loadaware-5646666d56-qzw4j 1/1 Running 0 18s 10.0.3.113 cn-beijing.10.0.3.111 +nginx-with-loadaware-5646666d56-sbgv9 1/1 Running 0 18s 10.0.3.120 cn-beijing.10.0.3.111 +nginx-with-loadaware-5646666d56-z79dn 1/1 Running 0 18s 10.0.3.116 cn-beijing.10.0.3.111 +``` + +Now we can see `nginx` pods get scheduled on the nodes other than `cn-beijing.10.0.3.112` (node with the highest load). + + +### Load-aware scheduling by the Prod Pods + +If there are many BestEffort Pods scheduled in one Node, the latency-sensitive Pods may fail to schedule cause the load of node has reached the limit of usage. In Koordinator v1.1.0, load-aware scheduling is optimized for this scenario. For latency-sensitive(LSE/LSR/LS) Pods, priority is given to scheduling to the nodes with low total utilization of the Prod Pods. BestEffort(BE) Pods are scheduled according to the utilization level of the whole node. + +Enable relevant optimizations by setting the following parameters: + +| Field | Description | Version | +|-------|-------------| --------| +| prodUsageThresholds| prodUsageThresholds indicates the resource utilization threshold of Prod Pods compared to the whole machine. Not enabled by default. | >= v1.1.0 | +| scoreAccordingProdUsage | scoreAccordingProdUsage controls whether to score according to the utilization of Prod Pod. | >= v1.1.0 | + +### Load-aware scheduling based on percentile statistics + +In Koordinator v1.0 and previous versions, load-aware scheduling is filtered and scored according to the average utilization data reported by koordlet. But the average value hides a lot of information, so in Koordinator v1.1, koordlet adds utilization aggregation data based on percentile statistics. Corresponding adaptations have also been made on the scheduler side. + +Enable relevant optimizations by setting the following parameters: + +| Field | Description | Version | +|-------|-------------| --------| +| aggregated | aggregated supports resource utilization filtering and scoring based on percentile statistics. | >= v1.1.0 | + +The fields of Aggregated: + +| Field | Description | Version | +|-------|-------------| --------| +| usageThresholds | usageThresholds indicates the resource utilization threshold of the machine based on percentile statistics. | >= v1.1.0| +| usageAggregationType | usageAggregationType indicates the percentile type of the machine's utilization when filtering. Currently supports `avg`, `p50`, `p90`, `p95` and `p99`. | >= v1.1.0 | +| usageAggregatedDuration | usageAggregatedDuration indicates the statistical period of the percentile of the machine's utilization when filtering. When this field is not set, the scheduler uses the data of the maximum period in NodeMetrics by default. | >= v1.1.0| +| scoreAggregationType | scoreAggregationType indicates the percentile type of the machine's utilization when scoring. Currently supports `avg`, `p50`, `p90`, `p95` and `p99`. | >= v1.1.0 +| scoreAggregatedDuration | scoreAggregatedDuration indicates the statistical period of the percentile of Prod Pod's utilization when scoring. When this field is not set, the scheduler uses the data of the maximum period in NodeMetrics by default. | >= v1.1.0 | + +The `aggregated` and the `usageThresholds` parameter are mutually exclusive. When both are configured, the `aggregated` will be used. +In addition, Pod type awareness is not currently supported. \ No newline at end of file diff --git a/versioned_docs/version-v1.3/user-manuals/memory-evict.md b/versioned_docs/version-v1.3/user-manuals/memory-evict.md new file mode 100644 index 000000000..d2930dc03 --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/memory-evict.md @@ -0,0 +1,130 @@ +# Eviction Strategy base on Memory Usage + +## Introduction + +Koordinator supports the dynamic overcommitment from idle resources on node to low-priority +Pods as Batch priority. In co-location scenarios, the actual memory resource usage of +nodes is constantly changing. For incompressible resources such as memory, high resource +usage of node may cause OOM, which results in the high-priority Pod got killed. Koordinator +provides an eviction strategy based on the memory usage node. `Koordlet` will continuously +detect the memory usage of node (Total-Available) in second-level granularity. +When the resource memory usage of node is high, it will evict low-priority BE Pods to +ensure the QoS of high-priority pods until the memory usage of node reduces below to the +threshold (evictThreshold). During the eviction process, Pods with lower priority(Pod.Spec.Priority) +will be selected first, and if the priority is the same, Pods which consume more memory will be +evicted first. + + +![image](/img/memory-evict.svg) + +### Prerequisite +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to +[Installation](/docs/installation). + +| Component | Version Requirement | +| --- | ------- | +| Kubernetes | ≥v1.18 | +| koordinator | ≥v0.3.0 | + +The eviction strategy is provided by `Koordlet`, which is disabled by default in feature-gate. +Please make sure the `BEMemoryEvict=true` field has been added in the `-feature-gates` arguments of `Koordlet` +as the [example](https://github.com/koordinator-sh/charts/blob/main/versions/v1.2.0/templates/koordlet.yaml#L36)。 + +## Use Eviction Strategy base on Memory Usage + +1. Create a configmap.yaml file based on the following ConfigMap content: + ```yaml + #ConfigMap slo-controller-config example。 + apiVersion: v1 + kind: ConfigMap + metadata: + name: slo-controller-config # name should be set as the configuration of koord-manager, e.g. ack-slo-config + namespace: koordinator-system # namespace should be set as the configuration of installation, e.g. kube-system + data: + # enable the eviction strategy base on Memory Usage + resource-threshold-config: | + { + "clusterStrategy": { + "enable": true, + "memoryEvictThresholdPercent": 70 + } + } + ``` + + | Configuration item | Parameter | Valid values | Description | + | :-------------- | :------ | :-------- | :----------------------------------------------------------- | + | `enable` | Boolean | true; false | true:enable the eviction.; false(default):disable the eviction. | + | `memoryEvictThresholdPercent` | Int | 0~100 | eviction threshold percent of node memory usage, default is 70. | + +2. Check whether a ConfigMap named `slo-controller-config` exists in the `koordinator-system` namespace. + + - If a ConfigMap named `slo-controller-config` exists, we commend that you run the kubectl patch command to update the ConfigMap. This avoids changing other settings in the ConfigMap. + + ```bash + kubectl patch cm -n koordinator-system slo-controller-config --patch "$(cat configmap.yaml)" + ``` + + - If no ConfigMap named `slo-controller-config` exists, run the kubectl patch command to create a ConfigMap named ack-slo-config: + + ```bash + kubectl apply -f configmap.yaml + ``` + +3. Create a file named be-pod-demo.yaml based on the following YAML content: + + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: be-pod-demo + labels: + koordinator.sh/qosClass: 'BE' # set Pod QoS as BE + spec: + containers: + - args: + - '-c' + - '1' + - '--vm' + - '1' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + restartPolicy: Always + schedulerName: default-scheduler + ``` + +4. Run the following command to deploy the be-pod-demo pod in the cluster: + + ```bash + kubectl apply -f be-pod-demo.yaml + ``` + +5. Run the following command to check the be-pod-demo pod in Running state: + + ```bash + $ kubectl get pod be-pod-demo + NAME READY STATUS RESTARTS AGE + be-pod-demo 1/1 Running 0 7s + ``` +6. Run the following command through [stress tool](https://linux.die.net/man/1/stress) +make sure the memory usage of node is above the threshold config, and the argument `--vm-bytes` +means the process will consume 10GB memory, this should be adjusted according to the node capacity. + + ```bash + $ stress --cpu 1 --vm 1 --vm-bytes 10G --vm-keep + ``` + +7. Check the running state of be-pod-demo, then you can find the be-pod-demo pod is not exist, +and the eviction information can be found in events. + + ```bash + $ kubectl get pod be-pod-demo + Error from server (NotFound): pods "be-pod-demo" not found + + $ kubectl get event + LAST SEEN TYPE REASON OBJECT MESSAGE + 46s Normal Killing pod/be-pod-demo Stopping container stress + 48s Warning evictPodSuccess ${your-pod-object} evict Pod:be-pod-demo, reason: EvictPodByNodeMemoryUsage, message: killAndEvictBEPods for node(${your-node-id}), need to release memory: 8077889699 + ``` \ No newline at end of file diff --git a/versioned_docs/version-v1.3/user-manuals/memory-qos.md b/versioned_docs/version-v1.3/user-manuals/memory-qos.md new file mode 100644 index 000000000..66f5e60f9 --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/memory-qos.md @@ -0,0 +1,355 @@ +# Memory QoS + +## Introduction + +The Koordlet provides the *Memory Quality of Service* (QoS) feature for containers. You can use this feature to +optimize the performance of memory-sensitive applications while ensuring fair memory scheduling among containers. This +topic describes how to enable the memory QoS feature for containers. + +### Background + +The following memory limits apply to containers: + +- The memory limit of the container. If the amount of memory that a container uses, including the page cache, is about + to reach the memory limit of the container, the memory reclaim mechanism of the OS kernel is triggered. As a result, + the application in the container may not be able to request or release memory resources as normal. +- The memory limit of the node. If the memory limit of a container is greater than the memory request of the container, + the container can overcommit memory resources. In this case, the available memory on the node may become insufficient. + This causes the OS kernel to reclaim memory from containers. As a result, the performance of your application is + downgraded. In extreme cases, the node cannot run as normal. + +To improve the performance of applications and the stability of nodes, Koordinator provides the memory QoS feature for +containers. We recommend that you use Anolis OS as the node OS. For other OS, we will try our best to adapt, and users +can still enable it without side effects. After you enable the memory QoS feature for a container, Koordlet +automatically configures the memory control group (memcg) based on the configuration of the container. This helps you +optimize the performance of memory-sensitive applications while ensuring fair memory scheduling on the node. + +Memory QoS provides the following optimizations to improve the memory utilization of pods: + +- When the memory used by a pod is about to reach the memory limit of the pod, the memcg performs asynchronous reclaim for a specific amount of memory. This prevents the reclaim of all the memory that the pod uses and therefore minimizes the adverse impact on the application performance caused by direct memory reclaim. +- Memory reclaim is performed in a fairer manner among pods. When the available memory on a node becomes insufficient, memory reclaim is first performed on pods that use more memory than their memory requests. This ensures sufficient memory on the node when a pod applies for a large amount of memory. +- If the BestEffort pods on a node use more memory than their memory requests, the system prioritizes the memory requirements of Guaranteed pods and Burstable pods over the memory requirements of BestEffort pods. + +![image](/img/memory-qos.png) + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.3 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to +[Installation](/docs/installation). + +### Configurations + +Koordlet has already enabled Memory QoS feature (`-feature-gates=AllAlpha=true`). +If not, please enable it manually by updating the feature gate in the koordlet daemonset. + +> NOTE: Memory QoS is controlled by the `CgroupReconcile` feature-gate. + +```yaml +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: koordlet +spec: + selector: + matchLabels: + koord-app: koordlet + template: + metadata: + labels: + koord-app: koordlet + spec: + containers: + - command: + - /koordlet + args: + - -CgroupRootDir=/host-cgroup/ + - -feature-gates=XXXX,CgroupReconcile=true # enable CPU Burst feature + ... +``` + +## Use Memory QoS + +When you enable memory QoS for the containers in a pod, the memcg is automatically configured based on the specified +ratios and pod parameters. To enable memory QoS for the containers in a pod, perform the following steps. + +### Use an annotation to enable Memory QoS for the pod + +Add the following annotations to enable memory QoS for the containers in a pod: + +```yaml +annotations: + # To enable memory QoS for the containers in a pod, set the value to auto. + koordinator.sh/memoryQOS: '{"policy": "auto"}' + # To disable memory QoS for the containers in a pod, set the value to none. + #koordinator.sh/memoryQOS: '{"policy": "none"}' +``` + +### Use a ConfigMap to enable memory QoS for all the containers in a cluster + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: slo-controller-config + namespace: koordinator-system +data: + resource-qos-config: |- + { + "clusterStrategy": { + "lsClass": { + "memoryQOS": { + "enable": true + } + }, + "beClass": { + "memoryQOS": { + "enable": true + } + } + } + } +``` + +### (Optional) Advanced Settings + +The following table describes the advanced parameters that you can use to configure fine-grained memory QoS +configurations at the pod level and cluster level. + +| Parameter | Data type | Valid value | Description | +| ------------------- | ----------- | --------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| enable | Boolean |
  • true
  • false
|
  • true: enables memory QoS for all the containers in a cluster. The default memory QoS settings for the QoS class of the containers are used.
  • false: disables memory QoS for all the containers in a cluster. The memory QoS settings are restored to the original settings for the QoS class of the containers.
| +| policy | String |
  • auto
  • default
  • none
|
  • auto: enables memory QoS for the containers in the pod and uses the recommended memory QoS settings. The recommended memory QoS settings are prioritized over the cluster-wide memory QoS settings.
  • default: specifies that the pod inherits the cluster-wide memory QoS settings.
  • none: disables memory QoS for the pod. The relevant memory QoS settings are restored to the original settings. The original settings are prioritized over the cluster-wide memory QoS settings.
| +| minLimitPercent | Int | 0~100 | Unit: %. Default value:`0`. The default value indicates that this parameter is disabled. This parameter specifies the unreclaimable proportion of the memory request of a pod. The amount of unreclaimable memory is calculated based on the following formula: `Value of memory.min = Memory request × Value of minLimitPercent/100`. This parameter is suitable for scenarios where applications are sensitive to the page cache. You can use this parameter to cache files to optimize read and write performance. For example, if you specify Memory `Request=100MiB` and `minLimitPercent=100` for a container, `the value of memory.min is 104857600`. | +| lowLimitPercent | Int | 0~100 | Unit: %. Default value:`0`. The default value indicates that this parameter is disabled. This parameter specifies the relatively unreclaimable proportion of the memory request of a pod. The amount of relatively unreclaimable memory is calculated based on the following formula: `Value of memory.low = Memory request × Value of lowLimitPercent/100`. For example, if you specify `Memory Request=100MiB` and `lowLimitPercent=100` for a container, `the value of memory.low is 104857600`. | +| throttlingPercent | Int | 0~100 | Unit: %. Default value:`0`. The default value indicates that this parameter is disabled. This parameter specifies the memory throttling threshold for the ratio of the memory usage of a container to the memory limit of the container. The memory throttling threshold for memory usage is calculated based on the following formula: `Value of memory.high = Memory limit × Value of throttlingPercent/100`. If the memory usage of a container exceeds the memory throttling threshold, the memory used by the container will be reclaimed. This parameter is suitable for container memory overcommitment scenarios. You can use this parameter to cgroups from triggering OOM. For example, if you specify `Memory Limit=100MiB` and `throttlingPercent=80` for a container, `the value of memory.high is 83886080`, which is equal to 80 MiB. | +| wmarkRatio | Int | 0~100 | Unit: %. Default value:`95`. A value of `0` indicates that this parameter is disabled. This parameter specifies the threshold of the usage of the memory limit or the value of `memory.high` that triggers asynchronous memory reclaim. If `throttlingPercent` is disabled, the asynchronous memory reclaim threshold for memory usage is calculated based on the following formula: `Value of memory.wmark_high = Memory limit × wmarkRatio/100`. If `throttlingPercent` is enabled, the asynchronous memory reclaim threshold for memory usage is calculated based on the following formula: `Value of memory.wmark_high = Value of memory.high × wmarkRatio/100`. If the usage of the memory limit or the value of memory.high exceeds the threshold, the memcg backend asynchronous reclaim feature is triggered. For example, if you specify `Memory Limit=100MiB`for a container, the memory throttling setting is`memory.high=83886080`, the reclaim ratio setting is `memory.wmark_ratio=95`, and the reclaim threshold setting is `memory.wmark_high=79691776`. | +| wmarkMinAdj | Int | -25~50 | Unit: %. The default value is `-25` for the `LS`/ `LSR` QoS class and `50` for the `BE` QoS class. A value of 0 indicates that this parameter is disabled. This parameter specifies the adjustment to the global minimum watermark for a container. A negative value decreases the global minimum watermark and therefore postpones memory reclaim for the container. A positive value increases the global minimum watermark and therefore antedates memory reclaim for the container. For example, if you create a pod whose QoS class is LS, the default setting of this parameter is `memory.wmark_min_adj=-25`, which indicates that the minimum watermark is decreased by 25% for the containers in the pod. | + +### Example + +0. The testing environment is shown below: + +- Kubernetes: 1.20 +- Nodes: + - Stress Node: an ECS instance (8 vCPU, 32GB RAM) for performing stress tests. + - Tested Node: an ECS instance (8 vCPU, 32GB RAM) runs the workload and serves. + +1. Create a file named redis-demo.yaml with the following YAML template: + +```yaml +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: redis-demo-config +data: + redis-config: | + appendonly yes + appendfsync no +--- +apiVersion: v1 +kind: Pod +metadata: + name: redis-demo + labels: + name: redis-demo + annotations: + koordinator.sh/memoryQOS: '{"policy": "auto"}' # Add this annotation to enable memory QoS + koordinator.sh/qosClass: 'LS' # Set the QoS class of the Redis pod to LS +spec: + containers: + - name: redis + image: redis:5.0.4 + command: + - redis-server + - "/redis-master/redis.conf" + env: + - name: MASTER + value: "true" + ports: + - containerPort: 6379 + resources: + limits: + cpu: "2" + memory: "6Gi" + requests: + cpu: "2" + memory: "2Gi" + volumeMounts: + - mountPath: /redis-master-data + name: data + - mountPath: /redis-master + name: config + volumes: + - name: data + emptyDir: {} + - name: config + configMap: + name: redis-demo-config + items: + - key: redis-config + path: redis.conf + nodeName: # Set nodeName to the name of the tested node +--- +apiVersion: v1 +kind: Service +metadata: + name: redis-demo +spec: + ports: + - name: redis-port + port: 6379 + protocol: TCP + targetPort: 6379 + selector: + name: redis-demo + type: ClusterIP +``` + +2. Run the following command to deploy Redis Server as the test application. + +You can access the redis-demo Service from within the cluster. + +```bash +kubectl apply -f redis-demo.yaml +``` + +3. Simulate the scenario of memory overcommitment. + +Use the Stress tool to increase the load on memory and trigger memory reclaim. The sum of the memory limits of all pods +on the node exceeds the physical memory of the node. + + a. Create a file named stress-demo.yaml with the following YAML template: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: stress-demo + labels: + name: stress-demo + annotations: + koordinator.sh/memoryQOS: '{"policy": "auto"}' # Add this annotation to enable memory QoS + koordinator.sh/qosClass: 'BE' # Set the QoS class of the Stress pod to BE +spec: + containers: + - args: + - '--vm' + - '2' + - '--vm-bytes' + - 11G + - '-c' + - '2' + - '--vm-hang' + - '2' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + restartPolicy: Always + nodeName: # Set nodeName to the name of the tested node, which is the node on which the Redis pod is deployed +``` + + b. Run the following command to deploy stress-demo: + +```bash +kubectl apply -f stress-demo.yaml +``` + +4. Run the following command to query the global minimum watermark of the node: + +> Note In memory overcommitment scenarios, if the global minimum watermark of the node is set to a low value, OOM +> killers may be triggered for all pods on the node even before memory reclaim is performed. Therefore, we recommend +> that you set the global minimum watermark to a high value. In this example, the global minimum watermark is set +> to 4,000,000 KB for the tested node that has 32 GiB of memory. + +```bash +cat /proc/sys/vm/min_free_kbytes +``` + +Expected output: + +```bash +4000000 +``` + +5. Use the following YAML template to deploy the memtier-benchmark tool to send requests to the tested node: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + labels: + name: memtier-demo + name: memtier-demo +spec: + containers: + - command: + - memtier_benchmark + - '-s' + - 'redis-demo' + - '--data-size' + - '200000' + - "--ratio" + - "1:4" + image: 'redislabs/memtier_benchmark:1.3.0' + name: memtier + restartPolicy: Never + nodeName: # Set nodeName to the name of the stress node that is used to send requests. +``` + +6. Run the following command to query the test results from memtier-benchmark: + +```bash +kubectl logs -f memtier-demo +``` + +7. Use the following YAML template to disable memory QoS for the Redis pod and Stress pod. Then, perform stress tests +again and compare the results. + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: redis-demo + labels: + name: redis-demo + annotations: + koordinator.sh/memoryQOS: '{"policy": "none"}' # Disable memory QoS. + koordinator.sh/qosClass: 'LS' +spec: + ... + +--- +apiVersion: v1 +kind: Pod +metadata: + name: stress-demo + labels: + name: stress-demo + annotations: + koordinator.sh/memoryQOS: '{"policy": "none"}' # Disable memory QoS. + koordinator.sh/qosClass: 'BE' +``` + +8. Check the results of Memory QoS enabled and disabled. + +- Disabled: Set the memory QoS policy of the pod to `none`. +- Enabled: Set the memory QoS policy of the pod to `auto` (the recommended parameters of memory QoS are used). + +| Metric | Disabled | Enabled | +| ----------------- | ------------- | ------------- | +| Latency-avg | 51.32 ms | 47.25 ms | +| Throughput-avg | 149.0 MB/s | 161.9 MB/s | + +The table shows that the latency of the Redis pod is reduced by 7.9% and the throughput of the Redis pod is increased +by 8.7% after memory QoS is enabled. This indicates that the memory QoS feature can optimize the performance of +applications in memory overcommitment scenarios. diff --git a/versioned_docs/version-v1.3/user-manuals/multi-hierarchy-elastic-quota-management.md b/versioned_docs/version-v1.3/user-manuals/multi-hierarchy-elastic-quota-management.md new file mode 100644 index 000000000..ad483fa8a --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/multi-hierarchy-elastic-quota-management.md @@ -0,0 +1,629 @@ +# Multi Hierarchy Elastic Quota Management + +Multi Hierarchy ElasticQuota Management is an ability of koord-scheduler to manage different user's resource usage in a shared-cluster. + +## Introduction +When several users or teams share a cluster, fairness of resource allocation is very important. the Koordinator provides +multi-hierarchy elastic quota management mechanism for the scheduler. +- It supports configuring quota groups in a tree structure, which is similar to the organizational structure of most companies. +- It supports the borrowing / returning of resources between different quota groups, for better resource utilization efficiency. +The busy quota groups can automatically temporarily borrow the resources from the idle quota groups, which can improve the +utilization of the cluster. At the same time, when the idle quota group turn into the busy quota group, it can also automatically +take back the "lent-to" resources. +- It considers the resource fairness between different quota groups. When the busy quota groups borrow the +resources from the idle quota groups, the resources can be allocated to the busy quota groups under some fair rules. + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.71 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to [Installation](/docs/installation). + +### Configurations + +Multi-Hierarchy-ElasticQuota-Management is *Enabled* by default. You can use it without any modification on the koord-descheduler config. + +## Use Multi-Hierarchy-ElasticQuota-Management + +### Quick Start by Label + +1.Create a Deployment `quota-example` with the YAML file below. + +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-example + namespace: default + labels: + quota.scheduling.koordinator.sh/parent: "" + quota.scheduling.koordinator.sh/is-parent: "false" +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 10 + memory: 20Mi +``` + +```bash +$ kubectl apply -f quota-example.yaml + elasticquota.scheduling.sigs.k8s.io/quota-example created + +$ kubectl get eqs -n default + NAME AGE + test-d 2s +``` + +2.Create a pod `pod-example` with the YAML file below. +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example + namespace: default + labels: + quota.scheduling.koordinator.sh/name: "quota-example" +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +```bash +$ kubectl apply -f pod-example.yaml + pod/pod-example created +``` + +3.Verify `quota-example` has changed. +```bash +$ kubectl get eqs -n default quota-example -o yaml +``` +```yaml +kind: ElasticQuota +metadata: + annotations: + quota.scheduling.koordinator.sh/request: '{"cpu":"40m","memory":"40Mi"}' + quota.scheduling.koordinator.sh/runtime: '{"cpu":"40m","memory":"40Mi"}' + quota.scheduling.koordinator.sh/shared-weight: '{"cpu":"40","memory":"40Gi"}' + creationTimestamp: "2022-10-08T09:26:38Z" + generation: 2 + labels: + quota.scheduling.koordinator.sh/is-parent: "false" + quota.scheduling.koordinator.sh/parent: root + manager: koord-scheduler + operation: Update + time: "2022-10-08T09:26:50Z" + name: quota-example + namespace: default + resourceVersion: "39012008" +spec: + max: + cpu: "40" + memory: 40Gi + min: + cpu: "10" + memory: 20Mi +status: + used: + cpu: 40m + memory: 40Mi +``` + +### Quick Start by Namespace +1.Create namespace +```bash +$ kubectl create ns quota-example + namespace/quota-example created +``` + +2.Create a Deployment `quota-example` with the YAML file below. + +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-example + namespace: quota-example + labels: + quota.scheduling.koordinator.sh/parent: "" + quota.scheduling.koordinator.sh/is-parent: "false" +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 10 + memory: 20Mi +``` + +```bash +$ kubectl apply -f quota-example.yaml + elasticquota.scheduling.sigs.k8s.io/quota-example created + +$ kubectl get eqs -n quota-example + NAME AGE + test-d 2s +``` + +2.Create a pod `pod-example` with the YAML file below. +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example + namespace: quota-example +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +```bash +$ kubectl apply -f pod-example.yaml + pod/pod-example created +``` + +3.Verify `quota-example` has changed. +```bash +$ kubectl get eqs -n quota-example quota-example -o yaml +``` +```yaml +kind: ElasticQuota +metadata: + annotations: + quota.scheduling.koordinator.sh/request: '{"cpu":"40m","memory":"40Mi"}' + quota.scheduling.koordinator.sh/runtime: '{"cpu":"40m","memory":"40Mi"}' + quota.scheduling.koordinator.sh/shared-weight: '{"cpu":"40","memory":"40Gi"}' + creationTimestamp: "2022-10-08T09:26:38Z" + generation: 2 + labels: + quota.scheduling.koordinator.sh/is-parent: "false" + quota.scheduling.koordinator.sh/parent: root + manager: koord-scheduler + operation: Update + time: "2022-10-08T09:26:50Z" + name: quota-example + namespace: quota-example + resourceVersion: "39012008" +spec: + max: + cpu: "40" + memory: 40Gi + min: + cpu: "10" + memory: 20Mi +status: + used: + cpu: 40m + memory: 40Mi +``` + +### Quota Debug Api. +```bash +$ kubectl -n koordinator-system get lease koord-scheduler --no-headers | awk '{print $2}' | cut -d'_' -f1 | xargs -I {} kubectl -n koordinator-system get pod {} -o wide --no-headers | awk '{print $6}' + 10.244.0.64 + +$ curl 10.244.0.64:10251/apis/v1/plugins/ElasticQuota/quota/quota-example +``` + +```json +{ + "allowLentResource": true, + "autoScaleMin": { + "cpu": "10", + "memory": "20Mi", + }, + "isParent": false, + "max": { + "cpu": "40", + "memory": "40Gi", + }, + "min": { + "cpu": "10", + "memory": "20Mi", + }, + "name": "quota-example", + "parentName": "root", + "podCache": { + "pod-example": { + "isAssigned": true, + "resource": { + "cpu": "40m", + "memory": "40Mi" + } + } + }, + "request": { + "cpu": "40m", + "memory": "40Mi" + }, + "runtime": { + "cpu": "40m", + "memory": "41943040", + }, + "runtimeVersion": 39, + "sharedWeight": { + "cpu": "40", + "memory": "40Gi", + }, + "used": { + "cpu": "40m", + "memory": "40Mi" + } +} +``` +The main different with yaml is that we can find all quota's pods and its status in `podCache`. + +### Advanced Configurations +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: false + quota.scheduling.koordinator.sh/parent: "parent" + quota.scheduling.koordinator.sh/allow-lent-resource: true + quota.scheduling.koordinator.sh/shared-weight: '{"cpu":"40","memory":"40Gi"}' +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 10 + memory: 20Mi +``` + +- `quota.scheduling.koordinator.sh/is-parent` is disposed by the user. It reflects the "child\parent" attribute of the quota group. Default is child. +- `quota.scheduling.koordinator.sh/parent` is disposed by the user. It reflects the parent quota name. Default is root. +- `quota.scheduling.koordinator.sh/shared-weight` is disposed by the user. It reflects the ability to share the "lent to" resource. Default equals to "max". +- `quota.scheduling.koordinator.sh/allow-lent-resource` is disposed by the user. It reflects whether quota group allows lent unused "min" to others. + +### WebHook Verify +1.Except for the first level quota group, we require that the sum of "min" of all sub quota groups should be less than or +equal to the "min" of parent group. + +first create parent quota: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-parent-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: true +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 10 + memory: 20Mi +``` + +then create child quota: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: false + quota.scheduling.koordinator.sh/parent: "quota-parent-example" +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 20 + memory: 20Mi +``` + +```bash +kubectl apply -f quota-example.yaml +Error from server: error when creating "quota-example.yaml": admission webhook "vquota.kb.io" denied the request: checkMinQuotaSum allChildren SumMinQuota > parentMinQuota, parent: quota-parent-example +``` + +2.Parent and child's min\max resource key must same. +first create parent quota: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-parent-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: true +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 10 + memory: 20Mi +``` + +then create child quota: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: false + quota.scheduling.koordinator.sh/parent: "quota-parent-example" +spec: + max: + cpu: 40 + memory: 40Gi + test: 200 + min: + cpu: 10 + memory: 20Mi +``` + +```bash +$ kubectl apply -f quota-example.yaml + Error from server: error when creating "quota-example.yaml": admission webhook "vquota.kb.io" denied the request: checkSubAndParentGroupMaxQuotaKeySame failed: quota-parent-example's key is not the same with quota-example +``` + +3.Parent group cannot run pod. + +first create parent quota: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-parent-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: true +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 10 + memory: 20Mi +``` + +then create pod: +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-example + namespace: default + labels: + quota.scheduling.koordinator.sh/name: "quota-parent-example" +spec: + schedulerName: koord-scheduler + containers: + - command: + - sleep + - 365d + image: busybox + imagePullPolicy: IfNotPresent + name: curlimage + resources: + limits: + cpu: 40m + memory: 40Mi + requests: + cpu: 40m + memory: 40Mi + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always +``` + +```bash +$ kubectl apply -f pod-example_xb.yaml + Error from server: error when creating "pod-example.yaml": admission webhook "vpod.kb.io" denied the request: pod can not be linked to a parentQuotaGroup,quota:quota-parent-example, pod:pod-example +``` + +4.The parent of node can only be parent group, not child group. + +first create parent quota: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-parent-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: false +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 10 + memory: 20Mi +``` + +then create child quota: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: false + quota.scheduling.koordinator.sh/parent: "quota-parent-example" +spec: + max: + cpu: 40 + memory: 40Gi + test: 200 + min: + cpu: 10 + memory: 20Mi +``` + +```bash +$ kubectl apply -f quota-example.yaml + Error from server: error when creating "elastic-quota-example_xb.yaml": admission webhook "vquota.kb.io" denied the request: quota-example has parentName quota-parent-example but the parentQuotaInfo's IsParent is false +``` + +5.A quota group can't be converted on the attribute of parent group\child group. + +first create parent quota: +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: ElasticQuota +metadata: + name: quota-parent-example + namespace: default + labels: + quota.scheduling.koordinator.sh/is-parent: true +spec: + max: + cpu: 40 + memory: 40Gi + min: + cpu: 10 + memory: 20Mi +``` + +then modify `quota.scheduling.koordinator.sh/is-parent:false`: +```bash +$ kubectl apply -f quota-parent-example.yaml + elastic-quota-example_xb_parent.yaml": admission webhook "vquota.kb.io" denied the request: IsParent is forbidden modify now, quotaName:quota-parent-example +``` + +### used > runtime revoke +We offer a config to control if quota's used > runtime, we allow the scheduler to delete over-resource-used pod from +low priority to high priority. you should follow the below config of `koord-scheduler-config.yaml` in helm. + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: koord-scheduler-config + namespace: {{ .Values.installation.namespace }} +data: + koord-scheduler-config: | + apiVersion: kubescheduler.config.k8s.io/v1beta2 + kind: KubeSchedulerConfiguration + leaderElection: + leaderElect: true + resourceLock: leases + resourceName: koord-scheduler + resourceNamespace: {{ .Values.installation.namespace }} + profiles: + - pluginConfig: + - name: ElasticQuota + args: + apiVersion: kubescheduler.config.k8s.io/v1beta2 + kind: ElasticQuotaArgs + quotaGroupNamespace: {{ .Values.installation.namespace }} + enableCheckParentQuota: true + monitorAllQuotas: true + revokePodInterval: 60s + delayEvictTime: 300s + plugins: + queueSort: + disabled: + - name: "*" + enabled: + - name: Coscheduling + preFilter: + enabled: + - name: NodeNUMAResource + - name: DeviceShare + - name: Reservation + - name: Coscheduling + - name: ElasticQuota + filter: + ... +``` +- `enableCheckParentQuota` check parentQuotaGroups' used and runtime Quota. Default is false. +- `monitorAllQuotas` enable "used > runtime revoke" logic. Default is false. +- `revokePodInterval` check loop time interval. +- `delayEvictTime` when "used > runtime" continues over `delayEvictTime` will really trigger eviction. + +To let scheduler can really delete the pod successfully, you should config the `rbac/koord-scheduler.yaml` as below in helm. + +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: koord-scheduler-role +rules: +{{- if semverCompare "<= 1.20-0" .Capabilities.KubeVersion.Version }} +- apiGroups: + - "" + resources: + - namespaces + verbs: + - get + - list + - watch +{{- end }} +- apiGroups: + - coordination.k8s.io + resources: + - leases + verbs: + - create + - get + - update +- apiGroups: + - "" + resources: + - pods + verbs: + - patch + - update + - delete +- apiGroups: + - "" + resources: + - pods/eviction + verbs: + - create +- apiGroups: + ... +``` diff --git a/versioned_docs/version-v1.3/user-manuals/performance-collector.md b/versioned_docs/version-v1.3/user-manuals/performance-collector.md new file mode 100644 index 000000000..4cc3a3971 --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/performance-collector.md @@ -0,0 +1,184 @@ +# Performance Collector + +## Motivation + +In real production environment, the runtime state of a node is a "chaotic system", and application interference caused by resource contention cannot be absolutely avoided. Koordinator is building interference detection and optimization capabilities. By extracting metrics of application running status, real-time analysis and detection are carried out, and more targeted strategies are adopted for target applications and interference sources after interference is discovered. +Koordinator implements a series of `Performance Collectors` to collect low-level metrics highly correlated with application running status on one node, and expose them through `Prometheus` to provide support for interference detection capabilities and cluster application scheduling. + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 + +- Koordinator >= 1.0 + +- To use CPI Collector, make sure your node machine supports Cycles and Instructions Kernel PMU(Performance Monitoring Unit) events. + + > Use belowing command to check. + + ```shell + $ perf list + List of pre-defined events (to be used in -e): + + branch-instructions OR branches [Hardware event] + branch-misses [Hardware event] + bus-cycles [Hardware event] + ... + + cpu-cycles OR cpu/cpu-cycles/ [Kernel PMU event] + ... + instructions OR cpu/instructions/ [Kernel PMU event] + ``` + +- To use PSI Collector, your Anolis OS needs to enable PSI feature. Please refer to this [document](https://www.alibabacloud.com/help/en/elastic-compute-service/latest/enable-the-psi-feature-for-cgroup-v1). + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to [Installation](https://koordinator.sh/zh-Hans/docs/installation). + +### Feature-gates +Performance Collector is managed by several feature-gates. Koordinator currently supports following collectors: + +- `CPICollector`: manages CPI collector. CPI: Cycles Per Instruction. +- `PSICollector`:manages PSI collector. PSI: Pressure Stall Information. + +### Configuration + +Performance Collectors are _Disabled_ currently by default. To enable them, just edit Koordlet's `feature-gates` args. + +```shell +kubectl edit ds koordlet -n koordinator-system +``` + +```shell +spec: + ... + spec: + containers: + - args: + ... + # modify here + # - -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true + - -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true,CPICollector=true,PSICollector=true + ... +``` + +## Overhead + +Koordinator Performance Collector is an important tool for interference detection, and one of its core goals is to collect relevant indicators at low cost. The following shows the system overhead introduced by Koordinator before and after enabling Performance Collector. Users can refer to this test result to use the Performance Collector feature. + +### Testing Context + +- Alibaba Cloud Container Service for Kubernetes (ACK) Managed Kubernetes Cluster: + - Kubernetes version:1.24.6-aliyun.1 + - Container Runtime:containerd 1.5.13 + - Node Spec:ecs.ebmg6.26xlarge,104 vCPU 384 GiB, OS: Alibaba Cloud Linux 2.1903 +- Node pressure: + - Test Pod image:nginx:1.14.2 + - Number of Pods on single Node:100 test Pod + 50 system Pod + - Number of Containers on single Node:150 + - Node CPU usage: about 25%, use lookbusy-1.4 to generate on each CPU +- Others: + - 100 nginx Pods are managed by a Linux cronjob, which is deleted every five minutes. The Deployment controller rebuild these Pods in time. + - CPI Collector runs in a window of 10 seconds every 60 seconds. + - PSI Collector runs every 10 seconds. + - The test lasts for 1 hour before and after Performance Collector is enabled. + +### Conclusion + +#### Case 1:Overhead comparison of Koordlet container before and after enabling Performance Collector + +Performance Collector runs on the Koordlet component of Koordinator, and the cost of the component is compared as follows: + +- No significant increase in overall overhead: + + | Metrics | Disable | Enable | + | :--------------: | :------: | :-------------------: | + | RSS Memory usage | 341MiB | 366MiB | + | CPU usage | 0.5 core | 0.6 core | + | Network I/O | - | no significant change | +- Possible cause of the overhead: + - The new CPI data table of per Container dimension, and new PSI data table of both per Container and per Pod dimension. + - The consumption caused by the collector's goroutine per cgroup. + - The consumption caused by Prometheus Gauge. + +#### Case 2:Overhead comparison of Node before and after enabling Performance Collector + +Performance Collector uses the perf_event_open(2) system call, and its impact on the node is compared as follows: + +- No significant increase in overall overhead: + + | Metrics | Disable | Enable | + | :-------------------: | :-----: | :----: | + | System Mode CPU usage | 0.94% | 0.96% | + | User Mode CPU usage | 24.51% | 25.19% | + +- Possible cause of the overhead: + - Usage of perf_event_open(2) + - Enabling of PSI feature on OS + +## Example + +1. To enable Performance Collector: +```shell +helm install koordinator https://... --set featureGates="CPICollector=true,PSICollector=true" +``` + +2. Use belowing flags to config collectors' time window or collect intervals: + + | Flag | Default | Definition | + | :-----------------------------: | :-----: | :--------------------------------: | + | -cpi-collector-interval-seconds | 60 | Collect cpi interval by seconds | + | -collect-cpi-timewindow-seconds | 10 | Collect cpi time window by seconds | + | -psi-collector-interval-seconds | 10 | Collect psi interval by seconds | +3. We can see reported metric values at Prometheus port(9316 as default), the API path is `/metrics`, e.g., CPI is shown as two records of *cycles* and *instructions*: +```shell +$ curl http://localhost:9316/metrics + +# HELP koordlet_container_cpi Container cpi collected by koordlet +# TYPE koordlet_container_cpi gauge +koordlet_container_cpi{container_id="containerd://498de02ddd3ad7c901b3c80f96c57db5b3ed9a817dbfab9d16b18be7e7d2d047",container_name="koordlet",cpi_field="cycles",node="your-node-name",pod_name="koordlet-x8g2j",pod_namespace="koordinator-system",pod_uid="3440fb9c-423b-48e9-8850-06a6c50f633d"} 2.228107503e+09 +koordlet_container_cpi{container_id="containerd://498de02ddd3ad7c901b3c80f96c57db5b3ed9a817dbfab9d16b18be7e7d2d047",container_name="koordlet",cpi_field="instructions",node="your-node-name",pod_name="koordlet-x8g2j",pod_namespace="koordinator-system",pod_uid="3440fb9c-423b-48e9-8850-06a6c50f633d"} 4.1456092e+09 +``` + +4. Notice that we also provide ServiceMonitor for Koordlet to evict those metrics: + + ```yaml + apiVersion: v1 + kind: Service + metadata: + labels: + koord-app: koordlet + name: koordlet + namespace: koordinator-system + spec: + clusterIP: None + ports: + - name: koordlet-service + port: 9316 + targetPort: 9316 + selector: + koord-app: koordlet + --- + apiVersion: monitoring.coreos.com/v1 + kind: ServiceMonitor + metadata: + labels: + koord-app: koordlet + name: koordlet + namespace: koordinator-system + spec: + endpoints: + - interval: 30s + port: koordlet-service + scheme: http + jobLabel: koord-app + selector: + matchLabels: + koord-app: koordlet + ``` + + You can find it in Promethues Targets: + + ![koordlet-servicemonitor-prometheus](/img/koordlet-servicemonitor-prometheus.png) diff --git a/versioned_docs/version-v1.3/user-manuals/pod-migration-job.md b/versioned_docs/version-v1.3/user-manuals/pod-migration-job.md new file mode 100644 index 000000000..e1a708902 --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/pod-migration-job.md @@ -0,0 +1,256 @@ +# PodMigrationJob + +Koordinator defines a CRD-based Pod migration API called `PodMigrationJob`, through which the descheduler or other automatic fault recovery components can evict or delete Pods more safely. + +## Introduction + +Migrating Pods is an important capability that many components (such as deschedulers) rely on, and can be used to optimize scheduling or help resolve workload runtime quality issues. We believe that pod migration is a complex process, involving steps such as auditing, resource allocation, and application startup, and is mixed with application upgrading, scaling scenarios, and resource operation and maintenance operations by cluster administrators. Therefore, how to manage the stability risk of this process to ensure that the application does not fail due to the migration of Pods is a very critical issue that must be resolved. + +Based on the final state-oriented migration capability of the PodMigrationJob CRD, we can track the status of each process during the migration process, perceive scenarios such as application upgrades and scaling to ensure the stability of the workload. + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.6 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to [Installation](/docs/installation). + +### Configurations + +PodMigrationJob is *Enabled* by default. You can use it without any modification on the koord-descheduler config. + +## Use PodMigrationJob + +### Quick Start + +1. Create a Deployment `pod-demo` with the YAML file below. + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: pod-demo + namespace: default +spec: + progressDeadlineSeconds: 600 + replicas: 1 + revisionHistoryLimit: 10 + selector: + matchLabels: + app: pod-demo + strategy: + rollingUpdate: + maxSurge: 25% + maxUnavailable: 25% + type: RollingUpdate + template: + metadata: + creationTimestamp: null + labels: + app: pod-demo + name: stress + spec: + containers: + - args: + - -c + - "1" + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + limits: + cpu: "2" + memory: 4Gi + requests: + cpu: 200m + memory: 400Mi + restartPolicy: Always + schedulerName: koord-scheduler +``` + +```bash +$ kubectl create -f pod-demo.yaml +deployment.apps/pod-demo created +``` + +2. Check the scheduled result of the pod `pod-demo-0`. + +```bash +$ kubectl get pod -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +pod-demo-5f9b977566-c7lvk 1/1 Running 0 41s 10.17.0.9 node-0 +``` + +`pod-demo-5f9b977566-c7lvk` is scheduled on the node `node-0`. + +3. Create a `PodMigrationJob` with the YAML file below to migrate `pod-demo-0`. + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: PodMigrationJob +metadata: + name: migrationjob-demo +spec: + paused: false + ttl: 5m + mode: ReservationFirst + podRef: + namespace: default + name: pod-demo-5f9b977566-c7lvk +status: + phase: Pending +``` + +```bash +$ kubectl create -f migrationjob-demo.yaml +podmigrationjob.scheduling.koordinator.sh/migrationjob-demo created +``` + +5. Query migration status + +```bash +$ kubectl get podmigrationjob migrationjob-demo +NAME PHASE STATUS AGE NODE RESERVATION PODNAMESPACE POD NEWPOD TTL +migrationjob-demo Succeed Complete 37s node-1 d56659ab-ba16-47a2-821d-22d6ba49258e default pod-demo-5f9b977566-c7lvk pod-demo-5f9b977566-nxjdf 5m0s +``` + +From the above results, it can be observed that: +- **PHASE** is `Succeed`, **STATUS** is `Complete`, indicating that the migration is successful. +- **NODE** `node-1` indicates the node where the new Pod is scheduled after the migration. +- **RESERVATION** `d56659ab-ba16-47a2-821d-22d6ba49258e` is the Reservation created during migration. The PodMigrationJob Controller will try to create the reserved resource for the Reservation before starting to evict the Pod. After the reservation is successful, the eviction will be initiated, which can ensure that the new Pod must be expelled. There are resources available. +- **PODNAMESPACE** `default` represents the namespace where the migrated Pod is located, +- **POD** `pod-demo-5f9b977566-c7lvk` represents the Pod to be migrated, +- **NEWPOD** `pod-demo-5f9b977566-nxjdf` is the newly created Pod after migration. +- **TTL** indicates the TTL period of the current Job. + +6. Query migration events + +PodMigrationJob Controller will create Events for important steps in the migration process to help users diagnose migration problems + +```bash +$ kubectl describe podmigrationjob migrationjob-demo +... +Events: + Type Reason Age From Message + ---- ------ ---- ---- ------- + Normal ReservationCreated 8m33s koord-descheduler Successfully create Reservation "d56659ab-ba16-47a2-821d-22d6ba49258e" + Normal ReservationScheduled 8m33s koord-descheduler Assigned Reservation "d56659ab-ba16-47a2-821d-22d6ba49258e" to node "node-1" + Normal Evicting 8m33s koord-descheduler Try to evict Pod "default/pod-demo-5f9b977566-c7lvk" + Normal EvictComplete 8m koord-descheduler Pod "default/pod-demo-5f9b977566-c7lvk" has been evicted + Normal Complete 8m koord-descheduler Bind Pod "default/pod-demo-5f9b977566-nxjdf" in Reservation "d56659ab-ba16-47a2-821d-22d6ba49258e" +``` + +### Advanced Configurations + +> The latest API can be found in [`pod_migration_job_types.go`](https://github.com/koordinator-sh/koordinator/blob/main/apis/scheduling/v1alpha1/pod_migration_job_types.go). + +### Example: Manually confirm whether the migration is allowed + +Eviction or migration operations that bring risks to the stability, so it is hoped to manually check and confirm that there is no error before initiating the migration operation, and then initiate the migration. + +Therefore, when creating a PodMigrationJob, set `spec.paused` to `true`, and set `spec.paused` to `false` after manually confirming that execution is allowed. +If you refuse to execute, you can update `status.phase=Failed` to terminate the execution of the PodMigrationJob immediately or wait for the PodMigrationJob to expire automatically. + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: PodMigrationJob +metadata: + name: migrationjob-demo +spec: + # paused indicates whether the PodMigrationJob should to work or not. + paused: true + # ttl controls the PodMigrationJob timeout duration. + ttl: 5m + mode: ReservationFirst + podRef: + namespace: default + name: pod-demo-5f9b977566-c7lvk +status: + phase: Pending +``` + +### Example: Just want to evict Pods, no need to reserve resources + +PodMigrationJob provides two migration modes: +- `EvictDirectly` is directly evict Pod, no need to reserve resources, +- `ReservationFirst` reserves resources first to ensure that resources can be allocated before initiating eviction. + +If just want to evict Pods, just set `spec.mode` to `EvictDirectly` + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: PodMigrationJob +metadata: + name: migrationjob-demo +spec: + paused: false + ttl: 5m + mode: EvictDirectly + podRef: + namespace: default + name: pod-demo-5f9b977566-c7lvk +status: + phase: Pending +``` + +### Example: Use reserved resources when migrating + +In some scenarios, resources are reserved first, and then a PodMigrationJob is created after success. +The arbitration mechanism provided by the PodMigrationJob Controller (BTW: will be implemented in v0.7) is reused to ensure workload stability. + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: PodMigrationJob +metadata: + name: migrationjob-demo +spec: + paused: false + ttl: 5m + mode: ReservationFirst + podRef: + namespace: default + name: pod-demo-5f9b977566-c7lvk + reservationOptions: + # the reservation-0 created before creating PodMigrationJob + reservationRef: + name: reservation-0 +status: + phase: Pending +``` + +### Example: Evicting Pods Gracefully + +PodMigrationJob supports graceful eviction of pods. + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: PodMigrationJob +metadata: + name: migrationjob-demo +spec: + paused: true + ttl: 5m + mode: ReservationFirst + podRef: + namespace: default + name: pod-demo-5f9b977566-c7lvk + deleteOptions: + # The duration in seconds before the object should be deleted. Value must be non-negative integer. + # The value zero indicates delete immediately. If this value is nil, the default grace period for the + # specified type will be used. + # Defaults to a per object value if not specified. zero means delete immediately. + gracePeriodSeconds: 60 +status: + phase: Pending +``` + + +### Known Issues +- [Arbitration mechanism](https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/scheduling/20220701-pod-migration-job.md#filter-podmigrationjob) is not currently supported. The v0.6 version only implements the migration capability based on resource reservation +- [Basic Migration API](https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/scheduling/20220701-pod-migration-job.md#basic-migration-api) is not currenty supported \ No newline at end of file diff --git a/versioned_docs/version-v1.3/user-manuals/resource-reservation.md b/versioned_docs/version-v1.3/user-manuals/resource-reservation.md new file mode 100644 index 000000000..88cd888ab --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/resource-reservation.md @@ -0,0 +1,449 @@ +# Resource Reservation + +Resource Reservation is an ability of koord-scheduler for reserving node resources for specific pods or workloads. + +## Introduction + +Pods are fundamental for allocating node resources in Kubernetes, which bind resource requirements with business logic. +However, we may allocate resources for specific pods or workloads not created yet in the scenarios below: + +1. Preemption: Existing preemption does not guarantee that only preempting pods can allocate preempted resources. We expect that the scheduler can "lock" resources preventing from allocation of other pods even if they have the same or higher priorities. +2. De-scheduling: For the descheduler, it is better to ensure sufficient resources before pods get rescheduled. Otherwise, rescheduled pods may not be runnable anymore and make the belonging application disrupted. +3. Horizontal scaling: To achieve more deterministic horizontal scaling, we expect to allocate node resources for the replicas to scale. +4. Resource Pre-allocation: We may want to pre-allocate node resources for future resource demands even if the resources are not currently allocatable. + +To enhance the resource scheduling of Kubernetes, koord-scheduler provides a scheduling API named `Reservation`, which allows us to reserve node resources for specified pods or workloads even if they haven't get created yet. + +![image](/img/resource-reservation.svg) + +For more information, please see [Design: Resource Reservation](../designs/resource-reservation). + +## Setup + +### Prerequisite + +- Kubernetes >= 1.18 +- Koordinator >= 0.6 + +### Installation + +Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to [Installation](/docs/installation). + +### Configurations + +Resource Reservation is *Enabled* by default. You can use it without any modification on the koord-scheduler config. + +## Use Resource Reservation + +### Quick Start + +1. Deploy a reservation `reservation-demo` with the YAML file below. + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: Reservation +metadata: + name: reservation-demo +spec: + template: # set resource requirements + namespace: default + spec: + containers: + - args: + - '-c' + - '1' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: # reserve 500m cpu and 800Mi memory + requests: + cpu: 500m + memory: 800Mi + schedulerName: koord-scheduler # use koord-scheduler + owners: # set the owner specifications + - object: # owner pods whose name is `default/pod-demo-0` + name: pod-demo-0 + namespace: default + ttl: 1h # set the TTL, the reservation will get expired 1 hour later +``` + +```bash +$ kubectl create -f reservation-demo.yaml +reservation.scheduling.koordinator.sh/reservation-demo created +``` + +2. Watch the reservation status util it becomes available. + +```bash +$ kubectl get reservation reservation-demo -o wide +NAME PHASE AGE NODE TTL EXPIRES +reservation-demo Available 88s node-0 1h +``` + +3. Deploy a pod `pod-demo-0` with the YAML file below. + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-demo-0 # match the owner spec of `reservation-demo` +spec: + containers: + - args: + - '-c' + - '1' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + limits: + cpu: '1' + memory: 1Gi + requests: + cpu: 200m + memory: 400Mi + restartPolicy: Always + schedulerName: koord-scheduler # use koord-scheduler +``` + +```bash +$ kubectl create -f pod-demo-0.yaml +pod/pod-demo-0 created +``` + +4. Check the scheduled result of the pod `pod-demo-0`. + +```bash +$ kubectl get pod pod-demo-0 -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +pod-demo-0 1/1 Running 0 32s 10.17.0.123 node-0 +``` + +`pod-demo-0` is scheduled at the same node with `reservation-demo`. + +5. Check the status of the reservation `reservation-demo`. + +```bash +$ kubectl get reservation reservation-demo -oyaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: Reservation +metadata: + name: reservation-demo + creationTimestamp: "YYYY-MM-DDT05:24:58Z" + uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx + ... +spec: + owners: + - object: + name: pod-demo-0 + namespace: default + template: + spec: + containers: + - args: + - -c + - "1" + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + requests: + cpu: 500m + memory: 800Mi + schedulerName: koord-scheduler + ttl: 1h +status: + allocatable: # total reserved + cpu: 500m + memory: 800Mi + allocated: # current allocated + cpu: 200m + memory: 400Mi + conditions: + - lastProbeTime: "YYYY-MM-DDT05:24:58Z" + lastTransitionTime: "YYYY-MM-DDT05:24:58Z" + reason: Scheduled + status: "True" + type: Scheduled + - lastProbeTime: "YYYY-MM-DDT05:24:58Z" + lastTransitionTime: "YYYY-MM-DDT05:24:58Z" + reason: Available + status: "True" + type: Ready + currentOwners: + - name: pod-demo-0 + namespace: default + uid: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy + nodeName: node-0 + phase: Available +``` + +Now we can see the reservation `reservation-demo` has reserved 500m cpu and 800Mi memory, and the pod `pod-demo-0` +allocates 200m cpu and 400Mi memory from the reserved resources. + +6. Cleanup the reservation `reservation-demo`. + +```bash +$ kubectl delete reservation reservation-demo +reservation.scheduling.koordinator.sh "reservation-demo" deleted +$ kubectl get pod pod-demo-0 +NAME READY STATUS RESTARTS AGE +pod-demo-0 1/1 Running 0 110s +``` + +After the reservation deleted, the pod `pod-demo-0` is still running. + +### Advanced Configurations + +> The latest API can be found in [`reservation_types`](https://github.com/koordinator-sh/koordinator/blob/main/apis/scheduling/v1alpha1/reservation_types.go). + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: Reservation +metadata: + name: reservation-demo +spec: + # pod template (required): Reserve resources and play pod/node affinities according to the template. + # The resource requirements of the pod indicates the resource requirements of the reservation + template: + namespace: default + spec: + containers: + - args: + - '-c' + - '1' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + requests: + cpu: 500m + memory: 800Mi + # scheduler name (required): use koord-scheduler to schedule the reservation + schedulerName: koord-scheduler + # owner spec (required): Specify what kinds of pods can allocate resources of this reservation. + # Currently support three kinds of owner specifications: + # - object: specify the name, namespace, uid of the owner pods + # - controller: specify the owner reference of the owner pods, e.g. name, namespace(extended by koordinator), uid, kind + # - labelSelector: specify the matching labels are matching expressions of the owner pods + owners: + - object: + name: pod-demo-0 + namespace: default + - labelSelector: + matchLabels: + app: app-demo + # TTL (optional): Time-To-Live duration of the reservation. The reservation will get expired after the TTL period. + # If not set, use `24h` as default. + ttl: 1h + # Expires (optional): Expired timestamp when the reservation is expected to expire. + # If both `expires` and `ttl` are set, `expires` is checked first. + expires: "YYYY-MM-DDTHH:MM:SSZ" +``` + + + +### Example: Reserve on Specified Node, with Multiple Owners + +1. Check the resources allocatable of each node. + +```bash +$ kubectl get node -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEMORY:.status.allocatable.memory +NAME CPU MEMORY +node-0 7800m 28625036Ki +node-1 7800m 28629692Ki +... +$ kubectl describe node node-1 | grep -A 8 "Allocated resources" + Allocated resources: + (Total limits may be over 100 percent, i.e., overcommitted.) + Resource Requests Limits + -------- -------- ------ + cpu 780m (10%) 7722m (99%) + memory 1216Mi (4%) 14044Mi (50%) + ephemeral-storage 0 (0%) 0 (0%) + hugepages-1Gi 0 (0%) 0 (0%) + hugepages-2Mi 0 (0%) 0 (0%) +``` + +As above, the node `node-1` has about 7.0 cpu and 26Gi memory unallocated. + +2. Deploy a reservation `reservation-demo-big` with the YAML file below. + +```yaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: Reservation +metadata: + name: reservation-demo-big +spec: + template: + namespace: default + spec: + containers: + - args: + - '-c' + - '1' + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: # reserve 6 cpu and 20Gi memory + requests: + cpu: 6 + memory: 20Gi + nodeName: node-1 # set the expected node name to schedule at + schedulerName: koord-scheduler + owners: # set multiple owners + - object: # owner pods whose name is `default/pod-demo-0` + name: pod-demo-1 + namespace: default + - labelSelector: # owner pods who have label `app=app-demo` can allocate the reserved resources + matchLabels: + app: app-demo + ttl: 1h +``` + +```bash +$ kubectl create -f reservation-demo-big.yaml +reservation.scheduling.koordinator.sh/reservation-demo-big created +``` + +3. Watch the reservation status util it becomes available. + +```bash +$ kubectl get reservation reservation-demo-big -o wide +NAME PHASE AGE NODE TTL EXPIRES +reservation-demo-big Available 37s node-1 1h +``` + +The reservation `reservation-demo-big` is scheduled at the node `node-1`, which matches the nodeName set in pod template. + +4. Deploy a deployment `app-demo` with the YAML file below. + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: app-demo +spec: + replicas: 2 + selector: + matchLabels: + app: app-demo + template: + metadata: + name: stress + labels: + app: app-demo # match the owner spec of `reservation-demo-big` + spec: + schedulerName: koord-scheduler # use koord-scheduler + containers: + - name: stress + image: polinux/stress + args: + - '-c' + - '1' + command: + - stress + resources: + requests: + cpu: 2 + memory: 10Gi + limits: + cpu: 4 + memory: 20Gi +``` + +```bash +$ kubectl create -f app-demo.yaml +deployment.apps/app-demo created +``` + +5. Check the scheduled result of the pods of deployment `app-demo`. + +```bash +k get pod -l app=app-demo -o wide +NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES +app-demo-798c66db46-ctnbr 1/1 Running 0 2m 10.17.0.124 node-1 +app-demo-798c66db46-pzphc 1/1 Running 0 2m 10.17.0.125 node-1 +``` + +Pods of deployment `app-demo` are scheduled at the same node with `reservation-demo-big`. + +6. Check the status of the reservation `reservation-demo-big`. + +```bash +$ kubectl get reservation reservation-demo-big -oyaml +apiVersion: scheduling.koordinator.sh/v1alpha1 +kind: Reservation +metadata: + name: reservation-demo-big + creationTimestamp: "YYYY-MM-DDT06:28:16Z" + uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx + ... +spec: + owners: + - object: + name: pod-demo-0 + namespace: default + template: + spec: + containers: + - args: + - -c + - "1" + command: + - stress + image: polinux/stress + imagePullPolicy: Always + name: stress + resources: + requests: + cpu: 500m + memory: 800Mi + schedulerName: koord-scheduler + ttl: 1h +status: + allocatable: + cpu: 6 + memory: 20Gi + allocated: + cpu: 4 + memory: 20Gi + conditions: + - lastProbeTime: "YYYY-MM-DDT06:28:17Z" + lastTransitionTime: "YYYY-MM-DDT06:28:17Z" + reason: Scheduled + status: "True" + type: Scheduled + - lastProbeTime: "YYYY-MM-DDT06:28:17Z" + lastTransitionTime: "YYYY-MM-DDT06:28:17Z" + reason: Available + status: "True" + type: Ready + currentOwners: + - name: app-demo-798c66db46-ctnbr + namespace: default + uid: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy + - name: app-demo-798c66db46-pzphc + namespace: default + uid: zzzzzzzz-zzzz-zzzz-zzzzzzzzzzzz + nodeName: node-1 + phase: Available +``` + +Now we can see the reservation `reservation-demo-big` has reserved 6 cpu and 20Gi memory, and the pods of deployment +`app-demo` allocates 4 cpu and 20Gi memory from the reserved resources. +The allocation for reserved resources does not increase the requested of node resources, otherwise the total request of +`node-1` would exceed the node allocatable. +Moreover, a reservation can be allocated by multiple owners when there are enough reserved resources unallocated. diff --git a/versioned_docs/version-v1.3/user-manuals/slo-config.md b/versioned_docs/version-v1.3/user-manuals/slo-config.md new file mode 100644 index 000000000..076a10527 --- /dev/null +++ b/versioned_docs/version-v1.3/user-manuals/slo-config.md @@ -0,0 +1,411 @@ +# SLO Configuration + +## Introduction + +Koordinator uses a ConfigMap to manage the SLO configurations. The ConfigMap is used by the slo-controller, whose name +and namespace can be specified via the startup arguments of the koord-manager +(`koordinator-system/slo-controller-config` by default). It has the following keys respectively: + +- `colocation-config`: The configuration for colocation. For example, whether to enable the colocated batch resources or not, the colocated watermark. +- `resource-threshold-config`: The configuration for threshold-based suppression or eviction. For example, the threshold for cpu suppression, the threshold for memory eviction. +- `resource-qos-config`: The configuration for QoS-based features. For example, Group Identity for BE pods, Memory QoS for LS pods, Last-Level-Cache partitioning for BE pods. +- `cpu-burst-config`: The configuration for the CPU Burst feature. For example, maximum burst ratio of the pod. +- `system-config`: The configuration for system-level settings. For example, the global minimum memory free factor (`min_free_kbytes`). + +### Configuration Levels + +Each config is defined in a pattern of both the cluster-level and the node-level. + +e.g. + +```go +type ColocationCfg struct { + ColocationStrategy `json:",inline"` + NodeConfigs []NodeColocationCfg `json:"nodeConfigs,omitempty"` +} + +type ResourceQOSCfg struct { + ClusterStrategy *slov1alpha1.ResourceQOSStrategy `json:"clusterStrategy,omitempty"` + NodeStrategies []NodeResourceQOSStrategy `json:"nodeStrategies,omitempty"` +} +``` + +The cluster-level config is for setting the global configurations, while the node-level is for users to adjust the +configurations of some nodes, especially for a gray-scale deployment. + +Please note that most configured fields have default values inside the components (koordlet, koord-manager), so editing +the changed parameters is usually enough. + +### NodeSLO + +The data in SLO config is parsed by the koord-manager. The koord-manager checks if the config data is legal, and then +updates the parsed configs into NodeSLO objects for every node. If the parsing fails, the koord-manager records events +to the ConfigMap object to warn the unmarshal errors. For the agent component koordlet, it watches the specifications +in the NodeSLO and reconciles the node QoS features. + +```yaml +apiVersion: slo.koordinator.sh/v1alpha1 +kind: NodeSLO +metadata: + name: test-node +spec: + cpuBurstStrategy: {} + extensions: {} + resourceQOSStrategy: {} + systemStrategy: {} + # parsed from the `resource-threshold-config` data + resourceUsedThresholdWithBE: + cpuSuppressPolicy: cpuset + cpuSuppressThresholdPercent: 65 + enable: true + memoryEvictThresholdPercent: 70 + +``` + +## Configurations + +> Referred version: Koordinator v1.2 + +The SLO Config template is as follows: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: slo-controller-config + namespace: koordinator-system +data: + # colocation-config is the configuration for colocation. + # Related features: Dynamic resource over-commitment, Load-aware scheduling, Load-aware descheduling. + # - enable: whether to enable the colocation. If false, the reclaimed resources of the node allocatable (e.g. `kubernetes.io/batch-cpu`) will be removed. + # - metricAggregateDurationSeconds: the aggregated duration of node metrics reporting. + # - metricReportIntervalSeconds: the reporting interval of the node metrics. + # - metricAggregatePolicy: policies of reporting node metrics in different durations. + # - cpuReclaimThresholdPercent: the reclaim threshold for calculating the reclaimed cpu resource. Basically, the reclaimed resource cannot reclaim the unused resources which are exceeding the threshold. + # - memoryReclaimThresholdPercent: the reclaim threshold for calculating the reclaimed memory resource. Basically, the reclaimed resource cannot reclaim the unused resources which are exceeding the threshold. + # - memoryCalculatePolicy: the policy for calculating the reclaimable memory resource. If set to `request`, only unallocated memory resource of high-priority pods are reclaimable, and no allocated memory can be reclaimed. + # - degradeTimeMinutes: the threshold duration to degrade the colocation for which the node metrics has not been updated. + # - updateTimeThresholdSeconds: the threshold duration to force updating the reclaimed resources with the latest calculated result. + # - resourceDiffThreshold: the threshold to update the reclaimed resources than which the calculated reclaimed resources is different from the current. + # - nodeConfigs: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration. + colocation-config: | + { + "enable": false, + "metricAggregateDurationSeconds": 300, + "metricReportIntervalSeconds": 60, + "metricAggregatePolicy": { + "durations": [ + "5m", + "10m", + "15m" + ] + }, + "cpuReclaimThresholdPercent": 60, + "memoryReclaimThresholdPercent": 65, + "memoryCalculatePolicy": "usage", + "degradeTimeMinutes": 15, + "updateTimeThresholdSeconds": 300, + "resourceDiffThreshold": 0.1, + "nodeConfigs": [ + { + "name": "anolis", + "nodeSelector": { + "matchLabels": { + "kubernetes.io/kernel": "anolis" + } + }, + "updateTimeThresholdSeconds": 360, + "resourceDiffThreshold": 0.2 + } + ] + } + # The configuration for threshold-based strategies. + # Related features: BECPUSuppress, BEMemoryEvict, BECPUEvict. + # - clusterStrategy: the cluster-level configuration. + # - nodeStrategies: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration. + # - enable: whether to enable the threshold-based strategies or not. If false, all threshold-based strategies are disabled. If set to true, CPU Suppress and Memory Evict are enabled by default. + # - cpuSuppressThresholdPercent: the node cpu utilization threshold to suppress BE pods' usage. + # - cpuSuppressPolicy: the policy of cpu suppression. If set to `cpuset`, the BE pods' `cpuset.cpus` will be reconciled when suppression. If set to `cfsQuota`, the BE pods' `cpu.cfs_quota_us` will be reconciled. + # - memoryEvictThresholdPercent: the node memory utilization threshold to evict BE pods. + # - memoryEvictLowerPercent: the node memory utilization threshold to stop the memory eviction. By default, `lowerPercent = thresholdPercent - 2`. + # - cpuEvictBESatisfactionLowerPercent: the cpu satisfaction threshold to start the cpu eviction (also require to meet the BE util threshold). + # - cpuEvictBEUsageThresholdPercent: the BE utilization (BEUsage / BERealLimit) threshold to start the cpu eviction (also require to meet the cpu satisfaction threshold). + # - cpuEvictBESatisfactionUpperPercent: the cpu satisfaction threshold to stop the cpu eviction. + # - cpuEvictTimeWindowSeconds: the time window of the cpu metrics for the cpu eviction. + resource-threshold-config: | + { + "clusterStrategy": { + "enable": false, + "cpuSuppressThresholdPercent": 65, + "cpuSuppressPolicy": "cpuset", + "memoryEvictThresholdPercent": 70, + "memoryEvictLowerPercent": 65, + "cpuEvictBESatisfactionUpperPercent": 90, + "cpuEvictBESatisfactionLowerPercent": 60, + "cpuEvictBEUsageThresholdPercent": 90 + }, + "nodeStrategies": [ + { + "name": "anolis", + "nodeSelector": { + "matchLabels": { + "kubernetes.io/kernel": "anolis" + } + }, + "cpuEvictBEUsageThresholdPercent": 80 + } + ] + } + # The configuration for QoS-based features. + # Related features: CPUQoS (GroupIdentity), MemoryQoS (CgroupReconcile), ResctrlQoS. + # - clusterStrategy: the cluster-level configuration. + # - nodeStrategies: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration. + # - lsrClass/lsClass/beClass: the configuration for pods of QoS LSR/LS/BE respectively. + # - cpuQOS: the configuration of CPU QoS. + # - enable: whether to enable CPU QoS. If set to `false`, the related cgroup configs will be reset to the system default. + # - groupIdentity: the priority level of the Group Identity ([-1, 2]). `2` means the highest priority, while `-1` means the lowest priority. Anolis OS required. + # - memoryQOS: the configuration of Memory QoS. + # - enable: whether to enable Memory QoS. If set to `false`, the related cgroup configs will be reset to the system default. + # - minLimitPercent: the scale percentage for setting the `memory.min` based on the container's request. It enables the memory protection from the Linux memory reclaim. + # - lowLimitPercent: the scale percentage for setting the `memory.low` based on the container's request. It enables the memory soft protection from the Linux memory reclaim. + # - throttlingPercent: the scale percentage for setting the `memory.high` based on the container's limit. It enables the memory throttling in cgroup level. + # - wmarkRatio: the ratio of container-level asynchronous memory reclaim based on the container's limit. Anolis OS required. + # - wmarkScalePermill: the per-mill of container memory to reclaim in once asynchronous memory reclaim. Anolis OS required. + # - wmarkMinAdj: the adjustment percentage of global memory min watermark. It affects the reclaim priority when the node memory free is quite a few. Anolis OS required. + # - resctrlQOS: the configuration of Resctrl (Intel RDT) QoS. + # - enable: whether to enable Resctrl QoS. + # - catRangeStartPercent: the starting percentage of the L3 Cache way partitioning. L3 CAT required. + # - catRangeEndPercent: the ending percentage of the L3 Cache way partitioning. L3 CAT required. + # - mbaPercent: the allocation percentage of the memory bandwidth. MBA required. + resource-qos-config: | + { + "clusterStrategy": { + "lsrClass": { + "cpuQOS": { + "enable": false, + "groupIdentity": 2 + }, + "memoryQOS": { + "enable": false, + "minLimitPercent": 0, + "lowLimitPercent": 0, + "throttlingPercent": 0, + "wmarkRatio": 95, + "wmarkScalePermill": 20, + "wmarkMinAdj": -25, + "priorityEnable": 0, + "priority": 0, + "oomKillGroup": 0 + }, + "resctrlQOS": { + "enable": false, + "catRangeStartPercent": 0, + "catRangeEndPercent": 100, + "mbaPercent": 100 + } + }, + "lsClass": { + "cpuQOS": { + "enable": false, + "groupIdentity": 2 + }, + "memoryQOS": { + "enable": false, + "minLimitPercent": 0, + "lowLimitPercent": 0, + "throttlingPercent": 0, + "wmarkRatio": 95, + "wmarkScalePermill": 20, + "wmarkMinAdj": -25, + "priorityEnable": 0, + "priority": 0, + "oomKillGroup": 0 + }, + "resctrlQOS": { + "enable": false, + "catRangeStartPercent": 0, + "catRangeEndPercent": 100, + "mbaPercent": 100 + } + }, + "beClass": { + "cpuQOS": { + "enable": false, + "groupIdentity": -1 + }, + "memoryQOS": { + "enable": false, + "minLimitPercent": 0, + "lowLimitPercent": 0, + "throttlingPercent": 0, + "wmarkRatio": 95, + "wmarkScalePermill": 20, + "wmarkMinAdj": 50, + "priorityEnable": 0, + "priority": 0, + "oomKillGroup": 0 + }, + "resctrlQOS": { + "enable": false, + "catRangeStartPercent": 0, + "catRangeEndPercent": 30, + "mbaPercent": 100 + } + } + }, + "nodeStrategies": [ + { + "name": "anolis", + "nodeSelector": { + "matchLabels": { + "kubernetes.io/kernel": "anolis" + } + }, + "beClass": { + "memoryQOS": { + "wmarkRatio": 90 + } + } + } + ] + } + # The configuration for the CPU Burst. + # Related features: CPUBurst. + # - clusterStrategy: the cluster-level configuration. + # - nodeStrategies: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration. + # - policy: the policy of CPU Burst. If set to `none`, the CPU Burst is disabled. If set to `auto`, the CPU Burst is fully enabled. If set to `cpuBurstOnly`, only the Linux CFS Burst feature is enabled. + # - cpuBurstPercent: the percentage of Linux CFS Burst. It affects the value of `cpu.cfs_burst_us` of pod/container cgroups. It specifies the percentage to which the CPU limit can be increased by CPU Burst. + # - cfsQuotaBurstPercent: the percentage of cfs quota burst. It affects the scaled ratio of `cpu.cfs_quota_us` of pod/container cgroups. It specifies the maximum percentage to which the value of cfs_quota in the cgroup parameters can be increased. + # - cfsQuotaBurstPeriodSeconds: the maximum period of once cfs quota burst. It indicates that the time period in which the container can run with an increased CFS quota is unlimited. + # - sharePoolThresholdPercent: the threshold of share pool utilization. If the share pool utilization is too high, CPU Burst will be stopped and reset to avoid machine overload. + cpu-burst-config: | + { + "clusterStrategy": { + "policy": "none", + "cpuBurstPercent": 1000, + "cfsQuotaBurstPercent": 300, + "cfsQuotaBurstPeriodSeconds": -1, + "sharePoolThresholdPercent": 50 + }, + "nodeStrategies": [ + { + "name": "anolis", + "nodeSelector": { + "matchLabels": { + "kubernetes.io/kernel": "anolis" + } + }, + "policy": "cfsQuotaBurstOnly", + "cfsQuotaBurstPercent": 400 + } + ] + } + # The configuration for system-level settings. + # Related features: SystemConfig. + # - clusterStrategy: the cluster-level configuration. + # - nodeStrategies: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration. + # - minFreeKbytesFactor: the factor for calculating the global minimum memory free watermark `/proc/sys/vm/min_free_kbytes`. `min_free_kbytes = minFreeKbytesFactor * nodeTotalMemory / 10000`. + # - watermarkScaleFactor: the reclaim factor `/proc/sys/vm/watermark_scale_factor` in once global memory reclaim. + # - memcgReapBackGround: whether to enable the reaper for orphan memory cgroups. + system-config: |- + { + "clusterStrategy": { + "minFreeKbytesFactor": 100, + "watermarkScaleFactor": 150, + "memcgReapBackGround": 0 + } + "nodeStrategies": [ + { + "name": "anolis", + "nodeSelector": { + "matchLabels": { + "kubernetes.io/kernel": "anolis" + } + }, + "minFreeKbytesFactor": 100, + "watermarkScaleFactor": 150 + } + ] + } +``` + +For more information, please check the user manuals and designs of the related features. + +## Quick Start + +1. Check the current SLO configurations via the ConfigMap `koordinator-system/slo-controller-config`. + +```bash +$ kubectl get configmap -n koordinator-system slo-controller-config -o yaml +apiVersion: v1 +kind: ConfigMap +metadata: + annotations: + meta.helm.sh/release-name: koordinator + meta.helm.sh/release-namespace: default + labels: + app.kubernetes.io/managed-by: Helm + name: slo-controller-config + namespace: koordinator-system +data: + colocation-config: | + { + "enable": false, + "metricAggregateDurationSeconds": 300, + "metricReportIntervalSeconds": 60, + "cpuReclaimThresholdPercent": 60, + "memoryReclaimThresholdPercent": 65, + "memoryCalculatePolicy": "usage", + "degradeTimeMinutes": 15, + "updateTimeThresholdSeconds": 300, + "resourceDiffThreshold": 0.1 + } + resource-threshold-config: | + { + "clusterStrategy": { + "enable": false + } + } +``` + +2. Edit the ConfigMap `koordinator-system/slo-controller-config` to change the SLO config. + +```bash +$ kubectl edit configmap -n koordinator-system slo-controller-config +``` + +For example, the configmap is edited as follows: + +```yaml +data: + # ... + resource-threshold-config: | + { + "clusterStrategy": { + "enable": true, + "cpuSuppressThresholdPercent": 60, + "cpuSuppressPolicy": "cpuset", + "memoryEvictThresholdPercent": 60 + } + } +``` + +3. Verify if the NodeSLO is successfully dispatched. + +> NOTE: The default values will be omitted in the NodeSLO. + +```bash +$ kubectl get nodeslo.slo.koordinator.sh test-node -o yaml +apiVersion: slo.koordinator.sh/v1alpha1 +kind: NodeSLO +metadata: + name: test-node +spec: + # ... + extensions: {} + resourceUsedThresholdWithBE: + cpuSuppressPolicy: cpuset + cpuSuppressThresholdPercent: 60 + enable: true + memoryEvictThresholdPercent: 60 +``` diff --git a/versioned_sidebars/version-v1.3-sidebars.json b/versioned_sidebars/version-v1.3-sidebars.json new file mode 100644 index 000000000..4ba1f3b34 --- /dev/null +++ b/versioned_sidebars/version-v1.3-sidebars.json @@ -0,0 +1,78 @@ +{ + "docs": [ + { + "type": "category", + "label": "Getting Started", + "collapsed": false, + "items": [ + "introduction", + "installation" + ] + }, + { + "type": "category", + "label": "Architecture", + "collapsed": false, + "items": [ + "architecture/overview", + "architecture/resource-model", + "architecture/priority", + "architecture/qos" + ] + }, + { + "type": "category", + "label": "User Manuals", + "collapsed": true, + "items": [ + "user-manuals/colocation-profile", + "user-manuals/load-aware-scheduling", + "user-manuals/load-aware-descheduling", + "user-manuals/fine-grained-cpu-orchestration", + "user-manuals/resource-reservation", + "user-manuals/pod-migration-job", + "user-manuals/fine-grained-device-scheduling", + "user-manuals/gang-scheduling", + "user-manuals/multi-hierarchy-elastic-quota-management", + "user-manuals/slo-config", + "user-manuals/cpu-suppress", + "user-manuals/cpu-burst", + "user-manuals/cpu-qos", + "user-manuals/cpu-evict", + "user-manuals/memory-qos", + "user-manuals/memory-evict", + "user-manuals/performance-collector" + ] + }, + { + "type": "category", + "label": "Design Details", + "collapsed": true, + "items": [ + "designs/koordlet-overview", + "designs/runtime-proxy", + "designs/nri-mode-resource-management", + "designs/node-prediction", + "designs/enhanced-scheduler-extension", + "designs/load-aware-scheduling", + "designs/fine-grained-cpu-orchestration", + "designs/resource-reservation", + "designs/pod-migration-job", + "designs/descheduler-framework", + "designs/fine-grained-device-scheduling", + "designs/gang-scheduling", + "designs/multi-hierarchy-elastic-quota-management" + ] + }, + { + "type": "category", + "label": "Best Practices", + "collapsed": true, + "items": [ + "best-practices/colocation-of-spark-jobs", + "best-practices/anolis_plugsched", + "best-practices/fine-grained-cpu-orchestration" + ] + } + ] +} diff --git a/versions.json b/versions.json index a9dbce256..15ed85c51 100644 --- a/versions.json +++ b/versions.json @@ -1,4 +1,5 @@ [ + "v1.3", "v1.2", "v1.1", "v1.0",