Azure Scheduler
The Azure Scheduler works with the default ClusterSizingConfiguration
resource and the HostedClusterSizing
controller.
ClusterSizingConfiguration
The ClusterSizingConfiguration
is an API used for setting tshirt sizes based on the number of nodes a HostedCluster
has. Each tshirt size can configure different effects that control various aspects of the cluster, such as the Kube API Server (KAS), etcd, etc. Additionally, it allows controlling the frequency of transitions between cluster sizes.
Effects
kasGoMemLimit
: Specifies the memory limit for the Kube API Server.controlPlanePriorityClassName
: The priority class for most control plane pods.etcdPriorityClassName
: The priority class for etcd pods.apiCriticalPriorityClassName
: The priority class for pods in the API request serving path, including Kube API Server and OpenShift APIServer.resourceRequests
: Allows specifying resource requests for control plane pods.machineHealthCheckTimeout
: Specifies an optional timeout for machine health checks created forHostedClusters
with this specific size.maximumRequestsInFlight
: Specifies the maximum requests in flight for Kube API Server.maximumMutatingRequestsInflight
: Specifies the maximum mutating requests in flight for Kube API Server.
ConcurrencyConfiguration
The ConcurrencyConfiguration
defines the bounds of allowed behavior for clusters transitioning between sizes. It includes:
SlidingWindow
: The window over which the concurrency bound is enforced. This is a duration (e.g.,10m
for 10 minutes) that specifies the time frame within which the concurrency limit is applied.Limit
: The maximum allowed number of cluster size transitions during the sliding window. This is an integer (e.g.,5
) that specifies how many transitions can occur within the sliding window.
TransitionDelayConfiguration
The TransitionDelayConfiguration
defines the lag between cluster size changing and the assigned tshirt size class being applied. It includes:
Increase
: The minimum period of time to wait between a cluster's size increasing and the tshirt size assigned to it being updated to reflect the new size. This is a duration (e.g.,30s
for 30 seconds).Decrease
: The minimum period of time to wait between a cluster's size decreasing and the tshirt size assigned to it being updated to reflect the new size. This is a duration (e.g.,10m
for 10 minutes).
HostedClusterSizing Controller
The HostedClusterSizing
controller determines the number of nodes associated with a HostedCluster
either from the HostedControlPlane.Status
or by iterating through the nodepools and counting the nodepools associated with the HostedCluster
. It then compares the number of nodes against the minimum and maximum sizes set for each tshirt size in the ClusterSizingConfiguration
. Based on this comparison, it applies a label to the HostedCluster
with the appropriate tshirt size. Depending on the settings in the ClusterSizingConfiguration
, it can wait a specified amount of time before transitioning between tshirt sizes using a sliding window, ensuring that only a limited number of transitions (e.g., 5 transitions) can occur within a specified time frame (e.g., 20 minutes).
The controller also updates the status of the HostedCluster
, reporting the computed cluster size, indicating if a tshirt size transition is pending, and specifying if the cluster requires a transition to a different size.
Azure Scheduler Controller
The Azure scheduler controller is straightforward. It checks the label set by the HostedClusterSizing
controller and retrieves the cluster sizing configuration associated with the tshirt size. Based on the configuration, it can modify the HostedCluster
with annotations for the specified fields. These annotations are then used by different controllers to propagate the required changes to the appropriate pods and containers.
How to Use
Prerequisites
- AKS cluster with cluster-autoscaler enabled and using Standard_D4s_v4 VMs for this example. (--enable-cluster-autoscaler flag when installing AKS cluster, with --min-count 2 --max-count 6)
- Hypershift operator with size tagging enabled. (--enable-size-tagging flag when installing hypershift operator)
- ClusterSizingConfiguration resource created. (A default clusterSizingConfiguration resource is created by the hypershift operator)
- A HostedCluster in the Completed state.
- A Nodepool with 2 nodes associated with the HostedCluster.
Steps
In the example below we will use a HostedCluster with the name 'pstefans-3' in the 'clusters' namespace and the nodepool 'pstefans-3' in the 'clusters' namespace.
-
The AKS cluster should have only 2 nodes at this point.
oc get nodes NAME STATUS ROLES AGE VERSION aks-nodepool1-11371333-vmss000000 Ready <none> 3h43m v1.31.1 aks-nodepool1-11371333-vmss000002 Ready <none> 3h43m v1.31.1
-
Edit the
ClusterSizingConfiguration
resource with the following spec:oc edit clustersizingconfiguration cluster
spec: concurrency: limit: 5 slidingWindow: 0s sizes: - criteria: from: 0 to: 2 name: small - criteria: from: 3 to: 4 effects: resourceRequests: - containerName: kube-apiserver cpu: 3 deploymentName: kube-apiserver - containerName: control-plane-operator cpu: 3 deploymentName: control-plane-operator name: medium - criteria: from: 5 name: large transitionDelay: decrease: 0s increase: 0s
-
Scale nodepool up to 3 nodes:
oc scale nodepool pstefans-3 \ --namespace clusters \ --replicas 3
-
Once node pool scales successfully, the
HostedCluster
will be updated with the new tshirt size label and should have the resource request overrides annotations applied to the HC and the relevant controllers should pick this up and set it on the specified pods.oc get deployment kube-apiserver -n clusters-pstefans-3 -o json | jq '.spec.template.spec.containers[] | select(.name == "kube-apiserver") | .resources'
{ "requests": { "cpu": "3", "memory": "2Gi" } }
oc get deployment control-plane-operator -n clusters-pstefans-3 -o json | jq '.spec.template.spec.containers[] | select(.name == "control-plane-operator") | .resources'
{ "requests": { "cpu": "3", "memory": "80Mi" } }
oc get hc pstefans-3 -n clusters -o yaml | grep resource-request-override.hypershift.openshift.io resource-request-override.hypershift.openshift.io/control-plane-operator.control-plane-operator: cpu=3 resource-request-override.hypershift.openshift.io/kube-apiserver.kube-apiserver: cpu=3
-
You should now see the autoscaler scaled the nodes on the AKS cluster to 3 as we requested 3 CPU cores for the kube-apiserver and control-plane-operator on a nodepool with max 4 cores. So each deployment will nearly request nearly a full node to itself.
oc get nodes NAME STATUS ROLES AGE VERSION aks-nodepool1-11371333-vmss000000 Ready <none> 4h8m v1.31.1 aks-nodepool1-11371333-vmss000002 Ready <none> 4h8m v1.31.1 aks-nodepool1-11371333-vmss000003 Ready <none> 9m31s v1.31.1
-
You should now see that each of the deployments we changed the resource requests for are running on a different node with sufficient compute.
kubectl get pods --all-namespaces --field-selector spec.nodeName=aks-nodepool1-11371333-vmss000003
NAMESPACE NAME READY STATUS RESTARTS AGE clusters-pstefans-3 kube-apiserver-549c75cb99-jj964 4/4 Running 0 12m
kubectl get pods --all-namespaces --field-selector spec.nodeName=aks-nodepool1-11371333-vmss000002
NAMESPACE NAME READY STATUS RESTARTS AGE clusters-pstefans-3 control-plane-operator-69b894d9dd-cxv2z 1/1 Running 0 14m