How Kubernetes Scheduler Scores Nodes Under Pod Burst

Learn the exact scoring algorithm kube-scheduler uses when 1000 Pods land at once, why it's not just 'most free CPU', and how percentageOfNodesToScore shapes placement.

When one thousand Pods appear in a cluster within seconds, kube-scheduler does not greedily pick the node with the most free memory. It runs a serial pipeline for each Pod. Inside that pipeline, node evaluation is parallelized. The placement pattern that emerges is the result of a weighted scoring algorithm, not a simple heuristic. That scoring algorithm is what we are going to unpack.

Think of the scheduler as a dispatcher at a busy airport. One thousand flights land at once. The dispatcher assigns each flight to a gate. Each flight has hard requirements: a gate must be long enough, must have a jet bridge, must not be blocked by another aircraft. The dispatcher first filters out gates that cannot physically accommodate the flight. Then, among the remaining gates, the dispatcher scores them. A gate closer to the baggage claim gets a higher score. A gate with a shorter taxi time gets a higher score. The dispatcher picks the gate with the highest combined score. Now imagine the dispatcher does this one flight at a time. As gates get assigned, later flights see fewer options and may be forced to less convenient gates. This is exactly how kube-scheduler handles a burst of Pods.

The Scheduling Pipeline

For each unscheduled Pod, the scheduler runs a two-phase scheduling context: a serial scheduling cycle that selects a node, and an asynchronous binding cycle that applies the decision via the API server. The scheduling cycle is strictly one Pod at a time. Inside that cycle, Filter and Score operations run across nodes in parallel. Binding cycles for multiple Pods can run concurrently. This design keeps throughput high while keeping the core placement logic simple and race-free.

When a new Pod appears, it enters the active queue (activeQ), ordered by Pod priority. The scheduler pops the highest-priority Pod and begins the scheduling cycle. First, PreFilter plugins precompute Pod-specific data and store it in CycleState. This avoids recomputing resource sums or affinity selectors for every node. Then the Filter phase checks every candidate node against hard constraints. A node fails if any filter plugin rejects it. The result is a set of feasible nodes.

If no nodes are feasible, PostFilter plugins attempt preemption: evicting lower-priority Pods to make room. If preemption succeeds, the Pod is retried later. If it fails, the Pod goes to the unschedulable pool and waits for cluster events.

With a non-empty feasible set, the scheduler moves to PreScore, which precomputes data for scoring, and then to the Score phase. Each Score plugin assigns an integer score in the range [0, 100] to each feasible node. The framework then combines these scores using configured weights and normalizes them. The node with the highest final score wins. Reserve updates the scheduler’s in-memory cache to deduct the Pod’s resources from that node, preventing overcommit for subsequent Pods in the same burst.

Feasibility Search and the Node Slice

In large clusters, checking every node for every Pod would be too expensive. The scheduler uses a mechanism to stop early once it has found “enough” feasible nodes. The key parameter is percentageOfNodesToScore, which controls the minimum number of nodes that must be evaluated in the scoring phase. During Filter, the scheduler scans nodes in a round-robin order. It stops scanning when the number of feasible nodes reaches the threshold computed from the percentage and cluster size. If the threshold is not met, all nodes are scanned.

The default value is not a fixed number. Kubernetes derives it from cluster size using a linear function that yields roughly 50% for a 100-node cluster and 10% for a 5000-node cluster, with a floor of 5%. You can override it to any value between 1 and 100. Setting it to 100 forces full scans.

To avoid always starting from the same node, the scheduler maintains a round-robin pointer. For the first Pod, it starts at node 0. For the next Pod, it resumes from the node after the last one checked. This pointer wraps around and interleaves nodes across zones. Over many Pods, every node gets a fair chance to be in the scanned subset.

Under a 1000-Pod surge, this means each Pod sees only a slice of the cluster. The first few Pods might consider nodes at the beginning of the array. Later Pods see different slices. Combined with in-memory reservations from earlier Pods, the resource landscape shifts. Early Pods land on the most preferred nodes. Later Pods may be forced to less ideal nodes or become unschedulable.

Scoring: Turning Preferences into a Winner

Once feasible nodes are identified, the scheduler must pick the best one. It does this by running multiple Score plugins in parallel for each node. Each plugin returns an integer in [0, 100] (the framework’s MinNodeScore and MaxNodeScore). If a plugin returns a value outside this range, the scheduling cycle aborts. The default set of plugins includes NodeResourcesFit, ImageLocality, InterPodAffinity, and others. Each plugin has a weight (default 1) that you can adjust in the scheduler configuration.

NodeResourcesFit is the workhorse. It implements two strategies: LeastAllocated and MostAllocated. LeastAllocated favors nodes with the most free resources, spreading Pods across the cluster. MostAllocated favors nodes with the least free resources, bin-packing Pods to keep nodes densely utilized. The strategy is chosen per resource type. For a 1000-Pod burst, if you use LeastAllocated, the scheduler will spread Pods as much as possible. If you use MostAllocated, it will pack them onto a few nodes until those nodes are full, then spill over.

After all plugins have scored each node, the framework runs NormalizeScore for any plugin that implements it. NormalizeScore lets a plugin rescale its scores across all nodes. For example, a plugin might apply min-max normalization to stretch scores to the full 0-100 range. Then the framework computes the final score for each node as a weighted sum: final = Σ(weight_i * normalizedScore_i). The node with the highest sum wins. Ties are broken by round-robin order.

Reserve: In-Memory Accounting

After scoring picks a winner, the Reserve phase immediately updates the scheduler’s internal node cache. It subtracts the Pod’s CPU and memory requests from the winning node’s available resources. This reservation is purely in-memory and takes effect before the API server is updated. Subsequent Pods in the same scheduling burst “see” this updated view. This prevents the scheduler from placing multiple Pods on the same node beyond its capacity, even though binding is asynchronous.

If binding later fails, Unreserve rolls back the reservation. But under normal operation, Reserve provides fast, consistent accounting without waiting for API server round-trips. This is critical when 1000 Pods are being scheduled back-to-back. Without it, the scheduler could overcommit a node because the API server’s Pod object updates would not be visible in time.

What Happens Under a 1000-Pod Surge

When 1000 Pods arrive at once, the scheduler’s informers enqueue them into activeQ. They are ordered by priority, then by timestamp. The scheduler pops one Pod, runs the scheduling cycle, and if successful, starts a binding cycle. While binding is in flight, the next Pod is popped and its scheduling cycle begins. This means the scheduling cycles are serial but binding cycles are concurrent.

Each Pod’s scheduling cycle sees a different set of feasible nodes due to round-robin scanning and in-memory reservations. Early Pods get the “best” nodes according to the scoring plugins. As nodes fill up, later Pods may see those nodes as infeasible or score them lower. If a Pod cannot be scheduled, it goes to the unschedulable pool and is retried only when relevant events occur (a node is added, a Pod is deleted, etc.). This avoids thrashing.

The overall placement pattern is an emergent property of the scoring weights, the feasibility search threshold, and the Pod priority ordering. It is not a simple round-robin or a greedy “most free CPU” algorithm. To predict or control the outcome, you must understand the scoring plugins you have enabled and the percentageOfNodesToScore setting.

Quick Reference

Property	Value
Score range per plugin	[0, 100] (MinNodeScore, MaxNodeScore)
Default scoring plugins	NodeResourcesFit, ImageLocality, InterPodAffinity, etc.
NodeResourcesFit strategies	LeastAllocated (spread), MostAllocated (bin-pack)
Default `percentageOfNodesToScore`	Linear function: ~50% at 100 nodes, ~10% at 5000 nodes, floor 5%
Minimum feasible nodes to find	`minFeasibleNodesToFind` (internal, derived from cluster size and percentage)
Scheduling cycle concurrency	1 Pod at a time (serial)
Node evaluation inside cycle	Parallel across nodes
Binding cycle concurrency	Multiple Pods in parallel
In-memory reservation	Reserve phase updates node cache before API bind
Queue ordering	Pod priority (higher first), then timestamp

Frequently Asked Questions

Q: How does the scheduler avoid overcommitting a node when binding is asynchronous? The Reserve phase updates the scheduler’s in-memory node cache immediately after scoring, before the Pod is bound. Subsequent Pods see the updated available resources. If binding fails, Unreserve rolls back the reservation. This prevents overcommit from the scheduler’s perspective, though actual node overcommit can still happen if kubelet overrides or if Pods have no resource requests.

Q: What happens if I set percentageOfNodesToScore to 100 in a 5000-node cluster? Every Pod will scan all 5000 nodes during Filter and Score. This increases scheduling latency and CPU usage dramatically. For a burst of 1000 Pods, the scheduler may become CPU-bound and scheduling throughput will drop. Use 100 only in small clusters or when you absolutely need every node considered for every Pod.

Q: How does the scheduler decide which scoring plugin wins when scores conflict? Plugins do not “win” individually. The framework computes a weighted sum of normalized scores from all plugins. The node with the highest weighted sum is chosen. You control the influence of each plugin by adjusting its weight in the scheduler configuration. A weight of 0 disables a plugin’s contribution to the final score.

Q: Does the scheduler guarantee that all 1000 Pods will be placed eventually? No. If cluster resources are insufficient, some Pods will remain unschedulable. The scheduler places them in the unschedulable pool and retries only when relevant cluster events occur. Preemption can help if higher-priority Pods can evict lower-priority ones. But if capacity is permanently short, Pods will stay pending until more nodes are added or existing workloads are removed.

Q: Can I customize the scoring algorithm without writing code? Yes. You can change the weights of built-in plugins and switch between LeastAllocated and MostAllocated for NodeResourcesFit via the scheduler configuration. For more complex logic, you can write a custom scheduler plugin using the scheduling framework and deploy it as a separate scheduler or as an extension in the default scheduler.

If you want this kind of breakdown every week — how real systems actually work under the hood — subscribe to Internals Decoded at internalsdecoded.com.