The Mathematics of AI Reliability: SLA-Based Capacity Planning Using Engset's Formula
When your CFO asks “How many AI servers do we need?”, most vendors shrug and say “It depends.” We built a system that gives you a mathematically precise answer.
This white paper explains the queueing theory behind our LLM Hardware Cost Calculator and why it matters for enterprises planning on-premise AI deployments.
Executive Summary
Most AI capacity planning today relies on guesswork: “We have 50 users, so maybe 2 servers?” This approach leads to either under-provisioning (frustrated users, slow responses) or over-provisioning (wasted budget, idle hardware).
We solve this using Engset’s formula—mathematics developed in 1915 that has ensured reliable private telephone networks for over a century. By treating AI inference requests like phone calls to a corporate exchange, we can calculate precisely how many “lines” (hardware capacity) you need to guarantee a specific service level.
Why Engset? Unlike the better-known Erlang formulas (designed for public networks with unlimited callers), Engset was specifically created for finite populations—exactly like your enterprise with a known number of employees. Using Erlang B for enterprise capacity planning leads to systematic over-provisioning; Engset gives you the mathematically correct answer.
Key Benefits:
- Mathematical guarantees instead of educated guesses
- SLA-based sizing: “95% of users get 40 tokens/second” becomes a solvable equation
- Cost optimization: Neither too much nor too little hardware (Engset avoids Erlang B’s over-provisioning)
- Predictable performance: Know exactly what to promise stakeholders
- Century-proven: The same math that dimensions corporate phone systems worldwide
Part 1: The Problem with Traditional Sizing
Existing Approaches and Their Limitations
While there are multiple online proposals for SME-scale LLM hardware sizing, they mostly rely on crude heuristics rather than rigorous mathematical models. Current computation methods fall into three categories:
VRAM-Centric Sizing Guides
Several vendors focus on “can I fit this model and context in GPU memory” with only informal concurrency rules of thumb:
-
Lenovo’s LLM Sizing Guide gives formulas to estimate GPU memory from model size, sequence length, attention heads, precision, and batch size, then treats “number of concurrent users” basically as batch size. It does not provide a probabilistic concurrency or SLA model; instead it suggests collecting customer requirements and stress testing.
-
Basebox’s Hardware Sizing Guide gives explicit concurrency heuristics for SMEs (e.g., for “intensive usage” 50 total users → 15–25 concurrent users, 100 total users → 30–50 concurrent users), then back-solves a single GPU recommendation. This is essentially rule-of-thumb peak concurrency, not a queueing-based dimensioning.
-
Puget Systems’ VRAM sizing guide chooses GPU counts by checking model+context+batch VRAM use and then adding 1 extra GPU “to be safe” if more users or tokens show up, again without a formal SLA probability model.
The limitation: These guides answer “what GPU spec do I need for my model and batch?” but they do not answer “for 50 office users, what hardware guarantees X tokens/second at P95?”
Cloud Quota / TPM-Based Planning
Other guides assume SaaS APIs and dimension around provider quotas (TPM/RPM), not on-premise hardware capacity:
-
Azure OpenAI / Databricks capacity guides explain tokens-per-minute (TPM) and requests-per-minute (RPM) quotas, then walk through estimating TPM from user count, requests per user, internal agent calls per request, and average prompt/response token lengths. They add a buffer (e.g., 20%) and choose an enrollment tier that provides sufficient TPM.
-
Palantir’s LLM capacity management describes reserving TPM and RPM for interactive queries vs batch jobs and enforcing a fixed split (e.g., 20% capacity reserved for interactive), but treats capacity as a hard quota and does not use Engset/Erlang-type models.
The limitation: These approaches are close conceptually (tokens/time as the unit, interactive vs batch), but they dimension against a quota table, not against a closed-form blocking probability for a finite user base.
SME-Oriented “Local LLM” Calculators
A few tools explicitly target “SME sized” or “local LLM” deployments, but they are mainly heuristic:
-
Generic VRAM calculators like “Can You Run This LLM?” let you pick a model and context and tell you how much VRAM you need, but they do not model user populations or SLAs.
-
Local-LLM hardware guides (Introl, OpenMetal, etc.) map example organizations (10, 50, 100 users) to one or two GPU SKUs based on assumed model size and broad “light/medium/heavy” usage categories, without a probabilistic concurrency or blocking model.
The limitation: These tools provide rough mappings but lack the mathematical rigor to guarantee specific SLA percentiles.
What Makes Our Approach Different
Our LLM Hardware Cost Calculator is one of the very few public tools that explicitly frames SME hardware sizing as an SLA-based, probability-driven capacity-planning problem with finite users, rather than just VRAM and peak concurrency guesswork.
In short: While there are multiple online proposals for SME-scale LLM hardware sizing, they mostly use VRAM-plus-heuristics or TPM-quota arithmetic. The Engset-based finite-population SLA dimensioning in this whitepaper is genuinely more rigorous and comparatively unique—it provides mathematical guarantees rather than educated guesses.
The Stakes
For enterprises, getting this wrong has real costs:
- Under-provisioned: Users wait 30+ seconds for responses → productivity loss, frustration, shadow IT (users switch to cloud APIs, defeating the purpose of on-premise)
- Over-provisioned: €50,000+ in idle GPUs → wasted capex, harder to justify future AI investments
Part 2: A Solution from Telecommunications
The Origins: 1915–1918
Tore Olaus Engset, a Norwegian engineer, developed his finite-source loss model in a 1915 manuscript and published it in 1918 to dimension early automatic telephone exchanges. His work actually predates or runs parallel with the better-known Erlang formulas—both became foundational pillars of queueing and teletraffic theory.
Engset’s problem was identical to ours: How many telephone lines does a private exchange (PBX) need to ensure callers rarely hear a busy signal?
Like AI requests, phone calls are:
- Sporadic: Users don’t call continuously
- Variable duration: Some calls are short, others long
- From a finite population: Your company has a fixed number of employees
Engset’s breakthrough was modeling finite populations where capacity is explicitly allocated per source. Unlike public networks (where demand can arrive from anywhere), a private exchange serves a known set of subscribers, each with a defined service entitlement. This enabled precise “grade of service” engineering: given N users and target blocking probability, calculate the exact number of lines needed.
The Analogy
| Telephone Exchange | AI Inference System |
|---|---|
| Telephone lines | GPU compute capacity |
| Callers | Users making AI requests |
| Call duration | Request processing time |
| Busy signal | Request waiting/timing out |
| Blocking probability | Probability user doesn’t get immediate service |
| Private branch exchange (PBX) | On-premise AI deployment |
The key insight: Your AI system is a “private exchange” with a known, finite number of potential users—exactly what Engset’s formula was designed for over a century ago.
Part 3: Why Engset, Not Erlang B?
This is perhaps the most important technical decision in our model. Let us explain.
The Erlang Family
Most engineers are familiar with Erlang B (for blocking probability) and Erlang C (for queuing). These formulas, developed by Danish mathematician A.K. Erlang around 1917, assume an infinite population of potential callers—appropriate for public telephone networks where the number of possible callers is essentially unlimited.
The Critical Difference
| Model | Population Assumption | Best For |
|---|---|---|
| Erlang B | Infinite sources | Public networks, cloud APIs, large-scale services |
| Engset | Finite sources | Private exchanges, enterprise systems, known user counts |
Why does this matter? The key distinction is about demand modeling, not physical constraints.
Engset models the probability distribution of concurrent demand from a known population. Given N users with known usage patterns, it predicts: “What’s the probability that X users need capacity simultaneously?”
This enables provisioning based on probabilities:
- At : Size so concurrent demand exceeds capacity only 5% of the time
- At : Size so concurrent demand exceeds capacity only 1% of the time
Erlang B, assuming infinite sources, overestimates peak demand for small populations—leading to over-provisioning. Engset gives the statistically correct answer for enterprise-scale deployments.
The Cost of Using the Wrong Model
When the ratio of users to channels is small (typical for enterprise deployments), Erlang B systematically overestimates blocking probability:
Scenario: 20 Users, 5 Channels, Moderate Load
| Model | Blocking Probability |
|---|---|
| Erlang B prediction | ~12% blocking |
| Engset prediction | ~6% blocking |
| Reality | Closer to Engset |
Using Erlang B for enterprise capacity planning leads to over-provisioning: you buy hardware you don’t need because the model assumes a worse-case that can’t actually happen with a finite user base.
Convergence at Scale
For very large user populations, Engset and Erlang B converge to the same result—which is why Erlang B became dominant in public telecom planning. But for enterprise AI deployments with 5, 50, or even 500 known users, the difference is significant and Engset is the mathematically correct choice.
Not a Historical Curiosity
Engset’s formula is over a century old but still the correct tool for [finite-source systems](https://repositum.tuwien.at/bitstream/20.500.12708/13123/2/Dombacher Christian - 2010 - Queueing models for call centres.pdf). It hasn’t been “overhauled” so much as embedded into modern queueing theory. Contemporary teletraffic textbooks present Engset alongside Erlang as the standard finite-source loss formula. Recent research continues to develop efficient numerical methods for computing Engset probabilities and extending the model to modern networking contexts.
In practical engineering, Engset-type models still appear in:
- Corporate telephony (PBX systems)
- [Call center capacity planning](https://repositum.tuwien.at/bitstream/20.500.12708/13123/2/Dombacher Christian - 2010 - Queueing models for call centres.pdf)
- Enterprise IT systems with known user counts
- On-premise AI deployments ← this is us
Optional Enhancement: QoS Traffic Shaping
The Engset model sizes capacity so that demand rarely exceeds supply. For most deployments, this is sufficient—users naturally experience good performance because the system isn’t overloaded.
However, for high-utilization or mission-critical deployments, adding a QoS (Quality of Service) layer provides additional guarantees:
| Scenario | Without QoS | With QoS |
|---|---|---|
| Low load | All users get excellent speed | Same |
| Peak load | First-come-first-served (some users may get less) | Fair distribution to all users |
| Overload | Degradation for everyone | Graceful throttling, guaranteed minimums |
When to consider QoS:
- High SLA percentiles (, ) where fairness during peaks is critical
- Multi-tenant deployments where isolation between user groups matters
- Regulated industries requiring auditable service guarantees
For most enterprise deployments, correct sizing via Engset provides the primary guarantee. QoS is an architectural enhancement, not a prerequisite.
Part 4: The Mathematical Model
Defining the System
We model an on-premise LLM deployment as an M/M/c/c/N loss system:
- users (finite population: your employees)
- channels (service capacity at guaranteed rate)
- No queue (if busy, request is “blocked”—user experiences delay)
A channel represents the capacity to serve one user at the guaranteed rate:
For example: A system with 120 tokens/second total capacity, guaranteeing 40 tokens/second per user, has 3 channels (can serve 3 users simultaneously at full speed).
Traffic Intensity (Erlangs)
Erlang is the standard unit for telecommunications traffic. It represents the average number of simultaneous “calls” (requests) in progress:
If your 50 users collectively generate an average of 100 tokens/second demand, and you guarantee 40 tokens/second per user:
This means on average, 2.5 “channels” are in use at any moment.
The Engset Formula
For users, channels, and traffic intensity per user, the blocking probability is:
Where state probabilities follow:
Don’t worry if this looks complex. The key takeaway: Given your user count, their usage patterns, and your hardware capacity, we can calculate the exact probability that a user will experience degraded service.
SLA Compliance
An SLA like “P95: 40 tokens/second minimum” means:
95% of the time, every user receives at least 40 tokens per second.
Here, tokens per second (TPS) serves as the core throughput metric for LLM inference. The relationship between throughput, latency, and user-perceived performance is well-established in LLM serving literature.
Mathematically:
Which means:
Our solver finds the minimum hardware that achieves this guarantee. This probabilistic SLA-based approach contrasts with dynamic batching optimizations, providing closed-form guarantees rather than runtime scheduling decisions.
Part 5: SLA Tiers Explained
Different business contexts require different reliability levels:
| SLA Tier | Blocking Allowed | Typical Use Case | Hardware Impact |
|---|---|---|---|
| P90 | 10% | Internal tools, non-critical apps | Most cost-efficient |
| P95 | 5% | Professional office environments | Standard choice |
| P99 | 1% | Business-critical systems | Higher redundancy required |
| P100 | 0% | Mission-critical, regulated industries | Maximum capacity |
Real-World Interpretation
- P90: “9 out of 10 requests get full speed.” Acceptable for dev tools, internal chat.
- P95: “19 out of 20 requests get full speed.” Good for general office productivity.
- P99: “99 out of 100 requests get full speed.” Required for customer-facing, financial, medical.
- P100: “Every single request gets full speed.” For systems where any delay is unacceptable.
The cost curve is non-linear. Going from to might require the hardware. Going from to might require more. Choose wisely.
Part 6: The Serial Throughput Constraint
A Critical Limitation
There’s one constraint that pure queueing theory doesn’t capture: No single user can receive more tokens per second than one hardware unit can produce.
Even with 10 GPUs providing 1,200 tokens/second total capacity, if each GPU maxes out at 120 tokens/second, no individual user can ever receive more than 120 tokens/second.
Why This Matters
| Scenario | Total Capacity | Per-Unit Speed | User Experience |
|---|---|---|---|
| 4× AMD (30 TPS each) | 120 TPS | 30 TPS | Each user gets max 30 TPS |
| 1× NVIDIA (120 TPS) | 120 TPS | 120 TPS | Each user gets max 120 TPS |
Same total capacity, very different user experience. For interactive use cases (chat, code completion), per-unit speed matters as much as aggregate capacity.
SLA Impossibility
If your SLA requires 40 tokens/second minimum, but your hardware only produces 30 tokens/second per unit, the SLA is impossible to meet—no matter how many units you deploy.
Our calculator detects this and warns you to either:
- Lower the guaranteed rate
- Choose faster hardware
Part 7: Practical Application
The Solver Algorithm
Our calculator implements this logic:
Example: Software Startup
Scenario:
- 3 office workers (5,000 tokens/hour each)
- 6 developers (50,000 tokens/hour each)
- SLA: P95 at 40 tokens/second
- Hardware: AMD Ryzen AI Max+ 395 (30 TPS per unit)
Calculation:
For 3 AMD units ( TPS, channels at 40 TPS):
For 4 AMD units ( TPS, channels at 40 TPS):
Result: 4 AMD units required
Example: Medical Practice
Scenario:
- 8 office workers (8,000 tokens/hour each)
- SLA: P99 at 30 tokens/second (higher reliability for medical)
- Hardware: NVIDIA RTX Pro 6000 (120 TPS per unit)
Calculation:
For 1 NVIDIA unit ( TPS, channels at 30 TPS):
Result: 1 NVIDIA unit required
Part 8: Utilization vs. Capacity
The Paradox of Low Utilization
A common question: “If I only need 4 units for my SLA, why is average utilization only 15%?”
Answer: SLAs are about worst-case guarantees, not average performance.
| Metric | Optimizes For | Result |
|---|---|---|
| Average utilization | Cost efficiency | Users wait during peaks |
| SLA compliance | User experience | Capacity available when needed |
Think of it like fire extinguishers: They sit unused 99.9% of the time, but you’d never remove them to “improve utilization.”
When Low Utilization Is Expected
- High SLA percentiles (, ): Must handle rare peak events
- Bursty usage patterns: Developers may generate average load during coding sprints
- Small user populations: Statistical smoothing requires more headroom
Part 9: Visualization and Monitoring
SLA Compliance Chart
Shows headroom (percentage above/below SLA) across all percentile levels:
- Positive headroom: System exceeds requirements (green zone)
- Zero headroom: System exactly meets SLA (threshold)
- Negative headroom: System fails to meet SLA (red zone)
TPS per User Chart
Shows the maximum tokens/second each user can receive at each percentile:
- Above “Required” line: SLA is met
- Below “Required” line: SLA is not met
- Capped at unit speed: Serial throughput constraint visible
These charts help visualize not just whether you meet your SLA, but how much margin you have.
Frequently Asked Questions
Why use Engset instead of Erlang B or Erlang C?
This is the core technical decision explained in Part 3 above. In brief:
- Erlang B/C assume infinite sources → designed for public networks where demand can arrive from anywhere
- Engset assumes finite sources → designed for enterprise systems where you provision capacity per known user
For enterprise AI deployments (5–500 users), Erlang B overestimates required capacity by 50–100%, leading to expensive over-provisioning. Engset models a provisioning strategy where each user has a defined service entitlement, enforced by QoS mechanisms.
Both formulas were developed around 1915–1918 and remain foundational in teletraffic engineering. For large populations (thousands of users), they converge. For enterprise scale, Engset is mathematically correct.
Can users make multiple parallel requests?
Yes—users can have multiple tabs, make parallel API calls, etc. The Engset model doesn’t assume physical constraints; it models the statistical probability of concurrent demand from a known population. Given usage patterns (tokens/hour per user type), it predicts how often total demand will reach various levels. We size hardware so demand exceeds capacity only at the acceptable rate (e.g., 5% for P95).
Do I need QoS software for this to work?
No. Engset sizes capacity based on demand probability. If sized correctly, the system isn’t overloaded at your target percentile—users naturally get good performance. QoS is an optional enhancement for mission-critical deployments where fair distribution during rare peak events is essential, or for multi-tenant isolation.
What if my users have very different usage patterns?
The model aggregates traffic from all user types. A mix of light users (office workers) and heavy users (developers) is handled by summing their expected load. The Engset calculation uses total users and total traffic, which provides a conservative estimate.
Can I use this for cloud API capacity planning?
The math applies, but cloud APIs typically have per-request pricing, not fixed capacity. The value of this model is for fixed-capacity systems (on-premise, reserved instances) where you pay for hardware regardless of utilization.
What about request queuing instead of blocking?
Our model assumes “loss” systems (no queue). For systems with queuing, different formulas apply (Erlang-C). However, for interactive AI use cases, queuing often defeats the purpose—users want immediate responses, not “your request is #47 in line.”
How does model quantization affect this?
Quantization changes the tokens/second per unit (model speed), which feeds into the channel calculation. More quantization = faster inference = more channels per unit = potentially fewer units needed for the same SLA.
Is the Engset model too conservative or too aggressive?
For mixed user populations, it’s slightly conservative (recommends marginally more capacity than strictly necessary). This is appropriate for production capacity planning where under-provisioning is more costly than over-provisioning.
Technical Appendix: Implementation Details
Blocking Probability Calculation
function calculateBlockingProbability(users, trafficErlangs, channels) {
if (channels <= 0) return 1.0;
if (users <= channels) return 0.0; // More channels than users
if (users === 0) return 0.0;
const a = trafficErlangs / users; // Per-source traffic intensity
if (a <= 0) return 0.0;
if (a >= 1) return 1.0; // Overloaded
const beta = a / (1 - a);
// Calculate state probabilities iteratively
let probabilities = [1.0];
for (let i = 1; i <= channels; i++) {
const ratio = ((users - i + 1) / i) * beta;
probabilities.push(probabilities[i - 1] * ratio);
}
const sum = probabilities.reduce((a, b) => a + b, 0);
return probabilities[channels] / sum;
}
Binary Search for Max TPS per User
To find the maximum throughput achievable at a given percentile:
function findMaxTpsPerUser(capacity, targetBlockingProb, users, totalTPS) {
let low = 0.1;
let high = capacity;
for (let i = 0; i < 50; i++) {
const mid = (low + high) / 2;
const traffic = totalTPS / mid; // Traffic relative to rate being tested
const channels = capacity / mid;
const pb = interpolatedBlockingProbability(users, traffic, channels);
if (pb <= targetBlockingProb) {
low = mid; // Can achieve this rate
} else {
high = mid; // Rate too high
}
}
return low;
}
References
Classical Teletraffic Theory
-
Domańska, J., et al. “Stationary Queueing Models with Aspects of Customer Impatience.” Telecommunications Review, 2010. Available online
-
Engset, T. O. (1918). “Finite-source loss model for automatic telephone exchanges.” Reprinted and discussed in modern queueing surveys. Historical reference
-
Virtamo, J. “Finite source population: the M/M/s/s/n system (Engset’s system).” Lecture notes, Aalto University. Available online
-
“What is the Engset calculation?” Teletraffic engineering, Wikiversity. Available online
-
Cisco Systems. “Traffic Analysis – Engset formula for finite sources.” Cisco VoIP Traffic Analysis guide. Available online
-
“Traffic Engineering Techniques in Telecommunications.” Tutorial PDF (covers Erlang B/C and Engset, with trunk dimensioning examples). Available online
-
“Traffic Engineering Tutorial.” Available online
-
ITAM. “An Introduction to Erlang B and Erlang C.” Lecture notes. Available online
-
Columbia University. “Erlang B and C Formulas.” Lecture notes. Available online
Modern Finite-Source Applications
-
Dombacher, C. “Queueing Models for Call Centres.” Dissertation, TU Wien, 2010. [Available online](https://repositum.tuwien.at/bitstream/20.500.12708/13123/2/Dombacher Christian - 2010 - Queueing models for call centres.pdf)
-
Domańska, J., et al. “Stationary Queueing Models with Aspects of Customer Impatience.” Discusses Engset extensions. Available online
-
Modeling Wireless Local Loop with finite subscribers, showing Engset-type loss models still used. ACM Digital Library. Available online
-
“Finite-source queueing models in wireless communications.” Available online
LLM Inference Metrics and Performance
-
BentoML. “Key metrics for LLM inference.” Covers TPS, TTFT, latency per token. Available online
-
Databricks. “LLM Inference Performance Engineering: Best Practices.” Discusses relationship between TTFT, per-token latency, overall response latency, and impact of batch size and tensor parallelism. Available online
-
AIMultiple. “LLM Latency Benchmark by Use Cases.” Empirical latency benchmarks across models and use cases. Available online
SLA-Based LLM Serving
-
Pang, B., et al. “Optimizing LLM Inference Throughput via Memory-aware Dynamic Batching under SLA Constraints.” arXiv:2503.05248, 2025. Available online
-
“Scaling Up Throughput-oriented LLM Inference.” arXiv:2509.13201, 2025. Discusses throughput-oriented LLM inference with relaxed latency/SLAs. Available online
Existing Hardware Sizing Methods
-
Lenovo. “LLM Sizing Guide.” Provides formulas to estimate GPU memory requirements from model parameters but lacks probabilistic concurrency or SLA models. Available online
-
Basebox. “Hardware Sizing for On-Premise AI: The VRAM Calculation Guide.” Uses rule-of-thumb peak concurrency heuristics for SME deployments without queueing-based dimensioning. Available online
-
Puget Systems. “Sizing VRAM to Generative AI and LLM Workloads.” Chooses GPU counts based on VRAM requirements with informal safety margins, without formal SLA probability models. Available online
-
Microsoft Azure / Databricks. “Provisioned Throughput Tokens.” Capacity planning guides for cloud APIs using TPM/RPM quotas with buffer-based sizing, not probabilistic models for finite user populations. Available online
-
Palantir. “LLM Capacity Management.” Describes TPM/RPM reservation strategies for interactive vs batch workloads but treats capacity as hard quotas without Engset/Erlang-type models. Available online
-
“Can You Run This LLM?” VRAM calculator. Generic tool for estimating VRAM requirements but does not model user populations or SLAs. Available online
-
Introl. “Local LLM Hardware Pricing Guide 2025.” Maps example organizations to GPU SKUs based on usage categories without probabilistic concurrency models. Available online
-
OpenMetal. “AI Model Performance: Tokens per Second.” Hardware guides mapping user counts to GPU recommendations without probabilistic blocking models. Available online
Try It Yourself
Experience SLA-based capacity planning with our interactive calculator:
→ LLM Hardware Cost Calculator
The calculator implements everything described in this white paper:
- Real-time Engset calculations
- Hardware comparisons under identical SLAs
- Visual SLA compliance charts
- Industry presets for common scenarios
Related Resources
- Transparently Calculate On-Premises AI Costs — Overview of the calculator’s features and how to use it
- On-Premise AI Solutions for every SME — Practical guide to models, hardware, and operations
- AI Costs by Office Role — Token consumption patterns across different business roles
Citation:
onprem.ai (2025). The Mathematics of AI Reliability: SLA-Based Capacity Planning
Using Engset's Formula. Retrieved from https://onprem.ai/knowhow/llm-sla-capacity-planning-whitepaper