The Mathematics of AI Reliability: SLA-Based Capacity Planning Using Engset's Formula

When your CFO asks “How many AI servers do we need?”, most vendors shrug and say “It depends.” We built a system that gives you a mathematically precise answer.

This white paper explains the queueing theory behind our LLM Hardware Cost Calculator and why it matters for enterprises planning on-premise AI deployments.


Executive Summary

Most AI capacity planning today relies on guesswork: “We have 50 users, so maybe 2 servers?” This approach leads to either under-provisioning (frustrated users, slow responses) or over-provisioning (wasted budget, idle hardware).

We solve this using Engset’s formula—mathematics developed in 1915 that has ensured reliable private telephone networks for over a century. By treating AI inference requests like phone calls to a corporate exchange, we can calculate precisely how many “lines” (hardware capacity) you need to guarantee a specific service level.

Why Engset? Unlike the better-known Erlang formulas (designed for public networks with unlimited callers), Engset was specifically created for finite populations—exactly like your enterprise with a known number of employees. Using Erlang B for enterprise capacity planning leads to systematic over-provisioning; Engset gives you the mathematically correct answer.

Key Benefits:

  • Mathematical guarantees instead of educated guesses
  • SLA-based sizing: “95% of users get 40 tokens/second” becomes a solvable equation
  • Cost optimization: Neither too much nor too little hardware (Engset avoids Erlang B’s over-provisioning)
  • Predictable performance: Know exactly what to promise stakeholders
  • Century-proven: The same math that dimensions corporate phone systems worldwide

Part 1: The Problem with Traditional Sizing

Existing Approaches and Their Limitations

While there are multiple online proposals for SME-scale LLM hardware sizing, they mostly rely on crude heuristics rather than rigorous mathematical models. Current computation methods fall into three categories:

VRAM-Centric Sizing Guides

Several vendors focus on “can I fit this model and context in GPU memory” with only informal concurrency rules of thumb:

  • Lenovo’s LLM Sizing Guide gives formulas to estimate GPU memory from model size, sequence length, attention heads, precision, and batch size, then treats “number of concurrent users” basically as batch size. It does not provide a probabilistic concurrency or SLA model; instead it suggests collecting customer requirements and stress testing.

  • Basebox’s Hardware Sizing Guide gives explicit concurrency heuristics for SMEs (e.g., for “intensive usage” 50 total users → 15–25 concurrent users, 100 total users → 30–50 concurrent users), then back-solves a single GPU recommendation. This is essentially rule-of-thumb peak concurrency, not a queueing-based dimensioning.

  • Puget Systems’ VRAM sizing guide chooses GPU counts by checking model+context+batch VRAM use and then adding 1 extra GPU “to be safe” if more users or tokens show up, again without a formal SLA probability model.

The limitation: These guides answer “what GPU spec do I need for my model and batch?” but they do not answer “for 50 office users, what hardware guarantees X tokens/second at P95?”

Cloud Quota / TPM-Based Planning

Other guides assume SaaS APIs and dimension around provider quotas (TPM/RPM), not on-premise hardware capacity:

  • Azure OpenAI / Databricks capacity guides explain tokens-per-minute (TPM) and requests-per-minute (RPM) quotas, then walk through estimating TPM from user count, requests per user, internal agent calls per request, and average prompt/response token lengths. They add a buffer (e.g., 20%) and choose an enrollment tier that provides sufficient TPM.

  • Palantir’s LLM capacity management describes reserving TPM and RPM for interactive queries vs batch jobs and enforcing a fixed split (e.g., 20% capacity reserved for interactive), but treats capacity as a hard quota and does not use Engset/Erlang-type models.

The limitation: These approaches are close conceptually (tokens/time as the unit, interactive vs batch), but they dimension against a quota table, not against a closed-form blocking probability for a finite user base.

SME-Oriented “Local LLM” Calculators

A few tools explicitly target “SME sized” or “local LLM” deployments, but they are mainly heuristic:

  • Generic VRAM calculators like “Can You Run This LLM?” let you pick a model and context and tell you how much VRAM you need, but they do not model user populations or SLAs.

  • Local-LLM hardware guides (Introl, OpenMetal, etc.) map example organizations (10, 50, 100 users) to one or two GPU SKUs based on assumed model size and broad “light/medium/heavy” usage categories, without a probabilistic concurrency or blocking model.

The limitation: These tools provide rough mappings but lack the mathematical rigor to guarantee specific SLA percentiles.

What Makes Our Approach Different

Our LLM Hardware Cost Calculator is one of the very few public tools that explicitly frames SME hardware sizing as an SLA-based, probability-driven capacity-planning problem with finite users, rather than just VRAM and peak concurrency guesswork.

In short: While there are multiple online proposals for SME-scale LLM hardware sizing, they mostly use VRAM-plus-heuristics or TPM-quota arithmetic. The Engset-based finite-population SLA dimensioning in this whitepaper is genuinely more rigorous and comparatively unique—it provides mathematical guarantees rather than educated guesses.

The Stakes

For enterprises, getting this wrong has real costs:

  • Under-provisioned: Users wait 30+ seconds for responses → productivity loss, frustration, shadow IT (users switch to cloud APIs, defeating the purpose of on-premise)
  • Over-provisioned: €50,000+ in idle GPUs → wasted capex, harder to justify future AI investments

Part 2: A Solution from Telecommunications

The Origins: 1915–1918

Tore Olaus Engset, a Norwegian engineer, developed his finite-source loss model in a 1915 manuscript and published it in 1918 to dimension early automatic telephone exchanges. His work actually predates or runs parallel with the better-known Erlang formulas—both became foundational pillars of queueing and teletraffic theory.

Engset’s problem was identical to ours: How many telephone lines does a private exchange (PBX) need to ensure callers rarely hear a busy signal?

Like AI requests, phone calls are:

  • Sporadic: Users don’t call continuously
  • Variable duration: Some calls are short, others long
  • From a finite population: Your company has a fixed number of employees

Engset’s breakthrough was modeling finite populations where capacity is explicitly allocated per source. Unlike public networks (where demand can arrive from anywhere), a private exchange serves a known set of subscribers, each with a defined service entitlement. This enabled precise “grade of service” engineering: given N users and target blocking probability, calculate the exact number of lines needed.

The Analogy

Telephone ExchangeAI Inference System
Telephone linesGPU compute capacity
CallersUsers making AI requests
Call durationRequest processing time
Busy signalRequest waiting/timing out
Blocking probabilityProbability user doesn’t get immediate service
Private branch exchange (PBX)On-premise AI deployment

The key insight: Your AI system is a “private exchange” with a known, finite number of potential users—exactly what Engset’s formula was designed for over a century ago.


Part 3: Why Engset, Not Erlang B?

This is perhaps the most important technical decision in our model. Let us explain.

The Erlang Family

Most engineers are familiar with Erlang B (for blocking probability) and Erlang C (for queuing). These formulas, developed by Danish mathematician A.K. Erlang around 1917, assume an infinite population of potential callers—appropriate for public telephone networks where the number of possible callers is essentially unlimited.

The Critical Difference

ModelPopulation AssumptionBest For
Erlang BInfinite sourcesPublic networks, cloud APIs, large-scale services
EngsetFinite sourcesPrivate exchanges, enterprise systems, known user counts

Why does this matter? The key distinction is about demand modeling, not physical constraints.

Engset models the probability distribution of concurrent demand from a known population. Given N users with known usage patterns, it predicts: “What’s the probability that X users need capacity simultaneously?”

This enables provisioning based on probabilities:

  • At P95P_{95}: Size so concurrent demand exceeds capacity only 5% of the time
  • At P99P_{99}: Size so concurrent demand exceeds capacity only 1% of the time

Erlang B, assuming infinite sources, overestimates peak demand for small populations—leading to over-provisioning. Engset gives the statistically correct answer for enterprise-scale deployments.

The Cost of Using the Wrong Model

When the ratio of users to channels is small (typical for enterprise deployments), Erlang B systematically overestimates blocking probability:

Scenario: 20 Users, 5 Channels, Moderate Load

ModelBlocking Probability
Erlang B prediction~12% blocking
Engset prediction~6% blocking
RealityCloser to Engset

Using Erlang B for enterprise capacity planning leads to over-provisioning: you buy hardware you don’t need because the model assumes a worse-case that can’t actually happen with a finite user base.

Convergence at Scale

For very large user populations, Engset and Erlang B converge to the same result—which is why Erlang B became dominant in public telecom planning. But for enterprise AI deployments with 5, 50, or even 500 known users, the difference is significant and Engset is the mathematically correct choice.

Not a Historical Curiosity

Engset’s formula is over a century old but still the correct tool for [finite-source systems](https://repositum.tuwien.at/bitstream/20.500.12708/13123/2/Dombacher Christian - 2010 - Queueing models for call centres.pdf). It hasn’t been “overhauled” so much as embedded into modern queueing theory. Contemporary teletraffic textbooks present Engset alongside Erlang as the standard finite-source loss formula. Recent research continues to develop efficient numerical methods for computing Engset probabilities and extending the model to modern networking contexts.

In practical engineering, Engset-type models still appear in:

Optional Enhancement: QoS Traffic Shaping

The Engset model sizes capacity so that demand rarely exceeds supply. For most deployments, this is sufficient—users naturally experience good performance because the system isn’t overloaded.

However, for high-utilization or mission-critical deployments, adding a QoS (Quality of Service) layer provides additional guarantees:

ScenarioWithout QoSWith QoS
Low loadAll users get excellent speedSame
Peak loadFirst-come-first-served (some users may get less)Fair distribution to all users
OverloadDegradation for everyoneGraceful throttling, guaranteed minimums

When to consider QoS:

  • High SLA percentiles (P99P_{99}, P100P_{100}) where fairness during peaks is critical
  • Multi-tenant deployments where isolation between user groups matters
  • Regulated industries requiring auditable service guarantees

For most enterprise deployments, correct sizing via Engset provides the primary guarantee. QoS is an architectural enhancement, not a prerequisite.


Part 4: The Mathematical Model

Defining the System

We model an on-premise LLM deployment as an M/M/c/c/N loss system:

  • NN users (finite population: your employees)
  • cc channels (service capacity at guaranteed rate)
  • No queue (if busy, request is “blocked”—user experiences delay)

A channel represents the capacity to serve one user at the guaranteed rate:

channels=total_system_capacityguaranteed_rate_per_user\text{channels} = \left\lfloor \frac{\text{total\_system\_capacity}}{\text{guaranteed\_rate\_per\_user}} \right\rfloor

For example: A system with 120 tokens/second total capacity, guaranteeing 40 tokens/second per user, has 3 channels (can serve 3 users simultaneously at full speed).

Traffic Intensity (Erlangs)

Erlang is the standard unit for telecommunications traffic. It represents the average number of simultaneous “calls” (requests) in progress:

Traffic (Erlangs)=total_tokens_requested_per_secondguaranteed_tokens_per_user\text{Traffic (Erlangs)} = \frac{\text{total\_tokens\_requested\_per\_second}}{\text{guaranteed\_tokens\_per\_user}}

If your 50 users collectively generate an average of 100 tokens/second demand, and you guarantee 40 tokens/second per user:

Traffic=10040=2.5 Erlangs\text{Traffic} = \frac{100}{40} = 2.5 \text{ Erlangs}

This means on average, 2.5 “channels” are in use at any moment.

The Engset Formula

For NN users, mm channels, and traffic intensity aa per user, the blocking probability is:

Pb=P(m)i=0mP(i)P_b = \frac{P(m)}{\sum_{i=0}^{m} P(i)}

Where state probabilities follow:

P(i)P(i1)=Ni+1iβwhereβ=a1a\frac{P(i)}{P(i-1)} = \frac{N - i + 1}{i} \cdot \beta \quad \text{where} \quad \beta = \frac{a}{1 - a}

Don’t worry if this looks complex. The key takeaway: Given your user count, their usage patterns, and your hardware capacity, we can calculate the exact probability that a user will experience degraded service.

SLA Compliance

An SLA like “P95: 40 tokens/second minimum” means:

95% of the time, every user receives at least 40 tokens per second.

Here, tokens per second (TPS) serves as the core throughput metric for LLM inference. The relationship between throughput, latency, and user-perceived performance is well-established in LLM serving literature.

Mathematically:

(1Pb)95%(1 - P_b) \geq 95\%

Which means:

Pb5%P_b \leq 5\%

Our solver finds the minimum hardware that achieves this guarantee. This probabilistic SLA-based approach contrasts with dynamic batching optimizations, providing closed-form guarantees rather than runtime scheduling decisions.


Part 5: SLA Tiers Explained

Different business contexts require different reliability levels:

SLA TierBlocking AllowedTypical Use CaseHardware Impact
P9010%Internal tools, non-critical appsMost cost-efficient
P955%Professional office environmentsStandard choice
P991%Business-critical systemsHigher redundancy required
P1000%Mission-critical, regulated industriesMaximum capacity

Real-World Interpretation

  • P90: “9 out of 10 requests get full speed.” Acceptable for dev tools, internal chat.
  • P95: “19 out of 20 requests get full speed.” Good for general office productivity.
  • P99: “99 out of 100 requests get full speed.” Required for customer-facing, financial, medical.
  • P100: “Every single request gets full speed.” For systems where any delay is unacceptable.

The cost curve is non-linear. Going from P95P_{95} to P99P_{99} might require 2×2\times the hardware. Going from P99P_{99} to P100P_{100} might require 3×3\times more. Choose wisely.


Part 6: The Serial Throughput Constraint

A Critical Limitation

There’s one constraint that pure queueing theory doesn’t capture: No single user can receive more tokens per second than one hardware unit can produce.

Even with 10 GPUs providing 1,200 tokens/second total capacity, if each GPU maxes out at 120 tokens/second, no individual user can ever receive more than 120 tokens/second.

Why This Matters

ScenarioTotal CapacityPer-Unit SpeedUser Experience
4× AMD (30 TPS each)120 TPS30 TPSEach user gets max 30 TPS
1× NVIDIA (120 TPS)120 TPS120 TPSEach user gets max 120 TPS

Same total capacity, very different user experience. For interactive use cases (chat, code completion), per-unit speed matters as much as aggregate capacity.

SLA Impossibility

If your SLA requires 40 tokens/second minimum, but your hardware only produces 30 tokens/second per unit, the SLA is impossible to meet—no matter how many units you deploy.

If vunit<vguaranteedSLA impossible\text{If } v_{\text{unit}} < v_{\text{guaranteed}} \Rightarrow \text{SLA impossible}

Our calculator detects this and warns you to either:

  1. Lower the guaranteed rate
  2. Choose faster hardware

Part 7: Practical Application

The Solver Algorithm

Our calculator implements this logic:

For u=1 to umax:1.  C=u×vunit2.  m=C/vguaranteed3.  Pb=Engset(N,m,a)4.  If (1Pb)SLAtarget: Return uReturn umax (SLA impossible at this scale)\begin{aligned} &\textbf{For } u = 1 \text{ to } u_{\max}: \\ &\quad 1.\; C = u \times v_{\text{unit}} \\ &\quad 2.\; m = \left\lfloor C / v_{\text{guaranteed}} \right\rfloor \\ &\quad 3.\; P_b = \text{Engset}(N, m, a) \\ &\quad 4.\; \textbf{If } (1 - P_b) \geq \text{SLA}_{\text{target}}: \textbf{ Return } u \\ &\textbf{Return } u_{\max} \text{ (SLA impossible at this scale)} \end{aligned}

Example: Software Startup

Scenario:

  • 3 office workers (5,000 tokens/hour each)
  • 6 developers (50,000 tokens/hour each)
  • SLA: P95 at 40 tokens/second
  • Hardware: AMD Ryzen AI Max+ 395 (30 TPS per unit)

Calculation:

N=9 usersDemand=315,000 tokens/hour=87.5 TPSTraffic=87.540=2.19 Erlangs\begin{aligned} N &= 9 \text{ users} \\ \text{Demand} &= 315{,}000 \text{ tokens/hour} = 87.5 \text{ TPS} \\ \text{Traffic} &= \frac{87.5}{40} = 2.19 \text{ Erlangs} \end{aligned}

For 3 AMD units (C=90C = 90 TPS, m=2m = 2 channels at 40 TPS):

Pb8%Success rate=92%<95%    P_b \approx 8\% \quad \Rightarrow \quad \text{Success rate} = 92\% < 95\% \;\; ❌

For 4 AMD units (C=120C = 120 TPS, m=3m = 3 channels at 40 TPS):

Pb2%Success rate=98%95%    P_b \approx 2\% \quad \Rightarrow \quad \text{Success rate} = 98\% \geq 95\% \;\; ✅

Result: 4 AMD units required

Example: Medical Practice

Scenario:

  • 8 office workers (8,000 tokens/hour each)
  • SLA: P99 at 30 tokens/second (higher reliability for medical)
  • Hardware: NVIDIA RTX Pro 6000 (120 TPS per unit)

Calculation:

N=8 usersDemand=64,000 tokens/hour=17.8 TPSTraffic=17.830=0.59 Erlangs\begin{aligned} N &= 8 \text{ users} \\ \text{Demand} &= 64{,}000 \text{ tokens/hour} = 17.8 \text{ TPS} \\ \text{Traffic} &= \frac{17.8}{30} = 0.59 \text{ Erlangs} \end{aligned}

For 1 NVIDIA unit (C=120C = 120 TPS, m=4m = 4 channels at 30 TPS):

Pb0.1%Success rate=99.9%99%    P_b \approx 0.1\% \quad \Rightarrow \quad \text{Success rate} = 99.9\% \geq 99\% \;\; ✅

Result: 1 NVIDIA unit required


Part 8: Utilization vs. Capacity

The Paradox of Low Utilization

A common question: “If I only need 4 units for my SLA, why is average utilization only 15%?”

Answer: SLAs are about worst-case guarantees, not average performance.

MetricOptimizes ForResult
Average utilizationCost efficiencyUsers wait during peaks
SLA complianceUser experienceCapacity available when needed

Think of it like fire extinguishers: They sit unused 99.9% of the time, but you’d never remove them to “improve utilization.”

When Low Utilization Is Expected

  • High SLA percentiles (P99P_{99}, P100P_{100}): Must handle rare peak events
  • Bursty usage patterns: Developers may generate 50×50\times average load during coding sprints
  • Small user populations: Statistical smoothing requires more headroom

Part 9: Visualization and Monitoring

SLA Compliance Chart

Shows headroom (percentage above/below SLA) across all percentile levels:

  • Positive headroom: System exceeds requirements (green zone)
  • Zero headroom: System exactly meets SLA (threshold)
  • Negative headroom: System fails to meet SLA (red zone)

TPS per User Chart

Shows the maximum tokens/second each user can receive at each percentile:

  • Above “Required” line: SLA is met
  • Below “Required” line: SLA is not met
  • Capped at unit speed: Serial throughput constraint visible

These charts help visualize not just whether you meet your SLA, but how much margin you have.


Frequently Asked Questions

Why use Engset instead of Erlang B or Erlang C?

This is the core technical decision explained in Part 3 above. In brief:

  • Erlang B/C assume infinite sources → designed for public networks where demand can arrive from anywhere
  • Engset assumes finite sources → designed for enterprise systems where you provision capacity per known user

For enterprise AI deployments (5–500 users), Erlang B overestimates required capacity by 50–100%, leading to expensive over-provisioning. Engset models a provisioning strategy where each user has a defined service entitlement, enforced by QoS mechanisms.

Both formulas were developed around 1915–1918 and remain foundational in teletraffic engineering. For large populations (thousands of users), they converge. For enterprise scale, Engset is mathematically correct.

Can users make multiple parallel requests?

Yes—users can have multiple tabs, make parallel API calls, etc. The Engset model doesn’t assume physical constraints; it models the statistical probability of concurrent demand from a known population. Given usage patterns (tokens/hour per user type), it predicts how often total demand will reach various levels. We size hardware so demand exceeds capacity only at the acceptable rate (e.g., 5% for P95).

Do I need QoS software for this to work?

No. Engset sizes capacity based on demand probability. If sized correctly, the system isn’t overloaded at your target percentile—users naturally get good performance. QoS is an optional enhancement for mission-critical deployments where fair distribution during rare peak events is essential, or for multi-tenant isolation.

What if my users have very different usage patterns?

The model aggregates traffic from all user types. A mix of light users (office workers) and heavy users (developers) is handled by summing their expected load. The Engset calculation uses total users and total traffic, which provides a conservative estimate.

Can I use this for cloud API capacity planning?

The math applies, but cloud APIs typically have per-request pricing, not fixed capacity. The value of this model is for fixed-capacity systems (on-premise, reserved instances) where you pay for hardware regardless of utilization.

What about request queuing instead of blocking?

Our model assumes “loss” systems (no queue). For systems with queuing, different formulas apply (Erlang-C). However, for interactive AI use cases, queuing often defeats the purpose—users want immediate responses, not “your request is #47 in line.”

How does model quantization affect this?

Quantization changes the tokens/second per unit (model speed), which feeds into the channel calculation. More quantization = faster inference = more channels per unit = potentially fewer units needed for the same SLA.

Is the Engset model too conservative or too aggressive?

For mixed user populations, it’s slightly conservative (recommends marginally more capacity than strictly necessary). This is appropriate for production capacity planning where under-provisioning is more costly than over-provisioning.


Technical Appendix: Implementation Details

Blocking Probability Calculation

function calculateBlockingProbability(users, trafficErlangs, channels) {
    if (channels <= 0) return 1.0;
    if (users <= channels) return 0.0;  // More channels than users
    if (users === 0) return 0.0;

    const a = trafficErlangs / users;  // Per-source traffic intensity
    if (a <= 0) return 0.0;
    if (a >= 1) return 1.0;  // Overloaded

    const beta = a / (1 - a);

    // Calculate state probabilities iteratively
    let probabilities = [1.0];
    for (let i = 1; i <= channels; i++) {
        const ratio = ((users - i + 1) / i) * beta;
        probabilities.push(probabilities[i - 1] * ratio);
    }

    const sum = probabilities.reduce((a, b) => a + b, 0);
    return probabilities[channels] / sum;
}

Binary Search for Max TPS per User

To find the maximum throughput achievable at a given percentile:

function findMaxTpsPerUser(capacity, targetBlockingProb, users, totalTPS) {
    let low = 0.1;
    let high = capacity;

    for (let i = 0; i < 50; i++) {
        const mid = (low + high) / 2;
        const traffic = totalTPS / mid;  // Traffic relative to rate being tested
        const channels = capacity / mid;
        const pb = interpolatedBlockingProbability(users, traffic, channels);

        if (pb <= targetBlockingProb) {
            low = mid;  // Can achieve this rate
        } else {
            high = mid;  // Rate too high
        }
    }

    return low;
}

References

Classical Teletraffic Theory

  • Domańska, J., et al. “Stationary Queueing Models with Aspects of Customer Impatience.” Telecommunications Review, 2010. Available online

  • Engset, T. O. (1918). “Finite-source loss model for automatic telephone exchanges.” Reprinted and discussed in modern queueing surveys. Historical reference

  • Virtamo, J. “Finite source population: the M/M/s/s/n system (Engset’s system).” Lecture notes, Aalto University. Available online

  • “What is the Engset calculation?” Teletraffic engineering, Wikiversity. Available online

  • Cisco Systems. “Traffic Analysis – Engset formula for finite sources.” Cisco VoIP Traffic Analysis guide. Available online

  • “Traffic Engineering Techniques in Telecommunications.” Tutorial PDF (covers Erlang B/C and Engset, with trunk dimensioning examples). Available online

  • “Traffic Engineering Tutorial.” Available online

  • ITAM. “An Introduction to Erlang B and Erlang C.” Lecture notes. Available online

  • Columbia University. “Erlang B and C Formulas.” Lecture notes. Available online

Modern Finite-Source Applications

  • Dombacher, C. “Queueing Models for Call Centres.” Dissertation, TU Wien, 2010. [Available online](https://repositum.tuwien.at/bitstream/20.500.12708/13123/2/Dombacher Christian - 2010 - Queueing models for call centres.pdf)

  • Domańska, J., et al. “Stationary Queueing Models with Aspects of Customer Impatience.” Discusses Engset extensions. Available online

  • Modeling Wireless Local Loop with finite subscribers, showing Engset-type loss models still used. ACM Digital Library. Available online

  • “Finite-source queueing models in wireless communications.” Available online

LLM Inference Metrics and Performance

  • BentoML. “Key metrics for LLM inference.” Covers TPS, TTFT, latency per token. Available online

  • Databricks. “LLM Inference Performance Engineering: Best Practices.” Discusses relationship between TTFT, per-token latency, overall response latency, and impact of batch size and tensor parallelism. Available online

  • AIMultiple. “LLM Latency Benchmark by Use Cases.” Empirical latency benchmarks across models and use cases. Available online

SLA-Based LLM Serving

  • Pang, B., et al. “Optimizing LLM Inference Throughput via Memory-aware Dynamic Batching under SLA Constraints.” arXiv:2503.05248, 2025. Available online

  • “Scaling Up Throughput-oriented LLM Inference.” arXiv:2509.13201, 2025. Discusses throughput-oriented LLM inference with relaxed latency/SLAs. Available online

Existing Hardware Sizing Methods

  • Lenovo. “LLM Sizing Guide.” Provides formulas to estimate GPU memory requirements from model parameters but lacks probabilistic concurrency or SLA models. Available online

  • Basebox. “Hardware Sizing for On-Premise AI: The VRAM Calculation Guide.” Uses rule-of-thumb peak concurrency heuristics for SME deployments without queueing-based dimensioning. Available online

  • Puget Systems. “Sizing VRAM to Generative AI and LLM Workloads.” Chooses GPU counts based on VRAM requirements with informal safety margins, without formal SLA probability models. Available online

  • Microsoft Azure / Databricks. “Provisioned Throughput Tokens.” Capacity planning guides for cloud APIs using TPM/RPM quotas with buffer-based sizing, not probabilistic models for finite user populations. Available online

  • Palantir. “LLM Capacity Management.” Describes TPM/RPM reservation strategies for interactive vs batch workloads but treats capacity as hard quotas without Engset/Erlang-type models. Available online

  • “Can You Run This LLM?” VRAM calculator. Generic tool for estimating VRAM requirements but does not model user populations or SLAs. Available online

  • Introl. “Local LLM Hardware Pricing Guide 2025.” Maps example organizations to GPU SKUs based on usage categories without probabilistic concurrency models. Available online

  • OpenMetal. “AI Model Performance: Tokens per Second.” Hardware guides mapping user counts to GPU recommendations without probabilistic blocking models. Available online


Try It Yourself

Experience SLA-based capacity planning with our interactive calculator:

→ LLM Hardware Cost Calculator

The calculator implements everything described in this white paper:

  • Real-time Engset calculations
  • Hardware comparisons under identical SLAs
  • Visual SLA compliance charts
  • Industry presets for common scenarios


Citation:

onprem.ai (2025). The Mathematics of AI Reliability: SLA-Based Capacity Planning
Using Engset's Formula. Retrieved from https://onprem.ai/knowhow/llm-sla-capacity-planning-whitepaper