On-Premise AI Solutions for every SME
With the right choices, any SME can run a professional AI server on‑premise. We explain what matters and give concrete recommendations based on our practical experience.

Introductions
Many customers ask us: Are the top‑tier AIs only available in the cloud? Aren’t there affordable on‑premise options? While cloud giants bombard us daily with tempting AI offers, on‑prem solutions for SMEs still seem to lag behind.
For Swiss companies in fields like private banking, fiduciary services, law, or medical technology, a professional on‑premise infrastructure is often the only viable option: Risks in the cloud—such as data leaks or industrial espionage—can devastate a hard‑won reputation. Understandably, many SMEs prefer not to take that risk.
Uncertainty is high: Which concrete on‑prem solutions for SMEs are succeeding today? How do they compare to the cloud in performance, and what do they cost? We see too many offerings that run on unsuitable hardware. These lead to disappointment and fuel the misconception that professional AI cannot be run on‑premise.
In this article, we want to bring clarity and, based on our many years of practical experience, spell out exactly what matters for a successful on‑premise AI solution.
Contents
- Choosing the right AI for professional on‑premise applications
- AI server vs. running AI directly on your workstation
- Matching hardware with the ideal price‑performance ratio
- Modular server software for smooth operations
Choosing the Right AI for Professional On‑Premise Applications
Are the best AIs only in the cloud? It can seem that way when listening to cloud providers. The fact is: Many of today’s leading cloud AIs are open source, with licenses that even allow commercial use at no cost. The selection is almost too large, and new open‑source models appear continuously. So which model should you choose? What should you optimize for?
One factor has proven to be the most reliable starting point in practice: model size in GB (checkpoint). Important: model size in GB is not the same as parameter count. Parameter counts can be misleading, for example when models are heavily quantized. Model size in GB is a decisive factor both for a model’s intelligence and for hardware selection.
Model sizes at a glance
Category | Small | Medium | Large |
---|---|---|---|
Model size | 2–5 GB | 14–20 GB | 60 GB+ |
Intelligence | unreliable 😟 | usable 🙂 | top‑class 😁 |
MMLU Benchmark on open‑source AI | 63 Mistral‑7B | 85 gpt‑oss:20b | 90 gpt‑oss:120b |
Comparable Closed‑source AIs | worse than GPT‑3.5 better than GPT‑3 | GPT o4-mini Claude 3.5 Sonnet Gemini 1.5 Pro | GPT 4.1 GPT 5 |
Typical hardware | CPU, small GPU | NVIDIA RTX 5090 | NVIDIA RTX Pro 6000 |
Speed | ~2 words/s | ~30 words/s | ~60 words/s |
Models that are too small
Small models repeatedly make headlines: an AI that runs on a mini computer sounds fascinating. Unfortunately, such pieces are mostly about clicks. In serious use, you quickly notice: They often fail to grasp the task, talk a lot without saying anything concrete (the “politician syndrome”), go off on tangents, or—in the worst case—make up facts (often called “hallucinations”).
The consequence: You spend far too much time re‑explaining the task or correcting results, and end up disappointed. We therefore advise against under‑sized models. For on‑prem AI products, always check the model size: Anything below 14 GB or below 20B parameters is generally not fit for sustained professional use.
Medium‑sized models
By now, these are quite practical in professional settings. The price‑performance ratio is compelling. While some post‑editing is still required, these models deliver real value and noticeably boost productivity.
They shine particularly in creative work, marketing, or sales. They reliably handle tasks of moderate complexity, such as generating smart templates or drafting responses.
Large models
For demanding domains like private banking, fiduciary services, medical technology, or defense, large models are indispensable. In these highly sensitive areas—where factual accuracy and reliability are top priorities—only the large models provide sufficient assurance.
Their key advantage is superior cognitive capability. They process complex relationships, think ahead, and proactively consider critical details. When unsure, they ask targeted follow‑up questions to avoid misunderstandings—an essential safety mechanism for business‑critical applications.
AI server vs. running AI on your workstation?
Professional AI use in a company requires a stable and high‑performance infrastructure. We therefore recommend running your AI workloads on a dedicated server configured specifically for these compute‑intensive tasks and operating independently of everyday work devices.
Separation of concerns
Stability is crucial in professional environments: That’s why we advise our clients to run their AI on a dedicated server with suitable hardware. This reduces risks such as system overloads and ensures other business‑critical applications can continue running undisturbed.
Office PCs or laptops are designed for user interfaces and low‑compute software such as office applications. Large AI models, by contrast, present immense computational demands and require a different software and hardware environment.
A clean separation is both more stable and more flexible. If the AI runs on its own server, updates, maintenance, and backups can be performed centrally and independently of the workstation. You keep control over office computers while the AI continues to work reliably.
Noise and heat
Compute means heat, which means cooling, which in most cases means noise. Similar to gaming PCs, medium and large AI models produce a lot of heat and therefore require noticeable cooling. Gamers with headphones may not mind, but in an office you need to concentrate. Cooling a professional 600 W AI system is akin to a small vacuum cleaner and can be distracting.
Ideally, you have a server room or a lockable room such as a basement or even a storage closet to place the AI server. Smaller server clusters often get by without room air‑conditioning. If you must place an AI server in a work area, water cooling is advisable. It still uses fans and is somewhat more expensive, but is significantly quieter at the same performance because the heat can be dissipated over a larger surface area.
Background tasks
A dedicated AI server can continuously do preparatory work in the background. There are various types of such background tasks. One is document indexing, which can make search via AI significantly more efficient and successful. Documents are embedded into a vector space so that semantic similarities can be recognized and queries can be answered more precisely.
Another form of background processing that boosts SME efficiency is automated document handling. For example, incoming emails can be analyzed to automatically generate structured reports. Ongoing quality and compliance checks can also run in the background, saving resources while ensuring standards are met.
Matching hardware with the ideal price‑performance ratio
Professional AI models require a well‑thought‑out combination of abundant memory and high bandwidth for the large data volumes transferred between memory and processors.
For on‑premise AI solutions, there are essentially two options today: UMA or GPU‑based systems. In the following table, we present concrete examples of both architectures for professional AI use in an SME context:
Beelink GTR9 Pro | Apple Studio 96 GB | Gaming PC | Gaming PC | Workstation | |
---|---|---|---|---|---|
Architecture | UMA | UMA | GPU | GPU | GPU |
GPU model | Radeon 8060S iGPU | Integrated | RX 7900 XTX | NVIDIA RTX 5090 | NVIDIA RTX Pro 6000 |
Framework | ROCm | ROCm / Vulkan | ROCm | NVIDIA | NVIDIA |
Bandwidth | 256 GB/s | 409 GB/s | 960 GB/s | 1’792 GB/s | 1’792 GB/s |
Price (system) | € 2’000 | € 3’800 | € 2’500 (GPU only: € 1’000) | € 3’600 (GPU only: € 2’100) | € 10’500 (GPU only: € 9’000) |
VRAM | 128 GB | 96 GB | 24 GB | 32 GB | 96 GB |
Suitable for | Large AI 😁 | Large AI 😁 | Medium AI 😑 | Medium AI 😑 | Large AI 😁 |
LLM benchmark (Max MMLU) | 90 gpt‑oss:120b | 90 gpt‑oss:120b | 85 gpt‑oss:20b | 85 gpt‑oss:20b | 90 gpt‑oss:120b |
AI speed | Usable 😑 | Tight 😟 | Usable 😑 | Fast 😁 | Fast 😁 |
Words/s (≈ tokens/2) | 12 | 9 | 16 | 95 | 95 |
Time to first token with 10 pages context | 1 s | 30 s | 3 s | 1 s | 1 s |
Comparable to | GPT‑4.1 | GPT‑4.1 | GPT o4-mini | GPT o4-mini | GPT‑4.1 |
$ cost per MMLU (lower is better) | 16 | 31 | 12 | 25 | 100 |
$ cost per bandwidth | 5.5 | 6.8 | 1.0 | 1.2 | 5.0 |
Cooling noise | Quiet | Quiet | Loud | Loud | Loud |
Power draw | 230 W | 270 W | 350 W | 575 W | 600 W |
Conclusion | Best for interactive use (best price/perf for large LLM) | Best price for medium‑size LLMs | Best for batch processing (best price/perf for medium‑size LLM) | Best overall, but also a big price tag |
Graphics cards
Graphics cards are the undisputed winners for cost‑effective hardware suited to medium and large AI models. It may sound confusing: Why graphics cards when chatbots are text‑based?
AI models and 3D rendering share something fundamental: matrix math. No worries—we won’t bore you with the details. In short, the same computations used for perspective transformations also enable efficient neural‑network computation. GPUs were originally built for rendering, which makes them just as well‑suited to AI workloads today.
Major vendors are pushing hard on hardware specifically optimized for neural networks, but lifecycles for fundamental architectural changes take decades. Until then, modern GPUs will likely remain ahead. A frontrunner here could be Unified Memory Architecture, which we explain next.
Unified Memory Architecture (UMA)
Most computers have two separate memories: RAM for the processor (CPU) and VRAM for the graphics unit (GPU). This separation prevailed because it offers economically attractive flexibility: components can be mass‑produced and assembled as needed.
The classic split between RAM and VRAM also has downsides: Components are connected via sockets and therefore sit “relatively far” apart. We’re talking centimeters, but at astronomical speeds, the distance between memory and compute units matters.
In a Unified Memory Architecture (UMA), RAM and VRAM are combined into a single memory. This reduces the distance between memory and processors and increases bandwidth. To push bandwidth even further, memory and processors are soldered onto the mainboard, as with Apple processors or AI‑optimized UMA architectures.
In practice, soldered UMA architectures achieve roughly 4× the bandwidth of conventional, socket‑based components, yet they are still about 4× slower than GPUs. Bandwidth has a direct effect on AI speed measured in words per second.
Clusters
When configuring an AI cluster with consumer hardware, the lowest common denominator is decisive. Overall system performance is determined by the weakest component. While compute can be multiplied by adding server nodes, each component must meet minimum requirements.
For example, it would not make sense to connect 20 Raspberry Pis into a cluster to run a large AI model. Even with theoretically added compute, the system would be unusably slow in practice because each node fails to meet the minimum requirements for efficient AI computation.
As a rule of thumb: With consumer hardware, the weakest component in the AI cluster determines which model size can be run at which speed. To serve multiple users in parallel, you can effectively cluster these base components and use them together, scaling throughput linearly—but not single‑request speed.
Modular server software for smooth operations
Everyone knows the problem: Windows needs another update and then suddenly something stops working. The reason is that programs from different developers must coexist on a computer without interfering with each other. Dependencies can get out of sync during updates.
Operating systems like Windows, Linux, or macOS have been continuously improved for decades. In professional environments, however, there is generally zero tolerance for incompatibilities. Additional measures are therefore used to maximize stability and process safety—and thereby reduce maintenance effort and outages.
Docker Compose

Probably the best‑known and most widely used platform for modular operation of complex software is “Docker Compose.” Docker is based on the concept of “containers,” analogous to shipping containers in global trade. Thanks to strict standards for form factors and static requirements, containers fit on different ships, trains, and trucks worldwide and can be stacked regardless of contents. They are deployable anywhere.
Software containers are likewise independent of their environment. For companies, this means different applications can run isolated and safely without conflicts. Docker Compose is particularly suitable for smaller SME setups and for quickly entering the world of professional AI.
With a compact docker-compose.yml
file, multiple services—such as an AI API, a database, and a web frontend—can be started with a single command. This significantly reduces complexity and enables IT teams without deep DevOps experience to set up a stable on‑premise infrastructure. Particularly convincing: Updates and rollbacks are very easy with Compose, which significantly increases operational reliability in SMEs.
Typical use cases in SMEs include:
- Provisioning single AI instances for internal use
- Document search with Elasticsearch or MeiliSearch combined with an AI interface
- Small pilot projects that can later scale to Kubernetes
Kubernetes
As soon as dozens of users access an AI infrastructure simultaneously, or different models need to run in parallel, Kubernetes is the professional solution. Kubernetes provides automatic scaling, load distribution, and self‑healing when a service fails. This allows stable operation of larger setups with multiple GPU servers in a cluster.
For SMEs, getting started with Kubernetes is more complex, but it brings clear advantages:
- Central management of clusters with multiple AI servers
- Rolling out updates without downtime
- Integration of load balancers, secrets, and monitoring tools
- Flexible extension with additional services such as vector databases or API gateways
Especially in sensitive industries like banking and healthcare—where high availability is mandatory—Kubernetes is a solid foundation. In smaller environments, a hybrid approach often makes sense: start with Docker Compose during the pilot phase, then scale to Kubernetes later.
Remote maintenance
For on‑premise systems, professional maintenance is highly recommended. Outages can massively disrupt business operations. Remote maintenance tools ensure that system administrators always have an overview of performance and can intervene early in the event of problems.
Important building blocks for remote maintenance are:
- Prometheus + Grafana for detailed metrics (GPU utilization, memory usage, network load)
- Alertmanager with email, SMS, or Teams/Slack notifications for critical states
- Remote logging solutions such as Loki or the ELK stack to analyze root causes after the fact
- VPN tunnels for secure access to dashboards and maintenance systems, even outside the office
A modular maintenance setup provides not only security but also transparency: Decision‑makers can track the efficiency of their on‑premise AI at any time and thus make a stronger case to leadership that the investment is worthwhile.