API Standards for Large Language Models (LLM) AI

The Common Denominator: “OpenAI-Compatible API”

In the LLM world, there is currently (Dec. 2025) no formally specified standard like HTTP or SQL that is standardized by a governing body. Instead, a de facto standard has emerged that is supported by most providers: the “OpenAI-Compatible” API.

In this practical article, we will:

  1. Explain the most important API, “Chat Completion”, in OpenAI format
  2. Examine the best-known LLM models and their control parameters
  3. Use LiteLLM to access various models with a unified API
  4. Touch on advanced topics such as multimodality, structured outputs, reasoning, tool calling, MCP, etc.

1. The Most Important API: “Chat Completion” in OpenAI Format

For “simply generating text”, the central interface today is the Chat Completion API in OpenAI style. LiteLLM mirrors this format 1:1 and translates it internally for various providers. This allows a user to access different models with the same API call and receive the output in a unified format.

Basic Principle

The API is HTTP + JSON and typically works like this:

  • Endpoint: POST http://localhost:4000/chat/completions
  • Authentication: Authorization: Bearer <API-KEY> (API-KEY is configured in LiteLLM)
  • Body (minimal request without optional parameters):
{
  "model": "ollama/qwen2.5:32b",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Describe a neural network in two sentences." }
  ],
  "stream": false
}

The response looks like this:

{
  "id": "chatcmpl-xyz",
  "object": "chat.completion",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "A neural network is a mathematical model consisting of artificial neurons. It is trained to perform specific tasks, such as recognizing images or understanding language."
    },
    "finish_reason": "stop"
  }
  ],
  "usage": { "prompt_tokens": 10, "completion_tokens": 15, "total_tokens": 25 }
}
  • id: technical request ID
  • choices[^0].message.content: the actual answer text
  • usage: token statistics (prompt tokens, completion tokens, total). Tokens are an extensive topic for another article. Here it is sufficient to know that 1,000 tokens correspond to approximately 750 words.

Core Request Fields

The most important fields in the OpenAI-compatible quasi-standard:

  • model (Required) String describing the desired model. The exact syntax varies by provider (e.g., ollama/qwen2.5:32b or ollama/gpt-oss:20b for on-premise models), but the field is present everywhere.

  • messages (Required) – The Heart of the API

The messages array is the central mechanism for communicating with the model. It is a chronological list of messages that forms a conversation history. This messages structure with its three roles (system, user, assistant) is a universal common denominator that virtually all modern LLM APIs understand.

Even for use cases that are not true conversations (such as one-time document analysis, code generation, or data processing), this format is used. In such cases, the array typically consists of only two messages: a system message for instructions and a user message with the actual task.

The Three Core Roles:

The Instruction Level: "system"

  • What: Defines the role, context, and behavioral rules for the model
  • When: Usually as the first message in the array (optional, but recommended)
  • Practical Use:
    • Role Definition: "You are a helpful assistant for financial analysis."
    • Set Context: "You analyze quarterly reports for a Swiss SME."
    • Control Behavior: "Always answer in German, be precise and cite sources."
    • Format Specifications: "Always answer in JSON format with the fields 'Summary' and 'Risks'."
  • Important: System messages are treated by the model as “meta-instructions” – they shape the entire conversation but are not considered part of the actual dialogue.

The Request Level: "user"

  • What: The actual questions, tasks, or inputs from the user
  • When: Whenever the user asks something or assigns a task
  • Practical Use:
    • Simple Questions: "What is Docker?"
    • Tasks: "Analyze this contract for risks: [Text...]"
    • Multi-Turn Conversations: Multiple user messages in sequence for follow-up questions
    • Context Enrichment: "Here is the code: [Code...]. Explain the function."

The Response Level: "assistant"

  • What: Previous responses from the model
  • When: In multi-turn conversations, to show the model the history so far
  • Practical Use:
    • Conversation Memory: When you ask a second question, include the first assistant response so the model retains context
    • Corrections: If the model makes an error, you can mark the incorrect answer as assistant and send a new user message with the correction
    • Iterative Improvement: assistantuser (“That’s good, but add…”) → assistant → …

Examples of Applying the 3 Core Roles

Example: Simple Single-Turn Request

{
  "model": "ollama/qwen2.5:32b",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Name three advantages of Docker." }
  ]
}

Example: Multi-Turn Conversation

{
  "model": "ollama/qwen2.5:32b",
  "messages": [
    { "role": "system", "content": "You are an expert in cloud infrastructure." },
    { "role": "user", "content": "What is Docker?" },
    { "role": "assistant", "content": "Docker is a containerization platform..." },
    { "role": "user", "content": "How does it differ from Kubernetes?" }
  ]
}

Example: Practical Business Use Case: Contract Analysis

{
  "model": "ollama/gpt-oss:20b",
"messages": [
    {
      "role": "system",
      "content": "You are a legal advisor. Analyze contracts for risks and summarize the key points. Answer structured in German."
    },
    {
      "role": "user",
      "content": "Analyze this supplier contract:\n\n[50 pages of contract text...]"
    }
  ]
}

The Fourth Role: "tool" (Advanced)

  • Used in function calling when the model calls an external function
  • Tool calling is very extensive and would exceed the scope here, but it will be covered in more detail in Chapter 4.

Response Structure

The response follows the familiar OpenAI style:

  • choices: List of possible answers; by default there is exactly one.
  • Each choice contains:
    • message.role: usually "assistant"
    • message.content: the text
    • optionally message.tool_calls: if tools were used.
  • usage:
    • prompt_tokens
    • completion_tokens
    • total_tokens

Example:

{
  "id": "chatcmpl-xyz",
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Docker is a containerization platform that runs applications in isolated environments."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 25,
    "total_tokens": 45
  }
}

2. Well-Known LLM Models and Their Control Parameters

Even though the OpenAI-compatible format provides a common denominator, models differ significantly in support for control parameters and technical limits. This is particularly relevant for on-premise models, as they often use different architectures, quantizations, and runtime implementations. Control parameters vary by model and backend.

Control Parameters: What is Supported Where?

The most important control parameters in OpenAI format are:

  • temperature: Controls creativity/randomness of output (low = more deterministic, higher = more creative).
  • top_p: Alternative to temperature (nucleus sampling).
  • max_tokens: Maximum answer length.
  • n: Number of variants to generate (not implemented by all backends).
  • stop: String or list of strings at which output is terminated.
  • tools / tool_choice: Tool/function calling in OpenAI style (availability varies by backend).
  • response_format: E.g., {"type": "json_object"} for strictly JSON-formatted responses (primarily guaranteed in newer OpenAI-compatible APIs, often only partially implemented by other providers/pipelines).
  • logprobs, top_logprobs: Probabilities per token (not provided or only limited by many open-source backends).

Tabular Overview

The following table provides a quick overview of which control parameters are typically supported by various on-premise model families:

ParameterGPT‑OSSDeepSeek (Reasoning)LlamaDeepSeek (Chat)Qwen
temperature⚠️ (no effect)
top_p⚠️ (no effect)
max_tokens
stop
n (variants)⚠️⚠️⚠️⚠️
tools / tool_choice⚠️⚠️⚠️⚠️
response_format⚠️⚠️⚠️⚠️
logprobs⚠️❌ (errors possible)⚠️⚠️

Legend:

  • ✅ = Reliably supported
  • ⚠️ = Support varies by backend/implementation
  • ❌ = Usually not available

Notes:

  • Universal Parameters: max_tokens and stop work reliably practically everywhere
  • Reasoning Models: For reasoning models (like DeepSeek-Reasoner), temperature and top_p have no effect, as these models work deterministically
  • Backend Dependency: Many features (especially tools, response_format, n) depend heavily on the inference server used (vLLM, Ollama, TGI, etc.), not just the model itself

Typical Differences in On-Premise Models

GPT-OSS / Open-Source GPT Variants

  • Supported: Generally the standard parameters temperature, top_p, max_tokens, stop; many distributions explicitly follow the OpenAI API format. GPT-OSS is explicitly designed for OpenAI compatibility.
  • Varies Strongly: Support for tools / tool_choice, response_format, logprobs depends on the specific server implementation (e.g., OpenAI-compatible gateway, LiteLLM, Ollama, vLLM-API), not the model weight itself.
  • Special Feature: GPT-OSS is explicitly designed for OpenAI compatibility and local or self-hosted deployment, which is why high interface alignment with the OpenAI API is pursued.

Llama Family (Meta)

  • Supported: Parameters like temperature, top_p, max_tokens, and stop work reliably in common servers (vLLM, TGI, Ollama, etc.).
  • Limited: n (multiple variants) and tool calls are not implemented at the model level but – if at all – by the respective serving layer; depending on the framework, n, tools, or tool_choice may not be available or only partially available.
  • Logprobs: Many popular Llama backends offer logprobs either not at all or only with performance penalties; support is implementation-dependent.
  • Special Feature: Llama models ship with fixed context windows (e.g., 4k, 8k, 32k tokens) that can be extended through fine-tuning or special variants, but form the “hard” framework for input and output.

Qwen Family (Alibaba)

  • Supported: temperature, top_p, max_tokens, stop are supported by official Qwen servers and typical OpenAI-compatible gateways.
  • response_format/structured JSON: Strict JSON schemas are definable via response_format for OpenAI models; with Qwen deployments, structured responses are usually achieved through careful prompting or additional schema validators in the application layer.
  • Special Feature: Qwen models are considered strong for structured outputs; many users employ explicit format instructions (system/user prompts) instead of native response_format.

DeepSeek Models (DeepSeek-V3, DeepSeek-Reasoner/R1)

  • Supported: DeepSeek provides an explicitly OpenAI-compatible Chat Completion API; parameters like temperature, top_p, max_tokens, n, stop, tools/function calling, and response_format = {"type": "json_object"} are supported analogously to the OpenAI API.
  • Logprobs: For standard chat models (e.g., deepseek-chat / DeepSeek-V3), logprobs and top_logprobs are explicitly supported and described in the API documentation; the reasoning model path (deepseek-reasoner) documents, however, that sampling parameters like temperature/top_p have no effect and logprobs/top_logprobs can lead to errors.
  • Special Feature:
    • DeepSeek models offer large context windows (typically 64k–128k tokens, depending on version), with separate limits for total context and maximum output length (e.g., 128k context, 4k–8k output).
    • The API is deliberately designed as a “drop-in replacement” for the OpenAI API, including use of official OpenAI SDKs, so existing OpenAI integrations can be switched to DeepSeek with minimal changes.

Practical Implications

What does this mean in practice?

  1. Basic Parameters are (Almost) Universal: temperature, max_tokens, and stop work reliably with most OpenAI-compatible backends and on-premise setups, as long as they replicate the chat/completion schema of the OpenAI API.
  2. Check Advanced Features Specifically: If tools, response_format, n, or logprobs are needed, check in advance in the model or server documentation or verify in testing against the specific API whether these parameters are really supported.
  3. Fallback Strategies for Structured Outputs: If response_format is not implemented, structured responses can usually still be achieved through precise system prompts and downstream validation (e.g., JSON parser + retry). Downstream validation and security aspects should be considered.
  4. LiteLLM as Abstraction Layer: LiteLLM provides a mechanism with get_supported_openai_params(model=..., custom_llm_provider=...) to query per model/provider which OpenAI parameters are supported and cleanly mapped. Optionally, unsupported parameters can be automatically dropped (drop_params=True) instead of causing an error.

Limits and Defaults

Beyond control parameters, models also differ in other technical properties:

  • Context Window: Typically ranges from a few thousand tokens for smaller or older models to 128k+ tokens for large, modern systems; the exact value depends on the respective model checkpoint.
  • Token Limits: Maximum input and output length are usually limited by the same context limit and are additionally restricted by the inference server or gateway.
  • Default Temperature: Defaults vary by provider and server configuration; typical defaults are in the range 0.0–0.7 and can usually be overridden per request.
  • Safety Settings: Many on-premise deployments run without integrated content filters or moderation pipelines; security and compliance filters must therefore often be added as external guardrails (e.g., own moderation models, content safety services, or rule-based filters). Security and compliance filters and data protection aspects should be considered.

Important: Abstraction libraries like LiteLLM can unify parameter compatibility, defaults, and error handling, but do not change the real physical limits of the underlying model. A model with 4k context cannot suddenly process 32k tokens just because a proxy or API adapter is inserted. LiteLLM helps with compatibility but cannot exceed physical model limits.


3. LiteLLM: Accessing Various Models with a Unified API

LiteLLM is not its own LLM, but rather a proxy server or gateway that offers the unified OpenAI Chat Format externally and internally accesses many different LLM models with varying APIs.

Role of LiteLLM

Important tasks of LiteLLM:

  • API Alignment: Different providers speak different dialects – LiteLLM translates all to the OpenAI Chat Format (request and response).
  • Routing & Load Balancing: With config.yaml, multiple models/providers can be bundled behind a common alias (e.g., “team-default”).
  • Auth & Quotas: Management of API keys, key rotation, rate limits, cost limits, etc.
  • Monitoring & Metrics: Token consumption, latencies, error rates, spend tracking, etc. via Prometheus/metrics.
  • Integrations: Various frameworks (Langfuse, CrewAI, LangGraph, LlamaIndex, OpenAI Agents SDK, etc.) can speak directly with LiteLLM.

For beginners, this means:

  • It’s enough to understand the OpenAI Chat API model once.
  • After that, the same code can address various on-premise models (Ollama, local vLLM instances, Hugging Face Inference) or cloud providers (OpenAI, Anthropic, Azure) – only model and possibly base_url need to be adjusted.

How LiteLLM Unifies Providers

LiteLLM supports a very large list of providers, including:

  • On-Prem / Local: vLLM servers, Hugging Face Inference, Ollama, Triton Inference Server, custom HTTP servers that speak OpenAI-Compatible.
  • Cloud (optional): OpenAI, Azure OpenAI, Google (Vertex AI / Gemini), AWS Bedrock, Anthropic, Cohere, AI21, Together, Groq, OpenRouter, Nscale (EU-Sovereign), etc.
  • Specialized Providers: Recraft (images), Deepgram (audio), RAGFlow, LangGraph Agents, etc.

In config.yaml, it is defined:

  • Which model_name is visible externally (e.g., team/gpt-default) and
  • which backend parameters belong to it (provider, API key, base URL, cost information, fallback models, etc.).

In the application, only model="team/gpt-default" is then used – LiteLLM handles everything else.

Metrics, Observability, and Governance

LiteLLM brings – especially in proxy mode – extensive metric functions:

  • Standard Metrics (Prometheus):
    • litellm_total_tokens_metric: Input + output tokens per model/user
    • litellm_input_tokens_metric, litellm_output_tokens_metric
    • litellm_spend_metric: estimated costs per user, model, team
    • litellm_request_total_latency_metric: total latency
    • litellm_llm_api_latency_metric: pure provider latency
    • Success/error rates by model/fallback
  • Proxy Metrics:
    • Requests per endpoint (/chat/completions, etc.)
    • Error rate, fallback statistics for failures of individual models
  • Custom Labels: Via prometheus_metrics_config, custom labels can be set (e.g., team, service, feature_flag) to segment dashboards.

This allows LiteLLM to be cleanly integrated into Grafana dashboards and existing observability stacks.

Multimodality (Text + Image + More)

Many modern models are multimodal – they accept not only text but also images (sometimes audio/video). LiteLLM integrates this into the same completion path:

  • Example: Multimodal On-Premise Models (e.g., LLaVA, Phi-3.5 Vision) messages.content becomes an array of content blocks instead of a simple string, e.g.:
response = litellm.completion(
    model="ollama/llava:latest",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {"type": "image_url", "image_url": {"url": "https://.../image.jpg"}}
            ]
        }
    ]
)
  • For other on-premise providers (e.g., vLLM with vision models, Hugging Face Inference), the OpenAI-like interface is also used, but internally translated into provider-specific calls.

The advantage for the interface user:

  • Formally only need to understand messages – the same mechanisms as with pure text – and LiteLLM handles the provider specifics.

4. Outlook: Multimodality, Structured Outputs, Reasoning, and MCP

Modern LLM features go beyond simple text-to-text interfaces. Multimodality, controlled output formats, reasoning mechanisms, and standardized protocols enable tight integration of LLMs with business systems and minimize hallucinations. These features are increasingly standardized in OpenAI-compatible APIs but are often not consistently available in on-prem models yet.

Multimodality

New model generations process not only text but also images, audio, or documents. In on-premise or OpenAI-compatible deployments, this is expressed through extended HTTP payloads with Base64- or URL-based images.

REST Request with Image Source (vLLM, OpenAI-compatible API):

POST http://llm-intern.your-company.com/v1/chat/completions
Content-Type: application/json

{
  "model": "microsoft/Phi-3.5-vision-instruct",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "What is visible in this image?"},
      {
        "type": "image_url",
        "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."}
      }
    ]
  }],
  "max_tokens": 512
}

# Response (Status 200 OK)
{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The image shows an invoice from TechAG for CHF 1'250.00, invoice number INV-2025-001, dated 2025-01-15."
    }
  }]
}

More details on multimodality: vLLM Multimodal Inputs.

Structured Outputs and Guardrails

Structured outputs ensure that an LLM generates valid JSON with predefined structure instead of “approximately correct” text that then fails during parsing. This can be realized with purely open components – for example, with vLLM and its structured outputs functionality – and optionally secured additionally with Guardrails AI.

Open Source Example: vLLM Structured Outputs (guided_json)

vLLM extends its OpenAI-compatible server with additional parameters like guided_json, guided_choice, guided_regex, and guided_grammar, which bind the model during decoding to a JSON schema or grammar. More details can be found in the vLLM Structured Outputs documentation.

A request to your internal vLLM server could look like this:

POST http://llm-intern.your-company.com/v1/chat/completions
Content-Type: application/json

{
  "model": "Qwen/Qwen2.5-3B-Instruct",
  "messages": [{
    "role": "user",
    "content": "Extract invoice data and return it as JSON: \
Invoice #INV-2025-001 for CHF 1'250.00 from TechAG, 2025-01-15."
  }],
  "max_tokens": 256,

  "guided_json": {
    "name": "invoice_schema",
    "schema": {
      "type": "object",
      "properties": {
        "invoice_number": { "type": "string" },
        "amount":        { "type": "number" },
        "currency":      { "type": "string" },
        "date":          { "type": "string", "format": "date" },
        "vendor":        { "type": "string" }
      },
      "required": ["invoice_number", "amount", "currency", "date", "vendor"],
      "additionalProperties": false
    }
  }
}

Typical response (simplified):

{
  "id": "chatcmpl-struct-123",
  "object": "chat.completion",
  "model": "Qwen/Qwen2.5-3B-Instruct",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": {
        "invoice_number": "INV-2025-001",
        "amount": 1250.0,
        "currency": "CHF",
        "date": "2025-01-15",
        "vendor": "TechAG"
      }
    },
    "finish_reason": "stop"
  }]
}

Important:

  • The model is guided by guided_json to generate syntactically and semantically valid JSON according to the schema. This is enforced at the decoder level, as described in this Red Hat article on structured outputs with vLLM.
  • The entire pipeline remains open source (model, vLLM server, JSON schema definition).

More on this:

Briefly: What is Guardrails AI?

Guardrails AI is an open-source framework (Python/JavaScript) that equips LLM applications with additional “guardrails”. The framework is available on GitHub and can be installed via PyPI.

  • Input and Output Guards check and limit risks (e.g., PII leakage, toxicity, competitor mentions) via reusable validators from the “Guardrails Hub”.
  • Structured Data: Guardrails can validate LLM responses against Pydantic models or JSON schemas and automatically correct or regenerate on violations. An example can be found in the Guardrails AI documentation on structured data.
  • Guardrails can be integrated directly into your application or run as a separate guard service that prefilters incoming/outgoing LLM calls.

Both approaches complement each other well:

  • vLLM / Structured Outputs enforces formal structures at the decoder level.
  • Guardrails AI adds an additional validation and risk layer on top, including business and compliance rules.

Reasoning Models

Reasoning models (like DeepSeek-Reasoner) first generate detailed intermediate steps (reasoning_content) before providing an answer. The API returns reasoning_content and content separately.

REST Request + Response with Reasoning Trace:

POST http://deepseek-reasoner.your-company.com/v1/chat/completions
Content-Type: application/json

{
  "model": "deepseek-reasoner",
  "messages": [{
    "role": "user",
    "content": "Analyze our Q4 sales data and identify three main drivers for the revenue change compared to Q3."
  }],
  "max_tokens": 8000
}

# Response – Note reasoning_content as separate field
{
  "id": "chatcmpl-reas456",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "reasoning_content": "To analyze the revenue change, I need to clarify the following questions:\n1. Which categories had the largest revenue increase?\n2. Which external factors have changed?\n3. Are there seasonal effects between Q3 and Q4?\n\n[Detailed analyses...]\n\nOutput syntax:",
      "content": "The three main drivers of the revenue change are:\n1) Increased demand in cloud services (+18%)\n2) Improved margins through process optimization (+6%)\n3) Seasonal effects (Christmas quarter +12%)"
    },
    "finish_reason": "stop"
  }]
}

The reasoning_content can be evaluated for audit trails or prompt debugging. DeepSeek Reasoning Model Guide.

Tool Calling and Agentics

Tool calling allows the LLM to choose between multiple available functions and call them with correct arguments. The service then executes the function and responds to the model with the result.

REST Request with Two Available Tools (LLM Chooses One):

# 1. Initial request with tool definition
POST http://llm-platform.your-company.com/v1/chat/completions
Content-Type: application/json

{
  "model": "local-llama-3.1",
  "messages": [{"role": "user", "content": "What is the status of customer CUST-123 and their current contract value?"}],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "query_crm",
        "description": "Query customer data from on-prem CRM system (contact info, purchase history, etc.)",
        "parameters": {
          "type": "object",
          "properties": {
            "customer_id": {"type": "string", "description": "Customer ID"}
          },
          "required": ["customer_id"]
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "query_contracts",
        "description": "Query contract and billing data from on-prem ERP system",
        "parameters": {
          "type": "object",
          "properties": {
            "customer_id": {"type": "string", "description": "Customer ID"}
          },
          "required": ["customer_id"]
        }
      }
    }
  ]
}

# 2. Response – LLM chooses query_crm (not query_contracts)
{
  "id": "chatcmpl-tool789",
  "choices": [{
    "message": {
      "role": "assistant",
      "tool_calls": [{
        "id": "call_abc123",
        "type": "function",
        "function": {
          "name": "query_crm",
          "arguments": "{\"customer_id\": \"CUST-123\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

# 3. Internal service executes tool and sends result back
# POST http://crm-api.your-company.com/internal/query
# {"customer_id": "CUST-123"}
# Response: {"status": "active", "contact": "John Doe", "last_order": "2025-01-10"}

# 4. Second request with tool result
POST http://llm-platform.your-company.com/v1/chat/completions
{
  "model": "local-llama-3.1",
  "messages": [
    {"role": "user", "content": "What is the status of customer CUST-123 and their current contract value?"},
    {"role": "assistant", "tool_calls": [{"id": "call_abc123", "function": {"name": "query_crm", "arguments": "{\"customer_id\": \"CUST-123\"}"}}]},
    {
      "role": "tool",
      "tool_call_id": "call_abc123",
      "content": "{\"status\": \"active\", \"contact\": \"John Doe\", \"last_order\": \"2025-01-10\"}"
    }
  ],
  "tools": [...]
}

# 5. Final response with interpretation
{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Customer CUST-123 (John Doe) has status 'active'. The last order was placed on 2025-01-10."
    }
  }],
  "finish_reason": "stop"
}

More details: OpenAI Function Calling Guide.

Ollama now also supports tool calling: Ollama Tool Calling Documentation.

Model Context Protocol (MCP)

The Model Context Protocol (MCP) standardizes bidirectional tool integration via a JSON-RPC interface. Servers define tools, clients (LLMs, assistants) can use them – without specific API adapters.

REST Handshake with MCP Server:

# 1. Initialize connection
POST http://mcp-gateway.your-company.com/mcp
Content-Type: application/json

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "initialize",
  "params": {
    "protocolVersion": "2024-11-05",
    "capabilities": {},
    "clientInfo": {"name": "enterprise-llm-client", "version": "1.0.0"}
  }
}

# 2. Server responds with capabilities
{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "protocolVersion": "2024-11-05",
    "capabilities": {"tools": {}},
    "serverInfo": {"name": "internal-erp-mcp-server", "version": "2.1.0"}
  }
}

# 3. Query available tools
{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "tools/list",
  "params": {}
}

# 4. Server delivers tool list
{
  "jsonrpc": "2.0",
  "id": 2,
  "result": {
    "tools": [
      {
        "name": "query_sales_data",
        "description": "Query sales data from on-prem ERP",
        "inputSchema": {
          "type": "object",
          "properties": {
            "region": {"type": "string"},
            "quarter": {"type": "string"}
          },
          "required": ["region", "quarter"]
        }
      }
    ]
  }
}

# 5. Call tool
{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "tools/call",
  "params": {
    "name": "query_sales_data",
    "arguments": {"region": "EMEA", "quarter": "Q4-2025"}
  }
}

# 6. Server responds with result
{
  "jsonrpc": "2.0",
  "id": 3,
  "result": {
    "content": [{"type": "text", "text": "EMEA Q4-2025: Revenue CHF 2.5M, Growth +22%"}]
  }
}