vLLM OpenAI-Compatible Server

Trooper.AI provides a fully automated vLLM deployment template that installs, configures, and runs an OpenAI-compatible inference server on your GPU server using systemd.

Status Dashboard for vLLM on Trooper.AI
Status Dashboard for vLLM on Trooper.AI

The goal:

The template automatically:

You only control a small set of public parameters.

Understanding GPU Numbers: vLLM works with multiple GPUs, but it requires the number of GPUs to evenly divide the model’s attention heads. For instance, a model like Gemma with 32 attention heads can use 1, 2, 4, or 8 GPUs – but not 3.


Important Security Note on Screenshots: The servers shown in the screenshots are for demonstration purposes only and are secured by the Trooper.AI Network-Level Firewall, which is included with all GPU Server orders. For detailed information, see 🛡️ Native Firewall before your GPU Server.


Settings of vLLM template

This template deploys a ready-to-use vLLM inference server on your Trooper.AI instance. It installs the required runtime, configures the API endpoint, and prepares the model for OpenAI-compatible requests.

Below is a short explanation of each configuration option.

Basic settings to get startet with vLLM
Basic settings to get startet with vLLM

commandline_args

Optional advanced arguments passed directly to the vLLM server startup command.

Use this if you need to enable additional features such as:

  • tensor parallelism
  • quantization
  • tool calling
  • custom tokenizer settings
  • speculative decoding

Example:

Code
--tensor-parallel-size 2

Leave empty unless you know exactly which flags you want to use.

hf_token

Your HuggingFace access token.

This is required if the model:

  • is gated
  • requires authentication
  • or is downloaded from a private repository

For public models this field can be left empty.

You can generate a token here:

Code
https://huggingface.co/settings/tokens

The token is only used during model download.

max_tokens

Defines the maximum context window the server should support.

This directly affects VRAM usage.

Typical values:

Context Recommended
small models 4096
medium models 8192
long context models 16384+

Higher values increase memory usage significantly. If your server runs out of VRAM, lower this value.

modelname

The HuggingFace model identifier that vLLM should load.

Example:

Code
mistralai/Ministral-3-14B-Instruct-2512

Other compatible examples:

Code
Qwen/Qwen2.5-14B-Instruct
google/gemma-3-12b-it
meta-llama/Meta-Llama-3-8B-Instruct
... and many more

Make sure the model is supported by vLLM and fits into your GPU memory.

TOKEN

This is your API authentication key.

All requests to the vLLM server must include this token in the header:

Code
Authorization: Bearer YOUR_TOKEN

This protects your server from unauthorized access.

Example request:

Code
curl https://your-server/v1/chat/completions \
  -H "Authorization: Bearer YOUR_TOKEN"

Use a strong random string.


Model Size & GPU Requirements

You can utilize a wide range of large language models from HuggingFace within vLLM. Ensure sufficient VRAM is available, as performance is contingent upon having adequate free GPU VRAM to accommodate the model and context size, multiplied by the number of concurrent users.

Trooper.AI automatically selects optimal precision per GPU architecture.

VRAM calculation: Model weights + ~25% KV-Cache buffer.
VRAM can be shared across multiple GPUs via Tensor Parallelism (--tensor-parallel-size N).

Model Parameters Precision Min. VRAM Total GPU Configuration GPUs
Qwen/Qwen3-4B 4B BF16 ~8 GB 1× V100 16GB / RTX 4070 Ti Super 1
Qwen/Qwen3-8B 8B BF16 ~20 GB 1× RTX 3090 / RTX 4090 (24 GB) 1
mistralai/Ministral-3-14B-Instruct-2512 14B FP8 ~29 GB 1× RTX 4080 Pro 32GB or 1× A100 40GB 1
Qwen/Qwen3-32B 32B FP8 ~40 GB 1× A100 40GB or 2× RTX 4090 (2×24 GB) 1–2
meta-llama/Llama-3.1-8B-Instruct 8B FP8 ~20 GB 1× RTX 3090 / RTX 4090 (24 GB) 1
meta-llama/Llama-3.1-70B-Instruct 70B FP8 ~90 GB 1× RTX Pro 6000 Blackwell (96 GB) or 2× A100 (2×40 GB) 1–2

Note: FP8 is used on Ada/Hopper architectures (RTX 40-series, A100, H100) for maximum throughput. \ Trooper.AI automatically selects the optimal precision for your GPU.
Multi-GPU setups use Tensor Parallelism — VRAM scales linearly across GPUs.


Public Parameters

These parameters can be set via environment variables before running the installer.

Variable Description
TOKEN API key for authentication
modelname HuggingFace model path
hf_token HuggingFace token (for gated models)
commandline_args Optional extra vLLM CLI arguments

Automatic Benchmarking to tune Parameters

Status Dashboard for vLLM on Trooper.AI
Status Dashboard for vLLM on Trooper.AI

Our template includes a performance benchmark to help you optimize your GPU server for multi-agent use. Use it to test and compare models, GPU types, and parameters to maximize throughput and concurrent users.

How does the benchmark work?

The benchmark starts multiple agents simultaneously, each interacting with the vLLM server endpoint on a different topic. This prevents caching and tests real-world performance. You can see the throughput of each agent, the total throughput, and compare costs for tokenized services like GPT-5 mini. Often, a vLLM server from Trooper.AI is 2-4x cheaper than large token-based inference services while keeping your LLM work private!


What the Template Does

Startup of your vLLM server
Startup of your vLLM server

  1. Detects GPU architecture (Volta, Ampere, Ada, Hopper, Blackwell)

  2. Detects VRAM size

  3. Selects optimal precision automatically:

    • FP8 > BF16 > FP16
  4. Uses FP16 KV cache for stability

  5. Tunes:

    • max concurrent sequences
    • batched token size
    • memory utilization
  6. Installs vLLM with CUDA

  7. Creates a systemd service:

    Code
    vllm-server.service
    
  8. Starts a persistent OpenAI-compatible API server on a secure HTTPS endpoint.

No manual tuning is required.


API Endpoints

Base URL:

Code
http://YOUR_SERVER:PORT/v1

Endpoints:

  • /v1/models
  • /v1/completions
  • /v1/chat/completions

Authentication header:

Code
Authorization: Bearer YOUR_TOKEN_FROM_CONFIG

Python Client Example

python
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://YOUR_SERVER_ENDPOINT.apps.trooper.ai/v1"
)

resp = client.chat.completions.create(
    model="Qwen/Qwen3-14B",
    messages=[
        {"role": "user", "content": "Hello, what is vLLM?"}
    ],
    max_tokens=200
)

print(resp.choices[0].message.content)

Node.js Client Example

javascript
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "YOUR_API_KEY",
  baseURL: "https://YOUR_SERVER_ENDPOINT.apps.trooper.ai/v1"
});

const completion = await client.chat.completions.create({
  model: "Qwen/Qwen3-14B",
  messages: [
    { role: "user", content: "Hello from Node.js" }
  ],
  max_tokens: 200
});

console.log(completion.choices[0].message.content);

PHP Client Example

php
<?php

$ch = curl_init("https://YOUR_SERVER_ENDPOINT.apps.trooper.ai/v1/chat/completions");

$data = [
  "model" => "Qwen/Qwen3-14B",
  "messages" => [
    ["role" => "user", "content" => "Hello from PHP"]
  ],
  "max_tokens" => 200
];

curl_setopt_array($ch, [
  CURLOPT_POST => true,
  CURLOPT_RETURNTRANSFER => true,
  CURLOPT_HTTPHEADER => [
    "Authorization: Bearer YOUR_API_KEY",
    "Content-Type: application/json"
  ],
  CURLOPT_POSTFIELDS => json_encode($data)
]);

$response = curl_exec($ch);
curl_close($ch);

echo $response;

Streaming Example

Python Streaming

python
resp = client.chat.completions.create(
    model="Qwen/Qwen3-14B",
    messages=[{"role":"user","content":"Explain transformers"}],
    stream=True
)

for chunk in resp:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Node.js Streaming

javascript
const stream = await client.chat.completions.create({
  model: "Qwen/Qwen3-14B",
  messages: [{ role: "user", content: "Explain transformers" }],
  stream: true
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0].delta?.content || "");
}

Use Cases

Trooper.AI vLLM servers are designed for:

  • SaaS AI backends
  • Chatbots
  • Code assistants
  • RAG systems
  • Multi-user inference servers
  • High throughput batch inference
  • GPU rental environments

Performance Philosophy

Trooper.AI uses:

  • Automatic architecture tuning
  • Automatic precision selection
  • VRAM-aware batching
  • Stable KV cache configuration

This avoids:

  • GPU misconfiguration
  • Precision crashes
  • VRAM fragmentation
  • Context instability

Cost Comparison: Ministral-3 on Trooper.AI vs GPT-5 mini

Jeopardy-Game-Benchmark
Jeopardy-Game-Benchmark

To give a rough idea of the economics of self-hosting, the following comparison projects the cost of running Ministral-3-14B-Instruct-2512 on a Trooper.AI GPU server versus using the GPT-5 mini API for the same workload.

The estimate is based on the real benchmark run described in this article and extrapolated to one hour of continuous inference throughput.

Hourly Cost at Measured Throughput

Platform Hourly cost
GPT-5 mini API ~$3.12
Ministral-3-14B on Trooper.AI GPU server €0.51 (~$0.54)

How this was calculated

This estimate is based on the actual benchmark run in this article using Ministral-3-14B-Instruct-2512 on a Trooper.AI GPU server.

Metric Value
Total tokens processed 307,028
Runtime 153 seconds
Throughput ~2006 tokens/sec
Projected tokens/hour ~7.22M tokens

Token mix in the benchmark:

Token type Tokens
Input tokens 275,186
Output tokens 31,842

Scaling this ratio to ~7.22M tokens/hour and applying GPT-5 mini pricing:

  • $0.25 / 1M input tokens
  • $2.00 / 1M output tokens

results in an estimated ~$3.12 per hour for the same workload.

The Ministral-3 server on Trooper.AI instead runs at a flat €0.51/hour (~$0.54) regardless of token volume, which enables processing millions of tokens per hour at predictable cost.

Long-Running Workload Projection

Using the observed throughput we can estimate the cost of running the system for a full hour.

Metric Value
Tokens per hour ~7,221,543
GPT-5 mini cost $3.12
Trooper.AI server cost €0.51 (~$0.54)

Hourly savings

Running the same workload for an hour would still be:

≈ 5.8× cheaper on Trooper.AI

When Self-Hosting Becomes Much Cheaper

Self-hosting LLMs tends to win economically when:

  • workloads contain many small requests
  • parallel inference is required
  • applications generate millions of tokens per hour
  • workloads run continuously

Typical examples include:

  • AI game simulations
  • agent systems
  • automation pipelines
  • chat applications with many users

Summary

In this benchmark:

Metric Result
Model Ministral-3-14B
Server cost €0.51/hour
Tokens processed 307k
Runtime 153 seconds
Cost reduction 82.8%
Cost advantage 5.8× cheaper than GPT-5 mini

For high-throughput workloads, running models like Ministral-3 on Trooper.AI GPU servers can dramatically reduce inference costs while also removing API rate limits.


Why you need the vLLM template

The Trooper.AI vLLM template gives you:

  • OpenAI-compatible API
  • Automatic GPU optimization
  • Production-safe defaults
  • Minimal configuration
  • Maximum throughput

You only choose the model and API key.

Everything else is optimized automatically.


Troubleshooting

With the dashboard you can easily detect issues in startup and fix them. Not enougth VRAM? Upgrade to a higher Blib in minutes via the dashboard. Or fix the VRAM usage by lowering the token window size. Check the Logs easily in realtime with the dashboard:

Crashed and how to fix it
Crashed and how to fix it


Support

For advanced tuning, multi-GPU, or custom presets, contact Trooper.AI support.