Private AI / Infrastructure

A private AI platform with zero external AI APIs

AnnaTech reference build - self-funded, in production for our own operations

VRAM GPU inference cluster (4x RTX A5000): 98 GB
tokens of context served via vLLM (FP8): 262K
MCP tools behind tiered authentication: 30+
external AI APIs - everything runs in-network: 0

Architecture at a glance

Context

We advise companies that cannot - or should not - send their data to cloud AI vendors. The only credible way to give that advice is to run the alternative ourselves. This platform is our own production environment: it answers our research questions, drives our automations and generates our imagery, every day.

Design goals

Absolute data locality: no prompt, embedding or document leaves the network. Multi-model flexibility rather than one-vendor lock-in. Agents with real capabilities - web research, document work, image generation, controlled facility integrations - governed by real permissions. And enough operational discipline that the platform survives hardware and software failures without a human on call.

Architecture

Inference runs on two backends: a 4x RTX A5000 node (98 GB VRAM) serving an FP8-quantized 27B open-weight model at 262K context via vLLM, and an Apple-silicon node serving a mixture-of-experts model at 163K context - redundancy with very different power and cost profiles. An Open WebUI frontend exposes 12 curated model presets tuned per task.

Agent capability lives in an MCP tool layer: 30+ tools spanning a local research pipeline (self-hosted metasearch, parallel scraping, semantic ranking with local embeddings), document and file operations, image generation with explicit GPU memory orchestration, and controlled facility integrations. Access is tiered - LAN clients get administrative scope, remote clients authenticate with bearer tokens and rate limits, and per-tool deny-lists gate sensitive integrations. Polish and English voice runs on local Whisper STT and Piper TTS. A messaging-channel agent executes unattended tasks end to end.

Underneath: a 2-node Proxmox cluster with hardware watchdogs, self-healing service policies, VLAN network segmentation and versioned backups - hardened by real incidents, not by checklist.

Outcome

A production platform in daily use, and the reference architecture we adapt to client constraints: hardware sizing, model policy, integration surface and compliance posture. When we say private AI can be a first-class experience rather than a compromise, this is the evidence.

From the workbench

private AI platform — live inventory

$ platform status
inference backends   2   gpu: 4x A5000 / vLLM / FP8 27B / 262K ctx · apple-silicon: MoE / 163K ctx
model presets        12  per-task system prompts + parameters
mcp tools            30+ research · documents · images · facility integrations
auth tiers           2   lan = admin scope · remote = bearer token + rate limits
voice                PL EN whisper stt · piper tts, fully local
external AI APIs     0   no prompt, document or token leaves the network

Counts and configuration are the real, operating platform.

More work

Related case studies

Applied AI / Intelligent Automation

Talk to the person who will actually build it

One architect, end to end: scoping, architecture, delivery, operations. Write a paragraph about your problem and you will get an engineering answer, not a sales call.

[email protected] Capability statement (PDF)