Skip to content
Annatech_

Private AI / Infrastructure

A private AI platform with zero external AI APIs

AnnaTech reference build - self-funded, in production for our own operations

VRAM GPU inference cluster (4x RTX A5000)
98 GB
tokens of context served via vLLM (FP8)
262K
MCP tools behind tiered authentication
30+
external AI APIs - everything runs in-network
0

Architecture at a glance

YOUR NETWORK - NOTHING LEAVES IT external AI APIs: 0 GPU inference node 4x RTX A5000 · 98 GB VRAMvLLM · FP8 27B · 262K ctx Apple-silicon node MoE model · 163K ctxlow-power redundancy Local voice Whisper STT · Piper TTSPolish + English Chat & agent frontend Open WebUI · 12 curated model presets Permission tiers LAN = admin · remote = tokenrate limits · deny-lists MCP tool layer — 30+ tools research · documents · images · facility integrations Operations Proxmox 2-node · watchdogsVLAN segmentation · backups

Context

We advise companies that cannot - or should not - send their data to cloud AI vendors. The only credible way to give that advice is to run the alternative ourselves. This platform is our own production environment: it answers our research questions, drives our automations and generates our imagery, every day.

Design goals

Absolute data locality: no prompt, embedding or document leaves the network. Multi-model flexibility rather than one-vendor lock-in. Agents with real capabilities - web research, document work, image generation, controlled facility integrations - governed by real permissions. And enough operational discipline that the platform survives hardware and software failures without a human on call.

Architecture

Inference runs on two backends: a 4x RTX A5000 node (98 GB VRAM) serving an FP8-quantized 27B open-weight model at 262K context via vLLM, and an Apple-silicon node serving a mixture-of-experts model at 163K context - redundancy with very different power and cost profiles. An Open WebUI frontend exposes 12 curated model presets tuned per task.

Agent capability lives in an MCP tool layer: 30+ tools spanning a local research pipeline (self-hosted metasearch, parallel scraping, semantic ranking with local embeddings), document and file operations, image generation with explicit GPU memory orchestration, and controlled facility integrations. Access is tiered - LAN clients get administrative scope, remote clients authenticate with bearer tokens and rate limits, and per-tool deny-lists gate sensitive integrations. Polish and English voice runs on local Whisper STT and Piper TTS. A messaging-channel agent executes unattended tasks end to end.

Underneath: a 2-node Proxmox cluster with hardware watchdogs, self-healing service policies, VLAN network segmentation and versioned backups - hardened by real incidents, not by checklist.

Outcome

A production platform in daily use, and the reference architecture we adapt to client constraints: hardware sizing, model policy, integration surface and compliance posture. When we say private AI can be a first-class experience rather than a compromise, this is the evidence.

From the workbench

private AI platform — live inventory
$ platform status
inference backends   2   gpu: 4x A5000 / vLLM / FP8 27B / 262K ctx · apple-silicon: MoE / 163K ctx
model presets        12  per-task system prompts + parameters
mcp tools            30+ research · documents · images · facility integrations
auth tiers           2   lan = admin scope · remote = bearer token + rate limits
voice                PL EN whisper stt · piper tts, fully local
external AI APIs     0   no prompt, document or token leaves the network

Counts and configuration are the real, operating platform.

More work

Related case studies

Talk to the person who will actually build it

One architect, end to end: scoping, architecture, delivery, operations. Write a paragraph about your problem and you will get an engineering answer, not a sales call.