Technical Architecture

One server.
Every tool.
Your AI team.

A dedicated AI workspace per company. Claude connects to your CRM, email, calendar, and 40+ services through a 21-container distributed system with 3-layer self-healing and per-tenant VM isolation.

21 containers in production 43 MCP adapters Per-tenant VM isolation 3-layer self-healing
System Overview

Any AI client.
One company brain.

Users connect via Claude Desktop (SSH), Claude.ai (Remote MCP), or Telegram. Every request hits a dedicated server with full access to the company's tools.

ACCESS LAYER
Claude Desktop
SSH tunnel via connect.getelio.co
LIVE
Claude.ai / Mobile
Remote MCP over HTTPS
Q2
Telegram Bot
Grammy.js, 8 middleware layers
Q2
SSH / MCP
Elio Server
Dedicated VM per organization
AI RUNTIME
Claude Code
Business context (CLAUDE.md)
55 composable skills
3-layer memory system
43 MCP ADAPTERS
+34
OAuth tokens AES-256-GCM encrypted, never leave the server
APIs
SERVICES
Google Workspace
HubSpot
Slack
Notion
Any REST API

Request lifecycle

User asks
"Draft follow-ups
for closing deals"
Reads context
CLAUDE.md, memory,
skill library
Calls APIs
hubspot.list_deals()
gmail.create_draft()
Done
12 personalized
drafts in Gmail
Container Architecture

21 containers.
3 Compose stacks.

The control plane runs as a multi-compose distributed system on a single VM. Core services, observability, and infrastructure are deployed independently.

10
Core services
7
Observability
4
Infrastructure
~12GB
RAM budget
CORE STACK
docker-compose.yml
elio-api
x2 instances, LB
512M each
elio-assistant
SSE streaming
256M
elio-worker
BullMQ, 4x concurrency
6GB (2GB reserved)
elio-bot
Grammy.js Telegram
2GB
caddy
TLS + reverse proxy
256M
redis
BullMQ, noeviction
600M
redis-cache
API cache, allkeys-lru
300M
autoheal
Docker-native restart
64M
OBSERVABILITY
docker-compose.observability.yml
Grafana 11.6
Dashboards + alerts
Prometheus 3.2
Metrics, 30d retention
Loki 3.4
Logs, 30d retention
Tempo 2.4
Distributed tracing
Promtail
Log shipping
Node Exporter
Host metrics
Blackbox
HTTP probing
SPLIT REDIS ARCHITECTURE
redis:6379 (noeviction)
BullMQ queues, workflow state, job data. 512MB max. Data must never be evicted - a lost job means a lost workflow execution.
redis-cache:6380 (allkeys-lru)
API response cache, rate limit counters, token cache. 256MB max. Safe to evict - all data is reconstructable from source.
Deployment

Blue-green deploys.
Zero-downtime worker updates.

API DEPLOY
2 API instances behind Caddy load balancer. Rolling restart: take one down, rebuild, bring up, then repeat for the second. Caddy health-checks both and routes only to healthy instances.
# Caddy upstream config reverse_proxy elio-api:3100 elio-api-2:3100 { lb_policy round_robin health_uri /health health_interval 10s }
WORKER DEPLOY
Blue-green strategy. Start a green worker on port 9092. Wait for it to become healthy. Drain the blue worker (finish in-progress jobs). Swap. Tear down old.
# blue-green-deploy.sh sequence 1. docker compose -f green.yml up -d 2. wait_for_health worker-green:9092 3. drain_worker worker-blue (graceful) 4. swap_labels blue ↔ green 5. teardown old container
HORIZONTAL SCALING
Scale API from 2 to 4 instances with one command. docker compose -f docker-compose.yml -f docker-compose.scale.yml up -d. Each instance gets 512MB, Telegram bot disabled on instances 3-4 to prevent duplicate handling.

Tenant server provisioning

1
Install runtime Node 22, Bun, Claude Code, Redis, ufw
2
Push vault key + tenant config AES-256-GCM key, org YAML, generated CLAUDE.md
3
Deploy sync agent 6.2KB JS binary, systemd service, elio-sync sandboxed user
4
Register + start Device token (90-day), Promtail to central Loki, Grafana dashboard

Config sync (bidirectional)

PULL (every 5 min)
Tenant server polls GET /sync/config with ETag. Returns 304 if unchanged. Payload: CLAUDE.md sections, integrations, users, feature flags, skills. Device token auth. SHA-256 hash drift detection.
PUSH (on critical changes)
Control plane SSHes into tenant server to trigger immediate sync. Fires on: integration connect/disconnect, user add/remove, plan change. Encrypted credential push with 3x retry, 2s backoff.
Network Architecture

Single bridge.
Localhost-only ports.

All 21 containers share one Docker bridge network. Only Caddy is exposed to the internet. APIs, Redis, and metrics are bound to 127.0.0.1.

Internet (0.0.0.0) | UFW Firewall ── [80, 443, 22] | elio-caddy (0.0.0.0:80, 0.0.0.0:443) ├─ getelio.co → [elio-api, elio-api-2] round-robin ├─ app.getelio.co → [elio-api, elio-api-2] + SPA fallback ├─ metrics.getelio.co → grafana:3000 ├─ n8n.getelio.co → n8n:5678 └─ s.getelio.co → intercom:3847 Docker bridge (172.18.0.0/16) ── internal only ├─ [elio-api*] → redis:6379, redis-cache:6380 ├─ [elio-worker] → redis:6379 (BullMQ consumer) ├─ [promtail] → loki:3100 ├─ [prometheus] ← node-exporter, blackbox ├─ [tempo] ← OTLP from worker, APIs (port 4318) └─ [grafana] → prometheus, loki, tempo Localhost-only bindings (127.0.0.1) ├─ :3100, :3101 API instances ├─ :6379, :6380 Redis instances ├─ :9090, :9091 Metrics endpoints └─ :3200 Assistant service
CADDY SECURITY HEADERS
Content-Security-Policy
X-Frame-Options: DENY
X-Content-Type-Options: nosniff
Referrer-Policy: strict-origin
Permissions-Policy: camera=(), mic=()
CACHING STRATEGY
JS/CSS (hashed): max-age=31536000, immutable
Assets: max-age=2678400, stale-while-revalidate
HTML: no-cache (revalidate every request)
API: no-store (never cached)
Security Model

Defense in depth.
Per-tenant isolation at every layer.

TENANT A
Dedicated VM
Own SSH keys
Own OAuth vault
Own context & memory
Own Promtail logging
TENANT B
Dedicated VM
Own SSH keys
Own OAuth vault
Own context & memory
Own Promtail logging
TENANT C
Dedicated VM
Own SSH keys
Own OAuth vault
Own context & memory
Own Promtail logging
AUTHENTICATION
Users: Supabase OAuth (Google) + magic link fallback
Token cache: 3-tier (L1 in-memory 5min, L2 Redis 300s, L3 Supabase)
Admins: Role-based via app_metadata (super_admin > org_admin > member)
Devices: SHA-256 hashed device tokens, 90-day expiry
CREDENTIAL MANAGEMENT
Encryption: AES-256-GCM with per-credential random salt
Storage: Encrypted at rest on tenant VM, never on control plane
Transport: SSH push with 3x retry, 2s exponential backoff
Refresh: OAuth token auto-refresh every 30 min
SSH GATEWAY
Entry: connect.getelio.co - single SSH endpoint for all tenants
Routing: AuthorizedKeysCommand resolves handle to tenant VM
Validation: Timing-safe secret comparison, injection prevention
Caching: Positive 60s, negative 30s, Redis pub/sub invalidation
HARDENING
Firewall: UFW allow-list (22, 80, 443 only for production)
Containers: No Docker socket in worker (prevents escape)
Sync agent: Sandboxed (ProtectSystem=strict, NoNewPrivileges)
Database: Row Level Security on all tables, fail2ban on SSH
API MIDDLEWARE STACK
CSRF protection Per-org rate limiting Global rate limiting Permission enforcement Admin audit trail OTEL tracing Response envelope Request logging
Monitoring & Self-Healing

Three layers.
If one fails, the next catches it.

The system recovers from failures automatically. Docker restarts containers, cron scripts watch Docker, and the application watches everything.

A
Layer
Docker-Native
autoheal container monitors all services every 60 seconds. If a container's health check fails 3-5 times, it gets restarted automatically. Every service has a custom health check: Redis uses redis-cli ping, APIs use HTTP GET /health, worker uses /health/readiness.
interval: 60s
covers: all 21 containers
B
Layer
Host Guardians
Two cron scripts run on the host (outside Docker). worker-guardian.sh runs every minute, checks worker health via HTTP. services-guardian.sh runs every 5 minutes, watches n8n, Prometheus, Loki, Grafana, Promtail. After 3 failed restarts, escalates to Telegram with log excerpts. 15-min cooldown prevents alert spam.
interval: 1-5 min
covers: critical services
C
Layer
Application Self-Heal
30+ TypeScript modules in packages/self-heal/. 14 preconfigured alerts (redis-memory, queue-backlog, disk-space, workflow-dlq-growth, caddy-cert-expiry, etc). 8 automated remediation actions (dlq-retry, redis-purge, container-health, disk-cleanup, etc). Proactive health loop runs every 5 minutes, dispatches self-heal alerts after 3 consecutive failures.
interval: 5 min
14 alerts, 8 actions

Observability stack

LOGS
Promtail ships logs from all services and tenant servers to Loki. 30-day retention. LogQL queries in Grafana. Tenant logs tagged with tenant_id labels.
METRICS
Prometheus scrapes Node Exporter (CPU, RAM, disk) every 15 seconds. 30-day TSDB retention. Blackbox exporter probes HTTP endpoints. Alert rules fire to Telegram.
TRACES
OpenTelemetry OTLP spans sent to Tempo from API and worker. Batched every 5s. traceparent header injected into HTTP responses for end-to-end tracing.
Data Layer

Supabase + Redis.
BullMQ for everything async.

POSTGRESQL (SUPABASE)
38 migrations applied
Row Level Security on all tables
Drizzle ORM with strict types
Supabase Auth (OAuth + magic links)
Managed backups by Supabase
BULLMQ WORKFLOW ENGINE
Single queue: workflow-execution
4x concurrency, DLQ monitoring
Registry-first architecture
Unique job IDs (name + timestamp)
Auto-retry with exponential backoff
MEMORY SYSTEM
Hot: MEMORY.md (4KB, every session)
Warm: Daily logs (today + yesterday)
Cold: BM25 + vector hybrid search
Consolidation every 6h via cron
Token budget awareness
Technology Stack

TypeScript strict
across the entire stack.

React 19
+ Vite 6, Tailwind, TanStack
Hono
API framework on Bun
Supabase
Postgres + Auth + RLS
BullMQ
Workflow engine on Redis
Docker Compose
21 containers, 3 stacks
Caddy 2
Auto TLS + reverse proxy
Zod
Runtime validation
Vitest
Test framework
Scaling Path

Where we are.
Where we're going.

NOW
0-50
1 VPS per client (Hetzner)
Ansible provisioning
SSH gateway routing
Config sync + push
NEXT
50-500
Kubernetes cluster
Container per tenant
Zero-touch provisioning
Auto-scaling
FUTURE
500+
Multi-region (EU, US, ME)
On-prem option
Skills marketplace
Custom model routing