ELIO - Technical Architecture

System Overview

Any AI client.
One company brain.

Users connect via Claude Desktop (SSH), Claude.ai (Remote MCP), or Telegram. Every request hits a dedicated server with full access to the company's tools.

ACCESS LAYER

Claude Desktop

SSH tunnel via connect.getelio.co

LIVE

Claude.ai / Mobile

Remote MCP over HTTPS

Telegram Bot

Grammy.js, 8 middleware layers

SSH / MCP

Elio Server

Dedicated VM per organization

AI RUNTIME

Claude Code
Business context (CLAUDE.md)
55 composable skills
3-layer memory system

43 MCP ADAPTERS

+34

OAuth tokens AES-256-GCM encrypted, never leave the server

APIs

SERVICES

Google Workspace

HubSpot

Slack

Notion

Any REST API

Request lifecycle

User asks

"Draft follow-ups
for closing deals"

→

Reads context

CLAUDE.md, memory,
skill library

→

Calls APIs

hubspot.list_deals()
gmail.create_draft()

→

Done

12 personalized
drafts in Gmail

Container Architecture

21 containers.
3 Compose stacks.

The control plane runs as a multi-compose distributed system on a single VM. Core services, observability, and infrastructure are deployed independently.

Core services

Observability

Infrastructure

~12GB

RAM budget

CORE STACK

docker-compose.yml

elio-api

x2 instances, LB

512M each

elio-assistant

SSE streaming

256M

elio-worker

BullMQ, 4x concurrency

6GB (2GB reserved)

elio-bot

Grammy.js Telegram

2GB

caddy

TLS + reverse proxy

256M

redis

BullMQ, noeviction

600M

redis-cache

API cache, allkeys-lru

300M

autoheal

Docker-native restart

64M

OBSERVABILITY

docker-compose.observability.yml

Grafana 11.6

Dashboards + alerts

Prometheus 3.2

Metrics, 30d retention

Loki 3.4

Logs, 30d retention

Tempo 2.4

Distributed tracing

Promtail

Log shipping

Node Exporter

Host metrics

Blackbox

HTTP probing

SPLIT REDIS ARCHITECTURE

redis:6379 (noeviction)

BullMQ queues, workflow state, job data. 512MB max. Data must never be evicted - a lost job means a lost workflow execution.

redis-cache:6380 (allkeys-lru)

API response cache, rate limit counters, token cache. 256MB max. Safe to evict - all data is reconstructable from source.

Deployment

Blue-green deploys.
Zero-downtime worker updates.

API DEPLOY

2 API instances behind Caddy load balancer. Rolling restart: take one down, rebuild, bring up, then repeat for the second. Caddy health-checks both and routes only to healthy instances.

# Caddy upstream config
reverse_proxy elio-api:3100 elio-api-2:3100 {
    lb_policy round_robin
    health_uri /health
    health_interval 10s
}

WORKER DEPLOY

Blue-green strategy. Start a green worker on port 9092. Wait for it to become healthy. Drain the blue worker (finish in-progress jobs). Swap. Tear down old.

# blue-green-deploy.sh sequence
docker compose -f green.yml up -d
wait_for_health worker-green:9092
drain_worker worker-blue (graceful)
swap_labels blue ↔ green
teardown old container

HORIZONTAL SCALING

Scale API from 2 to 4 instances with one command. docker compose -f docker-compose.yml -f docker-compose.scale.yml up -d. Each instance gets 512MB, Telegram bot disabled on instances 3-4 to prevent duplicate handling.

Tenant server provisioning

Install runtime Node 22, Bun, Claude Code, Redis, ufw

Push vault key + tenant config AES-256-GCM key, org YAML, generated CLAUDE.md

Deploy sync agent 6.2KB JS binary, systemd service, elio-sync sandboxed user

Config sync (bidirectional)

PULL (every 5 min)

Tenant server polls GET /sync/config with ETag. Returns 304 if unchanged. Payload: CLAUDE.md sections, integrations, users, feature flags, skills. Device token auth. SHA-256 hash drift detection.

PUSH (on critical changes)

Control plane SSHes into tenant server to trigger immediate sync. Fires on: integration connect/disconnect, user add/remove, plan change. Encrypted credential push with 3x retry, 2s backoff.

Network Architecture

Single bridge.
Localhost-only ports.

All 21 containers share one Docker bridge network. Only Caddy is exposed to the internet. APIs, Redis, and metrics are bound to 127.0.0.1.

Internet (0.0.0.0)
    |
UFW Firewall ── [80, 443, 22]
    |
elio-caddy (0.0.0.0:80, 0.0.0.0:443)
    ├─ getelio.co         → [elio-api, elio-api-2] round-robin
    ├─ app.getelio.co     → [elio-api, elio-api-2] + SPA fallback
    ├─ metrics.getelio.co → grafana:3000
    ├─ n8n.getelio.co     → n8n:5678
    └─ s.getelio.co       → intercom:3847

Docker bridge (172.18.0.0/16) ── internal only
    ├─ [elio-api*]     → redis:6379, redis-cache:6380
    ├─ [elio-worker]   → redis:6379 (BullMQ consumer)
    ├─ [promtail]      → loki:3100
    ├─ [prometheus]     ← node-exporter, blackbox
    ├─ [tempo]          ← OTLP from worker, APIs (port 4318)
    └─ [grafana]        → prometheus, loki, tempo

Localhost-only bindings (127.0.0.1)
    ├─ :3100, :3101  API instances
    ├─ :6379, :6380  Redis instances
    ├─ :9090, :9091  Metrics endpoints
    └─ :3200         Assistant service

CADDY SECURITY HEADERS

Content-Security-Policy
X-Frame-Options: DENY
X-Content-Type-Options: nosniff
Referrer-Policy: strict-origin
Permissions-Policy: camera=(), mic=()

CACHING STRATEGY

JS/CSS (hashed): max-age=31536000, immutable
Assets: max-age=2678400, stale-while-revalidate
HTML: no-cache (revalidate every request)
API: no-store (never cached)

Security Model

Defense in depth.
Per-tenant isolation at every layer.

TENANT A

Dedicated VM
Own SSH keys
Own OAuth vault
Own context & memory
Own Promtail logging

TENANT B

Dedicated VM
Own SSH keys
Own OAuth vault
Own context & memory
Own Promtail logging

TENANT C

Dedicated VM
Own SSH keys
Own OAuth vault
Own context & memory
Own Promtail logging

AUTHENTICATION

Users: Supabase OAuth (Google) + magic link fallback
Token cache: 3-tier (L1 in-memory 5min, L2 Redis 300s, L3 Supabase)
Admins: Role-based via app_metadata (super_admin > org_admin > member)
Devices: SHA-256 hashed device tokens, 90-day expiry

CREDENTIAL MANAGEMENT

Encryption: AES-256-GCM with per-credential random salt
Storage: Encrypted at rest on tenant VM, never on control plane
Transport: SSH push with 3x retry, 2s exponential backoff
Refresh: OAuth token auto-refresh every 30 min

SSH GATEWAY

Entry: connect.getelio.co - single SSH endpoint for all tenants
Routing: AuthorizedKeysCommand resolves handle to tenant VM
Validation: Timing-safe secret comparison, injection prevention
Caching: Positive 60s, negative 30s, Redis pub/sub invalidation

HARDENING

Firewall: UFW allow-list (22, 80, 443 only for production)
Containers: No Docker socket in worker (prevents escape)
Sync agent: Sandboxed (ProtectSystem=strict, NoNewPrivileges)
Database: Row Level Security on all tables, fail2ban on SSH

API MIDDLEWARE STACK

CSRF protection Per-org rate limiting Global rate limiting Permission enforcement Admin audit trail OTEL tracing Response envelope Request logging

Monitoring & Self-Healing

Three layers.
If one fails, the next catches it.

The system recovers from failures automatically. Docker restarts containers, cron scripts watch Docker, and the application watches everything.

Layer

Docker-Native

autoheal container monitors all services every 60 seconds. If a container's health check fails 3-5 times, it gets restarted automatically. Every service has a custom health check: Redis uses redis-cli ping, APIs use HTTP GET /health, worker uses /health/readiness.

interval: 60s

covers: all 21 containers

Layer

Host Guardians

Two cron scripts run on the host (outside Docker). worker-guardian.sh runs every minute, checks worker health via HTTP. services-guardian.sh runs every 5 minutes, watches n8n, Prometheus, Loki, Grafana, Promtail. After 3 failed restarts, escalates to Telegram with log excerpts. 15-min cooldown prevents alert spam.

interval: 1-5 min

covers: critical services

Layer

Application Self-Heal

30+ TypeScript modules in packages/self-heal/. 14 preconfigured alerts (redis-memory, queue-backlog, disk-space, workflow-dlq-growth, caddy-cert-expiry, etc). 8 automated remediation actions (dlq-retry, redis-purge, container-health, disk-cleanup, etc). Proactive health loop runs every 5 minutes, dispatches self-heal alerts after 3 consecutive failures.

interval: 5 min

14 alerts, 8 actions

Observability stack

LOGS

Promtail ships logs from all services and tenant servers to Loki. 30-day retention. LogQL queries in Grafana. Tenant logs tagged with tenant_id labels.

METRICS

Prometheus scrapes Node Exporter (CPU, RAM, disk) every 15 seconds. 30-day TSDB retention. Blackbox exporter probes HTTP endpoints. Alert rules fire to Telegram.

TRACES

OpenTelemetry OTLP spans sent to Tempo from API and worker. Batched every 5s. traceparent header injected into HTTP responses for end-to-end tracing.

Data Layer

Supabase + Redis.
BullMQ for everything async.

POSTGRESQL (SUPABASE)

38 migrations applied
Row Level Security on all tables
Drizzle ORM with strict types
Supabase Auth (OAuth + magic links)
Managed backups by Supabase

BULLMQ WORKFLOW ENGINE

Single queue: workflow-execution
4x concurrency, DLQ monitoring
Registry-first architecture
Unique job IDs (name + timestamp)
Auto-retry with exponential backoff

MEMORY SYSTEM

Hot: MEMORY.md (4KB, every session)
Warm: Daily logs (today + yesterday)
Cold: BM25 + vector hybrid search
Consolidation every 6h via cron
Token budget awareness

One server.
Every tool.
Your AI team.

Any AI client.
One company brain.

Request lifecycle

21 containers.
3 Compose stacks.

Blue-green deploys.
Zero-downtime worker updates.

Tenant server provisioning

Config sync (bidirectional)

Single bridge.
Localhost-only ports.

Defense in depth.
Per-tenant isolation at every layer.

Three layers.
If one fails, the next catches it.

Observability stack

Supabase + Redis.
BullMQ for everything async.

TypeScript strict
across the entire stack.

Where we are.
Where we're going.

One server. Every tool. Your AI team.

Any AI client.One company brain.

Request lifecycle

21 containers.3 Compose stacks.

Blue-green deploys.Zero-downtime worker updates.

Tenant server provisioning

Config sync (bidirectional)

Single bridge.Localhost-only ports.

Defense in depth.Per-tenant isolation at every layer.

Three layers.If one fails, the next catches it.

Observability stack

Supabase + Redis.BullMQ for everything async.

TypeScript strictacross the entire stack.

Where we are.Where we're going.

One server.
Every tool.
Your AI team.

Any AI client.
One company brain.

21 containers.
3 Compose stacks.

Blue-green deploys.
Zero-downtime worker updates.

Single bridge.
Localhost-only ports.

Defense in depth.
Per-tenant isolation at every layer.

Three layers.
If one fails, the next catches it.

Supabase + Redis.
BullMQ for everything async.

TypeScript strict
across the entire stack.

Where we are.
Where we're going.