Runtime Security Reverse-Proxy
Wire-compatible passthrough endpoints for all supported LLM providers that let client applications drop Antidote Runtime Security in front of their LLM traffic by swapping a single configuration value, the SDKbase_url. Every prompt is scanned before it reaches the upstream
provider and every response is scanned before it returns to the caller,
reusing the same scanner, thresholds, settings, and event store as the
existing /scan/input and /scan/output endpoints.
Supported providers:
- Native routes, OpenAI, Anthropic (Claude), Google Gemini, Google Vertex AI, AWS Bedrock.
- OpenAI-compatible, Groq, DeepSeek, Perplexity, Mistral AI, OpenRouter, Cerebras.
- Self-hosted via OpenAI-compatible, Ollama, llama.cpp, vLLM, on-prem TGI, or anything else exposing an OpenAI-shaped API.
1. Why a Reverse-Proxy?
The pre-existingPOST /api/runtime-security/scan/{input,output}
endpoints require clients to wire scanning explicitly at two points in
their code. That’s fine for purpose-built middleware, but awkward for:
- Applications that already go through an OpenAI or Anthropic SDK and
want zero code changes beyond
base_url. - Third-party tools (LangChain, LlamaIndex, raw
curl, Cursor, etc.) where the wire protocol is the only contract Antidote can rely on. - Teams that want a hard guarantee that every call went through the firewall, enforced at the network boundary instead of relying on each developer to remember to call the scan endpoints.
2. Endpoints
All routes live under the existingruntime-security feature prefix
and are mounted in backend/app/api/router.py alongside the scan API.
| Method | Path | Upstream |
|---|---|---|
| POST | /api/runtime-security/proxy/openai/v1/chat/completions | https://api.openai.com/v1/chat/completions |
| POST | /api/runtime-security/proxy/openai/v1/completions | https://api.openai.com/v1/completions |
| POST | /api/runtime-security/proxy/anthropic/v1/messages | https://api.anthropic.com/v1/messages |
| POST | /api/runtime-security/proxy/openai-compatible/v1/chat/completions | Picked by X-Antidote-Upstream-Provider or App routing config (see §6.4) |
| POST | /api/runtime-security/proxy/openai-compatible/v1/completions | Same as above |
| POST | /api/runtime-security/proxy/gemini/v1beta/models/{model}:generateContent | https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent |
| POST | /api/runtime-security/proxy/gemini/v1beta/models/{model}:streamGenerateContent | …:streamGenerateContent?alt=sse |
| POST | /api/runtime-security/proxy/vertex/v1/projects/{project}/locations/{loc}/publishers/google/models/{m}:generateContent | https://{loc}-aiplatform.googleapis.com/v1/projects/{project}/locations/{loc}/publishers/google/models/{m}:generateContent |
| POST | /api/runtime-security/proxy/vertex/v1/.../models/{m}:streamGenerateContent | Same host, :streamGenerateContent?alt=sse |
| POST | /api/runtime-security/proxy/bedrock/model/{modelId}/converse | https://bedrock-runtime.{region}.amazonaws.com/model/{modelId}/converse (SigV4 signed server-side) |
backend/app/api/runtime_security_proxy.py, FastAPI routes, auth, header filtering, error shaping.backend/app/runtime_security/proxy.py, pure payload-walk helpers (scan_openai_chat_request,scan_anthropic_response, …) andforward_upstream().backend/tests/test_runtime_security_proxy.py, unit tests for the payload-walk helpers against aFakeScanner.
3. Base URL Swap in Client SDKs
The user-facing change is a single setting.OpenAI Python SDK
OpenAI TypeScript SDK
Anthropic Python SDK
Why two different antidote-auth headers? OpenAI puts its upstream key inAuthorization, Anthropic puts its upstream key inx-api-key. To avoid colliding with either, OpenAI callers send the antidote credential asX-API-Key(free for OpenAI) and Anthropic callers send it asX-Antidote-Key(free for Anthropic).Authorizationis never consumed by the proxy routes, it is forwarded verbatim.
OpenAI-compatible providers (Groq, DeepSeek, Perplexity, Mistral, OpenRouter, Cerebras)
The same OpenAI SDK works against the/openai-compatible/v1 route,
pick the upstream with X-Antidote-Upstream-Provider.
KNOWN_OPENAI_COMPAT_PROVIDERS):
groq, deepseek, perplexity, mistral, openrouter, cerebras.
Deployments can subset this list with
RUNTIME_SECURITY_OPENAI_COMPAT_ALLOWLIST=groq,mistral.
Apps with routing.provider (or routing.upstream_base_url) set bind
the upstream by default, so the header becomes optional for traffic
attributed to that App.
Self-hosted Ollama, llama.cpp, vLLM, TGI
Self-hosted runtimes that speak OpenAI-shape go through the same route usingX-Antidote-Upstream-Base. The deployment must allowlist the
base URL via RUNTIME_SECURITY_OPENAI_COMPAT_EXTRA_BASES:
Ollama’s OpenAI-compatible endpoint ishttp://<host>:11434/v1; llama.cpp’s server exposeshttp://<host>:8080/v1by default.
Google Gemini (Google AI Studio)
requests:
Google Vertex AI
Caller provides an OAuthAuthorization: Bearer <gcloud-access-token>
header just like the native Vertex endpoint. Path encodes the project,
location, publisher, and model:
{location} segment from the path and routes to
https://{location}-aiplatform.googleapis.com.
AWS Bedrock (Converse API)
Bedrock requires AWS SigV4 signing of the body, which means the proxy must sign on behalf of the caller (signing before scanning would break the signature whenever aredact rewrites the body). AWS credentials
are passed via dedicated headers and the proxy SigV4-signs each
upstream call server-side:
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION), the
X-Antidote-AWS-* headers become optional and the server-side identity
is used instead. The headers always take precedence so callers can use
their own AWS account.
Streaming./bedrock/model/{m}/converse-streamis not yet implemented, Bedrock streams use the AWS event-stream binary framing which requires a custom parser. Use the non-streaming/converseroute for now.
4. Authentication Model
Defined byproxy_get_current_user() in runtime_security_proxy.py.
Unlike the normal get_current_user dependency, it must leave
Authorization untouched so the upstream auth header passes through.
Priority order:
X-API-Key: ak_..., the normal antidote API key header. OpenAI clients should use this.X-Antidote-Key: ak_... | Bearer <jwt>, fallback for clients that already usex-api-keyfor the upstream provider (i.e. Anthropic).antidote_sessioncookie, lets the in-app dashboard exercise the proxy without provisioning an API key.
runtime_security.scan
permission the same way /scan/input and /scan/output do. API keys
used with the proxy need a scope that maps to runtime_security.scan
(typically *) and a role with runtime_security.scan capability.
Missing credentials return HTTP 401 with a message instructing the
client to set X-API-Key or X-Antidote-Key. Missing permission
returns HTTP 403.
App attribution
Every proxy call must also carryX-Antidote-App-Id (the UUID of an
App in the workspace). The proxy resolves the App via
runtime_security.app_resolver.resolve_runtime_app, which:
- Returns HTTP 400
APP_ID_REQUIRED/APP_NOT_FOUNDfor a missing or unknown App. - Returns HTTP 423
APP_DISABLEDwhenapp.status == "disabled", HTTP 410APP_ARCHIVEDwhen archived or soft-deleted. - Verifies
X-Antidote-App-Tokenagainst the App’s signed token table whenrequire_signed_token=true(APP_TOKEN_REQUIRED/APP_TOKEN_INVALID). - Enforces
max_events_per_hour/max_events_per_dayand returns HTTP 429APP_QUOTA_EXCEEDEDwithRetry-Afterwhen hit. - Loads the App’s
current_config_versionand stores theResolvedAppContextonrequest.state.runtime_security_app. The scanner picks up thresholds, detectors, custom phrases, custom PII rules, tool policy, androuting.forbidden_providersfrom this context, so two Apps in the same workspace can run with very different security postures against the same shared firewall.
config_version_id is stamped onto every
persisted event so the dashboard can correlate sudden verdict-mix
changes with config edits.
Strict attribution mode. Setting
ANTIDOTE_REQUIRE_APP_ATTRIBUTION=1 makes the scan endpoints
reject calls without an App-Id too. The proxy already requires it
regardless of this flag.
5. Request & Response Flow
For every proxy call the route performs:- Authenticate (see §4) and load the current
RuntimeSecurityConfigvia_load_settings(db). Ifcfg.enabled == False, return HTTP 503, the proxy refuses to pass traffic through when the firewall is disabled, so operators get a loud signal instead of silent bypass. - Parse the JSON body and reject
stream=truewith HTTP 400. See §9 for rationale. - Input scan. Walk the payload’s user-supplied text fields and
call
RuntimeSecurityScanner.scan(text, direction="input")on each one. On the firstblockverdict the route short-circuits with a provider-shaped error (§7) and persists ablockevent. On aredactverdict the helper rewrites the matching field in place so the forwarded request carries the sanitised copy. - Log the input verdict (
cfg.log_events) via the same_persist_event()used by/scan/input, stamped withsource_app="proxy:openai"/source_app="proxy:anthropic"so the dashboard can segment proxy traffic from direct scan traffic. - Forward upstream via
httpx.AsyncClient.post(). Headers are filtered by_filter_inbound_headers(), hop-by-hop and antidote auth headers are stripped, and provider-specific headers are kept (Authorizationfor OpenAI,x-api-key/anthropic-versionfor Anthropic). - Pass through upstream errors untouched. If the upstream returns a non-2xx, the route surfaces the upstream body and status code verbatim so SDKs can parse the provider’s native error shape. We do not rewrite these.
- Output scan. Walk the response body (
choices[].messagefor OpenAI,content[]for Anthropic) and scan each text block withdirection="output". Sameblock/redactsemantics as the request phase. - Log the output verdict and return the (possibly redacted) body to the caller with the upstream status code.
direction="input", which activates the
prompt-injection classifier. Output scans run with
direction="output", which skips prompt-injection detection (the
model’s own reply is not a prompt) and only runs PII/secret leakage
checks.
6. Supported Payload Shapes
OpenAI Chat Completions (/v1/chat/completions)
scan_openai_chat_request walks body["messages"]. For each message:
- Role filter. Only
user,system, anddevelopermessages are scanned on input. Assistant turns are the model’s own previous output; they get scanned when they come back throughdirection="output", never here. This avoids false-positive input blocks when a suspicious string appears in a historical assistant turn. - Content shapes. Both the legacy string form
(
content: "...") and the current array-of-parts form (content: [{type: "text", text: "..."}]) are supported. Non-text parts (image_url,audio,tool_use, …) pass through untouched. - Redaction. The first matching text part is rewritten in place; other parts (images, tool calls) are preserved. For string content the entire field is replaced with the scanner’s redacted text.
scan_openai_chat_response walks body["choices"][].message with the
same content-shape support.
OpenAI Legacy Completions (/v1/completions)
scan_openai_completion_request accepts either string or list prompts.
List prompts are joined with \n for scanning; on redact the body is
rewritten to a single-prompt form with the scanner’s redacted text.
scan_openai_completion_response walks body["choices"][].text.
Anthropic Messages (/v1/messages)
scan_anthropic_request covers:
body["system"], string form and list-of-text-blocks form.body["messages"][]whererole == "user", with both string and array content.- Assistant turns are skipped for the same reason as OpenAI.
scan_anthropic_response walks body["content"][] and scans each
type == "text" block. Non-text blocks (tool_use, thinking,
image) pass through untouched.
Google Gemini / Vertex AI (:generateContent and :streamGenerateContent)
Gemini and Vertex share the same body shape (generateContent payload),
so a single set of helpers, scan_gemini_request and
scan_gemini_response, covers both routes.
body["systemInstruction"]["parts"][].textis scanned and redacted in place (string and object form both supported).body["contents"][]is filtered to user turns. The role can be absent or"user";"model"turns are skipped (they’re prior model output and will be scanned when they come back as response).parts[]withtextare scanned; non-text parts (inlineData,fileData,functionCall) pass through untouched.- Streaming uses SSE with
alt=ssequery-string. The proxy parsescandidates[].content.parts[].textdeltas, scans the accumulating buffer with the same windowed cadence as OpenAI / Anthropic, and emits afinishReason: SAFETYterminal event when the scanner blocks.
scan_gemini_response walks candidates[].content.parts[].text.
AWS Bedrock Converse (/model/{modelId}/converse)
scan_bedrock_converse_request walks:
body["system"][].text, list-of-text-blocks form (the only form Bedrock Converse accepts).body["messages"][]whererole == "user", scanningcontent[].textblocks. Non-text content (image,document,toolUse,toolResult) passes through untouched. Assistant turns are skipped.
scan_bedrock_converse_response walks
body["output"]["message"]["content"][].text with the same
text-block-only redaction semantics. Streaming (/converse-stream) is
not yet implemented, see §3 caveat.
Verdict Folding
Scanning multiple fields produces multipleRuntimeScanResult objects;
the route folds them to the worst verdict seen (allow < redact < block)
so a single request writes one input event with the strongest verdict.
7. Error Surface
Errors are shaped to match each provider so existing SDK error handlers parse them without modification.OpenAI errors
| Code | When | code field |
|---|---|---|
| 400 | Input blocked (prompt injection / PII policy hit) | antidote_blocked |
| 400 | stream=true or invalid JSON | n/a |
| 401 | Missing antidote credential | n/a |
| 403 | Missing runtime_security.scan permission | n/a |
| 502 | Upstream network error (timeout, DNS, 5xx from OpenAI) | upstream_error |
| 502 | Output blocked (PII/secret leak in upstream response) | upstream_blocked |
| 503 | Runtime Security firewall is disabled | n/a |
Anthropic errors
invalid_request_error; output blocks and upstream
failures use api_error.
Google Gemini / Vertex errors
code field carries the HTTP status; status is the
machine-readable reason (antidote_blocked, upstream_error,
upstream_blocked, unsupported_endpoint).
Streaming blocks emit a terminal SSE event with a
candidates[].finishReason: "SAFETY" payload, mirroring Google’s own
safety-stop shape.
AWS Bedrock errors
| Code | When | type field |
|---|---|---|
| 401 | Missing X-Antidote-AWS-Access-Key-Id / -Secret-Access-Key | missing_credentials |
| 400 | Missing X-Antidote-AWS-Region (and no AWS_REGION env var) | missing_region |
| 500 | SigV4 signing failed (malformed creds, botocore error) | sign_error |
Upstream error passthrough
If the upstream provider returns a non-2xx (e.g. OpenAI 429 rate-limit, Anthropic 400 invalid model), the proxy does not rewrite the body. The upstream JSON and status code are forwarded verbatim so SDKs can surface the provider’s native error. Only blocks originated by Antidote adopt the antidote-shaped error envelope.8. Configuration
Defaults live inruntime_security/proxy.py; all are overridable via
environment variables.
| Env var | Default | Meaning |
|---|---|---|
RUNTIME_SECURITY_OPENAI_BASE_URL | https://api.openai.com | Upstream OpenAI (or Azure OpenAI / compatible) origin. |
RUNTIME_SECURITY_ANTHROPIC_BASE_URL | https://api.anthropic.com | Upstream Anthropic origin. |
RUNTIME_SECURITY_GEMINI_BASE_URL | https://generativelanguage.googleapis.com | Upstream Google AI Studio origin. (Vertex picks its origin from the request path’s {location} segment.) |
RUNTIME_SECURITY_OPENAI_COMPAT_ALLOWLIST | (empty → full catalogue) | Comma-separated subset of groq,deepseek,perplexity,mistral,openrouter,cerebras permitted on this deployment. |
RUNTIME_SECURITY_OPENAI_COMPAT_EXTRA_BASES | (empty) | Comma-separated http(s):// bases that callers may target via X-Antidote-Upstream-Base, used for Ollama, llama.cpp, vLLM, on-prem TGI. |
AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_SESSION_TOKEN / AWS_REGION | (empty) | Fallback AWS credentials used by the Bedrock route when the caller doesn’t send X-Antidote-AWS-* headers. |
RUNTIME_SECURITY_PROXY_TIMEOUT | 60 (seconds) | Upstream request timeout for httpx.AsyncClient. |
injection_model, max_text_length,
log_events, enabled, pre-prompt) are exposed by
PUT /api/runtime-security/config, the proxy pulls them via
_load_settings(db) on every request, so a settings change takes
effect immediately.
Per-App knobs (thresholds, detectors, custom phrases, custom PII
rules, tool policy, routing.forbidden_providers) are read from the
App’s current config version when the request is resolved against an
App-Id (see §4). Editing an App’s config publishes a new version that
takes effect on the next request, no proxy restart needed.
See Configuration for the full
schema split.
9. MVP Limitations
These are intentional trade-offs in the first cut, tracked for a follow-up:- Streaming coverage is partial. OpenAI (
stream=true), Anthropic (stream=true), all OpenAI-compatible providers, Google Gemini (:streamGenerateContent), and Vertex AI (:streamGenerateContent) all stream through the proxy with windowed scanning and a provider-shaped block chunk on policy hit. AWS Bedrock streaming (/converse-stream) is not yet implemented, Bedrock uses the AWS event-stream binary framing which needs a custom parser. The legacy OpenAI/v1/completionsroute also rejects streaming. - Non-text modalities pass through unscanned. Image inputs, audio, tool-call arguments, and function-call results are not scanned for PII or injection. Text-only fields of multimodal messages are still scanned. Vision/audio scanning will ship as a separate firewall feature.
- No per-route rate limiting beyond the existing scan endpoints.
The proxy uses the same
runtime_security.scanpermission and the shared rate limits that apply to the rest of the Runtime Security surface. Per-tenant upstream quota enforcement is out of scope here; the upstream provider is the source of truth for that. - Tool/function-call arguments are not scanned or redacted. This matches the scan endpoints’ current behaviour. A malicious prompt injection that routes through tool args would be caught at the surface prose, not the structured arg values.
10. Observability
Every proxy call writes one input event and (when not blocked on input) one output event through the existing_persist_event() helper. Events include:
direction:"input"or"output"verdict:"allow"/"redact"/"block"provider:"openai"or"anthropic"source_app:"proxy:openai"or"proxy:anthropic", use this to segment proxy traffic from direct scan-endpoint traffic in the dashboard filters.model: the model name from the request body (input phase) or the upstream response (output phase).injection_score/injection_label,pii_count,pii_categories,text_length,blocked_reasonas usual.metadata_:{phase, location},locationpoints at the blocked field (e.g."messages[2].user","choices[0].message","system") to speed up root-cause analysis.
runtime_security.block audit record via
create_audit_record() only from the /scan/* endpoints today; the
proxy emits only RuntimeSecurityEvent rows in the MVP. Extending the
block path to write audit records is a small follow-up and will
harmonise the two entry points.
Analytics, event listing, health, and config endpoints
(/api/runtime-security/analytics, /events, /health, /config)
cover proxy traffic automatically because they query
RuntimeSecurityEvent without filtering on source_app.
11. Testing
backend/tests/test_runtime_security_proxy.py exercises the
payload-walk helpers against a FakeScanner that returns canned
verdicts keyed on substring matches. This keeps the tests fast
(no HF weights, no DB, no network) and focused on the transformation
logic that makes the base_url swap correct.
Covered cases:
- Clean pass-through for all three payload shapes.
- In-place redaction for string content and array-of-parts content.
- Block short-circuit raises
ProxyBlockedwith the expectedlocationstring. - Assistant turns are excluded from input scans (both OpenAI and Anthropic).
- Array-of-parts content preserves non-text parts (images, tool_use).
- Verdict folding returns the worst verdict across multiple fields.
- Anthropic
systemstring and list-of-text-blocks shapes. - Legacy OpenAI completions string-prompt and list-prompt handling.
TestClient with an in-process
httpx mock for the upstream are a reasonable follow-up but are not
required for the MVP since the route body is almost entirely glue
around the tested helpers.
12. Dedicated Runtime Security Container
For production deployments where you want the firewall to scale independently of the main Antidote API (and fail independently of the worker stack), a dedicated container ships alongside the main image.Image layout
| File | Role |
|---|---|
backend/Dockerfile.runtime-security | Slim image built on python:3.11-slim, CPU-only torch, transformers. |
backend/requirements.runtime-security.txt | Minimal dep set, no celery, ultralytics, torchvision, pandas, sklearn, scikit-image. |
backend/app/runtime_security_main.py | Dedicated ASGI app that mounts only runtime_security_router and runtime_security_proxy_router. |
/api/runtime-security/apps/...) are also excluded, Apps are
created and configured against the main Antidote API, then their
UUIDs and tokens are referenced by the dedicated container at
runtime via X-Antidote-App-Id / X-Antidote-App-Token. Both
containers share the same database, so Apps published from the main
API are visible to the firewall instantly.
The only endpoints it serves are:
POST /api/runtime-security/scan/inputPOST /api/runtime-security/scan/outputGET /api/runtime-security/analyticsGET /api/runtime-security/eventsGET /api/runtime-security/configPUT /api/runtime-security/configGET /api/runtime-security/healthPOST /api/runtime-security/proxy/openai/v1/chat/completionsPOST /api/runtime-security/proxy/openai/v1/completionsPOST /api/runtime-security/proxy/anthropic/v1/messagesPOST /api/runtime-security/proxy/openai-compatible/v1/chat/completionsPOST /api/runtime-security/proxy/openai-compatible/v1/completionsPOST /api/runtime-security/proxy/gemini/v1beta/models/{model}:{action}POST /api/runtime-security/proxy/vertex/v1/projects/.../models/{model}:{action}POST /api/runtime-security/proxy/bedrock/model/{modelId}/converseGET /healthz, container livenessGET /readyz, container readiness (returnsreadyonce the injection model is warm)
Why a separate container?
- Blast-radius isolation. A bug in the dataset or healing stack can crash or OOM the main API without ever touching the firewall. If the proxy is what your production LLM traffic depends on, it shouldn’t share a process with batch scan workers.
- Right-sized scaling. The proxy runs on the critical path of every LLM call. The main API does not. You want to scale them on different signals (QPS for the proxy; dataset-count and worker-queue depth for the main API).
- Smaller attack surface. The firewall image has no upload endpoints, no admin surface, no healing routes, nothing to exploit beyond the two wire-protocol shapes. Pen-testing the firewall becomes a much smaller scope.
- Faster image rebuilds and pulls. The runtime-security image does not pull ultralytics/torchvision/pandas/scikit-image. Cold pulls are minutes faster in CI and autoscalers.
Shared database
Both containers talk to the same PostgreSQL instance, scan events from the dedicated proxy container land in the sameruntime_security_events table that the Antidote dashboard reads, so
events show up in the UI automatically without any cross-service
plumbing. The main API container continues to own migrations
(alembic upgrade head); the runtime-security container never runs
migrations on its own.
Docker Compose service
Theruntime_security service is defined in docker-compose.yml:
Environment variables
All the knobs from §8 apply to the dedicated container. Additional container-only env vars:| Variable | Default | Meaning |
|---|---|---|
RUNTIME_SECURITY_WORKERS | 2 | uvicorn worker count inside the container. |
RUNTIME_SECURITY_WARMUP_ON_START | true | Pre-load the injection classifier on startup so the first call is fast. |
RUNTIME_SECURITY_CORS_ORIGINS | (empty) | Comma-separated list; when non-empty, CORS is enabled for those origins. |
RUNTIME_SECURITY_DISABLE_INJECTION_MODEL | (unset) | Set to 1 to skip loading the ML classifier (phrase heuristics only). |
HF_HOME / TRANSFORMERS_CACHE | /models | Hugging Face cache location, mapped to the runtime_security_models volume. |
Model prefetch at build time
The Dockerfile prefetches the default injection model during the build (ARG PREFETCH_MODEL=1) so the first request doesn’t wait on a
HuggingFace Hub download. For air-gapped builds set
--build-arg PREFETCH_MODEL=0 and either:
- Mount a pre-populated
/modelsvolume at runtime, or - Ship a custom model via
RUNTIME_SECURITY_INJECTION_MODELpointing at a local path, or - Set
RUNTIME_SECURITY_DISABLE_INJECTION_MODEL=1to run in phrase-heuristic-only mode.
Health and readiness
GET /healthz, always returns200 {"status": "ok"}. Use this for liveness probes.GET /readyz, callswarmup()on the injection model and returnsreadyif the classifier loaded ordegradedif the container is running on heuristics only. Use this for readiness probes so traffic doesn’t arrive before the classifier is loaded.- Kubernetes / ECS task definitions should set
readinessProbe.httpGet.path = /readyzandlivenessProbe.httpGet.path = /healthz.
Scaling guidance
- The proxy is I/O-bound on upstream requests and CPU-bound on scanner
inference. Two workers per container is a reasonable starting point;
scale horizontally before scaling
--workerspast ~4 because the HF pipeline holds its own thread-pool internally. - Model load is the slow step. Keep the container warm; aggressive scale‑to‑zero forces a fresh model load on every wake.
- Pin CPU limits high enough that DeBERTa inference doesn’t starve on a shared host.
13. Deployment Notes
- No new database migration. The proxy reuses the existing
runtime_security_eventstable. - No new permission. Proxy routes enforce the existing
runtime_security.scanpermission already used by/scan/input. - TLS termination. The proxy must sit behind TLS, any plaintext deployment would leak both the client’s upstream provider key and the tenant’s prompt data. The existing ALB/NGINX termination used by the rest of the API is sufficient.
- Egress. The backend process must be able to reach each upstream
provider you want to enable:
api.openai.com,api.anthropic.com,generativelanguage.googleapis.com,{region}-aiplatform.googleapis.com,bedrock-runtime.{region}.amazonaws.com, and any OpenAI-compatible hosts you allowlist. In air-gapped deployments, point theRUNTIME_SECURITY_*_BASE_URLenv vars at your internal gateway and register self-hosted bases viaRUNTIME_SECURITY_OPENAI_COMPAT_EXTRA_BASES. - Timeouts. Default
RUNTIME_SECURITY_PROXY_TIMEOUT=60seconds. Raise it for long-context requests that your upstream honors (e.g. Claude 200k-context calls).

