Infrastructure Plan
Complete infrastructure specification for Settle — a guided estate administration platform. Covers all services from frontend delivery to physical mail, with real pricing, Terraform patterns, and DevOps runbooks sufficient to provision from zero in a single day.
Infrastructure Diagram
The architecture separates concerns across four zones: public edge delivery, API compute, background worker compute, and persistent data stores. All internal service-to-service traffic travels over Fly.io's private WireGuard network (6PN). No worker or data store is reachable from the public internet except through the API.
Hosting Architecture
Frontend — Netlify
SvelteKit is deployed to Netlify using the @sveltejs/adapter-netlify adapter. SSR pages render server-side via Netlify Edge Functions (Deno runtime). Static assets are served from Netlify's global CDN. Build previews are automatically deployed for every pull request.
# netlify.toml (repo root)
[build]
command = "npm run build"
publish = ".svelte-kit/netlify"
[build.environment]
NODE_VERSION = "22"
[[redirects]]
from = "/api/*"
to = "https://settle-api.fly.dev/:splat"
status = 200
force = true
[context.production]
environment = { NODE_ENV = "production" }
[context.deploy-preview]
environment = { NODE_ENV = "preview" }
PUBLIC_API_URL=https://settle-api.fly.dev
PUBLIC_R2_PUBLIC_URL=https://docs.settle.app
SENTRY_DSN=https://...@sentry.io/...
PUBLIC_POSTHOG_KEY=phc_...
Set via Netlify UI or netlify env:set. Never committed to repo.
API — Fly.io (2 regions)
The Node.js API runs on Fly.io in two regions for redundancy and latency. iad (Ashburn VA) is primary, serving East Coast and international traffic. ord (Chicago) is secondary, providing failover and serving Midwest traffic with lower latency.
# Initial deploy
fly launch --name settle-api --region iad --image node:22-alpine
fly regions add ord
# fly.toml for the API
app = "settle-api"
primary_region = "iad"
[build]
dockerfile = "Dockerfile"
[env]
PORT = "3000"
NODE_ENV = "production"
[http_service]
internal_port = 3000
force_https = true
auto_stop_machines = "stop"
auto_start_machines = true
min_machines_running = 1 # keep 1 warm in iad
[[vm]]
size = "shared-cpu-1x"
memory = "512mb"
[mounts]
# no persistent disk — stateless API
Scale targets
Workers — Fly.io Machines (scale-to-zero)
Each of the three workers is its own Fly.io app with a separate fly.toml. Workers run in the iad region only (no multi-region needed — they are not latency-sensitive). They scale to zero when idle and wake within ~500ms when the API enqueues a job via the Redis work queue.
Sends Tier 1 API calls, generates and sends Tier 2 letters via Lob, produces Tier 3 call scripts. Woken per notification batch. Expected runtime: 30–120 sec/job.
Queries NAUPA, NAIC, PBGC, VA.gov for unclaimed assets. Circuit breaker per external API. Results cached in Redis for 24h. Expected runtime: 2–10 min/estate.
Runs at 05:00 UTC daily via Fly scheduled machine. Computes each user's Daily Three tasks, writes digest rows to Postgres, queues delivery via Resend. Runtime: 5–15 min/run.
# Creating worker apps
fly apps create settle-worker-notify
fly apps create settle-worker-scanner
fly apps create settle-worker-digest
# Shared fly.toml pattern for workers
# (save as fly.worker-notify.toml etc.)
app = "settle-worker-notify"
primary_region = "iad"
[build]
dockerfile = "Dockerfile.worker"
[env]
WORKER_TYPE = "notify"
NODE_ENV = "production"
[http_service]
# No public HTTP — workers pull from Redis queue
auto_stop_machines = "stop"
auto_start_machines = true
min_machines_running = 0 # true scale-to-zero
[[vm]]
size = "shared-cpu-1x"
memory = "512mb"
# Digest generator uses a scheduled machine instead
# fly machine run settle-worker-digest \
# --schedule "0 5 * * *" --region iad
Database Setup
Neon is the right choice here: serverless PostgreSQL with built-in connection pooling, database branching for dev workflows, and point-in-time recovery. The Pro plan gives 7-day PITR and enough compute for the growth stage. Upgrade to Scale when estates exceed ~5,000.
Project and branch structure
# Install Neon CLI
npm install -g neonctl
# Authenticate
neonctl auth
# Create project
neonctl projects create --name settle --region-id aws-us-east-1
# Production and staging are long-lived branches
neonctl branches create --name staging --project-id <project-id>
# Per-engineer branches (create on onboarding)
neonctl branches create --name dev/corey --project-id <project-id>
# Enable pgcrypto extension (run once per branch)
psql $DATABASE_URL -c "CREATE EXTENSION IF NOT EXISTS pgcrypto;"
Connection pooling
Neon includes a built-in PgBouncer-compatible connection pooler. Use the pooled connection string for the API (?pgbouncer=true appended). Use the direct connection string only for migrations.
# In API environment (pooled — for request handlers)
DATABASE_URL=postgres://user:pass@ep-xxx.us-east-1.aws.neon.tech/settle?pgbouncer=true&connection_limit=10
# In migration scripts (direct — for schema changes)
DATABASE_DIRECT_URL=postgres://user:pass@ep-xxx.us-east-1.aws.neon.tech/settle
# Set pool mode to transaction (not session) for serverless
# This is the Neon default — verify in Neon console under Connection Pooling
Column-level encryption with pgcrypto
Sensitive fields are encrypted at the application layer before write and decrypted after read. pgcrypto's pgp_sym_encrypt / pgp_sym_decrypt functions handle the encryption inside Postgres for any server-side queries that need it. The primary encryption path is application-level using Node.js crypto.
| Table | Encrypted columns | Method |
|---|---|---|
| persons | ssn, date_of_birth | App-level AES-256-GCM before INSERT |
| estates | account_numbers, routing_numbers | App-level AES-256-GCM before INSERT |
| documents | file_key (R2 path) | App-level AES-256-GCM before INSERT |
| contacts | phone_number, email | pgcrypto pgp_sym_encrypt (searchable via hash index) |
| notification_records | recipient_address | App-level AES-256-GCM before INSERT |
// lib/crypto.ts — application-level field encryption
import { createCipheriv, createDecipheriv, randomBytes } from 'crypto';
const KEY = Buffer.from(process.env.FIELD_ENCRYPTION_KEY!, 'hex'); // 32 bytes
export function encryptField(plaintext: string): string {
const iv = randomBytes(16);
const cipher = createCipheriv('aes-256-gcm', KEY, iv);
const encrypted = Buffer.concat([cipher.update(plaintext, 'utf8'), cipher.final()]);
const tag = cipher.getAuthTag();
// Format: iv:tag:ciphertext (all hex)
return [iv.toString('hex'), tag.toString('hex'), encrypted.toString('hex')].join(':');
}
export function decryptField(stored: string): string {
const [ivHex, tagHex, encHex] = stored.split(':');
const decipher = createDecipheriv('aes-256-gcm', KEY, Buffer.from(ivHex, 'hex'));
decipher.setAuthTag(Buffer.from(tagHex, 'hex'));
return decipher.update(Buffer.from(encHex, 'hex')).toString('utf8')
+ decipher.final('utf8');
}
Key rotation strategy
Backup and PITR
| Stage | Neon Plan | PITR window | Automated backup |
|---|---|---|---|
| Launch | Pro ($19/mo) | 7 days | Yes — continuous WAL shipping |
| Growth | Pro ($69/mo) | 7 days | Yes |
| Scale | Scale ($149/mo) | 30 days | Yes + dedicated compute |
# Restore to a point in time (Neon CLI)
neonctl branches create \
--name recovery-test \
--project-id <project-id> \
--timestamp "2026-04-01T03:00:00Z"
# Connect to restored branch and verify
neonctl connection-string --branch-name recovery-test
Document Vault (Cloudflare R2)
Death certificates, SSN documents, financial statements, and insurance policies are stored in Cloudflare R2. R2 provides at-rest encryption natively, but Settle adds a second layer of application-level AES-256-GCM encryption before upload. This means even a Cloudflare-level compromise cannot expose document contents without the application encryption key.
Bucket structure
# One bucket, logical prefix per estate
settle-documents/
estates/{estate_id}/
{document_id}.enc # encrypted file
{document_id}.meta.json # encrypted metadata sidecar
# Create bucket via Wrangler
npx wrangler r2 bucket create settle-documents
npx wrangler r2 bucket create settle-documents-staging
npx wrangler r2 bucket create settle-documents-dev
Encryption before upload
// lib/vault.ts
import { S3Client, PutObjectCommand, GetObjectCommand } from '@aws-sdk/client-s3';
import { getSignedUrl } from '@aws-sdk/s3-request-presigner';
import { createCipheriv, createDecipheriv, randomBytes } from 'crypto';
const r2 = new S3Client({
region: 'auto',
endpoint: process.env.R2_ENDPOINT, // https://{acct}.r2.cloudflarestorage.com
credentials: {
accessKeyId: process.env.R2_ACCESS_KEY_ID!,
secretAccessKey: process.env.R2_SECRET_ACCESS_KEY!,
},
});
const DOC_KEY = Buffer.from(process.env.DOC_ENCRYPTION_KEY!, 'hex');
export async function uploadDocument(estateId: string, docId: string, buffer: Buffer) {
const iv = randomBytes(16);
const cipher = createCipheriv('aes-256-gcm', DOC_KEY, iv);
const encrypted = Buffer.concat([cipher.update(buffer), cipher.final()]);
const tag = cipher.getAuthTag();
// Prepend iv + tag to ciphertext for single-blob storage
const payload = Buffer.concat([iv, tag, encrypted]);
await r2.send(new PutObjectCommand({
Bucket: process.env.R2_BUCKET,
Key: `estates/${estateId}/${docId}.enc`,
Body: payload,
}));
}
export async function getSignedDownloadUrl(estateId: string, docId: string): Promise<string> {
const cmd = new GetObjectCommand({
Bucket: process.env.R2_BUCKET,
Key: `estates/${estateId}/${docId}.enc`,
});
// 15-minute expiry — short to limit exposure of sensitive docs
return getSignedUrl(r2, cmd, { expiresIn: 900 });
}
Retention policy
R2 lifecycle rules are not yet as mature as S3, so retention enforcement is handled at the application layer. When an estate is archived (closed + 7 years elapsed), a scheduled worker run lists all objects under estates/{estate_id}/, deletes them, and logs the deletion event to the immutable audit log.
Environment setup
# Fly secrets for API
fly secrets set -a settle-api \
R2_ENDPOINT="https://<account_id>.r2.cloudflarestorage.com" \
R2_ACCESS_KEY_ID="..." \
R2_SECRET_ACCESS_KEY="..." \
R2_BUCKET="settle-documents" \
DOC_ENCRYPTION_KEY="$(openssl rand -hex 32)"
Physical Mail Pipeline (Lob)
Tier 2 notifications — formal written notification to financial institutions, insurance companies, and government agencies — require actual physical letters. Lob handles printing, postage, and USPS delivery. This is infrastructure, not a product decision: the legal requirement for written notice is what makes physical mail necessary, not a UX preference.
Letter generation flow
Lob API integration
// workers/notify/lob.ts
import Lob from '@lob/lob-typescript-sdk';
const lob = new Lob({ username: process.env.LOB_API_KEY! });
export async function sendLetter(job: LetterJob): Promise<string> {
const html = await renderTemplate(job.notification_type, job.data_snapshot);
const letter = await lob.lettersApi.create({
description: `Settle notification: ${job.notification_type}`,
to: {
name: job.data_snapshot.recipient_name,
address_line1: job.data_snapshot.recipient_address1,
address_line2: job.data_snapshot.recipient_address2 ?? '',
address_city: job.data_snapshot.recipient_city,
address_state: job.data_snapshot.recipient_state,
address_zip: job.data_snapshot.recipient_zip,
address_country: 'US',
},
from: {
name: 'Settle Estate Administration',
address_line1: process.env.SETTLE_RETURN_ADDRESS1!,
address_city: process.env.SETTLE_RETURN_CITY!,
address_state: process.env.SETTLE_RETURN_STATE!,
address_zip: process.env.SETTLE_RETURN_ZIP!,
address_country: 'US',
},
file: html,
color: false, // B&W — reduces cost to ~$1.20/letter
double_sided: false,
address_placement: 'top_first_page',
mail_type: 'usps_first_class',
});
return letter.id; // ltr_xxx — store in notification_records.lob_id
}
Letter templates
| Notification type | Template file | Recipient |
|---|---|---|
| bank_account_closure | bank-account-closure.html | Bank institution |
| insurance_claim_initiation | insurance-claim.html | Insurance company |
| investment_account_closure | investment-closure.html | Brokerage |
| pension_claim | pension-claim.html | Plan administrator |
| utility_cancellation | utility-cancel.html | Utility provider |
| subscription_cancellation | subscription-cancel.html | Service provider |
Webhook endpoint
// api/routes/webhooks/lob.ts
import { verifyLobSignature } from './lob-verify';
router.post('/webhooks/lob', async (req, res) => {
if (!verifyLobSignature(req)) return res.status(401).send();
const { event_type, body: { id: lobId } } = req.body;
const statusMap: Record<string, string> = {
'letter.mailed': 'mailed',
'letter.in_transit': 'in_transit',
'letter.processed_for_delivery': 'delivered',
'letter.returned_to_sender': 'undeliverable',
};
const newStatus = statusMap[event_type];
if (!newStatus) return res.status(200).send();
await db.query(
'UPDATE notification_records SET status=$1, updated_at=NOW() WHERE lob_id=$2',
[newStatus, lobId]
);
if (newStatus === 'undeliverable') await createReturnMailTask(lobId);
res.status(200).send();
});
Cost model
CI/CD Pipeline
GitHub Actions handles all build, test, and deploy automation. The pipeline has three tracks: pull request validation, staging deployment on main merge, and production deployment on version tag.
Pipeline stages
# .github/workflows/ci.yml
name: CI
on:
pull_request:
push:
branches: [main]
tags: ['v*']
jobs:
lint-typecheck:
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '22', cache: 'npm' }
- run: npm ci
- run: npm run lint
- run: npm run typecheck
test:
runs-on: ubuntu-24.04
needs: lint-typecheck
services:
postgres:
image: postgres:16
env: { POSTGRES_PASSWORD: test }
options: --health-cmd pg_isready
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '22', cache: 'npm' }
- run: npm ci
- run: npm run test:unit
- run: npm run test:integration
env:
DATABASE_URL: postgres://postgres:test@localhost/test
REDIS_URL: redis://localhost:6379
security-scan:
runs-on: ubuntu-24.04
needs: lint-typecheck
steps:
- uses: actions/checkout@v4
- run: npm audit --audit-level=high
- uses: github/codeql-action/analyze@v3
with: { languages: javascript }
deploy-staging:
runs-on: ubuntu-24.04
needs: [test, security-scan]
if: github.ref == 'refs/heads/main'
environment: staging
steps:
- uses: actions/checkout@v4
- uses: superfly/flyctl-actions/setup-flyctl@master
- run: flyctl deploy --app settle-api-staging --remote-only
env: { FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN_STAGING }} }
- run: flyctl deploy --app settle-worker-notify-staging --remote-only
env: { FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN_STAGING }} }
- run: npm run db:migrate:staging
env:
DATABASE_DIRECT_URL: ${{ secrets.NEON_STAGING_DIRECT_URL }}
- run: npm run test:smoke -- --env staging
deploy-production:
runs-on: ubuntu-24.04
needs: [test, security-scan]
if: startsWith(github.ref, 'refs/tags/v')
environment: production
steps:
- uses: actions/checkout@v4
- uses: superfly/flyctl-actions/setup-flyctl@master
- run: npm run db:migrate:production
env:
DATABASE_DIRECT_URL: ${{ secrets.NEON_PRODUCTION_DIRECT_URL }}
- run: flyctl deploy --app settle-api --strategy rolling --remote-only
env: { FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN_PROD }} }
- run: flyctl deploy --app settle-worker-notify --remote-only
env: { FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN_PROD }} }
- run: flyctl deploy --app settle-worker-scanner --remote-only
env: { FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN_PROD }} }
- run: flyctl deploy --app settle-worker-digest --remote-only
env: { FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN_PROD }} }
Required GitHub secrets
| Secret | Used by |
|---|---|
| FLY_API_TOKEN_STAGING | Staging deploy jobs |
| FLY_API_TOKEN_PROD | Production deploy jobs |
| NEON_STAGING_DIRECT_URL | db:migrate:staging |
| NEON_PRODUCTION_DIRECT_URL | db:migrate:production |
| NETLIFY_AUTH_TOKEN | Netlify builds (auto-configured by Netlify GitHub app) |
Environment Management
| Environment | Frontend | API | Database | R2 Bucket | Lob mode |
|---|---|---|---|---|---|
| local | localhost:5173 | localhost:3000 | Neon dev branch | settle-documents-dev | Test mode (no real mail) |
| preview | Netlify deploy preview | settle-api-staging | Neon staging branch | settle-documents-staging | Test mode |
| staging | staging.settle.app (Netlify) | settle-api-staging.fly.dev | Neon staging branch | settle-documents-staging | Test mode |
| production | settle.app (Netlify) | settle-api.fly.dev | Neon main branch | settle-documents | Live mode |
Local development setup
# .env.local (gitignored)
DATABASE_URL=postgres://...neon.tech/settle?pgbouncer=true
DATABASE_DIRECT_URL=postgres://...neon.tech/settle
REDIS_URL=rediss://default:xxx@us1-xxx.upstash.io:6379
R2_ENDPOINT=https://<account>.r2.cloudflarestorage.com
R2_ACCESS_KEY_ID=...
R2_SECRET_ACCESS_KEY=...
R2_BUCKET=settle-documents-dev
LOB_API_KEY=test_xxx # Lob test key — no real letters sent
RESEND_API_KEY=re_xxx
FIELD_ENCRYPTION_KEY=$(openssl rand -hex 32)
DOC_ENCRYPTION_KEY=$(openssl rand -hex 32)
SENTRY_DSN= # leave empty in local — avoids noise
# Start local API (from /api directory)
npm run dev
# Start frontend (from /web directory)
npm run dev
Long-lived workflow considerations
Estates are active for 16–18 months. The infrastructure must handle:
- Sessions stored in Redis with TTL of 30 days (refreshed on activity). Users are never forced to log in mid-task.
- Estate state lives in Postgres, not sessions. Closing a browser and returning 3 months later yields the same estate state.
- Re-engagement: Digest Generator checks last_active_at per user. If >14 days inactive, escalates the Daily Three email with a "Welcome back" header. If >60 days, triggers a human-written check-in email via Resend.
- Worker job IDs are idempotent — duplicate job enqueues are safe (check Redis for existing job by estate_id + job_type before enqueue).
Monitoring & Observability
Stack
CPU, memory, request latency per machine. Available in Fly dashboard. Export to Grafana via Prometheus scrape endpoint.
Dashboards for custom business metrics. Free tier: 10k series, 14-day retention — more than sufficient at launch and growth stages.
Error tracking for API and all workers. Free tier until ~Scale stage (~$26/mo on Team plan for better retention). Set up one project per service.
Custom business metrics (push to Grafana)
These metrics are the ones that matter for Settle's operations. Standard infrastructure metrics (CPU, latency) are table stakes; these are the ones that tell you if the product is working.
| Metric | How measured | Alert threshold |
|---|---|---|
| notification_tier1_success_rate | % of Tier 1 API calls that return 2xx within 30s | Alert if <80% over 1h |
| notification_tier2_delivery_rate | % of Lob letters reaching processed_for_delivery status within 7 days | Alert if <85% over 7d window |
| notification_tier2_return_rate | % of Lob letters returned_to_sender | Alert if >5% over 7d window |
| benefit_scan_hit_rate | % of estate scans that find at least one potential benefit | Informational — track weekly trend |
| benefit_scan_external_api_errors | Error rate per external API (NAUPA, NAIC, PBGC, VA) | Alert if circuit breaker opens |
| estate_completion_rate | % of estates reaching "closed" status | Informational — monthly review |
| estate_avg_days_to_close | Median days from estate creation to closed status | Informational — product health |
| daily_digest_delivery_rate | % of digest emails successfully accepted by Resend | Alert if <95% |
| worker_job_queue_depth | Redis queue length per worker type | Alert if >100 jobs queued for >10min |
Grafana setup
# Grafana Cloud — create free account at grafana.com/products/cloud
# Get Prometheus push endpoint and API key
# In API: push metrics every 60s using prom-client
npm install prom-client
// lib/metrics.ts
import { Gauge, Counter, collectDefaultMetrics, register } from 'prom-client';
collectDefaultMetrics();
export const notificationSuccessRate = new Gauge({
name: 'settle_notification_tier1_success_rate',
help: '5-minute rolling success rate for Tier 1 notifications',
});
// Expose /metrics endpoint (Fly.io internal only)
router.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
Sentry setup
# Create projects in Sentry UI, then:
npm install @sentry/node
// api/instrument.ts (import before everything else)
import * as Sentry from '@sentry/node';
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
tracesSampleRate: 0.1, // 10% traces — sufficient at launch scale
beforeSend(event) {
// Strip PII from error contexts before sending to Sentry
if (event.user) delete event.user.email;
return event;
},
});
Cost Projections
| Service | Launch / mo | Growth / mo | Scale / mo | Notes |
|---|---|---|---|---|
| Netlify Pro | $19 | $19 | $19 | Includes 1M Edge Function requests/mo. Upgrade to Business ($99) at ~50k MAU if bandwidth spikes. |
| Fly.io API (2 regions) | $10 | $30 | $80 | 2× shared-cpu-1x 512MB at launch. Scale to 8× machines at Scale tier. Includes egress. |
| Fly.io Workers (3 apps) | $5 | $20 | $60 | Scale-to-zero — billed per second of execution. Digest generator is the most active at $2–5/mo. |
| Neon PostgreSQL | $19 | $69 | $149 | Pro at Launch/Growth. Scale plan at Year 2 (dedicated compute, 30-day PITR, 1TB storage). |
| Cloudflare R2 | $0 | $5 | $20 | Free tier: 10GB storage, 1M Class A ops/mo. Growth: ~50GB. Scale: ~200GB. No egress fees. |
| Upstash Redis | $0 | $10 | $25 | Free tier handles Launch. Pay-as-you-go at Growth (~$0.2/100k commands). Scale: fixed $25/mo plan. |
| Lob (physical letters) | $50 | $500 | $3,000 | ~30 letters/mo at Launch (200 estates × ~15%). ~$1.50/letter avg. See note below. |
| Resend (email) | $0 | $20 | $50 | Free tier: 3,000 emails/mo. Growth: ~20k/mo on $20 plan. Scale: ~100k/mo on Pro. |
| Sentry | $0 | $0 | $26 | Free Developer tier through Growth. Team plan ($26/mo) adds 90-day retention at Scale. |
| Grafana Cloud | $0 | $0 | $0 | Free tier (10k series, 14d retention) is sufficient through Scale. Upgrade only if adding logs ingestion. |
| Total | ~$103/mo | ~$673/mo | ~$3,429/mo | |
| Annual | ~$1,236/yr | ~$8,076/yr | ~$41,148/yr |
Cost per estate (monthly)
| Stage | Total infra cost | Active estates | Cost/estate/mo |
|---|---|---|---|
| Launch | $103 | 200 | $0.52 |
| Growth | $673 | 2,000 | $0.34 |
| Scale | $3,429 | 20,000 | $0.17 |
Infrastructure cost per estate declines as you scale. Lob is the only cost that scales linearly with estate count — everything else is largely fixed or grows sub-linearly.
Security Infrastructure
TLS everywhere
- Netlify auto-provisions Let's Encrypt certificates. HSTS enforced with 1-year max-age + includeSubDomains.
- Fly.io auto-provisions Let's Encrypt for *.fly.dev domains. Add custom domain via fly certs add api.settle.app.
- Neon connections use TLS by default. sslmode=require enforced in connection string.
- Upstash Redis uses TLS (rediss:// scheme). No unencrypted connections accepted.
- R2 accessed over HTTPS only. No public bucket access — all reads require signed URLs generated by authenticated API.
Private networking (Fly.io 6PN)
# Workers communicate with API and Neon over Fly private network
# No public ports opened on workers
# API internal address (accessible only from Fly apps in same org):
# settle-api.internal
# In worker config — connect to Neon via DATABASE_URL directly
# (Neon is external to Fly, so TLS is the transport, not 6PN)
# Block all inbound on workers (fly.toml)
[http_service]
internal_port = 3000
auto_stop_machines = "stop"
# No [[services]] block = no public inbound port
Secrets management
# All secrets stored in Fly.io secrets — never in environment files or code
fly secrets set -a settle-api \
DATABASE_URL="postgres://..." \
FIELD_ENCRYPTION_KEY="$(openssl rand -hex 32)" \
DOC_ENCRYPTION_KEY="$(openssl rand -hex 32)" \
LOB_API_KEY="live_..." \
RESEND_API_KEY="re_..." \
UPSTASH_REDIS_URL="rediss://..." \
SENTRY_DSN="https://...@sentry.io/..."
# List secrets (shows names only, not values)
fly secrets list -a settle-api
# Netlify secrets — set via Netlify UI or CLI
netlify env:set PUBLIC_API_URL https://api.settle.app
IAM — principle of least privilege
| Component | Access granted | Access denied |
|---|---|---|
| API (settle-api) | Read/write Neon (specific tables via role), R2 get/put/delete own prefix, Upstash read/write, Resend send | R2 bucket delete, Neon schema changes, Fly API token |
| Workers | Read/write Neon (specific tables), R2 get/put, Upstash read/write, Lob API create | R2 bucket delete, Neon schema changes, Resend (email sent via API, not directly) |
| CI/CD (GitHub Actions) | Fly deploy token (per-app), Neon direct URL (migrations only) | Production Fly token from staging jobs, database drops |
| Netlify | Read API URL env var, deploy to CDN | No database access, no encryption keys |
PII access logging
// middleware/audit.ts — log every PII field access
export function auditPiiAccess(userId: string, field: string, estateId: string) {
// Written to immutable audit_log table — no UPDATE/DELETE allowed on this table
return db.query(`
INSERT INTO audit_log (user_id, action, resource_type, resource_id, field_name, occurred_at)
VALUES ($1, 'READ_PII', 'estate', $2, $3, NOW())
`, [userId, estateId, field]);
}
// Immutable audit_log — prevent tampering
-- Run once as superuser:
CREATE RULE no_update_audit AS ON UPDATE TO audit_log DO INSTEAD NOTHING;
CREATE RULE no_delete_audit AS ON DELETE TO audit_log DO INSTEAD NOTHING;
-- Grant INSERT-only to app role:
GRANT INSERT ON audit_log TO settle_app;
REVOKE UPDATE, DELETE ON audit_log FROM settle_app;
IP allowlisting for admin endpoints
# Restrict /admin/* and /internal/* routes to office IP + Fly private network
# In Express middleware:
const ALLOWED_IPS = (process.env.ADMIN_ALLOWED_IPS ?? '').split(',');
router.use('/admin', (req, res, next) => {
const ip = req.ip;
if (!ALLOWED_IPS.some(allowed => ip.startsWith(allowed))) {
return res.status(403).json({ error: 'Forbidden' });
}
next();
});
Disaster Recovery
30 minutes
Time from incident declaration to full service restoration. Achievable through pre-provisioned standby infrastructure and documented runbooks.
5 minutes
Maximum acceptable data loss window. Achieved via Neon continuous WAL shipping, which typically has <1min lag to durable storage.
Scenario 1: Fly.io API region failure (iad goes down)
Scenario 2: Neon database failure / corruption
neonctl branches create \
--name recovery-$(date +%Y%m%d%H%M) \
--project-id <id> \
--timestamp "2026-04-01T14:55:00Z"
Rolling restart occurs automatically. Verify in Grafana.
Scenario 3: Compromised encryption key
Runbook quick reference
# Check all app statuses
fly status -a settle-api
fly status -a settle-worker-notify
fly status -a settle-worker-scanner
fly status -a settle-worker-digest
# View live logs
fly logs -a settle-api
fly logs -a settle-worker-scanner
# Scale up API in emergency
fly scale count 4 --region iad -a settle-api
fly scale count 4 --region ord -a settle-api
# Restart all API machines (e.g., after secrets rotation)
fly machines restart --all -a settle-api
# Check Redis queue depths
fly ssh console -a settle-api -- redis-cli -u $REDIS_URL LLEN queue:notify
fly ssh console -a settle-api -- redis-cli -u $REDIS_URL LLEN queue:scan
Compliance Checklist
Settle handles death certificates, Social Security numbers, financial records, and health insurance data. This section documents the infrastructure controls required for SOC 2 Type II readiness and provable compliance with data deletion obligations.
SOC 2 readiness (infrastructure controls)
| Control | Implementation | Status |
|---|---|---|
| CC6.1 — Logical access controls | Role-based DB access, Fly secrets, IAM principle of least privilege documented above | Implemented |
| CC6.2 — Encryption at rest | AES-256-GCM for PII fields + documents, Neon encrypted at rest, R2 encrypted at rest | Implemented |
| CC6.3 — Encryption in transit | TLS 1.2+ on all connections, HSTS enforced, no plaintext protocols | Implemented |
| CC6.6 — Vulnerability management | npm audit in CI, CodeQL scanning, Dependabot alerts enabled | Implemented |
| CC7.2 — Monitoring for anomalies | Sentry error tracking, Grafana anomaly dashboards, PII access audit log | Implemented |
| CC8.1 — Change management | All infra changes via GitHub PRs, no direct production mutations, IaC for all services | Partial — Terraform IaC in progress |
| A1.1 — System availability | Multi-region Fly.io, Neon HA, 99.95% SLA on all managed services | Implemented |
| P4.2 — Audit log retention | Immutable audit_log table, Neon PITR, 7-year retention policy enforced | Implemented |
Audit log retention (7 years)
-- audit_log schema
CREATE TABLE audit_log (
id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
user_id UUID,
action TEXT NOT NULL, -- READ_PII, WRITE_PII, DELETE_ESTATE, etc.
resource_type TEXT NOT NULL,
resource_id UUID,
field_name TEXT,
ip_address INET,
user_agent TEXT,
metadata JSONB
);
-- Partition by year for performance and retention management
CREATE TABLE audit_log_2026 PARTITION OF audit_log
FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');
-- 7-year retention: drop partitions older than 7 years in annual cron job
-- Run: DROP TABLE audit_log_2018 (in 2026); DROP TABLE audit_log_2019 (in 2027); etc.
-- Immutability rules (no UPDATE or DELETE from any role)
CREATE RULE audit_no_update AS ON UPDATE TO audit_log DO INSTEAD NOTHING;
CREATE RULE audit_no_delete AS ON DELETE TO audit_log DO INSTEAD NOTHING;
Provable data deletion
When a user invokes their right to deletion (GDPR Art. 17, CCPA), all their data must be deleted and that deletion must be provable across every store. This is the deletion checklist:
| Store | What to delete | How to verify |
|---|---|---|
| Neon PostgreSQL | All rows in: persons, estates, documents (metadata), contacts, notification_records, sessions, benefit_scans — where user_id = target. Do NOT delete audit_log rows — retain per compliance obligation. | Run SELECT COUNT(*) after deletion. Export final 0-count query result to compliance record. |
| Cloudflare R2 | All objects under estates/{estate_id}/ for every estate owned by the user. | List objects after deletion — expect empty result. Log deletion confirmation from R2 API response. |
| Upstash Redis | Session keys: session:{user_id}:*. Rate limit keys: ratelimit:{user_id}:*. Benefit scan cache: scan:{estate_id}:* | SCAN for pattern after deletion — expect 0 matches. |
| Sentry | Use Sentry's "Delete User Data" API endpoint for the user's email/ID. | Sentry API confirms deletion. Log confirmation. |
// api/admin/delete-user.ts — provable deletion script
export async function deleteUserData(userId: string): Promise<DeletionReport> {
const report: DeletionReport = { userId, startedAt: new Date(), steps: [] };
// 1. Get all estate IDs before deletion
const estates = await db.query('SELECT id FROM estates WHERE user_id = $1', [userId]);
// 2. Delete R2 documents for each estate
for (const estate of estates.rows) {
const deleted = await deleteR2Prefix(`estates/${estate.id}/`);
report.steps.push({ store: 'r2', estateId: estate.id, deletedObjects: deleted });
}
// 3. Delete Postgres rows
const tables = ['notification_records', 'benefit_scans', 'documents',
'contacts', 'estates', 'persons', 'users'];
for (const table of tables) {
const result = await db.query(`DELETE FROM ${table} WHERE user_id = $1`, [userId]);
report.steps.push({ store: 'postgres', table: table, rowsDeleted: result.rowCount });
}
// 4. Flush Redis keys
await redis.del(...await redis.keys(`session:${userId}:*`));
// 5. Write deletion record to audit_log (this entry is retained)
await db.query(
'INSERT INTO audit_log (action, resource_type, resource_id, metadata) VALUES ($1,$2,$3,$4)',
['USER_DATA_DELETED', 'user', userId, JSON.stringify(report)]
);
return report; // return to admin for storage in compliance folder
}
External API reliability (Benefit Scanner)
NAUPA, NAIC, PBGC, and VA.gov APIs are government-operated and may be slow, rate-limited, or unavailable. The infrastructure handles this with:
Use opossum npm package. Each API (NAUPA, NAIC, PBGC, VA) has its own circuit breaker with: failure threshold 50%, reset timeout 60s, timeout per request 15s. When open, scanner skips that API and logs it as unavailable.
Scan results are cached in Upstash Redis under scan:{estate_id}:{api} with 86,400s TTL. If a subsequent scan request arrives within 24h, return cached result. This also protects against rate limits.
// workers/scanner/circuit-breaker.ts
import CircuitBreaker from 'opossum';
const breakerOptions = {
timeout: 15000, // 15s per request
errorThresholdPercentage: 50,
resetTimeout: 60000, // try again after 60s
volumeThreshold: 5, // minimum 5 requests before opening
};
export const naupa = new CircuitBreaker(queryNaupa, breakerOptions);
export const naic = new CircuitBreaker(queryNaic, breakerOptions);
export const pbgc = new CircuitBreaker(queryPbgc, breakerOptions);
export const va = new CircuitBreaker(queryVa, breakerOptions);
// Track circuit state in metrics
[naupa, naic, pbgc, va].forEach((cb, i) => {
const name = ['naupa', 'naic', 'pbgc', 'va'][i];
cb.on('open', () => metrics.inc(`circuit_open_${name}`));
cb.on('halfOpen', () => metrics.inc(`circuit_halfopen_${name}`));
cb.on('close', () => metrics.inc(`circuit_close_${name}`));
});