1

Infrastructure Diagram

The architecture separates concerns across four zones: public edge delivery, API compute, background worker compute, and persistent data stores. All internal service-to-service traffic travels over Fly.io's private WireGuard network (6PN). No worker or data store is reachable from the public internet except through the API.

USERS Browser / PWA EDGE / DELIVERY Netlify SvelteKit SSR + Static Cloudflare R2 Document Vault (AES-256) Upstash Redis Sessions · Rate Limiting Resend Transactional Email API COMPUTE — FLY.IO (iad + ord) settle-api · iad Node.js 22 · 2×shared-cpu-1x 512 MB RAM · auto-stop settle-api · ord Node.js 22 · 2×shared-cpu-1x 512 MB RAM · auto-stop Sentry Error Tracking settle-api + workers Grafana Cloud Metrics · Dashboards Free tier BACKGROUND WORKERS — FLY.IO MACHINES (scale-to-zero) Notification Worker Tier 1 API · Lob Letters Tier 3 Script Generation Benefit Scanner NAUPA · NAIC · PBGC · VA Circuit breaker · 24h cache Digest Generator Daily Three preparation Runs 05:00 UTC daily Lob API Physical Letters ~$1.50/letter PERSISTENT DATA STORES Neon PostgreSQL pgcrypto · column encryption PITR 7d (Pro) · 30d (Scale) Cloudflare R2 Signed URLs · 15-min expiry Estate lifetime + 7yr retain Upstash Redis Sessions · rate limits Benefit scan cache (24h TTL) Audit Log Immutable pg 7-yr retain EXTERNAL BENEFIT APIs (queried by Benefit Scanner) NAUPA NAIC PBGC VA.gov SSA / SSNDB --- Public HTTPS --- Internal (Fly 6PN) --- Worker-triggered
Private networking: All Fly.io apps (API + workers) communicate over Fly's private WireGuard mesh network (6PN). Workers never accept inbound connections — they are woken by the API placing jobs onto a Redis queue, then call back into Postgres and R2. No inbound firewall rules needed for workers.
2

Hosting Architecture

Frontend — Netlify

SvelteKit is deployed to Netlify using the @sveltejs/adapter-netlify adapter. SSR pages render server-side via Netlify Edge Functions (Deno runtime). Static assets are served from Netlify's global CDN. Build previews are automatically deployed for every pull request.

netlify.toml
# netlify.toml (repo root)
[build]
  command = "npm run build"
  publish = ".svelte-kit/netlify"

[build.environment]
  NODE_VERSION = "22"

[[redirects]]
  from   = "/api/*"
  to     = "https://settle-api.fly.dev/:splat"
  status = 200
  force  = true

[context.production]
  environment = { NODE_ENV = "production" }

[context.deploy-preview]
  environment = { NODE_ENV = "preview" }
Netlify env vars (production)
PUBLIC_API_URL=https://settle-api.fly.dev
PUBLIC_R2_PUBLIC_URL=https://docs.settle.app
SENTRY_DSN=https://...@sentry.io/...
PUBLIC_POSTHOG_KEY=phc_...

Set via Netlify UI or netlify env:set. Never committed to repo.

API — Fly.io (2 regions)

The Node.js API runs on Fly.io in two regions for redundancy and latency. iad (Ashburn VA) is primary, serving East Coast and international traffic. ord (Chicago) is secondary, providing failover and serving Midwest traffic with lower latency.

# Initial deploy
fly launch --name settle-api --region iad --image node:22-alpine
fly regions add ord

# fly.toml for the API
app = "settle-api"
primary_region = "iad"

[build]
  dockerfile = "Dockerfile"

[env]
  PORT = "3000"
  NODE_ENV = "production"

[http_service]
  internal_port = 3000
  force_https = true
  auto_stop_machines = "stop"
  auto_start_machines = true
  min_machines_running = 1  # keep 1 warm in iad

[[vm]]
  size = "shared-cpu-1x"
  memory = "512mb"

[mounts]
  # no persistent disk — stateless API
Region routing: Fly.io's anycast routing automatically directs requests to the nearest healthy region. If iad is degraded, traffic fails over to ord transparently. No DNS changes needed.

Scale targets

Launch
2
machines (1 per region)
Growth
4
machines (2 per region)
Scale
8
machines (4 per region)

Workers — Fly.io Machines (scale-to-zero)

Each of the three workers is its own Fly.io app with a separate fly.toml. Workers run in the iad region only (no multi-region needed — they are not latency-sensitive). They scale to zero when idle and wake within ~500ms when the API enqueues a job via the Redis work queue.

settle-worker-notify

Sends Tier 1 API calls, generates and sends Tier 2 letters via Lob, produces Tier 3 call scripts. Woken per notification batch. Expected runtime: 30–120 sec/job.

settle-worker-scanner

Queries NAUPA, NAIC, PBGC, VA.gov for unclaimed assets. Circuit breaker per external API. Results cached in Redis for 24h. Expected runtime: 2–10 min/estate.

settle-worker-digest

Runs at 05:00 UTC daily via Fly scheduled machine. Computes each user's Daily Three tasks, writes digest rows to Postgres, queues delivery via Resend. Runtime: 5–15 min/run.

# Creating worker apps
fly apps create settle-worker-notify
fly apps create settle-worker-scanner
fly apps create settle-worker-digest

# Shared fly.toml pattern for workers
# (save as fly.worker-notify.toml etc.)
app = "settle-worker-notify"
primary_region = "iad"

[build]
  dockerfile = "Dockerfile.worker"

[env]
  WORKER_TYPE = "notify"
  NODE_ENV    = "production"

[http_service]
  # No public HTTP — workers pull from Redis queue
  auto_stop_machines = "stop"
  auto_start_machines = true
  min_machines_running = 0  # true scale-to-zero

[[vm]]
  size   = "shared-cpu-1x"
  memory = "512mb"

# Digest generator uses a scheduled machine instead
# fly machine run settle-worker-digest \
#   --schedule "0 5 * * *" --region iad
Worker wake latency: scale-to-zero machines have ~300–700ms cold start. For notification workers this is acceptable. For the digest generator, use Fly's --schedule cron flag on a persistent machine rather than scale-to-zero so the 05:00 UTC run starts immediately.
3

Database Setup

Neon is the right choice here: serverless PostgreSQL with built-in connection pooling, database branching for dev workflows, and point-in-time recovery. The Pro plan gives 7-day PITR and enough compute for the growth stage. Upgrade to Scale when estates exceed ~5,000.

Project and branch structure

# Install Neon CLI
npm install -g neonctl

# Authenticate
neonctl auth

# Create project
neonctl projects create --name settle --region-id aws-us-east-1

# Production and staging are long-lived branches
neonctl branches create --name staging --project-id <project-id>

# Per-engineer branches (create on onboarding)
neonctl branches create --name dev/corey --project-id <project-id>

# Enable pgcrypto extension (run once per branch)
psql $DATABASE_URL -c "CREATE EXTENSION IF NOT EXISTS pgcrypto;"

Connection pooling

Neon includes a built-in PgBouncer-compatible connection pooler. Use the pooled connection string for the API (?pgbouncer=true appended). Use the direct connection string only for migrations.

# In API environment (pooled — for request handlers)
DATABASE_URL=postgres://user:pass@ep-xxx.us-east-1.aws.neon.tech/settle?pgbouncer=true&connection_limit=10

# In migration scripts (direct — for schema changes)
DATABASE_DIRECT_URL=postgres://user:pass@ep-xxx.us-east-1.aws.neon.tech/settle

# Set pool mode to transaction (not session) for serverless
# This is the Neon default — verify in Neon console under Connection Pooling

Column-level encryption with pgcrypto

Sensitive fields are encrypted at the application layer before write and decrypted after read. pgcrypto's pgp_sym_encrypt / pgp_sym_decrypt functions handle the encryption inside Postgres for any server-side queries that need it. The primary encryption path is application-level using Node.js crypto.

Table Encrypted columns Method
personsssn, date_of_birthApp-level AES-256-GCM before INSERT
estatesaccount_numbers, routing_numbersApp-level AES-256-GCM before INSERT
documentsfile_key (R2 path)App-level AES-256-GCM before INSERT
contactsphone_number, emailpgcrypto pgp_sym_encrypt (searchable via hash index)
notification_recordsrecipient_addressApp-level AES-256-GCM before INSERT
// lib/crypto.ts — application-level field encryption
import { createCipheriv, createDecipheriv, randomBytes } from 'crypto';

const KEY = Buffer.from(process.env.FIELD_ENCRYPTION_KEY!, 'hex'); // 32 bytes

export function encryptField(plaintext: string): string {
  const iv = randomBytes(16);
  const cipher = createCipheriv('aes-256-gcm', KEY, iv);
  const encrypted = Buffer.concat([cipher.update(plaintext, 'utf8'), cipher.final()]);
  const tag = cipher.getAuthTag();
  // Format: iv:tag:ciphertext (all hex)
  return [iv.toString('hex'), tag.toString('hex'), encrypted.toString('hex')].join(':');
}

export function decryptField(stored: string): string {
  const [ivHex, tagHex, encHex] = stored.split(':');
  const decipher = createDecipheriv('aes-256-gcm', KEY, Buffer.from(ivHex, 'hex'));
  decipher.setAuthTag(Buffer.from(tagHex, 'hex'));
  return decipher.update(Buffer.from(encHex, 'hex')).toString('utf8')
    + decipher.final('utf8');
}

Key rotation strategy

1
Generate new key
openssl rand -hex 32 — store as FIELD_ENCRYPTION_KEY_NEXT in Fly secrets alongside existing FIELD_ENCRYPTION_KEY.
2
Re-encrypt in background job
Run the key rotation migration script: reads each encrypted row using the old key, re-encrypts with the new key, updates in-place. Runs in batches of 100 inside a transaction.
3
Swap secrets
fly secrets set FIELD_ENCRYPTION_KEY=$NEW_KEY and remove FIELD_ENCRYPTION_KEY_NEXT. Deploy triggers rolling restart — zero downtime.
4
Rotate R2 document keys in parallel
R2 documents use a separate DOC_ENCRYPTION_KEY. Rotate independently on the same quarterly cadence. See Section 4.
Rotation cadence: Quarterly for field encryption keys, quarterly for document encryption keys. On any suspected credential compromise, rotate immediately and treat as an incident.

Backup and PITR

StageNeon PlanPITR windowAutomated backup
LaunchPro ($19/mo)7 daysYes — continuous WAL shipping
GrowthPro ($69/mo)7 daysYes
ScaleScale ($149/mo)30 daysYes + dedicated compute
# Restore to a point in time (Neon CLI)
neonctl branches create \
  --name recovery-test \
  --project-id <project-id> \
  --timestamp "2026-04-01T03:00:00Z"

# Connect to restored branch and verify
neonctl connection-string --branch-name recovery-test
4

Document Vault (Cloudflare R2)

Death certificates, SSN documents, financial statements, and insurance policies are stored in Cloudflare R2. R2 provides at-rest encryption natively, but Settle adds a second layer of application-level AES-256-GCM encryption before upload. This means even a Cloudflare-level compromise cannot expose document contents without the application encryption key.

Bucket structure

# One bucket, logical prefix per estate
settle-documents/
  estates/{estate_id}/
    {document_id}.enc        # encrypted file
    {document_id}.meta.json  # encrypted metadata sidecar

# Create bucket via Wrangler
npx wrangler r2 bucket create settle-documents
npx wrangler r2 bucket create settle-documents-staging
npx wrangler r2 bucket create settle-documents-dev

Encryption before upload

// lib/vault.ts
import { S3Client, PutObjectCommand, GetObjectCommand } from '@aws-sdk/client-s3';
import { getSignedUrl } from '@aws-sdk/s3-request-presigner';
import { createCipheriv, createDecipheriv, randomBytes } from 'crypto';

const r2 = new S3Client({
  region: 'auto',
  endpoint: process.env.R2_ENDPOINT, // https://{acct}.r2.cloudflarestorage.com
  credentials: {
    accessKeyId:     process.env.R2_ACCESS_KEY_ID!,
    secretAccessKey: process.env.R2_SECRET_ACCESS_KEY!,
  },
});

const DOC_KEY = Buffer.from(process.env.DOC_ENCRYPTION_KEY!, 'hex');

export async function uploadDocument(estateId: string, docId: string, buffer: Buffer) {
  const iv         = randomBytes(16);
  const cipher     = createCipheriv('aes-256-gcm', DOC_KEY, iv);
  const encrypted  = Buffer.concat([cipher.update(buffer), cipher.final()]);
  const tag        = cipher.getAuthTag();
  // Prepend iv + tag to ciphertext for single-blob storage
  const payload    = Buffer.concat([iv, tag, encrypted]);

  await r2.send(new PutObjectCommand({
    Bucket: process.env.R2_BUCKET,
    Key:    `estates/${estateId}/${docId}.enc`,
    Body:   payload,
  }));
}

export async function getSignedDownloadUrl(estateId: string, docId: string): Promise<string> {
  const cmd = new GetObjectCommand({
    Bucket: process.env.R2_BUCKET,
    Key:    `estates/${estateId}/${docId}.enc`,
  });
  // 15-minute expiry — short to limit exposure of sensitive docs
  return getSignedUrl(r2, cmd, { expiresIn: 900 });
}
Download flow: The client never talks to R2 directly. The API generates a signed URL that is returned to the authenticated client. The client fetches the encrypted blob from R2 using that URL, then the API (or a dedicated decrypt endpoint) strips the iv/tag prefix and decrypts. The raw decrypted document is streamed to the browser. Signed URLs expire in 15 minutes — if the user needs to re-download they request a new URL.

Retention policy

R2 lifecycle rules are not yet as mature as S3, so retention enforcement is handled at the application layer. When an estate is archived (closed + 7 years elapsed), a scheduled worker run lists all objects under estates/{estate_id}/, deletes them, and logs the deletion event to the immutable audit log.

Environment setup

# Fly secrets for API
fly secrets set -a settle-api \
  R2_ENDPOINT="https://<account_id>.r2.cloudflarestorage.com" \
  R2_ACCESS_KEY_ID="..." \
  R2_SECRET_ACCESS_KEY="..." \
  R2_BUCKET="settle-documents" \
  DOC_ENCRYPTION_KEY="$(openssl rand -hex 32)"
5

Physical Mail Pipeline (Lob)

Tier 2 notifications — formal written notification to financial institutions, insurance companies, and government agencies — require actual physical letters. Lob handles printing, postage, and USPS delivery. This is infrastructure, not a product decision: the legal requirement for written notice is what makes physical mail necessary, not a UX preference.

Letter generation flow

1
Job enqueued to Redis
API places a send_letter job: { estate_id, notification_type, recipient_id, data_snapshot }. The data_snapshot includes all fields needed to render the letter — no DB call from worker.
2
Worker renders HTML template
The Notification Worker picks up the job, selects the matching HTML template from /templates/letters/{notification_type}.html, fills in the data snapshot using Handlebars, and produces a fully rendered HTML document.
3
HTML posted directly to Lob
Lob accepts HTML directly — no separate PDF generation step needed. The worker posts the rendered HTML to POST /v1/letters with a return address (Settle's virtual office address) and recipient address.
4
Lob ID stored in notification_records
The Lob ltr_xxx ID is stored in notification_records.lob_id. Initial status: mailed.
5
Lob webhook updates delivery status
Lob POSTs delivery events to POST /webhooks/lob. The API verifies the Lob-Signature header and updates notification_records.status. Events: letter.mailedletter.in_transitletter.in_local_arealetter.processed_for_deliveryletter.re_routed or letter.returned_to_sender.
6
Return mail handling
On letter.returned_to_sender webhook event, flag the notification record as undeliverable and create a user task: "Verify mailing address for {institution}." This appears in the next Daily Three.

Lob API integration

// workers/notify/lob.ts
import Lob from '@lob/lob-typescript-sdk';

const lob = new Lob({ username: process.env.LOB_API_KEY! });

export async function sendLetter(job: LetterJob): Promise<string> {
  const html   = await renderTemplate(job.notification_type, job.data_snapshot);
  const letter = await lob.lettersApi.create({
    description: `Settle notification: ${job.notification_type}`,
    to: {
      name:              job.data_snapshot.recipient_name,
      address_line1:     job.data_snapshot.recipient_address1,
      address_line2:     job.data_snapshot.recipient_address2 ?? '',
      address_city:      job.data_snapshot.recipient_city,
      address_state:     job.data_snapshot.recipient_state,
      address_zip:       job.data_snapshot.recipient_zip,
      address_country:   'US',
    },
    from: {
      name:            'Settle Estate Administration',
      address_line1:   process.env.SETTLE_RETURN_ADDRESS1!,
      address_city:    process.env.SETTLE_RETURN_CITY!,
      address_state:   process.env.SETTLE_RETURN_STATE!,
      address_zip:     process.env.SETTLE_RETURN_ZIP!,
      address_country: 'US',
    },
    file:  html,
    color: false,    // B&W — reduces cost to ~$1.20/letter
    double_sided: false,
    address_placement: 'top_first_page',
    mail_type: 'usps_first_class',
  });

  return letter.id; // ltr_xxx — store in notification_records.lob_id
}

Letter templates

Notification typeTemplate fileRecipient
bank_account_closurebank-account-closure.htmlBank institution
insurance_claim_initiationinsurance-claim.htmlInsurance company
investment_account_closureinvestment-closure.htmlBrokerage
pension_claimpension-claim.htmlPlan administrator
utility_cancellationutility-cancel.htmlUtility provider
subscription_cancellationsubscription-cancel.htmlService provider

Webhook endpoint

// api/routes/webhooks/lob.ts
import { verifyLobSignature } from './lob-verify';

router.post('/webhooks/lob', async (req, res) => {
  if (!verifyLobSignature(req)) return res.status(401).send();

  const { event_type, body: { id: lobId } } = req.body;
  const statusMap: Record<string, string> = {
    'letter.mailed':                  'mailed',
    'letter.in_transit':              'in_transit',
    'letter.processed_for_delivery':  'delivered',
    'letter.returned_to_sender':      'undeliverable',
  };

  const newStatus = statusMap[event_type];
  if (!newStatus) return res.status(200).send();

  await db.query(
    'UPDATE notification_records SET status=$1, updated_at=NOW() WHERE lob_id=$2',
    [newStatus, lobId]
  );

  if (newStatus === 'undeliverable') await createReturnMailTask(lobId);

  res.status(200).send();
});

Cost model

Lob pricing (US letters, B&W, single-sided, first class): approximately $1.20–1.50 per letter including printing and postage. At 2,000 letters/month (growth stage): ~$2,400–3,000/month. This cost is directly proportional to estate volume and should be modeled as a per-estate variable cost (~$8–15/estate lifetime in letter fees). Build this into pricing.
6

CI/CD Pipeline

GitHub Actions handles all build, test, and deploy automation. The pipeline has three tracks: pull request validation, staging deployment on main merge, and production deployment on version tag.

Pipeline stages

# .github/workflows/ci.yml
name: CI

on:
  pull_request:
  push:
    branches: [main]
    tags: ['v*']

jobs:
  lint-typecheck:
    runs-on: ubuntu-24.04
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '22', cache: 'npm' }
      - run: npm ci
      - run: npm run lint
      - run: npm run typecheck

  test:
    runs-on: ubuntu-24.04
    needs: lint-typecheck
    services:
      postgres:
        image: postgres:16
        env: { POSTGRES_PASSWORD: test }
        options: --health-cmd pg_isready
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '22', cache: 'npm' }
      - run: npm ci
      - run: npm run test:unit
      - run: npm run test:integration
        env:
          DATABASE_URL: postgres://postgres:test@localhost/test
          REDIS_URL: redis://localhost:6379

  security-scan:
    runs-on: ubuntu-24.04
    needs: lint-typecheck
    steps:
      - uses: actions/checkout@v4
      - run: npm audit --audit-level=high
      - uses: github/codeql-action/analyze@v3
        with: { languages: javascript }

  deploy-staging:
    runs-on: ubuntu-24.04
    needs: [test, security-scan]
    if: github.ref == 'refs/heads/main'
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - uses: superfly/flyctl-actions/setup-flyctl@master
      - run: flyctl deploy --app settle-api-staging --remote-only
        env: { FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN_STAGING }} }
      - run: flyctl deploy --app settle-worker-notify-staging --remote-only
        env: { FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN_STAGING }} }
      - run: npm run db:migrate:staging
        env:
          DATABASE_DIRECT_URL: ${{ secrets.NEON_STAGING_DIRECT_URL }}
      - run: npm run test:smoke -- --env staging

  deploy-production:
    runs-on: ubuntu-24.04
    needs: [test, security-scan]
    if: startsWith(github.ref, 'refs/tags/v')
    environment: production
    steps:
      - uses: actions/checkout@v4
      - uses: superfly/flyctl-actions/setup-flyctl@master
      - run: npm run db:migrate:production
        env:
          DATABASE_DIRECT_URL: ${{ secrets.NEON_PRODUCTION_DIRECT_URL }}
      - run: flyctl deploy --app settle-api --strategy rolling --remote-only
        env: { FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN_PROD }} }
      - run: flyctl deploy --app settle-worker-notify --remote-only
        env: { FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN_PROD }} }
      - run: flyctl deploy --app settle-worker-scanner --remote-only
        env: { FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN_PROD }} }
      - run: flyctl deploy --app settle-worker-digest --remote-only
        env: { FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN_PROD }} }
Deployment strategy: API uses --strategy rolling for zero-downtime deploys. Workers use default (replace) since they are scale-to-zero and have no live traffic to protect. Database migrations run before the API deploy — migrations must always be backward-compatible with the previous API version.

Required GitHub secrets

SecretUsed by
FLY_API_TOKEN_STAGINGStaging deploy jobs
FLY_API_TOKEN_PRODProduction deploy jobs
NEON_STAGING_DIRECT_URLdb:migrate:staging
NEON_PRODUCTION_DIRECT_URLdb:migrate:production
NETLIFY_AUTH_TOKENNetlify builds (auto-configured by Netlify GitHub app)
7

Environment Management

Environment Frontend API Database R2 Bucket Lob mode
local localhost:5173 localhost:3000 Neon dev branch settle-documents-dev Test mode (no real mail)
preview Netlify deploy preview settle-api-staging Neon staging branch settle-documents-staging Test mode
staging staging.settle.app (Netlify) settle-api-staging.fly.dev Neon staging branch settle-documents-staging Test mode
production settle.app (Netlify) settle-api.fly.dev Neon main branch settle-documents Live mode

Local development setup

# .env.local (gitignored)
DATABASE_URL=postgres://...neon.tech/settle?pgbouncer=true
DATABASE_DIRECT_URL=postgres://...neon.tech/settle
REDIS_URL=rediss://default:xxx@us1-xxx.upstash.io:6379
R2_ENDPOINT=https://<account>.r2.cloudflarestorage.com
R2_ACCESS_KEY_ID=...
R2_SECRET_ACCESS_KEY=...
R2_BUCKET=settle-documents-dev
LOB_API_KEY=test_xxx                    # Lob test key — no real letters sent
RESEND_API_KEY=re_xxx
FIELD_ENCRYPTION_KEY=$(openssl rand -hex 32)
DOC_ENCRYPTION_KEY=$(openssl rand -hex 32)
SENTRY_DSN=                             # leave empty in local — avoids noise

# Start local API (from /api directory)
npm run dev

# Start frontend (from /web directory)
npm run dev

Long-lived workflow considerations

Estates are active for 16–18 months. The infrastructure must handle:

  • Sessions stored in Redis with TTL of 30 days (refreshed on activity). Users are never forced to log in mid-task.
  • Estate state lives in Postgres, not sessions. Closing a browser and returning 3 months later yields the same estate state.
  • Re-engagement: Digest Generator checks last_active_at per user. If >14 days inactive, escalates the Daily Three email with a "Welcome back" header. If >60 days, triggers a human-written check-in email via Resend.
  • Worker job IDs are idempotent — duplicate job enqueues are safe (check Redis for existing job by estate_id + job_type before enqueue).
8

Monitoring & Observability

Stack

Fly.io built-in metrics

CPU, memory, request latency per machine. Available in Fly dashboard. Export to Grafana via Prometheus scrape endpoint.

Grafana Cloud (free)

Dashboards for custom business metrics. Free tier: 10k series, 14-day retention — more than sufficient at launch and growth stages.

Sentry

Error tracking for API and all workers. Free tier until ~Scale stage (~$26/mo on Team plan for better retention). Set up one project per service.

Custom business metrics (push to Grafana)

These metrics are the ones that matter for Settle's operations. Standard infrastructure metrics (CPU, latency) are table stakes; these are the ones that tell you if the product is working.

Metric How measured Alert threshold
notification_tier1_success_rate % of Tier 1 API calls that return 2xx within 30s Alert if <80% over 1h
notification_tier2_delivery_rate % of Lob letters reaching processed_for_delivery status within 7 days Alert if <85% over 7d window
notification_tier2_return_rate % of Lob letters returned_to_sender Alert if >5% over 7d window
benefit_scan_hit_rate % of estate scans that find at least one potential benefit Informational — track weekly trend
benefit_scan_external_api_errors Error rate per external API (NAUPA, NAIC, PBGC, VA) Alert if circuit breaker opens
estate_completion_rate % of estates reaching "closed" status Informational — monthly review
estate_avg_days_to_close Median days from estate creation to closed status Informational — product health
daily_digest_delivery_rate % of digest emails successfully accepted by Resend Alert if <95%
worker_job_queue_depth Redis queue length per worker type Alert if >100 jobs queued for >10min

Grafana setup

# Grafana Cloud — create free account at grafana.com/products/cloud
# Get Prometheus push endpoint and API key

# In API: push metrics every 60s using prom-client
npm install prom-client

// lib/metrics.ts
import { Gauge, Counter, collectDefaultMetrics, register } from 'prom-client';
collectDefaultMetrics();

export const notificationSuccessRate = new Gauge({
  name: 'settle_notification_tier1_success_rate',
  help: '5-minute rolling success rate for Tier 1 notifications',
});

// Expose /metrics endpoint (Fly.io internal only)
router.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Sentry setup

# Create projects in Sentry UI, then:
npm install @sentry/node

// api/instrument.ts (import before everything else)
import * as Sentry from '@sentry/node';
Sentry.init({
  dsn:              process.env.SENTRY_DSN,
  environment:      process.env.NODE_ENV,
  tracesSampleRate: 0.1,  // 10% traces — sufficient at launch scale
  beforeSend(event) {
    // Strip PII from error contexts before sending to Sentry
    if (event.user) delete event.user.email;
    return event;
  },
});
9

Cost Projections

Launch
200
estates · 400 users · <5 req/s
Growth (Year 1)
2,000
estates · 4,000 users · <20 req/s
Scale (Year 2)
20,000
estates · 40,000 users · <100 req/s
Service Launch / mo Growth / mo Scale / mo Notes
Netlify Pro $19 $19 $19 Includes 1M Edge Function requests/mo. Upgrade to Business ($99) at ~50k MAU if bandwidth spikes.
Fly.io API (2 regions) $10 $30 $80 2× shared-cpu-1x 512MB at launch. Scale to 8× machines at Scale tier. Includes egress.
Fly.io Workers (3 apps) $5 $20 $60 Scale-to-zero — billed per second of execution. Digest generator is the most active at $2–5/mo.
Neon PostgreSQL $19 $69 $149 Pro at Launch/Growth. Scale plan at Year 2 (dedicated compute, 30-day PITR, 1TB storage).
Cloudflare R2 $0 $5 $20 Free tier: 10GB storage, 1M Class A ops/mo. Growth: ~50GB. Scale: ~200GB. No egress fees.
Upstash Redis $0 $10 $25 Free tier handles Launch. Pay-as-you-go at Growth (~$0.2/100k commands). Scale: fixed $25/mo plan.
Lob (physical letters) $50 $500 $3,000 ~30 letters/mo at Launch (200 estates × ~15%). ~$1.50/letter avg. See note below.
Resend (email) $0 $20 $50 Free tier: 3,000 emails/mo. Growth: ~20k/mo on $20 plan. Scale: ~100k/mo on Pro.
Sentry $0 $0 $26 Free Developer tier through Growth. Team plan ($26/mo) adds 90-day retention at Scale.
Grafana Cloud $0 $0 $0 Free tier (10k series, 14d retention) is sufficient through Scale. Upgrade only if adding logs ingestion.
Total ~$103/mo ~$673/mo ~$3,429/mo
Annual ~$1,236/yr ~$8,076/yr ~$41,148/yr
Lob dominates at Scale. At 20,000 active estates, even a 10% Tier 2 notification rate is 2,000 letters/month at ~$3,000. This is directly tied to estate volume — it is a variable cost of the product, not infrastructure overhead. At $X/estate/month subscription pricing, model Lob as 1–3% of revenue. Negotiate volume pricing with Lob above 5,000 letters/month.

Cost per estate (monthly)

StageTotal infra costActive estatesCost/estate/mo
Launch$103200$0.52
Growth$6732,000$0.34
Scale$3,42920,000$0.17

Infrastructure cost per estate declines as you scale. Lob is the only cost that scales linearly with estate count — everything else is largely fixed or grows sub-linearly.

10

Security Infrastructure

TLS everywhere

  • Netlify auto-provisions Let's Encrypt certificates. HSTS enforced with 1-year max-age + includeSubDomains.
  • Fly.io auto-provisions Let's Encrypt for *.fly.dev domains. Add custom domain via fly certs add api.settle.app.
  • Neon connections use TLS by default. sslmode=require enforced in connection string.
  • Upstash Redis uses TLS (rediss:// scheme). No unencrypted connections accepted.
  • R2 accessed over HTTPS only. No public bucket access — all reads require signed URLs generated by authenticated API.

Private networking (Fly.io 6PN)

# Workers communicate with API and Neon over Fly private network
# No public ports opened on workers
# API internal address (accessible only from Fly apps in same org):
#   settle-api.internal

# In worker config — connect to Neon via DATABASE_URL directly
# (Neon is external to Fly, so TLS is the transport, not 6PN)

# Block all inbound on workers (fly.toml)
[http_service]
  internal_port = 3000
  auto_stop_machines = "stop"
  # No [[services]] block = no public inbound port

Secrets management

# All secrets stored in Fly.io secrets — never in environment files or code
fly secrets set -a settle-api \
  DATABASE_URL="postgres://..." \
  FIELD_ENCRYPTION_KEY="$(openssl rand -hex 32)" \
  DOC_ENCRYPTION_KEY="$(openssl rand -hex 32)" \
  LOB_API_KEY="live_..." \
  RESEND_API_KEY="re_..." \
  UPSTASH_REDIS_URL="rediss://..." \
  SENTRY_DSN="https://...@sentry.io/..."

# List secrets (shows names only, not values)
fly secrets list -a settle-api

# Netlify secrets — set via Netlify UI or CLI
netlify env:set PUBLIC_API_URL https://api.settle.app

IAM — principle of least privilege

ComponentAccess grantedAccess denied
API (settle-api) Read/write Neon (specific tables via role), R2 get/put/delete own prefix, Upstash read/write, Resend send R2 bucket delete, Neon schema changes, Fly API token
Workers Read/write Neon (specific tables), R2 get/put, Upstash read/write, Lob API create R2 bucket delete, Neon schema changes, Resend (email sent via API, not directly)
CI/CD (GitHub Actions) Fly deploy token (per-app), Neon direct URL (migrations only) Production Fly token from staging jobs, database drops
Netlify Read API URL env var, deploy to CDN No database access, no encryption keys

PII access logging

// middleware/audit.ts — log every PII field access
export function auditPiiAccess(userId: string, field: string, estateId: string) {
  // Written to immutable audit_log table — no UPDATE/DELETE allowed on this table
  return db.query(`
    INSERT INTO audit_log (user_id, action, resource_type, resource_id, field_name, occurred_at)
    VALUES ($1, 'READ_PII', 'estate', $2, $3, NOW())
  `, [userId, estateId, field]);
}

// Immutable audit_log — prevent tampering
-- Run once as superuser:
CREATE RULE no_update_audit AS ON UPDATE TO audit_log DO INSTEAD NOTHING;
CREATE RULE no_delete_audit AS ON DELETE TO audit_log DO INSTEAD NOTHING;
-- Grant INSERT-only to app role:
GRANT INSERT ON audit_log TO settle_app;
REVOKE UPDATE, DELETE ON audit_log FROM settle_app;

IP allowlisting for admin endpoints

# Restrict /admin/* and /internal/* routes to office IP + Fly private network
# In Express middleware:
const ALLOWED_IPS = (process.env.ADMIN_ALLOWED_IPS ?? '').split(',');

router.use('/admin', (req, res, next) => {
  const ip = req.ip;
  if (!ALLOWED_IPS.some(allowed => ip.startsWith(allowed))) {
    return res.status(403).json({ error: 'Forbidden' });
  }
  next();
});
11

Disaster Recovery

Recovery Time Objective (RTO)

30 minutes

Time from incident declaration to full service restoration. Achievable through pre-provisioned standby infrastructure and documented runbooks.

Recovery Point Objective (RPO)

5 minutes

Maximum acceptable data loss window. Achieved via Neon continuous WAL shipping, which typically has <1min lag to durable storage.

Scenario 1: Fly.io API region failure (iad goes down)

1
Detection (0–3 min)
Grafana alert fires on API error rate >5% for 2 consecutive minutes. PagerDuty notifies on-call engineer.
2
Automatic failover (3–5 min)
Fly.io anycast routing automatically stops sending traffic to iad machines that fail health checks. Traffic shifts to ord. No manual action required.
3
Verify (5–10 min)
fly status -a settle-api — confirm ord machines are healthy and handling traffic. Check Grafana dashboard confirms error rate returning to baseline.
4
Scale ord if needed (10–20 min)
fly scale count 4 --region ord -a settle-api — doubles ord capacity to handle full load while iad recovers.

Scenario 2: Neon database failure / corruption

1
Detection (0–2 min)
API returns 500s on database queries. Sentry alert fires immediately. Grafana shows DB error rate spike.
2
Identify recovery point (2–5 min)
Log into Neon console. Identify last known-good timestamp from audit logs. For corruption: identify the migration or event that caused it.
3
Create recovery branch (5–10 min)
neonctl branches create \
  --name recovery-$(date +%Y%m%d%H%M) \
  --project-id <id> \
  --timestamp "2026-04-01T14:55:00Z"
4
Verify data integrity (10–20 min)
Connect to recovery branch. Run integrity checks. Confirm estate count, last audit log entry, last notification record match expectations.
5
Switch API to recovery branch (20–25 min)
fly secrets set DATABASE_URL=<recovery-branch-url> -a settle-api
Rolling restart occurs automatically. Verify in Grafana.
6
Promote recovery branch to main (25–30 min)
Once verified: neonctl branches set-as-default --name recovery-xxx. Update API secret back to primary connection string. Post-incident review within 48h.

Scenario 3: Compromised encryption key

Immediate actions (within 15 minutes of discovery): (1) Rotate FIELD_ENCRYPTION_KEY and DOC_ENCRYPTION_KEY simultaneously — fly secrets set ... triggers restart. (2) Invalidate all active sessions in Upstash: FLUSHDB on the sessions keyspace. (3) Notify affected users within 72 hours per GDPR/state breach notification requirements. (4) Run re-encryption job against all encrypted columns before restoring normal operations. (5) Rotate all Fly.io API tokens, Neon credentials, R2 keys, and Lob API key as a precaution.

Runbook quick reference

# Check all app statuses
fly status -a settle-api
fly status -a settle-worker-notify
fly status -a settle-worker-scanner
fly status -a settle-worker-digest

# View live logs
fly logs -a settle-api
fly logs -a settle-worker-scanner

# Scale up API in emergency
fly scale count 4 --region iad -a settle-api
fly scale count 4 --region ord -a settle-api

# Restart all API machines (e.g., after secrets rotation)
fly machines restart --all -a settle-api

# Check Redis queue depths
fly ssh console -a settle-api -- redis-cli -u $REDIS_URL LLEN queue:notify
fly ssh console -a settle-api -- redis-cli -u $REDIS_URL LLEN queue:scan
12

Compliance Checklist

Settle handles death certificates, Social Security numbers, financial records, and health insurance data. This section documents the infrastructure controls required for SOC 2 Type II readiness and provable compliance with data deletion obligations.

SOC 2 readiness (infrastructure controls)

Control Implementation Status
CC6.1 — Logical access controls Role-based DB access, Fly secrets, IAM principle of least privilege documented above Implemented
CC6.2 — Encryption at rest AES-256-GCM for PII fields + documents, Neon encrypted at rest, R2 encrypted at rest Implemented
CC6.3 — Encryption in transit TLS 1.2+ on all connections, HSTS enforced, no plaintext protocols Implemented
CC6.6 — Vulnerability management npm audit in CI, CodeQL scanning, Dependabot alerts enabled Implemented
CC7.2 — Monitoring for anomalies Sentry error tracking, Grafana anomaly dashboards, PII access audit log Implemented
CC8.1 — Change management All infra changes via GitHub PRs, no direct production mutations, IaC for all services Partial — Terraform IaC in progress
A1.1 — System availability Multi-region Fly.io, Neon HA, 99.95% SLA on all managed services Implemented
P4.2 — Audit log retention Immutable audit_log table, Neon PITR, 7-year retention policy enforced Implemented

Audit log retention (7 years)

-- audit_log schema
CREATE TABLE audit_log (
  id            UUID        DEFAULT gen_random_uuid() PRIMARY KEY,
  occurred_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  user_id       UUID,
  action        TEXT        NOT NULL,  -- READ_PII, WRITE_PII, DELETE_ESTATE, etc.
  resource_type TEXT        NOT NULL,
  resource_id   UUID,
  field_name    TEXT,
  ip_address    INET,
  user_agent    TEXT,
  metadata      JSONB
);

-- Partition by year for performance and retention management
CREATE TABLE audit_log_2026 PARTITION OF audit_log
  FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');

-- 7-year retention: drop partitions older than 7 years in annual cron job
-- Run: DROP TABLE audit_log_2018 (in 2026); DROP TABLE audit_log_2019 (in 2027); etc.

-- Immutability rules (no UPDATE or DELETE from any role)
CREATE RULE audit_no_update AS ON UPDATE TO audit_log DO INSTEAD NOTHING;
CREATE RULE audit_no_delete AS ON DELETE TO audit_log DO INSTEAD NOTHING;

Provable data deletion

When a user invokes their right to deletion (GDPR Art. 17, CCPA), all their data must be deleted and that deletion must be provable across every store. This is the deletion checklist:

StoreWhat to deleteHow to verify
Neon PostgreSQL All rows in: persons, estates, documents (metadata), contacts, notification_records, sessions, benefit_scans — where user_id = target. Do NOT delete audit_log rows — retain per compliance obligation. Run SELECT COUNT(*) after deletion. Export final 0-count query result to compliance record.
Cloudflare R2 All objects under estates/{estate_id}/ for every estate owned by the user. List objects after deletion — expect empty result. Log deletion confirmation from R2 API response.
Upstash Redis Session keys: session:{user_id}:*. Rate limit keys: ratelimit:{user_id}:*. Benefit scan cache: scan:{estate_id}:* SCAN for pattern after deletion — expect 0 matches.
Sentry Use Sentry's "Delete User Data" API endpoint for the user's email/ID. Sentry API confirms deletion. Log confirmation.
// api/admin/delete-user.ts — provable deletion script
export async function deleteUserData(userId: string): Promise<DeletionReport> {
  const report: DeletionReport = { userId, startedAt: new Date(), steps: [] };

  // 1. Get all estate IDs before deletion
  const estates = await db.query('SELECT id FROM estates WHERE user_id = $1', [userId]);

  // 2. Delete R2 documents for each estate
  for (const estate of estates.rows) {
    const deleted = await deleteR2Prefix(`estates/${estate.id}/`);
    report.steps.push({ store: 'r2', estateId: estate.id, deletedObjects: deleted });
  }

  // 3. Delete Postgres rows
  const tables = ['notification_records', 'benefit_scans', 'documents',
                  'contacts', 'estates', 'persons', 'users'];
  for (const table of tables) {
    const result = await db.query(`DELETE FROM ${table} WHERE user_id = $1`, [userId]);
    report.steps.push({ store: 'postgres', table: table, rowsDeleted: result.rowCount });
  }

  // 4. Flush Redis keys
  await redis.del(...await redis.keys(`session:${userId}:*`));

  // 5. Write deletion record to audit_log (this entry is retained)
  await db.query(
    'INSERT INTO audit_log (action, resource_type, resource_id, metadata) VALUES ($1,$2,$3,$4)',
    ['USER_DATA_DELETED', 'user', userId, JSON.stringify(report)]
  );

  return report; // return to admin for storage in compliance folder
}

External API reliability (Benefit Scanner)

NAUPA, NAIC, PBGC, and VA.gov APIs are government-operated and may be slow, rate-limited, or unavailable. The infrastructure handles this with:

Circuit breaker per external API

Use opossum npm package. Each API (NAUPA, NAIC, PBGC, VA) has its own circuit breaker with: failure threshold 50%, reset timeout 60s, timeout per request 15s. When open, scanner skips that API and logs it as unavailable.

24-hour result cache

Scan results are cached in Upstash Redis under scan:{estate_id}:{api} with 86,400s TTL. If a subsequent scan request arrives within 24h, return cached result. This also protects against rate limits.

// workers/scanner/circuit-breaker.ts
import CircuitBreaker from 'opossum';

const breakerOptions = {
  timeout:            15000,  // 15s per request
  errorThresholdPercentage: 50,
  resetTimeout:       60000,  // try again after 60s
  volumeThreshold:    5,      // minimum 5 requests before opening
};

export const naupa  = new CircuitBreaker(queryNaupa,  breakerOptions);
export const naic   = new CircuitBreaker(queryNaic,   breakerOptions);
export const pbgc   = new CircuitBreaker(queryPbgc,   breakerOptions);
export const va     = new CircuitBreaker(queryVa,     breakerOptions);

// Track circuit state in metrics
[naupa, naic, pbgc, va].forEach((cb, i) => {
  const name = ['naupa', 'naic', 'pbgc', 'va'][i];
  cb.on('open',    () => metrics.inc(`circuit_open_${name}`));
  cb.on('halfOpen', () => metrics.inc(`circuit_halfopen_${name}`));
  cb.on('close',   () => metrics.inc(`circuit_close_${name}`));
});

Settle
Infrastructure Plan · Version 1.0 · April 2026
A DevOps engineer should be able to provision this stack from zero in a single working day using the commands in this document.