Technical Architecture Document

Settle — System Architecture

A guided estate administration platform built for the 2.8 million families navigating probate each year. This document defines the architecture for every technical decision from field-level encryption to 50-state legal rules.

Version 1.0
Date April 2026
Stack SvelteKit · Fly.io · Neon Postgres · Cloudflare R2
Scope Year 1 MVP through Year 3 Scale
Section 0

Architecture Principles

These principles are not aspirational — they are constraints. Every design decision in this document can be traced back to at least one of them. When a tradeoff arises, principles at the top of this list take precedence.

P-01
Defense in Depth
No single control stands alone. Encryption at the field level exists even when disk encryption is present. Auth checks exist at the middleware layer even when a route guard is in place. Every sensitive operation has multiple independent barriers.
P-02
Least Privilege
Database roles, API service accounts, and R2 credentials carry only the permissions they need and nothing more. Executors cannot see co-executor data; workers cannot read payment data. Privilege is granted explicitly, never inherited.
P-03
Fail Fast with Context
Errors surface immediately with enough context to debug. Silent failures — especially in the notification service and benefit scanner — are worse than visible crashes. Every error includes a correlation ID, estate context, and the operation that failed.
P-04
Auditability as a First-Class Feature
Every mutation to an estate record, every access to a SSN, every notification sent, and every rule version used to generate a task plan is written to an immutable audit log. This satisfies legal discovery requirements and enables debugging of long-lived workflows.
P-05
Workflows Must Never Lose State
An estate workflow lasts 16–18 months. The state machine is authoritative and must be recoverable from the database alone. In-memory state is only a performance optimization; it is never the source of truth.
P-06
Separation of Concerns
The rules engine does not send notifications. The notification service does not query benefit databases. Clear service boundaries make each component independently testable, deployable, and replaceable. The monorepo structure enforces this through module boundaries.
P-07
UPL Guardrails in Architecture
Unauthorized Practice of Law is a structural risk. The system guides, it does not advise. Content generated by the rules engine is framed as procedural guidance, not legal counsel. The architecture must make it technically impossible for the system to render a legal opinion on behalf of a specific estate situation.
P-08
Reversibility Over Optimization
At Year 1 scale, premature optimization creates accidental complexity. Prefer designs that can be replaced when requirements become clear over those that are theoretically optimal today. Postgres JSON columns over a dedicated rules engine database is the canonical example of this tradeoff.
P-09
Idempotency for External Calls
Any operation that calls an external service — Lob, Resend, NAUPA, institution APIs — must be idempotent. Retries are safe. Duplicate physical letters or duplicate cancellation requests are not. Every outbound call carries an idempotency key derived from the estate and operation context.
Section 1

System Overview

Settle is architecturally a modular monolith deployed on Fly.io, backed by Neon Postgres and Cloudflare R2. It is not a microservices system — at Year 1 scale (500 estates), the operational overhead of distributed services outweighs the benefits. Service boundaries are enforced at the module level within the codebase, making extraction straightforward when scale demands it.

CLIENT FLY.IO RUNTIME DATA STORES EXTERNAL APIS SvelteKit App SSR + CSR · Cloudflare CDN localhost:5173 / settle.com HTTPS API Server SvelteKit +hooks · REST Session Auth · RBAC Rules Engine In-process · JSON evaluation 50-state legal rule sets Notification Worker Fly.io background worker Tier 1 · Tier 2 · Tier 3 Benefit Scanner Fly.io scheduled worker NAUPA · NAIC · PBGC · VA Document Processor Virus scan · OCR · Classify Encrypt & write to R2 Neon Postgres Primary data store Estates · Tasks · Rules · Audit Cloudflare R2 Document vault AES-256 · per-estate keys Upstash Redis Session store · rate limits Benefit scan result cache Task Queue PgBoss · Postgres-backed Notification · Scan jobs Lob Physical mail API Resend Transactional email NAUPA / MM Unclaimed property NAIC Policy Locator Life insurance search SSA · IRS · VA Federal benefit agencies VitalChek Death cert ordering Solid line = synchronous call Dashed = async / background

Deployment Topology

All server-side compute runs on Fly.io. The SvelteKit application serves both the frontend (SSR) and the API routes. A separate Fly.io machine type runs background workers (notification service, benefit scanner) on a schedule or queue trigger. Neon Postgres provides the primary data store with automatic branching for staging environments. Cloudflare R2 stores documents, accessed via pre-signed URLs generated by the API — the browser never talks directly to R2 in upload mode.

Component Platform Machine Type Count (Y1) Scaling Trigger
api-server Fly.io shared-cpu-2x · 512MB 2 (HA) CPU > 70% for 2 min
notification-worker Fly.io shared-cpu-1x · 256MB 1 Queue depth > 50
benefit-scanner Fly.io shared-cpu-1x · 256MB 1 Cron-triggered
postgres Neon 0.25 CU · autoscale 1 primary Neon autoscale
document-vault Cloudflare R2 Object storage Serverless
session-store Upstash Redis Serverless Redis Serverless
Section 2

Frontend Architecture

The frontend is SvelteKit deployed to Cloudflare's global CDN edge network. This gives static assets and pre-rendered pages sub-50ms TTFB globally without any CDN configuration overhead. The SvelteKit application uses a hybrid rendering strategy: marketing pages and unauthenticated flows are fully SSR'd for SEO; the authenticated estate application is server-rendered for initial load then transitions to client-side navigation.

SSR Strategy

Route Pattern Rendering Mode Rationale
/ prerender Marketing page, fully static, cached at edge
/how-it-works, /pricing prerender SEO-critical, no dynamic content
/signup, /login SSR CSRF token injection, form handling
/estate/[id]/dashboard SSR + CSR Initial data load server-side; subsequent navigation client-side
/estate/[id]/tasks SSR + CSR Task plan rendered on server, mutations via client fetch
/estate/[id]/documents SSR + CSR File list server-rendered; upload/preview client-side

Grief-Aware UX Architecture

The people using Settle are in one of the most cognitively impaired states a person can experience. The architecture must account for this — not just the visual design, but the caching, session, and error recovery behavior of the application.

Grief fog is a real cognitive phenomenon. Users may start a task, navigate away, return hours later, and not remember what they were doing. The application must preserve context aggressively and never ask a user to re-enter information they have already provided.

Client-Side State Preservation

SvelteKit's +page.server.ts load functions cache estate data with a 60-second stale-while-revalidate window. Form state is auto-saved to localStorage every 5 seconds during intake flows, keyed by estate ID and form name. If the browser tab is closed mid-form, the user resumes exactly where they left off on next visit.

// src/lib/stores/formPersist.ts
export function createPersistedForm(estateId: string, formKey: string) {
  const storageKey = `settle:form:${estateId}:${formKey}`;

  return {
    restore: () => {
      const raw = localStorage.getItem(storageKey);
      return raw ? JSON.parse(raw) : null;
    },
    save: (data: unknown) => {
      localStorage.setItem(storageKey, JSON.stringify({
        data,
        savedAt: Date.now()
      }));
    },
    clear: () => localStorage.removeItem(storageKey)
  };
}

Session Continuity for Long-Lived Workflows

Estate workflows span 16–18 months. Standard session expiry of 24–72 hours is inappropriate. Sessions are configured with a 30-day sliding window. The session cookie carries only the session ID; all session data (user ID, estate ID, role, last active page) lives in Redis. On each authenticated request, the session TTL is refreshed. The "last active page" is stored and presented as a resume prompt on next login.

Offline Support

Families in rural areas, or using cellular connections at a funeral home, may have intermittent connectivity. The application uses a Service Worker registered at the app root to cache the application shell and the current estate's task plan for offline viewing. Mutations made offline are queued and replayed when connectivity is restored.

Asset Type Cache Strategy TTL
App shell (HTML, CSS, JS) Cache-first after first load Stale until new deploy
Estate task plan Stale-while-revalidate 10 minutes
Document metadata Stale-while-revalidate 5 minutes
Document files (PDFs) Cache on demand (explicit user action) 1 hour
API responses (GET) Network-first with cache fallback 60 seconds
Offline mutation queue: When a user marks a task complete while offline, the mutation is stored in IndexedDB. A background sync event replays the queue when connectivity is restored. The UI shows a subtle "syncing" indicator. Conflict resolution is last-write-wins at the task level — an acceptable tradeoff given the low concurrency of a single-estate workflow.
Section 3

Backend API

The API is implemented as SvelteKit server routes (+server.ts files), co-located with the frontend. This is not a tradeoff in Year 1 — it means a single deployment, shared TypeScript types between server and client, and zero serialization overhead. The API layer follows REST conventions with resource-oriented URLs and standard HTTP semantics.

Authentication and RBAC

Authentication is session-based with cookies. Sessions are stored in Upstash Redis with a 30-day sliding expiry. The session token is a cryptographically random 256-bit value (generated with crypto.getRandomValues). The session payload includes the user ID, estate ID (if applicable), role, and a fingerprint of the request IP and user agent for anomaly detection.

Why session auth over JWTs? See ADR-004. Short version: estate workflows are long-lived. JWTs with short expiry require refresh token infrastructure that's more complex than the problem it solves. Sessions can be invalidated instantly on the server, which matters critically when an attorney is removed from an estate or a family reports unauthorized access.

Role Definitions

Role Description Key Permissions Restrictions
executor Named executor of the estate Full CRUD on estate, tasks, documents, notifications Cannot delete estate record (soft delete only)
co_executor Named co-executor Same as executor; all actions are co-attributed Cannot remove executor or self
attorney Legal counsel on the estate Read all; write case notes; upload documents Cannot send notifications or modify task status
family_viewer Family member with view-only access Read task status, document list (not contents), estate summary No writes; no PII fields; no financial data
admin Settle staff (internal only) All estates; audit log access; legal rule management All access is logged; no unilateral PII access

RBAC is enforced at two layers: the route handler level (SvelteKit hooks check the session role before the handler executes) and the database level (Postgres row-level security policies that reference the current session's estate ID). This means a misconfigured route cannot accidentally return data from another estate — the database policy will reject the query.

Key API Endpoints

Estate Management

Method Path Description Required Role
POST /api/estates Create new estate (intake) Authenticated user
GET /api/estates/:id Get estate summary executor, co_executor, attorney, family_viewer
PATCH /api/estates/:id Update estate (assets discovered, status change) executor, co_executor
GET /api/estates/:id/tasks Get full task plan with rule version metadata executor, co_executor, attorney, family_viewer
POST /api/estates/:id/tasks/regenerate Regenerate task plan (new assets/state change) executor, co_executor
PATCH /api/estates/:id/tasks/:taskId Update task status, add notes executor, co_executor
POST /api/estates/:id/documents Request presigned upload URL executor, co_executor, attorney
POST /api/estates/:id/notifications Queue notification (tier specified in body) executor, co_executor
GET /api/estates/:id/benefits Get benefit scan results executor, co_executor
POST /api/estates/:id/benefits/scan Trigger benefit scan for estate executor, co_executor

Sample: Create Estate

// POST /api/estates
// Request Body:
{
  "deceased": {
    "firstName": "Margaret",
    "lastName": "Chen",
    "dateOfBirth": "1942-03-15",
    "dateOfDeath": "2026-03-01",
    "stateOfResidence": "CA",
    "ssn": "XXX-XX-XXXX"  // encrypted in transit and at rest
  },
  "executor": {
    "relationship": "child"
  },
  "estateProfile": {
    "hasRealProperty": true,
    "estimatedAssetValue": "250000-500000",
    "hasWill": true,
    "hasTrust": false
  }
}

// 201 Created Response:
{
  "estateId": "est_01j9k3m...",
  "status": "intake_complete",
  "taskPlan": {
    "generatedAt": "2026-04-03T14:22:00Z",
    "ruleSetVersion": "CA-2026.1",
    "taskCount": 34,
    "requiredDeadlines": [
      { "task": "file_probate_petition", "dueByDays": 30 }
    ]
  }
}

// 422 Error (invalid state):
{
  "error": "INVALID_STATE_CODE",
  "message": "stateOfResidence must be a valid 2-letter US state code",
  "field": "deceased.stateOfResidence"
}
Section 4

State Rules Engine

The rules engine is the most architecturally unique component in Settle. It must encode legally-correct, state-specific probate rules for all 50 states, remain maintainable by non-engineers (or at least by lawyers working with engineers), support versioning and rollback, and produce an auditable trail of exactly which rule set generated which task plan for a given estate.

This is a legal knowledge base, not a workflow engine. The rules engine's job is to answer one question: given the characteristics of this estate, in this state, what tasks must be completed and in what order? It does not manage task execution — that is the estate state machine. The distinction matters for testing, maintenance, and UPL compliance.

Rule Set Schema

Each state has a versioned JSON rule set stored in Postgres. The JSON document defines the full task graph for that state, with conditions that filter and customize tasks based on estate attributes. The schema is designed to be readable by a paralegal reviewing the document — not just a developer.

// Table: state_rule_sets
// Column: rules JSONB

{
  "state": "CA",
  "version": "CA-2026.1",
  "effectiveDate": "2026-01-01",
  "legalSources": [
    "Cal. Prob. Code § 13100",
    "Cal. Prob. Code § 8000"
  ],
  "probateThreshold": {
    "grossEstateValueCents": 18450000,
    "realPropertyIncluded": true,
    "source": "Cal. Prob. Code § 13100 (adjusted annually)"
  },
  "smallEstateAffidavit": {
    "available": true,
    "maxValueCents": 18450000,
    "waitDays": 40,
    "form": "DE-305"
  },
  "tasks": [
    {
      "id": "obtain_death_certificates",
      "category": "immediate",
      "priority": 1,
      "title": "Order certified death certificates",
      "description": "Order at least 10 certified copies from VitalChek...",
      "conditions": [],
      "deadlineDays": 7,
      "requiredFor": ["open_estate_account", "notify_ssa"],
      "legalGuidance": "Procedural step. Order via VitalChek or local registrar."
    },
    {
      "id": "file_probate_petition",
      "category": "probate",
      "priority": 2,
      "title": "File petition for probate",
      "conditions": [
        {
          "field": "estate.requiresProbate",
          "operator": "eq",
          "value": true
        }
      ],
      "deadlineDays": 30,
      "court": "Superior Court, Probate Division",
      "forms": ["DE-111", "DE-140"],
      "filingFee": {
        "baseFeeCents": 39500,
        "source": "Cal. Gov. Code § 70650"
      }
    }
  ]
}

Rule Evaluation

The rules engine evaluates a rule set against an EstateContext object — a snapshot of all estate attributes relevant to task generation. Evaluation is a pure function with no side effects: given the same EstateContext and RuleSet, it always produces the same task list. This property makes it trivially testable.

// src/lib/rules/evaluator.ts

export interface EstateContext {
  estateId: string;
  stateCode: string;
  dateOfDeath: Date;
  estimatedGrossValueCents: number;
  hasRealProperty: boolean;
  hasWill: boolean;
  hasTrust: boolean;
  hasMinorChildren: boolean;
  hasVeteranStatus: boolean;
  hasBusinessInterests: boolean;
  requiresProbate: boolean; // derived: grossValue > threshold
}

export function evaluateRuleSet(
  context: EstateContext,
  ruleSet: StateRuleSet
): GeneratedTask[] {
  const tasks: GeneratedTask[] = [];

  for (const rule of ruleSet.tasks) {
    const conditionsMet = rule.conditions.every(
      (c) => evaluateCondition(c, context)
    );

    if (conditionsMet) {
      tasks.push({
        ruleId: rule.id,
        ruleSetVersion: ruleSet.version,
        title: rule.title,
        description: rule.description,
        category: rule.category,
        priority: rule.priority,
        dueDateCalc: rule.deadlineDays
          ? addDays(context.dateOfDeath, rule.deadlineDays)
          : null,
        legalGuidance: rule.legalGuidance,
        forms: rule.forms ?? [],
        status: 'pending'
      });
    }
  }

  return tasks.sort((a, b) => a.priority - b.priority);
}

function evaluateCondition(
  condition: RuleCondition,
  context: EstateContext
): boolean {
  const value = getNestedValue(context, condition.field);
  switch (condition.operator) {
    case 'eq': return value === condition.value;
    case 'gt': return value > condition.value;
    case 'lt': return value < condition.value;
    case 'in': return condition.value.includes(value);
    default: throw new Error(`Unknown operator: ${condition.operator}`);
  }
}

Versioning and the Audit Trail

Rule sets are immutable once published. When a legal change requires updating California's rules, a new version CA-2026.2 is created — the old version is never modified. Every generated task carries the ruleSetVersion that produced it. This creates a complete audit trail: any task in any estate can be traced to the exact rule text that created it, at the time it was created.

Column Purpose
state_rule_sets.version Unique identifier, format STATE-YYYY.N (e.g., CA-2026.2)
state_rule_sets.status draft, review, published, superseded
state_rule_sets.superseded_by FK to newer version; forms a linked list of rule history
tasks.rule_set_version Captured at task generation time; never updated
estate_events.rule_regeneration Audit event when task plan is regenerated; records old and new versions

When new assets are discovered mid-process (e.g., a retirement account found three months in), the executor can trigger task plan regeneration. The engine runs against the current published rule set for the estate's state, adds any new tasks that aren't already present, and emits an estate_event recording the delta. Existing tasks are never deleted during regeneration — they may be superseded but the record of their creation is preserved.

Rule Update Process

1
Legal Change Detected
A paralegal or attorney identifies a statutory change (new probate threshold, new required form). They file a GitHub issue tagged legal-rule-change with the cite and effective date.
2
Draft New Rule Version
An engineer creates a new JSON rule set document by copying the current published version and making the required changes. The new version is inserted with status = 'draft' and effective_date set to the statutory effective date.
3
Automated Test Suite
Every state rule set has a corresponding test file with estate fixtures covering the key branching conditions. CI runs all rule tests on every PR. A rule change that breaks an existing fixture must be explicitly acknowledged in the PR description.
4
Legal Review
The draft rule set and the diff from the previous version are reviewed by a licensed attorney in the relevant state. The Settle admin interface provides a human-readable diff view of the rule changes. Review approval is recorded in the rule_set_reviews table with reviewer identity and timestamp.
5
Publish
On or after the effective_date, the admin publishes the new version. The old version's superseded_by is set. New estates in that state will use the new version immediately. Existing estates continue with their current task plan unless the executor triggers regeneration.
6
Notification to Affected Estates
If the rule change affects currently open estates (e.g., a new required filing), the system identifies affected estates and surfaces a notification: "A legal rule change in [State] may affect your estate. Review your task plan." No tasks are automatically added without executor acknowledgment.
UPL Guardrail: The legalGuidance field in every task rule is a procedural description, not legal advice. The schema enforces that this field cannot exceed 500 characters and must pass a content review that rejects first-person advisory language ("you should," "you must consult"). The admin UI renders a warning if submitted guidance contains these patterns.
Section 5

Data Architecture

Postgres via Neon is the primary data store for all structured data. R2 stores documents. There is no separate analytics database at Year 1 — Neon's read replica capability provides read scaling without operational overhead. The schema is designed for the long-lived, event-sourced nature of estate workflows.

Core Schema

estates Primary Entity
CREATE TABLE estates (
  id              TEXT PRIMARY KEY DEFAULT gen_estate_id(), -- 'est_' prefix + ulid
  status          TEXT NOT NULL DEFAULT 'intake',           -- intake|active|closing|closed|archived
  state_code      CHAR(2) NOT NULL,
  opened_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
  closed_at       TIMESTAMPTZ,
  rule_set_version TEXT,                                     -- FK: state_rule_sets.version
  estimated_value_cents  BIGINT,
  requires_probate       BOOLEAN GENERATED ALWAYS AS (
    estimated_value_cents > (
      SELECT probate_threshold_cents
      FROM state_rule_sets
      WHERE state_code = estates.state_code
        AND status = 'published'
      LIMIT 1
    )
  ) STORED,
  metadata        JSONB NOT NULL DEFAULT '{}'               -- flexible attributes
);

CREATE INDEX idx_estates_state ON estates(state_code);
CREATE INDEX idx_estates_status ON estates(status)
  WHERE status NOT IN ('closed','archived');
deceased Sensitive PII
CREATE TABLE deceased (
  id              TEXT PRIMARY KEY DEFAULT gen_id(),
  estate_id       TEXT NOT NULL REFERENCES estates(id),
  first_name      TEXT NOT NULL,
  last_name       TEXT NOT NULL,
  date_of_birth   DATE,
  date_of_death   DATE NOT NULL,
  state_of_residence CHAR(2) NOT NULL,
  -- Column-level encrypted fields (AES-256-GCM, application-layer)
  ssn_encrypted   BYTEA,                                    -- encrypted SSN
  ssn_last4       CHAR(4),                                  -- unencrypted for display/lookup
  -- Access to ssn_encrypted is audit-logged at application layer
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE UNIQUE INDEX idx_deceased_estate ON deceased(estate_id);
tasks Core Workflow
CREATE TABLE tasks (
  id              TEXT PRIMARY KEY DEFAULT gen_id(),
  estate_id       TEXT NOT NULL REFERENCES estates(id),
  rule_id         TEXT NOT NULL,                            -- e.g. 'file_probate_petition'
  rule_set_version TEXT NOT NULL,                           -- immutable at creation
  title           TEXT NOT NULL,
  category        TEXT NOT NULL,                            -- immediate|probate|financial|notifications|etc
  status          TEXT NOT NULL DEFAULT 'pending',          -- pending|in_progress|complete|skipped|n_a
  priority        INTEGER NOT NULL DEFAULT 50,
  due_date        DATE,
  completed_at    TIMESTAMPTZ,
  completed_by    TEXT REFERENCES users(id),
  notes           TEXT,
  is_generated    BOOLEAN NOT NULL DEFAULT TRUE,             -- false = manually added
  superseded_at   TIMESTAMPTZ,                              -- set if task replaced on regeneration
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_tasks_estate ON tasks(estate_id) WHERE superseded_at IS NULL;
CREATE INDEX idx_tasks_status  ON tasks(estate_id, status)
  WHERE status IN ('pending','in_progress');
state_rule_sets Legal Knowledge Base
CREATE TABLE state_rule_sets (
  version               TEXT PRIMARY KEY,                   -- 'CA-2026.1'
  state_code            CHAR(2) NOT NULL,
  status                TEXT NOT NULL DEFAULT 'draft',      -- draft|review|published|superseded
  effective_date        DATE NOT NULL,
  superseded_by         TEXT REFERENCES state_rule_sets(version),
  probate_threshold_cents BIGINT NOT NULL,
  rules                 JSONB NOT NULL,                     -- full rule set document
  reviewed_by           TEXT,                              -- attorney name on record
  reviewed_at           TIMESTAMPTZ,
  published_by          TEXT REFERENCES users(id),
  published_at          TIMESTAMPTZ,
  created_at            TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_rules_state_published ON state_rule_sets(state_code)
  WHERE status = 'published';
notifications Audit-Critical
CREATE TABLE notifications (
  id              TEXT PRIMARY KEY DEFAULT gen_id(),
  estate_id       TEXT NOT NULL REFERENCES estates(id),
  institution_id  TEXT REFERENCES institutions(id),
  tier            SMALLINT NOT NULL CHECK (tier IN (1, 2, 3)),
  channel         TEXT NOT NULL,                            -- api|mail|phone_script
  status          TEXT NOT NULL DEFAULT 'queued',           -- queued|processing|sent|delivered|failed
  idempotency_key TEXT UNIQUE NOT NULL,                    -- prevents duplicate sends
  external_id     TEXT,                                     -- Lob letter ID, API confirmation, etc.
  queued_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
  sent_at         TIMESTAMPTZ,
  delivered_at    TIMESTAMPTZ,
  failed_at       TIMESTAMPTZ,
  failure_reason  TEXT,
  retry_count     SMALLINT NOT NULL DEFAULT 0,
  payload         JSONB NOT NULL                            -- full request payload, redacted of PII
);

CREATE INDEX idx_notif_estate   ON notifications(estate_id);
CREATE INDEX idx_notif_status   ON notifications(status) WHERE status IN ('queued','processing');
CREATE UNIQUE INDEX idx_notif_idempotency ON notifications(idempotency_key);
audit_log Immutable · Append-Only
CREATE TABLE audit_log (
  id              BIGSERIAL PRIMARY KEY,
  estate_id       TEXT,                                     -- nullable for admin/system events
  actor_id        TEXT NOT NULL,                            -- user ID or 'system'
  actor_role      TEXT NOT NULL,
  action          TEXT NOT NULL,                            -- 'read_ssn', 'task_complete', etc.
  resource_type   TEXT NOT NULL,
  resource_id     TEXT,
  ip_address      INET,
  user_agent      TEXT,
  metadata        JSONB NOT NULL DEFAULT '{}',
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Append-only enforced via Postgres policy:
ALTER TABLE audit_log ENABLE ROW LEVEL SECURITY;
CREATE POLICY audit_insert_only ON audit_log
  FOR INSERT WITH CHECK (TRUE);
CREATE POLICY audit_no_update ON audit_log
  FOR UPDATE USING (FALSE);
CREATE POLICY audit_no_delete ON audit_log
  FOR DELETE USING (FALSE);

CREATE INDEX idx_audit_estate   ON audit_log(estate_id) WHERE estate_id IS NOT NULL;
CREATE INDEX idx_audit_action   ON audit_log(action, created_at DESC);
CREATE INDEX idx_audit_actor    ON audit_log(actor_id, created_at DESC);

Field-Level Encryption

The most sensitive fields — SSN, financial account numbers, medical record numbers — are encrypted at the application layer before being written to Postgres. Disk encryption (provided by Neon) is a necessary baseline but insufficient alone: it does not protect against a compromised database credential or a SQL injection vulnerability that returns raw rows. Field-level encryption ensures that even a full database dump is useless without the encryption keys.

Encryption Approach

  • Algorithm: AES-256-GCM with a random 96-bit nonce per encryption operation
  • Key Management: Per-estate derived keys using HKDF from a master key stored in Fly.io Secrets. The master key is rotatable without re-encrypting all records (re-encryption is a background job).
  • Storage format: BYTEA column stores nonce (12 bytes) || ciphertext || auth_tag (16 bytes)
  • Key derivation: HKDF(masterKey, salt=estateId, info="settle-v1-field-encryption")
// src/lib/crypto/fieldEncryption.ts

import { hkdf, getRandomValues } from 'node:crypto';

const MASTER_KEY = Buffer.from(process.env.FIELD_ENCRYPTION_KEY!, 'hex');
const ALGORITHM = 'aes-256-gcm';

export async function encryptField(
  plaintext: string,
  estateId: string
): Promise<Buffer> {
  const derivedKey = await deriveEstateKey(estateId);
  const nonce = getRandomValues(new Uint8Array(12));
  const cipher = createCipheriv(ALGORITHM, derivedKey, nonce);
  const encrypted = Buffer.concat([
    cipher.update(plaintext, 'utf8'),
    cipher.final()
  ]);
  const authTag = cipher.getAuthTag();
  // Pack: nonce (12) + ciphertext + authTag (16)
  return Buffer.concat([nonce, encrypted, authTag]);
}

export async function decryptField(
  cipherBuffer: Buffer,
  estateId: string
): Promise<string> {
  // Audit log this access BEFORE decryption
  await auditLog.write({ action: 'field_decryption', estateId });

  const derivedKey = await deriveEstateKey(estateId);
  const nonce = cipherBuffer.subarray(0, 12);
  const authTag = cipherBuffer.subarray(cipherBuffer.length - 16);
  const ciphertext = cipherBuffer.subarray(12, cipherBuffer.length - 16);

  const decipher = createDecipheriv(ALGORITHM, derivedKey, nonce);
  decipher.setAuthTag(authTag);
  return decipher.update(ciphertext) + decipher.final('utf8');
}

Fields That Require Encryption

Table Column Sensitivity Display Fallback
deceased ssn_encrypted Critical ssn_last4 (unencrypted)
financial_accounts account_number_encrypted Critical account_last4
financial_accounts routing_number_encrypted High institution name
benefits policy_number_encrypted High insurer name + type
deceased medical_record_numbers High provider name only

Document Vault (Cloudflare R2)

Documents — death certificates, wills, financial statements — are stored in R2 with server-side encryption (SSE-C using a per-estate key). The API never streams document content directly; instead it issues short-lived (5-minute) presigned URLs for download and upload. Documents are organized in R2 by a path scheme that does not expose PII in the object key.

R2 path format: estates/{estateId}/{documentType}/{ulid}.{ext}

On upload, the Document Processor worker intercepts the R2 object:created event, performs virus scanning (ClamAV via a Fly.io sidecar), attempts OCR classification, and writes document metadata to Postgres. If virus scanning fails, the document is moved to a quarantine prefix and the estate is notified.

Section 6

Notification Service Architecture

The notification service is the most operationally complex component in Settle. It must send actual cancellation requests to real institutions, generate and mail physical letters via Lob, and produce call scripts — each with fundamentally different reliability, latency, and tracking requirements. A failure in Tier 1 must not affect Tier 2; a physical letter sent twice is a worse failure mode than a digital request sent twice.

Tier 1
Automated API Cancellation
  • Direct API calls to subscription services
  • Netflix, Spotify, utility APIs
  • Idempotent via service-side dedup
  • Confirmation number stored in DB
  • Retry on 5xx, not on 4xx
  • Success = 200/204 from service
Tier 2
Managed Physical Mail
  • Lob API for letter generation
  • Death cert PDF merged into template
  • Lob provides USPS tracking
  • Delivery webhook updates status
  • Never send twice (idempotency key)
  • CASS-certified addressing required
Tier 3
Guided Call Scripts
  • Purely computational — no external calls
  • Template rendering + personalization
  • Returns structured script object
  • Includes estimated hold times
  • No retry needed (idempotent by nature)
  • Logged to notification_records table

Worker Architecture

The notification worker is a long-running Fly.io process that polls the PgBoss job queue. It is isolated from the API server to prevent notification failures from impacting user-facing requests. The worker processes jobs by tier in priority order: Tier 2 (physical mail, longest lead time) processes first, then Tier 1, then Tier 3.

API Server POST /notifications enqueue PgBoss Queue Postgres-backed notify:tier1 notify:tier2 · notify:tier3 dequeue Notif. Router tier-based dispatch idempotency check Tier 1 Handler Institution API · retry(3) Idempotency key: est+notifId Tier 2 Handler Lob API · letter_id stored Webhook updates delivery Tier 3 Handler Script render · no I/O Returns to client instantly Postgres: notifications table

Tier 1 — API Cancellation

Tier 1 notifications call institution APIs directly to cancel subscriptions or notify of a death. The key engineering challenges are: (1) each institution has a different API contract, (2) idempotency must be enforced even if the request succeeds but the network drops before the response arrives, and (3) the system must distinguish between "institution confirmed cancellation" and "institution returned 200 but nothing happened."

// Institution adapter interface — each institution implements this
interface InstitutionAdapter {
  institutionId: string;
  sendNotification(
    payload: NotificationPayload,
    idempotencyKey: string
  ): Promise<NotificationResult>;
}

interface NotificationResult {
  success: boolean;
  externalId?: string;       // confirmation number from institution
  confirmedAction?: string;   // 'cancelled' | 'notified' | 'pending_review'
  errorCode?: string;
  retryable: boolean;          // false for 4xx, true for 5xx/network errors
}

// Idempotency key construction
const idempotencyKey = `tier1:${estateId}:${institutionId}:${notificationType}`;
// This key is stable across retries. Postgres UNIQUE constraint on
// notifications.idempotency_key ensures no duplicate processing.

Tier 2 — Physical Mail via Lob

Physical mail has the highest failure cost of the three tiers. A letter sent to a wrong address wastes money and time, but more importantly, it may delay asset recovery for a grieving family. The Tier 2 handler performs address validation via Lob's CASS-certified address API before creating the letter. Only after validation succeeds does it proceed to letter creation.

// Tier 2 handler flow
async function handleTier2(job: NotificationJob) {
  // 1. Check idempotency — was this letter already sent?
  const existing = await db.notifications.findByIdempotencyKey(job.idempotencyKey);
  if (existing?.external_id) {
    // Already sent to Lob — fetch status and return
    return { success: true, externalId: existing.external_id, alreadySent: true };
  }

  // 2. Validate address via Lob before creating letter
  const addressVerification = await lob.usVerifications.verify({
    primary_line: job.recipientAddress.line1,
    city: job.recipientAddress.city,
    state: job.recipientAddress.state,
    zip_code: job.recipientAddress.zip
  });
  if (addressVerification.deliverability === 'undeliverable') {
    throw new NotificationError('UNDELIVERABLE_ADDRESS', { retryable: false });
  }

  // 3. Fetch death certificate presigned URL from R2
  const deathCertUrl = await r2.getPresignedUrl(job.estateId, 'death_certificate');

  // 4. Create letter via Lob API
  const letter = await lob.letters.create({
    description: `Estate notification: ${job.estateId}`,
    to: addressVerification.components,
    from: SETTLE_RETURN_ADDRESS,
    file: job.templateId,
    merge_variables: {
      deceasedName: job.deceasedName,
      institutionName: job.institutionName,
      estateExecutor: job.executorName,
      deathCertificateEnclosure: true
    },
    // Lob idempotency header — prevents duplicate if network fails
    idempotencyKey: job.idempotencyKey
  });

  // 5. Store Lob letter ID for tracking
  await db.notifications.update(job.notificationId, {
    status: 'sent',
    external_id: letter.id,
    sent_at: new Date()
  });
}

Lob sends delivery status webhooks when a letter is in-transit, delivered, or returned. The webhook handler updates notifications.status and, on delivery failure (returned mail), creates a new task for the executor to verify the institution's address.

Tier 3 — Call Script Generation

Tier 3 is a pure computation: given an estate context and institution profile, generate a structured call script. No external API calls are made. The script object is returned synchronously to the client and also stored in notifications for record-keeping. The handler uses a template system with institution-specific overrides for hold queues, required account information, and department routing.

Reliability Model

The critical distinction: Tier 1 failure means an institution subscription was not cancelled and may continue charging the estate. Tier 2 failure means a physical letter was not sent and a deadline may be missed. Both are worse outcomes than a system error that surfaces to the executor with a retry option.
Tier Retry Strategy Backoff Max Retries Failure Action
Tier 1 Retry on 5xx and network errors; no retry on 4xx Exponential: 1m, 5m, 30m, 2h 4 Mark failed; create manual task for executor
Tier 2 Retry on Lob API errors; no retry on address validation failure Exponential: 5m, 30m, 4h 3 Mark failed; notify executor to verify address
Tier 3 No retry needed (pure computation) 0 Surface error with context; log for debugging
Section 7

Benefit Discovery Architecture

Benefit discovery is the process of finding assets and entitlements the family may not know about: unclaimed property held by states, life insurance policies, pension benefits, VA entitlements. Some sources have APIs; some require web scraping; some require manual submission. The architecture must handle all three and present results with appropriate confidence signals so families act on real findings, not false positives.

Source Taxonomy

Source Data Integration Type Auth Required Rate Limit
NAUPA / MissingMoney Unclaimed property (all 50 states) HTTP scrape None Per-state limits, ~1 req/10s
NAIC Life Policy Locator Life insurance policies Form submission Registration required Manual review cycle (days)
PBGC Pension Search Defined-benefit pension benefits HTTP scrape None Unknown; respect robots.txt
VA Benefits Veteran burial/death benefits API (va.gov) API key Documented per-token limits
SSA Death Benefits Lump-sum death payment Manual guidance
MIB (Medical Info Bureau) Insurance application records Manual guidance
State unclaimed property (direct) State treasury holdings HTTP scrape Varies by state Vary by state

Scanner Architecture

The benefit scanner is a Fly.io worker that runs on a cron schedule (once per week per active estate) and on-demand when triggered by the executor. It uses a source adapter pattern — each external source has an adapter implementing a common interface — allowing new sources to be added without modifying the core scanner logic.

// Source adapter interface
interface BenefitSourceAdapter {
  sourceId: string;
  type: 'api' | 'scrape' | 'manual_guidance';
  canAutoScan: boolean;

  scan(
    context: ScanContext
  ): Promise<BenefitScanResult[]>;
}

interface ScanContext {
  deceasedName: { first: string; last: string };
  deceasedSsn?: string;         // decrypted only for sources that require it
  dateOfDeath: Date;
  stateOfResidence: string;
  hasVeteranStatus: boolean;
}

interface BenefitScanResult {
  sourceId: string;
  confidence: 'confirmed' | 'probable' | 'possible';
  benefitType: string;
  estimatedValueCents?: number;
  claimUrl?: string;
  manualStepsRequired?: string[];
  rawData: Record<string, unknown>;
  scannedAt: Date;
}

Caching Strategy

External benefit databases must not be queried on every page load. Scan results are cached in Postgres with a scanned_at timestamp. The frontend shows the cached result with a freshness indicator. Scans are throttled: no source is queried more than once per 24 hours per estate, regardless of how many times the executor views the benefits page. This prevents accidental hammering of NAUPA or PBGC from a user repeatedly refreshing the page.

SSN handling in the scanner: The SSN is only decrypted immediately before a scan that requires it (currently only NAUPA for some states). It is passed as a transient string within the worker's memory and is never logged, stored in the scan result, or included in audit logs. The audit log records that a scan requiring SSN access was performed, with the estate ID and timestamp, but not the SSN value itself.

Confidence Model

Confirmed
Name and SSN matched a specific record in the source database. A dollar amount is known. Example: NAUPA returned an exact unclaimed property record. Action: prompt executor to file claim immediately.
Probable
Name matched without SSN confirmation, or record exists but amount is unknown. Example: PBGC found a pension record with matching employer name. Action: display with guidance to verify and claim.
Possible
Based on estate profile characteristics (e.g., deceased worked in a state with a large unclaimed property backlog), a benefit likely exists but no matching record was found. Action: display as guidance with manual lookup steps.
Manual-guidance sources (NAIC, MIB, SSA): These sources cannot be automated. For these, the scanner generates structured guidance: the exact URL to visit, the form to complete, the information to have on hand, and the expected response timeline. This guidance is surfaced as a "Possible" benefit with a checklist of manual steps. The executor can mark it as completed when done.
Section 8

Security Architecture

Settle handles death certificates, Social Security numbers, financial account numbers, and medical history. A breach of this data against a grieving family is a categorical failure. Security is not a feature to be added in Year 2 — every component in this document has been designed with Defense in Depth as a first principle.

Threat Model Summary

Threat Vector Severity Primary Mitigation Secondary Mitigation
SQL injection → PII exfiltration Critical Parameterized queries (Drizzle ORM); no string concatenation Field-level encryption renders exfiltrated SSNs useless
IDOR: accessing another estate's data Critical Postgres RLS policies keyed to session estate ID API middleware ownership check before every handler
Credential theft (session hijack) Critical HttpOnly, Secure, SameSite=Strict session cookie Request fingerprinting; IP change triggers re-auth prompt
Document vault unauthorized access Critical Presigned URLs valid for 5 minutes only; R2 bucket not public SSE-C per-estate keys; document access is audit-logged
Compromised database credential High Neon per-role least-privilege credentials; no superuser in app Field encryption; SSNs unreadable without app-layer keys
Supply chain attack (npm) High Dependabot; lockfile integrity checked in CI Minimal dependency philosophy; audit npm packages quarterly
Unauthorized Practice of Law High Content policy in rules engine schema; architectural separation of guidance vs advice Legal review required for all rule set publications
Over-retention of sensitive data Medium Retention policy: 7 years post-estate closure; automated deletion jobs Right-to-erasure workflow for CCPA compliance

Defense in Depth: Layer View

CDN
Layer 1 — Edge (Cloudflare)
DDoS mitigation, WAF rules blocking common injection patterns, bot management. Rate limiting at the IP level: 100 requests/minute to API routes, 10 authentication attempts/minute per IP. All traffic forced to HTTPS with HSTS preloading.
APP
Layer 2 — Application (SvelteKit hooks)
Session validation on every authenticated request. RBAC check before handler execution. CSRF protection via double-submit cookie on all state-mutating requests. Input validation with Zod schemas — reject malformed input before any database interaction. Content Security Policy headers on all responses.
DB
Layer 3 — Database (Postgres RLS + Roles)
Row-Level Security policies ensure a database connection authenticated as settle_app can only read rows belonging to the estate in the session context. A separate read-only role settle_analytics has no access to the PII tables. The audit_log table is append-only at the Postgres policy level.
ENC
Layer 4 — Field Encryption (AES-256-GCM)
SSN, financial account numbers, and medical record identifiers are encrypted at the application layer before being written to Postgres. The encryption keys are derived per-estate from a master key stored in Fly.io Secrets — never in the database or application code. A full database dump without the application keys is useless for extracting these fields.
R2
Layer 5 — Document Storage (R2 SSE-C)
Documents are encrypted at rest in R2 using customer-supplied keys (SSE-C), with a unique key per estate derived from the same master key as field encryption. Documents are never accessible via a public URL. All access is via presigned URLs generated server-side, valid for 5 minutes, scoped to a single object, and audit-logged.
AUD
Layer 6 — Audit Logging (Immutable)
Every access to a sensitive field, every document download, every task mutation, and every rule set publication is written to the append-only audit_log table. Postgres RLS prevents any application role from updating or deleting audit records. Logs are retained for 7 years per legal hold requirements and are exportable per estate for legal discovery.

Data Retention and the Right to Erasure

Estate records are retained for 7 years following estate closure, aligned with common statute of limitations for executor liability. At the 7-year mark, a background job initiates deletion: PII fields are overwritten with null values, documents are deleted from R2, and the estate record is anonymized (names replaced with hashed identifiers). The estate's task completion record and financial summary are retained in anonymized form for aggregate analytics.

For CCPA right-to-erasure requests during an active estate, the request is held pending estate closure (deletion during active administration would be legally problematic). The request is logged and honored automatically at closure plus a 90-day cooling-off period.

Section 9

Scaling Strategy

Settle's growth trajectory — 500 estates in Year 1, 5,000 in Year 2, 50,000 in Year 3 — spans two orders of magnitude. The architecture is sized for Year 1 today and designed to scale to Year 3 without rearchitecting the core data model or notification service. Each year has a clear set of scaling gates that trigger architectural evolution.

Year 1 — Foundation
500
Active estates
~50
Concurrent users (peak)
  • Modular monolith on Fly.io
  • Neon 0.25 CU, autoscale
  • Single notification worker
  • Upstash Redis for sessions
  • Benefit scanner: weekly cron
Year 2 — Vertical Scale
5,000
Active estates
~500
Concurrent users (peak)
  • Add Neon read replica for analytics
  • Scale API to 4 instances
  • Notification worker: 2 instances
  • Add connection pooling (PgBouncer)
  • R2 bucket per region (2 regions)
Year 3 — Horizontal Scale
50,000
Active estates
~5,000
Concurrent users (peak)
  • Extract notification service to standalone API
  • Extract benefit scanner to standalone service
  • Neon: scale to 4+ CU with dedicated compute
  • Consider Postgres partitioning on tasks by estate
  • Dedicated analytics database (ClickHouse)

Key Scaling Decisions and Their Triggers

When Trigger Action Complexity Cost
~2,000 estates DB p99 query time > 100ms Add PgBouncer connection pooler; add Neon read replica for reports Low — operational only
~5,000 estates Notification queue depth consistently > 100 Scale notification worker to 3 concurrent instances Low — Fly.io scaling config only
~10,000 estates API P99 latency > 500ms or error rate > 0.1% Separate API server and notification worker into distinct Fly apps; independent scaling Medium — deploy config and inter-service auth
~30,000 estates Tasks table exceeds 10M rows; scan times degrade Partition tasks table by estate_id range; add partial indexes on active estates Medium — zero-downtime migration required
~50,000 estates Benefit scanner can't complete weekly runs within window Extract scanner to dedicated service; parallelize by state/source High — separate service with its own queue and auth
The modular monolith pays off here. Because service boundaries are enforced at the module level from Day 1 — the notification worker, benefit scanner, and rules engine are separate TypeScript modules with defined interfaces — extracting them to standalone services at Year 3 is a deployment change, not a rewrite. The code does not need to change; only the deployment topology does.
Section 10

Architecture Decision Records

Every significant architectural decision is recorded here with its context, the options considered, the decision made, and the tradeoffs accepted. These records are immutable once a decision is implemented — new decisions supersede rather than modify them.

ADR-001
Postgres JSONB for State Rules Engine (vs Dedicated Rules Engine)
Accepted
Context
The system must represent, evaluate, and version legal rules for 50 states with different probate thresholds, required forms, and filing deadlines. Options ranged from a dedicated business rules engine (Drools, OpenL Tablets) to a custom DSL to JSON data in Postgres.
Decision
Store rule sets as versioned JSONB documents in Postgres, evaluated by a pure TypeScript function in the application layer. Rule logic is simple conditional evaluation — no forward chaining, no conflict resolution, no complex inference. A full rules engine would solve problems we don't have while adding operational complexity we can't justify at Year 1 scale.
Tradeoffs
Chosen: JSONB in Postgres
  • No new infrastructure to operate
  • Rule sets are version-controlled with the database
  • TypeScript evaluator is trivially testable
  • Readable by paralegals with training
  • Easy to extend the schema as rules grow
Alternative: Drools / OpenL
  • Industry-standard for complex rule systems
  • Forward-chaining and conflict resolution built in
  • Requires Java runtime or separate service
  • High operational overhead for 50-state static rules
  • Over-engineered for conditional task filtering
Consequence
If rules become significantly more complex — mutual exclusions, forward-chained triggers, conflict resolution between state and federal rules — this decision should be revisited. The evaluator module is the natural extraction point.
ADR-002
Column-Level Encryption (vs Disk Encryption Only)
Accepted
Context
Neon provides disk-level encryption at rest (AES-256). The question is whether application-layer column-level encryption is also required for the highest-sensitivity fields (SSN, financial account numbers).
Decision
Implement column-level AES-256-GCM encryption in the application for SSN, financial account numbers, routing numbers, and medical record identifiers. Disk encryption is not sufficient because it does not protect against: a compromised database credential returning raw rows, a SQL injection vulnerability, a misconfigured query in the application, or a cloud provider employee with storage access. Column encryption adds a defense layer that is independent of all of these.
Tradeoffs
Chosen: Column-Level Encryption
  • Encrypted at-rest data useless without app keys
  • Independent of database security posture
  • Supports field-level access audit logging
  • Enables per-estate key derivation
Alternative: Disk Encryption Only
  • Zero application complexity
  • Fields are queryable/indexable
  • Does not protect against SQL injection
  • Does not protect against credential compromise
Consequence
Encrypted columns cannot be indexed or searched directly. For SSN, we store the last 4 digits in a separate unencrypted column for display and lookup purposes. Full SSN is only decrypted on explicit access, which is audit-logged. Search by SSN is not a supported use case in the product.
ADR-003
Lob for Physical Mail (vs Building Mail Infrastructure)
Accepted
Context
Tier 2 notifications require generating and mailing physical letters — often including a copy of the death certificate — to financial institutions. Options include using a mail API service (Lob), partnering with a print/mail fulfillment vendor, or building the capability in-house.
Decision
Use Lob's letter API. Lob provides CASS-certified address verification, USPS tracking, delivery webhooks, and secure document handling with SOC 2 Type II certification. Building this in-house would require print vendor relationships, postage accounts, CASS certification, and return mail handling — none of which is core product differentiation.
Tradeoffs
Chosen: Lob API
  • CASS address verification included
  • USPS tracking and delivery webhooks
  • SOC 2 Type II certified
  • No vendor relationship management
  • Higher per-letter cost vs volume contracts
Alternative: In-House / Vendor
  • Lower per-letter cost at volume
  • Full control over design and timing
  • Requires CASS certification
  • Return mail handling complexity
  • Significant operational overhead
Consequence
Lob's per-letter pricing (~$1.50–$3.50 including postage) is acceptable at Year 1–2 volume. At Year 3 (50,000 estates × multiple letters each), a volume contract negotiation with Lob or a migration to a direct print vendor should be evaluated. The Tier 2 handler interface makes this migration a swap of the Lob adapter only.
ADR-004
Session-Based Auth (vs JWT)
Accepted
Context
Settle requires authentication for a user population with a 16–18 month engagement window, role changes (adding attorneys, removing co-executors), and the potential for emergency access revocation. The choice between stateless JWT auth and stateful session auth carries meaningfully different security properties.
Decision
Session-based authentication with an HttpOnly, Secure, SameSite=Strict cookie storing a cryptographically random session token. Sessions are stored in Upstash Redis with a 30-day sliding expiry. The primary driver is instant revocability: when an attorney is removed from an estate or a user reports a compromised account, the session can be invalidated immediately by deleting it from Redis. With JWTs, the token remains valid until expiry regardless of server-side state changes.
Tradeoffs
Chosen: Session Auth
  • Instant session revocation on any device
  • Role changes take effect immediately
  • No complex token refresh infrastructure
  • Requires Redis for session storage
  • Every request hits Redis (fast, but a dependency)
Alternative: JWT (short-lived)
  • Stateless — no session store required
  • Horizontally scales without shared state
  • Cannot revoke before expiry without a denylist (= session store again)
  • Refresh token complexity
  • Wrong security model for role-change-heavy workflows
ADR-005
SvelteKit (vs Next.js)
Accepted
Context
The frontend required SSR capabilities, a co-located API layer, TypeScript support, and a modern reactive component model. The two leading options were SvelteKit and Next.js (React).
Decision
SvelteKit. The key factors: (1) SvelteKit's server routes co-locate API logic with page logic, reducing cognitive overhead for a small team; (2) Svelte's compiled output is smaller and faster than React + React DOM, relevant for users on mobile connections; (3) SvelteKit's form actions provide a clean, progressive-enhancement model for the intake flow; (4) the load function pattern makes server/client data boundary explicit and is well-suited to the estate data model.
Tradeoffs
Chosen: SvelteKit
  • Smaller bundle size; faster on mobile
  • Co-located API routes reduce context switching
  • Form actions great for intake flows
  • Smaller ecosystem than React
  • Fewer available engineers in hiring market
Alternative: Next.js (React)
  • Largest frontend ecosystem
  • React Server Components for complex UIs
  • Larger engineer hiring pool
  • App router complexity overhead
  • Larger runtime footprint
ADR-006
Modular Monolith (vs Microservices from Day 1)
Accepted
Context
The system has several conceptually distinct services: the rules engine, notification service, and benefit scanner. The architectural question is whether to deploy these as separate services immediately or co-locate them in a single deployed application.
Decision
Deploy as a modular monolith: enforce service boundaries via TypeScript module boundaries and strict inter-module interfaces, but deploy as a single Fly.io application (with the notification worker and benefit scanner as separate Fly machines). At 500 estates per year, the operational overhead of service meshes, inter-service auth, distributed tracing, and independent deployment pipelines is not justified. The module boundaries ensure extraction is a deployment change, not a rewrite, when scale demands it.
Consequence
A single TypeScript build artifact. A bug in the notification module could crash the API server. Mitigation: notification and scanner workers run as separate Fly machines from the API. A worker crash does not affect user-facing routes. The API server imports the rules engine synchronously; a rules engine bug affects the API. Mitigation: extensive test coverage of the evaluator; feature flags to disable rules engine on route level.
ADR-007
PgBoss for Job Queue (vs Redis-backed Queue)
Accepted
Context
The notification service and benefit scanner require a reliable job queue. The main options were a Redis-backed queue (BullMQ) or a Postgres-backed queue (PgBoss).
Decision
PgBoss. Because Postgres is already the primary data store, PgBoss requires no additional infrastructure. Job state, retry history, and dead-letter queues are all in the same database as the business data. This means job records and the notifications table can be queried in the same transaction, and the entire job history is available for operational debugging without a separate Redis cluster.
Tradeoffs
Chosen: PgBoss (Postgres)
  • No additional infrastructure
  • Transactional job creation with business data
  • Full job history queryable in SQL
  • Lower throughput ceiling than Redis (~1,000 jobs/sec)
  • Adds write load to primary database
Alternative: BullMQ (Redis)
  • Higher throughput (10,000+ jobs/sec)
  • Better real-time queue monitoring
  • Additional infrastructure to operate
  • Job state not colocated with business data
  • Not necessary at Year 1 volume
ADR-008
Confidence-Tiered Benefit Results (vs Binary Found/Not-Found)
Accepted
Context
The benefit scanner may find records that match by name but not SSN, or may infer likely benefits from estate characteristics without finding a specific record. The UX question is how to present these results to a grieving family without creating false expectations or causing them to overlook legitimate findings.
Decision
Use a three-level confidence model: Confirmed (SSN + name match with a specific dollar amount), Probable (name match or record exists without full confirmation), and Possible (inferred from estate characteristics). Each level has different UI treatment, different call-to-action copy, and different task urgency. This avoids the false precision of a binary model while not overwhelming the executor with speculative results.
Consequence
The "Possible" tier relies on actuarial inference rather than database results. This is intentional: a deceased veteran in California with no VA record showing in our database probably still has benefits that should be claimed. Displaying nothing would be a disservice. The content associated with "Possible" results must be clearly labeled as guidance, not as a confirmed finding — this is an active UPL guardrail concern.
ADR-009
Neon Postgres (vs PlanetScale / Supabase)
Accepted
Context
The system requires a managed Postgres-compatible database with good developer experience, branching for staging environments, and a clear scaling path. The primary contenders were Neon, PlanetScale (MySQL-based), and Supabase.
Decision
Neon. The decisive factors: (1) true Postgres compatibility — no MySQL dialect differences, no constraints on foreign keys or multi-statement transactions; (2) database branching for staging environments is first-class and maps well to the PR-based development workflow; (3) autoscaling compute from zero means no idle cost during development and low initial COGS; (4) Row Level Security works as expected with no Neon-specific constraints.
Tradeoffs
Chosen: Neon
  • True Postgres (not compatible — actual)
  • Database branching for staging
  • Scale to zero in dev
  • Connection pooling via Neon serverless proxy
  • Newer product; some enterprise features still maturing
Alternative: Supabase
  • Postgres with auth, storage, realtime built in
  • Could replace Redis for sessions
  • Tighter coupling to Supabase ecosystem
  • BYO auth/storage already planned
  • More opinionated platform lock-in
ADR-010
Immutable Audit Log via Postgres RLS (vs External SIEM)
Accepted
Context
Audit logging of all PII access and estate mutations is required for compliance and legal discovery. Options include an append-only Postgres table with RLS enforcement, a dedicated audit database, or shipping logs to an external SIEM (Datadog, Splunk).
Decision
Append-only Postgres table with RLS policies preventing UPDATE and DELETE for all application roles. At Year 1 volume, the audit log will accumulate ~5M rows per year — comfortably handled by Postgres with appropriate indexes. An external SIEM adds $3,000–$10,000/year in operational cost and complexity that is not justified until the volume of events warrants real-time alerting and cross-product correlation.
Consequence
At Year 3 scale (~50,000 estates), the audit log will have 50–100M rows. Partition the table by month at that point. Export to cold storage (R2 or S3 Glacier) for records older than 2 years. Begin evaluating Datadog or similar at 10M rows/year to enable real-time anomaly detection on PII access patterns.