Skip to content

Operations Runbook

Day-2 ops recipes. Copy the commands, swap the placeholders.

Assumes you have SSH to the host, the doable CLI on your laptop (see doable-cli Reference), and the host provisioned by doable install or deployment/server-setup.sh.

Read before changing production

Doable's deployment is opinionated about networking: every service binds to 127.0.0.1 and is fronted by Cloudflare Tunnel. Any change that exposes a port to 0.0.0.0 or removes the tunnel is a regression in your security posture. When in doubt, verify with sudo ss -tlnp | grep -v '127.0.0.1'. The only public listener should be sshd on :22.

Fresh-host bootstrap

The recommended bootstrap is the TUI installer, which uploads and streams deployment/server-setup.sh:

doable install \
  --host <ip> \
  --user root \
  --env-name myorg \
  --ssh-key ~/.ssh/id_ed25519

See the doable-cli Reference for the full flag list and the 15 setup phases.

If you prefer to run the script directly on the host, see Bare-metal deployment. The script is idempotent. Re-running it on an already-provisioned host is safe.

First-run tunnel quirk

The first invocation of deployment/server-setup.sh on a brand-new server can fail at the Cloudflare Tunnel step with Tunnel credentials file not found. The tunnel is created correctly; the script's UUID parse just trips on cloudflared's two-line output. Just re-run the script: the second pass takes the existing-tunnel branch and completes cleanly. Fix tracked upstream.

Database backups

The whole database fits in a single pg_dump. There is no separate object store to back up (uploaded files live on disk under /opt/doable/storage/ and should be rsynced separately).

Daily logical dump

# Run as root on the server (or via a systemd timer)
DATE=$(date +%Y-%m-%d)
sudo -u postgres pg_dump --format=custom --compress=9 doable \
  > /var/backups/doable-${DATE}.dump

Off-host storage

Logical dumps on the same disk as the database do not survive a disk loss. Sync them to a separate host or object store:

# rclone to any S3-compatible store (Backblaze B2, Cloudflare R2, AWS S3)
rclone copy /var/backups/doable-${DATE}.dump remote:doable-backups/

# Or rsync to another VPS
rsync -avz /var/backups/doable-${DATE}.dump backup-host:/var/backups/doable/

Schedule daily via cron or a systemd timer. See the Backups doc for the systemd-timer template.

Restore test (do this monthly)

A backup you have not restored is not a backup. Every month, take the latest dump and restore it to a scratch database on the same host:

sudo -u postgres createdb doable_restore_test
sudo -u postgres pg_restore --no-owner --no-privileges \
  --dbname=doable_restore_test \
  /var/backups/doable-$(date +%Y-%m-%d).dump

# Sanity check
sudo -u postgres psql doable_restore_test -c \
  "SELECT count(*) FROM users; SELECT count(*) FROM projects;"

# Drop when done
sudo -u postgres dropdb doable_restore_test

If counts are zero or the restore errors, fix the backup before you need it.

Restore from backup

Real disaster recovery into the live database:

# 1. Stop the API so nothing writes during restore
ssh user@host 'tmux send-keys -t doable:api C-c'

# 2. Drop and recreate the target database
sudo -u postgres dropdb doable
sudo -u postgres createdb doable -O doable

# 3. Restore
sudo -u postgres pg_restore --no-owner --no-privileges \
  --dbname=doable /var/backups/doable-<DATE>.dump

# 4. Reapply migrations (in case the dump pre-dated a schema change)
cd /opt/doable && pnpm db:migrate

# 5. Restart services
ssh user@host 'tmux send-keys -t doable:api Up Enter'

See Backups for the long-form restore walkthrough including pg_basebackup for full physical restores.

Schema migrations

Migrations live in services/api/src/migrations/*.sql and are applied with:

cd /opt/doable
pnpm db:migrate

The runner is forward-only; there are no automatic down migrations. If a migration goes wrong, restore from backup or write a compensating-up migration.

Deploy scripts that skip migrations

A real incident in May 2026 had five migrations unapplied for two days because the stored deploy snippet did not include pnpm db:migrate, AND every step was piped to tail so set -e did not catch the failure (tail exit code = 0). When you deploy:

  1. Run pnpm db:migrate 2>&1 without a | tail pipe, between pnpm install and the build step.
  2. If you must trim logs, use && { ... | tail -N; test ${PIPESTATUS[0]} -eq 0; } so the upstream exit is what set -e sees.
  3. After deploy, confirm with sudo -u postgres psql doable -c "SELECT * FROM doable_migrations ORDER BY applied_at DESC LIMIT 5;"

When a migration fails mid-deploy

  1. Do not retry blindly. A partial migration may have applied some statements; re-running can compound the breakage.
  2. Check doable_migrations for the last successfully recorded row.
  3. Inspect the failing migration file. If it's a transactional BEGIN; ... COMMIT; and you see the row missing, the whole migration rolled back; safe to fix and retry.
  4. If statements ran outside an explicit transaction (CREATE INDEX CONCURRENTLY, ALTER TYPE), some side effects may be in place. Inspect the schema (\d <table>) before retrying.
  5. As a last resort, restore from the latest backup, then redeploy.

Scaling

Vertical scaling

Doable is designed for a single-node deployment up to a few hundred active users. To grow:

  • More CPU/RAM: bump the VPS size. The API and web workers will use the extra cores via Node's worker pool. Postgres benefits from RAM via shared_buffers. Edit /etc/postgresql/16/main/postgresql.conf, bump shared_buffers to ~25% of host RAM, restart Postgres.
  • Bigger swap: the install creates a 2 GB swap. For a host with 16 GB+ RAM, increase to 4 GB:
swapoff /swapfile
fallocate -l 4G /swapfile
mkswap /swapfile
swapon /swapfile
  • More disk: uploaded user files live in /opt/doable/storage/. Attach a block volume and bind-mount or symlink that path to it.

Horizontal scaling

Honest truth: today, horizontal scaling is not turn-key. Doable's WebSocket layer relies on in-process Yjs document handlers, and the default KV store is in-memory. To scale horizontally you need:

  • A shared Redis instance (REDIS_URL=... in .env) for sessions and rate limits.
  • A sticky-session load balancer in front of ws so a given project's Yjs awareness all lands on the same node.
  • A shared filesystem for /opt/doable/storage/ (or migrate uploads to S3-compatible storage via the existing adapter pattern).

See Scaling for the current state. Until that work lands, the supported path is: bigger single node, plus off-host Postgres replica for read scaling if needed.

Log retrieval

There are three places logs live, depending on which surface produced them.

Live tmux logs (api / web / ws)

ssh user@host
tmux attach -t doable
# Ctrl-b 0 selects the api window
# Ctrl-b 1 selects the web window
# Ctrl-b 2 selects the ws window
# Ctrl-b d detaches (leaves session running)

Each window is the live stdout of the corresponding service. For a non-interactive grab:

ssh user@host 'tmux capture-pane -t doable:api -p -S -1000'

That dumps the last 1000 lines from the api window to stdout.

systemd journal

The wrapping doable.service, cloudflared.service, caddy.service, [email protected], and fail2ban.service all log to journald:

# Last 200 lines for the wrapper unit
sudo journalctl -u doable.service -n 200 --no-pager

# Follow tunnel logs
sudo journalctl -u cloudflared.service -f

# Postgres errors only
sudo journalctl -u [email protected] -p err --since "1 hour ago"

Cloudflare Tunnel dashboard

cloudflared ships connection metrics back to Cloudflare. The Networks, Tunnels dashboard in your Cloudflare account shows connection health, bandwidth, and per-hostname request counts that the local logs do not. Especially useful when traffic doesn't reach your host at all; that's an indication the tunnel itself is down.

Certificate rotation

You do not rotate certificates yourself for the public hostnames. Cloudflare handles TLS at its edge using a free Universal SSL cert that covers <zone> and *.<zone>. The connection between Cloudflare and your host runs through cloudflared over an encrypted tunnel; no public cert involved.

You do rotate certificates for:

  • Custom domains that Doable users attach to their published apps. These are managed by Caddy's on-demand TLS via Let's Encrypt. Renewals are automatic. To verify, sudo journalctl -u caddy | grep -i certificate should show recent successful renewals.
  • Your own Cloudflare API token if you've stored one for the wildcard DNS admin feature. See the DNS wildcard section below.

Useful Caddy commands:

# Reload after editing the Caddyfile
sudo systemctl reload caddy

# Force re-issue of a stuck cert
sudo systemctl stop caddy
sudo rm -rf /var/lib/caddy/.local/share/caddy/certificates/<cert-folder>
sudo systemctl start caddy

See Custom Domains for how publishing adds new hostnames to Caddy's on-demand allowlist.

Cloudflare Tunnel rotation

If you suspect the tunnel credentials have leaked, rotate them:

ssh user@host

# 1. Stop the running tunnel
sudo systemctl stop cloudflared

# 2. List existing tunnels
cloudflared tunnel list

# 3. Create a fresh tunnel
cloudflared tunnel create doable-myorg-rotated

# 4. Update DNS routes for each public hostname:
cloudflared tunnel route dns doable-myorg-rotated myorg.doable.me
cloudflared tunnel route dns doable-myorg-rotated api.myorg.doable.me
cloudflared tunnel route dns doable-myorg-rotated ws.myorg.doable.me

# 5. Update /etc/cloudflared/config.yml with the new tunnel UUID

sudo $EDITOR /etc/cloudflared/config.yml
# Replace `tunnel: <old-uuid>` and `credentials-file: ...<old-uuid>.json`

# 6. Validate and restart
sudo cloudflared --config /etc/cloudflared/config.yml tunnel ingress validate
sudo systemctl start cloudflared

# 7. Delete the old tunnel
cloudflared tunnel delete <old-uuid>

The doable admin Server Config, Cloudflared Ingress sub-view (key 2) lets you do the config.yml edit + validate + reload as one transaction.

Rate-limit kill switch

During multi-agent QA campaigns, the API's per-IP rate limits on /auth/login (10 / 15min) and /auth/register (5 / hour) will trip and stall the campaign. Disable them on the target server before the campaign and restore them after.

Only loosen rate limits; never anything else

Rate limits are the only security control you may relax for testing. Do not weaken CORS, auth middleware, RLS, JWT signing, CSRF, security headers, peer-auth on Postgres, MFA, or anything else. Open bugs must be root-caused and fixed, never worked around by loosening controls.

ssh user@host

# Append a DOABLE_RATE_LIMIT_DISABLE=1 to /opt/doable/.env
echo 'DOABLE_RATE_LIMIT_DISABLE=1' | sudo tee -a /opt/doable/.env

# Restart the api so the env var takes effect
tmux send-keys -t doable:api C-c
sleep 2
tmux send-keys -t doable:api 'cd /opt/doable && pnpm --filter api dev' Enter

After the campaign:

sudo sed -i '/^DOABLE_RATE_LIMIT_DISABLE=/d' /opt/doable/.env
tmux send-keys -t doable:api C-c && sleep 2 && \
  tmux send-keys -t doable:api 'cd /opt/doable && pnpm --filter api dev' Enter

# Verify rate limit is back: 11th login in 15min should return HTTP 429

DNS wildcard for /admin

Doable supports automatic wildcard DNS provisioning so users' published apps land at <slug>.your-zone.com without you manually creating CNAMEs. This is configured from the /admin web panel under DNS Configuration.

It needs a Cloudflare API token with these permissions:

  • Zone:Read on the zone you're configuring
  • DNS:Edit on the same zone

The token is stored encrypted at rest in platform_settings.cf_api_token using your host's DOABLE_KEK envelope key (AES-256-GCM). If DOABLE_KEK is lost or changed, the token decryption fails and the admin panel shows a decryption failed warning; paste the token again.

The admin panel can:

  • Detect whether your zone has Advanced Certificate Manager (needed for two-level subdomains like api.staging.doable.me)
  • Provision or remove the *.your-zone.com wildcard CNAME
  • List existing wildcards and detect drift

You will hit one specific limitation: Cloudflare's free Universal SSL covers <zone> and *.<zone> only. If you want api.staging.doable.me and staging.doable.me on the same zone without ACM, don't: use dashed hostnames (staging-api.doable.me) per the project's naming rule.

Incident response

When something breaks, check these in order. Most outages map to one of these failure modes.

Step 1: Is the public hostname reachable?

curl -sI https://myorg.doable.me
# Expect 200 or 30x. If you get 502 Bad Gateway, traffic is reaching
# Cloudflare but the tunnel can't reach origin.
# If you get a DNS error, the tunnel itself is down or the DNS record
# is missing.

Step 2: Is the tunnel up?

ssh user@host 'sudo systemctl status cloudflared'
sudo journalctl -u cloudflared --since "5 minutes ago"

A healthy tunnel shows Active: active (running) and recent Connection X registered lines. If the connection count to Cloudflare's edge is zero, the tunnel can't reach Cloudflare; check outbound connectivity from the host.

Step 3: Are the services up?

ssh user@host 'tmux capture-pane -t doable:api -p -S -50; echo "---"; \
                tmux capture-pane -t doable:web -p -S -50; echo "---"; \
                tmux capture-pane -t doable:ws  -p -S -50'

The last 50 lines from each window. Look for stack traces, port-bind errors, or the dreaded EADDRINUSE.

Step 4: Is Postgres up?

ssh user@host 'sudo -u postgres pg_isready'
# /var/run/postgresql:5432 - accepting connections

If not accepting, check journalctl -u [email protected] for out-of-disk, out-of-memory, or corrupted-page errors.

Step 5: Capture a debug snapshot

When you escalate or open a ticket, grab everything in one go:

ssh user@host '
  echo "=== uname"; uname -a
  echo "=== uptime"; uptime
  echo "=== ss"; sudo ss -tlnp
  echo "=== systemctl"; systemctl --failed
  echo "=== tunnel"; sudo systemctl status cloudflared --no-pager
  echo "=== api last 100"; tmux capture-pane -t doable:api -p -S -100
  echo "=== web last 100"; tmux capture-pane -t doable:web -p -S -100
  echo "=== ws last 100";  tmux capture-pane -t doable:ws  -p -S -100
  echo "=== pg ready"; sudo -u postgres pg_isready
  echo "=== migrations"; sudo -u postgres psql doable -c \
    "SELECT name, applied_at FROM doable_migrations ORDER BY applied_at DESC LIMIT 5;"
' > /tmp/doable-debug-$(date +%s).log

Specific failure modes

Symptom First thing to check
502 from Cloudflare Tunnel up? (systemctl status cloudflared). Origin up? (curl 127.0.0.1:3000)
521 from Cloudflare Tunnel not reaching origin; the cloudflared process isn't running
Login fails with 429 Rate limit tripped; see Rate-limit kill switch
Login fails with 500 Check api window: usually DB connection or JWT key error
Published app returns 404 Caddy route missing; check /etc/caddy/Caddyfile and reload
Web shows old build Web is standalone-built on prod; needs rebuild (see Upgrading)
EADDRINUSE on restart A previous tmux pane is still holding the port; lsof -i :3000
DB connection refused Postgres restarted but .env password is stale; rotate via admin TUI

Upgrading Doable

The git-pull path depends on which window changed. The api and ws windows run tsx watch and auto-reload on file changes; no rebuild needed. The web window on prod is built standalone, so any apps/web/** change needs a rebuild + restart.

Full upgrade recipe

ssh user@host
cd /opt/doable

# 1. Pull latest
git pull origin main

# 2. Install dependencies (run even on minor changes; Turborepo caches)
pnpm install

# 3. Apply any new migrations BEFORE restarting services
pnpm db:migrate

# 4a. If apps/web/** changed, rebuild and restart the web window
tmux send-keys -t doable:web C-c
sleep 3
tmux send-keys -t doable:web \
  'cd /opt/doable/apps/web && pnpm --filter web build && \
   PORT=3000 HOSTNAME=127.0.0.1 node .next/standalone/apps/web/server.js' Enter

# 4b. api and ws auto-reload; if they don't (e.g. a tsx-watch crash):
tmux send-keys -t doable:api C-c && sleep 2 && \
  tmux send-keys -t doable:api 'cd /opt/doable && pnpm --filter api dev' Enter
tmux send-keys -t doable:ws  C-c && sleep 2 && \
  tmux send-keys -t doable:ws  'cd /opt/doable && pnpm --filter ws  dev' Enter

# 5. Verify
curl -sI https://your-doable-host.example/api/health

The web rebuild takes 1-3 minutes on a 2-vCPU host. Watch the web window for ✓ Ready before declaring success; systemctl is-active returning active is not enough; the standalone server starts the supervised process even if next start hasn't bound port 3000 yet.

Rollback

git log --oneline -5
git checkout <previous-sha>
pnpm install
# Migrations are forward-only; if the previous SHA pre-dates a migration,
# you must restore from backup. Do NOT run `pnpm db:migrate` after rolling
# back.
# Rebuild + restart web window per recipe above.

Security hardening checklist

Run through this list quarterly, and after any significant change to your host or threat model. The Hardening doc has the full rationale; this is the operator-facing condensed version.

Network

  • [ ] sudo ss -tlnp shows only :22 on a public interface; everything else is on 127.0.0.1.
  • [ ] UFW status active (sudo ufw status verbose); default deny incoming, SSH allowed.
  • [ ] No port-forward rules in iptables -L FORWARD you didn't add yourself.
  • [ ] Cloudflare Tunnel is the only public path. Verify with dig your-doable-host.example; should resolve to a Cloudflare IP, not your origin.

Identity

  • [ ] SSH key auth only (PasswordAuthentication no in /etc/ssh/sshd_config).
  • [ ] No PermitRootLogin yes unless you've decided that's your model.
  • [ ] fail2ban-client status sshd shows the jail active and Time reasonable.
  • [ ] Every platform admin has MFA enabled in their Doable account (settings, Two-Factor Auth).

Secrets

  • [ ] /opt/doable/.env is mode 0600, owned by the doable user.
  • [ ] DOABLE_KEK is set and backed up off-host. Losing it loses access to encrypted columns like cf_api_token.
  • [ ] Quarterly: rotate the DB password via doable admin, Server Config, DB Credentials, rotate.

Database

  • [ ] listen_addresses = 'localhost' in /etc/postgresql/16/main/postgresql.conf.
  • [ ] Peer auth where possible. DOABLE_PG_PEER_AUTH=1 is the default since the May 2026 sprint. Older installs should run setup-v3/upgrade-to-peer-auth.sh once.
  • [ ] RLS active on projects, workspaces, integrations, and the github tables. Verify with \d+ <table> and look for Row security: Enabled.

Application

  • [ ] No .env files committed to git. Run git ls-files | grep -i env in /opt/doable; should return nothing.
  • [ ] Sandbox is active for AI tool calls and preview iframes; check nft list table inet doable_egress in admin TUI's Server Config, nft sub-view.
  • [ ] You have a tested restore from yesterday's backup (see Restore test).

Don't relax these for testing

If you're tempted to disable a security control for a QA campaign, re-read the Rate-limit kill switch section. Rate limits are the only thing you may switch off temporarily. Everything else stays on.

  • Bare-metal deployment: the network diagram and deployment/server-setup.sh walkthrough.
  • Docker deployment: Docker + nginx reference for the alternative path.
  • Scaling: current state of horizontal scaling, single-node growth tips.
  • Custom Domains: how Caddy's on-demand TLS handles user-attached domains.
  • Sandboxing: the egress firewall and per-UID isolation that the admin TUI exposes.
  • Row-Level Security: what RLS protects in the schema and how to verify it.