Operations Runbook¶
Day-2 ops recipes. Copy the commands, swap the placeholders.
Assumes you have SSH to the host, the doable CLI on your laptop
(see doable-cli Reference), and the host
provisioned by doable install or
deployment/server-setup.sh.
Read before changing production
Doable's deployment is opinionated about networking: every service
binds to 127.0.0.1 and is fronted by Cloudflare Tunnel. Any change
that exposes a port to 0.0.0.0 or removes the tunnel is a regression
in your security posture. When in doubt, verify with
sudo ss -tlnp | grep -v '127.0.0.1'. The only public listener
should be sshd on :22.
Fresh-host bootstrap¶
The recommended bootstrap is the TUI installer, which uploads and streams
deployment/server-setup.sh:
See the doable-cli Reference for the full flag list and the 15 setup phases.
If you prefer to run the script directly on the host, see Bare-metal deployment. The script is idempotent. Re-running it on an already-provisioned host is safe.
First-run tunnel quirk
The first invocation of deployment/server-setup.sh on a brand-new server can
fail at the Cloudflare Tunnel step with Tunnel credentials file not
found. The tunnel is created correctly; the script's UUID parse
just trips on cloudflared's two-line output. Just re-run the
script: the second pass takes the existing-tunnel branch and
completes cleanly. Fix tracked upstream.
Database backups¶
The whole database fits in a single pg_dump. There is no separate object
store to back up (uploaded files live on disk under /opt/doable/storage/
and should be rsynced separately).
Daily logical dump¶
# Run as root on the server (or via a systemd timer)
DATE=$(date +%Y-%m-%d)
sudo -u postgres pg_dump --format=custom --compress=9 doable \
> /var/backups/doable-${DATE}.dump
Off-host storage¶
Logical dumps on the same disk as the database do not survive a disk loss. Sync them to a separate host or object store:
# rclone to any S3-compatible store (Backblaze B2, Cloudflare R2, AWS S3)
rclone copy /var/backups/doable-${DATE}.dump remote:doable-backups/
# Or rsync to another VPS
rsync -avz /var/backups/doable-${DATE}.dump backup-host:/var/backups/doable/
Schedule daily via cron or a systemd timer. See the Backups doc for the systemd-timer template.
Restore test (do this monthly)¶
A backup you have not restored is not a backup. Every month, take the latest dump and restore it to a scratch database on the same host:
sudo -u postgres createdb doable_restore_test
sudo -u postgres pg_restore --no-owner --no-privileges \
--dbname=doable_restore_test \
/var/backups/doable-$(date +%Y-%m-%d).dump
# Sanity check
sudo -u postgres psql doable_restore_test -c \
"SELECT count(*) FROM users; SELECT count(*) FROM projects;"
# Drop when done
sudo -u postgres dropdb doable_restore_test
If counts are zero or the restore errors, fix the backup before you need it.
Restore from backup¶
Real disaster recovery into the live database:
# 1. Stop the API so nothing writes during restore
ssh user@host 'tmux send-keys -t doable:api C-c'
# 2. Drop and recreate the target database
sudo -u postgres dropdb doable
sudo -u postgres createdb doable -O doable
# 3. Restore
sudo -u postgres pg_restore --no-owner --no-privileges \
--dbname=doable /var/backups/doable-<DATE>.dump
# 4. Reapply migrations (in case the dump pre-dated a schema change)
cd /opt/doable && pnpm db:migrate
# 5. Restart services
ssh user@host 'tmux send-keys -t doable:api Up Enter'
See Backups for the long-form restore
walkthrough including pg_basebackup for full physical restores.
Schema migrations¶
Migrations live in services/api/src/migrations/*.sql and are applied with:
The runner is forward-only; there are no automatic down migrations. If a migration goes wrong, restore from backup or write a compensating-up migration.
Deploy scripts that skip migrations
A real incident in May 2026 had five migrations unapplied for two days
because the stored deploy snippet did not include pnpm db:migrate,
AND every step was piped to tail so set -e did not catch the
failure (tail exit code = 0). When you deploy:
- Run
pnpm db:migrate 2>&1without a| tailpipe, betweenpnpm installand the build step. - If you must trim logs, use
&& { ... | tail -N; test ${PIPESTATUS[0]} -eq 0; }so the upstream exit is whatset -esees. - After deploy, confirm with
sudo -u postgres psql doable -c "SELECT * FROM doable_migrations ORDER BY applied_at DESC LIMIT 5;"
When a migration fails mid-deploy¶
- Do not retry blindly. A partial migration may have applied some statements; re-running can compound the breakage.
- Check
doable_migrationsfor the last successfully recorded row. - Inspect the failing migration file. If it's a transactional
BEGIN; ... COMMIT;and you see the row missing, the whole migration rolled back; safe to fix and retry. - If statements ran outside an explicit transaction (CREATE INDEX
CONCURRENTLY, ALTER TYPE), some side effects may be in place. Inspect
the schema (
\d <table>) before retrying. - As a last resort, restore from the latest backup, then redeploy.
Scaling¶
Vertical scaling¶
Doable is designed for a single-node deployment up to a few hundred active users. To grow:
- More CPU/RAM: bump the VPS size. The API and web workers will use
the extra cores via Node's worker pool. Postgres benefits from RAM via
shared_buffers. Edit/etc/postgresql/16/main/postgresql.conf, bumpshared_buffersto ~25% of host RAM, restart Postgres. - Bigger swap: the install creates a 2 GB swap. For a host with 16 GB+ RAM, increase to 4 GB:
- More disk: uploaded user files live in
/opt/doable/storage/. Attach a block volume and bind-mount or symlink that path to it.
Horizontal scaling¶
Honest truth: today, horizontal scaling is not turn-key. Doable's WebSocket layer relies on in-process Yjs document handlers, and the default KV store is in-memory. To scale horizontally you need:
- A shared Redis instance (
REDIS_URL=...in.env) for sessions and rate limits. - A sticky-session load balancer in front of
wsso a given project's Yjs awareness all lands on the same node. - A shared filesystem for
/opt/doable/storage/(or migrate uploads to S3-compatible storage via the existing adapter pattern).
See Scaling for the current state. Until that work lands, the supported path is: bigger single node, plus off-host Postgres replica for read scaling if needed.
Log retrieval¶
There are three places logs live, depending on which surface produced them.
Live tmux logs (api / web / ws)¶
ssh user@host
tmux attach -t doable
# Ctrl-b 0 selects the api window
# Ctrl-b 1 selects the web window
# Ctrl-b 2 selects the ws window
# Ctrl-b d detaches (leaves session running)
Each window is the live stdout of the corresponding service. For a non-interactive grab:
That dumps the last 1000 lines from the api window to stdout.
systemd journal¶
The wrapping doable.service, cloudflared.service, caddy.service,
[email protected], and fail2ban.service all log to journald:
# Last 200 lines for the wrapper unit
sudo journalctl -u doable.service -n 200 --no-pager
# Follow tunnel logs
sudo journalctl -u cloudflared.service -f
# Postgres errors only
sudo journalctl -u [email protected] -p err --since "1 hour ago"
Cloudflare Tunnel dashboard¶
cloudflared ships connection metrics back to Cloudflare. The
Networks, Tunnels dashboard in your Cloudflare account shows
connection health, bandwidth, and per-hostname request counts that the
local logs do not. Especially useful when traffic doesn't reach your
host at all; that's an indication the tunnel itself is down.
Certificate rotation¶
You do not rotate certificates yourself for the public hostnames.
Cloudflare handles TLS at its edge using a free Universal SSL cert that
covers <zone> and *.<zone>. The connection between Cloudflare and your
host runs through cloudflared over an encrypted tunnel; no public cert
involved.
You do rotate certificates for:
- Custom domains that Doable users attach to their published apps.
These are managed by Caddy's on-demand TLS via Let's Encrypt. Renewals
are automatic. To verify,
sudo journalctl -u caddy | grep -i certificateshould show recent successful renewals. - Your own Cloudflare API token if you've stored one for the wildcard DNS admin feature. See the DNS wildcard section below.
Useful Caddy commands:
# Reload after editing the Caddyfile
sudo systemctl reload caddy
# Force re-issue of a stuck cert
sudo systemctl stop caddy
sudo rm -rf /var/lib/caddy/.local/share/caddy/certificates/<cert-folder>
sudo systemctl start caddy
See Custom Domains for how publishing adds new hostnames to Caddy's on-demand allowlist.
Cloudflare Tunnel rotation¶
If you suspect the tunnel credentials have leaked, rotate them:
ssh user@host
# 1. Stop the running tunnel
sudo systemctl stop cloudflared
# 2. List existing tunnels
cloudflared tunnel list
# 3. Create a fresh tunnel
cloudflared tunnel create doable-myorg-rotated
# 4. Update DNS routes for each public hostname:
cloudflared tunnel route dns doable-myorg-rotated myorg.doable.me
cloudflared tunnel route dns doable-myorg-rotated api.myorg.doable.me
cloudflared tunnel route dns doable-myorg-rotated ws.myorg.doable.me
# 5. Update /etc/cloudflared/config.yml with the new tunnel UUID
sudo $EDITOR /etc/cloudflared/config.yml
# Replace `tunnel: <old-uuid>` and `credentials-file: ...<old-uuid>.json`
# 6. Validate and restart
sudo cloudflared --config /etc/cloudflared/config.yml tunnel ingress validate
sudo systemctl start cloudflared
# 7. Delete the old tunnel
cloudflared tunnel delete <old-uuid>
The doable admin Server Config, Cloudflared Ingress sub-view (key 2)
lets you do the config.yml edit + validate + reload as one transaction.
Rate-limit kill switch¶
During multi-agent QA campaigns, the API's per-IP rate limits on
/auth/login (10 / 15min) and /auth/register (5 / hour) will trip and
stall the campaign. Disable them on the target server before the
campaign and restore them after.
Only loosen rate limits; never anything else
Rate limits are the only security control you may relax for testing. Do not weaken CORS, auth middleware, RLS, JWT signing, CSRF, security headers, peer-auth on Postgres, MFA, or anything else. Open bugs must be root-caused and fixed, never worked around by loosening controls.
ssh user@host
# Append a DOABLE_RATE_LIMIT_DISABLE=1 to /opt/doable/.env
echo 'DOABLE_RATE_LIMIT_DISABLE=1' | sudo tee -a /opt/doable/.env
# Restart the api so the env var takes effect
tmux send-keys -t doable:api C-c
sleep 2
tmux send-keys -t doable:api 'cd /opt/doable && pnpm --filter api dev' Enter
After the campaign:
sudo sed -i '/^DOABLE_RATE_LIMIT_DISABLE=/d' /opt/doable/.env
tmux send-keys -t doable:api C-c && sleep 2 && \
tmux send-keys -t doable:api 'cd /opt/doable && pnpm --filter api dev' Enter
# Verify rate limit is back: 11th login in 15min should return HTTP 429
DNS wildcard for /admin¶
Doable supports automatic wildcard DNS provisioning so users' published
apps land at <slug>.your-zone.com without you manually creating CNAMEs.
This is configured from the /admin web panel under DNS Configuration.
It needs a Cloudflare API token with these permissions:
- Zone:Read on the zone you're configuring
- DNS:Edit on the same zone
The token is stored encrypted at rest in platform_settings.cf_api_token
using your host's DOABLE_KEK envelope key (AES-256-GCM). If DOABLE_KEK
is lost or changed, the token decryption fails and the admin panel shows a
decryption failed warning; paste the token again.
The admin panel can:
- Detect whether your zone has Advanced Certificate Manager (needed for
two-level subdomains like
api.staging.doable.me) - Provision or remove the
*.your-zone.comwildcard CNAME - List existing wildcards and detect drift
You will hit one specific limitation: Cloudflare's free Universal SSL
covers <zone> and *.<zone> only. If you want api.staging.doable.me
and staging.doable.me on the same zone without ACM, don't: use
dashed hostnames (staging-api.doable.me) per the project's naming rule.
Incident response¶
When something breaks, check these in order. Most outages map to one of these failure modes.
Step 1: Is the public hostname reachable?¶
curl -sI https://myorg.doable.me
# Expect 200 or 30x. If you get 502 Bad Gateway, traffic is reaching
# Cloudflare but the tunnel can't reach origin.
# If you get a DNS error, the tunnel itself is down or the DNS record
# is missing.
Step 2: Is the tunnel up?¶
ssh user@host 'sudo systemctl status cloudflared'
sudo journalctl -u cloudflared --since "5 minutes ago"
A healthy tunnel shows Active: active (running) and recent
Connection X registered lines. If the connection count to Cloudflare's
edge is zero, the tunnel can't reach Cloudflare; check outbound
connectivity from the host.
Step 3: Are the services up?¶
ssh user@host 'tmux capture-pane -t doable:api -p -S -50; echo "---"; \
tmux capture-pane -t doable:web -p -S -50; echo "---"; \
tmux capture-pane -t doable:ws -p -S -50'
The last 50 lines from each window. Look for stack traces, port-bind
errors, or the dreaded EADDRINUSE.
Step 4: Is Postgres up?¶
If not accepting, check journalctl -u [email protected] for
out-of-disk, out-of-memory, or corrupted-page errors.
Step 5: Capture a debug snapshot¶
When you escalate or open a ticket, grab everything in one go:
ssh user@host '
echo "=== uname"; uname -a
echo "=== uptime"; uptime
echo "=== ss"; sudo ss -tlnp
echo "=== systemctl"; systemctl --failed
echo "=== tunnel"; sudo systemctl status cloudflared --no-pager
echo "=== api last 100"; tmux capture-pane -t doable:api -p -S -100
echo "=== web last 100"; tmux capture-pane -t doable:web -p -S -100
echo "=== ws last 100"; tmux capture-pane -t doable:ws -p -S -100
echo "=== pg ready"; sudo -u postgres pg_isready
echo "=== migrations"; sudo -u postgres psql doable -c \
"SELECT name, applied_at FROM doable_migrations ORDER BY applied_at DESC LIMIT 5;"
' > /tmp/doable-debug-$(date +%s).log
Specific failure modes¶
| Symptom | First thing to check |
|---|---|
| 502 from Cloudflare | Tunnel up? (systemctl status cloudflared). Origin up? (curl 127.0.0.1:3000) |
| 521 from Cloudflare | Tunnel not reaching origin; the cloudflared process isn't running |
| Login fails with 429 | Rate limit tripped; see Rate-limit kill switch |
| Login fails with 500 | Check api window: usually DB connection or JWT key error |
| Published app returns 404 | Caddy route missing; check /etc/caddy/Caddyfile and reload |
| Web shows old build | Web is standalone-built on prod; needs rebuild (see Upgrading) |
EADDRINUSE on restart |
A previous tmux pane is still holding the port; lsof -i :3000 |
| DB connection refused | Postgres restarted but .env password is stale; rotate via admin TUI |
Upgrading Doable¶
The git-pull path depends on which window changed. The api and ws windows
run tsx watch and auto-reload on file changes; no rebuild needed. The
web window on prod is built standalone, so any apps/web/** change
needs a rebuild + restart.
Full upgrade recipe¶
ssh user@host
cd /opt/doable
# 1. Pull latest
git pull origin main
# 2. Install dependencies (run even on minor changes; Turborepo caches)
pnpm install
# 3. Apply any new migrations BEFORE restarting services
pnpm db:migrate
# 4a. If apps/web/** changed, rebuild and restart the web window
tmux send-keys -t doable:web C-c
sleep 3
tmux send-keys -t doable:web \
'cd /opt/doable/apps/web && pnpm --filter web build && \
PORT=3000 HOSTNAME=127.0.0.1 node .next/standalone/apps/web/server.js' Enter
# 4b. api and ws auto-reload; if they don't (e.g. a tsx-watch crash):
tmux send-keys -t doable:api C-c && sleep 2 && \
tmux send-keys -t doable:api 'cd /opt/doable && pnpm --filter api dev' Enter
tmux send-keys -t doable:ws C-c && sleep 2 && \
tmux send-keys -t doable:ws 'cd /opt/doable && pnpm --filter ws dev' Enter
# 5. Verify
curl -sI https://your-doable-host.example/api/health
The web rebuild takes 1-3 minutes on a 2-vCPU host. Watch the web window
for ✓ Ready before declaring success; systemctl is-active returning
active is not enough; the standalone server starts the supervised
process even if next start hasn't bound port 3000 yet.
Rollback¶
git log --oneline -5
git checkout <previous-sha>
pnpm install
# Migrations are forward-only; if the previous SHA pre-dates a migration,
# you must restore from backup. Do NOT run `pnpm db:migrate` after rolling
# back.
# Rebuild + restart web window per recipe above.
Security hardening checklist¶
Run through this list quarterly, and after any significant change to your host or threat model. The Hardening doc has the full rationale; this is the operator-facing condensed version.
Network¶
- [ ]
sudo ss -tlnpshows only:22on a public interface; everything else is on127.0.0.1. - [ ] UFW status active (
sudo ufw status verbose); default deny incoming, SSH allowed. - [ ] No port-forward rules in
iptables -L FORWARDyou didn't add yourself. - [ ] Cloudflare Tunnel is the only public path. Verify with
dig your-doable-host.example; should resolve to a Cloudflare IP, not your origin.
Identity¶
- [ ] SSH key auth only (
PasswordAuthentication noin/etc/ssh/sshd_config). - [ ] No
PermitRootLogin yesunless you've decided that's your model. - [ ]
fail2ban-client status sshdshows the jail active and Time reasonable. - [ ] Every platform admin has MFA enabled in their Doable account (settings, Two-Factor Auth).
Secrets¶
- [ ]
/opt/doable/.envis mode 0600, owned by the doable user. - [ ]
DOABLE_KEKis set and backed up off-host. Losing it loses access to encrypted columns likecf_api_token. - [ ] Quarterly: rotate the DB password via
doable admin, Server Config, DB Credentials, rotate.
Database¶
- [ ]
listen_addresses = 'localhost'in/etc/postgresql/16/main/postgresql.conf. - [ ] Peer auth where possible.
DOABLE_PG_PEER_AUTH=1is the default since the May 2026 sprint. Older installs should runsetup-v3/upgrade-to-peer-auth.shonce. - [ ] RLS active on
projects,workspaces,integrations, and the github tables. Verify with\d+ <table>and look forRow security: Enabled.
Application¶
- [ ] No
.envfiles committed to git. Rungit ls-files | grep -i envin/opt/doable; should return nothing. - [ ] Sandbox is active for AI tool calls and preview iframes; check
nft list table inet doable_egressin admin TUI's Server Config, nft sub-view. - [ ] You have a tested restore from yesterday's backup (see Restore test).
Don't relax these for testing¶
If you're tempted to disable a security control for a QA campaign, re-read the Rate-limit kill switch section. Rate limits are the only thing you may switch off temporarily. Everything else stays on.
Related reading¶
- Bare-metal deployment: the network
diagram and
deployment/server-setup.shwalkthrough. - Docker deployment: Docker + nginx reference for the alternative path.
- Scaling: current state of horizontal scaling, single-node growth tips.
- Custom Domains: how Caddy's on-demand TLS handles user-attached domains.
- Sandboxing: the egress firewall and per-UID isolation that the admin TUI exposes.
- Row-Level Security: what RLS protects in the schema and how to verify it.