NERSC Spin Runbook

Audience: operators deploying SimBoard on NERSC Spin.

This runbook defines the NERSC Spin workload baseline and backend rollout flow using an initContainer for automatic Alembic migrations. This runbook uses the Rancher UI as the primary deployment workflow.

Rancher UI Configs

This document is the source of truth for Spin workload settings managed in Rancher UI. If a setting is not listed, leave it at Rancher defaults unless it affects security context, networking, storage, secrets, or image pull behavior. No workload manifests are versioned under deploy/spin/.

Prerequisites (Create First)

Create these resources before configuring workloads in Rancher. Create nersc-staging-ingestor-env and nersc-archive-ingestor-env later in Workload 3 setup, after generating the ingestion service-account token.

Secret	Type	Required	Example/Allowed Value	Used By
`simboard-backend-env`	`Opaque`	Yes	Backend runtime env vars	`backend` app container, `migrate` init container
`simboard-db`	`Opaque`	Yes	PostgreSQL runtime env vars	`db` container
`simboard-tls-cert`	`kubernetes.io/tls`	Yes	`tls.crt`, `tls.key` (PEM)	`lb` ingress
`registry-nersc`	Image pull secret	Yes	NERSC registry credentials	`backend`, `frontend`, CronJob workloads

Environment variable keys:

simboard-backend-env:

Key	Required	Example/Allowed Value	Used By
`ENV`	Yes	`development`, `staging`, `production`	`backend`, `migrate`
`ENVIRONMENT`	Yes	`local`, `dev`, `prod`	`backend`, `migrate`
`PORT`	Yes	`8000`	`backend`, `migrate`
`FRONTEND_ORIGIN`	Yes	frontend origin URL	`backend`, `migrate`
`FRONTEND_AUTH_REDIRECT_URL`	Yes	frontend auth callback URL	`backend`, `migrate`
`FRONTEND_ORIGINS`	Yes	comma-separated origins	`backend`, `migrate`
`DATABASE_URL`	Yes	Postgres SQLAlchemy URL	`backend`, `migrate`
`TEST_DATABASE_URL`	Yes	test Postgres URL	`backend`, `migrate`
`GITHUB_CLIENT_ID`	Yes	GitHub OAuth client id	`backend`, `migrate`
`GITHUB_CLIENT_SECRET`	Yes	GitHub OAuth client secret	`backend`, `migrate`
`GITHUB_REDIRECT_URL`	Yes	backend OAuth callback URL	`backend`, `migrate`
`GITHUB_STATE_SECRET_KEY`	Yes	random secret string	`backend`, `migrate`
`COOKIE_NAME`	Yes	cookie name	`backend`, `migrate`
`COOKIE_SECURE`	Yes	`true` or `false`	`backend`, `migrate`
`COOKIE_HTTPONLY`	Yes	`true` or `false`	`backend`, `migrate`
`COOKIE_SAMESITE`	Yes	`lax`, `strict`, `none`	`backend`, `migrate`
`COOKIE_MAX_AGE`	Yes	seconds as integer	`backend`, `migrate`

Optional assistant summary keys for simboard-backend-env when enabling AI summaries:

Key	Required	Example/Allowed Value	Used By
`ASSISTANT_LLM_ENABLED`	No	`true` or `false`	`backend`, `migrate`
`ASSISTANT_LLM_PROVIDER`	No	`ollama`	`backend`, `migrate`
`ASSISTANT_OLLAMA_BASE_URL`	If using Ollama	`http://localhost:11434`	`backend`, `migrate`
`ASSISTANT_OLLAMA_MODEL`	If using Ollama	`gemma4:26b`	`backend`, `migrate`
`ASSISTANT_OLLAMA_API_KEY`	No	proxy auth token or blank	`backend`, `migrate`
`ASSISTANT_LLM_TIMEOUT_SECONDS`	No	`30`	`backend`, `migrate`
`ASSISTANT_LLM_TEMPERATURE`	No	`0.2`	`backend`, `migrate`
`ASSISTANT_LLM_MAX_TOKENS`	No	`2048`	`backend`, `migrate`

simboard-db:

Key	Required	Example/Allowed Value	Used By
`POSTGRES_USER`	Yes	DB username	`db`
`POSTGRES_PASSWORD`	Yes	DB password	`db`
`POSTGRES_DB`	Yes	DB name	`db`
`POSTGRES_PORT`	Yes	`5432`	`db`
`POSTGRES_SERVER`	Yes	`db`	`db`
`PGDATA`	Yes	`/var/lib/postgresql/data/pgdata`	`db`
`PGTZ`	Yes	timezone string	`db`

Workload Configurations

Workload 1: Database Deployment (`db`)

Workloads -> Deployments -> Create (top-right)

1. Top-level configuration

Rancher field	Value
Namespace	`simboard`
Name	`db`
Replicas	`1`

2. Pod tab

Storage:

Create a PersistentVolumeClaim volume for Postgres data.

Rancher field	Value
Volume type	`PersistentVolumeClaim`
Volume name	`db-data` (or your naming standard)
Persistent Volume Claim Name	`pvc-simboard-db` (or existing claim)
Access mode	`Single-Node Read/Write`
Capacity	`1Gi` minimum (or larger per policy)
Storage class	Namespace/default class (example: `nfs-client-vast`)

3. Container tab (`db`)

General:

Rancher field	Value
Container Name	`db`
Container image	`postgres:17`
Pull policy	`Always`
Environment Variables	Type: `Secret`, Secret: `simboard-db`

General -> Networking:

Rancher field	Value
Service type	`ClusterIP`
Name	`db`
Private Container Port	`5432`
Protocol	`TCP`

Security Context:

Rancher field	Value
Run as User	Required: set numeric NERSC UID (check Iris)
Add Capabilities	`CHOWN,DAC_OVERRIDE,FOWNER,SETGID,SETUID`
Drop Capabilities	`ALL`

Storage:

Rancher field	Value
Volume	`db-data`
Mount path	`/var/lib/postgresql/data`
Read only	`false`

Keep PGDATA=/var/lib/postgresql/data/pgdata in the simboard-db secret. Mount the claim at /var/lib/postgresql/data, not at PGDATA, so Postgres initializes inside the pgdata subdirectory instead of the volume root. This avoids initdb failures on storage backends that pre-create files such as lost+found at the claim root.

Workload 2: Backend Deployment (`backend`)

Workloads -> Deployments -> Create (top-right)

Top-level configuration

Rancher field	Value
Workload type	`Deployment`
Name	`backend`
Labels	`app=simboard-backend`
Replicas	`1`
Image pull secret	`registry-nersc`

1. Pod tab

Security Context:

Rancher field	Value
Pod Filesystem Group	`62756`

Required for NERSC global file system (NGF/CFS) mounts to ensure correct permissions for the backend container user.

Storage:

Rancher field	Value
Volume type	`Bind-Mount`
Volume name	`staging`
Path on node	`/global/cfs/cdirs/e3sm/performance_archive`
The Path on the Node must be	`An existing directory`

Storage:

Rancher field	Value
Volume type	`Bind-Mount`
Volume name	`archive`
Path on node	`/global/cfs/cdirs/e3sm/performance_archive`
The Path on the Node must be	`An existing directory`

2. Container tab (`backend`)

General:

Rancher field	Value
Container Name	`backend`, Standard Container
Container Image	`registry.nersc.gov/e3sm/simboard/backend:<tag>`
Pull policy	`Always` for `:dev`; `IfNotPresent` for versioned tags
Environment Variables	Type: Secret, Secret: `simboard-backend-env`

General -> Networking:

Rancher field	Value
Service type	`ClusterIP`
Name	`backend`
Private Container Port	`8000`
Protocol	`TCP`

Security Context:

Rancher field	Value
Run as User	Required: set numeric NERSC UID (check Iris)
Add Capabilities	leave empty
Drop Capabilities	`ALL`

Storage:

Rancher field	Value
Volume	`staging`
Mount path	`/performance_archive`
Read only	`true` (recommended)

Rancher field	Value
Volume	`archive`
Mount path	`/OLD_PERF`
Read only	`true` (recommended)

3. Container tab (`migrate`, init container)

General:

Rancher field	Value
Container type	Init container
Name	`migrate`
Container image	`registry.nersc.gov/e3sm/simboard/backend:<tag>`
Command	`/app/migrate.sh`
Args	leave empty
Environment Variables	Type: Secret, Secret: `simboard-backend-env`
Script behavior	`/app/migrate.sh` checks `DATABASE_URL`, waits for DB, runs `alembic upgrade head`

Security Context:

Rancher field	Value
Run as User	Required: set numeric NERSC UID
allowPrivilegeEscalation	`false`
privileged	`false`
capabilities	add `DAC_OVERRIDE`, drop `ALL`

Workload 3: NERSC Archive Collection CronJobs

Use two Rancher-managed CronJob workloads:

nersc-staging-ingestor for incremental staging collection every 15 minutes
nersc-archive-ingestor for archive collection daily at 0 12 * * *

In this runbook, both CronJobs perform site-side collection and submit submission-qualified case directories to SimBoard, where backend ingestion occurs. The staging job scans PERF_ARCHIVE_ROOT. The archive job scans OLD_PERF_ARCHIVE_ROOT.

Prerequisites for this section:

Backend service must be up and reachable from within the cluster (http://backend:8000).
At least one admin account must exist (see setup step 1 below).
Ingestion service account token must be provisioned (see setup step 2 below).

Setup Procedure (New Ingestion Script)

Create admin account (if one does not already exist)
This script must run in the deployed backend environment (so it has the correct DB connection and app settings).
In Rancher UI, open target namespace -> Workloads -> Pods -> backend pod -> Execute Shell (backend container).

Run:

cd /app
python -m app.scripts.users.create_admin_account

Enter admin email/password when prompted.
Use this admin account in step 2 for service-account provisioning.
Provision ingestion service-account token
Service accounts are required when non-interactive systems (for example, these CronJobs) authenticate to the SimBoard API.
Run this in the same backend pod shell opened in step 1 (no kubectl required).

Execute:

cd /app
python -m app.scripts.users.provision_service_account \
  --service-name nersc-archive-ingestor \
  --base-url http://backend:8000 \
  --admin-email <admin-email-from-step-1> \
  --expires-in-days 365

Enter admin password when prompted.
Copy the generated token immediately; it is shown once.
Store and rotate the token per your org policy.
Use this token as SIMBOARD_API_TOKEN in both ingestion secrets from step 3.

Optional: quick token validation call

export SIMBOARD_API_TOKEN=<TOKEN>
curl -X POST http://backend:8000/api/v1/ingestions/from-path \
  -H "Authorization: Bearer $SIMBOARD_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
        "archive_path": "/global/cfs/cdirs/e3sm/simulations/archive.tar.gz",
        "machine_name": "perlmutter",
        "hpc_username": "<your_hpc_username>"
      }'

Create/update secrets nersc-staging-ingestor-env and nersc-archive-ingestor-env
In Rancher, open target namespace -> Storage -> Secrets -> Create.
Create one secret named nersc-staging-ingestor-env.
Create a second secret named nersc-archive-ingestor-env.
Secret type: Opaque
Populate them as follows.

Key table for nersc-staging-ingestor-env:

Key	Required	Example/Allowed Value	Used By
`SIMBOARD_API_TOKEN`	Yes	service-account bearer token (from Setup Procedure step 2)	`nersc-staging-ingestor`
`SIMBOARD_API_BASE_URL`	Yes	`http://backend:8000`	`nersc-staging-ingestor`
`SCAN_MODE`	Yes	`staging`	`nersc-staging-ingestor`
`PERF_ARCHIVE_ROOT`	Yes	`/performance_archive`	`nersc-staging-ingestor`
`MACHINE_NAME`	Yes	`perlmutter`	`nersc-staging-ingestor`
`DRY_RUN`	No	`true` or `false`	`nersc-staging-ingestor`

Key table for nersc-archive-ingestor-env:

Key	Required	Example/Allowed Value	Used By
`SIMBOARD_API_TOKEN`	Yes	service-account bearer token (from Setup Procedure step 2)	`nersc-archive-ingestor`
`SIMBOARD_API_BASE_URL`	Yes	`http://backend:8000`	`nersc-archive-ingestor`
`SCAN_MODE`	Yes	`archive`	`nersc-archive-ingestor`
`OLD_PERF_ARCHIVE_ROOT`	Yes	`/OLD_PERF`	`nersc-archive-ingestor`
`MACHINE_NAME`	Yes	`perlmutter`	`nersc-archive-ingestor`
`DRY_RUN`	No	`true` or `false`	`nersc-archive-ingestor`
`ARCHIVE_YEAR_START`	No, scoped backfills only	`2025` or `2025-01`	`nersc-archive-ingestor`
`ARCHIVE_YEAR_END`	No, scoped backfills only	`2025` or `2025-03`	`nersc-archive-ingestor`

OLD_PERF_ARCHIVE_ROOT must point at archive root whose immediate children are YYYY-MM buckets containing immutable performance_archive_<timestamp> snapshots. The runner stores completed snapshot checkpoints in SimBoard's database, so no separate checkpoint file or persistent checkpoint volume is needed. ARCHIVE_YEAR_START is the earliest month considered; newly arriving snapshots in any eligible month are discovered automatically.

Create/update CronJob nersc-staging-ingestor
Use the Staging CronJob section below.
Configure secret-backed environment variables from nersc-staging-ingestor-env.
Keep the CronJob command on python -m app.scripts.ingestion.nersc_archive_ingestor. Do not switch this workload to app/scripts/ingestion/sites/nersc.sh; that wrapper is for host-side NERSC cron usage and defaults to host filesystem paths and API values that do not match this Spin workload.
Create/update CronJob nersc-archive-ingestor
Use the Archive CronJob section below.
Configure secret-backed environment variables from nersc-archive-ingestor-env.
Keep the CronJob command on python -m app.scripts.ingestion.nersc_archive_ingestor. Do not switch this workload to app/scripts/ingestion/sites/nersc.sh; that wrapper is for host-side NERSC cron usage and defaults to host filesystem paths and API values that do not match this Spin workload.
Validate both jobs once with dry run
Set DRY_RUN=true in both ingestion secrets.
Trigger a one-off job from each CronJob.
Confirm logs include scan_completed and submission-qualified case discovery during collection.
Remove DRY_RUN (or set DRY_RUN=false) in each secret after validation.
Verify steady-state behavior
Confirm nersc-staging-ingestor runs every 15 minutes.
Confirm nersc-archive-ingestor runs daily at 0 12 * * *.
Confirm unchanged cases are not re-ingested.
Confirm failures appear as failed CronJob runs and case_ingestion_failed log events for both jobs.

To force a full archive rescan, suspend nersc-archive-ingestor so it cannot recreate checkpoints during the reset. In Rancher, open a shell in the db container and run the following command after replacing the placeholders. Use the canonical machine name stored in SimBoard and the basename of OLD_PERF_ARCHIVE_ROOT as the archive name (for example, OLD_PERF):

psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -v ON_ERROR_STOP=1 <<'SQL'
DELETE FROM archive_scan_checkpoints AS checkpoint
USING machines AS machine
WHERE checkpoint.machine_id = machine.id
   AND machine.name = '<machine-name>'
   AND checkpoint.archive_name = '<archive-name>';
SQL

Confirm the deleted row count, then resume or trigger the archive CronJob. The next run scans all snapshots in the configured year range; ingestion state still prevents already processed executions from being submitted again.

Staging CronJob (`nersc-staging-ingestor`)

Top-level configuration:

Rancher field	Value
Namespace	`simboard`
Name	`nersc-staging-ingestor`
Schedule	`/15 * * *`

1. CronJob tab

Scaling and Upgrade Policy:

Rancher field	Value
Concurrency policy	`Skip next run if current run hasn't finished`
Successful jobs history limit	`3`
Failed jobs history limit	`3`

2. Pod tab

Security Context:

Rancher field	Value
Pod Filesystem Group	`62756`

Pod:

Rancher field	Value
Restart policy	`OnFailure`

Storage:

Rancher field	Value
Volume type	`Bind-Mount`
Volume name	`staging`
Path on node	`/global/cfs/cdirs/e3sm/performance_archive`
The Path on the Node must be	`An existing directory`

3. Container tab

General:

Rancher field	Value
Container Name	`nersc-staging-ingestor`
Container image	`registry.nersc.gov/e3sm/simboard/backend:<tag>`
Pull policy	`Always` for `:dev`; `IfNotPresent` for versioned tags
Image pull secret	`registry-nersc`
Command	`python`
Arguments	`-m app.scripts.ingestion.nersc_archive_ingestor`
Environment Variables	Type: Secret, Secret: `nersc-staging-ingestor-env`

Security Context:

Rancher field	Value
Run as User	Required: set to a numeric NERSC UID for this workload
allowPrivilegeEscalation	`false`
privileged	`false`
capabilities	drop `ALL`

Storage:

Rancher field	Value
Archive volume	`staging`
Archive mount path	`/performance_archive`
Archive read only	`true` (recommended)

Archive CronJob (`nersc-archive-ingestor`)

Top-level configuration:

Rancher field	Value
Namespace	`simboard`
Name	`nersc-archive-ingestor`
Schedule	`0 12 * * *`

1. CronJob tab

Scaling and Upgrade Policy:

Rancher field	Value
Concurrency policy	`Skip next run if current run hasn't finished`
Successful jobs history limit	`3`
Failed jobs history limit	`3`

2. Pod tab

Security Context:

Rancher field	Value
Pod Filesystem Group	`62756`

Pod:

Rancher field	Value
Restart policy	`OnFailure`

Storage:

Rancher field	Value
Volume type	`Bind-Mount`
Volume name	`archive`
Path on node	`/global/cfs/projectdirs/e3sm/OLD_PERF`
The Path on the Node must be	`An existing directory`

3. Container tab

General:

Rancher field	Value
Container Name	`nersc-archive-ingestor`
Container image	`registry.nersc.gov/e3sm/simboard/backend:<tag>`
Pull policy	`Always` for `:dev`; `IfNotPresent` for versioned tags
Image pull secret	`registry-nersc`
Command	`python`
Arguments	`-m app.scripts.ingestion.nersc_archive_ingestor`
Environment Variables	Type: Secret, Secret: `nersc-archive-ingestor-env`

Security Context:

Rancher field	Value
Run as User	Required: set to a numeric NERSC UID for this workload
allowPrivilegeEscalation	`false`
privileged	`false`
capabilities	drop `ALL`

Storage:

Rancher field	Value
Archive volume	`archive`
Archive mount path	`/OLD_PERF`
Archive read only	`true` (recommended)

Notes:

Manage ingestion configuration via two Opaque secrets (nersc-staging-ingestor-env and nersc-archive-ingestor-env) and expose them as secret-backed environment variables.
Use backend service DNS (http://backend:8000) for in-cluster API calls.
Non-zero CronJob exits indicate at least one case ingestion failure in that run.

Mounting NERSC E3SM Performance Archive

Canonical values for all workloads that mount the E3SM performance archive: These values should already be set in the instructions above, but are repeated here for clarity and to highlight security context requirements.

Field	Staging value	Archive value
Path on node	`/global/cfs/cdirs/e3sm/performance_archive`	`/global/cfs/projectdirs/e3sm/OLD_PERF`
Volume name	`staging`	`archive`
In-container mount path	`/performance_archive`	`/OLD_PERF`
Read only	`true` (recommended)	`true` (recommended)

Security context requirements for NERSC global file system (NGF/CFS) mounts:

Set numeric Run as User at pod/container level.
If Run as Group ID is set, also set Run as User.
Set Run as Group ID to the appropriate numeric group ID (62756 for E3SM)
Keep Linux capabilities minimal (drop: ALL; only add what is required).

Source: NERSC Spin Storage - NERSC Global File Systems.

Workload 4: Frontend Deployment (`frontend`)

Workloads -> Deployments -> Create (top-right)

1. Top-level configuration

Rancher field	Value
Workload type	`Deployment`
Name	`frontend`
Labels	`app=simboard-frontend`
Replicas	`1`
Image pull secret	`registry-nersc`

2. Container tab (`frontend`)

General:

Rancher field	Value
Container image	`registry.nersc.gov/e3sm/simboard/frontend:<tag>`
Pull policy	`Always` for `:dev`; `IfNotPresent` for versioned tags
Port	`80/TCP`

General -> Networking:

Rancher field	Value
Service type	`ClusterIP`
Name	`frontend`
Private Container Port	`80`
Protocol	`TCP`

Security Context:

Rancher field	Value
Add Capabilities	`CHOWN,SETGID,SETUID,NET_BIND_SERVICE`
Drop Capabilities	`ALL`

Additional Configurations

TLS Secret (`simboard-tls-cert`)

General tab

Rancher field	Value
Resource type	`Secret`
Name	`simboard-tls-cert`
Secret type	`kubernetes.io/tls`

Data tab

Rancher field	Value
Data key	`tls.crt` (certificate PEM)
Data key	`tls.key` (private key PEM)

Ingress (`lb`)

Service Discovery -> Ingresses -> Create

General tab

Rancher field	Value
Resource type	`Ingress`
Name	`lb`
Ingress class	`nginx`

TLS tab

Rancher field	Value
TLS secret	`simboard-tls-cert`
TLS hosts	`simboard-dev.e3sm.org`, `simboard-dev-api.e3sm.org`, `lb.simboard.development.svc.spin.nersc.org`

Rules tab

Rancher field	Value
Rule	Host `simboard-dev.e3sm.org`, path `/`, service `frontend:80`
Rule	Host `simboard-dev-api.e3sm.org`, path `/`, service `backend:8000`
Optional host alias	`lb.simboard.development.svc.spin.nersc.org`

Deploy Order

Open the Rancher UI and select the target namespace.
Ensure DB service/deployment (db) are healthy in Service Discovery → Services and Workloads → Deployments.
Update/redeploy backend deployment with the target backend image tag.
Watch backend pod init container logs (migrate) in Rancher to confirm migration success.
Verify backend deployment health and pod status under Workloads → Pods.
Update/redeploy frontend deployment with the target frontend image tag, then verify frontend pod status.
Create/confirm an admin account (Rancher pod shell), then provision ingestion service-account token and create/update secrets nersc-staging-ingestor-env and nersc-archive-ingestor-env.
Create/update CronJobs nersc-staging-ingestor and nersc-archive-ingestor, run one-off dry runs (DRY_RUN=true) for both, then set DRY_RUN=false.
Verify ingress routing under Service Discovery → Ingresses for lb and confirm both frontend and backend hosts resolve via HTTPS.

Failure Handling

If backend init container migrate fails, the backend pod will not become Ready.
Fix database connectivity or migration issues, then redeploy backend.
Backend image rollback does not revert schema automatically; handle schema rollback explicitly via Alembic when required.

Concurrency Note

Migrations run once per new backend pod via initContainer. During a rollout, more than one backend pod can exist at the same time (for example, with multiple replicas or a RollingUpdate strategy and maxSurge > 0), and multiple pods can attempt migrations concurrently. If your migration safety model depends on a single migrator, configure the backend deployment to use either a Recreate rollout strategy or a RollingUpdate strategy with maxSurge=0 (and typically maxUnavailable=1), or ensure your migration tooling enforces a DB-level migration lock.

NERSC Spin Runbook

Rancher UI Configs

Prerequisites (Create First)

Workload Configurations

Workload 1: Database Deployment (db)

1. Top-level configuration

2. Pod tab

3. Container tab (db)

Workload 2: Backend Deployment (backend)

Top-level configuration

1. Pod tab

2. Container tab (backend)

3. Container tab (migrate, init container)

Workload 3: NERSC Archive Collection CronJobs

Setup Procedure (New Ingestion Script)

Staging CronJob (nersc-staging-ingestor)

1. CronJob tab

2. Pod tab

3. Container tab

Archive CronJob (nersc-archive-ingestor)

1. CronJob tab

2. Pod tab

3. Container tab

Mounting NERSC E3SM Performance Archive

Workload 4: Frontend Deployment (frontend)

1. Top-level configuration

2. Container tab (frontend)

Additional Configurations

TLS Secret (simboard-tls-cert)

General tab

Data tab

Ingress (lb)

General tab

TLS tab

Rules tab

Deploy Order

Failure Handling

Concurrency Note

Workload 1: Database Deployment (`db`)

3. Container tab (`db`)

Workload 2: Backend Deployment (`backend`)

2. Container tab (`backend`)

3. Container tab (`migrate`, init container)

Staging CronJob (`nersc-staging-ingestor`)

Archive CronJob (`nersc-archive-ingestor`)

Workload 4: Frontend Deployment (`frontend`)

2. Container tab (`frontend`)

TLS Secret (`simboard-tls-cert`)

Ingress (`lb`)