Metadata Ingestion Architecture
HPC sites produce performance_archive metadata that SimBoard ingests into PostgreSQL. Automated ingestion uses one of two workflows depending on whether the source archive is readable from the SimBoard backend environment on NERSC Spin.
Browser/manual uploads are supported separately and are not part of automated HPC dedupe reconstruction.
Ingestion Modes
Automated HPC ingestion uses two site-side scripts. Both use database-backed dedupe state, but they submit changed cases through different routes:
nersc_archive_ingestor.pyfor local path ingestion on NERSC / Perlmutterhpc_upload_archive_ingestor.pyfor remote automated uploads from LCRC and other DOE sites
| Mode | Script / entry point | Access pattern | Route | Use when | Examples |
|---|---|---|---|---|---|
| Local path ingestion | nersc_archive_ingestor.py |
SimBoard reads a mounted case directory path inside the performance_archive root. |
/api/v1/ingestions/from-path |
Source archive is readable from NERSC Spin. | NERSC / Perlmutter |
| Remote automated upload | hpc_upload_archive_ingestor.py |
Site job uploads one changed case archive over HTTPS. | /api/v1/ingestions/from-hpc-upload |
Source archive is not readable from NERSC Spin. | LCRC / Chrysalis; other DOE sites |
| Browser/manual upload | N/A | User uploads an archive through the browser. | /api/v1/ingestions/from-upload |
Manual, test, or ad hoc ingestion is needed. | User workstation |
Automated Dedupe Flow
Both automated scripts follow the same dedupe sequence:
- Scan the site
performance_archive. - Read known execution IDs from
/api/v1/ingestions/state. - Compare the scan results with database-backed state.
- Submit changed cases with the full discovered
processed_execution_idsset. - SimBoard stores the submitted dedupe state on ingestion audit rows.
- Future runs reconstruct dedupe state from PostgreSQL.
Remote automated uploads must contain exactly one case directory per request. The submitted case_path is used as the stable dedupe key for that uploaded case.
flowchart TD
subgraph RUNNERS["Site-Side Ingestion Scripts"]
NERSC["nersc_archive_ingestor.py\nNERSC / Perlmutter"]
HPC["hpc_upload_archive_ingestor.py\nLCRC / other DOE sites"]
SCAN["Scan site filesystem\nperformance_archive"]
STATE_REQ["Read known execution IDs\nGET /api/v1/ingestions/state"]
COMPARE["Compare scan results\nwith database-backed state"]
CHANGED["Changed cases"]
NERSC_PAYLOAD["Changed case directory path\n+ processed_execution_ids"]
HPC_PAYLOAD["One case archive\n+ case_path\n+ processed_execution_ids"]
end
subgraph BACKEND["SimBoard Backend"]
STATE["State API"]
PATH["POST /api/v1/ingestions/from-path"]
UPLOAD["POST /api/v1/ingestions/from-hpc-upload"]
NORMALIZE["Normalize and validate"]
AUDIT["Store ingestion audit row\nwith dedupe state"]
DB[("PostgreSQL")]
end
NERSC --> SCAN
HPC --> SCAN
SCAN --> STATE_REQ
STATE_REQ --> STATE
STATE -->|"known execution IDs"| STATE_REQ
STATE --> DB
STATE_REQ --> COMPARE
COMPARE --> CHANGED
CHANGED --> NERSC_PAYLOAD --> PATH
CHANGED --> HPC_PAYLOAD --> UPLOAD
PATH --> NORMALIZE
UPLOAD --> NORMALIZE
NORMALIZE --> AUDIT --> DB
Runner Configuration
All automated ingestion requests require a bearer API token. Both site-side runners use:
SIMBOARD_API_BASE_URLSIMBOARD_API_TOKENPERF_ARCHIVE_ROOTMACHINE_NAMEDRY_RUN
They also support these tuning options:
MAX_CASES_PER_RUNMAX_ATTEMPTSREQUEST_TIMEOUT_SECONDS
Stored Results
After ingestion, SimBoard stores normalized cases, simulations, machines, artifacts, links, and audit records in PostgreSQL. Simulation rows preserve parsed CASE_HASH values so the frontend can group related executions inside a case without assigning persistent reference runs. The frontend reads the resulting catalog data through /api/v1 endpoints.
Note
SimBoard records artifact references such as output directories, source archive locations, run scripts, and batch logs to support reproducibility.
Referenced source archive directories may be cleaned up by scheduled site-side jobs outside of SimBoard to limit storage growth.
Site Summary
| Site / Machine | Ingestion mode | Scheduler | Source archive location |
|---|---|---|---|
| NERSC / Perlmutter | Local automated path ingestion | Cron | /global/cfs/projectdirs/e3sm/performance_archive |
| LCRC / Chrysalis | Remote automated upload | Sandia Jenkins | /lcrc/group/e3sm/PERF_Chrysalis/performance_archive |
| SNL / Compy | Remote automated upload | Sandia Jenkins | /compyfs/performance_archive |
| ALCF / Aurora | Remote automated upload | ALCF GitLab job, daily at 7 AM | /lus/flare/projects/E3SM_Dec/performance_archive |
| OLCF / Frontier | Remote automated upload | Local cron job | /lustre/orion/proj-shared/cli115 |
Reference: PACE Upload Scripts
PACE uses site-specific upload scripts and schedulers to collect or upload performance_archive metadata. These serve as references for existing DOE-site automation and are not part of the SimBoard ingestion API. They also provide context for the design of the remote automated upload workflow and the expected contents of performance_archive metadata.
Source: PACE Collection and Upload Reference
| Machine | SAVE_TIMING_DIR | Upload manager / frequency | Machine-specific PACE script |
|---|---|---|---|
| Chrysalis | /lcrc/group/e3sm/PERF_Chrysalis |
Sandia Jenkins | chrysalis_pace.sh |
| Perlmutter | /global/cfs/cdirs/e3sm |
scrontab |
TBD |
| Muller / Alvarez | /global/cfs/cdirs/e3sm |
scrontab |
TBD |
| Compy | /compyfs |
Sandia Jenkins | compy_pace.sh |
| Aurora | /lus/flare/projects/E3SM_Dec/performance_archive |
ALCF GitLab job, daily at 7 AM | /lus/flare/projects/E3SM_Dec/performance_archive/pace_archive_aurora.sh |
| Frontier | /lustre/orion/proj-shared/cli115 |
Local cron | /lustre/orion/cli115/proj-shared/performance_archive/pace_archive.sh; /lustre/orion/cli115/world-shared/frontier/performance_archive/pace_archive.sh |
| Anvil | /lcrc/group/e3sm |
Sandia Jenkins | TBD |