Improve Reliability and Safeguards for RMS_Update Script #505

Cybis320 · 2025-01-19T21:06:48Z

Background & Issues in the Original Script

Risk of Partial Copy/Corruption
- The original script uses direct cp commands on .config and mask.bmp
- Power loss or error during cp leaves files in undefined state
- No validation that copied files match source
No Process Control
- Multiple instances can run simultaneously
- Concurrent updates can interfere with backup/restore operations
- Stale locks from crashed updates aren't detected or cleaned
No Copy Recovery
- Single cp attempt with no retry on failure
- Failed copy operations halt the script or continue without the file
- No integrity check of copied files
No Space Verification
- Script attempts operations without checking available space
- Git operations can require up to 3x repo size (~90MB for RMS) plus temp files
- Updates can fail mid-operation when disk fills
Interruption Handling
- Uses UPDATEINPROGRESSFILE but doesn't verify file states after interruption
- No mechanism to confirm if restore succeeded
- Failed operations don't clean up temp files
Parameter Handling Risk
- Script uses $# -eq 0 check for parameter detection
- Any parameter (including typos or dots) triggers skip of backup/restore
- Custom configs can be overwritten by accidental parameters
Dependencies Not Updated in One Pass
- If new dependencies are introduced (Python or system-level), the old script doesn’t know to install them
- The script may need to be run a second time after the code is pulled, because changes to the update logic won’t load mid-run
- This leads to potential import or build errors on the newly pulled code until another update is done

How This PR Addresses These Issues

Atomic Copy Operations
- Implements retry_cp function that:
  - Copies to .tmp file
  - Validates with diff against source
  - Uses mv for atomic replacement
- Corrupted copies never reach target location
Process Isolation
- Creates /tmp/update.lock with current PID
- Verifies PID is still running before removing stale locks
- Registers cleanup handlers for all exit paths including SIGINT/SIGTERM
Copy Validation Cycle
- Configurable RETRY_LIMIT for failed operations
- Each copy verified with diff before acceptance
- Failed copies trigger retry after 1-second delay
- Source file permissions preserved through copy chain
Space Pre-checks
- Calculates required space:
  - Current repo size
  - Git operation buffer (3x repo)
  - Backup space for configs
  - Temp file overhead
- Verifies 200MB minimum free space
- Checks all target directories (/tmp, source, backup)
State Tracking
- Records operation state in UPDATEINPROGRESSFILE
- Verifies existence and integrity of files after restore
- Cleanup operations for all temp files
- Logs operation results with timestamps
Single Update Path
- Removes parameter-based behavior
- Identical process for setup and updates
- Always attempts backup/restore (safe even when only default files exist)
- Command-line arguments ignored rather than changing behavior
Dependencies Installed in the Same Run
- Moves system-level dependencies to a separate file (system_packages.txt)
- When new code is pulled, the updated list of dependencies is used immediately
- No need to run the script a second time just to get newly introduced packages

The script now uses a single, verifiable update path. Every critical operation includes integrity checks and automatic recovery steps. The update process is atomic - either it completes fully or rolls back safely.

Remaining issue

If the update process fails, the script currently doesn’t provide an obvious signal for the operator (e.g., a prompt or notification). The system could continue running on the old RMS version without anyone realizing.

One solution is to modify First_Run so that if the update fails, it exits before launching start_capture, thereby stopping the station altogether. This makes the failure more evident (the station isn’t capturing), prompting the operator to investigate and fix the update issue. The tradeoff, of course, is that the station remains offline until the update is resolved.

…t on failed stash or pull.

satmonkey · 2025-01-19T21:11:06Z

Great idea! Seems like we are waiting for this desperately.

Cybis320 added 4 commits January 18, 2025 20:52

Prevent corruption of config and mask file

c5fd3ab

Set minimum disk space to 200 MB to attempt update

116f902

Fix typo

6974b9a

Move system packages list to external file. fix check disk space. Exi…

eecf211

…t on failed stash or pull.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Reliability and Safeguards for RMS_Update Script #505

Improve Reliability and Safeguards for RMS_Update Script #505

Cybis320 commented Jan 19, 2025

satmonkey commented Jan 19, 2025

Improve Reliability and Safeguards for RMS_Update Script #505

Are you sure you want to change the base?

Improve Reliability and Safeguards for RMS_Update Script #505

Conversation

Cybis320 commented Jan 19, 2025

Background & Issues in the Original Script

How This PR Addresses These Issues

Remaining issue

satmonkey commented Jan 19, 2025