Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Reliability and Safeguards for RMS_Update Script #505

Draft
wants to merge 4 commits into
base: prerelease
Choose a base branch
from

Conversation

Cybis320
Copy link
Contributor

Background & Issues in the Original Script

  1. Risk of Partial Copy/Corruption

    • The original script uses direct cp commands on .config and mask.bmp
    • Power loss or error during cp leaves files in undefined state
    • No validation that copied files match source
  2. No Process Control

    • Multiple instances can run simultaneously
    • Concurrent updates can interfere with backup/restore operations
    • Stale locks from crashed updates aren't detected or cleaned
  3. No Copy Recovery

    • Single cp attempt with no retry on failure
    • Failed copy operations halt the script or continue without the file
    • No integrity check of copied files
  4. No Space Verification

    • Script attempts operations without checking available space
    • Git operations can require up to 3x repo size (~90MB for RMS) plus temp files
    • Updates can fail mid-operation when disk fills
  5. Interruption Handling

    • Uses UPDATEINPROGRESSFILE but doesn't verify file states after interruption
    • No mechanism to confirm if restore succeeded
    • Failed operations don't clean up temp files
  6. Parameter Handling Risk

    • Script uses $# -eq 0 check for parameter detection
    • Any parameter (including typos or dots) triggers skip of backup/restore
    • Custom configs can be overwritten by accidental parameters
  7. Dependencies Not Updated in One Pass

    • If new dependencies are introduced (Python or system-level), the old script doesn’t know to install them
    • The script may need to be run a second time after the code is pulled, because changes to the update logic won’t load mid-run
    • This leads to potential import or build errors on the newly pulled code until another update is done

How This PR Addresses These Issues

  1. Atomic Copy Operations

    • Implements retry_cp function that:
      • Copies to .tmp file
      • Validates with diff against source
      • Uses mv for atomic replacement
    • Corrupted copies never reach target location
  2. Process Isolation

    • Creates /tmp/update.lock with current PID
    • Verifies PID is still running before removing stale locks
    • Registers cleanup handlers for all exit paths including SIGINT/SIGTERM
  3. Copy Validation Cycle

    • Configurable RETRY_LIMIT for failed operations
    • Each copy verified with diff before acceptance
    • Failed copies trigger retry after 1-second delay
    • Source file permissions preserved through copy chain
  4. Space Pre-checks

    • Calculates required space:
      • Current repo size
      • Git operation buffer (3x repo)
      • Backup space for configs
      • Temp file overhead
    • Verifies 200MB minimum free space
    • Checks all target directories (/tmp, source, backup)
  5. State Tracking

    • Records operation state in UPDATEINPROGRESSFILE
    • Verifies existence and integrity of files after restore
    • Cleanup operations for all temp files
    • Logs operation results with timestamps
  6. Single Update Path

    • Removes parameter-based behavior
    • Identical process for setup and updates
    • Always attempts backup/restore (safe even when only default files exist)
    • Command-line arguments ignored rather than changing behavior
  7. Dependencies Installed in the Same Run

    • Moves system-level dependencies to a separate file (system_packages.txt)
    • When new code is pulled, the updated list of dependencies is used immediately
    • No need to run the script a second time just to get newly introduced packages

The script now uses a single, verifiable update path. Every critical operation includes integrity checks and automatic recovery steps. The update process is atomic - either it completes fully or rolls back safely.

Remaining issue

If the update process fails, the script currently doesn’t provide an obvious signal for the operator (e.g., a prompt or notification). The system could continue running on the old RMS version without anyone realizing.

One solution is to modify First_Run so that if the update fails, it exits before launching start_capture, thereby stopping the station altogether. This makes the failure more evident (the station isn’t capturing), prompting the operator to investigate and fix the update issue. The tradeoff, of course, is that the station remains offline until the update is resolved.

@satmonkey
Copy link
Contributor

Great idea! Seems like we are waiting for this desperately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants