Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a "sync and retry" mechanism to handle manager-agent resource mismatches #3514

Open
fregataa opened this issue Jan 22, 2025 — with Lablup-Issue-Syncer · 0 comments
Assignees

Comments

@fregataa
Copy link
Member

fregataa commented Jan 22, 2025

Motivation

  • InsufficientResource errors occur in the CREATING phase of session creation, while creating containers in agents. This error may result from:
    • Resource fragmentation in agent (BA-588)
    • Resource state mismatches between managers and agents
      This can happen naturally. However, these InsufficientResource errors lead to immediate session cancellation without any “sync and retry” attempts.

Objective

  • When InsufficientResource errors occur, managers should sync resource states with agents and retry session creation according to configured retry policies
  • Implement retry policies that specify intervals, maximum attempts, and whether to enqueue the session or retry creating the kernel(s) with the same agent

Expected Sub Issue

  • Refactor manager's error handler to detect InsufficientResource errors
  • Implement sync API that resolve resource mismatches between managers and agents
  • Add configuration options for retry policies (intervals, max attempts, etc.)
  • Implement manager-side APIs for retry policy configuration
@fregataa fregataa changed the title Fix session creation failures due to resource handling issues Implement a "sync and retry" mechanism to handle manager-agent resource mismatches Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants