Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting ALWAYS_OFFLOAD_NODE_STATUS without setting nodeStatusOffLoad results in workflow errors #12563

Open
3 of 4 tasks
abhijeetviswa opened this issue Jan 22, 2024 · 3 comments · May be fixed by #13874
Open
3 of 4 tasks
Labels
area/controller Controller issues, panics P3 Low priority type/bug

Comments

@abhijeetviswa
Copy link

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issue exists when I tested with :latest
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

I was in the process of enabling node status offloading. As part of this I set the ALWAYS_OFFLOAD_NODE_STATUS environment variable and missed making the nodeStatusOffload property in the configmap to true.

This resulted in the workflows failing with the error: offload node status is not supported (cli) and Workflow operation error (ui).

While this is definitely a configuration issue, it would be nice to have the controller ignore the ALWAYS_OFFLOAD_NODE_STATUS flag if the nodeStatusOffload setting is set to false so that users don't have to disable both if they need to disable node status offloading.

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

# This is an issue with any workflow, so pasting the default workflow here. 
metadata:
  name: fantastic-python
  labels:
    example: 'true'
spec:
  arguments:
    parameters:
      - name: message
        value: hello argo
  entrypoint: argosay
  templates:
    - name: argosay
      inputs:
        parameters:
          - name: message
            value: '{{workflow.parameters.message}}'
      container:
        name: main
        image: 'argoproj/argosay:v2'
        command:
          - /argosay
        args:
          - echo
          - '{{inputs.parameters.message}}'
  ttlStrategy:
    secondsAfterCompletion: 300
  podGC:
    strategy: OnPodCompletion

Logs from the workflow controller

Type     Reason           Age   From                 Message
  ----     ------           ----  ----                 -------
  Normal   WorkflowRunning  65s   workflow-controller  Workflow Running
  Warning  WorkflowFailed   65s   workflow-controller  offload node status is not supported

Logs from in your workflow's wait container

This was not done :(
@Joibel
Copy link
Member

Joibel commented Jan 22, 2024

You have asked for something to happen and not configured it. The controller is invalidly configured.

I wouldn't support being functional in this case, but we could just refused to start the controller instead, but it feels less helpful. The cli error is being as useful as it can be. Blindly carrying on in the light of bad configuration might be sane for some kinds of applications, but I disagree with doing it for a controller like argo workflows.

I'd support an improvement to the UI error message in this case, but otherwise think the current behavior is otherwise correct.

@agilgur5 agilgur5 added area/controller Controller issues, panics P3 Low priority labels Jan 22, 2024
@agilgur5
Copy link

but we could just refused to start the controller instead

Yea I think this would be the proper route, or at the very least have the Controller log out a critical error

Logs from the workflow controller

Notably, these logs are actually missing from the issue

and Workflow operation error (ui).

This is actually coming from the Controller, it's a message on the entry node

offload node status is not supported (cli)

this one is a bit more interesting as it's mentioned in the docs but this scenario is not mentioned as what could happen.

This error message comes from the DB code. I'm not sure exactly how the CLI chose this error to show specifically. Which command in the CLI did you use to get that? argo list?

@abhijeetviswa
Copy link
Author

I'm not aware of the command that was run to get the logs since it was given to me by a team member. From what I understood it was a kubectl describe on the controller itself.

I'll try reproducing this again with a local argo setup over the weekend and come up with a more concrete set of logs and commands to reproduce this issue.

I'm not sure exactly how the CLI chose this error to show specifically.

From my very limited understanding of the controller log and from memory at looking at the kubectl logs of the controller, it comes up when a node is "hydrated/dehydrated". This if condition succeeds since alwaysOffloadNodeStatus is true, while h.offLoadNodeStatusRepo has the default value to ExplosiveOffloadNodeStatusRepo since nodeStatusOffload is false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics P3 Low priority type/bug
Projects
None yet
3 participants