Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to send requests with "safe" checking, exiting early if the service info cannot be quickly communicated with #63

Merged
merged 9 commits into from
Aug 21, 2024

Conversation

OrdiNeu
Copy link
Contributor

@OrdiNeu OrdiNeu commented Aug 1, 2024

Jira ticket

DIG-1708

Description

This PR introduces the ability to send requests with the "safe" header before a fanout call, which checks which servers are currently communicable. This was prompted by the BC server taking a long time to serve any request while the UHN server was experiencing DNS issues.

The idea here is to shortcut away requests (before the usual 60s timeout) if a server is unable to serve a service-info request. This involves a heartbeat script that is constantly checking the uptime of connected servers, and a "safe" header that can be used in all frontend requests. If a server is detected to be down by the heartbeat script, all safe requests will subsequently ignore that server.

To test

  1. Build with this branch
  2. Run integration tests.
  3. Add candig-dev as a server to federation.
  4. candig-dev should show up in /app/federation/live_servers.txt, and requests to fanout
  5. Go offline, wait a minute, OR delete the entry in the container's /app/federation/live_servers.txt.
  6. If you waited, candig-dev should no longer show up in /app/federation/live_servers.txt
  7. Send a request like the following:
curl -v candig.docker.internal:5080/federation/v1/fanout
  -H "Content-Type: application/json"
  -H "Authorization: Bearer $TOKEN"
  -d '{ "method": "GET", "path": "service-info", "service": "query", "payload": {}, "safe": "True" }'

You should get a response like the following:

[
  {
    "location": {
      "name": "Local",
      "province": "ON",
      "province-code": "ca-on"
    },
    "message": "handle_server_request failed on internal-1, federation = false: Safe check abort: internal-1 is assumed to be down",
    "service": "query",
    "status": 500
  }
]

@OrdiNeu OrdiNeu marked this pull request as ready for review August 16, 2024 18:15
@OrdiNeu OrdiNeu requested a review from daisieh August 16, 2024 18:40
@daisieh
Copy link
Member

daisieh commented Aug 19, 2024

I cannot get any change I make to live_servers to make a difference in the responses I get from federation. I deleted the uhn-candig-dev entry from the file, but curl -X "POST" "http://candig.docker.internal:5080/federation/v1/fanout" still gives me answers for both.

[
  {
    "location": {
      "name": "Local",
      "province": "ON",
      "province-code": "ca-on"
    },
    "results": {
      "description": "A query microservice for operating with HTSGet & Katsu",
      "id": "org.candig.query",
      "name": "CanDIG query service",
      "organization": {
        "name": "CanDIG",
        "url": "https://www.distributedgenomics.ca"
      },
      "type": {
        "artifact": "query",
        "group": "org.candig",
        "version": "v0.1.0"
      },
      "version": "0.1.0"
    },
    "service": "query",
    "status": 200
  },
  {
    "location": {
      "name": "UHN dev",
      "province": "Ontario",
      "province-code": "ca-on"
    },
    "results": {
      "description": "A query microservice for operating with HTSGet & Katsu",
      "id": "org.candig.query",
      "name": "CanDIG query service",
      "organization": {
        "name": "CanDIG",
        "url": "https://www.distributedgenomics.ca"
      },
      "type": {
        "artifact": "query",
        "group": "org.candig",
        "version": "v0.1.0"
      },
      "version": "0.1.0"
    },
    "service": "query",
    "status": 200
  }
]

@OrdiNeu
Copy link
Contributor Author

OrdiNeu commented Aug 20, 2024

@daisieh Did you include the "safe" flag in your request? The code is only active if it is explicitly asked to be safe -- though thinking about it perhaps that's the wrong behaviour.

@daisieh
Copy link
Member

daisieh commented Aug 20, 2024

Ah yes, that seems to be the problem. But if live_servers.txt contains internal-1, shouldn't that server return a good result?

live_servers.txt:

internal-1
[
  {
    "location": {
      "name": "Local",
      "province": "ON",
      "province-code": "ca-on"
    },
    "message": "handle_server_request failed on internal-1, federation = false: Safe check abort: internal-1 is assumed to be down",
    "service": "query",
    "status": 500
  },
  {
    "location": {
      "name": "UHN dev",
      "province": "Ontario",
      "province-code": "ca-on"
    },
    "message": "handle_server_request failed on uhn-candig-dev, federation = false: Safe check abort: uhn-candig-dev is assumed to be down",
    "service": "query",
    "status": 500
  }
]

@OrdiNeu
Copy link
Contributor Author

OrdiNeu commented Aug 20, 2024

Indeed, it should. I'm not sure what could cause that error -- was there any whitespace in live_servers.txt?

@daisieh
Copy link
Member

daisieh commented Aug 20, 2024

Not that I can tell.

Copy link
Member

@daisieh daisieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, after fixing a few things, I think it works correctly on my system. You might want to confirm my changes before merging.

@OrdiNeu
Copy link
Contributor Author

OrdiNeu commented Aug 21, 2024

Yup, those changes work for me ✔️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants