-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittant thredds read errors behind twitcher #123
Comments
The last error in the notebook displays this link as the cause of error:
This looks like the error I usually get when doing an incorrectly formed request towards THREDDS. Using the corresponding I'm guessing there must be a parameter that is incorrectly parsed or forwarded somewhere, which causes the Twitcher redirect to be incorrect. I notice that the
So, I'm not sure where the notebook retrieves the badly formed link from, without the appropriate URL encoding, and why it doesn't do the same problem using Is it possible to check if some Magpie/Twitcher/Nginx logs are produced on the Ouranos instance when calling the first link (maybe need to start them with DEBUG to getter more descriptive logs)? |
Just to clarify this corresponding link is not the opendap data url as it's pointing to the thredds
Both of the above examples work when run as single lines, but the Twitcher version will result in a read error a very small percentage of the time making long running workflow calculations problematic |
Old Twitcher/Magpie performance problem that might be related to the performance hit observed by Travis: #97 (comment). I think this "fixed" issue has not fixed completely the problem. We have 2 problems: performance and intermittent access/read error. The intermittent read error problem faced by Travis is with caching disabled on our production deployment, see bird-house/birdhouse-deploy@7dcb208...Ouranosinc:birdhouse-deploy:8bdf5271df66803730da6a0c34d46e2952399ab2 We've seen that when caching is enable, we have more intermittent read problem, and that's not just when xarray is hitting Thredds. A simple DescribeService GET request on any WPS could also result in error so that's why we disabled it on our production pretty much since caching was introduced. Caching was introduced to solve the performance problem. Is it possible that the caching config is not honored anymore or caching is permanently enabled in recent Twitcher/Magpie? Below is the summary of a discussion with @fmigneault on April 5. The intermittent problem has been seen by our nightly Jenkins since 2023-02-18. Nothing before since we have more than 3 months of logs. Magpie was bumped from 3.26.0 to 3.31.0 on 2023-02-10 in this PR bird-house/birdhouse-deploy#283. It went live on Ouranos production on 2023-02-14 via this merge and overnight autodeploy Ouranosinc/birdhouse-deploy@f72d36c. Jenkins nightly logs investigation summary:
|
@tlvu |
@fmigneault |
@tlvu |
In
|
Matching logs from thredds container,
|
No errors in So @fmigneault that's all I've got, sorry no stack-trace for you. If you need something else, please be specific with what command and where I should look it up. |
Below all the "could not get service name" type of error in
|
@tlvu It seems the critical error information would be obtained from this. There's probably something raising in Magpie that propagates all the way to this step, but hard to investigate further without it. |
Tried setting |
There should be some URLs logged at least because of this line that would be called on any allowed request: |
Found the problem. |
…oblem See issue bird-house/twitcher#123 To investigate this kind of errors: ``` $ docker logs twitcher 2>&1 | grep ERROR 2023-05-04 16:53:12,723 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name thredds 2023-05-04 17:29:58,124 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name thredds 2023-05-05 22:20:01,798 ERROR [TWITCHER:50][ThreadPoolExecutor-0_1] Could not get service with name malleefowl 2023-05-07 00:38:04,647 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name thredds 2023-05-07 00:50:35,185 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name flyingpigeon 2023-05-08 00:55:09,584 ERROR [TWITCHER:50][ThreadPoolExecutor-0_1] Could not get service with name thredds 2023-05-08 00:58:43,493 ERROR [TWITCHER:50][ThreadPoolExecutor-0_1] Could not get service with name thredds 2023-05-08 01:17:03,611 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name finch 2023-05-08 09:19:19,819 ERROR [TWITCHER:50][ThreadPoolExecutor-0_2] Could not get service with name thredds 2023-05-09 01:24:12,094 ERROR [TWITCHER:50][ThreadPoolExecutor-0_2] Could not get service with name thredds ```
DEBUG logging enabled in prod, see commit Ouranosinc/birdhouse-deploy@9d1e37a |
|
Describe the bug
We have noticed intermittant netcdf read errors when accessing opendap links in the PAVICS jupyterhub.
Approximate date when the problem began ~ March 01 2023
To Reproduce
The intermittant nature of this problem makes reproducing somewhat difficult but a public notebook on the PAVICS server is available here : https://pavics.ouranos.ca/jupyter/hub/user-redirect/lab/tree/public/logan-public/Tests/THREDDS_Issues_March2023/Random_Thredds_read_errors.ipynb
The notebook will execute a relatively large workflow and uses multiple dask worker processes to accentuate the possibility of a read error
Multiple notebook runs (note only ~5-6 due to time needed) have shown that bypassing twitcher (i.e. thredds with nginx proxy only) always allows successful completion of the calculations whereas accessing opendap links behind the nginx/twitcher combinations typically results in a read error relatively quickly in the workflow.
Although not quantified there also seems to be a general performance hit when accessing data via nginx/twitcher (print outs of execution times in the workflow loop between 25-40 sec with nginx/twitcher versus 18-30 seconds with nginx only). Note also, that the notebook runs 'nginx-only' code first so I do not believe the performance difference is benefiting from caching of data or if it is should benefit the 'twitcher/nginx' run.
Expected behavior
Execution of code without read error
The text was updated successfully, but these errors were encountered: