Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocrd network: How to handle calls to unlisted processors? #1033

Closed
kba opened this issue Mar 29, 2023 · 4 comments
Closed

ocrd network: How to handle calls to unlisted processors? #1033

kba opened this issue Mar 29, 2023 · 4 comments

Comments

@kba
Copy link
Member

kba commented Mar 29, 2023

image

https://github.com/OCR-D/core/pull/974/files/8143614bd16d02552af340d613fbf31a71732e7b..7f4f1d704954a38afcd74bbb23a215890c9c36fd#diff-98efeaed0fdef4fe0b48a9f89e50bee11a7175db9d5a9d1202d4cb07dcd60694

@bertsky

@joschrew isn't that overly strict? We already have a long tail of processors, with new ones being added (continuously wrapping new external tools), so it's difficult to keep track of for admins even. (And we certainly should not enforce requiring the developers to "register" new processors centrally again.) So IIUC the hosts entries are useful for preallocating/preloading a number if important processors, but there could still be a "let it fail" approach to unknown processors – simply trying to call their CLI when requested...

@joschrew:

The way it currently works is: the processing-server is started with a configuration file. There the processors to be started are listet. This is the source of the mentioned self.processor_list. Then for every processor a queue is created and this queue is used for a processor-call: A user request to a processor is accepted at the processing_server and it gets wrapped into a message to later start the processor run. So if the processor is not in that list it cannot be called because it has no queue related. Maybe you are thinking about the standalone processor as a server, that is not part of this branch.

@bertsky:

Maybe you are thinking about the standalone processor as a server, that is not part of this branch.

No, I'm thinking of a fallback queue for all others processors to mitigate the problem that you have to (at any given time) know all processors in advance and configure them in the hosts section.

@joschrew

Ah, ok understood, thanks for clearifying.

isn't that overly strict?

I think it is necessary for the current approach as we don't have the queue-on-demand-creation yet and therefore not available processors should be intercepted imo. Probably it should be thought about adding what you have in mind but tbh currently it is out of my scope.

@bertsky:

Agreed. Should I open an issue in spec for this, @tdoan2010?

@MehmedGIT:

@bertsky, I am just seeing this conversation. I remember we already had an internal discussion (AFAIR, it's still not in the spec yet) about such a queue for storing processing messages that fail and can't be reprocessed. According to the current requirements, re-queuing a failing message creates an infinite loop. We still need to keep such messages somewhere (even if just for logging purposes) instead of silently dropping them. The general approach is to have a separate exchange agent which collects dropped (negatively acked by the consumer) messages and queues them to a separate dead letter queue. Check this diagram.

@bertsky:

@MehmedGIT that's not what this conversation was about, though. My concern was that we might not want to require setting up every possible processor runtime individually, but provide some sort of fallback (general purpose) worker that tries to feed these processing messages to the respective CLIs. So the processing server could then push all processing messages for non-configured processors into this fallback queue and hope for the best.

What you are addressing, however, is failing jobs (worker failing on the job). IMO nothing special should happen: the whole processing/workflow request should return failed in that case.

Rescheduling partially run workflows is another relatd topic BTW. So once the user fixed their workflow or runtime configuration, as long as the workspace is still present, the scheduler should identify which steps have already been run successfully before.

@tdoan2010:

Sorry, somehow I missed this discussion because it is marked as "Outdated". But I agree with Robert. This check is not necessary. It actually restricts the ability to add more processors to the system at runtime.

Please remove this check.

@MehmedGIT:

The main issue here is that the dynamic deployment of processors was not envisioned in the architecture. Only processors specified inside the processing server configuration file are deployed. There is no way to deploy processors after starting the Processing Server. Hence, the current solution works according to these requirements. Ideally, a separate endpoint for deploying processors (not database or queue) should be implemented. After some refactoring of the deployer agent, implementing that is not complicated.

The main problem is the tracking of the run-time status of processing workers. In case a processing worker dies, the Processing Server or the Deployer has no way to check that (EDIT: no way, aka, it is not implemented yet.)

@bertsky:

Only processors specified inside the processing server configuration file are deployed. There is no way to deploy processors after starting the Processing Server.

But @tdoan2010 is right: additional Processing Workers can be started outside and independent of the Processing Server at runtime. It would not be the responsibility of the latter to restart them (on failure) or shut them down (on termination), but of their outside controller, which could be some docker compose config btw. The Processing Server would not even have to know about them; rather, the (external) Processing Workers would have to know about the PS's queue and database, that's all.

The main problem is the tracking of the run-time status of processing workers. In case a processing worker dies, the Processing Server or the Deployer has no way to check that (EDIT: no way, aka, it is not implemented yet.)

But that's exactly the point of the discussion on worker failure...

@MehmedGIT:

But @tdoan2010 is right: additional Processing Workers can be started outside and independent of the Processing Server at runtime.

They can be started separately. But the processing server will have no info about them in order to terminate them.

It would not be the responsibility of the latter to restart them (on failure) or shut them down (on termination)

Well, exactly! But that was the initial idea - the processing server starting and killing all agents was envisioned. This is also in the current state of the spec. When the implementation has started it was easier to separate concepts, hence, they Deployer agent was introduced (although still not mentioned in the spec at all). Recently, it was decided to separate the Deployer completely.

@bertsky:

Oh, you're right – so I was actually arguing in your favour here, and against dynamic deployment (i.e. keeping a mutable list of worker instances).

But what happened to that other, earlier idea of having some fallback queue (backed by special workers which simply try to run whatever CLI name was requested)? Would that be possible with RabbitMQ?

@MehmedGIT:

@bertsky, IMO not hard to implement with what we have currently. The processing worker has to be adapted. Currently, the processing server is creating the message queues to which the processing workers listen to. The processing server could also create an extra queue named "ocrd-default". Any processing requests that don't match to any other queue will be sent to that queue. Then the default worker could be started with ocrd-default for the processor-name argument and start listening on that queue.

However, the tricky part: During run-time if another specific worker of a specific processor is started that was not started with the config file and the processing server is not aware of that, the requests will still be going to the ocrd-default queue.

And, honestly, I don't want to go in that direction by myself without having the required changes in the spec so potentially other problems that may arise are identified and discussed instead of trying to find fast hacky fixes on the fly afterwards.

@bertsky:

However, the tricky part: During run-time if another specific worker of a specific processor is started that was not started with the config file and the processing server is not aware of that, the requests will still be going to the ocrd-default queue.

Indeed! So here the two strategies would clash.

And, honestly, I don't want to go in that direction by myself without having the required changes in the spec so potentially other problems that may arise are identified and discussed instead of trying to find fast hacky fixes on the fly afterwards.

Ok, agreed. So let's deliberately leave the gap wide open – so we have a good reason to discuss openly afterwards.

@MehmedGIT

We have not even entered the waters where the processing-worker can be both a worker and a REST server according to the most recent spec changes. Check my comment here.

@kba kba mentioned this issue Mar 29, 2023
@bertsky
Copy link
Collaborator

bertsky commented Mar 29, 2023

Another recent twist in that discussion was the idea that the Processing Server can respond with HTTP 422 if there is no queue/worker for the requested executable in the current (dynamic) processor_list. (So the caller could then either change their workflow – sidestepping this executable –, or abort their workflow loudly, or ensure the missing processor gets started as a queue+worker by asking the Processing Server to do that or by doing it externally and registering on the Processing Server.)

@MehmedGIT
Copy link
Contributor

MehmedGIT commented Mar 30, 2023

Another recent twist in that discussion was the idea that the Processing Server can respond with HTTP 422 if there is no queue/worker for the requested executable in the current (dynamic) processor_list

This is already implemented in #1030. To be precise the Processing Server checks only if a queue exists or not. Returns HTTP 422 if the queue does not exist.

@bertsky
Copy link
Collaborator

bertsky commented May 9, 2023

So fallback queue, dead-letter exchange, these were off the table IIRC?

I guess we still need a separate issue for the new endpoint for Processing Worker registration (dynamic processor_list)?

@MehmedGIT
Copy link
Contributor

So fallback queue, dead-letter exchange, these were off the table IIRC?

Yes. I guess the fallback queue won't be needed anymore with the dynamic registration.

Not sure if a separate issue is needed. The dynamic registration through an endpoint of the Processing Server has been planned today to be implemented next. So that's already on our radar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants