-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Gymnasium 1.0 compatibility #2473
base: gh/vmoens/31/base
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2473
Note: Links to docs will display an error until the docs builds have been completed. ❌ 5 New Failures, 13 Unrelated FailuresAs of commit 33c429b with merge base fac1f7a (): NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: 305b15271bd5b0cd7d9b55a9f8b2079bbc40950f Pull Request resolved: #2473
@pseudo-rnd-thoughts @jkterry1 I would need some guidance here for the partial resets. If I understand correctly, now if an env is done, the
I'm trying to think about how I could adapt I cannot think of any way to make Other more metaphysical questions:
Thanks! |
ghstack-source-id: eed26cf693a936c60ee7607ce6b8e5fd6b994953 Pull Request resolved: #2473
@vmoens Thanks for your message, I'll try to answer the best I can
The action is discarded or "has no effect"
This adds an extra step to vector environments when resetting so this cost is dependent on the reset frequency. Yes, this is a loss of resources, however, in my opinion, very minimally.
The reward will be zero, termination and truncation is False.
Possibly, if you reset every step, this reduces your practical budget to
You are correct, but the boiler plate is assuming that users have a single environment buffer thus each sub-environment data must be extracted. For vectorised environment buffers, this control is unnecessary but it is critical that you are not using the end of one episode and the begining of another in your training data (or it must be masked out). @RedTachyon and I are planning on releasing a blog post about this more because we understand the complexity of it for projects to adapt |
What about an approach that does not need a blog post to adapt? :D |
@pseudo-rnd-thoughts you don't have partial resets right? Smth like |
@pseudo-rnd-thoughts I'll drop here my 2 cents as the creator of VMAS which is a vectorized env It seems to me that the new API for autoresets in
I think the impact of this is bigger than we can imagine. I work with batches of hunderds of thousands of envs on GPU and it is very probable that in one step at least one env will be done. I don't know why this is the route that was taken, but I don't think it will be sustainable. The solution is really simple in my opinion and it is to keep |
Since it's an API that is also used by envpool and others we could think of a way to circumvent this when calling step_and_maybe_reset but as of now I think the most immediate solution is going to be to tell people not to use gymnasium 1.0 with torchrl until a fix is found... I agree with Matteo that having a method call To unblock this, I can give a shot at building a wrapper that does some sort of partial reset maybe? Thoughts? |
@vmoens and @matteobettini Thanks for your input, I'm happy to have a zoom call to work out a solution if that helps. To clarify, the previous autoreset (reset within the same step) is still compatible with
Normally a vector environment from Gymnasium, you only call reset at the beginning with users calling @matteobettini I admittedly didn't test this across a wide range of environments, but I didn't expect that adding an additional call to a policy would cause significant problems as most environments take a few hundred steps at minimum before termination, let alone truncation. Meaning that I expected this to cause only a minimal impact on performance.
The problem as I see it is that an autoreset must happen under the hood to reset sub-environments, either within the same step as a termination / truncation or on the next step. |
Thanks for the reply, I am available if we set up a call
Yes I think the core problem discussed here is that doing it in the same step is a better solution as distributing it across 2 steps has the problems highlighted above. In your example of the 200 steps. If you have 200 enviornomnets terminating in 200 steps and they are perfectly out of sync, you will be doing a reset at every step. Imagine with more enviornments... 200 is nothing compared to 200_000 More in general regarding autoreset, I am a bit constrained by it. I understand that in gym this is how things are done currently, but for example in my use case I have a VMAS env (which has no autoreset) and I would like to wrap it with a gymnasium VectorEnv and I can't because gymansium assumes that everyone will use autoresets, while I would prefer the general API not to prescribe if you do autoresets or not (and thus provide a reset function that takes an optional mask). |
@pseudo-rnd-thoughts is there any way one could implement a |
The vector API doesn't require autoreset to be implemented but it has been designed with this in mind. |
To get back to the issue at hand, here is a recap: It seems like in torchrl we are not able to support resets happening over 2 step calls. So in order to work with the new gymnasium VectorEnv we would need either:
@pseudo-rnd-thoughts is there any way we can make either happen? (my fav would be the first) |
I think I spotted another criticality related to this. The API is assuming that it possible to step the sub envs independently (which is not true in batched simulators as it would defeat their purpose) Here is the example:
So for all environemnts that cannot step subenvs independently the new API cannot work |
I might address some other points later (discussing with Vincent "offline" in parallel), but one thing I want a sanity check on regarding the last message:
Wouldn't the same be true for the old API? (which also featured an autoreset, just shifted by one step) |
No, the old API should be fine, because it did (correct me if i am wrong): When step is called:
So it always has a global step I would hardly call the new API a "shift", I think its implications are quite ramified |
@matteobettini I don't know how your problem is implemented but can you not pass a no-op like action to the simulator? |
I am not talking about my problem here, I am talking about any batched simulation framework (isaac, vmas,brax...) or any batched policy, where the core point of batching and vectorization in RL is that the functions What the new API implies is that you will have to compute This is a major problem because this API is essentially eliminating all the benefits of batching and vectorization
2 is only visible when using batched simulators (e.g., isaac, brax, vmas; assuming they even support partial steps), but 1 will be visible in any training library based on this API, independently of the simulator |
I'd really, really like to ask people to turn down the temperature and stop overexaggerating.
This is just not true. I know this for a fact, because I'm working on a project that uses the new API, that benefits from batching and vectorization. Beyond that, Gymnasium ships with some hardware-accelerated Jax-based environments. There's probably a more targeted argument that you can make - in which, case, please make that argument instead of saying false things, because there's no point for us to engage with that. |
I did not want to step up the tone, sorry for that, I am just trying to bring this issue to light. Let me try to prove that there is a case where you would loose all benefits of batching with a worst-case scenario study. Assume you have The worst case scenario happens when all envrionments are done except 1 . You would then call the policy with N obs, but then N-1 are discarded. Then your step function would call the physical step with 1 action and N-1 no-op actions. This is why i believe my statement is true. This is because the reset happens after the policy and before the step. In the previous API, the reset happened ater the step but before the policy, making all calls to the physical step and the policy use N valid inputs. I know that this is a worst-case scenario, but computer science often studies complexity using worst-case analysis. I understand that empirically you use jax and are doing fine, but I am just trying to understand the theoretical implications here. |
Maybe I'm too much of an empiricist, but to me, this worst-case analysis should be a minimal problem in practice, so I didn't take it much into consideration. If you find this significantly increases compute time, please let me know.
@vmoens Ironically, the opposite is true (to my knowledge) with the no-op action for terminated / truncated states; then 1 million steps is an upper bound, not the lower bound. In reality, they are probably taking fewer steps than that of the actual environments unless they try to be fancy then yes, they could mess it up but that is their fault not gymnasium's imho. |
@RedTachyon I agree let's lower the tone! At the end of the day what we want it to support the community better. Let me first try to explain why it is so difficult for us to adapt to these changes: Concretely speaking, this change is impossible to incorporate in torchrl without some heavy, painful and probably bc-breaking refactoring. To put it simply: Our goal is to support the RL community, whether they use gym or something else and here the cost seems too high to pay for the general community. I understand gym wants to be the one default backend for simulation in RL and related fields, but from where we stand we can't ignore other frameworks. To understand why we don't encounter the problem of reset vs final obs in torchrl, I invite you to read about torchrl format here. I hope this will help you understand where we come from! To move forward:
|
I believe it could show up in practical examples. Imagine a self-driving simulator where you have 4 actions for driving and one for a hand-break. Probably, pulling the hand-break at the wrong time would lead to a crash. I believe that a scenario like this would come close to what I outlined as a lot of sub envs would be frequently done during exploration and you would see empirically the difference. I guess the point I am making is that the cost of the previous API did not depend on the task, and now it does, with the worst case being what i outlined and a scenario like the above empirically approaching that |
Let me rephrase and expand a bit on the accuracy point. Here's another simple example of why I think this is an issue: CartPole (arguably the most popular and simple env ever) has a reset frequency tightly linked to the perf of the algo. With gym 1.0, algo 1 will have fewer total steps so probably an even poorer perf. The gap between 1 and 2 will be bigger than it should. Note that the effect is opposite if the env resets more frequently when it's solved (imagine getting out of a maze the fastest possible => more success = more resets = less training data). Does that make sense?
Any thoughts on this? |
@vmoens I've spent some time investigating adding other autoreset APIs to Gymnasium's built-in environments: Farama-Foundation/Gymnasium#1227. This would allow users to specify the autoreset mode of vector environments and select the autoreset mode for the built-in Sync and Async vector environments. Please test this branch with TorchRL and let me know if there are any problems with it or if there are changes to the API that you would like implemented.
This solution should solve this question.
This has always been true. We are always interested in new proposals and contributions from the community. I want to note that this API change was a key part of an alpha release from February of this year (https://github.com/Farama-Foundation/Gymnasium/releases/tag/v1.0.0a1), 8 months ago. This was not a new change. |
@pseudo-rnd-thoughts amazing, looking forward to this! |
@pseudo-rnd-thoughts @RedTachyon @vmoens @matteobettini As a member of both the Gymnasium and TorchRL communities, I want to thank all of you for working together on this. It’s great to see this kind of collaboration. With some environments and libraries starting to restrict dependencies to Keep up the good work! |
Stack from ghstack (oldest at bottom):