Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch dump and batch update of old stages #10630

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

bric-afisher
Copy link

@bric-afisher bric-afisher commented Nov 23, 2024

Description

Rather than running stage.dump separately for each stage, which involves loading the dvc.lock file many times, this proposed code groups stages together by their dvcfiles and does a dvcfile.batch_dump (see the stages_by_dvcfile object).

Also, rather than getting the old version of each stage within get_versioned_outs, this proposed code gets all old versions in one step, stores them in a old_stages object, and then passes each old stage to stage.save. This similarly removes a step where the lock file is reloaded once per stage.

My understanding is that these changes improve the speed of dvc commit when the dvc.lock file is very large and slow to load, without causing any changes or problems regarding the function's behavior. That said, I am not 100%, especially when it comes to the get_versioned_outs part which I have less context for. I would appreciate feedback about what this could be missing!

Related issues

Fixes #10629, although it seems like it would be even better to have separate lock files.

Checklist

  • ❗❗ (WITH CAVEAT BELOW!!!) ❗❗ I have followed the Contributing to DVC checklist.
    • I am not a python expert so I apologize if I missed something! I did not add a test for this change because I did not see an obvious unit test to update. However, I did run this version with the reproducible example in dvc commit is slow when there are many stages #10629 and saw that it reduced commit time as described there.
    • I ran pytest in the the development environment and there were 6 tests that failed that seem to be related to some kind of API. These tests also failed when I ran pytest while on the main branch without any changes. So it seems like there was something about the developer environment that I did not set up correctly, rather than these being caused by the changes I made. I'm not sure though. Might you have any insight as to why these fail?
FAILED tests/func/test_daemon.py::test_analytics - subprocess.CalledProcessError: Command '['/home/fishea10/dvc/venv/bin/python3', '/home/fishea10/dvc/dvc/__main__.py', 'config', '-l', '-vv']' returned non-zero exit status 1.

FAILED tests/func/test_daemon.py::test_updater - subprocess.CalledProcessError: Command '['/home/fishea10/dvc/venv/bin/python3', '/home/fishea10/dvc/dvc/__main__.py', 'version', '-vv']' returned non-zero exit status 1.

FAILED tests/integration/test_studio_live_experiments.py::test_post_to_studio[None-False-True] - AssertionError: assert 'mytoken' == 'STUDIO_TOKEN'

FAILED tests/integration/test_studio_live_experiments.py::test_post_to_studio[None-False-False] - AssertionError: assert 'mytoken' == 'STUDIO_TOKEN'

FAILED tests/integration/test_studio_live_experiments.py::test_post_to_studio[DVC_EXP_GIT_REMOTE-False-True] - AssertionError: assert 'mytoken' == 'STUDIO_TOKEN'

FAILED tests/integration/test_studio_live_experiments.py::test_post_to_studio[DVC_EXP_GIT_REMOTE-False-False] - AssertionError: assert 'mytoken' == 'STUDIO_TOKEN'
  • 📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
    • NA. There is no change from the user's perspective.

Copy link

codecov bot commented Nov 23, 2024

Codecov Report

Attention: Patch coverage is 90.16393% with 6 lines in your changes missing coverage. Please review.

Project coverage is 91.00%. Comparing base (2431ec6) to head (aedc2fa).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
dvc/dvcfile.py 85.18% 1 Missing and 3 partials ⚠️
dvc/repo/commit.py 91.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10630      +/-   ##
==========================================
+ Coverage   90.68%   91.00%   +0.32%     
==========================================
  Files         504      504              
  Lines       39795    39835      +40     
  Branches     3141     3154      +13     
==========================================
+ Hits        36087    36253     +166     
+ Misses       3042     2949      -93     
+ Partials      666      633      -33     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@shcheklein
Copy link
Member

@skshetry could you PTAL?

@skshetry
Copy link
Member

Hi, thank you for the reproducer in #10629.

I profiled the dvc commit -f from the reproducer and got the following results:

  • 64.6% spent on dump().
  • 34% time spent on get_versioned_outs()

That is a lot of time spent reading and writing the DVC files. We should optimize this.

I see you have attempted to fix both issues in the PR. I'd suggest splitting it into two separate PRs, starting with batching dump() as it's where most of the time is being spent.

Regarding dump()

My main concern with batching is that commit updates won't be immediately reflected in the DVC files, delaying the experience for VS Code users and for other commands like dvc data status. However, I think the performance improvements are worth it (LMK wyt @shcheklein).

I also reviewed other commands to see if they might benefit from batch dumping but found none. While dvc repro could be a candidate, it’s designed to update DVC files as soon as possible.

That said, the PR looks a bit complex, but I am fine with it if it exclusively optimizes dvc commit. This also reveals a bigger issue: we lack proper abstractions for writing stages (and/or handling stages in general).

Regarding get_versioned_outs()

This is a known issue (see this discussion and #8873). There was previously a merge_versioned flag - reintroducing it might be a better solution than passing old_stage.
(We should also think of getting rid of cloud versioning).

However, I’d suggest handling this in a separate PR.

@bric-afisher
Copy link
Author

bric-afisher commented Nov 28, 2024

Thanks so much for your time and comments!

Regarding dump()

My main concern with batching is that commit updates won't be immediately reflected in the DVC files, delaying the experience for VS Code users and for other commands like dvc data status.

I'm not sure if I follow, but perhaps this related to the use case my group has:

We have been parsing the dvc DAG to find stages of our dvc.yaml that can be run in parallel. In each parallel job, we run the command in the stage's cmd field directly. After all commands have finished, we run dvc commit.

Alternatively, instead of using running the cmd we could do dvc repro -s stage_name so that the lock file is updated in parallel, which would appear to update the lock file "as fast as possible." However, this can result in crashes if two jobs finish at almost the same time and then try to update the lock file at the same time. When the number of stages run in parallel is large, and the lock file is large and slow to update, this problem can occur fairly regularly.

We also considered doing a try-wait cycle for updating the lock file in parallel, but this essentially brings us back to the problem in #10629 where the lock file is updated sequentially, and the time taken to update the lock file becomes longer than the time required to run the stages (even in parallel). One might expect batch updates would result in a lag time during which the outs are out of sync with the lock file, but in our scenario the lag time associated with sequential updates is actually larger.

Is this the scenario you're referring to? Or is there a different sequence of commands or actions that would result in a problem? (I'm a VS Code user myself btw.)

Regarding get_versioned_outs()

This is a known issue (see this discussion and #8873). There was previously a merge_versioned flag - reintroducing it might be a better solution than passing old_stage. (We should also think of getting rid of cloud versioning).

I did see PR #8873 but there and in general I was having trouble figuring out what the goal of merge_versioned and meta versioning is. Is there a place where this is documented? Or, is it not something that users are expected to interact with?

@skshetry
Copy link
Member

skshetry commented Dec 2, 2024

Is this the scenario you're referring to? Or is there a different sequence of commands or actions that would result in a problem? (I'm a VS Code user myself btw.)

In case of dvc repro, we allow running parallel branches of pipelines in parallel.
So, we want to write the result of a given stage as soon as it is completed.

See https://dvc.org/doc/command-reference/repro#parallel-stage-execution.

(You may get Unable to acquire lock errors, dvc repro only unlocks the repo lock when it actually starts executing the cmd, and re-locks as soon as it is done. So you may have to wait, or retry. This makes it really hard to use this feature).

For dvc commit, my point was that the commit for each stage may take a substantial amount of time. If you are using VSCode with DVC extension for example, you would not notice that something was committed on the DVC file tree unless dvc commit succeeds.

Regarding get_versioned_outs()

This is a known issue (see this discussion and #8873). There was previously a merge_versioned flag - reintroducing it might be a better solution than passing old_stage. (We should also think of getting rid of cloud versioning).

I did see PR #8873 but there and in general I was having trouble figuring out what the goal of merge_versioned and meta versioning is. Is there a place where this is documented? Or, is it not something that users are expected to interact with?

This versioning stuff is from Cloud Versioning feature. In this feature, we keep metadata of every single file inside dvc.lock file instead of using a .dir file. So, when we need to update a file in a tracked directory, we need to merge existing data from dvc.lock and update it.

@skshetry
Copy link
Member

@bric-afisher, any updates? Do you still plan to work on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dvc commit is slow when there are many stages
3 participants