-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16362: pydaos.torch checkpointing #15691
base: master
Are you sure you want to change the base?
Conversation
Errors are component not formatted correctly,Ticket number suffix is not a number. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data |
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/1/execution/node/133/log |
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/2/execution/node/134/log |
91c5ab5
to
9edb95b
Compare
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/3/execution/node/134/log |
9edb95b
to
60a4612
Compare
Signed-off-by: Denis Barakhtanov <[email protected]>
Signed-off-by: Denis Barakhtanov <[email protected]>
Calling daos_init on Python module import creates challenges for functional tests (and for some user cases) when DAOS agent is not ready yet. Better solution is to try initialize DAOS lib when the Dataset is created. Signed-off-by: Denis Barakhtanov <[email protected]>
Signed-off-by: Denis Barakhtanov <[email protected]>
Signed-off-by: Denis Barakhtanov <[email protected]>
Signed-off-by: Denis Barakhtanov <[email protected]>
Signed-off-by: Denis Barakhtanov <[email protected]>
Ave linters ! Signed-off-by: Denis Barakhtanov <[email protected]>
60a4612
to
e72b63d
Compare
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/4/execution/node/135/log |
Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15691/3/display/redirect |
Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15691/3/display/redirect |
Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15691/3/display/redirect |
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15691/3/display/redirect |
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/5/execution/node/135/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nits. Overall looks good!
Since directory_tree.py
was modified we should run the test that uses it. You can do that by including this string in future commit messages
Features: DfuseFind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the C shim part looks good to me. no changes requested; just some clarifications / comments added.
|
||
assert(hdl->dfs != NULL); | ||
|
||
int rc = dfs_lookup(hdl->dfs, path, O_RDONLY, &obj, NULL, &st); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can optimize this a bit if the dir path is cached to just call dfs_stat. but for now this is fine.
Co-authored-by: Dalton Bohning <[email protected]> Signed-off-by: enakta <[email protected]>
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/6/execution/node/133/log |
Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/7/execution/node/134/log |
This allows better flexibility. Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
Functional tests for pydaos.torch module is now available. Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
No more negative error values. Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/8/execution/node/134/log |
Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/9/execution/node/134/log |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/9/execution/node/1229/log |
Checkpoint writes can be now done in chunks in parallel. Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/10/execution/node/133/log |
Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/11/execution/node/134/log |
Did not take into account that timeout is shared across all tests in suit. Features: DfuseFind Signed-off-by: Denis Barakhtanov <[email protected]>
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/12/execution/node/134/log |
@mchaarawi I've added parameters passthrough for Could you please have another look ? |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/12/execution/node/1184/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python bandit (security check) has some complaints. Since this is test code it's not a real concern but we should get the check clean
Co-authored-by: Dalton Bohning <[email protected]> Signed-off-by: enakta <[email protected]>
Features: DfuseFind,PytorchCheckpointTest,PytorchDatasetsTest Signed-off-by: Denis Barakhtanov <[email protected]>
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15691/14/testReport/ |
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15691/14/execution/node/1140/log |
|
||
out: | ||
D_FREE(cp1); | ||
D_FREE(cp2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D_FREE(*dir) ?
Introducing PyTorch checkpoint interface and user documentation for
pydaos.torch
module.Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: