Skip to content

Commit

Permalink
Merge branch 'branch-25.02' into misc-packaging
Browse files Browse the repository at this point in the history
  • Loading branch information
jameslamb authored Jan 7, 2025
2 parents 4a7d932 + 8fbcbab commit c36a646
Show file tree
Hide file tree
Showing 5 changed files with 473 additions and 8 deletions.
13 changes: 13 additions & 0 deletions .github/workflows/pr.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ jobs:
# Please keep pr-builder as the top job here
pr-builder:
needs:
- check-nightly-ci
- changed-files
- checks
- conda-cpp-build
Expand Down Expand Up @@ -42,6 +43,18 @@ jobs:
- name: Telemetry setup
if: ${{ vars.TELEMETRY_ENABLED == 'true' }}
uses: rapidsai/shared-actions/telemetry-dispatch-stash-base-env-vars@main
check-nightly-ci:
# Switch to ubuntu-latest once it defaults to a version of Ubuntu that
# provides at least Python 3.11 (see
# https://docs.python.org/3/library/datetime.html#datetime.date.fromisoformat)
runs-on: ubuntu-24.04
env:
RAPIDS_GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
steps:
- name: Check if nightly CI is passing
uses: rapidsai/shared-actions/check_nightly_success/dispatch@main
with:
repo: cugraph
changed-files:
secrets: inherit
needs: telemetry-setup
Expand Down
16 changes: 8 additions & 8 deletions cpp/src/c_api/neighbor_sampling.cpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2022-2024, NVIDIA CORPORATION.
* Copyright (c) 2022-2025, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -880,7 +880,6 @@ struct neighbor_sampling_functor : public cugraph::c_api::abstract_functor {
handle_.get_stream());

std::optional<rmm::device_uvector<label_t>> start_vertex_labels{std::nullopt};
std::optional<rmm::device_uvector<label_t>> local_label_to_comm_rank{std::nullopt};
std::optional<rmm::device_uvector<label_t>> label_to_comm_rank{
std::nullopt}; // global after allgatherv

Expand Down Expand Up @@ -932,12 +931,13 @@ struct neighbor_sampling_functor : public cugraph::c_api::abstract_functor {
handle_.get_stream(),
raft::device_span<label_t>{unique_labels.data(), unique_labels.size()});

(*local_label_to_comm_rank).resize(num_unique_labels, handle_.get_stream());
rmm::device_uvector<label_t> local_label_to_comm_rank(num_unique_labels,
handle_.get_stream());

cugraph::detail::scalar_fill(
handle_.get_stream(),
(*local_label_to_comm_rank).begin(), // This should be rename to rank
(*local_label_to_comm_rank).size(),
local_label_to_comm_rank.begin(), // This should be rename to rank
local_label_to_comm_rank.size(),
label_t{handle_.get_comms().get_rank()});

// Perform allgather to get global_label_to_comm_rank_d_vector
Expand All @@ -948,11 +948,11 @@ struct neighbor_sampling_functor : public cugraph::c_api::abstract_functor {
std::exclusive_scan(
recvcounts.begin(), recvcounts.end(), displacements.begin(), size_t{0});

(*label_to_comm_rank)
.resize(displacements.back() + recvcounts.back(), handle_.get_stream());
label_to_comm_rank = rmm::device_uvector<label_t>(
displacements.back() + recvcounts.back(), handle_.get_stream());

cugraph::device_allgatherv(handle_.get_comms(),
(*local_label_to_comm_rank).begin(),
local_label_to_comm_rank.begin(),
(*label_to_comm_rank).begin(),
recvcounts,
displacements,
Expand Down
55 changes: 55 additions & 0 deletions scripts/dask/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Dask scripts for multi-GPU environments

This directory contains tools for configuring environments for single-node or
multi-node, multi-gpu (SNMG or MNMG) Dask-based cugraph runs, currently
consisting of shell and python scripts.

Users should also consult the multi-GPU utilities in the
`python/cugraph/cugraph/testing/mg_utils.py` module, specifically the
`start_dask_client()` function, to see how to create `client` and `cluster`
instances in Python code to access the corresponding Dask processes created by
the tools here.


### run-dask-process.sh

This script is used to start the Dask scheduler and workers as needed.

To start a scheduler and workers on a node, run it like this:
```
bash$ run-dask-process.sh scheduler workers
```
Once a scheduler is running on a node in the cluster, workers can be started
on other nodes in the cluster by running the script on each worker node like
this:
```
bash$ run-dask-process.sh workers
```
The env var SCHEDULER_FILE must be set to the location where the scheduler
will generate the scheduler JSON file. The same env var is used by the
workers to locate the generated scheduler JSON file for reading.

The script will ensure the scheduler is started before the workers when both
are specified.

Additional options can be specified for using different communication
mechanisms:
```
--tcp - initalize a TCP cluster (default)
--ucx - initialize a UCX cluster with NVLink
--ucxib | --ucx-ib - initialize a UCX cluster with InfiniBand+NVLink
```
Finally, the script can be run with `-h` or `--help` to see the full set of
options.

### wait_for_workers.py

This script can be used to ensure all workers that are expected to be present
in the cluster are up and running. This is useful for automation that sets up
the Dask cluster and cannot proceed until the Dask cluster is available
to accept tasks.

This example waits for 16 workers to be present:
```
bash$ python wait_for_workers.py --scheduler-file-path=$SCHEDULER_FILE --num-expected-workers=16
```
Loading

0 comments on commit c36a646

Please sign in to comment.