Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM error raised after subsequent calls to plc.bfs with different SGGraphs #4067

Closed
rlratzel opened this issue Dec 21, 2023 · 3 comments · Fixed by #4077
Closed

OOM error raised after subsequent calls to plc.bfs with different SGGraphs #4067

rlratzel opened this issue Dec 21, 2023 · 3 comments · Fixed by #4077
Assignees
Labels
bug Something isn't working

Comments

@rlratzel
Copy link
Contributor

rlratzel commented Dec 21, 2023

The following code produces the error shown below. The graph is too small to result in a legitimate OOM error, so a bug is likely causing something in libcugraph to allocate or attempt to allocate more memory than is available.

import pylibcugraph as plc
import cupy as cp

edges = [
    (0,2), (0,4), (0,5), (0,6), (0,7), (0,8), (0,10), (0,11), (0,12), (0,13),
    (0,17), (0,19), (0,21), (0,31), (2,7), (2,8), (2,9), (2,13), (2,27), (2,28),
    (2,32), (4,6), (4,10), (5,6), (5,10), (5,16), (6,16), (8,30), (8,32),
    (8,33), (9,33), (13,33), (14,32), (14,33), (15,32), (15,33), (18,32),
    (18,33), (19,33), (20,32), (20,33), (22,32), (22,33), (23,25), (23,27),
    (23,29), (23,32), (23,33), (24,25), (24,27), (24,31), (25,31), (26,29),
    (26,33), (27,33), (28,31), (28,33), (29,32), (29,33), (30,32), (30,33),
    (31,32), (31,33), (32,33),
]
edges += [(target, source) for (source, target) in edges]

source=0
src_indices=cp.array([e[0] for e in edges], dtype="int32")
dst_indices=cp.array([e[1] for e in edges], dtype="int32")
G=plc.SGGraph(
    resource_handle=plc.ResourceHandle(),
    graph_properties=plc.GraphProperties(
        is_multigraph=False,
        is_symmetric=False,
    ),
    src_or_offset_array=src_indices,
    dst_or_index_array=dst_indices,
    weight_array=None,
    store_transposed=False,
    renumber=False,
    do_expensive_check=False,
)

distances, predecessors, node_ids = plc.bfs(
    handle=plc.ResourceHandle(),
    graph=G,
    sources=cp.array([source], "int32"),
    direction_optimizing=False,
    depth_limit=-1,
    compute_predecessors=False,
    do_expensive_check=False,
)
MAXINT=2147483647
mask = distances != MAXINT
it = zip(node_ids[mask].tolist(), distances[mask].tolist())

del G

source=1
src_indices=cp.array([], dtype="int32")
dst_indices=cp.array([], dtype="int32")

G=plc.SGGraph(
    resource_handle=plc.ResourceHandle(),
    graph_properties=plc.GraphProperties(
        is_multigraph=False,
        is_symmetric=False,
    ),
    src_or_offset_array=src_indices,
    dst_or_index_array=dst_indices,
    weight_array=None,
    store_transposed=False,
    renumber=False,
    do_expensive_check=False,
)

distances, predecessors, node_ids = plc.bfs(
    handle=plc.ResourceHandle(),
    graph=G,
    sources=cp.array([source], "int32"),
    direction_optimizing=False, 
    depth_limit=1,
    compute_predecessors=False,
    do_expensive_check=False,
)
(base) root@98e76e671240:/Projects/cugraph# python oom_repro.py
Traceback (most recent call last):
  File "/Projects/cugraph/oom_repro.py", line 66, in <module>
    distances, predecessors, node_ids = plc.bfs(
  File "bfs.pyx", line 181, in pylibcugraph.bfs.bfs
  File "utils.pyx", line 53, in pylibcugraph.utils.assert_success
RuntimeError: non-success value returned from cugraph_bfs: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp

This was initially discovered in #4029

@rlratzel rlratzel added the bug Something isn't working label Dec 21, 2023
@rlratzel rlratzel self-assigned this Dec 21, 2023
@rlratzel rlratzel changed the title OOM error raised when calling NX algos with nx-cugraph installed OOM error raised after subsequent calls to plc.bfs with different SGGraphs Jan 3, 2024
@rlratzel rlratzel changed the title OOM error raised after subsequent calls to plc.bfs with different SGGraphs OOM error raised after subsequent calls to plc.bfs with different SGGraphs Jan 3, 2024
@ChuckHastings
Copy link
Collaborator

Did some analysis here. Here's what I found:

  1. I believe the fundamental problem is this. For your second graph, you define a graph with no edges (and therefore no vertices), but you request a BFS from vertex id 1. There is no vertex id 1, but we don't check that boundary condition, so we attempt to perform a BFS from a non-existent vertex. This causes us to go out of bounds on an array reference.
  2. In my test runs as you described above, this out of bounds triggers the OOM failure, because we read the out of bounds memory and compute a completely invalid amount of memory.
  3. If I skip the first graph call and just make the second graph call, I get a different error (which is what actually led me to discover the out of bounds computation.
  4. If I set do_expensive_check to True in the BFS call it will return the fundamental error (that we're passing in bad parameters).

Can we define the desired behavior in this situation? If the user creates a graph (whether empty or not) and executes BFS with a starting vertex that is not actually a vertex in the graph, what is the desired behavior? Return an error indicating bad input parameters? Silently fail, returning the result of running BFS from no vertex?

@rlratzel
Copy link
Contributor Author

rlratzel commented Jan 3, 2024

Thanks, Chuck. Here's what I noticed now:

1 & 2 make sense, but when I run the script with the entire first call and setup code commented out (everything from del G and above), which I think is what you did for 3, the script silently passes and I see no error. However, if I set do_expensive_check as you did, I see an error as you described in 4.

Given that, it seems like there could be a few issues:

  1. If we think the input is invalid (seems like that's how it should be classified), then we should raise/return an error. For reference, NetworkX does this in the same situation:
>>> list(nx.bfs_edges(G,333))
Traceback (most recent call last):
  File "/Projects/networkx/networkx/classes/graph.py", line 1354, in neighbors
    return iter(self._adj[n])
KeyError: 333

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Projects/networkx/networkx/algorithms/traversal/breadth_first_search.py", line 218, in bfs_edges
    yield from generic_bfs_edges(G, source, successors, depth_limit)
  File "/Projects/networkx/networkx/algorithms/traversal/breadth_first_search.py", line 118, in generic_bfs_edges
    next_parents_children = [(source, neighbors(source))]
  File "/Projects/networkx/networkx/classes/graph.py", line 1356, in neighbors
    raise NetworkXError(f"The node {n} is not in the graph.") from err
networkx.exception.NetworkXError: The node 333 is not in the graph.

so raising/returning an error seems reasonable to me.

  1. The out-of-bounds read obviously shouldn't happen. If this is only possible with an invalid source vertex and we're handling that now by checking and raising, then we've covered this one too. However, if there's another situation we can think of then perhaps there's some additional guardrails we should also add?
  2. Any idea why I'm not seeing an error of any kind for your 3rd point above? Is that something else that needs to be looked at?

@ChuckHastings
Copy link
Collaborator

This particular OOB read is because of the invalid source vertex (which we never check). I can't think of other bad input cases that we're not checking... although I am admittedly bad at finding these a priori. I think adding this check should be sufficient. Correct input data should not result in an OOB error like this.

OOB errors are very non-deterministic in their behavior. It's one of the things that makes them a challenge to isolate and debug. For this particular code - which is fairly straightforward, it's probably a function of what GPU model, device driver and CUDA runtime versions that you're running. Fundamentally, in my run when I access the OOB element I'm reading a -1 - whatever was in the memory previously, and when I try to access whatever_array[-1] it's referencing an illegal location. In your example run you may be reading a different value (perhaps not even OOB), and then accessing some legal location (but wrong). Because the graph is empty it's likely only one OOB memory reference that doesn't even get used after the small block of code it's in. As long as the subtraction yields a positive value that's small enough to be allocated in GPU memory it will not fail, nor will it generate a bad result... because there is no result. If you did the same thing with a real graph you probably increase your chance of a bad outcome (a different failure or bogus result).

I will add the fixes to the C API for BFS to detect this error.

rapids-bot bot pushed a commit that referenced this issue Jan 12, 2024
…valid vertex (#4077)

* Added error check to be sure that the source vertex is a valid vertex in the graph.
* Updated `nx_cugraph.Graph` class to create PLC graphs using `vertices_array` in order to include isolated vertices. This is now needed since the error check added in this PR prevents NetworkX tests from passing if isolated vertices are treated as invalid, so this change prevents that.
* This resolves the problem that required the test workarounds done [here](#4029 (comment)) in [4029](#4029), so those workarounds have been removed in this PR.

Closes #4067

Authors:
  - Chuck Hastings (https://github.com/ChuckHastings)
  - Rick Ratzel (https://github.com/rlratzel)

Approvers:
  - Seunghwa Kang (https://github.com/seunghwak)
  - Ray Douglass (https://github.com/raydouglass)
  - Erik Welch (https://github.com/eriknw)

URL: #4077
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants