Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Error on invalid device ordinal and cuda_memory_resource #4806

Open
yigithanyigit opened this issue Dec 5, 2024 · 6 comments
Open
Assignees
Labels
question Further information is requested

Comments

@yigithanyigit
Copy link

yigithanyigit commented Dec 5, 2024

Hello!

I am not sure this is the correct place to ask this question.

I am getting a error like this;

Thrust exception: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal
CUDA Error detected. cudaErrorInvalidValue invalid argument
louvain_example: /home/yigithan/miniconda3/envs/cugraph_dev/include/rmm/mr/device/cuda_memory_resource.hpp:80: virtual void rmm::mr::cuda_memory_resource::do_deallocate(void*, std::size_t, rmm::cuda_stream_view): Assertion `status__ == cudaSuccess' failed.

We are currently developing some project on top of cugraph specifically louvain. In my colleagues PC the examples and tests that I going mention are working perfectly fine. We installed from same repo/commit, same cuda version and same os.

The examples that I tried are;

https://github.com/yigithanyigit/cugraph/blob/branch-24.12/cpp/tests/community/louvain_test.cpp

https://github.com/yigithanyigit/cugraph-template/blob/main/src/louvain.cu

Short definition of problem.

If I am working small dataset like karate there is no problem. It starts and finishes succesfully. But when I working with big datasets (ca-hollywood-2009, soc-livejournal), it initializes, after that runs ~30-40 seconds and crashes (probably at de-allocation stage).

I also ran with compute-sanitizer and got this results.

Program hit cudaErrorLaunchOutOfResources (error 701) due to "too many resources requested for launch" on CUDA API call to cudaLaunchKernel_ptsz.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4466f5]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:cudaLaunchKernel_ptsz [0x547fd]
=========                in /home/yigithan/miniconda3/envs/cugraph_dev/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel in /home/yigithan/miniconda3/envs/cugraph_dev/targets/x86_64-linux/include/cuda_runtime_api.h:14030 [0xe5ecef1]
=========                in /home/yigithan/miniconda3/envs/cugraph_dev/lib/libcugraph.so
=========     Host Frame:_ZL736__device_stub__ZN7cugraph6detail35per_v_transform_reduce_e_mid_degreeILb1ENS_12graph_view_tIiiLb0ELb0EvEENS0_52edge_partition_endpoint_dummy_property_device_view_tIiEES5_NS0_42edge_partition_edge_property_device_view_tIiPKffEENS6_IiPKjbEEPfZNS_71_GLOBAL__N__3530e449_32_graph_weight_utils_sg_v32_e32_cu_4d8abc56_2573119compute_weight_sumsILb1EiifLb0ELb0EEEN3rmm14device_uvectorIT2_EERKN4raft8handle_tERKNS2_IT0_T1_XT3_EXT4_EvEENS_20edge_property_view_tISP_PKSI_N6thrust15iterator_traitsISV_E10value_typeEEEEUnvdl0_PFNSH_IfEESN_RKS3_NST_IiS8_fEEESF_ILb1EiifLb0ELb0EE2_NS_9reduce_op4plusIfEEfEEvNS_28edge_partition_device_view_tINSO_11vertex_typeENSO_9edge_typeEXsrSO_12is_multi_gpuEvEES1C_S1C_SP_SI_T3_NSW_8optionalIT4_EET5_T6_T8_S1L_T7_RN7cugraph28edge_partition_device_view_tIiiLb0EvEEiiRNS_6detail52edge_partition_endpoint_dummy_property_device_view_tIiEES6_RNS3_42edge_partition_edge_property_device_view_tIiPKffEERN6thrust8optionalINS7_IiPKjbEEEEPfR17__nv_dl_wrapper_tI11__nv_dl_tagIPFN3rmm14device_uvectorIfEERKN4raft8handle_tERKNS_12graph_view_tIiiLb0ELb0EvEENS_20edge_property_view_tIiS9_fEEEXadL_ZNS_71_GLOBAL__N__3530e449_32_graph_weight_utils_sg_v32_e32_cu_4d8abc56_2573119compute_weight_sumsILb1EiifLb0ELb0EEENSN_IT2_EESS_RKNST_IT0_T1_XT3_EXT4_EvEENSX_IS16_PKS13_NSC_15iterator_traitsIS1B_E10value_typeEEEEELj2EEJEEffRNS_9reduce_op4plusIfEE in /tmp/tmpxft_00006475_00000000-6_graph_weight_utils_sg_v32_e32.cudafe1.stub.c:233 [0xe5ee3f6]
=========                in /home/yigithan/miniconda3/envs/cugraph_dev/lib/libcugraph.so
.
.
.
.

My cuda version is: 12.4
Compute capability: 86
Device: RTX 3090
OS: Ubuntu 22.04 LTS
DRIVER:550.127.08

librmm : 24.12.00a33 cuda12_241204_g3b5f6af2_33 rapidsai-nightly
rmm: 24.12.00a33 cuda12_py312_241204_g3b5f6af2_33 rapidsai-nightly

Update

Issue also occurs on 24.10

and I reinstalled OS tried with different drivers

My cuda version is: 12.6
Compute capability: 86
Device: RTX 3090
OS: Ubuntu 24.04 LTS
DRIVER:560.35.03

@yigithanyigit yigithanyigit added ? - Needs Triage Need team to review and classify question Further information is requested labels Dec 5, 2024
@bdice
Copy link
Contributor

bdice commented Dec 5, 2024

Thanks for filing this! I think this might be a better fit for the cuGraph repository. The RMM failure you're observing is probably due to an earlier bug occurring in cuGraph's code. I will transfer this issue there.

cc: @ChuckHastings for awareness.

@bdice bdice transferred this issue from rapidsai/rmm Dec 5, 2024
@yigithanyigit
Copy link
Author

yigithanyigit commented Dec 10, 2024

Thank you for the response!

I think you are right! I compiled librmm from source and it works pretty fine. I tried to trace down more but I couldn't achieved not much during the process.

I am open to debug/solve ideas.

Thanks

@yigithanyigit
Copy link
Author

Ok, I am not solved the problem but I found a workaround.

The problem looks undefined behavior in some part of the code. I am having this problem if I build with debug symbols.

@ChuckHastings
Copy link
Collaborator

We have not been able to build a complete build of cugraph with debug symbols in a while (the overall code is too big). Can you share some simple code that reproduces your error? I can try and reproduce it myself and that would make it easier to try and diagnose the problem you are seeing.

@yigithanyigit
Copy link
Author

We have not been able to build a complete build of cugraph with debug symbols in a while (the overall code is too big). Can you share some simple code that reproduces your error? I can try and reproduce it myself and that would make it easier to try and diagnose the problem you are seeing.

First of all thank you for your response!

Louvain tests are failing (Rmat32, Rmat64). I assume you may test those tests.

I built with;

./build.sh libcugraph -g

From my observations small datasets like karate working perfectly fine. Just FYI.

I hope it helps.

Thanks

@ChuckHastings
Copy link
Collaborator

I'm back from holiday break and will start investigating. I'll let you know how I progress.

@ChuckHastings ChuckHastings self-assigned this Jan 6, 2025
@ChuckHastings ChuckHastings removed the ? - Needs Triage Need team to review and classify label Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: To-do
Development

No branches or pull requests

3 participants