-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Duplicate column names are not allowed when given weight in from_dask_cudf_edgelist #4852
Comments
Hi @blackboxo Sorry to hear that you've encountered this bug. I'm happy to take a look at this issue. Is this issue reproducible on any dataset, or would you be able to provide some insights about this |
I think it's reproducible on any dataset |
@blackboxo I've tried to reproduce this error with your code and the in import cugraph
import cudf
import pandas as pd
import numpy as np
import gc
import datetime
import sys
import nvidia_smi
import time
import datetime
import os
import timeit
import itertools
import matplotlib.pyplot as plt
import dask.dataframe as dd
import cugraph.dask as dask_cugraph
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
from dask.utils import parse_bytes
import cugraph.dask.comms.comms as Comms
from cugraph.testing.mg_utils import stop_dask_client
import cudf
import dask_cudf
if __name__ == "__main__":
cluster = LocalCUDACluster(
# CUDA_VISIBLE_DEVICES="0,1",
rmm_pool_size=parse_bytes("25GB"), # This GPU has 32GB of memory
# rmm_pool_size=0.9, # Use 90% of GPU memory as a pool for faster allocations
enable_cudf_spill=True, # Improve device memory stability
device_memory_limit=parse_bytes("128GB"),
)
client = Client(cluster)
Comms.initialize(p2p=True)
import dask
dask.config.set({"dataframe.backend": "cudf"})
asset_df = dd.read_csv('netscience.csv', 'netscience.csv', names=['src', 'dst', 'weights'], sep=" ", nrows=100)
# breakpoint()
G = cugraph.Graph()
G.from_dask_cudf_edgelist(asset_df, source='src', destination='dst', weight='weights', renumber=True) contents of
my environment
|
What happens if you run your code-snippet with this dataset: https://github.com/rapidsai/cugraph/blob/branch-25.02/datasets/netscience.csv |
seems work now. thanks! |
Ah, glad to hear! Just for the record, what seemed to be the fix? Updating the environment? |
It's a bit strange. for the following code, I changed asset_df_join to asset_df_join[['event_account_src', 'event_account_dst', 'weight']], then it works. But I have checked columns in asset_df_join, there are no duplicate column names. G.from_dask_cudf_edgelist(asset_df_join[['event_account_src', 'event_account_dst', 'weight']], source='event_account_src', destination='event_account_dst', weight='weight', renumber=True) asset_df_join is a dask dataframe and generated by following code: import dask |
Hmm very interesting.. If I had to guess, there must be some difference in the data being passed in before vs. now with the workaround. But let me know if you'd like me to try and reproduce things again. For the time being, thanks for letting us know. Best of luck |
Version
24.10.0
Which installation method(s) does this occur on?
Pip
Describe the bug.
specify weight in 'from_dask_cudf_edgelist' will throw error ValueError: Duplicate column names are not allowed
if no specify weight then there is no error.
Minimum reproducible example
Relevant log output
Environment details
Other/Misc.
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: