Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Delay setting MultiIndex.level/codes until needed (#17728)
Follow up to #17644 This PR changes `MultiIndex` to delay computing `self._level` and `self._codes` via factorization of `self._column` until needed by certain methods. Before, for state consistency, those attributes were always eagerly computed. As discussed offline the performance benefit of not eagerly computing those attributes is more desirable. ```python import cudf df_train = cudf.datasets.randomdata(nrows=50_000_000, dtypes={"label": int, "weekday": int, "cat_2": int, "brand": int}) target = "label" col = ['weekday', 'cat_2', 'brand'] df_gb= df_train[col + [target]].groupby(col) %%time df_gb.agg(['mean', 'count']) # PR CPU times: user 144 ms, sys: 23.9 ms, total: 168 ms Wall time: 166 ms _ ._ __/__ _ _ _ _ _/_ Recorded: 12:12:46 Samples: 4 /_//_/// /_\ / //_// / //_'/ // Duration: 0.159 CPU time: 0.160 / _/ v5.0.0 Cell [3] 0.158 <module> <ipython-input-3-ee51998a643a>:1 `- 0.158 wrapper cudf/utils/performance_tracking.py:30 `- 0.158 DataFrameGroupBy.agg cudf/core/groupby/groupby.py:879 |- 0.129 DataFrameGroupBy._aggregate cudf/core/groupby/groupby.py:789 `- 0.029 NumericalColumn.astype cudf/core/column/column.py:1126 `- 0.029 NumericalColumn.as_numerical_column cudf/core/column/numerical.py:428 `- 0.029 inner contextlib.py:78 `- 0.029 cast cudf/core/_internals/unary.py:40 # branch 25.02 head CPU times: user 369 ms, sys: 105 ms, total: 474 ms Wall time: 478 ms _ ._ __/__ _ _ _ _ _/_ Recorded: 12:11:20 Samples: 79 /_//_/// /_\ / //_// / //_'/ // Duration: 0.480 CPU time: 0.478 / _/ v5.0.0 Cell [3] 0.479 <module> <ipython-input-3-ee51998a643a>:1 `- 0.478 wrapper cudf/utils/performance_tracking.py:30 `- 0.478 DataFrameGroupBy.agg cudf/core/groupby/groupby.py:879 |- 0.267 cached_property.__get__ functools.py:979 | `- 0.267 _Grouping.keys cudf/core/groupby/groupby.py:3534 | `- 0.267 wrapper cudf/utils/performance_tracking.py:30 | `- 0.267 MultiIndex._from_data cudf/core/multiindex.py:344 | `- 0.265 _compute_levels_and_codes cudf/core/multiindex.py:67 | |- 0.230 factorize cudf/core/algorithms.py:22 | | |- 0.168 NumericalColumn._label_encoding cudf/core/column/column.py:1516 | | | |- 0.097 [self] cudf/core/column/column.py | | | |- 0.031 as_column cudf/core/column/column.py:1948 | | | | `- 0.028 NumericalColumn.astype cudf/core/column/column.py:1126 | | | | `- 0.028 NumericalColumn.as_numerical_column cudf/core/column/numerical.py:428 | | | | `- 0.028 inner contextlib.py:78 | | | | `- 0.028 cast cudf/core/_internals/unary.py:40 | | | |- 0.030 inner contextlib.py:78 | | | | `- 0.030 sort_by_key cudf/core/_internals/sorting.py:160 | | | `- 0.007 NumericalColumn.take cudf/core/column/column.py:943 | | | `- 0.007 inner contextlib.py:78 | | | `- 0.007 gather cudf/core/_internals/copying.py:18 | | |- 0.051 NumericalColumn.unique cudf/core/column/column.py:1342 | | | |- 0.036 inner contextlib.py:78 | | | | `- 0.036 drop_duplicates cudf/core/_internals/stream_compaction.py:82 | | | `- 0.015 NumericalColumn.is_unique cudf/core/column/column.py:1056 | | | `- 0.015 NumericalColumn.distinct_count cudf/core/column/column.py:1108 | | `- 0.006 NumericalColumn.dropna cudf/core/column/column.py:294 | | `- 0.006 NumericalColumn.copy cudf/core/column/column.py:481 | `- 0.028 _compile_module_with_cache cupy/cuda/compiler.py:473 | [4 frames hidden] cupy, <built-in> |- 0.133 DataFrameGroupBy._aggregate cudf/core/groupby/groupby.py:789 `- 0.079 wrapper cudf/utils/performance_tracking.py:30 `- 0.079 MultiIndex._from_columns_like_self cudf/core/frame.py:194 `- 0.079 wrapper cudf/utils/performance_tracking.py:30 |- 0.040 MultiIndex._copy_type_metadata cudf/core/multiindex.py:2101 | `- 0.038 _compute_levels_and_codes cudf/core/multiindex.py:67 | `- 0.038 factorize cudf/core/algorithms.py:22 | |- 0.021 NumericalColumn._label_encoding cudf/core/column/column.py:1516 | | |- 0.010 [self] cudf/core/column/column.py | | `- 0.007 inner contextlib.py:78 | | `- 0.007 sort_by_key cudf/core/_internals/sorting.py:160 | `- 0.011 NumericalColumn.unique cudf/core/column/column.py:1342 | `- 0.010 inner contextlib.py:78 | `- 0.010 drop_duplicates cudf/core/_internals/stream_compaction.py:82 `- 0.039 MultiIndex._from_data cudf/core/multiindex.py:344 `- 0.038 _compute_levels_and_codes cudf/core/multiindex.py:67 `- 0.038 factorize cudf/core/algorithms.py:22 |- 0.021 NumericalColumn._label_encoding cudf/core/column/column.py:1516 | |- 0.010 [self] cudf/core/column/column.py | `- 0.007 inner contextlib.py:78 | `- 0.007 sort_by_key cudf/core/_internals/sorting.py:160 `- 0.010 NumericalColumn.unique cudf/core/column/column.py:1342 `- 0.010 inner contextlib.py:78 `- 0.010 drop_duplicates cudf/core/_internals/stream_compaction.py:82 ``` Authors: - Matthew Roeschke (https://github.com/mroeschke) - Matthew Murray (https://github.com/Matt711) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #17728
- Loading branch information