Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: value_counts to consistently maintain order of input #59745

Merged
merged 7 commits into from
Oct 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,67 @@ In cases with mixed-resolution inputs, the highest resolution is used:
In [2]: pd.to_datetime([pd.Timestamp("2024-03-22 11:43:01"), "2024-03-22 11:43:01.002"]).dtype
Out[2]: dtype('<M8[ns]')

.. _whatsnew_300.api_breaking.value_counts_sorting:

Changed behavior in :meth:`DataFrame.value_counts` and :meth:`DataFrameGroupBy.value_counts` when ``sort=False``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In previous versions of pandas, :meth:`DataFrame.value_counts` with ``sort=False`` would sort the result by row labels (as was documented). This was nonintuitive and inconsistent with :meth:`Series.value_counts` which would maintain the order of the input. Now :meth:`DataFrame.value_counts` will maintain the order of the input.

.. ipython:: python

df = pd.DataFrame(
{
"a": [2, 2, 2, 2, 1, 1, 1, 1],
"b": [2, 1, 3, 1, 2, 3, 1, 1],
}
)
df

*Old behavior*

.. code-block:: ipython

In [3]: df.value_counts(sort=False)
Out[3]:
a b
1 1 2
2 1
3 1
2 1 2
2 1
3 1
Name: count, dtype: int64

*New behavior*

.. ipython:: python

df.value_counts(sort=False)

This change also applies to :meth:`.DataFrameGroupBy.value_counts`. Here, there are two options for sorting: one ``sort`` passed to :meth:`DataFrame.groupby` and one passed directly to :meth:`.DataFrameGroupBy.value_counts`. The former will determine whether to sort the groups, the latter whether to sort the counts. All non-grouping columns will maintain the order of the input *within groups*.

*Old behavior*

.. code-block:: ipython

In [5]: df.groupby("a", sort=True).value_counts(sort=False)
Out[5]:
a b
1 1 2
2 1
3 1
2 1 2
2 1
3 1
dtype: int64

*New behavior*

.. ipython:: python

df.groupby("a", sort=True).value_counts(sort=False)

.. _whatsnew_300.api_breaking.deps:

Increased minimum version for Python
Expand Down
10 changes: 8 additions & 2 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -7266,7 +7266,11 @@ def value_counts(
normalize : bool, default False
Return proportions rather than frequencies.
sort : bool, default True
Sort by frequencies when True. Sort by DataFrame column values when False.
Sort by frequencies when True. Preserve the order of the data when False.

.. versionchanged:: 3.0.0

Prior to 3.0.0, ``sort=False`` would sort by the columns values.
ascending : bool, default False
Sort in ascending order.
dropna : bool, default True
Expand Down Expand Up @@ -7372,7 +7376,9 @@ def value_counts(
subset = self.columns.tolist()

name = "proportion" if normalize else "count"
counts = self.groupby(subset, dropna=dropna, observed=False)._grouper.size()
counts = self.groupby(
subset, sort=False, dropna=dropna, observed=False
)._grouper.size()
counts.name = name

if sort:
Expand Down
28 changes: 17 additions & 11 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -2621,7 +2621,13 @@ def value_counts(
normalize : bool, default False
Return proportions rather than frequencies.
sort : bool, default True
Sort by frequencies.
Sort by frequencies when True. When False, non-grouping columns will appear
in the order they occur in within groups.

.. versionchanged:: 3.0.0

In prior versions, ``sort=False`` would sort the non-grouping columns
by label.
ascending : bool, default False
Sort in ascending order.
dropna : bool, default True
Expand Down Expand Up @@ -2673,43 +2679,43 @@ def value_counts(

>>> df.groupby("gender").value_counts()
gender education country
female high FR 1
US 1
female high US 1
FR 1
male low FR 2
US 1
medium FR 1
Name: count, dtype: int64

>>> df.groupby("gender").value_counts(ascending=True)
gender education country
female high FR 1
US 1
female high US 1
FR 1
male low US 1
medium FR 1
low FR 2
Name: count, dtype: int64

>>> df.groupby("gender").value_counts(normalize=True)
gender education country
female high FR 0.50
US 0.50
female high US 0.50
FR 0.50
male low FR 0.50
US 0.25
medium FR 0.25
Name: proportion, dtype: float64

>>> df.groupby("gender", as_index=False).value_counts()
gender education country count
0 female high FR 1
1 female high US 1
0 female high US 1
1 female high FR 1
2 male low FR 2
3 male low US 1
4 male medium FR 1

>>> df.groupby("gender", as_index=False).value_counts(normalize=True)
gender education country proportion
0 female high FR 0.50
1 female high US 0.50
0 female high US 0.50
1 female high FR 0.50
2 male low FR 0.50
3 male low US 0.25
4 male medium FR 0.25
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2519,7 +2519,7 @@ def _value_counts(
grouper, _, _ = get_grouper(
df,
key=key,
sort=self.sort,
sort=False,
observed=False,
dropna=dropna,
)
Expand All @@ -2528,7 +2528,7 @@ def _value_counts(
# Take the size of the overall columns
gb = df.groupby(
groupings,
sort=self.sort,
sort=False,
observed=self.observed,
dropna=self.dropna,
)
Expand Down
39 changes: 35 additions & 4 deletions pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -755,6 +755,7 @@ def result_index_and_ids(self) -> tuple[Index, npt.NDArray[np.intp]]:
obs = [
ping._observed or not ping._passed_categorical for ping in self.groupings
]
sorts = [ping._sort for ping in self.groupings]
# When passed a categorical grouping, keep all categories
for k, (ping, level) in enumerate(zip(self.groupings, levels)):
if ping._passed_categorical:
Expand All @@ -765,7 +766,9 @@ def result_index_and_ids(self) -> tuple[Index, npt.NDArray[np.intp]]:
result_index.name = self.names[0]
ids = ensure_platform_int(self.codes[0])
elif all(obs):
result_index, ids = self._ob_index_and_ids(levels, self.codes, self.names)
result_index, ids = self._ob_index_and_ids(
levels, self.codes, self.names, sorts
)
elif not any(obs):
result_index, ids = self._unob_index_and_ids(levels, self.codes, self.names)
else:
Expand All @@ -778,6 +781,7 @@ def result_index_and_ids(self) -> tuple[Index, npt.NDArray[np.intp]]:
levels=[levels[idx] for idx in ob_indices],
codes=[codes[idx] for idx in ob_indices],
names=[names[idx] for idx in ob_indices],
sorts=[sorts[idx] for idx in ob_indices],
)
unob_index, unob_ids = self._unob_index_and_ids(
levels=[levels[idx] for idx in unob_indices],
Expand All @@ -800,9 +804,18 @@ def result_index_and_ids(self) -> tuple[Index, npt.NDArray[np.intp]]:
).reorder_levels(index)
ids = len(unob_index) * ob_ids + unob_ids

if self._sort:
if any(sorts):
# Sort result_index and recode ids using the new order
sorter = result_index.argsort()
n_levels = len(sorts)
drop_levels = [
n_levels - idx
for idx, sort in enumerate(reversed(sorts), 1)
if not sort
]
if len(drop_levels) > 0:
sorter = result_index._drop_level_numbers(drop_levels).argsort()
else:
sorter = result_index.argsort()
result_index = result_index.take(sorter)
_, index = np.unique(sorter, return_index=True)
ids = ensure_platform_int(ids)
Expand Down Expand Up @@ -837,10 +850,13 @@ def _ob_index_and_ids(
levels: list[Index],
codes: list[npt.NDArray[np.intp]],
names: list[Hashable],
sorts: list[bool],
) -> tuple[MultiIndex, npt.NDArray[np.intp]]:
consistent_sorting = all(sorts[0] == sort for sort in sorts[1:])
mroeschke marked this conversation as resolved.
Show resolved Hide resolved
sort_in_compress = sorts[0] if consistent_sorting else False
shape = tuple(len(level) for level in levels)
group_index = get_group_index(codes, shape, sort=True, xnull=True)
ob_ids, obs_group_ids = compress_group_index(group_index, sort=self._sort)
ob_ids, obs_group_ids = compress_group_index(group_index, sort=sort_in_compress)
ob_ids = ensure_platform_int(ob_ids)
ob_index_codes = decons_obs_group_ids(
ob_ids, obs_group_ids, shape, codes, xnull=True
Expand All @@ -851,6 +867,21 @@ def _ob_index_and_ids(
names=names,
verify_integrity=False,
)
if not consistent_sorting:
# Sort by the levels where the corresponding sort argument is True
n_levels = len(sorts)
drop_levels = [
n_levels - idx
for idx, sort in enumerate(reversed(sorts), 1)
if not sort
]
if len(drop_levels) > 0:
sorter = ob_index._drop_level_numbers(drop_levels).argsort()
else:
sorter = ob_index.argsort()
ob_index = ob_index.take(sorter)
_, index = np.unique(sorter, return_index=True)
ob_ids = np.where(ob_ids == -1, -1, index.take(ob_ids))
ob_ids = ensure_platform_int(ob_ids)
return ob_index, ob_ids

Expand Down
4 changes: 2 additions & 2 deletions pandas/tests/frame/methods/test_value_counts.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ def test_data_frame_value_counts_dropna_true(nulls_fixture):
expected = pd.Series(
data=[1, 1],
index=pd.MultiIndex.from_arrays(
[("Beth", "John"), ("Louise", "Smith")], names=["first_name", "middle_name"]
[("John", "Beth"), ("Smith", "Louise")], names=["first_name", "middle_name"]
),
name="count",
)
Expand Down Expand Up @@ -156,7 +156,7 @@ def test_data_frame_value_counts_dropna_false(nulls_fixture):
pd.Index(["Anne", "Beth", "John"]),
pd.Index(["Louise", "Smith", np.nan]),
],
codes=[[0, 1, 2, 2], [2, 0, 1, 2]],
codes=[[2, 0, 2, 1], [1, 2, 2, 0]],
names=["first_name", "middle_name"],
),
name="count",
Expand Down
Loading
Loading