-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: get_dummies behaviour with dummy_na = True is counter-intuitive / incorrect when no NaN values present #59968
Comments
In an ideal modeling scenario one would only proceed to encoding step if they have completed the data cleaning process and no null values are present, in which case the extra column generated is harmless since it has no variance to be picked up as a feature irrespective of the model. Having said that I would not want to remove this feature, because it could be potentially used as a final data prep step to detect if there are any pending nulls present left to be treated. However, I do agree with adding it to docs. Maybe adding a brief comment to the |
This is not quite right IMO, as there are many ways to deal with NaN values, one of them being to convert them to indicators rather than to interpolate them. I.e. this function can just as much be regarded as a cleaning step. Another example of where this behaviour is extremely counter-intuitive / unproductive: if a DataFrame X is such that all n features are binary indicator values, but with NaN values being possible and only present in k of the n features, then using This is not only extremely counter-intuitive, but if, as you said, encoding is supposed to be done "after cleaning", then this behaviour basically forces the user to clean the data again after encoding, meaning it breaks your ideal modeling scenario data flow as well (necessitating clean -> encode -> clean again). Of course, it is trivial to do this step yourself using
This is unfortunately just not true at all. There is nothing really harmless about e.g. k extra constant and identical columns if the later model is, say, a deep learner (in which case these constant columns will often be converted to floating point of some kind, and then there will be extra gradient computations due to a bunch of useless constants), or if that data is being passed to e.g. a stepwise feature selection model. As I also said in a multiprocessing context, k unexpected useless new features (especially if converting to float) can be a significant amount of extra memory, especially if on a cluster with e.g. 80 cores. This can happen in automated ML pipelines and the like. Whether extra constant and identical columns are harmless or not depends on the use-case and downstream task. But anyway, I agree this is all fixed by just making the documentation clearer. It would also be nice to know the reasoning though, because I don't think "constant columns are harmless" is really generally true. If this is an assumption baked into |
In my opinion, it would be surprising to pass I'm strongly opposed here. |
@rhshadrach I can definitely understand why people might have different expectations here about the behaviour, but that is why I ultimately filed this as a documentation issue. I wouldn't want behaviour to change for people who expected the current behaviour. Are you saying you oppose changing the documentation too as well though? EDIT: And if the above reasoning is that we want pd.get_dummies outputs to not depend on the values of the inputs, this also is currently not the case, since obviously the amount of columns produced by >>> df = DataFrame({"a": [0, 1, 1], "b": [0, 0, 1]})
>>> pd.get_dummies(df, columns=["a", "b"], dummy_na=True, drop_first=True)
a_1.0 a_nan b_1.0 b_nan
0 False False False False
1 True False False False
2 True False True False
>>> df.iloc[2, 0] = 2
>>> pd.get_dummies(df, columns=["a", "b"], dummy_na=True, drop_first=True)
a_1.0 a_2.0 a_nan b_1.0 b_nan
0 False False False False False
1 True False False False False
2 False True False True False |
Perhaps just to support the idea that people might have very different inututions here, e.g. scikit-learn MissingIndicator has the following argument:
I.e. the default behaviour is arguably closer to what I expected. I think a simple documentation change would make this clearer to users of |
The addition of
seems unnecessary to me, however no real opposition to its inclusion. I do agree
This is in the definition of the operation, so is unavoidable. The point is to avoid it where ever possible. |
PRs for improving the documentation are welcome! |
take |
Pandas version checks
main
hereLocation of the documentation
https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html
Documentation problem
The docs currently read for this function:
However, when no NaN values are present, a useless constant NaN indicator column is still added:
This is arguably quite unexpected behaviour, as constant columns do not contain information except in some very rare cases and for specific custom models.
I.e. for almost any kind of model this column will be ignored (but then annoyingly clutter e.g.
.feature_importances
variables and also perhaps needlessly increase compute times for algorithms that scale significantly with the number of features, but which may not have methods for ignoring constant input features). For data with a lot of binary features, and pipelines or models which also might do e.g. conversion to floating-point dtypes, all these useless extra constant features could also significantly increase memory requirements (especially in multiprocessing contexts).I imagine the intended design decision is that if you do e.g.
df1, df2 = train_test_split(df)
, and it ends up such thatdf1
doesn't have any NaN values for some feature"f"
, butdf2
does, then at least with the current implementation, the user can be ensured the following does not raise an AssertionError:But still, that is a pretty strange use-case, as dummification should generally happen right at the beginning on the full data.
In my opinion the default behaviour should to only add a NaN indicator column if NaN values are actually present... . I actually would consider this to be an implementation or design bug, for like 99% of use-cases. But at bare minimum this undesirable and unexpected behaviour should be documented with some reasoning.
Suggested fix for documentation
Just change the docs to:
The text was updated successfully, but these errors were encountered: