ENH: Add totality validation to merge method #58547

z3rone · 2024-05-03T09:56:55Z

Feature Type

Adding new functionality to pandas

Problem Description

The available validation methods lack checks for (left-/right-)totality. I am frequently encountering cases where I need to manually check that eg. a one-to-one merge also finds a match match in the right DF for every row in the left DF or vice versa.

Feature Description

Add the following to one_to_one, one_to_many and many_to_one merge validations:

left_total ... Each row in the left DataFrame is matched to (at least) one row in the right DataFrame
right_total ... Each row in the right DataFrame is matched to (at least) one row in the left DataFrame
total ... Both left_total and right_total must hold

A combination of join relation and totality constraint should be possible by combining with a +: one_to_one+left_total

Alternative Solutions

Currently, doing an outer join and checking for NaN values in the "foreign" columns works to find unmerged rows. However, this will fail if there are already NaN values in the initial DataFrames.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

z3rone · 2024-05-09T13:08:24Z

To maybe add a common use case. Here the goal is to add the biological domain to the favorite animal of certain people:

import pandas as pd

# Create the first DataFrame with person names and favorite animals
df1_data = {
    'Person': ['John', 'Emma', 'Alex','Darleen'],
    'Animal': ['Dog', 'Spider', 'Snake','Cat']
}
df1 = pd.DataFrame(df1_data)

# Create the second DataFrame with mapping of animals to biological class
df2_data = {
    'Animal': ['Dog', 'Snake', 'Cat'],
    'Biological_Class': ['Mammal', 'Reptile', 'Mammal']
}
df2 = pd.DataFrame(df2_data)

# Merge the DataFrames on the 'Animal' column
merged_df = pd.merge(
    df1,
    df2,
    on='Animal',
    validate='m:1'
)

The merged_df will lack the favorite animal of Emma, as 'Spider' has no class defined in df2. With the proposed feature validate could be set to m:1+left_total. This would raise an error as not all keys from the left df1 are contained in the right df2.

z3rone · 2024-06-28T11:23:00Z

To transfer some of the arguments from the (for now suspended) MR:

The argument was made that the extension of the validate parameter could blurr the lines of what the feature is supposed to be
I would argue that currently the validation serves to avoid unwanted duplication of rows and my extionsion would be the complement: Ensuring that there is no unwanted loss of rows.
Furthermore, the extension would enable the validation of some of the most common merge cardinalities in the entity-relationship-model.

rhshadrach · 2024-10-03T20:22:02Z

Thanks for the request and PR. I agree this is a natural thing to validate, and if the implementation that pandas could provide was significantly more efficient than what a user could do separately (e.g. using various things computed in the course of doing the merge), then I would be more supportive of its inclusion in pandas.

However, the implementation in the current PR is no more efficient than what can be accomplished by the current public API. As such, it would add more maintenance burden without providing anything a user could not already do (albeit, with more keystrokes). As such, I'm currently opposed to its inclusion.

z3rone · 2024-10-03T22:01:59Z

@rhshadrach I do not see how this extension is more trivial than the current validations. The 1:m check is basically just a check for left.index.is_unique. I personally find myself typing out the manual check for lost samples quite frequent. But if the community does not see enought value in this change I'll stop pushing it forward.

rhshadrach · 2024-10-03T22:14:59Z

I agree and would not support their inclusion today. In addition, if they were prohibitive to further enhancements to core functionality, I would support their deprecation and removal. As it is, I think keeping them so as to not disrupt their current users makes sense.

z3rone · 2024-10-03T23:09:30Z

thank you for your feedback. I guess this can be closed then.

z3rone added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 3, 2024

z3rone added a commit to z3rone/pandas that referenced this issue May 4, 2024

Add totality validation pandas-dev#58547

fad17c3

z3rone mentioned this issue May 6, 2024

ENH: Totality validation for merge operation #58600

Closed

5 tasks

z3rone added a commit to z3rone/pandas that referenced this issue May 6, 2024

Add totality validation pandas-dev#58547

93ebe3d

rhshadrach added Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 3, 2024

z3rone closed this as completed Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add totality validation to merge method #58547

ENH: Add totality validation to merge method #58547

z3rone commented May 3, 2024 •

edited

Loading

z3rone commented May 9, 2024 •

edited

Loading

z3rone commented Jun 28, 2024

rhshadrach commented Oct 3, 2024

z3rone commented Oct 3, 2024

rhshadrach commented Oct 3, 2024

z3rone commented Oct 3, 2024

ENH: Add totality validation to merge method #58547

ENH: Add totality validation to merge method #58547

Comments

z3rone commented May 3, 2024 • edited Loading

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

z3rone commented May 9, 2024 • edited Loading

z3rone commented Jun 28, 2024

rhshadrach commented Oct 3, 2024

z3rone commented Oct 3, 2024

rhshadrach commented Oct 3, 2024

z3rone commented Oct 3, 2024

z3rone commented May 3, 2024 •

edited

Loading

z3rone commented May 9, 2024 •

edited

Loading