-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closest improved #185
Closest improved #185
Conversation
…osest hits and are not separate chromosomes. Problem: If the segment does not have closest (e.g., in ignore_upstream=True setting), it will simply drop out of output table. Solution: Added checking the indexes of df1 that are absent form the final returned hits Why it was not observed before: No detiled testing for options ignore_upstream/ignore_downstream/ignore_overlap and their combinations. If df1 has a segment from a chromosome X and df2 has a segment from it, but they are in an arrangements that won't produce a hit, this segment will be dropped out of df1 output.
@@ -511,8 +510,8 @@ def overlap( | |||
index_col = return_index if isinstance(return_index, str) else "index" | |||
index_col_1 = index_col + suffixes[0] | |||
index_col_2 = index_col + suffixes[1] | |||
df_index_1 = pd.DataFrame({index_col_1: df1.index[events1]}) | |||
df_index_2 = pd.DataFrame({index_col_2: df2.index[events2]}) | |||
df_index_1 = pd.DataFrame({index_col_1: df1.index[events1]}, dtype=pd.Int64Dtype()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe @nvictus introduced the _to_nullable_dtype()
in the last update... maybe there is a way we can remove boilerplate & also use same strategy for closest ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but I don't see how it will help in this case, frankly. It's storage of the index that is always integer and not absent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I agree that we do not seem to need a nullable type here.
…tream/downstream when there is no region to match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual change is small and nicely contained, seems good to me!
@@ -511,8 +510,8 @@ def overlap( | |||
index_col = return_index if isinstance(return_index, str) else "index" | |||
index_col_1 = index_col + suffixes[0] | |||
index_col_2 = index_col + suffixes[1] | |||
df_index_1 = pd.DataFrame({index_col_1: df1.index[events1]}) | |||
df_index_2 = pd.DataFrame({index_col_2: df2.index[events2]}) | |||
df_index_1 = pd.DataFrame({index_col_1: df1.index[events1]}, dtype=pd.Int64Dtype()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I agree that we do not seem to need a nullable type here.
Problem: If the segment does not have closest (e.g., in ignore_upstream=True setting), it will simply drop out of output table.
Solution: Added checking the indexes of df1 that are absent form the final returned hits
Why it was not observed before: No detiled testing for options ignore_upstream/ignore_downstream/ignore_overlap and their combinations.
If df1 has a segment from a chromosome X and df2 has a segment from it, but they are in an arrangements that won't produce a hit, this segment will be dropped out of df1 output.
Dragged modifications: overlap returned float indexes with failing the tests (numpy v1.22.4, pandas v1.5.2). Solution added for robustness of code across different software versions.