Nan probabilities prediction on datasets with (almost) constant data #56

stranger-codebits · 2022-09-20T15:28:21Z

When training a new model ore even using the pretrained one, trying to obtain predictions all probabilities leads to none.

This strange behavior was firstly observed during predictions of specific semantic data types where many labels had a bias towards the first defined label. Digging deeper, when using predict_proba, a full set of nan probabilities was observed. I believe this is a bug.

Digging deeper, I found out that probably skewness & kurtosis for character level statistics are having nan as actual values. As these metrics have the standard deviation in the denominator of calculations this is a valid concern and issue.

This can be fixed in the code by adding fixed min/max values for computational reasons but I believe that this is something that has to be also taken into account when deriving complex features from metrics. This issue is not described I believe in the corresponding paper (https://arxiv.org/pdf/1905.10688.pdf) and probably it is an edge case that was missed by the authors.

This may also be the root cause behind issue#47 (#47).

Thanks a lot for the great model, OS code and contributions.

Bellow is an example of the aforementioned behavior just by changing the examples of the provided examples notebooks. This is the minimum reproducible example.
https://gist.github.com/stranger-codebits/6074b5fe2d02ac9db9f2750dbad9a24f

madelonhulsebos · 2022-10-03T01:22:04Z

Dear @stranger-codebits,

Thanks a lot for reporting your issue and findings, this is a great catch.
I hope to have time to look into this soon. In the meantime, feel welcome to file a PR with a fix, if you have implemented one!

Kind regards

stranger-codebits · 2022-10-03T10:42:37Z

Are there any specific guides on how to open a PR, any tests that have to execute and so on? I will open a PR on the next days!

Kind regards,
Nikolaos

madelonhulsebos · 2022-10-04T04:29:01Z

Hi Nikolaos, That would be much appreciated! There are no guidelines in place, but it would be great if you could provide your solution along with some evidence showing that 1) the overall model performance is the same or better, and 2) the predicted probabilities/classes makes sense for the inputs for which you found issues. Thank you! Kind regards, Madelon

…

On Mon, Oct 3, 2022 at 3:42 AM Nikolaos Anastasopoulos < ***@***.***> wrote: Are there any specific guides on how to open a PR, any tests that have to execute and so on? I will open a PR on the next days! Kind regards, Nikolaos — Reply to this email directly, view it on GitHub <#56 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTHAANBI47JZXCDK4ADVXDWBK2CRANCNFSM6AAAAAAQRGGI3U> . You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nan probabilities prediction on datasets with (almost) constant data #56

Nan probabilities prediction on datasets with (almost) constant data #56

stranger-codebits commented Sep 20, 2022

madelonhulsebos commented Oct 3, 2022

stranger-codebits commented Oct 3, 2022

madelonhulsebos commented Oct 4, 2022 via email

Nan probabilities prediction on datasets with (almost) constant data #56

Nan probabilities prediction on datasets with (almost) constant data #56

Comments

stranger-codebits commented Sep 20, 2022

madelonhulsebos commented Oct 3, 2022

stranger-codebits commented Oct 3, 2022

madelonhulsebos commented Oct 4, 2022 via email