0.4
Home insurance claims classifier.
For a better client experience and an improved claim processing time, predict at First Notification of Loss if a claim will be complex to handle, or simple.
21.06.2021
John Doe and Jane Rey.
The source code is available at this address: http://www.intranet/project/dir
Claim Id | Age | Gender | Loan Amount | ... |
---|---|---|---|---|
2938 | 38 | M | 74000 | |
2939 | 42 | F | 123000 | |
2940 | 53 | F | 85000 |
Columns: 19 Rows: 215 412
Column name | Description | Example |
---|---|---|
Age | The age of the client | 34 |
Gender | The gender of the client | M |
Loan Amount | Amount of the client loan | 83000 |
At this step, the dataset unique signature is:
abece2ef84645c61499cb4b74f552daa205380666b1ab03bbfa2fcdab91b11b6
Feature | Missing value percentage |
---|---|
Age | 11% |
Gender | 20% |
Loan Amount | 4% |
89.47% of columns have been filled in
2 columns have been dropped because they were containing only 1 value
The 15 remaining columns are:
Columns |
---|
Age |
Gender |
Loan Amount |
... |
Target |
617 duplicate rows have been dropped.
214 795 rows remaining.
No row is entirely empty.
214 795 rows remaining.
Every row contains more than 2 non empty fields.
No row filtering 214 795 rows remaining.
For categorical variables: replacement by the most frequent value. For numerical variables: replacement by variable median.
At this step, the dataset unique signature is:
cb4b74f552daa205380666b1ab03bbfa2fcde2ef8464abe5c61499cab91b11b6
2 variables have been created:
- Age of the insured person at loan subscription, in months (loan subscription date - insured birth date)
- Loan seniority at claim creation date, in months (claim creation date - loan subscription date)
The target is computed like this: if claim processing time (settlement date - creation date) is superior to 3 weeks, it is considered as complex (target = 1), else it is simple (target = 0).
The target rate is: 6.61%
At this step, the dataset unique signature is:
bdb4e3721bd9ea1db352b8672a2facb61058380869f09bd35bb0072695d86a4d
We used the GradientBoosting algorithm (scikit-learn 0.20.2) with the following parameters:
{'nthread': 4, 'objective': 'binary:logistic', 'eval_metric': 'logloss', 'colsample_bytree': 1, 'silent': 1, 'subsample': 0.8, 'learning_rate': 0.2, 'max_depth': 8, 'min_child_weight': 8, 'lambda': 1, 'alpha': 1}
The version of xgboost is '1.0.0'
The version of scikit-learn is '0.23.2'
The dataset was split in train and test parts (90/10)
Servers used were hosted in île-de-france.
A cumulative of 5440 seconds (1.5 hours) of computation was performed.
Total emissions are estimated to be 9.05e-03 kgCO2eq. It represents 0.29 tree-days.
The metric "tree-days" corresponds to the number of days a mature tree needs to absorb this quantity of CO2. On average, a tree absorbs 11kgCO2/year.
We used the Log Loss metric.
We chose hyperparameters and variable with a 3-fold cross-validation.
Metric | Value |
---|---|
Log-Loss | 0.187 |
AUC | 0.902 |
Accuracy | 0.934 |
Threshold | 0.5 |
F1 score | 0.206 |
Precision | 0.535 |
Recall | 0.128 |
Confusion matrix:
Predicted Positive | Predicted Negative | |
---|---|---|
Actual Positive | 24650 | 200 |
Actual Negative | 1570 | 230 |
Lift at 10%: 2.06
Protected features "['Gender', 'Age']" used in dataset.
The following metrics are computed with a threshold = 0.5
Feature: Gender
Distribution of predictions per subgroup
0 | 1 | |
---|---|---|
global | 93.39% | 6.61% |
F | 93.31% | 6.69% |
M | 93.46% | 6.54% |
Performances of the model per subgroup
ratio | logloss | auc | accuracy | f1_score | precision | recall | adversarial_proportions | |
---|---|---|---|---|---|---|---|---|
global | 100% | 0.19 | 0.9 | 0.93 | 0.21 | 0.53 | 0.13 | 0.0% |
F | 38.93% | 0.18 | 0.91 | 0.93 | 0.2 | 0.5 | 0.12 | 0.0% |
M | 61.07% | 0.19 | 0.89 | 0.94 | 0.21 | 0.57 | 0.13 | 0.0% |
Feature: Age
Distribution of predictions per subgroup
0 | 1 | |
---|---|---|
global | 93.39% | 6.61% |
0_0-20yr | 100.0% | 0.0% |
1_20-30yr | 89.43% | 10.57% |
2_30-40yr | 91.97% | 8.03% |
3_40-50yr | 100.0% | 0.0% |
4_over_50yr | 100.0% | 0.0% |
Performances of the model per subgroup
ratio | logloss | auc | accuracy | f1_score | precision | recall | adversarial_proportions | |
---|---|---|---|---|---|---|---|---|
global | 100% | 0.19 | 0.9 | 0.93 | 0.21 | 0.53 | 0.13 | 7.05% |
0_0-20yr | 10.2% | 0.04 | 0.94 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0% |
1_20-30yr | 18.85% | 0.43 | 0.81 | 0.82 | 0.28 | 0.64 | 0.18 | 9.35% |
2_30-40yr | 28.33% | 0.35 | 0.77 | 0.86 | 0.15 | 0.39 | 0.09 | 13.45% |
3_40-50yr | 24.89% | 0.1 | 0.83 | 0.98 | 0.0 | 0.0 | 0.0 | 7.54% |
4_over_50yr | 17.74% | 0.04 | 0.92 | 1.0 | 0.0 | 0.0 | 0.0 | 3.8% |
The percentage of adversarial examples corresponds to the percentage of instances for which the prediction of the model can be modified by changing only the considered feature.
In other words, these are instances for which, if the considered feature were worth something else, all other things being equal, then the model prediction would be different.
Trained model unique signature is:
a5b4e3721bd9ea1db352b8672a2facb61058380869f09bd35bb0072695d88cbb