Postprocessing method reproduction issue #266

markplagge · 2024-12-10T23:26:25Z

Hi,
I have been trying to reproduce the results of the paper using the OpenOOD scripts and saved checkpoints, and have run into a strange behavior. I can't seem to reproduce the post-processing OOD results (FPR@95, AUROC, etc.) reliably.

I am testing the ResNets, and did the following steps:

Clone OpenOOD
Set up python env.
run download script:

python scripts/download/download.py --contents 'datasets' 'checkpoints' \
	--datasets 'ood_v1.5' \
	--checkpoints 'ood_v1.5' \
	--save_dir './data' './results' \
	--dataset_mode 'benchmark' ```
4. Run a benchmark (in this case ResNet18 OOD with ash)
```bash
python main.py --config ./configs/configs/datasets/cifar10/cifar10.yml \
    configs/datasets/cifar10/cifar10_ood.yml \
    configs/networks/resnet18_32x32.yml \
    configs/pipelines/test/test_ood.yml \
    configs/preprocessors/base_preprocessor.yml \
    configs/postprocessors/ash.yml \
    --num_workers 8 \
    --network.checkpoint 'results/cifar10_resnet18_32x32_base_e100_lr0.1_default/s0/best.ckpt' \
    --mark 1

Repeat the process with all three checkpoints.
After running this, I get values that are much smaller than the reported values:
For example, on the farood metric, this run reports:

FPR@95	AUROC	AUPR_IN	AUPR_OUT	ACC
40.41	91.80	79.26	94.30	95.22
35.82	91.98	81.84	94.43	94.63
48.16	89.98	71.78	94.25	95.32

Interestingly, if I follow the same commands but add --seed n to the arguments (n being the seed in the saved checkpoints), the values become closer to the reported ones.

Any ideas as to my mistake or what is happening?

zjysteven · 2024-12-11T00:51:15Z

In general, a few methods (with random components in their design) indeed can be sensitive to random seeds. But this shouldn't be the case for ASH if I remember correctly.

Also, it's interesting that the results you have shown here are actually a lot higher (rather than lower) than what we report for ASH on CIFAR-10. For example, see from this full table, the farood AUROC (averaged over three checkpoints) is only 78.49.

I cannot think of a cause for this. Would you mind trying the new evaluation surface which is the eval_ood.py to see if you can reproduce the results? Nearly all numbers reported in OpenOOD v1.5 are obtained by running that py file instead of the old interface (python main.py --config ...)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postprocessing method reproduction issue #266

Postprocessing method reproduction issue #266

markplagge commented Dec 10, 2024

zjysteven commented Dec 11, 2024 •

edited

Loading

Postprocessing method reproduction issue #266

Postprocessing method reproduction issue #266

Comments

markplagge commented Dec 10, 2024

zjysteven commented Dec 11, 2024 • edited Loading

zjysteven commented Dec 11, 2024 •

edited

Loading