🤺 Evaluation criteria¶
Evaluation will be performed on held-out test cases of 200 patients. Test data will be drawn in part (1/4) from the same source and distribution as the training data. The majority of test data (3/4) however will consist of oncologic PET/CT examinations that were drawn from different sources reflecting different domains and clinical settings.
A combination of two metrics reflecting the aims and specific challenges for the task of PET lesion segmentation:
- Foreground Dice score of segmented lesions
- Volume of false positive connected components that do not overlap with positives (=false positive volume)
- Volume of positive connected components in the ground truth that do not overlap with the estimated segmentation mask (=false negative volume)
In case of test data that do not contain positives (no FDG-avid lesions), only metric 2 and 3 will be used.
Figure: Example for the evaluation. The Dice score is calculated to
measure the correct overlap between predicted lesion segmentation (blue)
and ground truth (red). Additionally special emphasis is put on false
positives by measuring their volume, i.e. large false positives like
brain or bladder will result in a low score and false negatives by
measuring their volume (i.e. entirely missed lesions).
A python script computing these evaluation metrics is provided under https://github.com/lab-midas/autoPET.
📈 Ranking¶
The submitted algorithms will be ranked according to:
Step 1: Separate rankings will be computed based on each metric.
Step 2: From the three ranking tables, the mean ranking of each participant will be computed as the numerical mean of the single rankings (metric 1: 50 % weight, metrics 2 and 3: 25 % weight each)
Step 3: In case of equal ranking, the achieved Dice metric will be used as a tie break.
Award category 1¶
The metrics (Dice, FP, FN) will be directly used to generate the
leaderboard.
(for Dice: higher score = better, for FP or FN: lower volumes = better)
Award category 2¶
The goal of award category 2 is to identify contributions, that provide stable results over different data sets. However, low variation of performance across different environments is only meaningful if - in addition - performance itself is acceptable. There is a trade-off between accuracy and variance of results. Therefore - in this category - the ranking will be defined in terms of variance regarding the same metrics as in category 1. However, only contributions that have above median performance (compared to all contributions) regarding the Dice metric will be considered. This is to avoid trivial submissions with constant low performance.
Award category 3¶
It will be evaluated by a jury of at least 2 independent experts in the field.
Please note that participants can only contribute one algorithm for all three categories. We will not accept different contributions for each category.
Please note that the displayed ranking in the leaderboard does not represent the final ranking according to the above award categories. The final ranking will be announced at MICCAI.