Skip to content
Snippets Groups Projects
Commit 2d66af36 authored by André Anjos's avatar André Anjos :speech_balloon:
Browse files

[doc] Add section about db xtests

parent 8d6f009b
No related branches found
No related tags found
1 merge request!12Streamlining
Pipeline #39314 passed
Showing
with 277 additions and 111 deletions
......@@ -306,11 +306,11 @@ def run(
# (avg_metrics["precision"]+avg_metrics["recall"])
avg_metrics["std_pr"] = std_metrics["precision"]
avg_metrics["pr_upper"] = avg_metrics["precision"] + avg_metrics["std_pr"]
avg_metrics["pr_lower"] = avg_metrics["precision"] - avg_metrics["std_pr"]
avg_metrics["pr_upper"] = avg_metrics["precision"] + std_metrics["precision"]
avg_metrics["pr_lower"] = avg_metrics["precision"] - std_metrics["precision"]
avg_metrics["std_re"] = std_metrics["recall"]
avg_metrics["re_upper"] = avg_metrics["recall"] + avg_metrics["std_re"]
avg_metrics["re_lower"] = avg_metrics["recall"] - avg_metrics["std_re"]
avg_metrics["re_upper"] = avg_metrics["recall"] + std_metrics["recall"]
avg_metrics["re_lower"] = avg_metrics["recall"] - std_metrics["recall"]
avg_metrics["std_f1"] = std_metrics["f1_score"]
maxf1 = avg_metrics["f1_score"].max()
......@@ -406,6 +406,7 @@ def compare_annotators(baseline, other, name, output_folder,
# Merges all dataframes together
df_metrics = pandas.concat(data.values())
df_metrics.drop(0, inplace=True)
# Report and Averages
avg_metrics = df_metrics.groupby("index").mean()
......@@ -420,17 +421,13 @@ def compare_annotators(baseline, other, name, output_folder,
# (avg_metrics["precision"]+avg_metrics["recall"])
avg_metrics["std_pr"] = std_metrics["precision"]
avg_metrics["pr_upper"] = avg_metrics["precision"] + avg_metrics["std_pr"]
avg_metrics["pr_lower"] = avg_metrics["precision"] - avg_metrics["std_pr"]
avg_metrics["pr_upper"] = avg_metrics["precision"] + std_metrics["precision"]
avg_metrics["pr_lower"] = avg_metrics["precision"] - std_metrics["precision"]
avg_metrics["std_re"] = std_metrics["recall"]
avg_metrics["re_upper"] = avg_metrics["recall"] + avg_metrics["std_re"]
avg_metrics["re_lower"] = avg_metrics["recall"] - avg_metrics["std_re"]
avg_metrics["re_upper"] = avg_metrics["recall"] + std_metrics["recall"]
avg_metrics["re_lower"] = avg_metrics["recall"] - std_metrics["recall"]
avg_metrics["std_f1"] = std_metrics["f1_score"]
# we actually only need to keep the second row of the pandas dataframe
# with threshold == 0.5 - the first row is redundant
avg_metrics.drop(0, inplace=True)
metrics_path = os.path.join(output_folder, "second-annotator", f"{name}.csv")
os.makedirs(os.path.dirname(metrics_path), exist_ok=True)
logger.info(f"Saving averages over all input images at {metrics_path}...")
......
......@@ -10,8 +10,10 @@ F1 Scores (micro-level)
-----------------------
* Benchmark results for models: DRIU, HED, M2U-Net and U-Net.
* Models are trained and tested on the same dataset (numbers in parenthesis
indicate number of parameters per model)
* Models are trained and tested on the same dataset (**numbers in bold**
indicate number of parameters per model). Models are trained for a fixed
number of 1000 epochs, with a learning rate of 0.001 until epoch 900 and then
0.0001 until the end of the training.
* Database and model resource configuration links (table top row and left
column) are linked to the originating configuration files used to obtain
these results.
......@@ -21,24 +23,26 @@ F1 Scores (micro-level)
where the threshold is previously selected on the training set
* You can cross check the analysis numbers provided in this table by
downloading this software package, the raw data, and running ``bob binseg
analyze`` providing the model URL as ``--weight`` parameter. Otherwise, we
also provide `CSV files
<https://www.idiap.ch/software/bob/data/bob/bob.ip.binseg/master/baselines/>`_
with the estimated performance per threshold (100
steps) per subset.
analyze`` providing the model URL as ``--weight`` parameter.
* For comparison purposes, we provide "second-annotator" performances on the
same test set, where available.
.. list-table::
:header-rows: 1
:header-rows: 2
* -
-
- :py:mod:`driu <bob.ip.binseg.configs.models.driu>`
- :py:mod:`hed <bob.ip.binseg.configs.models.hed>`
- :py:mod:`m2unet <bob.ip.binseg.configs.models.m2unet>`
- :py:mod:`unet <bob.ip.binseg.configs.models.unet>`
* - Dataset
- 2nd. Annot.
- :py:mod:`driu (15M) <bob.ip.binseg.configs.models.driu>`
- :py:mod:`hed (14.7M) <bob.ip.binseg.configs.models.hed>`
- :py:mod:`m2unet (0.55M) <bob.ip.binseg.configs.models.m2unet>`
- :py:mod:`unet (25.8M) <bob.ip.binseg.configs.models.unet>`
- 15M
- 14.7M
- 0.55M
- 25.8M
* - :py:mod:`drive <bob.ip.binseg.configs.datasets.drive.default>`
- 0.788 (0.021)
- `0.819 (0.016) <baselines_driu_drive_>`_
......@@ -52,7 +56,7 @@ F1 Scores (micro-level)
- `0.811 (0.039) <baselines_m2unet_stare_>`_
- `0.828 (0.041) <baselines_unet_stare_>`_
* - :py:mod:`chasedb1 <bob.ip.binseg.configs.datasets.chasedb1.first_annotator>`
- 0.768 0.023
- 0.768 (0.023)
- `0.811 (0.018) <baselines_driu_chase_>`_
- `0.806 (0.021) <baselines_hed_chase_>`_
- `0.801 (0.018) <baselines_m2unet_chase_>`_
......@@ -80,39 +84,53 @@ set performances. Single performance figures (F1-micro scores) correspond to
its average value across all test set images, for a fixed threshold set to
``0.5``.
.. figure:: drive.png
:align: center
:alt: Model comparisons for drive datasets
:py:mod:`drive <bob.ip.binseg.configs.datasets.drive.default>`: PR curve and F1 scores at T=0.5 (:download:`pdf <drive.pdf>`)
.. figure:: stare.png
:align: center
:alt: Model comparisons for stare datasets
:py:mod:`stare <bob.ip.binseg.configs.datasets.stare.ah>`: PR curve and F1 scores at T=0.5 (:download:`pdf <stare.pdf>`)
.. figure:: chasedb1.png
:align: center
:alt: Model comparisons for chasedb1 datasets
:py:mod:`chasedb1 <bob.ip.binseg.configs.datasets.chasedb1.first_annotator>`: PR curve and F1 scores at T=0.5 (:download:`pdf <chasedb1.pdf>`)
.. figure:: hrf.png
:align: center
:alt: Model comparisons for hrf datasets
:py:mod:`hrf <bob.ip.binseg.configs.datasets.hrf.default>`: PR curve and F1 scores at T=0.5 (:download:`pdf <hrf.pdf>`)
.. figure:: iostar-vessel.png
:align: center
:alt: Model comparisons for iostar-vessel datasets
:py:mod:`iostar-vessel <bob.ip.binseg.configs.datasets.iostar.vessel>`: PR curve and F1 scores at T=0.5 (:download:`pdf <iostar-vessel.pdf>`)
.. list-table::
* - .. figure:: drive.png
:align: center
:scale: 50%
:alt: Model comparisons for drive datasets
:py:mod:`drive <bob.ip.binseg.configs.datasets.drive.default>`: PR curve and F1 scores at T=0.5 (:download:`pdf <drive.pdf>`)
- .. figure:: stare.png
:align: center
:scale: 50%
:alt: Model comparisons for stare datasets
:py:mod:`stare <bob.ip.binseg.configs.datasets.stare.ah>`: PR curve and F1 scores at T=0.5 (:download:`pdf <stare.pdf>`)
* - .. figure:: chasedb1.png
:align: center
:scale: 50%
:alt: Model comparisons for chasedb1 datasets
:py:mod:`chasedb1 <bob.ip.binseg.configs.datasets.chasedb1.first_annotator>`: PR curve and F1 scores at T=0.5 (:download:`pdf <chasedb1.pdf>`)
- .. figure:: hrf.png
:align: center
:scale: 50%
:alt: Model comparisons for hrf datasets
:py:mod:`hrf <bob.ip.binseg.configs.datasets.hrf.default>`: PR curve and F1 scores at T=0.5 (:download:`pdf <hrf.pdf>`)
* - .. figure:: iostar-vessel.png
:align: center
:scale: 50%
:alt: Model comparisons for iostar-vessel datasets
:py:mod:`iostar-vessel <bob.ip.binseg.configs.datasets.iostar.vessel>`: PR curve and F1 scores at T=0.5 (:download:`pdf <iostar-vessel.pdf>`)
-
Remarks
-------
* There seems to be no clear winner as confidence intervals based on the
standard deviation overlap substantially between the different models, and
across different datasets.
* There seems to be almost no effect on the number of parameters on
performance. U-Net, the largest model, is not a clear winner through all
baseline benchmarks
* Where second annotator labels exist, model performance and variability seems
on par with such annotations. One possible exception is for CHASE-DB1, where
models show consistently less variability than the second annotator.
Unfortunately, this cannot be conclusive.
.. include:: ../../links.rst
File added
doc/results/xtest/driu-chasedb1.png

160 KiB

File added
doc/results/xtest/driu-drive.png

179 KiB

File added
doc/results/xtest/driu-hrf.png

177 KiB

File added
doc/results/xtest/driu-iostar-vessel.png

156 KiB

File added
doc/results/xtest/driu-stare.png

174 KiB

......@@ -2,29 +2,29 @@
.. _bob.ip.binseg.results.xtest:
======================
Cross-Database Tests
======================
==========================
Cross-Database (X-)Tests
==========================
F1 Scores (micro-level)
-----------------------
* Benchmark results for models: DRIU, HED, M2U-Net and U-Net.
* Models are trained and tested on the same dataset (numbers in parenthesis
indicate number of parameters per model), and then evaluated across the test
sets of other datasets.
sets of other databases. X-tested datasets therefore represent *unseen*
data and can be a good proxy for generalisation analysis.
* Each table row indicates a base trained model and each column the databases
the model was tested against. The native performance (intra-database) is
marked **in bold**. Thresholds are chosen *a priori* on the training set of
the database used to generate the model being cross-tested. Hence, the
threshold used for all experiments in a same row is always the same.
* You can cross check the analysis numbers provided in this table by
downloading this software package, the raw data, and running ``bob binseg
analyze`` providing the model URL as ``--weight`` parameter, and then the
``-xtest`` resource variant of the dataset the model was trained on. For
example, to run cross-evaluation tests for the DRIVE dataset, use the
configuration resource :py:mod:`drive-xtest
<bob.ip.binseg.configs.datasets.drive.xtest>`. Otherwise, we
also provide `CSV files
<https://www.idiap.ch/software/bob/data/bob/bob.ip.binseg/master/xtest/>`_
with the estimated performance per threshold (100 steps) per subset.
* For comparison purposes, we provide "second-annotator" performances on the
same test set, where available.
<bob.ip.binseg.configs.datasets.drive.xtest>`.
* We only show results for DRIU (~15.4 million parameters) and M2U-Net (~550
thousand parameters) as these models seem to represent the performance
extremes according to our :ref:`baseline analysis
......@@ -43,48 +43,199 @@ DRIU
.. list-table::
:header-rows: 1
* - Model / X-Test
- :py:mod:`drive <bob.ip.binseg.configs.datasets.drive.xtest>`
- :py:mod:`stare <bob.ip.binseg.configs.datasets.stare.xtest>`
- :py:mod:`chasedb1 <bob.ip.binseg.configs.datasets.chasedb1.xtest>`
- :py:mod:`hrf <bob.ip.binseg.configs.datasets.hrf.xtest>`
- :py:mod:`iostar-vessel <bob.ip.binseg.configs.datasets.iostar.vessel_xtest>`
* - `drive <baselines_driu_drive_>`_
-
-
-
-
-
* - `stare <baselines_driu_stare_>`_
-
-
-
-
-
* - `chasedb1 <baselines_driu_chase_>`_
-
-
-
-
-
* - `hrf <baselines_driu_hrf_>`_
-
-
-
-
-
* - `iostar-vessel <baselines_driu_iostar_>`_
-
-
-
-
-
Precision-Recall (PR) Curves
----------------------------
:header-rows: 2
* -
- drive
- stare
- chasedb1
- hrf
- iostar-vessel
* - Model / W x H
- 544 x 544
- 704 x 608
- 960 x 960
- 1648 x 1168
- 1024 x 1024
* - :py:mod:`drive <bob.ip.binseg.configs.datasets.drive.default>` (`model <baselines_driu_drive_>`_)
- **0.819 (0.016)**
- 0.759 (0.151)
- 0.321 (0.068)
- 0.711 (0.067)
- 0.493 (0.049)
* - :py:mod:`stare <bob.ip.binseg.configs.datasets.stare.ah>` (`model <baselines_driu_stare_>`_)
- 0.733 (0.037)
- **0.824 (0.037)**
- 0.491 (0.094)
- 0.773 (0.051)
- 0.469 (0.055)
* - :py:mod:`chasedb1 <bob.ip.binseg.configs.datasets.chasedb1.first_annotator>` (`model <baselines_driu_chase_>`_)
- 0.730 (0.023)
- 0.730 (0.101)
- **0.811 (0.018)**
- 0.779 (0.043)
- 0.774 (0.019)
* - :py:mod:`hrf <bob.ip.binseg.configs.datasets.hrf.default>` (`model <baselines_driu_hrf_>`_)
- 0.702 (0.038)
- 0.641 (0.160)
- 0.600 (0.072)
- **0.802 (0.039)**
- 0.546 (0.078)
* - :py:mod:`iostar-vessel <bob.ip.binseg.configs.datasets.iostar.vessel>` (`model <baselines_driu_iostar_>`_)
- 0.758 (0.019)
- 0.724 (0.115)
- 0.777 (0.032)
- 0.727 (0.059)
- **0.825 (0.021)**
Next, you will find the PR plots showing confidence intervals, for the various
cross-tests explored, on a per cross-tested model arrangement. All curves
correspond to test set performances. Single performance figures (F1-micro
scores) correspond to its average value across all test set images, for a fixed
threshold set *a priori* on the training set of dataset used for creating the
model.
.. list-table::
* - .. figure:: driu-drive.png
:align: center
:scale: 40%
:alt: X-tests for a DRIU model based on DRIVE
:py:mod:`drive <bob.ip.binseg.configs.datasets.drive.xtest>`: DRIU model X-tested (:download:`pdf <driu-drive.pdf>`)
- .. figure:: driu-stare.png
:align: center
:scale: 40%
:alt: X-tests for a DRIU model based on STARE
:py:mod:`stare <bob.ip.binseg.configs.datasets.stare.xtest>`: DRIU model X-tested (:download:`pdf <driu-stare.pdf>`)
* - .. figure:: driu-chasedb1.png
:align: center
:scale: 40%
:alt: X-tests for a DRIU model based on CHASE-DB1
:py:mod:`chasedb1 <bob.ip.binseg.configs.datasets.chasedb1.xtest>`: DRIU model X-tested (:download:`pdf <driu-chasedb1.pdf>`)
- .. figure:: driu-hrf.png
:align: center
:scale: 40%
:alt: X-tests for a DRIU model based on HRF
:py:mod:`hrf <bob.ip.binseg.configs.datasets.hrf.xtest>`: DRIU model X-tested (:download:`pdf <driu-hrf.pdf>`)
* - .. figure:: driu-iostar-vessel.png
:align: center
:scale: 40%
:alt: X-tests for a DRIU model based on IOSTAR (vessel)
:py:mod:`iostar-vessel <bob.ip.binseg.configs.datasets.iostar.vessel_xtest>`: DRIU model X-tested (:download:`pdf <driu-iostar-vessel.pdf>`)
-
M2U-Net
=======
.. list-table::
:header-rows: 2
* -
- drive
- stare
- chasedb1
- hrf
- iostar-vessel
* - Model / W x H
- 544 x 544
- 704 x 608
- 960 x 960
- 1648 x 1168
- 1024 x 1024
* - :py:mod:`drive <bob.ip.binseg.configs.datasets.drive.default>` (`model <baselines_m2unet_drive_>`_)
- **0.804 (0.014)**
- 0.736 (0.144)
- 0.548 (0.055)
- 0.744 (0.058)
- 0.722 (0.036)
* - :py:mod:`stare <bob.ip.binseg.configs.datasets.stare.ah>` (`model <baselines_m2unet_stare_>`_)
- 0.715 (0.031)
- **0.811 (0.039)**
- 0.632 (0.033)
- 0.765 (0.049)
- 0.673 (0.033)
* - :py:mod:`chasedb1 <bob.ip.binseg.configs.datasets.chasedb1.first_annotator>` (`model <baselines_m2unet_chase_>`_)
- 0.677 (0.027)
- 0.695 (0.099)
- **0.801 (0.018)**
- 0.763 (0.040)
- 0.761 (0.018)
* - :py:mod:`hrf <bob.ip.binseg.configs.datasets.hrf.default>` (`model <baselines_m2unet_hrf_>`_)
- 0.591 (0.071)
- 0.460 (0.230)
- 0.332 (0.108)
- **0.796 (0.043)**
- 0.419 (0.088)
* - :py:mod:`iostar-vessel <bob.ip.binseg.configs.datasets.iostar.vessel>` (`model <baselines_m2unet_iostar_>`_)
- 0.743 (0.019)
- 0.745 (0.076)
- 0.771 (0.030)
- 0.749 (0.052)
- **0.817 (0.021)**
Next, you will find the PR plots showing confidence intervals, for the various
cross-tests explored, on a per cross-tested model arrangement. All curves
correspond to test set performances. Single performance figures (F1-micro
scores) correspond to its average value across all test set images, for a fixed
threshold set *a priori* on the training set of dataset used for creating the
model.
.. list-table::
* - .. figure:: m2unet-drive.png
:align: center
:scale: 40%
:alt: X-tests for a M2U-Net model based on DRIVE
:py:mod:`drive <bob.ip.binseg.configs.datasets.drive.xtest>`: M2U-Net model X-tested (:download:`pdf <m2unet-drive.pdf>`)
- .. figure:: m2unet-stare.png
:align: center
:scale: 40%
:alt: X-tests for a M2U-Net model based on STARE
:py:mod:`stare <bob.ip.binseg.configs.datasets.stare.xtest>`: M2U-Net model X-tested (:download:`pdf <m2unet-stare.pdf>`)
* - .. figure:: m2unet-chasedb1.png
:align: center
:scale: 40%
:alt: X-tests for a M2U-Net model based on CHASE-DB1
:py:mod:`chasedb1 <bob.ip.binseg.configs.datasets.chasedb1.xtest>`: M2U-Net model X-tested (:download:`pdf <m2unet-chasedb1.pdf>`)
- .. figure:: m2unet-hrf.png
:align: center
:scale: 40%
:alt: X-tests for a M2U-Net model based on HRF
:py:mod:`hrf <bob.ip.binseg.configs.datasets.hrf.xtest>`: M2U-Net model X-tested (:download:`pdf <m2unet-hrf.pdf>`)
* - .. figure:: m2unet-iostar-vessel.png
:align: center
:scale: 40%
:alt: X-tests for a M2U-Net model based on IOSTAR (vessel)
:py:mod:`iostar-vessel <bob.ip.binseg.configs.datasets.iostar.vessel_xtest>`: M2U-Net model X-tested (:download:`pdf <m2unet-iostar-vessel.pdf>`)
-
Remarks
-------
* For each row, the peak performance is always obtained in an intra-database
test (training and testing on the same database). Conversely, we observe a
performance degradation (albeit not catastrophic in most cases) for all other
datasets in the cross test.
* X-test performance on a model created from HRF suggests a strong bias, as
performance does not generalize well for other (unseen) datasets.
* Models generated from CHASE-DB1 and IOSTAR (vessel) seem to generalize quite
well to unseen data, when compared to the relatively poor generalization
capabilites of models generated from HRF or DRIVE.
.. include:: ../../links.rst
File added
doc/results/xtest/m2unet-chasedb1.png

165 KiB

File added
doc/results/xtest/m2unet-drive.png

170 KiB

File added
doc/results/xtest/m2unet-hrf.png

186 KiB

File added
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment