Skip to content
Snippets Groups Projects
install.rst 5.64 KiB

Installation

We support two installation modes, through pip_, or mamba_ (conda).

Setup

A configuration file may be useful to setup global options that should be often reused. The location of the configuration file depends on the value of the environment variable $XDG_CONFIG_HOME, but defaults to ~/.config/ptbench.toml. You may edit this file using your preferred editor.

Here is an example configuration file that may be useful as a starting point:

[datadir]
indian = "/Users/myself/dbs/tbxpredict"
montgomery = "/Users/myself/dbs/montgomery-xrayset"
shenzhen = "/Users/myself/dbs/shenzhen"
nih_cxr14_re = "/Users/myself/dbs/nih-cxr14-re"
tbx11k_simplified = "/Users/myself/dbs/tbx11k-simplified"

[nih_cxr14_re]
idiap_folder_structure = false  # set to `true` if at Idiap

Tip

To get a list of valid data directories that can be configured, execute:

ptbench dataset list

You must procure and download datasets by yourself. The raw data is not included in this package as we are not authorised to redistribute it.

To check whether the downloaded version is consistent with the structure that is expected by this package, run:

ptbench dataset check montgomery

Supported Datasets

Here is a list of currently supported datasets in this package, alongside notable properties. Each dataset name is linked to the location where raw data can be downloaded. The list of images in each split is available in the source code.

Tuberculosis datasets

The following datasets contain only the tuberculosis final diagnosis (0 or 1). In addition to the splits presented in the following table, 10 folds (for cross-validation) randomly generated are available for these datasets.

Dataset Reference H x W Samples Training Validation Test
Montgomery_ [MONTGOMERY-SHENZHEN-2014]_ 4020 x 4892 138 88 22 28
Shenzhen_ [MONTGOMERY-SHENZHEN-2014]_ Varying 662 422 107 133
Indian_ [INDIAN-2013]_ Varying 155 83 20 52

Tuberculosis multilabel dataset

The following dataset contains the labels healthy, sick & non-TB, active TB, and latent TB. The implemented tbx11k dataset in this package is based on the simplified version, which is just a more compact version of the original. In addition to the splits presented in the following table, 10 folds (for cross-validation) randomly generated are available for these datasets.

Dataset Reference H x W Samples Training Validation Test
TBX11K_ [TBX11K-2020]_ 512 x 512 11'200 6600 1800 2800
TBX11K-SIMPLIFIED_ [TBX11K-SIMPLIFIED-2020]_ 512 x 512 11'200 6600 1800 2800

Tuberculosis + radiological findings dataset

The following dataset contains both the tuberculosis final diagnosis (0 or 1) and radiological findings.

Dataset Reference H x W Samples Train Test
PadChest_ [PADCHEST-2019]_ Varying 160'861 160'861 0

Radiological findings datasets

The following dataset contains only the radiological findings without any information about tuberculosis.

Note

NIH CXR14 labels for training and validation sets are the relabeled versions done by the author of the CheXNeXt study [CHEXNEXT-2018]_.

Dataset Reference H x W Samples Training Validation Test
NIH_CXR14_re_ [NIH-CXR14-2017]_ 1024 x 1024 109'041 98'637 6'350 4'054

HIV-Tuberculosis datasets

The following datasets contain only the tuberculosis final diagnosis (0 or 1) and come from HIV infected patients. 10 folds (for cross-validation) randomly generated are available for these datasets.

Please contact the authors of these datasets to have access to the data.

Dataset Reference H x W Samples
TB POC [TB-POC-2018]_ 2048 x 2500 407
HIV TB [HIV-TB-2019]_ 2048 x 2500 243