diff --git a/doc/data-model.rst b/doc/data-model.rst index 9fdbd08a3b95af8696a386ac5eaa47c98671750d..5915ce4dd05f243bef45e9259fe96074f511f111 100644 --- a/doc/data-model.rst +++ b/doc/data-model.rst @@ -38,18 +38,27 @@ user A may be under ``/home/user-a/databases/database-1`` and Sample ------ -The in-memory representation of the raw database samples. In this package, it -is specified as a two-tuple with a tensor (or a dictionary with multiple -tensors), and metadata (typically label, name, etc.). +The in-memory representation of the raw database ``Sample``. It is specified +as a dictionary containing at least the following keys: + +* ``image`` (:py:class:`torch.Tensor`): the image to be analysed +* ``target`` (:py:class:`torch.Tensor`): the target for the current task +* ``name`` (:py:class:`str`): a unique name for this sample + +Optionally, depending on the task, the following keys may also be present: + +* ``mask`` (:py:class:`torch.Tensor`): an inclusion mask for the input image + and targets. If set, then it is used to evaluate errors only within the + masked area. RawDataLoader ------------- -A concrete "functor" that allows one to load the raw data and associated -metadata, to create a in-memory Sample representation. RawDataLoaders are -typically Database-specific due to raw data and metadata encoding varying quite -a lot on different databases. RawDataLoaders may also embed various +A callable object that allows one to load the raw data and associated metadata, +to create a in-memory ``Sample`` representation. Concrete ``RawDataLoader``\s +are typically database-specific due to raw data and metadata encoding varying +quite a lot on different databases. ``RawDataLoader``\s may also embed various pre-processing transformations to render data readily usable such as pre-cropping of black pixel areas, or 16-bit to 8-bit auto-level conversion. @@ -57,27 +66,35 @@ pre-cropping of black pixel areas, or 16-bit to 8-bit auto-level conversion. TransformSequence ----------------- -A sequence of callables that allows one to transform torch.Tensor objects into -other torch.Tensor objects, typically to crop, resize, convert Color-spaces, -and the such on raw-data. +A sequence of callables that allows one to transform :py:class:`torch.Tensor` +objects into other :py:class:`torch.Tensor` objects, typically to crop, resize, +convert color-spaces, and the such on raw-data. TransformSequences are used in +two main parts of this library: to power raw-data loading and transformations +required to fit data into a model (e.g. ensuring images are grayscale or +resized to a certain size), and to implement data-augmentations for +training-time usage. DatabaseSplit ------------- -A dictionary that represents an organization of the available raw data in the -database to perform an evaluation protocol (e.g. train, validation, test) -through datasets (or subsets). It is represented as dictionary mapping dataset -names to lists of "raw-data" sample representations, which vary in format -depending on Database metadata availability. RawDataLoaders receive this raw -representations and can convert these to in-memory Sample's. +A dictionary-like object that represents an organization of the available raw +data in the database to perform an evaluation protocol (e.g. train, validation, +test) through datasets (or subsets). It is represented as dictionary mapping +dataset names to lists of "raw-data" ``Sample`` representations, which vary in +format depending on Database metadata availability. ``RawDataLoaders`` receive +this raw representations and can convert these to in-memory ``Sample``\s. The +:py:class:`mednet.data.split.JSONDatabaseSplit` is concrete example of a +``DatabaseSplit`` implementation that can read the split definition from JSON +files, and is thoroughly at the library to represent the various database +splits supported. ConcatDatabaseSplit ------------------- -An extension of a DatabaseSplit, in which the split can be formed by -cannibalising various other DatabaseSplits to construct a new evaluation +An extension of a ``DatabaseSplit``, in which the split can be formed by +reusing various other ``DatabaseSplit``\s to construct a new evaluation protocol. Examples of this are cross-database tests, or the construction of multi-Database training and validation subsets. @@ -85,20 +102,22 @@ multi-Database training and validation subsets. Dataset ------- -An iterable object over in-memory Samples, inherited from the pytorch Dataset -definition. A dataset in our framework may be completely cached in memory or -have in-memory representation of samples loaded on demand. After data loading, -our datasets can optionally apply a TransformSequence, composed of -pre-processing steps defined on a per-model level before optionally caching -in-memory Sample representations. The "raw" representation of a dataset are the -split dictionary values (ie. not the keys). +An iterable object over in-memory ``Sample``\s, inherited from the +:py:class:`.torch.utils.data.Dataset`. A ``Dataset`` in this framework may be +completely cached in memory, or have in-memory representation of ``Sample``\s +loaded on demand. After data loading, ``Dataset``\s can optionally apply a +``TransformSequence``, composed of pre-processing steps defined on a per-model +level before optionally caching in-memory ``Sample`` representations. The "raw" +representation of a ``Dataset`` are the split dictionary values (ie. not the +keys). DataModule ---------- -A DataModule aggregates Splits and RawDataLoaders to provide lightning a -known-interface to the complete evaluation protocol (train, validation, -prediction and testing) required for a full experiment to take place. It -automates control over data loading parallelisation and caching inside our -framework, providing final access to readily-usable pytorch DataLoaders. +A ``DataModule`` aggregates ``DatabaseSplit``\s and ``RawDataLoader``\s to +provide lightning a known-interface to the complete evaluation protocol (train, +validation, prediction and testing) required for a full experiment to take +place. It automates control over data loading parallelisation and caching +inside the framework, providing final access to readily-usable pytorch +``DataLoader``\s. diff --git a/doc/img/data-model-dark.png b/doc/img/data-model-dark.png index 2eb870d73a2704b78c3e58b1e5d11551646bd6ee..5dafc87843943eb38c8b7cc6ef5ff6c6864b6ff7 100644 Binary files a/doc/img/data-model-dark.png and b/doc/img/data-model-dark.png differ diff --git a/doc/img/data-model-lite.png b/doc/img/data-model-lite.png index e0f39233babf91a7c88408b92e589e90c39cab1f..5b9fa8095ce5bc454fe81a3238791f4961b8e3f1 100644 Binary files a/doc/img/data-model-lite.png and b/doc/img/data-model-lite.png differ diff --git a/doc/img/data-model.dot b/doc/img/data-model.dot index ea2903ecfe80f78875d0e1871f2d2fc4a9ad0c33..4767bf79db0d4f756f6c410b93a1406511441a8d 100644 --- a/doc/img/data-model.dot +++ b/doc/img/data-model.dot @@ -39,7 +39,7 @@ digraph G { ] Sample [ - label = "{Sample (tuple)|+ tensor: torch.Tensor\l+ metadata: dict[str, Any]\l}" + label = "{Sample (dict)|+ image: torch.Tensor\l+ target: torch.Tensor\l+ name: str\l}" ] DataLoader [