Current Datamodule implementation (CachingDatamodule) does not support Datamodule contatenation

I'm looking into this one, as some of our Datamodules are actually combinations of other existing data modules.

Here are particular examples I'm trying to figure out how to make work:

Example 1: Each split requires a different raw-data-loader. This is the scenario with cross-database testing, for example.
- Split = train, validation, and test from (raw) Database A, test from (raw) Database B, renamed as "testB"
  - train -> raw-data-loader from Database A
  - validation -> raw-data-loader from Database A
  - test -> raw-data-loader from Database A
  - testB -> raw-data-loader from Database B
Example 2: Each sample requires a different raw-data-loader. This is the scenario with a multi-database training, where the user is merging multiple training and validation sets to build a new "super-set". Test sets remain "separate" for a more detailed analysis.
- Split = train, validation, and test from (raw) Database A, the same for Database B, but you want to have these merged
  - train -> mix of both raw-data-loaders, depends on the sample
  - validation -> the same as with training, depends on the sample
  - test -> raw-data-loader from Database A
  - testB -> raw-data-loader from Database B

In the current implementation, you can't do that. Your are tied to 1 raw-data-loader for the whole input split.

More info: Ultimately, the raw-data-loader is passed down the line to our own torch "Dataset" implementations (cached or delayed).

Torch has the concept of "concatenated" datasets. Is this useful here? May be something like a "concatenated" datamodule would a useful concept? What would be an intuitive interface to this?

Example 1:

dm1 = CachingDataModule(...)
dm2 = CachingDataModule(...)

dm3 = ConcatDataModule({"train": dm1["train"], "validation": dm1["validation"], "test": dm1["test"], "testB": dm2["test"])
# for all other aspects, dm3 behaves like a CachingDataModule...

Example 2:

dm1 = CachingDataModule(...)
dm2 = CachingDataModule(...)

dm3 = ConcatDataModule({"train": (dm1["train"], dm2["train"]), "validation": (dm1["validation"], dm2["validation"]), "test": dm1["test"], "testB": dm2["test"])
# for all other aspects, dm3 behaves like a CachingDataModule...

@mdelitroz, @ojimenez, @dcarron: does the above look like a reasonable interface?

I tried to implement this - it will not fly (it'd require a massive decapsulation of the existing functionality)...

Here is attempt 2: we take the splits and associate each subset in the split to its own RawDataLoader:

split1 = JSONDataSplit(...)
data_loader1 = RawDataLoader1(...)
split2 = JSONDataSplit(...)
data_loader2 = RawDataLoader2(...)
dm = ConcatDataModule({
    "train": (split1["train"], data_loader1),
    "validation": (split1["validation"], data_loader1),
    "test": (split1["test"], data_loader1),
    "testB": (split2["test"], data_loader2)),
    }, ...)

For the use case 2, we use lists for each key in the dictionary:

split1 = JSONDataSplit(...)
data_loader1 = RawDataLoader1(...)
split2 = JSONDataSplit(...)
data_loader2 = RawDataLoader2(...)
dm = ConcatDataModule({
    "train": [
        (split1["train"], data_loader1),
        (split2["train"], data_loader2),
    ],
    "validation": [
        (split1["validation"], data_loader1),
        (split2["validation"], data_loader2),
    ],
    "test": (split1["test"], data_loader1),
    "testB": (split2["test"], data_loader2)),
    }, ...)

I'll try to have a straw man implementation for tomorrow, and think it through a bit.

Right, new day, new ideas - I thinking that, as a matter of fact, our current (simple) use-case with the CachingDataModule is just a subset of this. So we can just transform our current implementation into this, rename it to ConcatDataModule, then inherit from that one to create the "simple" case with a simplified constructor. This will minimise code dispersion and maximise DRY, IMO.

mentioned in commit 1120d9df

This is done with the above commit. We can migrate the other databases.

mentioned in commit eea2f306

closed with merge request !6 (merged)

mentioned in commit adf29dd1

Current Datamodule implementation (CachingDatamodule) does not support Datamodule contatenation

Designs

Child items ...

Activity