As written, our caching datamodule will not allow inputing multiple datasplits each with different raw-data loaders. This would be required for mixin datasets such as "mc_ch".
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
I'm looking into this one, as some of our Datamodules are actually combinations of other existing data modules.
Here are particular examples I'm trying to figure out how to make work:
Example 1: Each split requires a different raw-data-loader. This is the scenario with cross-database testing, for example.
Split = train, validation, and test from (raw) Database A, test from (raw) Database B, renamed as "testB"
train -> raw-data-loader from Database A
validation -> raw-data-loader from Database A
test -> raw-data-loader from Database A
testB -> raw-data-loader from Database B
Example 2: Each sample requires a different raw-data-loader. This is the scenario with a multi-database training, where the user is merging multiple training and validation sets to build a new "super-set". Test sets remain "separate" for a more detailed analysis.
Split = train, validation, and test from (raw) Database A, the same for Database B, but you want to have these merged
train -> mix of both raw-data-loaders, depends on the sample
validation -> the same as with training, depends on the sample
test -> raw-data-loader from Database A
testB -> raw-data-loader from Database B
In the current implementation, you can't do that. Your are tied to 1 raw-data-loader for the whole input split.
More info: Ultimately, the raw-data-loader is passed down the line to our own torch "Dataset" implementations (cached or delayed).
Torch has the concept of "concatenated" datasets. Is this useful here? May be something like a "concatenated" datamodule would a useful concept? What would be an intuitive interface to this?
Example 1:
dm1=CachingDataModule(...)dm2=CachingDataModule(...)dm3=ConcatDataModule({"train":dm1["train"],"validation":dm1["validation"],"test":dm1["test"],"testB":dm2["test"])# for all other aspects, dm3 behaves like a CachingDataModule...
Example 2:
dm1=CachingDataModule(...)dm2=CachingDataModule(...)dm3=ConcatDataModule({"train":(dm1["train"],dm2["train"]),"validation":(dm1["validation"],dm2["validation"]),"test":dm1["test"],"testB":dm2["test"])# for all other aspects, dm3 behaves like a CachingDataModule...
Right, new day, new ideas - I thinking that, as a matter of fact, our current (simple) use-case with the CachingDataModule is just a subset of this. So we can just transform our current implementation into this, rename it to ConcatDataModule, then inherit from that one to create the "simple" case with a simplified constructor. This will minimise code dispersion and maximise DRY, IMO.