DataModule

DatModule from plkit is a subclass of LightningDataModule from pytorch-lightning. So you can follow everything that is documented by pytorch-lightning

Other than that, we have two additional methods defined: data_reader to read the data into a collection and data_splits to split the data collection for training, validation and testing.

The data_reader method is required to be defined if you want to use the features of plkit's DataModule (auto-splitting data for example). Once you have it defined, you don't need to care about data_prepare and setup methods that LightningDataModule requires.

data_reader

You can read data into a list of samples by data_reader or yield the samples to save some memory.

Note

If data_reader is yielding (it is a generateor), plkit.data.IterDataset will be used, and samples can not be suffled.

Tip

You can yield multiple features, as well as the labels at the same time as a tuple. For example:

class MyData(plkit.DataModule):
    ...
    def data_reader(
        yield (sample_name, feature_a, feature_b, label)

Then in training_step, validation_step or test_step, you can easily decouple them like this:

class MyModule(plkit.Module):
    ...

    def training_step(self, batch, _):
        # Each variable has this batch of features
        sample_name, feature_a, feature_b, label = batch

This is also the case when a list of samples returned from data_reader.

data_splits

This method is supposed to split the data collection returned (yielded) from data_reader. If you have data_tvt (see data_tvt in configuration) specified in configuration, the collection will be specified automatically based on data_tvt. Otherwise, you can return a dictionary like this from data_splits to split the data by yourself:

from plkit.data import DataModule, Dataset, IterDataset

class MyData(DataModule):
    ...

    def data_splits(self, data, stage):
        return {
            'train': Dataset(...),
            'val': Dataset(...), # or a list of datasets,
            'test': Dataset(...), # or a list of datasets
        }

Note

setup is only calling at stage fit. If you want to do it at test stage, you will need to override setup method.