Fundamental functions for time series modeling using deep learning methods in pytorch

from pathlib import Path


DATA_PATH = Path("../testing_data")

Data Preprocessing

We will first open the data

data =pd.read_csv(DATA_PATH / "hydro_example.csv", parse_dates=True, index_col="time")
data.head(5)

Now we will split data into coherent groups


source

split_by_date

 split_by_date (data:pandas.core.frame.DataFrame, val_dates:tuple,
                test_dates:tuple)

Split time series data into train, validation and test sets based on date ranges.

Type Details
data DataFrame Input dataframe containing time series data
val_dates tuple Tuple of (start_date, end_date) for validation set
test_dates tuple Tuple of (start_date, end_date) for test set
Returns tuple
train, valid, test = split_by_date(data, val_dates=("2012-01-01", "2012-12-31"), test_dates=("2013-01-01", "2014-12-31"))

Now lets define the feature and the target columns and divide data in feature and targets

x_cols = ["smoothed_rain","Q_mgb"]
y_cols = ["Q_obs"]

x_train, y_train = train[x_cols], train[y_cols]
x_valid, y_valid = valid[x_cols], valid[y_cols]
x_test, y_test = test[x_cols], test[y_cols]

Now we will fit the scaler based only on train data. This ensures that: 1. No information from the validation/test data sets leaks to into the scaling process 2. All data is scaled consistently using the same parameters 3. The model sees new data scaled in the same way as it was trained

feature_scaler, target_scaler = RobustScaler(), RobustScaler()
_, _ = feature_scaler.fit_transform(x_train), target_scaler.fit_transform(y_train)

Finally, we’ll create a custom dataset class to handle our time series data. This class will create sequences of input features (simulation discharge and rainfall) and target values (observed discharge).


source

HydroDataset

 HydroDataset (x:pandas.core.frame.DataFrame,
               y:pandas.core.frame.DataFrame, ctx_len:int,
               pred_len:int=10, x_transform:<built-
               infunctioncallable>=None, y_transform:<built-
               infunctioncallable>=None)

*An abstract class representing a :class:Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:__getitem__, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:__len__, which is expected to return the size of the dataset by many :class:~torch.utils.data.Sampler implementations and the default options of :class:~torch.utils.data.DataLoader. Subclasses could also optionally implement :meth:__getitems__, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

.. note:: :class:~torch.utils.data.DataLoader by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.*

We can easily instantiate the dataset as follows

train_dataset = HydroDataset(
    x=x_train,
    y=y_train,
    ctx_len=1,
    pred_len=1,
    x_transform=feature_scaler.transform,
    y_transform=target_scaler.transform
    )

The total training samples are

len(train_dataset)

Is it possible to easly get a training sample as follows:

train_dataset[5]

And also to the the t+0 for any item

train_dataset.get_t0(1000)

Model example

For the sake of example, we will define the simplest NN we possibly can in PyTorch, which is a simple linear model.

class SimpleNN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(SimpleNN, self).__init__()
        self.linear = nn.Linear(input_dim, output_dim)

    def forward(self, x):
        batch_size = x.shape[0]
        x = x.reshape(batch_size, -1)
        out = self.linear(x)
        return out

Model training

Now we will define a basic learner class to handle the training process. This class will be used to train the model and evaluate its performance.


source

Learner

 Learner (model:torch.nn.modules.module.Module,
          train_loader:torch.utils.data.dataloader.DataLoader,
          val_loader:torch.utils.data.dataloader.DataLoader,
          criterion:torch.nn.modules.module.Module=MSELoss(),
          optimizer:torch.optim.optimizer.Optimizer=<class
          'torch.optim.adam.Adam'>, log_dir:str=None, verbose:bool=True)

Initialize self. See help(type(self)) for accurate signature.

Type Default Details
model Module model to train
train_loader DataLoader data loader for training data
val_loader DataLoader data loader for validation data
criterion Module MSELoss() loss function to optimize
optimizer Optimizer Adam optimizer class to use for training
log_dir str None directory to save tensorboard logs,
verbose bool True whether to print training progress
Returns None

Model training example

Lets see a simple example of how we can train a neural network.

First we will create our Datasets and Dataloarders based on the data we splitted above

batch_size = 32

context_len=3
prediction_len=2
x_transform=feature_scaler.transform
y_transform=target_scaler.transform

train_dataset = HydroDataset(x=x_train, y=y_train, ctx_len=context_len, pred_len=prediction_len, x_transform=x_transform, y_transform=y_transform)
valid_dataset = HydroDataset(x=x_valid, y=y_valid, ctx_len=context_len, pred_len=prediction_len, x_transform=x_transform, y_transform=y_transform)
test_dataset = HydroDataset(x=x_test, y=y_test, ctx_len=context_len, pred_len=prediction_len, x_transform=x_transform, y_transform=y_transform)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

We can now instantiate the model

model = SimpleNN(input_dim=len(x_cols)*context_len, output_dim=prediction_len)

And finally we can instantiate the learner and fit our data

learner = Learner(model=model, train_loader=train_loader, val_loader=valid_loader)
learner.fit(lr=0.001, epochs=3)

Lets now see the prediction. There are two possible ways. Predicting only the values.

y_pred = learner.predict_values(test_loader)

Getting the prediction with the timestamp and column name. This allow us also to scale back to the original values.

y_pred = learner.predict(test_loader, inverse_transform=target_scaler.inverse_transform)
y_pred.head(4)

We will now add the observation and the mgb simulation so we can plot the result.

y_pred["obs"] = y_test.loc[y_pred.index]
y_pred["mgb"] = x_test["Q_mgb"].loc[y_pred.index]
y_pred.plot()