from pathlib import Path
= Path("../testing_data") DATA_PATH
Fundamental functions for time series modeling using deep learning methods in pytorch
Data Preprocessing
We will first open the data
=pd.read_csv(DATA_PATH / "hydro_example.csv", parse_dates=True, index_col="time")
data 5) data.head(
Now we will split data into coherent groups
split_by_date
split_by_date (data:pandas.core.frame.DataFrame, val_dates:tuple, test_dates:tuple)
Split time series data into train, validation and test sets based on date ranges.
Type | Details | |
---|---|---|
data | DataFrame | Input dataframe containing time series data |
val_dates | tuple | Tuple of (start_date, end_date) for validation set |
test_dates | tuple | Tuple of (start_date, end_date) for test set |
Returns | tuple |
= split_by_date(data, val_dates=("2012-01-01", "2012-12-31"), test_dates=("2013-01-01", "2014-12-31")) train, valid, test
Now lets define the feature and the target columns and divide data in feature and targets
= ["smoothed_rain","Q_mgb"]
x_cols = ["Q_obs"]
y_cols
= train[x_cols], train[y_cols]
x_train, y_train = valid[x_cols], valid[y_cols]
x_valid, y_valid = test[x_cols], test[y_cols] x_test, y_test
Now we will fit the scaler based only on train data. This ensures that: 1. No information from the validation/test data sets leaks to into the scaling process 2. All data is scaled consistently using the same parameters 3. The model sees new data scaled in the same way as it was trained
= RobustScaler(), RobustScaler()
feature_scaler, target_scaler = feature_scaler.fit_transform(x_train), target_scaler.fit_transform(y_train) _, _
Finally, we’ll create a custom dataset class to handle our time series data. This class will create sequences of input features (simulation discharge and rainfall) and target values (observed discharge).
HydroDataset
HydroDataset (x:pandas.core.frame.DataFrame, y:pandas.core.frame.DataFrame, ctx_len:int, pred_len:int=10, x_transform:<built- infunctioncallable>=None, y_transform:<built- infunctioncallable>=None)
*An abstract class representing a :class:Dataset
.
All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:__getitem__
, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:__len__
, which is expected to return the size of the dataset by many :class:~torch.utils.data.Sampler
implementations and the default options of :class:~torch.utils.data.DataLoader
. Subclasses could also optionally implement :meth:__getitems__
, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.
.. note:: :class:~torch.utils.data.DataLoader
by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.*
We can easily instantiate the dataset as follows
= HydroDataset(
train_dataset =x_train,
x=y_train,
y=1,
ctx_len=1,
pred_len=feature_scaler.transform,
x_transform=target_scaler.transform
y_transform )
The total training samples are
len(train_dataset)
Is it possible to easly get a training sample as follows:
5] train_dataset[
And also to the the t+0 for any item
1000) train_dataset.get_t0(
Model example
For the sake of example, we will define the simplest NN we possibly can in PyTorch, which is a simple linear model.
class SimpleNN(nn.Module):
def __init__(self, input_dim, output_dim):
super(SimpleNN, self).__init__()
self.linear = nn.Linear(input_dim, output_dim)
def forward(self, x):
= x.shape[0]
batch_size = x.reshape(batch_size, -1)
x = self.linear(x)
out return out
Model training
Now we will define a basic learner class to handle the training process. This class will be used to train the model and evaluate its performance.
Learner
Learner (model:torch.nn.modules.module.Module, train_loader:torch.utils.data.dataloader.DataLoader, val_loader:torch.utils.data.dataloader.DataLoader, criterion:torch.nn.modules.module.Module=MSELoss(), optimizer:torch.optim.optimizer.Optimizer=<class 'torch.optim.adam.Adam'>, log_dir:str=None, verbose:bool=True)
Initialize self. See help(type(self)) for accurate signature.
Type | Default | Details | |
---|---|---|---|
model | Module | model to train | |
train_loader | DataLoader | data loader for training data | |
val_loader | DataLoader | data loader for validation data | |
criterion | Module | MSELoss() | loss function to optimize |
optimizer | Optimizer | Adam | optimizer class to use for training |
log_dir | str | None | directory to save tensorboard logs, |
verbose | bool | True | whether to print training progress |
Returns | None |
Model training example
Lets see a simple example of how we can train a neural network.
First we will create our Datasets and Dataloarders based on the data we splitted above
= 32
batch_size
=3
context_len=2
prediction_len=feature_scaler.transform
x_transform=target_scaler.transform
y_transform
= HydroDataset(x=x_train, y=y_train, ctx_len=context_len, pred_len=prediction_len, x_transform=x_transform, y_transform=y_transform)
train_dataset = HydroDataset(x=x_valid, y=y_valid, ctx_len=context_len, pred_len=prediction_len, x_transform=x_transform, y_transform=y_transform)
valid_dataset = HydroDataset(x=x_test, y=y_test, ctx_len=context_len, pred_len=prediction_len, x_transform=x_transform, y_transform=y_transform)
test_dataset
= DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
train_loader = DataLoader(valid_dataset, batch_size=batch_size)
valid_loader = DataLoader(test_dataset, batch_size=batch_size) test_loader
We can now instantiate the model
= SimpleNN(input_dim=len(x_cols)*context_len, output_dim=prediction_len) model
And finally we can instantiate the learner and fit our data
= Learner(model=model, train_loader=train_loader, val_loader=valid_loader)
learner =0.001, epochs=3) learner.fit(lr
Lets now see the prediction. There are two possible ways. Predicting only the values.
= learner.predict_values(test_loader) y_pred
Getting the prediction with the timestamp and column name. This allow us also to scale back to the original values.
= learner.predict(test_loader, inverse_transform=target_scaler.inverse_transform)
y_pred 4) y_pred.head(
We will now add the observation and the mgb simulation so we can plot the result.
"obs"] = y_test.loc[y_pred.index]
y_pred["mgb"] = x_test["Q_mgb"].loc[y_pred.index]
y_pred[ y_pred.plot()