The data preparation step focuses on converting raw measurement data with a frequency of 15 minutes into weekly extremes and preparing these extremes to be usable input for the forecasting model.
Initially the entire history is aggregated and stored in a Snowflake database. Consecutively, updates are done every week.
The data preparation step resulting in weekly aggregated data.
Columns described in Data Understanding are selected from the measurement data and metadata tables.
From the measurement data only data is selected from DALI boxes that have nominal power registered in the metadata table and that are in operation (src.utils.snowflake.read_meta()). Both data and metadata are needed to indicate future overloading and add value to Grid Planners.
src.utils.snowflake.read_meta()
As mentioned only the active power (P) on medium voltage side is used initially. Apparent power (S) is preferred (for a more fair comparison with the nominal power), but since this is 15 minute average data, this is hard to reconstruct.
In this step the separate power phases are selected and processed as well as the sum of the phases (src.utils.snowflake.make_week_extremes_query()).
src.utils.snowflake.make_week_extremes_query()
Only the 15 minute average channels are selected for preprocessing.
No data cleaning is performed on the raw data before aggregation, but data is checked and cleaned on the following after loading (src.preprocess.preprocess.load_data()) and before being used by a model for forecasting:
src.preprocess.preprocess.load_data()
Data of extremes (minimum or maximum) having the value of zero in the beginning of the series are removed (src.preprocess.preprocess.remove_leading_idling()). This is for example the case if a DALI box is in operation, but its transformer is not.
src.preprocess.preprocess.remove_leading_idling()
Only data is used that has a history of more than two years (src.preprocess.preprocess.too_short()). This will ensure in this stage that the seasonality (sub)model has enough data to tune on.
src.preprocess.preprocess.too_short()
Only data is used of transformers that have a measurement in their history with an absolute value higher than half the transformer capacity (src.preprocess.preprocess.too_small()).
src.preprocess.preprocess.too_small()
Duplicate data (can only be created by updating the extreme table) is not an issue for the model and will not be eliminated.
Missing data is neither a problem for the model and is also not imputed.
From the raw 15-minute data the weekly minimum and maximum are determined. This is done per channel and boxid (src.utils.snowflake.make_week_extremes_query()). The week definition used is the ISO-week since this is always a full week.
A SQL query aggregates and writes the result asynchronously on the Snowflake database. This can be done in batch for all historic measurements (src.utils.snowflake.create_week_extremes()), but the created table can also be updated per week (src.utils.snowflake.update_week_extremes()).
src.utils.snowflake.create_week_extremes()
src.utils.snowflake.update_week_extremes()
Database
DB_DATASCIENCE_P
Schema
DATASCIENCE_1_ETL
Table
DS_SPARK_DALI_WEEK_EXTREMES
The fields of the table are listed below. The table is clustered by BOXID and L (phase). The amount of rows is condensed from 89,052,020,404 to 3,457,856 records.
Field
Type
Example
BOXID
VARCHAR
ESD.000240-2
L
sumli
YEAR
NUMBER
2021
WEEK
53
PROCESSED_ON
TIMESTAMPTZ
2021-05-12 07:45:00.000000000
MAX
DOUBLE
678.90
MIN
123.45
Since no additional data sources are used, no joins or merges are required.
The model does not demand an order (e.g. by year and week) of the data. For the modelling stage the data is queried from the table in Data Construction
Consecutively, a date column is constructed from the ISO year and week format with day==1.
The extra columns period and model_var are assigned and filled with the values “history”, “observed” respectively for measurement data. This is in preparation for long formatting and concatenating forecast results in a later stage (src.preprocess.preprocess.format_data()).
src.preprocess.preprocess.format_data()
An example of the loaded extreme data is shown below:
The format of the loaded extremes data.
The weekly extremes can be updated on a weekly (or longer) basis.
By running src.preprocess.update_extremes() the function src.utils.snowflake.update_week_extremes() is called. This will will trigger the following steps which update the weekly extremes Snowflake table:
src.preprocess.update_extremes()
The detailed process to create and assess load forecasts.