timetk: a time series toolkit in R

Kyle Hardage, PhD, PG

Engineering Geologist, DWR

2025-11-13

What is timetk?

a toolkit for time series analysis in the tidyverse style
key point: no novel time series objects (possible disadvantage with speed)
key strengths:
- seamless integration with tibble / dplyr workflows
- fast plotting helpers and seasonal diagnostics
- feature engineering (time series signature, lags, rolling stats)
- works well with tidymodels and forecasting pipelines

Slide roadmap

visualization
wrangling & analysis
machine learning (briefly)
links & resources

Setup & single dataset

Code

# load packages & dataset (run locally before presenting)
library(tidyverse)
library(timetk)
library(slider)
library(lubridate)

# timetk includes the `m4_monthly` dataset — we'll use a single series across slides
# This dataset was compiled for the M4 Competition for forecasting in 2018
# Replace with a smaller sample if you prefer to knit faster.
series_tbl <- timetk::m4_monthly %>%
  filter(id == "M750") %>%  # example series — pick one consistent series
  select(id, date, value) %>% 
  mutate(month = month(date),
         season = ifelse(month %in% c(1:3, 10:12), "wet", "dry")) %>% 
  arrange(date)

head(series_tbl)

# A tibble: 6 × 5
  id    date       value month season
  <fct> <date>     <dbl> <dbl> <chr> 
1 M750  1990-01-01  6370     1 wet   
2 M750  1990-02-01  6430     2 wet   
3 M750  1990-03-01  6520     3 wet   
4 M750  1990-04-01  6580     4 dry   
5 M750  1990-05-01  6620     5 dry   
6 M750  1990-06-01  6690     6 dry

Notes:

we will use built-in datasets to demonstrate end-to-end workflow
values have no real meaning (synthetic data), so substitute domain knowledge

Visualizations

Quick plots

plot_time_series() is a timetk convenience wrapper

Code

plot_time_series(series_tbl, date, value)

can group IDs prior to plotting (Note: facet arguments go inside the function)

Code

series_tbl %>% 
  group_by(id) %>% 
  plot_time_series(., date, value, .facet_scales = "free")

more groupings (in this case, trend category)

Code

series_tbl %>% 
  group_by(id) %>% 
  mutate(trend = ifelse(date < "2002-04-01", "rising", "stable")) %>% 
  plot_time_series(date, value, .color_var = trend)

switching off .interactive will return a ggplot (for reports, sharing, etc)

Code

series_tbl %>% 
  group_by(id) %>% 
  mutate(month = month(date), 
         trend = ifelse(date < "2002-04-01", "rising", "stable")) %>% 
  plot_time_series(date, value, .color_var = trend, .interactive = FALSE)

however, code gets cluttered within plot_time_series, so may use ggplot()

Code

series_tbl %>%
  plot_time_series(date, value, .color_var = season,
                   # Returns static ggplot
                   .interactive = FALSE,  
                   # Customization
                   .title = "Series M750 - value over time",
                   .x_lab = "Date",
                   .y_lab = "Value (unitless)",
                   .color_lab = "Month") +
  scale_y_continuous(labels = scales::label_comma())

Code

# time series plot (use ggplot + tk plotting helper locally)
series_tbl %>%
  ggplot(aes(x = date, y = value, col = season)) +
  geom_line() +
  labs(title = 'Series M750 — Value over time', x = 'Date', y = 'Value (unitless)') +
  theme_minimal()

time series boxplots for sampled data

Code

series_tbl %>% 
  plot_time_series_boxplot(date, value, .period = "6 months")

Seasonality and trend decomposition

note on functions:

often come in pairs - tk_ for analysis / preprocessor & plot_ for plotting

Code

tk_seasonal_diagnostics(series_tbl, date, value) %>% head(4)

# A tibble: 4 × 5
  date       .value month.lbl quarter year 
  <date>      <dbl> <fct>     <fct>   <fct>
1 1990-01-01   6370 January   1       1990 
2 1990-02-01   6430 February  1       1990 
3 1990-03-01   6520 March     1       1990 
4 1990-04-01   6580 April     2       1990

Code

plot_seasonal_diagnostics(series_tbl, date, value)

use feature set to set the seasonal period (e.g. diurnally)

default: auto

others: second, minute, hour, wday.lbl, week, month.lbl, quarter, year

Code

tk_seasonal_diagnostics(series_tble, date, value, .feature_set = "hour")

STL decomposition (season + trend + remainder or loess smoothing)

Code

plot_stl_diagnostics(series_tbl, date, value)

likewise, use tk_ variant to view values from prior slide and output determined frequency

frequency = 12 observations per 1 year

trend = 60 observations per 5 years

Code

tk_stl_diagnostics(series_tbl, date, value)

# A tibble: 306 × 6
   date       observed season trend remainder seasadj
   <date>        <dbl>  <dbl> <dbl>     <dbl>   <dbl>
 1 1990-01-01     6370  161.  6516.    -308.    6209.
 2 1990-02-01     6430  166.  6529.    -265.    6264.
 3 1990-03-01     6520  300.  6541.    -321.    6220.
 4 1990-04-01     6580  282.  6553.    -255.    6298.
 5 1990-05-01     6620  270.  6565.    -215.    6350.
 6 1990-06-01     6690  100.  6577.      12.5   6590.
 7 1990-07-01     6000 -648.  6590.      58.6   6648.
 8 1990-08-01     5450 -936.  6601.    -215.    6386.
 9 1990-09-01     6480 -173.  6612.      40.1   6653.
10 1990-10-01     6820   81.9 6624.     114.    6738.
# ℹ 296 more rows

plot_anomaly_diagnostics() and plot_time_series() scale well for multiple series

Code

m4_hourly %>%
  group_by(id) %>%
  plot_time_series(date, value, .facet_ncol = 2, .facet_scales = "free")

Use faceting + tk_augment_timeseries_signature() for grouped visuals (monthly, weekly patterns)
granular breakdown of time series data

Code

tk_augment_timeseries_signature(m4_hourly)

# A tibble: 3,060 × 31
   id    date                value  index.num  diff  year year.iso  half quarter
   <fct> <dttm>              <dbl>      <dbl> <dbl> <int>    <int> <int>   <int>
 1 H150  2013-09-01 12:00:00    45 1378036800    NA  2013     2013     2       3
 2 H150  2013-09-01 13:00:00    57 1378040400  3600  2013     2013     2       3
 3 H150  2013-09-01 14:00:00    27 1378044000  3600  2013     2013     2       3
 4 H150  2013-09-01 15:00:00    22 1378047600  3600  2013     2013     2       3
 5 H150  2013-09-01 16:00:00    21 1378051200  3600  2013     2013     2       3
 6 H150  2013-09-01 17:00:00    24 1378054800  3600  2013     2013     2       3
 7 H150  2013-09-01 18:00:00    29 1378058400  3600  2013     2013     2       3
 8 H150  2013-09-01 19:00:00    37 1378062000  3600  2013     2013     2       3
 9 H150  2013-09-01 20:00:00    97 1378065600  3600  2013     2013     2       3
10 H150  2013-09-01 21:00:00   165 1378069200  3600  2013     2013     2       3
# ℹ 3,050 more rows
# ℹ 22 more variables: month <int>, month.xts <int>, month.lbl <ord>,
#   day <int>, hour <int>, minute <int>, second <int>, hour12 <int>,
#   am.pm <int>, wday <int>, wday.xts <int>, wday.lbl <ord>, mday <int>,
#   qday <int>, yday <int>, mweek <int>, week <int>, week.iso <int>,
#   week2 <int>, week3 <int>, week4 <int>, mday7 <int>

Anomalies

flag outliers (anomalies) in a dataset (common for my work)

Code

plot_anomaly_diagnostics(series_tbl, date, value)

how does this work?

step 1) detrend and remove seasonality (STL decomposition)

step 2) Inner quartile range of median on remainder

Code

plot_stl_diagnostics(series_tbl, date, value, .interactive = F) + theme(panel.spacing = unit(0.2, "lines"), strip.text = element_text(margin = margin(0,0,0,0)))

control the sensitivity with .alpha and seasonality with .frequency

lower .alpha = more stringent threshold to flag anomalies (default = 0.05)

anomaly flags can be viewed using tk_anomaly_diagnostic

Code

plot_anomaly_diagnostics(series_tbl, date, value, .alpha = 0.1, .frequency = "quarter")

real example: identify and remove pumping influence to determine ambient groundwater elevation.

unflagged pumping was successfully detected as anomaly

Anomaly Hydrograph detecting unflagged pumping

Wrangling & Analysis

Time index handling

timetk makes it easy to keep time indices tidy:
- tk_augment_timeseries_signature() (appends time columns)
- tk_ts() (coerce tbl to ts object)
- tk_tbl() (coerce ts to tbl object)
workflows: create features → summarize with rolling windows → lag features
emphasis on tidy approach: features are columns in the tibble for ML

Time series signature

Code

# Generate a time-series signature
sig_tbl <- tk_augment_timeseries_signature(series_tbl, .date_var = date)

# show top columns
sig_tbl %>% select(date, index.num, year, month.lbl, wday.lbl, hour) %>% head()

# A tibble: 6 × 6
  date       index.num  year month.lbl wday.lbl  hour
  <date>         <dbl> <int> <ord>     <ord>    <int>
1 1990-01-01 631152000  1990 January   Monday       0
2 1990-02-01 633830400  1990 February  Thursday     0
3 1990-03-01 636249600  1990 March     Thursday     0
4 1990-04-01 638928000  1990 April     Sunday       0
5 1990-05-01 641520000  1990 May       Tuesday      0
6 1990-06-01 644198400  1990 June      Friday       0

the signature contains numeric and categorical encoding of date/time
useful as input features for models — automatic and fast

Lags & Rolling summaries

Code

# Create lag and rolling features (summary code)
feat_tbl <- series_tbl %>%
  mutate(
    lag_1 = lag(value, 1),
    roll_3 = slider::slide_dbl(value, mean, .before = 2, .complete = TRUE)
  )

head(feat_tbl)

# A tibble: 6 × 7
  id    date       value month season lag_1 roll_3
  <fct> <date>     <dbl> <dbl> <chr>  <dbl>  <dbl>
1 M750  1990-01-01  6370     1 wet       NA    NA 
2 M750  1990-02-01  6430     2 wet     6370    NA 
3 M750  1990-03-01  6520     3 wet     6430  6440 
4 M750  1990-04-01  6580     4 dry     6520  6510 
5 M750  1990-05-01  6620     5 dry     6580  6573.
6 M750  1990-06-01  6690     6 dry     6620  6630

tk_augment_lags() (helper) and slider package for flexible rolling windows.
edge handling: NA from lags, .complete flag in slider.

Resampling & aggregations

summarize_by_time() and pad_by_time() are convenient wrappers for common operations
e.g., go from hourly to daily data

Code

head(m4_hourly, 4)

# A tibble: 4 × 3
  id    date                value
  <fct> <dttm>              <dbl>
1 H10   2015-07-01 12:00:00   513
2 H10   2015-07-01 13:00:00   512
3 H10   2015-07-01 14:00:00   506
4 H10   2015-07-01 15:00:00   500

Code

m4_hourly %>% 
  summarize_by_time(.by = 'day', value = mean(value, na.rm = TRUE)) %>% 
  head(4)

# A tibble: 4 × 2
  date                value
  <dttm>              <dbl>
1 2013-09-01 00:00:00  94.1
2 2013-09-02 00:00:00 111. 
3 2013-09-03 00:00:00  99  
4 2013-09-04 00:00:00 110.

use pad_by_time() fill missing dates (wrapper for padr::pad)
good for prepping series before modeling or imputation
useful for accurate plots (can guess the .by interval)

Code

# remove middle days of data
hourly_missing <- m4_hourly %>% 
  filter(id == "H10") %>% 
  mutate(day = day(date)) %>%
  filter(!day %in% 10:20)

Code

plot_time_series(hourly_missing, date, value, .interactive = F)

Code

pad_by_time(hourly_missing, date, .by = "hour") %>% 
  plot_time_series(., date, value, .interactive = F)

Machine Learning (briefly)

timetk helps with feature engineering
pair with tidymodels for forecasting or regression
typical flow: signature → lags/rolling → split → recipe → model

What is feature engineering?

the process of transforming raw data into informative input variables that improve model performance

often the most creative and valuable part of the workflow — combining domain insight with technical skill to make patterns obvious to the algorithm

goals of feature engineering

enhance signal, reduce noise
make patterns learnable
improve model generalization
apply domain knowledge to highlight key relationships

Common feature engineering techniques

Type	Description	Example
Numeric transformations	Scale, normalize, or log-transform variables	`log(price)`, `scale(age)`
Categorical encoding	Convert categories into numeric codes	One-hot encode `region`
Datetime features	Extract time-based patterns	`month()`, `day_of_week()`, `is_weekend`
Lag features	Capture temporal dependence	`lag(sales, 1)`
Rolling / window statistics	Describe recent trends or volatility	`roll_mean(sales, 3)`
Interaction terms	Combine variables to capture nonlinear effects	`price * demand`
Aggregations	Summarize by groups or time windows	Mean sales per category

each technique adds structure to the data

for time series, lag and rolling stats are powerful because they preserve the temporal relationships models need to understand momentum or seasonality

Feature engineering with `timetk`

Function	Purpose
`tk_augment_timeseries_signature()`	Generates rich date-based features (year, quarter, month, week, wday, is_weekend, etc.)
`tk_augment_lags()`	Adds lag features for autoregressive effects
`tk_augment_slidify()`	Computes rolling window statistics (mean, sd, etc.)
`tk_augment_differences()`	Calculates first or seasonal differences
`tk_augment_holiday_signature()`	Adds holiday indicators and proximity features

better features often beat more complex models
automated tools like timetk accelerate this process via tidy pipelines

ML — forecast pipeline

timetk fits tidymodels framework and can be included as steps in recipes and workflows

Code

# Load required packages
library(tidymodels)
library(timetk)
library(dplyr)
library(parsnip)
library(workflows)
library(rsample)

# 1. Split into training and testing sets ----
# 'assess' defines the size of the test set (12 months)
splits <- time_series_split(series_tbl, assess = "12 months", cumulative = TRUE)

training_data <- rsample::training(splits)
testing_data  <- rsample::testing(splits)

# 2. Build a recipe for feature engineering ----
recipe_spec <- recipe(value ~ date, data = training_data) %>%
  # Create rich time-based features (month, quarter, year, etc.)
  step_timeseries_signature(date) %>%
  # Remove the raw date and numeric index to avoid duplication
  step_rm(date) %>%
  # Normalize numeric features (optional)
  step_normalize(all_numeric_predictors())

# 3. Define the model ----
model_spec <- linear_reg() %>%
  set_engine("lm")

# 4. Build workflow ----
wf <- workflow() %>%
  add_recipe(recipe_spec) %>%
  add_model(model_spec)

# 5. Fit the model ----
fit <- wf %>%
  fit(data = training_data)

# 6. Make forecasts ----
forecast_tbl <- predict(fit, new_data = testing_data) %>%
  bind_cols(testing_data)

# 7. Visualize the results ----
forecast_tbl %>%
  ggplot(aes(x = date)) +
  geom_line(aes(y = value), color = "steelblue", size = 1.2) +
  geom_line(aes(y = .pred), color = "red", linetype = "dashed") +
  labs(title = "Forecast vs Actuals",
       y = "Value", x = "Date",
       caption = "Blue = observed | Red dashed = predicted") +
  theme_minimal()

Resources & further reading

timetk package site & documentation: https://business-science.github.io/timetk/

Thank you — Questions?

timetk: a time series toolkit in R

What is timetk?

Slide roadmap

Setup & single dataset

Visualizations

Quick plots

Seasonality and trend decomposition

Facets and multi-series

Anomalies

Wrangling & Analysis

Time index handling

Time series signature

Lags & Rolling summaries

Resampling & aggregations

Machine Learning (briefly)

What is feature engineering?

Common feature engineering techniques

Feature engineering with timetk

ML — forecast pipeline

Resources & further reading

Feature engineering with `timetk`