timetk: a time series toolkit in R

Kyle Hardage, PhD, PG

Engineering Geologist, DWR

2025-11-13

What is timetk?

  • a toolkit for time series analysis in the tidyverse style
  • key point: no novel time series objects (possible disadvantage with speed)
  • key strengths:
    • seamless integration with tibble / dplyr workflows
    • fast plotting helpers and seasonal diagnostics
    • feature engineering (time series signature, lags, rolling stats)
    • works well with tidymodels and forecasting pipelines

Slide roadmap

  1. visualization
  2. wrangling & analysis
  3. machine learning (briefly)
  4. links & resources

Setup & single dataset

Code
# load packages & dataset (run locally before presenting)
library(tidyverse)
library(timetk)
library(slider)
library(lubridate)

# timetk includes the `m4_monthly` dataset — we'll use a single series across slides
# This dataset was compiled for the M4 Competition for forecasting in 2018
# Replace with a smaller sample if you prefer to knit faster.
series_tbl <- timetk::m4_monthly %>%
  filter(id == "M750") %>%  # example series — pick one consistent series
  select(id, date, value) %>% 
  mutate(month = month(date),
         season = ifelse(month %in% c(1:3, 10:12), "wet", "dry")) %>% 
  arrange(date)

head(series_tbl)
# A tibble: 6 × 5
  id    date       value month season
  <fct> <date>     <dbl> <dbl> <chr> 
1 M750  1990-01-01  6370     1 wet   
2 M750  1990-02-01  6430     2 wet   
3 M750  1990-03-01  6520     3 wet   
4 M750  1990-04-01  6580     4 dry   
5 M750  1990-05-01  6620     5 dry   
6 M750  1990-06-01  6690     6 dry   

Notes:

  • we will use built-in datasets to demonstrate end-to-end workflow
  • values have no real meaning (synthetic data), so substitute domain knowledge

Visualizations

Quick plots

  • plot_time_series() is a timetk convenience wrapper
Code
plot_time_series(series_tbl, date, value)

can group IDs prior to plotting (Note: facet arguments go inside the function)

Code
series_tbl %>% 
  group_by(id) %>% 
  plot_time_series(., date, value, .facet_scales = "free")

more groupings (in this case, trend category)

Code
series_tbl %>% 
  group_by(id) %>% 
  mutate(trend = ifelse(date < "2002-04-01", "rising", "stable")) %>% 
  plot_time_series(date, value, .color_var = trend)

switching off .interactive will return a ggplot (for reports, sharing, etc)

Code
series_tbl %>% 
  group_by(id) %>% 
  mutate(month = month(date), 
         trend = ifelse(date < "2002-04-01", "rising", "stable")) %>% 
  plot_time_series(date, value, .color_var = trend, .interactive = FALSE)

however, code gets cluttered within plot_time_series, so may use ggplot()

Code
series_tbl %>%
  plot_time_series(date, value, .color_var = season,
                   # Returns static ggplot
                   .interactive = FALSE,  
                   # Customization
                   .title = "Series M750 - value over time",
                   .x_lab = "Date",
                   .y_lab = "Value (unitless)",
                   .color_lab = "Month") +
  scale_y_continuous(labels = scales::label_comma())
Code
# time series plot (use ggplot + tk plotting helper locally)
series_tbl %>%
  ggplot(aes(x = date, y = value, col = season)) +
  geom_line() +
  labs(title = 'Series M750 — Value over time', x = 'Date', y = 'Value (unitless)') +
  theme_minimal()

time series boxplots for sampled data

Code
series_tbl %>% 
  plot_time_series_boxplot(date, value, .period = "6 months")

Seasonality and trend decomposition

note on functions:

often come in pairs - tk_ for analysis / preprocessor & plot_ for plotting

Code
tk_seasonal_diagnostics(series_tbl, date, value) %>% head(4)
# A tibble: 4 × 5
  date       .value month.lbl quarter year 
  <date>      <dbl> <fct>     <fct>   <fct>
1 1990-01-01   6370 January   1       1990 
2 1990-02-01   6430 February  1       1990 
3 1990-03-01   6520 March     1       1990 
4 1990-04-01   6580 April     2       1990 
Code
plot_seasonal_diagnostics(series_tbl, date, value)

use feature set to set the seasonal period (e.g. diurnally)

default: auto

others: second, minute, hour, wday.lbl, week, month.lbl, quarter, year

Code
tk_seasonal_diagnostics(series_tble, date, value, .feature_set = "hour")

STL decomposition (season + trend + remainder or loess smoothing)

Code
plot_stl_diagnostics(series_tbl, date, value)

likewise, use tk_ variant to view values from prior slide and output determined frequency

frequency = 12 observations per 1 year

trend = 60 observations per 5 years

Code
tk_stl_diagnostics(series_tbl, date, value)
# A tibble: 306 × 6
   date       observed season trend remainder seasadj
   <date>        <dbl>  <dbl> <dbl>     <dbl>   <dbl>
 1 1990-01-01     6370  161.  6516.    -308.    6209.
 2 1990-02-01     6430  166.  6529.    -265.    6264.
 3 1990-03-01     6520  300.  6541.    -321.    6220.
 4 1990-04-01     6580  282.  6553.    -255.    6298.
 5 1990-05-01     6620  270.  6565.    -215.    6350.
 6 1990-06-01     6690  100.  6577.      12.5   6590.
 7 1990-07-01     6000 -648.  6590.      58.6   6648.
 8 1990-08-01     5450 -936.  6601.    -215.    6386.
 9 1990-09-01     6480 -173.  6612.      40.1   6653.
10 1990-10-01     6820   81.9 6624.     114.    6738.
# ℹ 296 more rows

Facets and multi-series

  • plot_anomaly_diagnostics() and plot_time_series() scale well for multiple series
Code
m4_hourly %>%
  group_by(id) %>%
  plot_time_series(date, value, .facet_ncol = 2, .facet_scales = "free")
  • Use faceting + tk_augment_timeseries_signature() for grouped visuals (monthly, weekly patterns)
  • granular breakdown of time series data
Code
tk_augment_timeseries_signature(m4_hourly)
# A tibble: 3,060 × 31
   id    date                value  index.num  diff  year year.iso  half quarter
   <fct> <dttm>              <dbl>      <dbl> <dbl> <int>    <int> <int>   <int>
 1 H150  2013-09-01 12:00:00    45 1378036800    NA  2013     2013     2       3
 2 H150  2013-09-01 13:00:00    57 1378040400  3600  2013     2013     2       3
 3 H150  2013-09-01 14:00:00    27 1378044000  3600  2013     2013     2       3
 4 H150  2013-09-01 15:00:00    22 1378047600  3600  2013     2013     2       3
 5 H150  2013-09-01 16:00:00    21 1378051200  3600  2013     2013     2       3
 6 H150  2013-09-01 17:00:00    24 1378054800  3600  2013     2013     2       3
 7 H150  2013-09-01 18:00:00    29 1378058400  3600  2013     2013     2       3
 8 H150  2013-09-01 19:00:00    37 1378062000  3600  2013     2013     2       3
 9 H150  2013-09-01 20:00:00    97 1378065600  3600  2013     2013     2       3
10 H150  2013-09-01 21:00:00   165 1378069200  3600  2013     2013     2       3
# ℹ 3,050 more rows
# ℹ 22 more variables: month <int>, month.xts <int>, month.lbl <ord>,
#   day <int>, hour <int>, minute <int>, second <int>, hour12 <int>,
#   am.pm <int>, wday <int>, wday.xts <int>, wday.lbl <ord>, mday <int>,
#   qday <int>, yday <int>, mweek <int>, week <int>, week.iso <int>,
#   week2 <int>, week3 <int>, week4 <int>, mday7 <int>

Anomalies

flag outliers (anomalies) in a dataset (common for my work)

Code
plot_anomaly_diagnostics(series_tbl, date, value)

how does this work?

step 1) detrend and remove seasonality (STL decomposition)

step 2) Inner quartile range of median on remainder

Code
plot_stl_diagnostics(series_tbl, date, value, .interactive = F) + theme(panel.spacing = unit(0.2, "lines"), strip.text = element_text(margin = margin(0,0,0,0)))

control the sensitivity with .alpha and seasonality with .frequency

lower .alpha = more stringent threshold to flag anomalies (default = 0.05)

anomaly flags can be viewed using tk_anomaly_diagnostic

Code
plot_anomaly_diagnostics(series_tbl, date, value, .alpha = 0.1, .frequency = "quarter")

real example: identify and remove pumping influence to determine ambient groundwater elevation.

unflagged pumping was successfully detected as anomaly

Original Groundwater Hydrgraph

Anomaly Hydrograph detecting unflagged pumping

Wrangling & Analysis

Time index handling

  • timetk makes it easy to keep time indices tidy:
    • tk_augment_timeseries_signature() (appends time columns)
    • tk_ts() (coerce tbl to ts object)
    • tk_tbl() (coerce ts to tbl object)
  • workflows: create features → summarize with rolling windows → lag features
  • emphasis on tidy approach: features are columns in the tibble for ML

Time series signature

Code
# Generate a time-series signature
sig_tbl <- tk_augment_timeseries_signature(series_tbl, .date_var = date)

# show top columns
sig_tbl %>% select(date, index.num, year, month.lbl, wday.lbl, hour) %>% head()
# A tibble: 6 × 6
  date       index.num  year month.lbl wday.lbl  hour
  <date>         <dbl> <int> <ord>     <ord>    <int>
1 1990-01-01 631152000  1990 January   Monday       0
2 1990-02-01 633830400  1990 February  Thursday     0
3 1990-03-01 636249600  1990 March     Thursday     0
4 1990-04-01 638928000  1990 April     Sunday       0
5 1990-05-01 641520000  1990 May       Tuesday      0
6 1990-06-01 644198400  1990 June      Friday       0
  • the signature contains numeric and categorical encoding of date/time
  • useful as input features for models — automatic and fast

Lags & Rolling summaries

Code
# Create lag and rolling features (summary code)
feat_tbl <- series_tbl %>%
  mutate(
    lag_1 = lag(value, 1),
    roll_3 = slider::slide_dbl(value, mean, .before = 2, .complete = TRUE)
  )

head(feat_tbl)
# A tibble: 6 × 7
  id    date       value month season lag_1 roll_3
  <fct> <date>     <dbl> <dbl> <chr>  <dbl>  <dbl>
1 M750  1990-01-01  6370     1 wet       NA    NA 
2 M750  1990-02-01  6430     2 wet     6370    NA 
3 M750  1990-03-01  6520     3 wet     6430  6440 
4 M750  1990-04-01  6580     4 dry     6520  6510 
5 M750  1990-05-01  6620     5 dry     6580  6573.
6 M750  1990-06-01  6690     6 dry     6620  6630 
  • tk_augment_lags() (helper) and slider package for flexible rolling windows.
  • edge handling: NA from lags, .complete flag in slider.

Resampling & aggregations

  • summarize_by_time() and pad_by_time() are convenient wrappers for common operations
  • e.g., go from hourly to daily data
Code
head(m4_hourly, 4)
# A tibble: 4 × 3
  id    date                value
  <fct> <dttm>              <dbl>
1 H10   2015-07-01 12:00:00   513
2 H10   2015-07-01 13:00:00   512
3 H10   2015-07-01 14:00:00   506
4 H10   2015-07-01 15:00:00   500
Code
m4_hourly %>% 
  summarize_by_time(.by = 'day', value = mean(value, na.rm = TRUE)) %>% 
  head(4)
# A tibble: 4 × 2
  date                value
  <dttm>              <dbl>
1 2013-09-01 00:00:00  94.1
2 2013-09-02 00:00:00 111. 
3 2013-09-03 00:00:00  99  
4 2013-09-04 00:00:00 110. 
  • use pad_by_time() fill missing dates (wrapper for padr::pad)
  • good for prepping series before modeling or imputation
  • useful for accurate plots (can guess the .by interval)
Code
# remove middle days of data
hourly_missing <- m4_hourly %>% 
  filter(id == "H10") %>% 
  mutate(day = day(date)) %>%
  filter(!day %in% 10:20)
Code
plot_time_series(hourly_missing, date, value, .interactive = F)

Code
pad_by_time(hourly_missing, date, .by = "hour") %>% 
  plot_time_series(., date, value, .interactive = F)

Machine Learning (briefly)

  • timetk helps with feature engineering
  • pair with tidymodels for forecasting or regression
  • typical flow: signature → lags/rolling → split → recipe → model

What is feature engineering?

the process of transforming raw data into informative input variables that improve model performance

often the most creative and valuable part of the workflow — combining domain insight with technical skill to make patterns obvious to the algorithm

goals of feature engineering

  • enhance signal, reduce noise
  • make patterns learnable
  • improve model generalization
  • apply domain knowledge to highlight key relationships

Common feature engineering techniques

Type Description Example
Numeric transformations Scale, normalize, or log-transform variables log(price), scale(age)
Categorical encoding Convert categories into numeric codes One-hot encode region
Datetime features Extract time-based patterns month(), day_of_week(), is_weekend
Lag features Capture temporal dependence lag(sales, 1)
Rolling / window statistics Describe recent trends or volatility roll_mean(sales, 3)
Interaction terms Combine variables to capture nonlinear effects price * demand
Aggregations Summarize by groups or time windows Mean sales per category

each technique adds structure to the data

for time series, lag and rolling stats are powerful because they preserve the temporal relationships models need to understand momentum or seasonality

Feature engineering with timetk

Function Purpose
tk_augment_timeseries_signature() Generates rich date-based features (year, quarter, month, week, wday, is_weekend, etc.)
tk_augment_lags() Adds lag features for autoregressive effects
tk_augment_slidify() Computes rolling window statistics (mean, sd, etc.)
tk_augment_differences() Calculates first or seasonal differences
tk_augment_holiday_signature() Adds holiday indicators and proximity features

better features often beat more complex models
automated tools like timetk accelerate this process via tidy pipelines

ML — forecast pipeline

timetk fits tidymodels framework and can be included as steps in recipes and workflows

Code
# Load required packages
library(tidymodels)
library(timetk)
library(dplyr)
library(parsnip)
library(workflows)
library(rsample)

# 1. Split into training and testing sets ----
# 'assess' defines the size of the test set (12 months)
splits <- time_series_split(series_tbl, assess = "12 months", cumulative = TRUE)

training_data <- rsample::training(splits)
testing_data  <- rsample::testing(splits)

# 2. Build a recipe for feature engineering ----
recipe_spec <- recipe(value ~ date, data = training_data) %>%
  # Create rich time-based features (month, quarter, year, etc.)
  step_timeseries_signature(date) %>%
  # Remove the raw date and numeric index to avoid duplication
  step_rm(date) %>%
  # Normalize numeric features (optional)
  step_normalize(all_numeric_predictors())

# 3. Define the model ----
model_spec <- linear_reg() %>%
  set_engine("lm")

# 4. Build workflow ----
wf <- workflow() %>%
  add_recipe(recipe_spec) %>%
  add_model(model_spec)

# 5. Fit the model ----
fit <- wf %>%
  fit(data = training_data)

# 6. Make forecasts ----
forecast_tbl <- predict(fit, new_data = testing_data) %>%
  bind_cols(testing_data)

# 7. Visualize the results ----
forecast_tbl %>%
  ggplot(aes(x = date)) +
  geom_line(aes(y = value), color = "steelblue", size = 1.2) +
  geom_line(aes(y = .pred), color = "red", linetype = "dashed") +
  labs(title = "Forecast vs Actuals",
       y = "Value", x = "Date",
       caption = "Blue = observed | Red dashed = predicted") +
  theme_minimal()

Resources & further reading

  • timetk package site & documentation: https://business-science.github.io/timetk/

Thank you — Questions?