Creating a Bespoke Model • bespoke

library(bespoke)
library(modeldata)

Sometimes a simple series of if statements are enough for a subject matter expert to make a fairly good prediction from input data. In such cases, it can be helpful to hand-craft a function that can make simple predictions on new data. However, it is not easy to put such functions on the same footing as actual models for comparison.

With bespoke, you can construct “fitted models” which behave the same as any other model within the tidymodels framework, but which use hand-crafted functions for prediction. Here we will demonstrate a simple case.

Oil Data

We will work with the oils data from modeldata. This dataset consists of 96 samples of commercial oils, each of which belongs to one of 7 classes of oil: corn, olive, peanut, pumpkin, rapeseed, soybean, or sunflower.

data(oils)
summary(oils)
#>     palmitic        stearic          oleic          linoleic    
#>  Min.   : 4.50   Min.   :1.700   Min.   :22.80   Min.   : 7.90  
#>  1st Qu.: 6.20   1st Qu.:3.475   1st Qu.:26.30   1st Qu.:43.10  
#>  Median : 9.85   Median :4.200   Median :30.70   Median :50.80  
#>  Mean   : 9.04   Mean   :4.200   Mean   :36.73   Mean   :46.49  
#>  3rd Qu.:11.12   3rd Qu.:5.000   3rd Qu.:38.62   3rd Qu.:58.08  
#>  Max.   :14.90   Max.   :6.700   Max.   :76.70   Max.   :66.10  
#>                                                                 
#>    linolenic       eicosanoic      eicosenoic           class   
#>  Min.   :0.100   Min.   :0.100   Min.   :0.1000   corn     : 2  
#>  1st Qu.:0.375   1st Qu.:0.100   1st Qu.:0.1000   olive    : 7  
#>  Median :0.800   Median :0.400   Median :0.1000   peanut   : 3  
#>  Mean   :2.272   Mean   :0.399   Mean   :0.3115   pumpkin  :37  
#>  3rd Qu.:2.650   3rd Qu.:0.400   3rd Qu.:0.3000   rapeseed :10  
#>  Max.   :9.500   Max.   :2.800   Max.   :1.8000   soybean  :11  
#>                                                   sunflower:26

We’ll construct a function to choose a class from some simple properties in the input data.

Our Hand-Crafted Model Function

For now, we will simply return a random choice for each row of data. As we develop this package further we’ll create a more realistic use case.

random_baseline <- function(new_data, n_classes) {
  # Return a number between 1 and n_classes for each row of input.
  return(
    sample(seq_len(n_classes), nrow(new_data), replace = TRUE)
  )
}

“Training” a Model

In this case we need to specifically tell the “model” how many classes we have. We will likely make this a standard parameter available to any bespoke functions in the future.

oil_fit <- bespoke_classification(
  class ~ ., 
  oils, 
  fn = random_baseline, 
  n_classes = 7
)

That object will now behave like other model objects!

Using Our Model

oils_no_classes <- oils[, 1:7]
head(oils_no_classes)
#> # A tibble: 6 × 7
#>   palmitic stearic oleic linoleic linolenic eicosanoic eicosenoic
#>      <dbl>   <dbl> <dbl>    <dbl>     <dbl>      <dbl>      <dbl>
#> 1      9.7     5.2  31       52.7       0.4        0.4        0.1
#> 2     11.1     5    32.9     49.8       0.3        0.4        0.1
#> 3     11.5     5.2  35       47.2       0.2        0.4        0.1
#> 4     10       4.8  30.4     53.5       0.3        0.4        0.1
#> 5     12.2     5    31.1     50.5       0.3        0.4        0.1
#> 6      9.8     4.2  43       39.2       2.4        0.4        0.5
predict(oil_fit, new_data = head(oils_no_classes))
#> # A tibble: 6 × 1
#>   .pred_class
#>   <fct>      
#> 1 soybean    
#> 2 soybean    
#> 3 rapeseed   
#> 4 sunflower  
#> 5 peanut     
#> 6 pumpkin

Note that in this case the predictions are completely random.

predict(oil_fit, new_data = head(oils_no_classes))
#> # A tibble: 6 × 1
#>   .pred_class
#>   <fct>      
#> 1 corn       
#> 2 rapeseed   
#> 3 corn       
#> 4 rapeseed   
#> 5 soybean    
#> 6 corn

Working with Other Models

Of course the main point of doing this is to compare to other models. For this use case, bespoke provides a parsnip-style model, bespoke().

bespoke_spec <- bespoke(fn = random_baseline) %>% 
  parsnip::set_engine("bespoke", n_classes = 7)

bespoke_fit <- bespoke_spec %>% 
  parsnip::fit(class ~ ., oils)

predict(bespoke_fit, new_data = head(oils_no_classes), type = "class")
#> # A tibble: 6 × 1
#>   .pred_class
#>   <fct>      
#> 1 rapeseed   
#> 2 corn       
#> 3 sunflower  
#> 4 sunflower  
#> 5 sunflower  
#> 6 rapeseed
predict(bespoke_fit, new_data = head(oils_no_classes), type = "prob")
#> # A tibble: 6 × 7
#>   .pred_corn .pred_olive .pred_peanut .pred_pumpkin .pred_rapeseed .pred_soybean
#>        <dbl>       <dbl>        <dbl>         <dbl>          <dbl>         <dbl>
#> 1          1           0            0             0              0             0
#> 2          0           0            0             1              0             0
#> 3          0           0            0             0              0             0
#> 4          0           0            1             0              0             0
#> 5          0           0            1             0              0             0
#> 6          0           0            0             0              0             0
#> # ℹ 1 more variable: .pred_sunflower <dbl>

This can be compared to other models as if it’s a “real” model.

tree_spec <- parsnip::decision_tree(mode = "classification") %>% 
  parsnip::set_engine(
    engine = "rpart"
  )

oil_set <- workflowsets::workflow_set(
  preproc = list(class ~ .),
  models = list(bespoke_spec, tree_spec)
)

bs_oil <- rsample::bootstraps(oils)

oil_res <- oil_set %>%
  workflowsets::workflow_map(
    "fit_resamples",
    resamples = bs_oil
  )
#> → A | warning: ✖ No observations were detected in `truth` for level: peanut.
#>                ℹ Computation will proceed by ignoring those levels.
#> There were issues with some computations   A: x1
#> → B | warning: ✖ No observations were detected in `truth` for levels: corn and olive.
#>                ℹ Computation will proceed by ignoring those levels.
#> There were issues with some computations   A: x1→ C | warning: ✖ No observations were detected in `truth` for levels: corn and peanut.
#>                ℹ Computation will proceed by ignoring those levels.
#> There were issues with some computations   A: x1→ D | warning: ✖ No observations were detected in `truth` for level: corn.
#>                ℹ Computation will proceed by ignoring those levels.
#> There were issues with some computations   A: x1→ E | warning: ✖ No observations were detected in `truth` for level: olive.
#>                ℹ Computation will proceed by ignoring those levels.
#> There were issues with some computations   A: x1There were issues with some computations   A: x2   B: x1   C: x4   D: x5   E: x1
#> 
#> → A | warning: ✖ No observations were detected in `truth` for level: peanut.
#>                ℹ Computation will proceed by ignoring those levels.
#> → B | warning: ✖ No observations were detected in `truth` for levels: corn and olive.
#>                ℹ Computation will proceed by ignoring those levels.
#> → C | warning: ✖ No observations were detected in `truth` for levels: corn and peanut.
#>                ℹ Computation will proceed by ignoring those levels.
#> There were issues with some computations   A: x1   B: x1   C: x2
#> → D | warning: ✖ No observations were detected in `truth` for level: corn.
#>                ℹ Computation will proceed by ignoring those levels.
#> There were issues with some computations   A: x1   B: x1   C: x2→ E | warning: ✖ No observations were detected in `truth` for level: olive.
#>                ℹ Computation will proceed by ignoring those levels.
#> There were issues with some computations   A: x1   B: x1   C: x2There were issues with some computations   A: x2   B: x1   C: x4   D: x5   E: x1

workflowsets::rank_results(oil_res)
#> # A tibble: 6 × 9
#>   wflow_id          .config .metric  mean std_err     n preprocessor model  rank
#>   <chr>             <chr>   <chr>   <dbl>   <dbl> <int> <chr>        <chr> <int>
#> 1 formula_decision… Prepro… accura… 0.836  0.0215    25 formula      deci…     1
#> 2 formula_decision… Prepro… brier_… 0.119  0.0133    25 formula      deci…     1
#> 3 formula_decision… Prepro… roc_auc 0.890  0.0127    25 formula      deci…     1
#> 4 formula_bespoke   Prepro… accura… 0.122  0.0109    25 formula      besp…     2
#> 5 formula_bespoke   Prepro… brier_… 0.847  0.0128    25 formula      besp…     2
#> 6 formula_bespoke   Prepro… roc_auc 0.508  0.0116    25 formula      besp…     2

Unsurprisingly, the decision tree performs much better than the random model.