Sometimes a simple series of if
statements are enough
for a subject matter expert to make a fairly good prediction from input
data. In such cases, it can be helpful to hand-craft a function that can
make simple predictions on new data. However, it is not easy to put such
functions on the same footing as actual models for comparison.
With bespoke, you can construct “fitted models” which behave the same as any other model within the tidymodels framework, but which use hand-crafted functions for prediction. Here we will demonstrate a simple case.
Oil Data
We will work with the oils
data from
modeldata. This dataset consists of 96 samples of
commercial oils, each of which belongs to one of 7 classes of oil: corn,
olive, peanut, pumpkin, rapeseed, soybean, or sunflower.
data(oils)
summary(oils)
#> palmitic stearic oleic linoleic
#> Min. : 4.50 Min. :1.700 Min. :22.80 Min. : 7.90
#> 1st Qu.: 6.20 1st Qu.:3.475 1st Qu.:26.30 1st Qu.:43.10
#> Median : 9.85 Median :4.200 Median :30.70 Median :50.80
#> Mean : 9.04 Mean :4.200 Mean :36.73 Mean :46.49
#> 3rd Qu.:11.12 3rd Qu.:5.000 3rd Qu.:38.62 3rd Qu.:58.08
#> Max. :14.90 Max. :6.700 Max. :76.70 Max. :66.10
#>
#> linolenic eicosanoic eicosenoic class
#> Min. :0.100 Min. :0.100 Min. :0.1000 corn : 2
#> 1st Qu.:0.375 1st Qu.:0.100 1st Qu.:0.1000 olive : 7
#> Median :0.800 Median :0.400 Median :0.1000 peanut : 3
#> Mean :2.272 Mean :0.399 Mean :0.3115 pumpkin :37
#> 3rd Qu.:2.650 3rd Qu.:0.400 3rd Qu.:0.3000 rapeseed :10
#> Max. :9.500 Max. :2.800 Max. :1.8000 soybean :11
#> sunflower:26
We’ll construct a function to choose a class from some simple properties in the input data.
Our Hand-Crafted Model Function
For now, we will simply return a random choice for each row of data. As we develop this package further we’ll create a more realistic use case.
“Training” a Model
In this case we need to specifically tell the “model” how many classes we have. We will likely make this a standard parameter available to any bespoke functions in the future.
oil_fit <- bespoke_classification(
class ~ .,
oils,
fn = random_baseline,
n_classes = 7
)
That object will now behave like other model objects!
Using Our Model
oils_no_classes <- oils[, 1:7]
head(oils_no_classes)
#> # A tibble: 6 × 7
#> palmitic stearic oleic linoleic linolenic eicosanoic eicosenoic
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 9.7 5.2 31 52.7 0.4 0.4 0.1
#> 2 11.1 5 32.9 49.8 0.3 0.4 0.1
#> 3 11.5 5.2 35 47.2 0.2 0.4 0.1
#> 4 10 4.8 30.4 53.5 0.3 0.4 0.1
#> 5 12.2 5 31.1 50.5 0.3 0.4 0.1
#> 6 9.8 4.2 43 39.2 2.4 0.4 0.5
predict(oil_fit, new_data = head(oils_no_classes))
#> # A tibble: 6 × 1
#> .pred_class
#> <fct>
#> 1 soybean
#> 2 soybean
#> 3 rapeseed
#> 4 sunflower
#> 5 peanut
#> 6 pumpkin
Note that in this case the predictions are completely random.
Working with Other Models
Of course the main point of doing this is to compare to other models.
For this use case, bespoke provides a parsnip-style
model, bespoke()
.
bespoke_spec <- bespoke(fn = random_baseline) %>%
parsnip::set_engine("bespoke", n_classes = 7)
bespoke_fit <- bespoke_spec %>%
parsnip::fit(class ~ ., oils)
predict(bespoke_fit, new_data = head(oils_no_classes), type = "class")
#> # A tibble: 6 × 1
#> .pred_class
#> <fct>
#> 1 rapeseed
#> 2 corn
#> 3 sunflower
#> 4 sunflower
#> 5 sunflower
#> 6 rapeseed
predict(bespoke_fit, new_data = head(oils_no_classes), type = "prob")
#> # A tibble: 6 × 7
#> .pred_corn .pred_olive .pred_peanut .pred_pumpkin .pred_rapeseed .pred_soybean
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 0 0 0 0
#> 2 0 0 0 1 0 0
#> 3 0 0 0 0 0 0
#> 4 0 0 1 0 0 0
#> 5 0 0 1 0 0 0
#> 6 0 0 0 0 0 0
#> # ℹ 1 more variable: .pred_sunflower <dbl>
This can be compared to other models as if it’s a “real” model.
tree_spec <- parsnip::decision_tree(mode = "classification") %>%
parsnip::set_engine(
engine = "rpart"
)
oil_set <- workflowsets::workflow_set(
preproc = list(class ~ .),
models = list(bespoke_spec, tree_spec)
)
bs_oil <- rsample::bootstraps(oils)
oil_res <- oil_set %>%
workflowsets::workflow_map(
"fit_resamples",
resamples = bs_oil
)
#> → A | warning: ✖ No observations were detected in `truth` for level: peanut.
#> ℹ Computation will proceed by ignoring those levels.
#> There were issues with some computations A: x1
#> → B | warning: ✖ No observations were detected in `truth` for levels: corn and olive.
#> ℹ Computation will proceed by ignoring those levels.
#> There were issues with some computations A: x1→ C | warning: ✖ No observations were detected in `truth` for levels: corn and peanut.
#> ℹ Computation will proceed by ignoring those levels.
#> There were issues with some computations A: x1→ D | warning: ✖ No observations were detected in `truth` for level: corn.
#> ℹ Computation will proceed by ignoring those levels.
#> There were issues with some computations A: x1→ E | warning: ✖ No observations were detected in `truth` for level: olive.
#> ℹ Computation will proceed by ignoring those levels.
#> There were issues with some computations A: x1There were issues with some computations A: x2 B: x1 C: x4 D: x5 E: x1
#>
#> → A | warning: ✖ No observations were detected in `truth` for level: peanut.
#> ℹ Computation will proceed by ignoring those levels.
#> → B | warning: ✖ No observations were detected in `truth` for levels: corn and olive.
#> ℹ Computation will proceed by ignoring those levels.
#> → C | warning: ✖ No observations were detected in `truth` for levels: corn and peanut.
#> ℹ Computation will proceed by ignoring those levels.
#> There were issues with some computations A: x1 B: x1 C: x2
#> → D | warning: ✖ No observations were detected in `truth` for level: corn.
#> ℹ Computation will proceed by ignoring those levels.
#> There were issues with some computations A: x1 B: x1 C: x2→ E | warning: ✖ No observations were detected in `truth` for level: olive.
#> ℹ Computation will proceed by ignoring those levels.
#> There were issues with some computations A: x1 B: x1 C: x2There were issues with some computations A: x2 B: x1 C: x4 D: x5 E: x1
workflowsets::rank_results(oil_res)
#> # A tibble: 6 × 9
#> wflow_id .config .metric mean std_err n preprocessor model rank
#> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int>
#> 1 formula_decision… Prepro… accura… 0.836 0.0215 25 formula deci… 1
#> 2 formula_decision… Prepro… brier_… 0.119 0.0133 25 formula deci… 1
#> 3 formula_decision… Prepro… roc_auc 0.890 0.0127 25 formula deci… 1
#> 4 formula_bespoke Prepro… accura… 0.122 0.0109 25 formula besp… 2
#> 5 formula_bespoke Prepro… brier_… 0.847 0.0128 25 formula besp… 2
#> 6 formula_bespoke Prepro… roc_auc 0.508 0.0116 25 formula besp… 2
Unsurprisingly, the decision tree performs much better than the random model.