02-clasificacion-tuneo.utf8

background-image: url(img/latinR-portada.png)
background-size: cover
class: animated slideInRight fadeOutLeft, middle

# 02. Clasificación. Tuneo de Modelos.

---

background-image: url(img/latinR-portada.png)
background-size: cover
class: animated slideInRight fadeOutLeft, middle

# Árboles de decisión

---

# Árboles de decisión

.footnote[Imagen tomada de https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition]

---

## Hiperparámetro

.bg-near-white.b--purple.ba.bw2.br3.shadow-5.ph4.mt5[
### Un hiperparámetro es una propiedad de un algoritmo de aprendizaje. Este valor influencia la forma en que trabaja el algoritmo. Estos valores no son aprendidos por el algoritmo desde los datos. Deben ser seteados antes de correr el algoritmo por el analista. 
]

---

## Hiperparámetros de 
## árboles de decisión

* **min_n**

Establezca n mínimo para dividir en cualquier nodo.
Es un método de parada temprana. Se utiliza para evitar el sobreajuste.

* **tree_depth**

Pone un límite a la profundidad máxima del árbol.
Un método para detener el algoritmo. Se utiliza para evitar el sobreajuste.

* **cost_complexity**

Agrega un costo o penalización a los errores de árboles más complejos. 
Es una forma de poda. Utilizado para evitar el sobreajuste (overfitting).

---

## Vamos a modelar las especies 🐧

### Ingreso los datos

```r
library(tidymodels) 
library(palmerpenguins)

penguins <- palmerpenguins::penguins %>%
  drop_na() %>% #elimino valores perdidos
  select(-year,-sex, -island) #elimino columnas q no son numéricas
glimpse(penguins) #observo variables restantes
```

```
## Rows: 333
## Columns: 5
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.…
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.…
## $ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 1…
## $ body_mass_g       <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 380…
```

---

### Paso 1: Vamos a dividir el set de datos

```r
library(rsample)
set.seed(123) #setear la semilla
p_split <- penguins %>%
  initial_split(prop=0.75) # divido en 75%

p_train <- training(p_split)

p_split
```

```
## <Analysis/Assess/Total>
## <250/83/333>
```

```r
# para hacer validación cruzada estratificada
p_folds <- vfold_cv(p_train, strata = species) 
```

Estos son los datos de entrenamiento/prueba/total

* __Vamos a _entrenar_ con 250 muestras__
* __Vamos a _testear_ con 83 muestras__
* __Datos totales: 333__

---

### Paso 2: Preprocesamiento (receta)

```r
#creo la receta
recipe_dt <- p_train %>%
  recipe(species~.) %>%
  step_corr(all_predictors()) %>% #elimino las correlaciones
  step_center(all_predictors(), -all_outcomes()) %>% #centrado
  step_scale(all_predictors(), -all_outcomes()) %>% #escalado
  prep() 
recipe_dt #ver la receta
```

```
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          4
## 
## Training data contained 250 data points and no missing data.
## 
## Operations:
## 
## Correlation filter removed no terms [trained]
## Centering for bill_length_mm, ... [trained]
## Scaling for bill_length_mm, ... [trained]
```

---
background-image: url(img/dt-fondo.png)
background-size: cover

### Paso 3: Especificar el modelo  🌳

__Modelo de árboles de decisión__

__Vamos a utilizar el modelo por defecto__

```r
#especifico el modelo 
set.seed(123)
vanilla_tree_spec <- decision_tree() %>% #arboles de decisión
  set_engine("rpart") %>% #librería rpart
  set_mode("classification") #modo para clasificar
vanilla_tree_spec
```

```
## Decision Tree Model Specification (classification)
## 
## Computational engine: rpart
```

---
background-image: url(img/dt-fondo.png)
background-size: cover

### Paso 4: armo el workflow

```r
#armo el workflow
tree_wf <- workflow() %>%
  add_recipe(recipe_dt) %>% #agrego la receta
  add_model(vanilla_tree_spec) #agrego el modelo
tree_wf 
```

```
## ══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## ● step_corr()
## ● step_center()
## ● step_scale()
## 
## ── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (classification)
## 
## Computational engine: rpart
```

---

### Paso 5: Ajuste de la función

```r
#modelo vanilla sin tunning
set.seed(123) 
vanilla_tree_spec %>% 
  fit_resamples(species ~ ., 
                resamples = p_folds) %>% 
  collect_metrics() #desanidar las metricas
```

```
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy multiclass 0.951    10 0.0123 
## 2 roc_auc  hand_till  0.970    10 0.00794
```

---

#### Paso 5.1: vamos a especificar 2 hiperparámatros

```r
set.seed(123) 
trees_spec <- decision_tree() %>% 
  set_engine("rpart") %>% 
  set_mode("classification") %>% 
  set_args(min_n = 20, cost_complexity = 0.1) #especifico hiperparámetros

trees_spec %>%
  fit_resamples(species ~ ., 
                resamples = p_folds) %>% 
  collect_metrics()
```

---

# Ejercicio

.bg-near-white.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt5[
#### 1. ¿Por qué es el mismo valor obtenido en los dos casos?

#### 2. Dejando fijo el valor de min_n=20, pruebe C=1, C=0.5 y C=0.
#### 3. Dejando fijo el valor de C=0, pruebe min_n 1 y 5.
]

---

### Paso 6: Predicción del modelo  🔮

```r
#utilizamos la funcion last_fit junto al workflow y al split de datos
final_fit_dt <- last_fit(tree_wf,
  split = p_split
)
final_fit_dt %>%
  collect_metrics()
```

```
## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.880
## 2 roc_auc  hand_till      0.924
```

---

### Paso 6.1: matriz de confusión

```r
final_fit_dt %>%
  collect_predictions() %>%
  conf_mat(species, .pred_class) #para ver la matriz de confusión
```

```
##            Truth
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        35         0      0
##   Chinstrap      5         9      2
##   Gentoo         1         2     29
```

```r
final_fit_dt %>%
  collect_predictions() %>%
  sens(species, .pred_class) #sensibilidad global del modelo
```

```
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 sens    macro          0.869
```

---

# Ejercicio

.bg-near-white.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt5[
#### Repetir estos pasos para el modelo de C=0 y min_n=5
]

---

## Resumiendo

* initial_split()

]

]

---

## Resumiendo

* step_*()

]

]

---

## Resumiendo

* set_engine()
* mode()

]

]

---

## Resumiendo

* workflow()
* add_recipe()
* add_model()

]

]

---

## Resumiendo

* last_fit()
]

]

---

## Resumiendo

* collect_metrics()
* collect_predictions() + conf_mat()

]

]

---

background-image: url(img/latinR-portada.png)
background-size: cover
class: animated slideInRight fadeOutLeft, middle

# Random Forest

---

# Random Forest

---

# Bootstraping

---

# Bagging
 
<img src="img/random-forest3.png" width="100%" style="display: block; margin: auto;" />

---
### Veamos que sucede al clasificar el conjunto de testeo
 
<img src="img/random-forest4.png" width="100%" style="display: block; margin: auto;" />

---
### Voto mayoritario (majority vote)
 
<img src="img/random-forest5.png" width="100%" style="display: block; margin: auto;" />

---

### Hiperparámetros de 
### random forest
  
### mtry
* El número de predictores a muestrearse en cada split de árbol

### min_n
* el número de observaciones necesarias para seguir dividiendo nodos

---
background-image: url(img/rf-fondo.png)
background-size: cover
  
  
### Paso 2: Preprocesamos los datos

```r
p_recipe <- training(p_split) %>%
  recipe(species~.) %>%
  step_corr(all_predictors()) %>%
  step_center(all_predictors(), -all_outcomes()) %>%
  step_scale(all_predictors(), -all_outcomes()) %>%
  prep()
p_recipe
```

---

### Paso 3: Especificar el modelo

```r
rf_spec <- rand_forest() %>% 
  set_engine("ranger") %>% 
  set_mode("classification")
```

---

### Veamos como funciona el modelo sin tunning

```r
set.seed(123)

rf_spec %>% 
  fit_resamples(species ~ ., 
                resamples = p_folds) %>% 
  collect_metrics()
```

```
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy multiclass 0.972    10 0.0108 
## 2 roc_auc  hand_till  0.996    10 0.00192
```

---
background-image: url(img/rf-fondo.png)
background-size: cover

## Random Forest con mtry=2  🔧

```r
rf2_spec <- rf_spec %>% 
  set_args(mtry = 2)

set.seed(123)

rf2_spec %>% 
  fit_resamples(species ~ ., 
                resamples = p_folds) %>% 
  collect_metrics()
```

---
background-image: url(img/rf-fondo.png)
background-size: cover

## Random Forest con mtry=3  🔧

```r
rf3_spec <- rf_spec %>% 
  set_args(mtry = 3)

set.seed(123)

rf3_spec %>% 
  fit_resamples(species ~ ., 
                resamples = p_folds) %>% 
  collect_metrics()
```

```
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy multiclass 0.967    10 0.0104 
## 2 roc_auc  hand_till  0.997    10 0.00115
```

---

## Random Forest con mtry=4  🔧

```r
rf4_spec <- rf_spec %>% 
  set_args(mtry = 4)

set.seed(123)
rf4_spec %>% 
  fit_resamples(species ~ ., 
                resamples = p_folds) %>% 
  collect_metrics()
```

```
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy multiclass 0.964    10 0.00972
## 2 roc_auc  hand_till  0.996    10 0.00147
```

---

background-image: url(img/fondo-conceptos.png)
background-size: cover
class: animated slideInRight fadeOutLeft, middle

### En la realidad lo que uno realiza es un tuneo de los hiperparámetros del modelo de manera automática

---
background-image: url(img/rf-fondo.png)
background-size: cover

### Tuneo de hiperparámetros automático 🔧

```r
tune_spec <- rand_forest(
  mtry = tune(),
  trees = 1000,
  min_n = tune()
) %>%
  set_mode("classification") %>%
  set_engine("ranger")
tune_spec
```

```
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 1000
##   min_n = tune()
## 
## Computational engine: ranger
```

---
background-image: url(img/rf-fondo.png)
background-size: cover

## Workflows

```r
tune_wf <- workflow() %>%
  add_recipe(p_recipe) %>%
  add_model(tune_spec)

set.seed(123)
cv_folds <- vfold_cv(p_train, strata = species)
tune_wf
```

```
## ══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## ● step_corr()
## ● step_center()
## ● step_scale()
## 
## ── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 1000
##   min_n = tune()
## 
## Computational engine: ranger
```

---

### Paralelizamos los cálculos

```r
doParallel::registerDoParallel()

set.seed(123)
tune_res <- tune_grid(
  tune_wf,
  resamples = cv_folds,
  grid = 20
)

tune_res
```

```
## # Tuning results
## # 10-fold cross-validation using stratification 
## # A tibble: 10 x 4
##    splits           id     .metrics          .notes          
##    <list>           <chr>  <list>            <list>          
##  1 <split [224/26]> Fold01 <tibble [40 × 6]> <tibble [0 × 1]>
##  2 <split [224/26]> Fold02 <tibble [40 × 6]> <tibble [0 × 1]>
##  3 <split [224/26]> Fold03 <tibble [40 × 6]> <tibble [0 × 1]>
##  4 <split [224/26]> Fold04 <tibble [40 × 6]> <tibble [0 × 1]>
##  5 <split [224/26]> Fold05 <tibble [40 × 6]> <tibble [0 × 1]>
##  6 <split [225/25]> Fold06 <tibble [40 × 6]> <tibble [0 × 1]>
##  7 <split [225/25]> Fold07 <tibble [40 × 6]> <tibble [0 × 1]>
##  8 <split [226/24]> Fold08 <tibble [40 × 6]> <tibble [0 × 1]>
##  9 <split [227/23]> Fold09 <tibble [40 × 6]> <tibble [0 × 1]>
## 10 <split [227/23]> Fold10 <tibble [40 × 6]> <tibble [0 × 1]>
```

---

## Rango de valores para min_n y mtry

![](02-clasificacion-tuneo_files/figure-html/unnamed-chunk-33-1.png)

---
background-image: url(img/rf-fondo.png)
background-size: cover

## Elijo el mejor modelo 🏆

* Con la función select_best

```r
best_auc <- select_best(tune_res, "roc_auc")

final_rf <- finalize_model(
  tune_spec,
  best_auc
)

final_rf
```

```
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 2
##   trees = 1000
##   min_n = 4
## 
## Computational engine: ranger
```

---

## Valores finales

```r
set.seed(123)
final_wf <- workflow() %>%
  add_recipe(p_recipe) %>%
  add_model(final_rf)

final_res <- final_wf %>%
  last_fit(p_split)

final_res %>%
  collect_metrics()
```

```
## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.988
## 2 roc_auc  hand_till      1
```

---

## Matriz de Confusión

```r
final_res %>%
  collect_predictions() %>%
  conf_mat(species, .pred_class)
```

```
##            Truth
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        41         0      0
##   Chinstrap      0        11      1
##   Gentoo         0         0     30
```

---

## Ejercicio 3

.bg-near-white.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt5[
#### Con el set de datos de iris realice una clasificación con random forest.

]

---

## Resumiendo

]

---

## Resumiendo

]
---

## Resumiendo

]

---

## Resumiendo

]

]

---

## Resumiendo

]

---

### Importancia de las variables

* libreria vip

```r
library(vip)
set.seed(123)
final_rf %>%
  set_engine("ranger", importance = "permutation") %>%
  fit(species ~ .,
      data = juice(p_recipe)) %>%
  vip(geom = "point")+
  theme_xaringan()
```

---

### Gráfico 
![](02-clasificacion-tuneo_files/figure-html/unnamed-chunk-43-1.png)

---

### Ejemplos (un poco) más reales

#### [Tuning random forest hyperparameters with #TidyTuesday trees data](https://juliasilge.com/blog/sf-trees-random-tuning/)

<br><br><br><br>
#### [Hyperparameter tuning and #TidyTuesday food consumption](https://juliasilge.com/blog/food-hyperparameter-tune/)

---

# Bibliografía

#### Max Kuhn & Julia Silge - [Tidy Modeling (en desarrollo)](https://www.tmwr.org/)

#### Julia Silge's [Personal Blog](https://juliasilge.com/blog/)

#### Max Kuhn & Kjell Johnson - [Feature engineering and Selection: A Practical Approach for Predictive Models](http://www.feat.engineering/)

#### Max Kuhn & Kjell Johnson - Applied Predictive Modeling

#### Documentación de [`tidymodels`](https://www.tidymodels.org/)

#### Alison Hill [rstudio conf](https://conf20-intro-ml.netlify.app/materials/) y [curso virtual](https://alison.rbind.io/post/2020-06-02-tidymodels-virtually/) material

---

# Muchas gracias 🤖