Short Intro to Caret Package

Aleix Ruiz de Villa
2014/10/29

Pre-strata conference meetup: Data Science Spain, RugBcn, Gemleb

Introduction

What is it

Package for easying the machine learning typical modelization workflow

Advantatges

  • Uniform syntaxis: most popular packages (147 models) for machine learning
  • Helpers
  • Parallel processing (LINUX only)

Introduction - iris data

  Sepal.Length Sepal.Width Petal.Length
1          5.1         3.5          1.4
2          4.9         3.0          1.4
3          4.7         3.2          1.3
4          4.6         3.1          1.5
5          5.0         3.6          1.4
6          5.4         3.9          1.7
  Petal.Width Species
1         0.2  setosa
2         0.2  setosa
3         0.2  setosa
4         0.2  setosa
5         0.2  setosa
6         0.4  setosa

Introduction - iris data

plot of chunk unnamed-chunk-2

Introduction - machine learning

  • You have outcomes \( y_1, \ldots, y_n \) and explanatory variables \( x_1, \ldots, x_n \).
  • You want to model \( y \) using \( x \): find a function \( y = f(x) \)
  • You use your historical data \( y_1, \ldots, y_n \) and \( x_1, \ldots, x_n \) to find/build such function \( f \).
  • Ultimate goal: given a new \( x \) you want to predict its corresponding \( y \)

\( y \) type task type
Discrete Classification
Continuous Regression

Introduction - machine learing typical workflow

1) Pre-processing data set (historical data)

2) Training model: finding \( f^* \) that suits best in our data set

3) Evaluate performance of \( f^* \) on new data

`preProcess` function - centering and scaling

Centering and Scaling: \( \frac{x-\mu}{\sigma} \)

library(caret)
data(iris)
X <- iris[, names(iris) != 'Species']
Y <- iris[, 'Species']

prePropInfo <- preProcess(X , method = c("center", "scale"))
predict(prePropInfo, X)

`preProcess` function - centering and scaling continued

  Sepal.Length Sepal.Width Petal.Length
1          5.1         3.5          1.4
2          4.9         3.0          1.4
3          4.7         3.2          1.3
4          4.6         3.1          1.5
  Petal.Width
1         0.2
2         0.2
3         0.2
4         0.2
  Sepal.Length Sepal.Width Petal.Length
1      -0.8977     1.01560       -1.336
2      -1.1392    -0.13154       -1.336
3      -1.3807     0.32732       -1.392
4      -1.5015     0.09789       -1.279
  Petal.Width
1      -1.311
2      -1.311
3      -1.311
4      -1.311

`preProcess` function - transforming variables

Box Cox tranformations:

\[ \begin{cases} \dfrac{y_i^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0, \\[8pt] \ln{(y_i)} & \text{if } \lambda = 0, \end{cases} \]

Dimensionality reduction

  • Principal Component Analysis
  • Independent Component Analysis

`preProcess` function - missing values

Imputation: missing values

  • knn of the training set
  • bagged (bootstrap aggregating) trees
  Sepal.Length Sepal.Width Petal.Length
1          5.1          NA          1.4
2          4.9         3.0           NA
3          4.7          NA          1.3
4          4.6         3.1          1.5
5           NA         3.6          1.4
6           NA         3.9          1.7
  Petal.Width Species
1         0.2  setosa
2         0.2  setosa
3          NA  setosa
4         0.2  setosa
5         0.2  setosa
6          NA  setosa

Data manipulation - dummy variables

Dummy Variables: dummyVar

library(caret)
XCategories <- c("A", "B", "C")
XFactor <- data.frame( response = sample(c(0,1), 20, replace = TRUE),
                       factVar = factor(sample(XCategories, 20, replace = TRUE)))

dummyInfo <- dummyVars( response~. , data = XFactor )
predict( dummyInfo, XFactor ) 

Data Manipulation - dummy variables continued

  factVar factVar.A factVar.B factVar.C
1       B         0         1         0
2       C         0         0         1
3       C         0         0         1
4       B         0         1         0
5       A         1         0         0
6       B         0         1         0

Data manipulation - others

Near zero variance predictors: nearZeroVar

  • problems with cross validations

Linear Dependencies findLinearCombos

Remove correlated predictors over a threshold: findCorrelations

Data Splitting/Resampling I

k Fold Cross Validation

Data Splitting/Resampling II

k Fold Cross Validation

dataPart <- createDataPartition( iris$Species, p=.80, times = 2) 
sapply( dataPart, head )
     Resample1 Resample2
[1,]         1         1
[2,]         4         3
[3,]         6         4
[4,]         7         7
[5,]         8         8
[6,]        10         9

Data Splitting/Resampling III

Repeated K Fold Cross Validation

Bootstraping: sample with replacement same length as historical data set

Training

  • Models Included: by type, list
  • Automatic tuning of parameters using resampling techniques: by default, if you have \( p \) parameters, it will take \( 3^p \) combinations
  • You can add your own custom models

Train syntax I

The train function

  • Formula o matrix syntax
  • method: any from the list
  • metric: “Accuracy” (by default classification), “Kappa”, “RMSE” (by default regression), “Rsquared”, “ROC” (needs extra control specifications),…
  • trControl: control options
  • tuneLength: number of elements to build the grid
  • tuneGrid: your own parameter grid
  • preProcess: any that we have seen before.

Train syntax II

The trainControl function

  • method: resampling “boot”, “cv”, …
  • number: number of resamplings.
  • summaryFunction: your own evaluation function

Train/Predict example I

cvCtrl <- trainControl(method = "boot", repeats = 3)
model <- train(Species~., iris, method = "rf",
               trControl = cvCtrl)

Train/Predict example II

Random Forest 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Bootstrapped (25 reps) 

Summary of sample sizes: 150, 150, 150, 150, 150, 150, ... 

Resampling results across tuning parameters:

  mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
  2     0.9       0.9    0.02         0.04    
  3     0.9       0.9    0.03         0.04    
  4     0.9       0.9    0.03         0.04    

Accuracy was used to select the optimal
 model using  the largest value.
The final value used for the model was mtry = 4. 

Train/Predict example III

model$finalModel

Call:
 randomForest(x = x, y = y, mtry = param$mtry) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 4%
Confusion matrix:
           setosa versicolor virginica
setosa         50          0         0
versicolor      0         47         3
virginica       0          3        47
           class.error
setosa            0.00
versicolor        0.06
virginica         0.06

Train/Predict example IV

predict(model$finalModel, iris[1, 1:4])
     1 
setosa 
Levels: setosa versicolor virginica
iris[1, 5]
[1] setosa
Levels: setosa versicolor virginica

References