Aleix Ruiz de Villa
2014/10/29
Package for easying the machine learning typical modelization workflow
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
6 5.4 3.9 1.7
Petal.Width Species
1 0.2 setosa
2 0.2 setosa
3 0.2 setosa
4 0.2 setosa
5 0.2 setosa
6 0.4 setosa
\( y \) type | task type |
---|---|
Discrete | Classification |
Continuous | Regression |
1) Pre-processing data set (historical data)
2) Training model: finding \( f^* \) that suits best in our data set
3) Evaluate performance of \( f^* \) on new data
library(caret)
data(iris)
X <- iris[, names(iris) != 'Species']
Y <- iris[, 'Species']
prePropInfo <- preProcess(X , method = c("center", "scale"))
predict(prePropInfo, X)
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
Petal.Width
1 0.2
2 0.2
3 0.2
4 0.2
Sepal.Length Sepal.Width Petal.Length
1 -0.8977 1.01560 -1.336
2 -1.1392 -0.13154 -1.336
3 -1.3807 0.32732 -1.392
4 -1.5015 0.09789 -1.279
Petal.Width
1 -1.311
2 -1.311
3 -1.311
4 -1.311
\[ \begin{cases} \dfrac{y_i^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0, \\[8pt] \ln{(y_i)} & \text{if } \lambda = 0, \end{cases} \]
Sepal.Length Sepal.Width Petal.Length
1 5.1 NA 1.4
2 4.9 3.0 NA
3 4.7 NA 1.3
4 4.6 3.1 1.5
5 NA 3.6 1.4
6 NA 3.9 1.7
Petal.Width Species
1 0.2 setosa
2 0.2 setosa
3 NA setosa
4 0.2 setosa
5 0.2 setosa
6 NA setosa
dummyVar
library(caret)
XCategories <- c("A", "B", "C")
XFactor <- data.frame( response = sample(c(0,1), 20, replace = TRUE),
factVar = factor(sample(XCategories, 20, replace = TRUE)))
dummyInfo <- dummyVars( response~. , data = XFactor )
predict( dummyInfo, XFactor )
factVar factVar.A factVar.B factVar.C
1 B 0 1 0
2 C 0 0 1
3 C 0 0 1
4 B 0 1 0
5 A 1 0 0
6 B 0 1 0
nearZeroVar
findLinearCombos
findCorrelations
dataPart <- createDataPartition( iris$Species, p=.80, times = 2)
sapply( dataPart, head )
Resample1 Resample2
[1,] 1 1
[2,] 4 3
[3,] 6 4
[4,] 7 7
[5,] 8 8
[6,] 10 9
train
functionmethod
: any from the listmetric
: “Accuracy” (by default classification), “Kappa”, “RMSE” (by default regression), “Rsquared”, “ROC” (needs extra control specifications),…trControl
: control optionstuneLength
: number of elements to build the gridtuneGrid
: your own parameter grid preProcess
: any that we have seen before.trainControl
functionmethod
: resampling “boot”, “cv”, …number
: number of resamplings.summaryFunction
: your own evaluation functioncvCtrl <- trainControl(method = "boot", repeats = 3)
model <- train(Species~., iris, method = "rf",
trControl = cvCtrl)
Random Forest
150 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 150, 150, 150, 150, 150, 150, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa Accuracy SD Kappa SD
2 0.9 0.9 0.02 0.04
3 0.9 0.9 0.03 0.04
4 0.9 0.9 0.03 0.04
Accuracy was used to select the optimal
model using the largest value.
The final value used for the model was mtry = 4.
model$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 4%
Confusion matrix:
setosa versicolor virginica
setosa 50 0 0
versicolor 0 47 3
virginica 0 3 47
class.error
setosa 0.00
versicolor 0.06
virginica 0.06
predict(model$finalModel, iris[1, 1:4])
1
setosa
Levels: setosa versicolor virginica
iris[1, 5]
[1] setosa
Levels: setosa versicolor virginica