[ad_1]
R is an excellent language for machine learning. R primarily excels in the following three areas:
- Data manipulation
- Plotting
- Built-in libraries
R manipulates data using a native, in-built data structure called a data frame. Let’s load some data now and see how R can help us to work with it.
The data
Luckily for us, R comes with some built-in datasets that we can simply load and have ready to go. We’ll use one such dataset called iris to test some of R’s machine learning capabilities. This dataset contains 50 flowers, each one of three different different species: iris setosa, iris versicolor, and iris virginica. We can classify these data points using a data frame containing 4 different measurement attributes: sepal length, sepal width, petal length, and petal width. Let’s load the data and get started.
data("iris")
head(iris)
The R head
command lets us inspect the first few elements of a dataset. You’ll see that the first elements all have the same class label, given by the Species
column. When training a machine learning model, we want our data to be randomized so let’s fix this.
Shuffling the data
The elements of our data frame are the rows, so to shuffle the data we need to permute the rows. To do this we can use the built-in R functions, nrow
and sample
. nrow(iris)
returns a range the size of the iris dataset. Applying sample
to this returns a permutation of the range. We can then index into the dataset using the shuffled row indices as shown below.
shuffled <- iris[sample(nrow(iris)),]
head(shuffled)
Now, when we take a look at the head of the dataset, we see that we’re getting a mix of different flowers, which is great!
Train and validation splits
Now that we’ve got the data shuffled, let’s partition it into a training and validation set. We’ll hold out 20 percent of the data points for the validation set. We can get the two partitions as follows:
n <- nrow(shuffled)
split <- .2 * n
train <- shuffled[1:split,]
validate <- shuffled[split:n,]
Defining a model
Once that we have our data divided into train and validation splits, we can start training a model on it. We’ll first explore using boosted decision trees for classification with an R package called xgboost
. Let’s first install it.
install.packages("xgboost")
require(xgboost)
Now, we can construct the model. The XGBoost interface requires that we section the training data into the actual X data
and the species
labels. To do this, we just select the corresponding columns from the data frame. We also need to convert the Species
labels into numerical values in {0,1,2} as XGBoost only handles numbered class labels. We’ll do the same for the validation set.
# Training set
dcols <- c(1:4)
lcols <- c(5)
train$Species <- as.character(train$Species)
train$Species[train$Species == "setosa"] <- 0
train$Species[train$Species == "virginica"] <- 1
train$Species[train$Species == "versicolor"] <- 2
X <- train[,dcols]
labels <- train[,lcols]
# Validation Set
validate$Species <- as.character(validate$Species)
validate$Species[validate$ Species == "setosa"] <- 0
validate$Species[validate$ Species == "virginica"] <- 1
validate$Species[validate$ Species == "versicolor"] <- 2
Xv <- validate[,dcols]
labelsV <- validate[,lcols]
Training the model
Now that we’ve gotten that out of the way, we can begin training the model. The interface for xgboost is super simple; we can train the model with a single call.
booster <- xgboost(data = as.matrix(X), label = labels, max.depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "multi:softmax", num.class = 3)
[1] train-merror:0.000000
[2] train-merror:0.000000
The parameters we used deserve some explanation.
- We set
max.depth = 2
to use decision trees that have a maximum depth of 2. Deeper trees can give higher accuracy but are also higher in variance. Depth 2 should be a good balance for our purposes. eta
specifies the learning rate for our model. This is a hyperparameter that requires some experimentation and tuning to get right. We’ll use1
for this example.nthread
specifies the number of CPU threads to use while training the model. This is a small dataset so we don’t need many threads, but we would for larger datasets.nrounds
specifies the number of training epochs (passes) to perform over the data. We’ll start with 2.
Evaluating model performance
Now that we’ve trained our boosted tree model, we can evaluate its performance on the validation set. To do this, we can use the predict
function that comes packaged with xgboost
in order to generate our predictions.
preds <- predict(booster, as.matrix(Xv))
head(preds)
[1] 1 1 0 0 1 0
You can see that our boosted model is predicting nicely. All that is left is to calculate is its accuracy. We will calculate the mean
of all entries of preds
that are not equal to the corresponding entries of labelsV
.
err <- mean(preds != labelsV)
print(paste("validation-error= ", err))
[1] "validation-error= 0.0495867768595041"
print(paste("validation- accuracy=", 1 - err))
[1] "validation-accuracy= 0.950413223140496"
Nice! The validation accuracy on the dataset is equal to about 95 percent, which is a great performance from the model.
Visualizing our results
To finish off the project, we’ll just visualize our predictions. First, we will plot the native data using a library called ggvis
. Before that though, we must install the package.
install.packages("ggvis")
Now, since we’re visually limited to two dimensions, we will choose to plot our classes vs. the Sepal Length and Width attributes. This can be done as follows.
library(ggvis)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
data <- iris
data %>% ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species) %>% layer_points()
To visualize how our model predicts in comparison, we’ll quickly run it across all data points, not just the validation set. To combine the training and validation data, we use the rbind
function, which just joins the two data frames vertically. We also undo the work we did before, converting from numeric labels back to species names.
all_data <- rbind(X, Xv)
preds <- predict(booster, as.matrix(all_data))
all_data["Species"] <- preds
all_data$Species[all_data$ Species == 0] <- "setosa"
all_data$Species[all_data$ Species == 1] <- "virginica"
all_data$Species[all_data$ Species == 2] <- "versicolor"
head(all_data)
Everything looks good! Let’s plot the final result, using the same code as before.
all_data %>% ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species) %>% layer_points()
We can visually compare the true plot with the plot of our predictions. There should only be a few points misclassified between the two!
This article has been published from the source link without modifications to the text. Only the headline has been changed.
Source link