This report is produced in partial fulfillment of the requirements for the Practical Machine Learning Course offered by Johns Hopkins Bloomberg School of Public Health and Coursera.
This report describes processing and model building steps performed on the Data Classification of Body Postures and Movements dataset. For more information, visit http://groupware.les.inf.puc-rio.br/har
The aim is to select and build an optimal prediction model to predict 20 test cases in the course.
1. Training and Testing Data is read from online source.
## Download and read raw data
url1 <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(url1, destfile="pml-training.csv")
url2 <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(url2, destfile="pml-testing.csv")
dataTrain <- read.csv("pml-training.csv", header=TRUE)
dataTest <- read.csv("pml-testing.csv", header=TRUE)
2. The dataTest
set is held out. Exploration and subsequent analysis are only performed on the dataTrain
set.
3. After performing the command str(dataTrain)
, it is determined that there are 19622 observations, consisting of 160 variables.
1. It is noted that many variables in the dataset contain invalid values such as NA’s and blanks. For example the dataTrain$var_total_accel_belt
variable below. It is decided that such variables with large amount of invalid values be excluded from the model.
summary(dataTrain$var_total_accel_belt)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 0 1 0 16 19216
2. After excluding the abovementioned variables, it is found that the data has no more invalid values as described by complete.cases
command. We now have 54 variables, including the variable to be predicted, classe
.
dataTidy <- dataTrain[,-c(grep("^amplitude|^kurtosis|^skewness|^avg|^cvtd_timestamp|^max|^min|^new_window|^raw_timestamp|^stddev|^var|^user_name|X",names(dataTrain)))]
paste("Complete Cases:")
## [1] "Complete Cases:"
table(complete.cases(dataTidy))
##
## TRUE
## 19622
1. Given that we have a medium to large sample size, it is decided that the tidy data be further split into two sets, 60% for training and 40% for testing.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(39)
inTrain <- createDataPartition(y=dataTidy$classe,
p=0.6,list=FALSE)
dataTidyTrain <- dataTidy[inTrain,]
dataTidyTest <- dataTidy[-inTrain,]
rf
and Gradient Boosting gbm
algorithms are selected for comparison based on the accuracy these algorithms can achieve in classification. (Refer to lectures) In addition, these 2 models have built-in feature selection as described in the Caret package reference. (Refer to [1])set.seed(39)
# k-fold validation - 10-fold validation, use kappa as metric
fitControl <- trainControl(method = "cv",
number = 10)
gbmFit <- train(classe~., data=dataTidyTrain, method="gbm", metric="Kappa", trControl=fitControl,verbose=FALSE)
## Loading required package: gbm
## Loading required package: survival
## Loading required package: splines
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: parallel
## Loaded gbm 2.1
## Loading required package: plyr
rfFit <- train(classe~.,data=dataTidyTrain,method="rf", metric="Kappa", trControl=fitControl)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
resamples
function from the Caret package.library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(lattice)
rValues <- resamples(list(rf=rfFit,gbm=gbmFit))
summary(rValues)
##
## Call:
## summary.resamples(object = rValues)
##
## Models: rf, gbm
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rf 0.995 0.996 0.997 0.997 0.998 1.000 0
## gbm 0.979 0.984 0.986 0.987 0.989 0.997 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rf 0.994 0.995 0.996 0.996 0.998 1.000 0
## gbm 0.973 0.980 0.982 0.983 0.985 0.996 0
bwplot(rValues,metric="Kappa",main="RandomForest (rf) vs Gradient Boosting (gbm)")
1. With the selected RandomForest model, we shall proceed to model validation.
2. The details of the selected model is shown below.
rfFit
## Random Forest
##
## 11776 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 10598, 10598, 10599, 10597, 10598, 10599, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 1 1 0.002 0.003
## 27 1 1 0.002 0.002
## 53 1 1 0.003 0.004
##
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
3. We shall be using the confusionMatrix
function in the Caret package to validate the selected model with the dataTidyTest
test set. The corresponding statistics and error rates are shown.
library(caret)
confusionMatrix(dataTidyTest$classe, predict(rfFit,dataTidyTest))
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2231 1 0 0 0
## B 2 1515 1 0 0
## C 0 4 1364 0 0
## D 0 0 9 1269 8
## E 0 0 0 5 1437
##
## Overall Statistics
##
## Accuracy : 0.996
## 95% CI : (0.995, 0.997)
## No Information Rate : 0.285
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.995
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.999 0.997 0.993 0.996 0.994
## Specificity 1.000 1.000 0.999 0.997 0.999
## Pos Pred Value 1.000 0.998 0.997 0.987 0.997
## Neg Pred Value 1.000 0.999 0.998 0.999 0.999
## Prevalence 0.285 0.194 0.175 0.162 0.184
## Detection Rate 0.284 0.193 0.174 0.162 0.183
## Detection Prevalence 0.284 0.193 0.174 0.164 0.184
## Balanced Accuracy 0.999 0.998 0.996 0.997 0.997
4. From the above validation result, it can be determined that the selected Model performs at a Kappa value of 0.995, with an accuracy of 0.996.
1. Finally, we shall use the selected model to predict the classification of the testing set provided. In addition, in accordance to submission instructions, the pml_write_files
function is used to generate submission files.
library(caret)
results <- predict(rfFit,newdata=dataTest)
print(as.data.frame(results))
## results
## 1 B
## 2 A
## 3 B
## 4 A
## 5 A
## 6 E
## 7 D
## 8 B
## 9 A
## 10 A
## 11 B
## 12 C
## 13 B
## 14 A
## 15 E
## 16 E
## 17 A
## 18 B
## 19 B
## 20 B
pml_write_files(results)