In this example, I have included a case study by Cloudera data scientists on how
large datasets can be resampled, and applied the random forest model with R and
Hadoop. Here, I have considered the Kaggle blue book for bulldozers competition
for understanding the types of Big Data problem definitions. Here, the goal of this
competition is to predict the sale price of a particular piece of heavy equipment at a
usage auction based on its usage, equipment type, and configuration. This solution has
been provided by Uri Laserson (Data Scientist at Cloudera). The provided data contains
the information about auction result posting, usage, and equipment configuration.
It's a trick to model the Big Data sets and divide them into the smaller datasets.
Fitting the model on that dataset is a traditional machine learning technique such as
random forests or bagging. There are possibly two reasons for random forests: