Are you aware that employing the XGBoost algorithm is considered a winning strategy in many data science competitions?
So, what makes it more powerful than a traditional Random Forest or Neural Network? In broad terms, it’s the efficiency, accuracy, and feasibility of this algorithm. (I’ve discussed this part in detail below in this tutorial).
In the last few years, predictive modeling has become much faster and more accurate. I remember spending long hours on feature engineering to improve the model by a few decimals. A lot of that difficult work, can now be done using better algorithms.
Technically, “XGBoost” is a short form for Extreme Gradient Boosting. It gained popularity in data science after the famous Kaggle competition called the Otto Classification Challenge. The latest implementation on “xgboost” on R was launched in August 2015. We will refer to this version (0.4-2) in this post.
In this article, I’ve explained a simple approach to using xgboost in R. So, consider this algorithm the next time you build a model. I’m sure it would be a moment of shock and then happiness!
Extreme gradient boosting (xgboost) is similar to the gradient boosting framework but is more efficient. It has both linear model solver and tree learning algorithms. So, what makes it fast is that it can do parallel computation on a single machine.
This makes xgboost at least 10 times faster than existing gradient boosting implementations. It supports various objective functions, including regression, classification, and ranking.
Since it is very high in predictive power but relatively slow with implementation, “xgboost” becomes an ideal fit for many competitions. It also has additional features for doing cross-validation and finding important variables. Many parameters need to be controlled to optimize the model. We will discuss these factors in the next section.
XGBoost only works with numeric vectors. Yes! you need to work on data types here.
Therefore, you need to convert all other forms of data into numeric vectors. One Hot Encoding is a simple method to convert categorical variables into numeric vectors. This term emanates from digital circuit language, which means an array of binary signals, and the only legal values are 0s and 1s.
In R, one hot encoding is quite easy. This step (shown below) will essentially make a sparse matrix using flags on every possible value of that variable. A sparse Matrix is a matrix where most of the values are zeros. Conversely, a dense matrix is a matrix where most values are non-zero.
Let’s assume you have a dataset named ‘campaign’ and want to convert all categorical variables except the response variable into such flags. Here is how you do it :
sparse_matrix <- sparse.model.matrix(response ~ .-1, data = campaign)
Now, let’s break down this code as follows:
To convert the target variables as well, you can use the following code:
output_vector = df[,response] == "Responder"
Here is what the code does:
is "Responder"
is TRUE ;Here are simple steps you can use to crack any data problem using xgboost:
library(xgboost) library(readr) library(stringr) library(caret) library(car)
(Here, I use bank data where we need to find whether a customer is eligible for a loan or not).
set.seed(100) setwd("C:\\Users\\ts93856\\Desktop\\datasource") # load data df_train = read_csv("train_users_2.csv") df_test = read_csv("test_users.csv")
# Loading labels of train data
labels = df_train['labels'] df_train = df_train[-grep('labels', colnames(df_train))]
# combine train and test data df_all = rbind(df_train,df_test)
# clean Variables : here I clean people with age less than 14 or more than 100
df_all[df_all$age < 14 | df_all$age > 100,'age'] <- -1 df_all$age[df_all$age < 0] <- mean(df_all$age[df_all$age > 0])
# one-hot-encoding categorical features ohe_feats = c('gender', 'education', 'employer')
dummies <- dummyVars(~ gender + education + employer, data = df_all) df_all_ohe <- as.data.frame(predict(dummies, newdata = df_all)) df_all_combined <- cbind(df_all[,-c(which(colnames(df_all) %in% ohe_feats))],df_all_ohe)df_all_combined$agena <- as.factor(ifelse(df_all_combined$age < 0,1,0))
I am using a list of variables in “feature_selected” to be used by the model. I have shared a quick and smart way to choose variables later in this article.
df_all_combined <- df_all_combined[,c('id',features_selected)] # split train and test X = df_all_combined[df_all_combined$id %in% df_train$id,] y <- recode(labels$labels,"'True'=1; 'False'=0) X_test = df_all_combined[df_all_combined$id %in% df_test$id,]
xgb <- xgboost(data = data.matrix(X[,-1]), label = y, eta = 0.1, max_depth = 15, nround=25, subsample = 0.5, colsample_bytree = 0.5, seed = 1, eval_metric = "merror", objective = "multi:softprob", num_class = 12, nthread = 3 )
And that’s it! You now have an object “xgb” which is an xgboost model. Here is how you score a test population :
# predict values in test set y_pred <- predict(xgb, data.matrix(X_test[,-1]))
I understand, by now, you would be highly curious to know about various parameters used in xgboost model. So, there are three types of parameters: General Parameters, Booster Parameters, and Task Parameters.
Let’s understand these parameters in detail. I require you to pay attention here. This is the most critical aspect of implementing the xgboost algorithm:
The tree specific parameters –
Compared to other machine learning techniques, I find implementing xgboost really simple. You already have a model if you have done all we have until now.
Let’s take it one step further and try to find the importance of the variable in the model and subset our variable list.
# Lets start with finding what the actual tree looks like
model <- xgb.dump(xgb, with.stats = T) model[1:10] #This statement prints top 10 nodes of the model
# Get the feature real names names <- dimnames(data.matrix(X[,-1]))[[2]]
# Compute feature importance matrix importance_matrix <- xgb.importance(names, model = xgb) # Nice graph xgb.plot.importance(importance_matrix[1:10,])
#In case last step does not work for you because of a version issue, you can try following : barplot(importance_matrix[,1])
As you can observe, many variables are just not worth using into our model. You can conveniently remove these variables and run the model again. This time you can expect a better accuracy.
Let’s assume, Age was the variable which came out to be most important from the above analysis. Here is a simple chi-square test which you can do to see whether the variable is actually important or not.
test <- chisq.test(train$Age, output_vector) print(test)
We can do the same process for all important variables. This will bring out the fact whether the model has accurately identified all possible important variables or not.
In conclusion, the XGBoost algorithm offers powerful capabilities for building numeric predictive models in both R and Python. By monitoring metrics like test error and leveraging features such as watchlist, we can iteratively refine our models for improved performance. With its versatility and efficiency, XGBoost remains a cornerstone in modern data science, enabling precise and actionable insights from complex datasets.
Ans: Yes, XGBoost can be used in R. It’s implemented through the ‘xgboost’ package, which provides powerful tools for building gradient boosting models efficiently.
Ans: XGBoost works by iteratively training a sequence of decision trees, each correcting the previous trees’ errors. It employs a gradient-boosting framework that optimizes a differentiable loss function to minimize prediction errors.
Ans: XGBoost and random forest are both powerful algorithms for regression tasks, but their performance may vary depending on the dataset and specific requirements. While random forest builds multiple independent trees, XGBoost sequentially improves upon them, often leading to higher accuracy, especially in complex datasets.
If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
How to find best parameter values for the model?
Aditya, Its an iterative process. You generally start with the default value and then move towards either extremes depending on the CV gain. Tavish
Below code is giving an error : labels = df_train['labels']
I think in the dataset "label" is "Loan_Status" and this code is right labels = df_train['Loan_Status'] df_train = df_train[-grep('Loan_Status', colnames(df_train))]
Very helpful article Srivastava. I heard about XGBOOST but did not implement it. Will definitely try this in the next competition, using this article.
hi Tavish, Thanks for taking the time to put together this elaborate explanation.. I'm trying to follow along using the code, and seem to have come unstuck at Step 2. This line of code throws an 'undefined columns selected' error: labels = df_train['labels'] What am I missing?
I have used a loans data which is not publicly available and not the loan challenge data on AV. The intention of the article was to understand the underlying process of XGboost. Hope the article helped you.
Thx for material, Tavish Srivastava. In your code you use variable "Age", but there is not this variable in the dataset. How you get this feature?
I have used a loans data which is not publicly available and not the loan challenge data on AV. The intention of the article was to understand the underlying process of XGboost. Hope the article helped you.
Nice article, I am going to try this algorithm on mortgage prepayment and default data
Hi, Thanks for posting wonderful article XGboost. Below code is not merging train and test dataset excluding Loan_Status from Train dataset. labels = df_train['labels'] df_train = df_train[-grep('labels', colnames(df_train))] # combine train and test data df_all = rbind(df_train,df_test) I think simple way to do it is # Exclude column 13 df_train_sub = subset(df_train, select=c(1:12)) Merge train and Test dataset. df_all = rbind(df_train_sub,df_test) Let me know if i am missing something here.
I have used a loans data which is not publicly available and not the loan challenge data on AV. The intention of the article was to understand the underlying process of XGboost. Hope the article helped you.
1. You should load 'Matrix" package to run the function sparse.model.matrix() 2. There is no “label” or "Age" or "Employer" in the download data set. 3. For "categorical features" in the data set, there are "Gender", "Married", "Education", "Self_Employed", "Property_Area"
I guess Tavish idea with this was to theoretically demonstrate the use of xgboost. The code as presented here have lots of errors with respect to variable names and I do not think you can run these codes as is.
Hi folks, If anyone is looking for a working example of xgboost, here is a simple example in R. Although xgboost is an overkill for this problem, it demonstrates how to run a multi-class classification using xgboost. . https://github.com/rachar1/DataAnalysis/blob/master/xgboost_Classification.R Hope this helps.
Great article, it would be much helpful if you can get in to details of xgb.importance(), like what can we understand from the Gain, Cover and Frequence columns of the output. Thanks :)
The feature importance part was unknown to me, so thanks a ton Tavish. Looking forward to applying it into my models. Also, i guess there is an updated version to xgboost i.e.,"xgb.train" and here we can simultaneously view the scores for train and the validation dataset. that we pass into the algorithm as xgb.DMatrix. Also xgb.cv gives us a very good idea to select parameters for xgb.train as here we can specify nfolds for the number of cross validations. Would love to get your views on these too !!!
I am getting error while converting datatypes of Loan Prediction to Numeric > names(n) [1] "Gender" "Married" "Dependents" "Education" [5] "Self_Employed" "ApplicantIncome" "CoapplicantIncome" "LoanAmount" [9] "Loan_Amount_Term" "Credit_History" "Property_Area" "Loan_Status" >sparse_matrix <- sparse.model.matrix(response ~ .,data = n) Error in model.frame.default(object, data, xlev = xlev) : variable lengths differ (found for 'Gender') I am unable to figure out the issue. Kindly suggest.
Hi Tavish, Great article. Thanks. Can you let me know how to access the data set you used so that i can follow your step and get a bettee understanding? Thansk Srikar
Thank you so much for such a great intro to xgboost!
Hi Tavish, Definitely a good article. But it would be great if you give the dataset along with the article and explain the techniques based on that.. Also many of the parameter explanations are not clear. May be it would be because of my lesser experience in this area.
Hi Tavish, Thanks for the article. I did not understand your paragraph on the Chi2 square test. How does this test allows you to (in)validate a feature ?
Hi Tanvish, Is it possible to use multiple computer's CPU to process XGBOOST. Thanks,
Error in Using xgboost--- I have following data set of stock prices of selected shares on nifty. data.frame': 1772 obs. of 291 variables: $ TCS.NS.Open : num [1:1772, 1] 0.977 -1.369 -0.324 -0.524 -1.291 ... $ TCS.NS.High : num [1:1772, 1] 1.024 -1.373 -0.323 -0.523 -1.302 ... $ TCS.NS.Low : num [1:1772, 1] 0.994 -1.372 -0.3 -0.547 -1.29 ... $ TCS.NS.Close : num [1:1772, 1] 0.982 -1.371 -0.313 -0.562 -1.301 ... $ TCS.NS.Volume : num [1:1772, 1] -0.465 0.064 -0.122 0.369 1.03 -0.52 -0.559 -0.613 0.333 -0.815 ... $ TCS.NS.Adjusted : num [1:1772, 1] 0.969 -1.306 -0.154 -1.018 -0.977 ... $ INFY.NS.Open : num [1:1772, 1] 1.501 -1.498 0.128 -0.463 -0.117 ... $ INFY.NS.High : num [1:1772, 1] 1.483 -1.508 0.115 -0.495 -0.104 ... $ INFY.NS.Low : num [1:1772, 1] 1.436 -1.507 0.104 -0.552 -0.107 ... $ INFY.NS.Close : num [1:1772, 1] 1.416 -1.487 0.096 -0.574 -0.09 ... $ INFY.NS.Volume : num [1:1772, 1] 3.856 -0.174 -0.096 0.486 -0.105 ... $ INFY.NS.Adjusted : num [1:1772, 1] 0.487 -1.343 -0.471 -1.056 -0.705 ... $ TECHM.NS.Open : num [1:1772, 1] 1.313 -1.513 -0.754 0.403 -0.235 . When I run following xgboost model, I get error--- bst=xgboost(data=as.matrix(train[,predictorNames]), label=train$outcome, verbose = 0, eta=0.1, gamma=50, missing = NaN, nround=50, colsample_bytree=0.1, subsample=8.6, objective="binary:logistic") Error in xgb.get.DMatrix(data, label, missing) : xgboost: need label when data is a matrix I checked label is provided but error persists.
Hi Tanvish, I am using Decision Forest Regression for my model, but I need a method to select important features out of 100+ features and then train the Decision Forest Regression Model, What's your view on using "XGBOOST" to just do feature selection and then train model using DFR?
I am using similar parameters for xgboost and xgbtrain, but the output is slightly different. Even the RMSE is bit different. In such case, which one should I use training.matrix = as.matrix(training) dtraining <- xgb.DMatrix(as.matrix(training[,-5]), label = as.matrix(training[,5])) param <- list("objective" = "reg:linear", # multiclass classification "subsample"= subsample, "colsample_bytree" = colsample_bytree, "max_depth" = max_depth, # maximum depth of tree "min_child_weight" = min_child_weight, "max_delta_step" = max_delta_step, "eta" = eta, # step size shrinkage "gamma" = gamma , # minimum loss reduction "nthread" = nthreads#, # number of threads to be used #"eval_metric" = evalerror ) bst <- xgb.train(params = param, data=dtraining, nrounds=nrounds, maximize = FALSE, verbose = 0) bst2<-xgboost(data = training.matrix[,-5], label = training.matrix[,5], verbose = 1, nrounds=nrounds, params = param, maximize = FALSE)
Thanks to dear tavish.
Hi Can you please share the data template you have used for running the code?We can run it with dummy data.
What does conservative means, like when you say with the increase of gamma, the algorithm will be more conservative.
Hi Sir, I am try to run code below X % select(-reordered)), label = subtrain$reordered) line. error is [20:15:09] amalgamation/../dmlc-core/src/io/local_filesys.cc:66: LocalFileSystem.GetPathInfo 1 Error:No such file or directory Can you please assist. Thank you
where is the data campaign? could you please send it to me?
what is the parameter of number of trees??? Is it nrounds?
hii could someone please post the link to the data set used in this example
Hi Tanvish, pls I have three variables for my response variables (1,2,3) can I still run with same method in running two variables (1,2).
Can you please tell me which library or package does the function "sparse.model.matrix() contain? for the action to convert to numerics for this code: sparse.model.matrix(response ~ .-1, data = campaign)
Very useful article. I have a question here. when we use techniques such as multi linear regression to build a prediction model, we have factors such as model's p value, R-Square Adjusted value etc. that will decide if we can accept or reject the model. For example if the R-Square Adjusted value is more 80%: accept the model, else reject. Similarly, do we have any techniques to evaluate the fitness of the model while using XGB, with the purpose of either accepting or rejecting the model? Request you to throw light on this topic. Thanks.
How do i check the accuracy of the model? Can you please help with the confusion Matrix
Hi Tanvish, This was a complete waste of time. When you write a 'Working Code Example' article you should work with a public dataset so others can also replicate the code. Please either update your program with a public dataset or kindly delete the post because there are several other 'Working' examples out there that people can follow without wasting their time
Hi Thanks for your article,,i want to understand that how i can show covariates importance in xgbt?what is script for bar plot