Lending Club Loan Data in R

1 Overview

A case study of machine learning / modeling in R with credit default data. Data is taken from Kaggle Lending Club Loan Data but is also available publicly at Lending Club Statistics Page. We illustrate the complete workflow from data ingestion, over data wrangling/transformation to exploratory data analysis and finally modeling approaches. Along the way will be helpful discussions on side topics such as training strategy, computational efficiency, R intricacies, and more. Note that the focus is more on methods in R rather than statistical rigour. This is meant to be a reference for an end-to-end data science workflow rather than a serious attempt to achieve best model performance.

1.1 Kaggle Data Description

The files contain complete loan data for all loans issued through 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the “present” contains complete loan data for all loans issued through the previous completed calendar quarter. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. The file is a matrix of about 890 thousand observations and 75 variables. A data dictionary is provided in a separate file.

2 Environment Setup

2.1 Libraries

This will automatically install any missing libraries. Make sure to have proper proxy settings. You do not necessarily need to have all of them but only those needed for respective sections (which will be indicated).

# define used libraries
libraries_used <- 
  c("lazyeval", "readr","plyr" ,"dplyr", "readxl", "ggplot2", 
    "funModeling", "scales", "tidyverse", "corrplot", "GGally", "caret",
    "rpart", "randomForest", "pROC", "gbm", "choroplethr", "choroplethrMaps",
    "microbenchmark", "doParallel", "e1071")

# check missing libraries
libraries_missing <- 
  libraries_used[!(libraries_used %in% installed.packages()[,"Package"])]
# install missing libraries
if(length(libraries_missing)) install.packages(libraries_missing)

2.2 Session Info

sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.1  backports_1.1.0 bookdown_0.5    magrittr_1.5   
##  [5] rprojroot_1.2   tools_3.4.1     htmltools_0.3.6 yaml_2.1.14    
##  [9] Rcpp_0.12.12    stringi_1.1.5   rmarkdown_1.6   blogdown_0.1   
## [13] knitr_1.17      stringr_1.2.0   digest_0.6.12   evaluate_0.10.1

3 A note on libraries

There are generally two ways to access functions from R packages (we use package and library interchangeably). Either via a direct access without loading the library, i.e. package::function() or by loading (attaching) the library into workspace (environment) and thus making all its functions available at once. The problem with the first option is that direct access sometimes can lead to issues (unexpected behavior/errors) and makes coding cumbersome. The problem with the second approach is a conflict of function (and other variables) names between two libraries (called masking). A third complication is that some functions require libraries to be loaded (e.g. some instances of caret::train()) and thus attach the library without being explicitly told. Again this may lead to masking of functions and if user is unaware this can lead to nasty problems (conceived bugs) down the road (sometimes libraries will issue a warning upon load, e.g. plyr when dplyr is already loaded). There is no golden way (at least we are not aware of it) and so we tend to do the following

  • load major libraries that are used frequently into workspace but pay attention to load succession to avoid unwanted masking
  • access rarely used functions directly (being aware that they work and don’t attach anything themselves)
  • sometimes use direct access to a fucntion although its library is loaded (either to make clear which library is currently used or because we explicitly need a function that has been masked due to another loaded library)

We will load a few essential libraries here as they are used heavily. Other librraies may only be loaded in respective section. You can also try to follow the analysis without loading many librraies at the beginning but only when needed. Just be aware of the succession to avoid issues.

3.1 Library versions

Also note that often the version of the library used matters (it always matters but what we mean is, it makes a difference) and some libraries are developed more actively than others which may lead to issues. For example, note that library caret is using plyr while the tidyverse of which dplyr (successor of plyr) is part has some new concepts, e.g. the default data frame is a tibble::tibble(). It seems that caret has issues with tibbles, see e.g. “Wrong model type for classification” in regression problems in R-Caret.

library(plyr)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## arrange():   dplyr, plyr
## compact():   purrr, plyr
## count():     dplyr, plyr
## failwith():  dplyr, plyr
## filter():    dplyr, stats
## id():        dplyr, plyr
## lag():       dplyr, stats
## mutate():    dplyr, plyr
## rename():    dplyr, plyr
## summarise(): dplyr, plyr
## summarize(): dplyr, plyr
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

4 Data Import

Refer to section Overview to see how the data may be acquired.

The loans data can be conveniently read via readr::read_csv(). Note that the function tries to determine the variable type by reading the first 1,000 rows which does not always guarantee correct import especially in cases of missing values at the begining of the file. Other packages treat imports differently, for example data.table::fread() takes a sample of 1,000 rows to determine column types (100 rows from 10 different points) which seems to get more robust results than readr::read_csv(). Both package functions allow the explicit definition of column classes, for readr::read_csv() it is parameter col_types and for data.table::fread() it is parameter colClasses. Alternatively, readr::read_csv() offers the parameter guess_max that allows increasing the number of rows being guessed similar to SAS import procedure parameter guessingrows. Naturally, import time increases if more rows are guessed. For details on readr including comparisons against Base R and data.table::fread() see readr.

path <- "D:/data/lending_club"
loans <- readr::read_csv(paste0(path, "./loan.csv"))

The meta data comes in an Excel file and needs to be parsed via a special library, in this case we use readxl.

library(readxl)
# Load the Excel workbook
excel_file = paste0("./LendingClubDataDictionary.xlsx")
# see available tabs
excel_sheets(paste0(path, excel_file))
## [1] "LoanStats"   "browseNotes" "RejectStats"
# Read in the first worksheet
meta_loan_stats = read_excel(paste0(path, excel_file), sheet = "LoanStats")
meta_browse_notes = read_excel(paste0(path, excel_file), sheet = "browseNotes")
meta_reject_stats = read_excel(paste0(path, excel_file), sheet = "RejectStats")

5 Meta Data

Let’s have a look at the meta data in more detail. First, which variables are present in loan data set and what is their type. Then check meta data information and finally compare the two to see if anything is missing.

The usual approach may be to use base::str() function to get a summary of the data structure. However, it may be useful to quantify the “information power” of different metrics and dimensions by looking at the ratio of zeros and missing values to overall observations. This will not always reveal the truth (as there may be variables that are only populated if certain conditions apply) but it still gives some indication. The funModeling package offers the function funModeling::df_status() for that. It does not scale very well and has quite a few dependencies (so a direct call is preferred over a full library load) but it suits the purpose for this data. Unfortunately, it does not return the number of rows and columns. The data has 887379 observations (rows) and 74 variables (columns).

We will require the meta data at a later stage so we assign it to a variable. The function funModeling::df_status() has parameter print_results = TRUE set by default which means the data will be assigned and printed at the same time.

meta_loans <- funModeling::df_status(loans, print_results = FALSE)
knitr::kable(meta_loans)
variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
id 0 0.00 0 0.00 0 0 integer 887379
member_id 0 0.00 0 0.00 0 0 integer 887379
loan_amnt 0 0.00 0 0.00 0 0 numeric 1372
funded_amnt 0 0.00 0 0.00 0 0 numeric 1372
funded_amnt_inv 233 0.03 0 0.00 0 0 numeric 9856
term 0 0.00 0 0.00 0 0 character 2
int_rate 0 0.00 0 0.00 0 0 numeric 542
installment 0 0.00 0 0.00 0 0 numeric 68711
grade 0 0.00 0 0.00 0 0 character 7
sub_grade 0 0.00 0 0.00 0 0 character 35
emp_title 0 0.00 51457 5.80 0 0 character 289148
emp_length 0 0.00 0 0.00 0 0 character 12
home_ownership 0 0.00 0 0.00 0 0 character 6
annual_inc 2 0.00 4 0.00 0 0 numeric 49384
verification_status 0 0.00 0 0.00 0 0 character 3
issue_d 0 0.00 0 0.00 0 0 character 103
loan_status 0 0.00 0 0.00 0 0 character 10
pymnt_plan 0 0.00 0 0.00 0 0 character 2
url 0 0.00 0 0.00 0 0 character 887379
desc 0 0.00 761597 85.83 0 0 character 124453
purpose 0 0.00 0 0.00 0 0 character 14
title 0 0.00 151 0.02 0 0 character 61452
zip_code 0 0.00 0 0.00 0 0 character 935
addr_state 0 0.00 0 0.00 0 0 character 51
dti 451 0.05 0 0.00 0 0 numeric 4086
delinq_2yrs 716961 80.80 29 0.00 0 0 numeric 29
earliest_cr_line 0 0.00 29 0.00 0 0 character 697
inq_last_6mths 497905 56.11 29 0.00 0 0 numeric 28
mths_since_last_delinq 1723 0.19 454312 51.20 0 0 numeric 155
mths_since_last_record 1283 0.14 750326 84.56 0 0 numeric 123
open_acc 7 0.00 29 0.00 0 0 numeric 77
pub_rec 751572 84.70 29 0.00 0 0 numeric 32
revol_bal 3402 0.38 0 0.00 0 0 numeric 73740
revol_util 3540 0.40 502 0.06 0 0 numeric 1356
total_acc 0 0.00 29 0.00 0 0 numeric 135
initial_list_status 0 0.00 0 0.00 0 0 character 2
out_prncp 255798 28.83 0 0.00 0 0 numeric 248332
out_prncp_inv 255798 28.83 0 0.00 0 0 numeric 266244
total_pymnt 17759 2.00 0 0.00 0 0 numeric 506726
total_pymnt_inv 18037 2.03 0 0.00 0 0 numeric 506616
total_rec_prncp 18145 2.04 0 0.00 0 0 numeric 260227
total_rec_int 18214 2.05 0 0.00 0 0 numeric 324635
total_rec_late_fee 874862 98.59 0 0.00 0 0 numeric 6181
recoveries 862702 97.22 0 0.00 0 0 numeric 23055
collection_recovery_fee 863872 97.35 0 0.00 0 0 numeric 20708
last_pymnt_d 0 0.00 17659 1.99 0 0 character 98
last_pymnt_amnt 17673 1.99 0 0.00 0 0 numeric 232451
next_pymnt_d 0 0.00 252971 28.51 0 0 character 100
last_credit_pull_d 0 0.00 53 0.01 0 0 character 103
collections_12_mths_ex_med 875553 98.67 145 0.02 0 0 numeric 12
mths_since_last_major_derog 0 0.00 665676 75.02 0 0 character 168
policy_code 0 0.00 0 0.00 0 0 numeric 1
application_type 0 0.00 0 0.00 0 0 character 2
annual_inc_joint 0 0.00 886868 99.94 0 0 character 308
dti_joint 0 0.00 886870 99.94 0 0 character 449
verification_status_joint 0 0.00 886868 99.94 0 0 character 3
acc_now_delinq 883236 99.53 29 0.00 0 0 numeric 8
tot_coll_amt 0 0.00 70276 7.92 0 0 character 10325
tot_cur_bal 0 0.00 70276 7.92 0 0 character 327342
open_acc_6m 0 0.00 866007 97.59 0 0 character 13
open_il_6m 0 0.00 866007 97.59 0 0 character 35
open_il_12m 0 0.00 866007 97.59 0 0 character 12
open_il_24m 0 0.00 866007 97.59 0 0 character 17
mths_since_rcnt_il 0 0.00 866569 97.65 0 0 character 201
total_bal_il 0 0.00 866007 97.59 0 0 character 17030
il_util 0 0.00 868762 97.90 0 0 character 1272
open_rv_12m 0 0.00 866007 97.59 0 0 character 18
open_rv_24m 0 0.00 866007 97.59 0 0 character 28
max_bal_bc 0 0.00 866007 97.59 0 0 character 10707
all_util 0 0.00 866007 97.59 0 0 character 1128
total_rev_hi_lim 0 0.00 70276 7.92 0 0 character 21251
inq_fi 0 0.00 866007 97.59 0 0 character 18
total_cu_tl 0 0.00 866007 97.59 0 0 character 33
inq_last_12m 0 0.00 866007 97.59 0 0 character 29

It seems the data has two unique identifiers id and member_id. There is no variable with 100% missing or zero values i.e. information power of zero. There are a few which have a high ratio of NAs so the meta description needs to be checked whether this is expected. There are also variables which have only one unique value, e.g. policy_code. Again, the meta description should be checked to see the rationale but such dimensions are not useful for any analysis or model buidling.

Our meta table also contains the absolute number of unique values which is helpful for plotting (for attributes that is). Another interesting ratio to look at is that of unique values over all values for any attribute. A high ratio would indicate that this is probably a “free” field, i.e. no particular constraints are put on its values (except key variables of course but in case of uniqueness their ratio should be one as is the case with id and member_id). When looking into correlations, these high ratio fields will have low correlation with other fields but they may still be useful e.g. because they have “direction” information (e.g. the direction of an effect) or, in the case of strings, may be useful for text analytics. We thus add a respective variable to the meta data. For improved readability, we use the scales::percent() function to convert output to percent.

meta_loans <-
  meta_loans %>%
  mutate(uniq_rat = unique / nrow(loans))

meta_loans %>%
  select(variable, unique, uniq_rat) %>%
  mutate(unique = unique, uniq_rat = scales::percent(uniq_rat)) %>%
  knitr::kable()
variable unique uniq_rat
id 887379 100.0%
member_id 887379 100.0%
loan_amnt 1372 0.2%
funded_amnt 1372 0.2%
funded_amnt_inv 9856 1.1%
term 2 0.0%
int_rate 542 0.1%
installment 68711 7.7%
grade 7 0.0%
sub_grade 35 0.0%
emp_title 289148 32.6%
emp_length 12 0.0%
home_ownership 6 0.0%
annual_inc 49384 5.6%
verification_status 3 0.0%
issue_d 103 0.0%
loan_status 10 0.0%
pymnt_plan 2 0.0%
url 887379 100.0%
desc 124453 14.0%
purpose 14 0.0%
title 61452 6.9%
zip_code 935 0.1%
addr_state 51 0.0%
dti 4086 0.5%
delinq_2yrs 29 0.0%
earliest_cr_line 697 0.1%
inq_last_6mths 28 0.0%
mths_since_last_delinq 155 0.0%
mths_since_last_record 123 0.0%
open_acc 77 0.0%
pub_rec 32 0.0%
revol_bal 73740 8.3%
revol_util 1356 0.2%
total_acc 135 0.0%
initial_list_status 2 0.0%
out_prncp 248332 28.0%
out_prncp_inv 266244 30.0%
total_pymnt 506726 57.1%
total_pymnt_inv 506616 57.1%
total_rec_prncp 260227 29.3%
total_rec_int 324635 36.6%
total_rec_late_fee 6181 0.7%
recoveries 23055 2.6%
collection_recovery_fee 20708 2.3%
last_pymnt_d 98 0.0%
last_pymnt_amnt 232451 26.2%
next_pymnt_d 100 0.0%
last_credit_pull_d 103 0.0%
collections_12_mths_ex_med 12 0.0%
mths_since_last_major_derog 168 0.0%
policy_code 1 0.0%
application_type 2 0.0%
annual_inc_joint 308 0.0%
dti_joint 449 0.1%
verification_status_joint 3 0.0%
acc_now_delinq 8 0.0%
tot_coll_amt 10325 1.2%
tot_cur_bal 327342 36.9%
open_acc_6m 13 0.0%
open_il_6m 35 0.0%
open_il_12m 12 0.0%
open_il_24m 17 0.0%
mths_since_rcnt_il 201 0.0%
total_bal_il 17030 1.9%
il_util 1272 0.1%
open_rv_12m 18 0.0%
open_rv_24m 28 0.0%
max_bal_bc 10707 1.2%
all_util 1128 0.1%
total_rev_hi_lim 21251 2.4%
inq_fi 18 0.0%
total_cu_tl 33 0.0%
inq_last_12m 29 0.0%

An attribute like emp_title has over 30% unique values which makes it a poor candidate for modeling as it seems borrowers are free to describe their notion of employment title. A sophisticated model, however, may even take advantage of the data by checking the strings for “indicator words” that may be associated with an honest or a dishonest (credible / non-credible) candidate.

5.1 Data Transformation

We should also check the variable types. As mentioned in section Data Import the import function uses heuristics to guess the data types and these may not always work.

  • annual_inc_joint = character but should be numeric
  • dates are read as character and need to be transformed

First let’s transform character to numeric. We make use of dplyr::mutate_at() function and provide a vector of columns to be mutated (transformed). In general, when using libraries from the tidyverse (these libraries are mainly authored by Hadley Wickham and other RStudio people), most functions offer a standard version as opposed to an NSE (non-standard evaluation) version which can take character values as variable names. These functions usually have a trailing underscore, e.g. dplyr::select_() as compared to the non-standard evaluation function dplyr::select(). For details, see the dplyr vignette on Non-standard evaluation. However, note that the tidyverse libraries are still changing a fair amount so best to cross-check whether used function is the most efficient one for given task or if there is a new one introduced in the mean time.

As opposed to packages like data.table tidyverse packages do not alter their input in place so we have to re-assign the result to the original object. We could also create a new object but this is memory-inefficient (in fact even the re-assignment to the original object creates a temporary copy). For more details on memory management, see chapter Memory in Hadley Wickham’s book “Advanced R”.

chr_to_num_vars <- 
  c("annual_inc_joint", "mths_since_last_major_derog", "open_acc_6m",
    "open_il_6m", "open_il_12m", "open_il_24m", "mths_since_rcnt_il",
    "total_bal_il", "il_util", "open_rv_12m", "open_rv_24m",
    "max_bal_bc", "all_util", "total_rev_hi_lim", "total_cu_tl",
    "inq_last_12m", "dti_joint", "inq_fi", "tot_cur_bal", "tot_coll_amt")

loans <-
  loans %>%
  mutate_at(.funs = funs(as.numeric), .vars = chr_to_num_vars)

Let’s have a look at the date variables to see how they need to be transformed.

chr_to_date_vars <- 
  c("issue_d", "last_pymnt_d", "last_credit_pull_d",
    "next_pymnt_d", "earliest_cr_line", "next_pymnt_d")

loans %>%
  select_(.dots = chr_to_date_vars) %>%
  str()
## Classes 'tbl_df', 'tbl' and 'data.frame':    887379 obs. of  5 variables:
##  $ issue_d           : chr  "Dec-2011" "Dec-2011" "Dec-2011" "Dec-2011" ...
##  $ last_pymnt_d      : chr  "Jan-2015" "Apr-2013" "Jun-2014" "Jan-2015" ...
##  $ last_credit_pull_d: chr  "Jan-2016" "Sep-2013" "Jan-2016" "Jan-2015" ...
##  $ next_pymnt_d      : chr  NA NA NA NA ...
##  $ earliest_cr_line  : chr  "Jan-1985" "Apr-1999" "Nov-2001" "Feb-1996" ...
head(unique(loans$next_pymnt_d))
## [1] NA         "Feb-2016" "Jan-2016" "Sep-2013" "Feb-2014" "May-2014"
for (i in chr_to_date_vars){
  print(head(unique(loans[, i])))
  }
## # A tibble: 6 x 1
##    issue_d
##      <chr>
## 1 Dec-2011
## 2 Nov-2011
## 3 Oct-2011
## 4 Sep-2011
## 5 Aug-2011
## 6 Jul-2011
## # A tibble: 6 x 1
##   last_pymnt_d
##          <chr>
## 1     Jan-2015
## 2     Apr-2013
## 3     Jun-2014
## 4     Jan-2016
## 5     Apr-2012
## 6     Nov-2012
## # A tibble: 6 x 1
##   last_credit_pull_d
##                <chr>
## 1           Jan-2016
## 2           Sep-2013
## 3           Jan-2015
## 4           Sep-2015
## 5           Dec-2014
## 6           Aug-2012
## # A tibble: 6 x 1
##   next_pymnt_d
##          <chr>
## 1         <NA>
## 2     Feb-2016
## 3     Jan-2016
## 4     Sep-2013
## 5     Feb-2014
## 6     May-2014
## # A tibble: 6 x 1
##   earliest_cr_line
##              <chr>
## 1         Jan-1985
## 2         Apr-1999
## 3         Nov-2001
## 4         Feb-1996
## 5         Jan-1996
## 6         Nov-2004
## # A tibble: 6 x 1
##   next_pymnt_d
##          <chr>
## 1         <NA>
## 2     Feb-2016
## 3     Jan-2016
## 4     Sep-2013
## 5     Feb-2014
## 6     May-2014

It seems the date format is consistent and follows month-year convention. We can use the base::as.Date() function to convert this to a date format. The function requires a day as well so we simply add the 1st of each month to the existing character string via pasting together strings using base::paste0(). Note that base::paste0("bla", NA) will not return NA but the concatenated string (here: blaNA). Conveniently, base::as.Date() will return NA so we can leave it at that. Alternatively, we could include an exception handler for NA values to explicitly handle those because we saw previously that some date variables include NA values. One way to approach that would be to wrap the date conversion function into a base::ifelse() call as so ifelse(is.na(x), NA, some_date_function(). However, it seems that base::ifelse() is dropping the date class that we just created with our date function, for details, see e.g. How to prevent ifelse from turning Date objects into numeric objects. As we do not want to deal with these issues at this moment, we simply go with the default behavior of base::as.Date() as it returns NA in cases of bad input anyway.

Let’s also return the NA values of the input to remind ourselves of the number. It should coincide with the number of NA after we have converted / transformed the date variables.

meta_loans %>% 
  select(variable, q_na) %>% 
  filter(variable %in% chr_to_date_vars)
##             variable   q_na
## 1            issue_d      0
## 2   earliest_cr_line     29
## 3       last_pymnt_d  17659
## 4       next_pymnt_d 252971
## 5 last_credit_pull_d     53

Finally, this is how our custom date conversion function will look like, we call it convert_date() and it will take the string value to be converted as input. We define the date format by following the function conventions of base::as.Date(), for details see ?base::as.Date. We could have also used an anonymous function directly in the dplyr::mutate_at() call but the creation of a specific function seemed appropriate as we may use it several times.

convert_date <- function(x){
  as.Date(paste0("01-", x), format = "%d-%b-%Y")
  } 

loans <-
  loans %>%
  mutate_at(.funs = funs(convert_date), .vars = chr_to_date_vars)
num_vars <- 
  loans %>% 
  sapply(is.numeric) %>% 
  which() %>% 
  names()

meta_loans %>%
  select(variable, p_zeros, p_na, unique) %>%
  filter_(~ variable %in% num_vars) %>%
  knitr::kable()
variable p_zeros p_na unique
id 0.00 0.00 887379
member_id 0.00 0.00 887379
loan_amnt 0.00 0.00 1372
funded_amnt 0.00 0.00 1372
funded_amnt_inv 0.03 0.00 9856
int_rate 0.00 0.00 542
installment 0.00 0.00 68711
annual_inc 0.00 0.00 49384
dti 0.05 0.00 4086
delinq_2yrs 80.80 0.00 29
inq_last_6mths 56.11 0.00 28
mths_since_last_delinq 0.19 51.20 155
mths_since_last_record 0.14 84.56 123
open_acc 0.00 0.00 77
pub_rec 84.70 0.00 32
revol_bal 0.38 0.00 73740
revol_util 0.40 0.06 1356
total_acc 0.00 0.00 135
out_prncp 28.83 0.00 248332
out_prncp_inv 28.83 0.00 266244
total_pymnt 2.00 0.00 506726
total_pymnt_inv 2.03 0.00 506616
total_rec_prncp 2.04 0.00 260227
total_rec_int 2.05 0.00 324635
total_rec_late_fee 98.59 0.00 6181
recoveries 97.22 0.00 23055
collection_recovery_fee 97.35 0.00 20708
last_pymnt_amnt 1.99 0.00 232451
collections_12_mths_ex_med 98.67 0.02 12
mths_since_last_major_derog 0.00 75.02 168
policy_code 0.00 0.00 1
annual_inc_joint 0.00 99.94 308
dti_joint 0.00 99.94 449
acc_now_delinq 99.53 0.00 8
tot_coll_amt 0.00 7.92 10325
tot_cur_bal 0.00 7.92 327342
open_acc_6m 0.00 97.59 13
open_il_6m 0.00 97.59 35
open_il_12m 0.00 97.59 12
open_il_24m 0.00 97.59 17
mths_since_rcnt_il 0.00 97.65 201
total_bal_il 0.00 97.59 17030
il_util 0.00 97.90 1272
open_rv_12m 0.00 97.59 18
open_rv_24m 0.00 97.59 28
max_bal_bc 0.00 97.59 10707
all_util 0.00 97.59 1128
total_rev_hi_lim 0.00 7.92 21251
inq_fi 0.00 97.59 18
total_cu_tl 0.00 97.59 33
inq_last_12m 0.00 97.59 29

We also see that variables mths_since_last_delinq, mths_since_last_record, mths_since_last_major_derog, dti_joint and annual_inc_joint have a large share of NA values. If we think about this in more detail, it may be reasonable to assume that NA values for the variables mths_since_last_delinq, mths_since_last_record and mths_since_last_major_derog actually indicate that there was no event/record of any missed payment so there cannot be any time value. Analogously, a missing value for annual_inc_joint and dti_joint may simply indicate that it is a single borrower or the partner has no income. Thus, the first three variables actually carry valuable information that may be lost if we ignored it. We will thus replace the missing values with zeros to make them available for modeling. It should be noted though that a zero time could indicate an event that is just happening so we have to document our assumptions carefully.

na_to_zero_vars <-
  c("mths_since_last_delinq", "mths_since_last_record",
    "mths_since_last_major_derog")

loans <- 
  loans %>%
  mutate_at(.vars = na_to_zero_vars, .funs = funs(replace(., is.na(.), 0)))

These transformations should have us covered for now. We recreate the meta table after all these changes.

meta_loans <- funModeling::df_status(loans, print_results = FALSE)
meta_loans <-
  meta_loans %>%
  mutate(uniq_rat = unique / nrow(loans))

5.2 Additional Meta Data File

Next we look at the meta descriptions provided in the additional Excel file.

knitr::kable(meta_loan_stats[,1:2])
LoanStatNew Description
acc_now_delinq The number of accounts on which the borrower is now delinquent.
acc_open_past_24mths Number of trades opened in past 24 months.
addr_state The state provided by the borrower in the loan application
all_util Balance to credit limit on all trades
annual_inc The self-reported annual income provided by the borrower during registration.
annual_inc_joint The combined self-reported annual income provided by the co-borrowers during registration
application_type Indicates whether the loan is an individual application or a joint application with two co-borrowers
avg_cur_bal Average current balance of all accounts
bc_open_to_buy Total open to buy on revolving bankcards.
bc_util Ratio of total current balance to high credit/credit limit for all bankcard accounts.
chargeoff_within_12_mths Number of charge-offs within 12 months
collection_recovery_fee post charge off collection fee
collections_12_mths_ex_med Number of collections in 12 months excluding medical collections
delinq_2yrs The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years
delinq_amnt The past-due amount owed for the accounts on which the borrower is now delinquent.
desc Loan description provided by the borrower
dti A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
dti_joint A ratio calculated using the co-borrowers’ total monthly payments on the total debt obligations, excluding mortgages and the requested LC loan, divided by the co-borrowers’ combined self-reported monthly income
earliest_cr_line The month the borrower’s earliest reported credit line was opened
emp_length Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
emp_title The job title supplied by the Borrower when applying for the loan.*
fico_range_high The upper boundary range the borrower’s FICO at loan origination belongs to.
fico_range_low The lower boundary range the borrower’s FICO at loan origination belongs to.
funded_amnt The total amount committed to that loan at that point in time.
funded_amnt_inv The total amount committed by investors for that loan at that point in time.
grade LC assigned loan grade
home_ownership The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER
id A unique LC assigned ID for the loan listing.
il_util Ratio of total current balance to high credit/credit limit on all install acct
initial_list_status The initial listing status of the loan. Possible values are – W, F
inq_fi Number of personal finance inquiries
inq_last_12m Number of credit inquiries in past 12 months
inq_last_6mths The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
installment The monthly payment owed by the borrower if the loan originates.
int_rate Interest Rate on the loan
issue_d The month which the loan was funded
last_credit_pull_d The most recent month LC pulled credit for this loan
last_fico_range_high The upper boundary range the borrower’s last FICO pulled belongs to.
last_fico_range_low The lower boundary range the borrower’s last FICO pulled belongs to.
last_pymnt_amnt Last total payment amount received
last_pymnt_d Last month payment was received
loan_amnt The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
loan_status Current status of the loan
max_bal_bc Maximum current balance owed on all revolving accounts
member_id A unique LC assigned Id for the borrower member.
mo_sin_old_il_acct Months since oldest bank installment account opened
mo_sin_old_rev_tl_op Months since oldest revolving account opened
mo_sin_rcnt_rev_tl_op Months since most recent revolving account opened
mo_sin_rcnt_tl Months since most recent account opened
mort_acc Number of mortgage accounts.
mths_since_last_delinq The number of months since the borrower’s last delinquency.
mths_since_last_major_derog Months since most recent 90-day or worse rating
mths_since_last_record The number of months since the last public record.
mths_since_rcnt_il Months since most recent installment accounts opened
mths_since_recent_bc Months since most recent bankcard account opened.
mths_since_recent_bc_dlq Months since most recent bankcard delinquency
mths_since_recent_inq Months since most recent inquiry.
mths_since_recent_revol_delinq Months since most recent revolving delinquency.
next_pymnt_d Next scheduled payment date
num_accts_ever_120_pd Number of accounts ever 120 or more days past due
num_actv_bc_tl Number of currently active bankcard accounts
num_actv_rev_tl Number of currently active revolving trades
num_bc_sats Number of satisfactory bankcard accounts
num_bc_tl Number of bankcard accounts
num_il_tl Number of installment accounts
num_op_rev_tl Number of open revolving accounts
num_rev_accts Number of revolving accounts
num_rev_tl_bal_gt_0 Number of revolving trades with balance >0
num_sats Number of satisfactory accounts
num_tl_120dpd_2m Number of accounts currently 120 days past due (updated in past 2 months)
num_tl_30dpd Number of accounts currently 30 days past due (updated in past 2 months)
num_tl_90g_dpd_24m Number of accounts 90 or more days past due in last 24 months
num_tl_op_past_12m Number of accounts opened in past 12 months
open_acc The number of open credit lines in the borrower’s credit file.
open_acc_6m Number of open trades in last 6 months
open_il_12m Number of installment accounts opened in past 12 months
open_il_24m Number of installment accounts opened in past 24 months
open_il_6m Number of currently active installment trades
open_rv_12m Number of revolving trades opened in past 12 months
open_rv_24m Number of revolving trades opened in past 24 months
out_prncp Remaining outstanding principal for total amount funded
out_prncp_inv Remaining outstanding principal for portion of total amount funded by investors
pct_tl_nvr_dlq Percent of trades never delinquent
percent_bc_gt_75 Percentage of all bankcard accounts > 75% of limit.
policy_code publicly available policy_code=1
new products not publicly availab le policy_code=2
pub_rec Number of derogatory public records
pub_rec_bankruptcies Number of public record bankruptcies
purpose A category provided by the borrower for the loan request.
pymnt_plan Indicates if a payment plan has been put in place for the loan
recoveries post charge off gross recovery
revol_bal Total credit revolving balance
revol_util Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
sub_grade LC assigned loan subgrade
tax_liens Number of tax liens
term The number of payments on the loan. Values are in months and can be either 36 or 60.
title The loan title provided by the borrower
tot_coll_amt Total collection amounts ever owed
tot_cur_bal Total current balance of all accounts
tot_hi_cred_lim Total high credit/credit limit
total_acc The total number of credit lines currently in the borrower’s credit file
total_bal_ex_mort Total credit balance excluding mortgage
total_bal_il Total current balance of all installment accounts
total_bc_limit Total bankcard high credit/credit limit
total_cu_tl Number of finance trades
total_il_high_credit_limit Total installment high credit/credit limit
total_pymnt Payments received to date for total amount funded
total_pymnt_inv Payments received to date for portion of total amount funded by investors
total_rec_int Interest received to date
total_rec_late_fee Late fees received to date
total_rec_prncp Principal received to date
total_rev_hi_lim Total revolving high credit/credit limit
url URL for the LC page with listing data.
verification_status Indicates if income was verified by LC, not verified, or if the income source was verified
verified_status_joint Indicates if the co-borrowers’ joint income was verified by LC, not verified, or if the income source was verified
zip_code The first 3 numbers of the zip code provided by the borrower in the loan application.
NA NA
NA * Employer Title replaces Employer Name for all loans listed after 9/23/2013

As expected, id is “A unique LC assigned ID for the loan listing.” and member_id is “A unique LC assigned Id for the borrower member.”. The policy_code is “publicly available policy_code=1 new products not publicly available policy_code=2”. That means there could be different values but as we have seen before in this data it only takes on one value.

Finally, let’s look at variables that are either in the data and not in the meta description or vice versa. The dplyr::setdiff() function does what its name suggests, just pay attention to the order of arguments to understand which difference you actually get.

Variables in loans data but not in meta description.

dplyr::setdiff(colnames(loans), meta_loan_stats$LoanStatNew)
## [1] "verification_status_joint" "total_rev_hi_lim"

Variables in meta description but not in loans data.

dplyr::setdiff(meta_loan_stats$LoanStatNew, colnames(loans))
##  [1] "acc_open_past_24mths"           "avg_cur_bal"                   
##  [3] "bc_open_to_buy"                 "bc_util"                       
##  [5] "chargeoff_within_12_mths"       "delinq_amnt"                   
##  [7] "fico_range_high"                "fico_range_low"                
##  [9] "last_fico_range_high"           "last_fico_range_low"           
## [11] "mo_sin_old_il_acct"             "mo_sin_old_rev_tl_op"          
## [13] "mo_sin_rcnt_rev_tl_op"          "mo_sin_rcnt_tl"                
## [15] "mort_acc"                       "mths_since_recent_bc"          
## [17] "mths_since_recent_bc_dlq"       "mths_since_recent_inq"         
## [19] "mths_since_recent_revol_delinq" "num_accts_ever_120_pd"         
## [21] "num_actv_bc_tl"                 "num_actv_rev_tl"               
## [23] "num_bc_sats"                    "num_bc_tl"                     
## [25] "num_il_tl"                      "num_op_rev_tl"                 
## [27] "num_rev_accts"                  "num_rev_tl_bal_gt_0"           
## [29] "num_sats"                       "num_tl_120dpd_2m"              
## [31] "num_tl_30dpd"                   "num_tl_90g_dpd_24m"            
## [33] "num_tl_op_past_12m"             "pct_tl_nvr_dlq"                
## [35] "percent_bc_gt_75"               "pub_rec_bankruptcies"          
## [37] "tax_liens"                      "tot_hi_cred_lim"               
## [39] "total_bal_ex_mort"              "total_bc_limit"                
## [41] "total_il_high_credit_limit"     "total_rev_hi_lim  "            
## [43] "verified_status_joint"          NA

It seems for the loans variables verification_status_joint and total_rev_hi_lim there are actually equivalents in the meta data but they carry slightly different names or have leading / trailing blanks. There are quite a few variables in the meta description that are not in the data. But that should be less of a concern as we don’t need those anyway.

6 Defining default

Our ultimate goal is the prediction of loan defaults from a given set of observations by selecting explanatory (independent) variables (also called feature in machine learning parlance) that result in an acceptable model performance as quantified by a pre-defined measure. This goal will also impact our exploratory data analysis. We will try to build some intuition about the data given our knowledge of the final goal. If we were given a data discovery task such as detecting interesting patterns via unsupervised learning methods we would probably perform a very different analysis.

For loans data the usual variable of interest is a delay or a default on required payments. So far we haven’t looked at any default variable because, in fact, there is no one variable indicating it. One may speculate about the reasons but a plausible explanation may be that the definiton of default depends on the perspective. A risk-averse person may classify any delay in scheduled payments immediately as default from day one while others may apply a step-wise approach considering that borrowers may pay at a later stage. Yet another classification may look at any rating deterioration and include a default as last grade in a rating scale.

Let’s look at potential variables that may indicate a default / delay in payments:

  • loan_status: Current status of the loan
  • delinq_2yrs: The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years
  • mths_since_last_delinq: The number of months since the borrower’s last delinquency.

We can look at the unique values of above variables by applying base::unique() via the purrr::map() function to each column of interest. The purrr library is part of the tidyverse set of libraries and applies functional paradigms via a level of abstraction.

default_vars <- c("loan_status", "delinq_2yrs", "mths_since_last_delinq")
purrr::map(.x = loans[, default_vars], .f = base::unique)
## $loan_status
##  [1] "Fully Paid"                                         
##  [2] "Charged Off"                                        
##  [3] "Current"                                            
##  [4] "Default"                                            
##  [5] "Late (31-120 days)"                                 
##  [6] "In Grace Period"                                    
##  [7] "Late (16-30 days)"                                  
##  [8] "Does not meet the credit policy. Status:Fully Paid" 
##  [9] "Does not meet the credit policy. Status:Charged Off"
## [10] "Issued"                                             
## 
## $delinq_2yrs
##  [1]  0  2  3  1  4  6  5  8  7  9 11 NA 13 15 10 12 17 18 29 24 14 21 22
## [24] 19 16 30 26 20 27 39
## 
## $mths_since_last_delinq
##   [1]   0  35  38  61   8  20  18  68  45  48  41  40  74  25  53  39  10
##  [18]  26  56  77  28  52  24  16  60  54  23   9  11  13  65  19  80  22
##  [35]  59  79  44  64  57  14  63  49  15  73  70  29  51   5  75  55   2
##  [52]  30  47  33  69   4  43  21  27  46  81  78  82  31  76  62  72  42
##  [69]  50   3  12  67  36  34  58  17  71  66  32   6  37   7   1  83  86
##  [86] 115  96 103 120 106  89 107  85  97  95 110  84 135  88  87 122  91
## [103] 146 134 114  99  93 127 101  94 102 129 113 139 131 156 143 109 119
## [120] 149 118 130 148 126  90 141 116 100 152  98  92 108 133 104 111 105
## [137] 170 124 136 180 188 140 151 159 121 123 157 112 154 171 142 125 117
## [154] 176 137

We can see that delinq_2yrs shows only a few unique values which is a bit surprising as it could take many more values given its definition. The variable mths_since_last_delinq has some surprisingly large values. Both variables only indicate a delinquency in the past so they cannot help with the default definition.

The variable loan status seems to be an indicator of the current state a particular loan is in. We should also perform a count of the different values.

loans %>%
  group_by(loan_status) %>%
  summarize(count = n(), rel_count = count/nrow(loans)) %>%
  knitr::kable()
loan_status count rel_count
Charged Off 45248 0.0509906
Current 601779 0.6781533
Default 1219 0.0013737
Does not meet the credit policy. Status:Charged Off 761 0.0008576
Does not meet the credit policy. Status:Fully Paid 1988 0.0022403
Fully Paid 207723 0.2340860
In Grace Period 6253 0.0070466
Issued 8460 0.0095337
Late (16-30 days) 2357 0.0026561
Late (31-120 days) 11591 0.0130621

It is not immediately obvious what the different values stand for, so we refer to Lending Club’s documentation about “What do the different Note statuses mean?

  • Fully Paid: Loan has been fully repaid, either at the expiration of the 3- or 5-year year term or as a result of a prepayment.
  • Current: Loan is up to date on all outstanding payments.
  • Does not meet the credit policy. Status:Fully Paid: No explanation but see “fully paid”.
  • Issued: New loan that has passed all Lending Club reviews, received full funding, and has been issued.
  • Charged Off: Loan for which there is no longer a reasonable expectation of further payments. Generally, Charge Off occurs no later than 30 days after the Default status is reached. Upon Charge Off, the remaining principal balance of the Note is deducted from the account balance. Learn more about the difference between “default” and “charge off”.
  • Does not meet the credit policy. Status:Charged Off: No explanation but see “Charged Off”
  • Late (31-120 days): Loan has not been current for 31 to 120 days.
  • In Grace Period: Loan is past due but within the 15-day grace period.
  • Late (16-30 days): Loan has not been current for 16 to 30 days.
  • Default: Loan has not been current for 121 days or more.

Given above information, we will define a default as follows.

defaulted <- 
  c("Default", 
    "Does not meet the credit policy. Status:Charged Off", 
    "In Grace Period", 
    "Late (16-30 days)", 
    "Late (31-120 days)")

We now have to add an indicator variable to the data. We do this by reassigning the mutated data to the original object. An alternative would be to update the object via a compound assignment pipe-operator from the magrittr package magrittr::%<>% or an assignment in place := from the data.table package. We use a Boolean (True/False) indicator variable which will have nicer plotting properties (as it is treated like a character variable by the plotting library ggplot2) rather than a numerical value such as 1/0. R is usually clever enough to still allow calculations on Boolean values.

loans <-
  loans %>%
  mutate(default = ifelse(!(loan_status %in% defaulted), FALSE, TRUE))

Given our new indicator variable, we can now compute the frequency of actual defaults in the training set. It is around 0.0249961.

loans %>%
  summarise(default_freq = sum(default / n()))
## # A tibble: 1 x 1
##   default_freq
##          <dbl>
## 1   0.02499608
# alternatively in a table
table(loans$default) / nrow(loans)
## 
##      FALSE       TRUE 
## 0.97500392 0.02499608

7 Removing variables deemed unfit for most modeling

As stated before some variables may actually have information value but are kicked out as we deem them unfit for most practical purposes. Arguably one would have to look at the actual value distribution as e.g. a high number of unique values may be non-sense values for only a few loans but we don’t dig deeper here.

We can get rid of following variables with given reason

  • annual_inc_joint: high NA ratio
  • dti_joint: high NA ratio
  • verification_status_joint: high NA ratio
  • policy_code: only one unique value -> standard deviation = 0
  • id: key variable
  • member_id: key variable
  • emp_title: high amount of unique values
  • url: high amount of unique values
  • desc: high NA ratio
  • title: high amount of unique values
  • next_pymnt_d: high NA ratio
  • open_acc_6m: high NA ratio
  • open_il_6m: high NA ratio
  • open_il_12m: high NA ratio
  • open_il_24m: high NA ratio
  • mths_since_rcnt_il: high NA ratio
  • total_bal_il: high NA ratio
  • il_util: high NA ratio
  • open_rv_12m: high NA ratio
  • open_rv_24m: high NA ratio
  • max_bal_bc: high NA ratio
  • all_util: high NA ratio
  • total_rev_hi_lim: high NA ratio
  • inq_fi: high NA ratio
  • total_cu_tl: high NA ratio
  • inq_last_12m: high NA ratio
vars_to_remove <- 
  c("annual_inc_joint", "dti_joint", "policy_code", "id", "member_id",
    "emp_title", "url", "desc", "title", "open_acc_6m", "open_il_6m", 
    "open_il_12m", "open_il_24m", "mths_since_rcnt_il", "total_bal_il", 
    "il_util", "open_rv_12m", "open_rv_24m", "max_bal_bc", "all_util",
    "total_rev_hi_lim", "inq_fi", "total_cu_tl", "inq_last_12m",
    "verification_status_joint", "next_pymnt_d")

loans <- loans %>% select(-one_of(vars_to_remove))

We further remove variables for different (stated) reasons

  • sub_grade: contains same (but more granular) information as grade
  • loan_status: has been used to define target variable
vars_to_remove <- 
  c("sub_grade", "loan_status")

loans <- loans %>% select(-one_of(vars_to_remove))

8 A note on hypothesis generation vs. hypothesis confirmation

Before we start our analysis, we should be clear about its aim and what the data is used for. In the book “R for Data Science” the authors put it quite nicely under the name Hypothesis generation vs. hypothesis confirmation:

  1. Each observation can either be used for exploration or confirmation, not both.
  2. You can use an observation as many times as you like for exploration, but you can only use it once for confirmation. As soon as you use an observation twice, you’ve switched from confirmation to exploration. This is necessary because to confirm a hypothesis you must use data independent of the data that you used to generate the hypothesis. Otherwise you will be over optimistic. There is absolutely nothing wrong with exploration, but you should never sell an exploratory analysis as a confirmatory analysis because it is fundamentally misleading.

In a strict sense, this requires us to split the data into different sets. The authors go on suggesting a split:

  1. 60% of your data goes into a training (or exploration) set. You’re allowed to do anything you like with this data: visualise it and fit tons of models to it.
  2. 20% goes into a query set. You can use this data to compare models or visualisations by hand, but you’re not allowed to use it as part of an automated process.
  3. 20% is held back for a test set. You can only use this data ONCE, to test your final model.

This means that even for exploratory data analysis (EDA), we would only look at parts of the data. We will split the data into two sets with 80% train and 20% test ratio at random. As we are dealing with time-series data, we could also split the data by time. But time itself may be an explanatory variable which could be modeled. All exploratory analysis will be performed on the training data only. We will use base::set.seed() to make the random split reproducible. You can have an argument whether it is sensible to even use split data for EDA but EDA usually builds your intuition about the data and thus will shape data transformation and model decisions. The test data allows for testing all these assumptions in addition to the actual model performance. There are other methods such as cross validation which do not necessarily require a test data set but we have enough observations to afford one.

One note of caution is necessary here. Since not all data is used for model fitting, the test data may have labels that do not occur in the training set and with same rationale feautures may have unseen values. In addition, the data is imbalanced, i.e. only a few lenders default while many more do not. The last fact may actually require a non-random split considering the class label (default / non-default). The same may hold true for the features (independent variables). For more details on dealing with imbalanced data, see Learning from Imbalanced Classes or Learning from Imbalanced Data. Tom Fawcett puts it nicely in previous first mentioned blog post:

Conventional algorithms are often biased towards the majority class because their loss functions attempt to optimize quantities such as error rate, not taking the data distribution into consideration. In the worst case, minority examples are treated as outliers of the majority class and ignored. The learning algorithm simply generates a trivial classifier that classifies every example as the majority class.

We can do the split manually (see commented out code) but in order to ensure class distributions within the split data, we use function createDataPartition() from the caret package which performs stratified sampling. For details on the function, see The caret package: 4.1 Simple Splitting Based on the Outcome.

# ## manual approach
# # 80% of the sample size
# smp_size <- floor(0.8 * nrow(loans))
# 
# # set the seed to make your partition reproductible
# set.seed(123)
# train_index <- sample(seq_len(nrow(loans)), size = smp_size)
# 
# train <- loans[train_index, ]
# test <- loans[-train_index, ]

## with caret
set.seed(6438)

train_index <- 
  caret::createDataPartition(y = loans$default, times = 1, 
                             p = .8, list = FALSE)

train <- loans[train_index, ]
test <- loans[-train_index, ]

9 Exploratory Data Analysis

There are many excellent resources on exploratory analysis, e.g.

A visual look at the data should always precede any model considerations. A useful visualization library is ggplot2 (which is part of the tidyverse and further on also referred to as ggplot) that requires a few other libraries on top for extensions such as scales. We are not after publication-ready visualizations yet as this phase is considered data exploration or data discovery rather than results reporting. Nontheless, used visualization libraries already produce visually appealing graphics as they use smart heuristics and default values to guess sensible parameter settings.

library(scales)

The most important questions around visualization are which variables are numeric and if so are they continous or discrete and which are strings. Furthermore, which variables are attributes (categorical) and which make up sensible metric-attribute pairs. An important information for efficient visualization with categorical variables is also the amount of unique values they can take and the ratio of zero or missing values, both which were already analyzed in section Meta Data. In a first exploratory analysis we aim for categorical variables that have high information power and not too many unique values to keep the information density at a manageable level. Also consider group sizes and differences between median and mean driven by outliers. Especially when drawing conclusions from summarized / aggregated information, we should be aware of group size. We thus may add this info directly in the plot or look at it before plotting in a grouped table.

A generally good idea is to look at the distributions of relevant continous variables. These are probably

  • loan_amnt
  • funded_amnt
  • funded_amnt_inv
  • annual_inc

Among the target variable default some interesting categorical variables might be

  • term
  • int_rate
  • grade
  • emp_title
  • home_ownership
  • verification_status
  • issue_d (timestamp)
  • loan_status
  • purpose
  • zip_code (geo-info)
  • addr_state (geo-info)
  • application_type

Assign continous variables to a character vector and reshape data for plotting of distributions.

income_vars <- c("annual_inc")
loan_amount_vars <- c("loan_amnt", "funded_amnt", "funded_amnt_inv")

We can reshape and plot the original data for specified variables in a tidyr-dplyr-ggplot pipe. For details on tidying with tidyr, see Data Science with R - Data Tidying or Data Processing with dplyr & tidyr.

train %>%
  select_(.dots = income_vars) %>%
  gather_("variable", "value", gather_cols = income_vars) %>%
  ggplot(aes(x = value)) +
  facet_wrap(~ variable, scales = "free_x", ncol = 3) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing non-finite values (stat_bin).

We can see that a lot of loans have corresponding annual income of zero and in general income seems low. As already known, joint income has a large number of NA values (i.e. cannot be displayed) and those few values that are present do not seem to have significant exposure. Most loan applications must have been submitted by single income borrowers, i.e. either single persons or single-income households.

train %>%
  select_(.dots = loan_amount_vars) %>%
  gather_("variable", "value", gather_cols = loan_amount_vars) %>%
  ggplot(aes(x = value)) +
  facet_wrap(~ variable, scales = "free_x", ncol = 3) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The loan amount distributions seems similar in shape suggesting not too much divergence between the loan amount applied for, the amount committed and the amount invested.

We had already identified a number of interesting categorical variables. Let’s combine our selection with the meta information gathered in an earlier stage to see the information power and uniqueness.

categorical_vars <- 
  c("term", "grade", "sub_grade", "emp_title", "home_ownership",
    "verification_status", "loan_status", "purpose", "zip_code",
    "addr_state", "application_type", "policy_code")

meta_loans %>%
  select(variable, p_zeros, p_na, type, unique) %>%
  filter_(~ variable %in% categorical_vars) %>%
  knitr::kable()
variable p_zeros p_na type unique
term 0 0.0 character 2
grade 0 0.0 character 7
sub_grade 0 0.0 character 35
emp_title 0 5.8 character 289148
home_ownership 0 0.0 character 6
verification_status 0 0.0 character 3
loan_status 0 0.0 character 10
purpose 0 0.0 character 14
zip_code 0 0.0 character 935
addr_state 0 0.0 character 51
policy_code 0 0.0 numeric 1
application_type 0 0.0 character 2

This gives us some ideas on what may be useful for a broad data exploration. Variables such as emp_title have too many unique values to be suitable for a classical categorical graph. Other variables may lend themselves to pairwise or correlation graphs such as int_rate while again others may be used in time series plots such as issue_d. We may even be able to plot some appealing geographical plots with geolocation variables such as zip_code or addr_state.

9.1 Grade

For a start, let’s look at grade which seems to be a rating classification scheme that Lending Club uses to assign loans into risk buckets similar to other popular rating schemes like S&P or Moodys. For more details, see Lending Club Rate Information. For now, it suffices to know that grades take values A, B, C, D, E, F, G where A represents the highest quality loan and G the lowest.

As a first relation, we investigate the distribution of loan amount over the different grades with a standard boxplot highlighting potential outliers in red. One may also select a pre-defined theme from the ggthemes package by adding a call to ggplot such as theme_economist() or even create a theme oneself. For details on themes, see ggplot2 themes and Introduction to ggthemes.

As mentioned before we want to be group size aware so let’s write a short function to be used inside the ggplot::geom_boxplot() call in combination with ggplot::stat_summary() where we can call user-defined functions using parameter fun.data. As it is often the case with plots, we need to experiment with the position of additional text elements which we achieve by scaling the y-position with some constant multiplier around the median (as the boxplot will have the median as horizontal bar within the rectangles). The mean can be added in a similar fashion, however we don’t need to specify the y-position explicitly. The count as text and the mean as point may overlap themselves and with the median horizontal bar. This is acceptable in an exploratory setting. For publication-ready plots one would have to perform some adjustments.

# see http://stackoverflow.com/questions/15660829/how-to-add-a-number-of-observations-per-group-and-use-group-mean-in-ggplot2-boxp
give_count <- 
  stat_summary(fun.data = function(x) return(c(y = median(x)*1.06,
                                               label = length(x))),
               geom = "text")

# see http://stackoverflow.com/questions/19876505/boxplot-show-the-value-of-mean
give_mean <- 
  stat_summary(fun.y = mean, colour = "darkgreen", geom = "point", 
               shape = 18, size = 3, show.legend = FALSE)
train %>%
  ggplot(aes(grade, loan_amnt)) +
  geom_boxplot(fill = "white", colour = "darkblue", 
               outlier.colour = "red", outlier.shape = 1) +
  give_count +
  give_mean +
  scale_y_continuous(labels = comma) +
  facet_wrap(~ default) +
  labs(title="Loan Amount by Grade", x = "Grade", y = "Loan Amount \n")

We can derive a few points from the plot:

  • there is not a lot of difference between default and non-default
  • lower quality loans tend to have a higher loan amount
  • there are virtually no outliers except for grade B
  • the loan amount spread (IQR) seems to be slightly higher for lower quality loans

According to Lending Club’s rate information, we would expect a strong increasing relationship between grade and interest rate so we can test this. We also consider the term of the loan as one might expect that longer term loans could have a higher interest rate.

train %>%
  ggplot(aes(grade, int_rate)) +
  geom_boxplot(fill = "white", colour = "darkblue", 
               outlier.colour = "red", outlier.shape = 1) +
  give_count +
  give_mean +
  scale_y_continuous(labels = comma) +
  labs(title="Interest Rate by Grade", x = "Grade", y = "Interest Rate \n") +
  facet_wrap(~ term)

We can derive a few points from the plot:

  • interest rate increases with grade worsening
  • a few loans seem to have an equally low interest rate independent of grade
  • the spread of rates seems to increase with grade worsening
  • there tend to be more outliers on the lower end of the rate
  • The 3-year term has a much higher number of high-rated borrowers while the 5-year term has a larger number in the low-rating grades

9.2 Home Ownership

We can do the same for home ownership.

train %>%
  ggplot(aes(home_ownership, int_rate)) +
  geom_boxplot(fill = "white", colour = "darkblue", 
               outlier.colour = "red", outlier.shape = 1) +
  give_count +
  give_mean +
  scale_y_continuous(labels = comma) +
  facet_wrap(~ default) +
  labs(title="Interest Rate by Home Ownership", x = "Home Ownership", y = "Interest Rate \n")

We can derive a few points from the plot:

  • there seems no immediate conclusion with respect to the impact of home ownership, e.g. we can see that median/mean interest rate is higher for people who own a house than for those who still pay a mortgage.
  • interest rates are highest for values “any” and “none” which could be loans of limited data quality but there are very few data points.
  • interest rate is higher for default loans, which is probably driven by other factors (e.g. grade)

9.3 Loan Amount and Income

Another interesting plot may be the relationship between loan amount and funded / invested amount. As all variables are continous, we can do that with a simple scatterplot but we will need to reshape the data to have both loan values plotted against loan amount.

In fact the reshaping here may be slightly odd as we like to display the same variable on the x-axis but different values on the y-axis facetted by their variable names. To achieve that, we gather the data three times with the same x variable but changing y variables.

funded_amnt <-
  train %>%
  transmute(loan_amnt = loan_amnt, value = funded_amnt, 
              variable = "funded_amnt")

funded_amnt_inv <-
  train %>%
  transmute(loan_amnt = loan_amnt, value = funded_amnt_inv, 
              variable = "funded_amnt_inv")

plot_data <- rbind(funded_amnt, funded_amnt_inv)
# remove unnecessary data using regex
ls()
##  [1] "categorical_vars"  "chr_to_date_vars"  "chr_to_num_vars"  
##  [4] "convert_date"      "default_vars"      "defaulted"        
##  [7] "excel_file"        "funded_amnt"       "funded_amnt_inv"  
## [10] "give_count"        "give_mean"         "i"                
## [13] "income_vars"       "libraries_missing" "libraries_used"   
## [16] "loan_amount_vars"  "loans"             "meta_browse_notes"
## [19] "meta_loan_stats"   "meta_loans"        "meta_reject_stats"
## [22] "na_to_zero_vars"   "num_vars"          "path"             
## [25] "plot_data"         "test"              "train"            
## [28] "train_index"       "vars_to_remove"
rm(list = ls()[grep("^funded", ls())])

Now let’s plot the data.

plot_data %>%
  ggplot(aes(x = loan_amnt, y = value)) +
  facet_wrap(~ variable, scales = "free_x", ncol = 3) +
  geom_point()

We can derive a few points from the plot:

  • there are instances when funded amount is smaller loan amount
  • there seems to be a number of loans where investment is smaller funded amount i.e. not the full loan is invested in

Let’s do the same but only for annual income versus loan amount.

train %>%
  ggplot(aes(x = annual_inc, y = loan_amnt)) +
  geom_point()
## Warning: Removed 4 rows containing missing values (geom_point).

We can derive a few points from the plot:

  • there is no immediatly discernible relationship
  • there are quite a few income outliers with questionable values (e.g. why would a person with annual income 9500000 request a loan amount of 24000)

9.4 Time Series

Now let’s take a look at interest rates over time but split the time series by grade to see if there are differences in interest rate development depending on the borrower grade. Again we make use of a dplyr pipe, first selecting the variables of interest, then grouping by attributes and finally summarising the metric interest rate by taking the mean for each goup. We use facet_wrap to split by attribute grade. As we are using the mean for building an aggregate representation, we should be weary of outliers and group size which we had already looked at earlier (ignoring time dimension).

train %>%
  select(int_rate, grade) %>%
  group_by(grade) %>%
  summarise(int_rate_mean = mean(int_rate, na.rm = TRUE),
            int_rate_median = median(int_rate, na.rm = TRUE),
            n = n()) %>%
  knitr::kable()
grade int_rate_mean int_rate_median n
A 7.244639 7.26 118281
B 10.830884 10.99 203741
C 13.979427 13.98 196815
D 17.178566 16.99 111490
E 19.899484 19.99 56765
F 23.572306 23.76 18454
G 25.624502 25.83 4358

Group size are relatively large except for grade G which may have only a few hundred loans per group when grouping by grade and date. Mean and median seem fairly close at this non-granular level. Let’s plot the data.

train %>%
  select(int_rate, grade, issue_d) %>%
  group_by(grade, issue_d) %>%
  summarise(int_rate_mean = mean(int_rate, na.rm = TRUE)) %>%
  ggplot(aes(issue_d, int_rate_mean)) +
  geom_line(color= "darkblue", size = 1) +
  facet_wrap(~ grade)

We can derive a few points from the plot:

  • the mean interest rate is falling or relatively constant for high-rated clients
  • the mean interest rate is increasing significantly for low-rated clients

Let’s also look at loan amounts over time in the same manner.

train %>%
  select(loan_amnt, grade, issue_d) %>%
  group_by(grade, issue_d) %>%
  summarise(loan_amnt_mean = mean(loan_amnt, na.rm = TRUE)) %>%
  ggplot(aes(issue_d, loan_amnt_mean)) +
  geom_line(color= "darkblue", size = 1) +
  facet_wrap(~ grade)

We can derive a few points from the plot:

  • the mean loan amount is increasing for all grades
  • while high-rated clients have some mean loan amount volatility, it is much higher for low-rated clients

9.5 Geolocation Plots

Let’s remind ourselves of the geolocation variables in the data and their information power.

  • zip_code (geo-info)
  • addr_state (geo-info)
geo_vars <- c("zip_code", "addr_state")

meta_loans %>%
  select(variable, p_zeros, p_na, type, unique) %>%
  filter_(~ variable %in% geo_vars) %>%
  knitr::kable()
variable p_zeros p_na type unique
zip_code 0 0 character 935
addr_state 0 0 character 51
loans %>%
  select_(.dots = geo_vars) %>%
  str()
## Classes 'tbl_df', 'tbl' and 'data.frame':    887379 obs. of  2 variables:
##  $ zip_code  : chr  "860xx" "309xx" "606xx" "917xx" ...
##  $ addr_state: chr  "AZ" "GA" "IL" "CA" ...

We see that zip_code seems to be the truncated US postal code with only first three digits having a value. The addr_state seems to be the state names in a two-letter abbreviation.

We can use the choroplethr package to work with maps (alternatives may be maps among other packages). While choroplethr provides functions to work with the data, its sister package choroplethrMaps contains corresponding maps that can be used by choroplethr. One issue with choroplethr is that it attaches the package plyr which means many functions from dplyr are masked as it is the successor to plyr. Thus we do not load the package choroplethr despite using its functions frequently. However, we need to load choroplethrMaps as it has some data that is not exported from its namespace so syntax like choroplethrMaps::state.map where state.map is the rdata file does not work. An alternative might be to load the file directly by going to the underlying directory, e.g. R-3.x.x\library\choroplethrMaps\data but this is cumbersome. As discussed before, when attaching a library, we just need to make sure that important functions of other packages are not masked. In this case it’s fine.

For details on those libraries, see CRAN choroplethr and CRAN choroplethrMaps.

First we aggregate the default rate by state.

default_rate_state <- 
  train %>%
  select(default, addr_state) %>%
  group_by(addr_state) %>%
  summarise(default_rate = sum(default, na.rm = TRUE) / n())

knitr::kable(default_rate_state)
addr_state default_rate
AK 0.0289280
AL 0.0300944
AR 0.0243994
AZ 0.0260203
CA 0.0247008
CO 0.0233654
CT 0.0222655
DC 0.0131646
DE 0.0291604
FL 0.0267678
GA 0.0238065
HI 0.0303441
IA 0.2000000
ID 0.0000000
IL 0.0201743
IN 0.0242243
KS 0.0162975
KY 0.0204172
LA 0.0268172
MA 0.0246096
MD 0.0279492
ME 0.0023474
MI 0.0249429
MN 0.0239122
MO 0.0222711
MS 0.0301540
MT 0.0289296
NC 0.0263557
ND 0.0075377
NE 0.0156087
NH 0.0206490
NJ 0.0249566
NM 0.0264766
NV 0.0303213
NY 0.0294889
OH 0.0225960
OK 0.0283897
OR 0.0209814
PA 0.0256838
RI 0.0245849
SC 0.0218611
SD 0.0301164
TN 0.0267720
TX 0.0240810
UT 0.0250696
VA 0.0279697
VT 0.0172771
WA 0.0216223
WI 0.0213508
WV 0.0225434
WY 0.0179455

The data is already in a good format but when using choroplethr we need to adhere to some conventions to make it work out of the box. The first thing would be to bring the state names into a standard format. In fact, we can investigate the format by looking e.g. into the choroplethrMaps::state.map dataset which we later use to map our data. We first load the library and then the data via the function utils::data(). From the documentation, we also learn that state.map is a “data.frame which contains a map of all 50 US States plus the District of Columbia.” It is based on a shapefile which is often used in geospatial visualizations and “taken from the US Census 2010 Cartographic Boundary shapefiles page”. We are interested in the region variable which seems to hold the state names.

library(choroplethrMaps)
utils::data(state.map)
str(state.map)
## 'data.frame':    50763 obs. of  12 variables:
##  $ long      : num  -113 -113 -112 -112 -111 ...
##  $ lat       : num  37 37 37 37 37 ...
##  $ order     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hole      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ piece     : Factor w/ 66 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ group     : Factor w/ 226 levels "0.1","1.1","10.1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ id        : chr  "0" "0" "0" "0" ...
##  $ GEO_ID    : Factor w/ 51 levels "0400000US01",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ STATE     : Factor w/ 51 levels "01","02","04",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ region    : chr  "arizona" "arizona" "arizona" "arizona" ...
##  $ LSAD      : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
##  $ CENSUSAREA: num  113594 113594 113594 113594 113594 ...
unique(state.map$region)
##  [1] "arizona"              "arkansas"             "louisiana"           
##  [4] "minnesota"            "mississippi"          "montana"             
##  [7] "new mexico"           "north dakota"         "oklahoma"            
## [10] "pennsylvania"         "tennessee"            "virginia"            
## [13] "california"           "delaware"             "west virginia"       
## [16] "wisconsin"            "wyoming"              "alabama"             
## [19] "alaska"               "florida"              "idaho"               
## [22] "kansas"               "maryland"             "colorado"            
## [25] "new jersey"           "north carolina"       "south carolina"      
## [28] "washington"           "vermont"              "utah"                
## [31] "iowa"                 "kentucky"             "maine"               
## [34] "massachusetts"        "connecticut"          "michigan"            
## [37] "missouri"             "nebraska"             "nevada"              
## [40] "new hampshire"        "new york"             "ohio"                
## [43] "oregon"               "rhode island"         "south dakota"        
## [46] "district of columbia" "texas"                "georgia"             
## [49] "hawaii"               "illinois"             "indiana"

So it seems we need small case long form state names which implies we need a mapping from the short names in our data. Some internet research may give us the site American National Standards Institute (ANSI) Codes for States where we find a mapping under section FIPS Codes for the States and the District of Columbia. We could try using some parsing tool to extract the table from the web page but for now we take a short cut and simply copy paste the data into a tab-delimited txt file and read the file with readr::read_tsv(). We then cross-check if all abbreviations used in our loans data are present in the newly created mapping table.

states <- readr::read_tsv("./data//us_states.txt")
## Parsed with column specification:
## cols(
##   Name = col_character(),
##   `FIPS State Numeric Code` = col_integer(),
##   `Official USPS Code` = col_character()
## )
str(states)
## Classes 'tbl_df', 'tbl' and 'data.frame':    51 obs. of  3 variables:
##  $ Name                   : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ FIPS State Numeric Code: int  1 2 4 5 6 8 9 10 11 12 ...
##  $ Official USPS Code     : chr  "AL" "AK" "AZ" "AR" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 3
##   .. ..$ Name                   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ FIPS State Numeric Code: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Official USPS Code     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
# give some better variable names (spaces never helpful)
names(states) <- c("name", "fips_code", "usps_code")
# see if all states are covered
dplyr::setdiff(default_rate_state$addr_state, states$usps_code)
## character(0)

Comparing the unique abbreviations in the loans data with the usps codes reveals that we are all covered so let’s bring the data together using dplyr::left_join(). For details on dplyr join syntax, see Two-table verbs. To go with the choroplethr conventions, we also need to

  • have state names in lower case
  • rename the mapping variable to region and the numeric metric to value.
  • only select above two variables in the function call

Note that we are re-assigning the data to itself to avoid creating a new table.

default_rate_state <-
  default_rate_state %>%
  left_join(states[, c("usps_code", "name")], 
            by = c("addr_state" = "usps_code")) %>%
  rename(region = name, value = default_rate) %>%
  mutate(region = tolower(region)) %>%
  select(region, value)

Finally, we can plot the data using the default colors from choroplethr. As we are having data on state level, we use function choroplethr::state_choropleth().

choroplethr::state_choropleth(df = default_rate_state, 
                              title = "Default rate by State")

We can derive a few points from the plot:

  • default rate varies across states
  • we may want to adjust the default binning
  • we are only looking at default rate but not defaulted exposure (there may be states with many defaults but the defaulted amount is small)

We could do the same exercise but on a more granular level as we have zip codes available. They are truncated so we would have to find a way to work with them. They also have a high number of unique values so the map legend may be cramped if we were looking at all of US.

10 Correlation

Many models rely on the notion of correlation between independent and dependent variables so a natural exploratoy visualization would be a correlation plot or correlogram. One library offering this is corrplot with its main function corrplot::corrplot(). The function takes as input the correlation matrix that can be produced with stats::cor(). This of course is only defined for numeric, non-missing variables. In order to have a reasonable information density in the correlation matrix, we will kick out some variables with a missing value share of larger 50%.

For a discussion on missing value treatment, see e.g. Data Analysis Using Regression and Multilevel/Hierarchical Models - Chapter 25: Missing-data imputation

Let’s again build a numeric variable vector after all previous operations and look at correlations.

num_vars <- 
  train %>% 
  sapply(is.numeric) %>% 
  which() %>% 
  names()

meta_train <- funModeling::df_status(train, print_results = FALSE)

meta_train %>%
  select(variable, p_zeros, p_na, unique) %>%
  filter_(~ variable %in% num_vars) %>%
  knitr::kable()
variable p_zeros p_na unique
loan_amnt 0.00 0.00 1369
funded_amnt 0.00 0.00 1369
funded_amnt_inv 0.03 0.00 8203
int_rate 0.00 0.00 540
installment 0.00 0.00 64438
annual_inc 0.00 0.00 41978
dti 0.05 0.00 4072
delinq_2yrs 80.80 0.00 26
inq_last_6mths 56.10 0.00 27
mths_since_last_delinq 51.43 0.00 152
mths_since_last_record 84.70 0.00 123
open_acc 0.00 0.00 74
pub_rec 84.70 0.00 31
revol_bal 0.38 0.00 68979
revol_util 0.40 0.05 1312
total_acc 0.00 0.00 131
out_prncp 28.85 0.00 209810
out_prncp_inv 28.85 0.00 224620
total_pymnt 2.01 0.00 423038
total_pymnt_inv 2.04 0.00 424921
total_rec_prncp 2.05 0.00 222003
total_rec_int 2.06 0.00 287778
total_rec_late_fee 98.59 0.00 5169
recoveries 97.22 0.00 18681
collection_recovery_fee 97.35 0.00 16796
last_pymnt_amnt 2.00 0.00 197434
collections_12_mths_ex_med 98.65 0.02 12
mths_since_last_major_derog 75.06 0.00 167
acc_now_delinq 99.53 0.00 8
tot_coll_amt 78.94 7.92 9319
tot_cur_bal 0.01 7.92 293158

Finally, we can produce a correlation plot. Dealing with missing values, using option use = "pairwise.complete.obs" in function stats::cor() is considered bad practice as it uses pair matching and correlations may not be comparable, see e.g. Pairwise-complete correlation considered dangerous. Alternatively, we can use option use = complete.obs which only considers complete observations but may discard a lot of data. However, as we have looked into the meta information, after our wrangling the proportion of missing values for the numeric variables has dropped a lot so we should be fine.

Once we have the correlation matrix, we can use the function corrplot::corrplot() from the corrplot library. It offers a number of different visualization options. For details, see An Introduction to corrplot Package.

library(corrplot)
corrplot::corrplot(cor(train[, num_vars], use = "complete.obs"), 
                   method = "pie", type = "upper")

We can derive a few things from the plot

  • overall correlation between variables seems low but there are a few highly correlated ones
  • some highly correlated variables may indicate very similar information, e.g. mths_since_last_delinq and mths_since_last_major_derog
  • some variables are perfectly correlated and thus should be removed

Rather than a visual inspection, an (automatic) inspection of correlations and removal of highly correlated features can be done via function caret::findCorrelation() with a defined cutoff parameter. If two variables have a high correlation, the function looks at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation. Using exact = TRUE will cause the function to re-evaluate the average correlations at each step while exact = FALSE uses all the correlations regardless of whether they have been eliminated or not. The exact calculations will remove a smaller number of predictors but can be much slower when the problem dimensions are “big”.

caret::findCorrelation(cor(train[, num_vars], use = "complete.obs"), 
                       names = TRUE, cutoff = .5)
##  [1] "loan_amnt"                   "funded_amnt"                
##  [3] "funded_amnt_inv"             "installment"                
##  [5] "total_pymnt_inv"             "total_pymnt"                
##  [7] "total_rec_prncp"             "out_prncp"                  
##  [9] "total_acc"                   "mths_since_last_record"     
## [11] "mths_since_last_major_derog" "recoveries"

Given above, we remove a few variables

vars_to_remove <- 
  c("loan_amnt", "funded_amnt", "funded_amnt_inv", "installment",
    "total_pymnt_inv", "total_rec_prncp", "mths_since_last_delinq", 
    "out_prncp", "total_pymnt", "total_rec_int", "total_acc",
    "mths_since_last_record", "recoveries")

train <- train %>% select(-one_of(vars_to_remove))

11 Summary plots

An alternative to producing individual exploratory plots is using summary plots that come with sensible default settings and display the data according to its data type (numeric or categorical). The advantages are that a lot of information can be put into one graph and potential relationships may be spotted more easily. The disadvantages are potential performance issues when plotting many variables of a large dataset and readability issues in case of many dimensions for categorical variables. These disadvantages can be somewhat remediated by looking at summary plots after a lot of variables have already been discarded due to other findings (as we have done here). To show the general idea, we use the GGally library which builds on top of ggplot2 and provides some nice visualization functions. We choose a subset of variables for illustration of the GGally::ggpairs() function. Even with this subset the production of the graph may take some time depending on your hardware. For a theoretical background, see The Generalized Pairs Plot. For the many customization options, see GGally. As always when visualizing a lot of information, there is a trade-off between information density and visualization efficiency. The more variables are selected for the plot, the more difficult it becomes to efficiently visualize it. This needs some playing around or a really big screen.

Note

Simply leaving the variable default as boolean would produce an error in ggpairs (or rather in ggplot2 which it calls): Error: StatBin requires a continuous x variable the x variable is discrete. Perhaps you want stat="count"?. Besides, we like it to be treated as categorical so we convert it to character for this function call.

library(GGally)
plot_ggpairs <-
  train %>%
  select(annual_inc, term, grade, default) %>%
  mutate(default = as.character(default)) %>%
  ggpairs()

plot_ggpairs

The resulting graph has a lot of information in it and we won’t go into further detail here but the most important points to read the graph properly. In the columns:

  • for single continuous numeric variable plots density estimate
  • for continuous numeric variable against another continuous numeric variable plots scatterplot
  • for continuous numeric variable against another categorical variable plots histogram
  • for single categorical variable plots histogram

In the rows:

  • for continuous numeric variable against another continuous numeric variable plots correlation
  • for continuous numeric variable against another categorical variable plots boxplot by dimension of categorical variable
  • for categorical variable against another categorical variable plots historgram by dimension of y-axis categorical variable
  • for categorical variable against another continuous variable plots scatterplot by dimension of categorical variable

Another interesting plot would be to visualize the density (estimates) of the numeric variables color-coded by the classes of the target variable (here: default/non-default) to inspect whether there are visual clues about differences in the distributions. We can achieve this by using ggplot but need to bring the data into a long format rather than its current wide format (see Data Science with R - Chapter Data Tidying for details on reshaping). We use tidyr::gather() to bring the numeric variables into a long format and then add the target variable again. Note that there are different lengths of the reshaped data and the original target variable vector. As dplyr::mutate() will not properly take care of the recycling rules, we have to recycle (repeat) the variable default manually to have the same length as the reshaped train data. See An Introduction to R - Chapter 2.2: Vector arithmetic how recycling works in general.

Vectors occurring in the same expression need not all be of the same length. If they are not, the value of the expression is a vector with the same length as the longest vector which occurs in the expression. Shorter vectors in the expression are recycled as often as need be (perhaps fractionally) until they match the length of the longest vector. In particular a constant is simply repeated.

Note that we make use of a few plot parameters to make the visualization easier to follow. We like default to be a factor so the densities are independent of class distribution. First, we switch the levels of variable default to have TRUE on top (as the highest objective is to get the true positive right). We use the fill and color parameters to encode with the target variable. This allows a visual distinction of overlaying densities. In addition, we set the alpha value to have the area under the curve transparently filled. For color selection we make use of brewer scales which offer a well selected pallette of colors. Finally, we split the plots by variable with facet_wrap allowing the scales of each indifidual plot to be free and defining the number of columns via ncol.

num_vars <- train %>% sapply(is.numeric) %>% which() %>% names()

train %>%
  select_(.dots = num_vars) %>%
  gather(measure, value) %>%
  mutate(default = factor(rep(x = train$default, 
                              length.out = length(num_vars) * dim(train)[1]), 
                          levels = c("TRUE", "FALSE"))) %>%
  ggplot(data = ., aes(x = value, fill = default, 
                       color = default, order = -default)) +
  geom_density(alpha = 0.3, size = 0.5) +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap( ~ measure, scales = "free", ncol = 3)

We can derive a few things from the plot:

  • most of the densities are right-skewed with probably a few outliers extending the x-axis which makes discerning patterns in the area with highest mass difficult
  • even with above mentioned visualization impediment it seems that there is not a lot of difference between default and non-default loans for most variables
  • some densities seem wobbly for the majority class (e.g. open_acc) which is caused by the fact that the smoothing algorithm has many more data points to smooth in between but the density follows the same general shape

Given above observations, we can try to improve the plot by removing (adjusting) outliers. There are many different methods that can be used, see e.g.

Most of these methods focus on removing outliers for modeling which requires a lot of care and understanding how model characteristics change when removing / replacing outliers. In our case, we can be more pragmatic as we are only interested in improving our understanding of the mass of the distribution in a graphical way to see if there are easily discernible pattern differences between the two outcomes of our target variable. For this, we like to replace outliers with some sensible value derived from the distribution of the particular variable. One could use the interquartile range as it is used in boxplots. We will use winsorization, i.e. replacing outliers with a specified percentile, e.g. 95%. As this is a fairly simple task, we write our own function to do that. Alternatively, one may use e.g. function psych::winsor() which trims outliers from both ends of the distribution. Since we have already seen, that there is no negative value present, we only cut from the right side of the distribution. We call the function winsor_outlier and allow settings for the percentile and removing NA values. The design is inspired by an SO question How to replace outliers with the 5th and 95th percentile values in R.

winsor_outlier <- function(x, cutoff = .95, na.rm = FALSE){
    quantiles <- quantile(x, cutoff, na.rm = na.rm)
    x[x > quantiles] <- quantiles
    x
    }

train %>%
  select_(.dots = num_vars) %>%
  mutate_all(.funs = winsor_outlier, cutoff = .95, na.rm = TRUE) %>%
  gather(measure, value) %>%
  mutate(default = factor(rep(x = train$default, 
                              length.out = length(num_vars)*dim(train)[1]), 
                          levels = c("TRUE", "FALSE"))) %>%
  ggplot(data = ., aes(x = value, fill = default, 
                       color = default, order = -default)) +
  geom_density(alpha = 0.3, size = 0.5) +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap( ~ measure, scales = "free", ncol = 3)

We can derive a few things from the plot:

  • most densities don’t show a discernible difference between default and non-default
  • ignore the densities centered around zero as they stem from too few observations
  • some densities show a bump on the right stemming from the outlier replacement and thus indicating that there is actually data in the tail

Interestingly, variable out_prncp_inv shows a significant bump around zero for default == FALSE which can be confirmed via a count. However, the description tells us that this variable represents Remaining outstanding principal for portion of total amount funded by investors so it’s unclear how useful it is (especially when knowing that this variable will only be available after a loan has been granted).

train %>% 
  select(default, out_prncp_inv) %>% 
  filter(out_prncp_inv == 0) %>%
  group_by(default) %>% 
  summarize(n = n())
## # A tibble: 2 x 2
##   default      n
##     <lgl>  <int>
## 1   FALSE 204207
## 2    TRUE    613

Overall, it seems from visual inspection that the numeric variables do not contain a lot of information gain regarding the classification of the target variable. One obvious exception is the interest rate which is a function of the expected risk of the borrower probably mostly driven by the rating. Here we see that defaulted clients tend to have higher interest rates and also seem to have more outliers in the tail.

A side note on plotting

If your screen is too small, you can export any plot to e.g. svg or pdf using graphics devices of the cairo API in package grDevices (usually part of base R distro) and scale the plot size to something bigger (especially helpful if it’s a long plot). Here we show an example for the previously created ggpairs plot stored as pdf.

cairo_pdf(filename = paste0(path, "/plot_ggpairs.pdf"),
    width=15,
    height=20,
    pointsize=10)
plot_ggpairs
dev.off()

12 Modeling

12.1 Model options

There are a couple of different model options available for a supervised binary classification problem such as determining whether a loan will default or not. In general, we may distinguish the following model approaches (among others):

  • logistic regression
  • linear discriminant analysis
  • k-nearest neighbors (kNN)
  • trees
  • random forests
  • boosting
  • support vector machines (SVM)

In addition, there are multiple ways to assess model performance such as model parameters, cross-validation, bootstrap and multiple feature selection methods such as ridge regression, the lasso, principal component anaylsis, etc.

We will look at some of those models and methods and see how they compare. For a more detailed treatment and further background, an accessible text is “An Introduction to Statistical Learning” or for a more advanced treatment “The Elements of Statistical Learning: Data Mining, Inference, and Prediction”.

12.2 Imbalanced data

Before we can think about any modeling, we have to realize that the dataset is highly imbalanced, i.e. the target variable has a very low proportion of defaults (namely 0.0249963). This can lead to several issues in many algorithms, e.g.

  • biased accuracy
  • loss functions attempt to optimize quantities such as error rate, not taking the data distribution into consideration
  • errors have the same cost (not pertaining to imbalanced data only)

More details can be found in

There are a couple of approaches to deal with the problem which can be divided into (taken from Learning from Imbalanced Classes)

  • do nothing (not a good idea in general)
  • balance training set
    • oversample minority class
    • undersample majority class
    • synthesize new minority classes
  • throw away minority examples and switch to an anomaly detection framework
  • at the algorithm level, or after it:
    • adjust the class weight (misclassification costs)
    • adjust the decision threshold
    • modify an existing algorithm to be more sensitive to rare classes
  • construct an entirely new algorithm to perform well on imbalanced data

We will focus on balancing the training set. There are basically three options available

  • undersample majority class (randomly or informed)
  • oversample minority class (randomly or informed)
  • do a mixture of both (e.g. SMOTE or ROSE method)

Undersampling the majority class may lose information but via decreasing dataset also lead to more computational efficiency. We also tried SMOTE and ROSE but functions DMwR::SMOTE() and ROSE::ROSE() seem to be very picky about their input and so far we failed to run them. For details on SMOTE, see SMOTE: Synthetic Minority Over-sampling Technique and for details on ROSE, see ROSE: A Package for Binary Imbalanced Learning. As pointed out here a random sampling for either under- or oversampling is not the best idea. It is recommended to use cross-validation and perform over- or under-sampling on each fold independently to get an honest estimate of model performance. In fact, function caret::train() also allows the re-balancing of data during the training of a model via caret::trainControl(sampling = "..."), see The caret package 11.2: Subsampling During Resampling for details. For now, we go with undersampling which still leaves a fair amount of training observations (at least for non-deep learning approaches).

train_down <- 
  caret::downSample(x = train[, !(names(train) %in% c("default"))], 
                    y = as.factor(train$default), yname = "default")

base::prop.table(table(train_down$default))
## 
## FALSE  TRUE 
##   0.5   0.5
base::table(train_down$default)
## 
## FALSE  TRUE 
## 17745 17745

12.3 A note on modeling libraries

There are many modeling libraries in R (in fact it is one of the major reasons for its popularity) and they can be very roughly distinguished into

  • general modeling libraries comprising many different model functions (sometimes wrappers around specialized model libraries)
  • specialized modeling libraries focusing on only one or a certain class of models

We will be using general modeling libraries and in particular the (standard) library shipping with R stats and a popular library caret. Another library often used is rms. The advantage of the general libraries is their easier access via a common API, usually a fairly comprehensive documentation and a relatively large user base which means high chance that someone already did what you like to do.

12.3.1 A note on caret library in particular

The caret package (short for _C_lassification _A_nd _RE_gression _T_raining) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:

  • data splitting
  • pre-processing
  • feature selection
  • model tuning using resampling
  • variable importance estimation

as well as other functionality. It aims at providing a uniform interface to R modeling functions themselves, as well as a way to standardize common tasks (such as parameter tuning and variable importance). The documentation gives an overview of the models that can be used via the caret::train() function which is the workhorse for building models. A quick look at available models reveals that there is a lot of choice.

names(caret::getModelInfo())
##   [1] "ada"                 "AdaBag"              "AdaBoost.M1"        
##   [4] "adaboost"            "amdai"               "ANFIS"              
##   [7] "avNNet"              "awnb"                "awtan"              
##  [10] "bag"                 "bagEarth"            "bagEarthGCV"        
##  [13] "bagFDA"              "bagFDAGCV"           "bam"                
##  [16] "bartMachine"         "bayesglm"            "binda"              
##  [19] "blackboost"          "blasso"              "blassoAveraged"     
##  [22] "bridge"              "brnn"                "BstLm"              
##  [25] "bstSm"               "bstTree"             "C5.0"               
##  [28] "C5.0Cost"            "C5.0Rules"           "C5.0Tree"           
##  [31] "cforest"             "chaid"               "CSimca"             
##  [34] "ctree"               "ctree2"              "cubist"             
##  [37] "dda"                 "deepboost"           "DENFIS"             
##  [40] "dnn"                 "dwdLinear"           "dwdPoly"            
##  [43] "dwdRadial"           "earth"               "elm"                
##  [46] "enet"                "evtree"              "extraTrees"         
##  [49] "fda"                 "FH.GBML"             "FIR.DM"             
##  [52] "foba"                "FRBCS.CHI"           "FRBCS.W"            
##  [55] "FS.HGD"              "gam"                 "gamboost"           
##  [58] "gamLoess"            "gamSpline"           "gaussprLinear"      
##  [61] "gaussprPoly"         "gaussprRadial"       "gbm_h2o"            
##  [64] "gbm"                 "gcvEarth"            "GFS.FR.MOGUL"       
##  [67] "GFS.GCCL"            "GFS.LT.RS"           "GFS.THRIFT"         
##  [70] "glm.nb"              "glm"                 "glmboost"           
##  [73] "glmnet_h2o"          "glmnet"              "glmStepAIC"         
##  [76] "gpls"                "hda"                 "hdda"               
##  [79] "hdrda"               "HYFIS"               "icr"                
##  [82] "J48"                 "JRip"                "kernelpls"          
##  [85] "kknn"                "knn"                 "krlsPoly"           
##  [88] "krlsRadial"          "lars"                "lars2"              
##  [91] "lasso"               "lda"                 "lda2"               
##  [94] "leapBackward"        "leapForward"         "leapSeq"            
##  [97] "Linda"               "lm"                  "lmStepAIC"          
## [100] "LMT"                 "loclda"              "logicBag"           
## [103] "LogitBoost"          "logreg"              "lssvmLinear"        
## [106] "lssvmPoly"           "lssvmRadial"         "lvq"                
## [109] "M5"                  "M5Rules"             "manb"               
## [112] "mda"                 "Mlda"                "mlp"                
## [115] "mlpKerasDecay"       "mlpKerasDecayCost"   "mlpKerasDropout"    
## [118] "mlpKerasDropoutCost" "mlpML"               "mlpSGD"             
## [121] "mlpWeightDecay"      "mlpWeightDecayML"    "monmlp"             
## [124] "msaenet"             "multinom"            "mxnet"              
## [127] "mxnetAdam"           "naive_bayes"         "nb"                 
## [130] "nbDiscrete"          "nbSearch"            "neuralnet"          
## [133] "nnet"                "nnls"                "nodeHarvest"        
## [136] "null"                "OneR"                "ordinalNet"         
## [139] "ORFlog"              "ORFpls"              "ORFridge"           
## [142] "ORFsvm"              "ownn"                "pam"                
## [145] "parRF"               "PART"                "partDSA"            
## [148] "pcaNNet"             "pcr"                 "pda"                
## [151] "pda2"                "penalized"           "PenalizedLDA"       
## [154] "plr"                 "pls"                 "plsRglm"            
## [157] "polr"                "ppr"                 "PRIM"               
## [160] "protoclass"          "pythonKnnReg"        "qda"                
## [163] "QdaCov"              "qrf"                 "qrnn"               
## [166] "randomGLM"           "ranger"              "rbf"                
## [169] "rbfDDA"              "Rborist"             "rda"                
## [172] "regLogistic"         "relaxo"              "rf"                 
## [175] "rFerns"              "RFlda"               "rfRules"            
## [178] "ridge"               "rlda"                "rlm"                
## [181] "rmda"                "rocc"                "rotationForest"     
## [184] "rotationForestCp"    "rpart"               "rpart1SE"           
## [187] "rpart2"              "rpartCost"           "rpartScore"         
## [190] "rqlasso"             "rqnc"                "RRF"                
## [193] "RRFglobal"           "rrlda"               "RSimca"             
## [196] "rvmLinear"           "rvmPoly"             "rvmRadial"          
## [199] "SBC"                 "sda"                 "sdwd"               
## [202] "simpls"              "SLAVE"               "slda"               
## [205] "smda"                "snn"                 "sparseLDA"          
## [208] "spikeslab"           "spls"                "stepLDA"            
## [211] "stepQDA"             "superpc"             "svmBoundrangeString"
## [214] "svmExpoString"       "svmLinear"           "svmLinear2"         
## [217] "svmLinear3"          "svmLinearWeights"    "svmLinearWeights2"  
## [220] "svmPoly"             "svmRadial"           "svmRadialCost"      
## [223] "svmRadialSigma"      "svmRadialWeights"    "svmSpectrumString"  
## [226] "tan"                 "tanSearch"           "treebag"            
## [229] "vbmpRadial"          "vglmAdjCat"          "vglmContRatio"      
## [232] "vglmCumulative"      "widekernelpls"       "WM"                 
## [235] "wsrf"                "xgbLinear"           "xgbTree"            
## [238] "xyf"

12.4 Data preparation for modeling

A number of models require data preparation in advance so here we are going to perform some pre-processing so that all models can work with the data. Note that a few model functions also allow the pre-processing during the training step (e.g. scale and center) but it is computationally more efficient to do at least some pre-processing before.

12.4.1 Proper names for character variables

Some models require a particular naming scheme for the values of character variables (e.g. no spcaes). The base::make.names() function does just that.

vars_to_mutate <-
  train_down %>%
  select(which(sapply(.,is.character))) %>%
  names()

vars_to_mutate
##  [1] "term"                "grade"               "emp_length"         
##  [4] "home_ownership"      "verification_status" "pymnt_plan"         
##  [7] "purpose"             "zip_code"            "addr_state"         
## [10] "initial_list_status" "application_type"
train_down <-
  train_down %>%
  mutate_at(.funs = make.names, .vars = vars_to_mutate)
  
test <-
  test %>%
  mutate_at(.funs = make.names, .vars = vars_to_mutate)

12.4.2 Dummy variables

A few models automatically convert character/factor variables into binary dummy variables (e.g. stats::glm()) while others don’t. To build a common ground, we can create dummy variables ahead of modeling. The caret::dummyVars() function creates a full set of dummy variables (i.e. less than full rank parameterization). The function takes a formula and a data set and outputs an object that can be used to create the dummy variables using the predict method. Let’s also remove the zip code variable as it would create a huge binary dummy matrix and potentially lead to computational problems. In many cases it is anyway not allowed to use it for modeling as it may be discriminative against certain communities. We also adjust the test set as we will need it for evaluation. We throw away all variables that are not part of the training set as they are not used by the models and would only clutter memory. We keep the dummy data separate from the original training data as any model using explicit variable names (i.e. not the full set) would require us to list each dummy variable separately which is cumbersome. If the data was big, we may have chosen to keep it all in one data set. For more details, see The caret package 3.1: Creating Dummy Variables.

vars_to_remove <- c("zip_code")
train_down <- train_down %>% select(-one_of(vars_to_remove))

# train
dummies_train <-
  dummyVars("~.", data = train_down[, !(names(train_down) %in% c("default"))], 
            fullRank = FALSE)

train_down_dummy <-
  train_down %>%
  select(-which(sapply(.,is.character))) %>%
  cbind(predict(dummies_train, newdata = train_down))

# test
dummies_test <-
  dummyVars("~.", data = test[, dummies_train$vars], fullRank = FALSE)

test_dummy <-
  test %>%
  select(one_of(colnames(train_down))) %>%
  select(-which(sapply(.,is.character))) %>%
  cbind(predict(dummies_test, newdata = test))

12.5 Logistic regression

While linear regression is very often used for (continuous) quantitative variables, logistic regression is its counterpart for (discrete) qualitative responses also referred to as categorical. Rather than modeling the outcome directly as default or not, logistic regression models the probability that the response belongs to a particular category. The logistic function will ensure that probabilties are within the range [0, 1]. The model coefficients are estimated via the maximum likelihood method.

The logistic regression model is part of the larger family of generalized linear model function and implemented in many R packages. We will use the stats::glm() function that usually comes with the standard install. A strong contender for any regression modeling is the caret package that we have already used before and will use again.

We will start with a very simple model of one explanatory variable, namely grade, which intuitively should be a good predictor of default as it is a credit rating and therefore an evaluation of the borrower. Note that a call to stats::glm() will not return all model statistics by default but only some attributes of the created model object. We can inspect the object’s attributes with the base::attributes() function. The function base::summary() is a generic function used to produce result summaries of the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument. In the case of a glm object, it will return model statistics such as

  • deviance residuals
  • coefficients
  • standard error
  • z statistic
  • p-value
  • AIC

and some more. We may also use the summary function on selected model objects, e.g. the coefficients only via summary(model_object)$coef.

Finally, before running the model, the documentation of stats::glm() tells us that if the response is of type factor the first level denotes failure and all others success. We can check and find that this is already the case here.

levels(train_down$default)
## [1] "FALSE" "TRUE"
model_glm_1 <- glm(formula = default ~ grade, 
                   family = binomial(link = "logit"), 
                   data = train_down, na.action = na.exclude)
class(model_glm_1)
## [1] "glm" "lm"
attributes(model_glm_1)
## $names
##  [1] "coefficients"      "residuals"         "fitted.values"    
##  [4] "effects"           "R"                 "rank"             
##  [7] "qr"                "family"            "linear.predictors"
## [10] "deviance"          "aic"               "null.deviance"    
## [13] "iter"              "weights"           "prior.weights"    
## [16] "df.residual"       "df.null"           "y"                
## [19] "converged"         "boundary"          "model"            
## [22] "call"              "formula"           "terms"            
## [25] "data"              "offset"            "control"          
## [28] "method"            "contrasts"         "xlevels"          
## 
## $class
## [1] "glm" "lm"
summary(model_glm_1)
## 
## Call:
## glm(formula = default ~ grade, family = binomial(link = "logit"), 
##     data = train_down, na.action = na.exclude)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8377  -1.1985  -0.0327   1.1564   1.7406  
## 
## Coefficients:
##             Estimate Std. Error z value            Pr(>|z|)    
## (Intercept) -1.26665    0.03911  -32.39 <0.0000000000000002 ***
## gradeB       0.77173    0.04519   17.08 <0.0000000000000002 ***
## gradeC       1.31620    0.04389   29.99 <0.0000000000000002 ***
## gradeD       1.67669    0.04600   36.45 <0.0000000000000002 ***
## gradeE       1.99273    0.05132   38.83 <0.0000000000000002 ***
## gradeF       2.26212    0.06856   33.00 <0.0000000000000002 ***
## gradeG       2.75092    0.12634   21.77 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 49200  on 35489  degrees of freedom
## Residual deviance: 46080  on 35483  degrees of freedom
## AIC: 46094
## 
## Number of Fisher Scoring iterations: 4
summary(model_glm_1)$coef
##               Estimate Std. Error   z value      Pr(>|z|)
## (Intercept) -1.2666450 0.03910963 -32.38704 4.178600e-230
## gradeB       0.7717315 0.04518690  17.07866  2.139941e-65
## gradeC       1.3161994 0.04388804  29.98993 1.327728e-197
## gradeD       1.6766852 0.04600412  36.44641 7.841098e-291
## gradeE       1.9927260 0.05131843  38.83061  0.000000e+00
## gradeF       2.2621234 0.06855512  32.99715 8.925306e-239
## gradeG       2.7509198 0.12633659  21.77453 4.046038e-105

There are a couple of functions that work natively with lm / glm objects created via stats::glm() (taken from Steph Locke: Logistic Regressions):

  • coefficients: Extract coefficients
  • summary: Output a basic summary
  • fitted: Return the predicted values for the training data
  • predict: Predict some values for new data
  • plot: Produce some basic diagnostic plots
  • residuals: Return the errors on predicted values for the training data

Also note that categorical variables will get transformed into dummy variables. A standardization is not necessarily needed, see e.g. Is standardization needed before fitting logistic regression. There are a number of model diagnostics one may perform, see e.g. Diagnostics for logistic regression and pay attention to Frank Harrell’s answer.

12.6 Model evaluation illustration

For the sake of illustration, let’s go once through the model evaluation steps for this very simple model. This workflow should be similar even when the model gets more complex. There are many perspectives one can take when evaluating a model and many packages that provide helper functions to do so. For this illustration, we focus on a fairly standard approach looking at some statistics and explaining them along the way. We will also be plotting an ROC curve to establish its basic functioning.

To evaluate the model performance, we have to use the fitted model to predict the target variable default on unseen (test) data. The function stats::preditc.glm() does exactly that. We set type = "response" which returns the predicted probabilities and thus allows us to evaluate against a chosen threshold. This can be useful in cases where the cost of missclassification is not equal and one outcome should be penalized heavier than another. Also note, that we choose the positive outcome to be TRUE, i.e. a true positive would be correctly identifiying a default. We use the function caret::ConfusionMatrix() to plot a confusion matrix (also called error matrix). The function also returns a number of model statistics (however note that the test data is highly imbalanced as we have not applied any resampling as in the training data - so a number of statistics are meaningless). Here is what the function reports:

  • Accuracy: (TP + TN) / (TP + FP + TN + FN)
  • 95% CI: 95% confidence interval for accuracy
  • No Information Rate: accuracy in case of random guessing (in case of binary outcome the proportion of the majority class)
  • P-Value [Acc > NIR]: p-value that accuracy is larger No Information Rate
  • Kappa: (observed accuracy - expected accuracy) / (1 - expected accuracy), see Cohen’s kappa in plain English
  • Mcnemar’s Test P-Value
  • Sensitivity / recall: true positive rate = TP / (TP + FN)
  • Specificity: true negative rate = TN / (TN + FP)
  • Pos Pred Value: (sensitivity * prevalence) / ((sensitivity * prevalence) + ((1 - specificity) * (1 - prevalence)))
  • Neg Pred Value: (specificity * (1-prevalence)) / (((1-sensitivity) * prevalence) + ((specificity) * (1 - prevalence)))
  • Prevalence: (TP + FN) / (TP + FP + TN + FN)
  • Detection Rate: TP / (TP + FP + TN + FN)
  • Detection Prevalence: (TP + FP) / (TP + FP + TN + FN)
  • Balanced Accuracy: (sensitivity + specificity) / 2
model_glm_1_pred <- 
  predict.glm(object = model_glm_1, newdata = test, type = "response")
model_pred_t <- function(pred, t) ifelse(pred > t, TRUE, FALSE)
caret::confusionMatrix(data = model_pred_t(model_glm_1_pred, 0.5), 
                       reference = test$default,
                       positive = "TRUE")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE  TRUE
##      FALSE 79712  1003
##      TRUE  93327  3433
##                                              
##                Accuracy : 0.4685             
##                  95% CI : (0.4662, 0.4708)   
##     No Information Rate : 0.975              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.0211             
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.77390            
##             Specificity : 0.46066            
##          Pos Pred Value : 0.03548            
##          Neg Pred Value : 0.98757            
##              Prevalence : 0.02500            
##          Detection Rate : 0.01934            
##    Detection Prevalence : 0.54520            
##       Balanced Accuracy : 0.61728            
##                                              
##        'Positive' Class : TRUE               
## 

Looking at the model statistics, we find a mixed picture:

  • we get a fair amount of true defaults right (sensitivity)
  • we get a large amount of non-defaults wrong (specificity)
  • the Kappa (which should consider class distributions) is very low

We can visualize the classification (predicted probabilities) versus the actual class for the test set and the given threshold. We make use of a plotting function inspired by Illustrated Guide to ROC and AUC with its source code residing on Github. We have slightly adjusted the code to fit our needs.

plot_pred_type_distribution <- function(df, threshold) {
  v <- rep(NA, nrow(df))
  v <- ifelse(df$pred >= threshold & df$actual == 1, "TP", v)
  v <- ifelse(df$pred >= threshold & df$actual == 0, "FP", v)
  v <- ifelse(df$pred < threshold & df$actual == 1, "FN", v)
  v <- ifelse(df$pred < threshold & df$actual == 0, "TN", v)
  
  df$pred_type <- v
  
  ggplot(data = df, aes(x = actual, y = pred)) + 
    geom_violin(fill = rgb(1, 1 ,1, alpha = 0.6), color = NA) + 
    geom_jitter(aes(color = pred_type), alpha = 0.6) +
    geom_hline(yintercept = threshold, color = "red", alpha = 0.6) +
    scale_color_discrete(name = "type") +
    labs(title = sprintf("Threshold at %.2f", threshold))
  }

plot_pred_type_distribution(
  df = tibble::tibble(pred = model_glm_1_pred, actual = test$default), 
  threshold = 0.5)

We can derive a few things from the plot:

  • We can clearly see the imbalance of the test dataset with many more observations being FALSE, i.e. non-defaulted
  • We can see that predictions tend to cluster more around the center (i.e. the threshold) rather than the top or bottom
  • We see that there are only seven distinct (unique) prediction probabilities, namely 0.512386, 0.2198321, 0.3787367, 0.6010975, 0.6739447, 0.8152174, 0.7301686 which corresponds to the number of unique levels in our predictor grade, namely B, C, A, E, F, D, G
  • We can see the trade-off of moving the threshold down or up leading to higher sensitivity (lower specificity) or higher specificity (lower sensitivity), respectively.

12.6.1 ROC curve

Let’s have a look at the ROC curve which evaluates the performance over all possible thresholds rather than just one. Without going into too much detail, the ROC curve works as follows:

  • draw sensitivity (true positive rate) on y-axis
  • draw specificivity (true negative rate) on x-axis (often also drawn as 1 - true negative rate)
  • go from threshold zero to threshold one in a number of pre-defined steps, e.g. with distance 0.01
  • for each threshold sensitivity is plotted against specificity
  • the result is usually a (smoothed) curve that lies above the diagonal splitting the area
  • the diagonal (also called base) represents the lower bound as it is the result of a random guess (assuming balanced class distribution) - so if the curve lies below it, the model is worse than random guessing and one could simply invert the prediction
  • in a perfect model the curve touches the upper left data point, i.e. having a sensitivity and specificity of one (perfect separation)
  • This setup nicely illustrates the trade-off between sensitivity and specificity as it often occurs in reality, i.e. increasing one, decreases the other
  • The curve also allows comparing different models against each other

Although the curve gives a good indication of model performance, it is usually preferrable to have one model statistic to use as benchmark. This is the role of the AUC (AUROC) known as area under the curve (area under the ROC curve). It is simply calculated via integration. Naturally, the base is 0.5 (the diagonal) and thus the lower bound. The upper bound is therefore 1.

More details can be found e.g. in:

We will be using the pROC library and its main function pROC::roc() to compute ROC.

roc_glm_1 <- pROC::roc(response = test$default, predictor = model_glm_1_pred)
roc_glm_1
## 
## Call:
## roc.default(response = test$default, predictor = model_glm_1_pred)
## 
## Data: model_glm_1_pred in 173039 controls (test$default FALSE) < 4436 cases (test$default TRUE).
## Area under the curve: 0.6641

We see an area of 0.6641045 which is better than random guessing but not too good. Let’s plot it to see its shape. Note that by default pROC::plot.roc() has the x-axis as specificity rather than 1 - specificity as many authors and other packages have. To use 1 - specificity, one can call the function with parameter legacy.axes = TRUE. However, it seems that the actual trade-off between sensitivity and specificity is easier to see when plotting specificity as the mind does not easily interpret one minus some quantity. We also set asp = NA to create an x-axis range from zero to one which, depending on your graphics output settings, ensures the standard way of plotting an ROC curve accepting the risk of a misshape in case graphics output does not have (close to) quadratic shape.

pROC::plot.roc(x = roc_glm_1, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               col = "green", print.auc = FALSE, print.auc.y = .4)

legend(x = "bottomright", legend=c("glm_1 AUC = 0.664"), 
       col = c("green"), lty = 1, cex = 1.0)

We see that the curve lies over the diagonal but it does not have a strong tendency to touch the upper left corner. A more complex model may perform better but would involve a larger amount of predictors so naturally the question of variable selection will come up.

12.7 Variable selection

Many articles and book chapters have been written on the topic of variable selection (feature selection) and some authors believe it is one of, if not the most, important topic in model buidling. A few good resources giving an overview, can be found below:

The different approaches can be categorized as follows:

  • subset selection
  • best subset selection
  • stepwise selection (forward/backward/hybrid)
  • shrinkage
  • ridge regression (l2-norm)
  • lasso (l1-norm)
  • dimension reduction
  • principal component analysis (PCA)

There are many trade-offs to be made when choosing one approach over the other but the two most important are statistical soundness and computational efficiency. We will not go into further details of variable selection but rather focus on the model building. We thus simply detrmine a predictor space that seems sensible given prior analysis.

Let’s remind ourselves of variables left for modeling.

full_vars <- colnames(train_down)
full_vars
##  [1] "term"                        "int_rate"                   
##  [3] "grade"                       "emp_length"                 
##  [5] "home_ownership"              "annual_inc"                 
##  [7] "verification_status"         "issue_d"                    
##  [9] "pymnt_plan"                  "purpose"                    
## [11] "addr_state"                  "dti"                        
## [13] "delinq_2yrs"                 "earliest_cr_line"           
## [15] "inq_last_6mths"              "open_acc"                   
## [17] "pub_rec"                     "revol_bal"                  
## [19] "revol_util"                  "initial_list_status"        
## [21] "out_prncp_inv"               "total_rec_late_fee"         
## [23] "collection_recovery_fee"     "last_pymnt_d"               
## [25] "last_pymnt_amnt"             "last_credit_pull_d"         
## [27] "collections_12_mths_ex_med"  "mths_since_last_major_derog"
## [29] "application_type"            "acc_now_delinq"             
## [31] "tot_coll_amt"                "tot_cur_bal"                
## [33] "default"

We determine a subspace and report which variables we ignore.

model_vars <-
  c("term", "grade", "emp_length", "home_ownership", "annual_inc",
    "purpose", "pymnt_plan", "out_prncp_inv", "delinq_2yrs", "default")

ignored_vars <- dplyr::setdiff(full_vars, model_vars)
ignored_vars
##  [1] "int_rate"                    "verification_status"        
##  [3] "issue_d"                     "addr_state"                 
##  [5] "dti"                         "earliest_cr_line"           
##  [7] "inq_last_6mths"              "open_acc"                   
##  [9] "pub_rec"                     "revol_bal"                  
## [11] "revol_util"                  "initial_list_status"        
## [13] "total_rec_late_fee"          "collection_recovery_fee"    
## [15] "last_pymnt_d"                "last_pymnt_amnt"            
## [17] "last_credit_pull_d"          "collections_12_mths_ex_med" 
## [19] "mths_since_last_major_derog" "application_type"           
## [21] "acc_now_delinq"              "tot_coll_amt"               
## [23] "tot_cur_bal"

12.8 A more complex logistic model with caret

The interested reader should check the comprehensive caret documentation on how the API works and for modeling in particular read model training and tuning. We only give a high-level overview here.

The caret package has several functions that attempt to streamline the model building and evaluation process. The train function can be used to

  • evaluate, using resampling, the effect of model tuning parameters on performance
  • choose the “optimal” model across these parameters
  • estimate model performance from a training set

First, a specific model must be chosen. User-defined models can also be created. Second, the corresponding model parameters need to be passed to the function call. The workhorse function is caret::train() but an important input is parameter trControl which also allows specifying a particular resampling approach, e.g. k-fold cross-validation (once or repeated), leave-one-out cross-validation and bootstrap (simple estimation or the 632 rule), etc.

12.8.1 trainControl

We will define the controlling parameters in a separate function call to caret::trainControl() and then pass this on to caret::train(trControl = ). This also allows using the control parameters for different models if applicable to those. The function caret::trainControl() generates parameters that further control how models are created. For more details, see The caret package 5.5.4: The trainControl Function.

We perform a 10-fold cross-validation repeated five times and return the class probabilities. Since we are interested in ROC as our performance measure, we need to set summaryFunction = twoClassSummary. If one likes to see a live log of the execution, verboseIter = TRUE provides that. Since we have no tuning parameters for logistic regression, we leave those out for now.

ctrl <- 
  trainControl(method = "repeatedcv", 
               number = 10,
               repeats = 5,
               classProbs = TRUE,
               summaryFunction = twoClassSummary,
               savePredictions = TRUE,
               verboseIter = FALSE)

12.8.2 train

Finally, we can call caret::train() with specified control parameters. Note that caret is picky about factor values which need to be valid R names. We thus convert the default variable to have values “yes” and “no” rather than “TRUE” and “FALSE” (which when converted to variable names would not be valid as they are reserved keywords). Make sure that the test set is transformed likewise when using the model for prediction as otherwise your factor levels may be different to the training set and the class probabilities thus wrong.

We will skip the complex discussion around variable selection for now and simply determine the “base” variable universe that seems plausible from intuition and prior exploratory analysis. We have removed categorical variables with many unique values (such as zip codes) as they would be mutated to large dummy matrices.

model_glm_2 <-
  train_down %>%
  select(model_vars) %>%
  mutate(default = as.factor(ifelse(default == TRUE, "yes", "no"))) %>%
  filter(complete.cases(.)) %>%
  train(default ~ ., 
        data = ., 
        method = "glm", 
        family = "binomial",
        metric = "ROC",
        trControl = ctrl)

As with the previous model object created via stats::glm(), the final model object created has many methods and attributes for further processing. But we first look at the results/performance on the training set (which should give a fair estimate of the test set performance since we used cross-validation). Note that all reported metrics are interpretable since this is done on the balanced data.

model_glm_2
## Generalized Linear Model 
## 
## 35488 samples
##     9 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 31938, 31938, 31940, 31939, 31940, 31940, ... 
## Resampling results:
## 
##   ROC        Sens       Spec    
##   0.6964721  0.6291356  0.662019

We can see that we have improved on the ROCAUC (~ 70%) but got worse for sensitivity compared to the first, simpler model. Looking at the predictors, we also see how the model created dummies to encode categorical variables.

attributes(model_glm_2)
## $names
##  [1] "method"       "modelInfo"    "modelType"    "results"     
##  [5] "pred"         "bestTune"     "call"         "dots"        
##  [9] "metric"       "control"      "finalModel"   "preProcess"  
## [13] "trainingData" "resample"     "resampledCM"  "perfNames"   
## [17] "maximize"     "yLimits"      "times"        "levels"      
## [21] "terms"        "coefnames"    "contrasts"    "xlevels"     
## 
## $class
## [1] "train"         "train.formula"
summary(model_glm_2)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3163  -1.0774  -0.2782   1.0635   5.6347  
## 
## Coefficients:
##                                 Estimate     Std. Error z value
## (Intercept)                -1.7381657126   0.1420955332 -12.232
## termX60.months             -0.5453038253   0.0298754606 -18.253
## gradeB                      0.8159277950   0.0461745543  17.671
## gradeC                      1.3693629935   0.0459015145  29.833
## gradeD                      1.7227789216   0.0490499846  35.123
## gradeE                      2.0514999392   0.0562719050  36.457
## gradeF                      2.3642399559   0.0741417266  31.888
## gradeG                      2.8775473195   0.1315002573  21.882
## emp_lengthX..1.year         0.0719719375   0.0634309986   1.135
## emp_lengthX1.year          -0.0231801001   0.0663542839  -0.349
## emp_lengthX10..years       -0.1447186450   0.0541082748  -2.675
## emp_lengthX2.years         -0.0155518922   0.0628511985  -0.247
## emp_lengthX3.years          0.0019357572   0.0640494674   0.030
## emp_lengthX4.years         -0.0416200714   0.0681959839  -0.610
## emp_lengthX5.years         -0.1208526942   0.0671199278  -1.801
## emp_lengthX6.years          0.0685014427   0.0713774190   0.960
## emp_lengthX7.years          0.0169740882   0.0708191548   0.240
## emp_lengthX8.years         -0.0231357344   0.0711988793  -0.325
## emp_lengthX9.years         -0.0454158473   0.0751319778  -0.604
## home_ownershipNONE         -0.5795772695   1.3066383779  -0.444
## home_ownershipOTHER        12.3599134059  91.3562921237   0.135
## home_ownershipOWN           0.1059366445   0.0397823985   2.663
## home_ownershipRENT          0.1738380528   0.0253881630   6.847
## annual_inc                 -0.0000018620   0.0000002708  -6.875
## purposecredit_card          0.0940714394   0.1290711042   0.729
## purposedebt_consolidation   0.1957294997   0.1273405178   1.537
## purposeeducational          1.4894997798   0.4601526184   3.237
## purposehome_improvement     0.2605554222   0.1355011685   1.923
## purposehouse                0.0010870105   0.2087251602   0.005
## purposemajor_purchase       0.1111783795   0.1507309963   0.738
## purposemedical              0.1525955150   0.1664861916   0.917
## purposemoving               0.4431876570   0.1857656345   2.386
## purposeother                0.0979564862   0.1356860122   0.722
## purposerenewable_energy     0.0910160004   0.4132422786   0.220
## purposesmall_business       0.2405034075   0.1572708880   1.529
## purposevacation             0.1203054032   0.1926193435   0.625
## purposewedding             -0.4075063208   0.3202319627  -1.273
## pymnt_plany                11.9809020840 154.7613613910   0.077
## out_prncp_inv               0.0000520918   0.0000017015  30.614
## delinq_2yrs                 0.0865718494   0.0125320528   6.908
##                                       Pr(>|z|)    
## (Intercept)               < 0.0000000000000002 ***
## termX60.months            < 0.0000000000000002 ***
## gradeB                    < 0.0000000000000002 ***
## gradeC                    < 0.0000000000000002 ***
## gradeD                    < 0.0000000000000002 ***
## gradeE                    < 0.0000000000000002 ***
## gradeF                    < 0.0000000000000002 ***
## gradeG                    < 0.0000000000000002 ***
## emp_lengthX..1.year                    0.25652    
## emp_lengthX1.year                      0.72684    
## emp_lengthX10..years                   0.00748 ** 
## emp_lengthX2.years                     0.80457    
## emp_lengthX3.years                     0.97589    
## emp_lengthX4.years                     0.54166    
## emp_lengthX5.years                     0.07177 .  
## emp_lengthX6.years                     0.33720    
## emp_lengthX7.years                     0.81058    
## emp_lengthX8.years                     0.74522    
## emp_lengthX9.years                     0.54552    
## home_ownershipNONE                     0.65736    
## home_ownershipOTHER                    0.89238    
## home_ownershipOWN                      0.00775 ** 
## home_ownershipRENT            0.00000000000753 ***
## annual_inc                    0.00000000000621 ***
## purposecredit_card                     0.46610    
## purposedebt_consolidation              0.12428    
## purposeeducational                     0.00121 ** 
## purposehome_improvement                0.05449 .  
## purposehouse                           0.99584    
## purposemajor_purchase                  0.46076    
## purposemedical                         0.35937    
## purposemoving                          0.01705 *  
## purposeother                           0.47033    
## purposerenewable_energy                0.82568    
## purposesmall_business                  0.12621    
## purposevacation                        0.53225    
## purposewedding                         0.20318    
## pymnt_plany                            0.93829    
## out_prncp_inv             < 0.0000000000000002 ***
## delinq_2yrs                   0.00000000000491 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 49197  on 35487  degrees of freedom
## Residual deviance: 44806  on 35448  degrees of freedom
## AIC: 44886
## 
## Number of Fisher Scoring iterations: 11
predictors(model_glm_2)
##  [1] "termX60.months"            "gradeB"                   
##  [3] "gradeC"                    "gradeD"                   
##  [5] "gradeE"                    "gradeF"                   
##  [7] "gradeG"                    "emp_lengthX..1.year"      
##  [9] "emp_lengthX1.year"         "emp_lengthX10..years"     
## [11] "emp_lengthX2.years"        "emp_lengthX3.years"       
## [13] "emp_lengthX4.years"        "emp_lengthX5.years"       
## [15] "emp_lengthX6.years"        "emp_lengthX7.years"       
## [17] "emp_lengthX8.years"        "emp_lengthX9.years"       
## [19] "home_ownershipNONE"        "home_ownershipOTHER"      
## [21] "home_ownershipOWN"         "home_ownershipRENT"       
## [23] "annual_inc"                "purposecredit_card"       
## [25] "purposedebt_consolidation" "purposeeducational"       
## [27] "purposehome_improvement"   "purposehouse"             
## [29] "purposemajor_purchase"     "purposemedical"           
## [31] "purposemoving"             "purposeother"             
## [33] "purposerenewable_energy"   "purposesmall_business"    
## [35] "purposevacation"           "purposewedding"           
## [37] "pymnt_plany"               "out_prncp_inv"            
## [39] "delinq_2yrs"

We can also look at the variable importance with caret::varImp(). For linear models, the absolute value of the t-statistic for each model parameter is used.

varImp(model_glm_2)
## glm variable importance
## 
##   only 20 most important variables shown (out of 39)
## 
##                           Overall
## gradeE                    100.000
## gradeD                     96.340
## gradeF                     87.466
## out_prncp_inv              83.972
## gradeC                     81.827
## gradeG                     60.017
## termX60.months             50.059
## gradeB                     48.462
## delinq_2yrs                18.937
## annual_inc                 18.846
## home_ownershipRENT         18.770
## purposeeducational          8.866
## emp_lengthX10..years        7.323
## home_ownershipOWN           7.291
## purposemoving               6.531
## purposehome_improvement     5.261
## emp_lengthX5.years          4.925
## purposedebt_consolidation   4.202
## purposesmall_business       4.181
## purposewedding              3.477
plot(varImp(model_glm_2))

Finally, we use the model to predict the test set classes. Again, we need to be aware of any pre-processing required such as mutations or treatment of missing values.

model_glm_2_pred <- 
  predict(model_glm_2, 
          newdata = test %>% 
            select(model_vars) %>%
            mutate(default = as.factor(ifelse(default == TRUE, 
                                              "yes", "no"))) %>%
            filter(complete.cases(.)), 
          type = "prob")

head(model_glm_2_pred, 3)
##          no       yes
## 1 0.7041954 0.2958046
## 2 0.8783103 0.1216897
## 3 0.6509819 0.3490181

Note that caret::predict.train returns the probabilties for all outcomes i.e. in our binary case for yes and no. We repeat the analysis from earlier and see how the new model performs.

caret::confusionMatrix(
  data = ifelse(model_glm_2_pred[, "yes"] > 0.5, "yes", "no"), 
  reference = as.factor(ifelse(test[complete.cases(test[, model_vars]), 
                                    "default"] == TRUE, "yes", "no")),
  positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     no    yes
##        no  109156   1482
##        yes  63876   2953
##                                              
##                Accuracy : 0.6317             
##                  95% CI : (0.6295, 0.634)    
##     No Information Rate : 0.975              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.0378             
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.66584            
##             Specificity : 0.63084            
##          Pos Pred Value : 0.04419            
##          Neg Pred Value : 0.98660            
##              Prevalence : 0.02499            
##          Detection Rate : 0.01664            
##    Detection Prevalence : 0.37657            
##       Balanced Accuracy : 0.64834            
##                                              
##        'Positive' Class : yes                
## 

Again we build an ROC object.

temp <- as.factor(ifelse(test[complete.cases(test[, model_vars]),
                                             "default"] == TRUE, "yes", "no"))

roc_glm_2 <- 
  pROC::roc(response = temp, 
            predictor = model_glm_2_pred[, "yes"])

roc_glm_2
## 
## Call:
## roc.default(response = temp, predictor = model_glm_2_pred[, "yes"])
## 
## Data: model_glm_2_pred[, "yes"] in 173032 controls (temp no) < 4435 cases (temp yes).
## Area under the curve: 0.6995
#data = as.data.frame(unclass(train[complete.cases(train), ]))

The ROCAUC of ~ 70% is nearly the same as for the cross-validated training set. We plot the curve comparing it against the ROC from the simple model.

pROC::plot.roc(x = roc_glm_1, legacy.axes = FALSE, xlim = c(1, 0), asp = NA, 
               col = "green")

pROC::plot.roc(x = roc_glm_2, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               add = TRUE, col = "blue")

legend(x = "bottomright", legend=c("glm_1 AUC = 0.664", "glm_2 AUC = 0.7"), 
       col = c("green", "blue"), lty = 1, cex = 1.0)

We can see that the more complex model has a better ROC above the simple model.

12.9 A note on computational efficiency (parallelization via cluster)

Some models (and their tuning parameter calibration via resampling) will require significant computational resources both in terms of memory and CPUs. There are a few ways to increase computational efficiency which can be roughly divided into

  • decrease training set (without loosing information)
  • compute on several cores (parallelization via cluster)
  • shift matrix computations to GPU (e.g. via CUDA)
  • use compute instances from cloud offerings such as Amazon Web Services, googleCloud or Microsoft Azure

We will only look at parallelization here as it is supported by R and caret out of the box. For details, see The caret package 9: Parallel Processing, caret ml parallel and HOME of the R parallel WIKI! by Tobias Kind. There are a couple of libraries supporting parallel processing in R but one should be aware of the support for different operating systems. On Windows, one should use doParallel which is a parallel backend for the foreach package. Note that the multicore functionality only runs tasks on a single computer, not a cluster of computers. However, you can use the snow functionality to execute on a cluster, using Unix-like operating systems, Windows, or even a combination. It is pointless to use doParallel and parallel on a machine with only one processor with a single core. To get a speed improvement, it must run on a machine with multiple processors, multiple cores, or both. Remember: unless doParallel::registerDoParallel() is called, foreach will not run in parallel. Simply loading the doParallel package is not enough. Note that we can set the parameter caret::trainControl(allowParallel = TRUE) but it is actually the default setting, i.e. caret::train() will automatically run in parallel if it detects a registerd cluster, irrespective of library is used to build it.

To illustrate efficiency gains, we will benchmark an example model training using the previous glm model. We use library (and function with same name) microbenchmark and define to execute the operation several times (to get a stable time estimate) via parameter times. Note that parallizing usually means creating copies of the original data for each worker (core) which requires according memory. Since we only want to illustrate the logic here, we downsize the training data to avoid failures due to insufficient memory.

benchmark_train_data <- 
  train_down[seq(1, nrow(train_down), 2), model_vars] %>%
  mutate(default = as.factor(ifelse(default == TRUE, "yes", "no"))) %>%
  filter(complete.cases(.))
format(utils::object.size(benchmark_train_data), units = "Mb")
## [1] "1.3 Mb"

We see the data has ~ 1 Mb in size.

ctrl <- 
  trainControl(method = "repeatedcv", 
               number = 10,
               repeats = 5,
               classProbs = TRUE,
               summaryFunction = twoClassSummary,
               verboseIter = FALSE,
               allowParallel = TRUE) #default

library(microbenchmark)

glm_nopar <-
  microbenchmark(glm_nopar =
      train(default ~ .,
            data = benchmark_train_data,
            method = "glm",
            family = "binomial",
            metric = "ROC",
            trControl = ctrl),
    times = 5
    )
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

We can look at the resulting run time.

glm_nopar
## Unit: seconds
##       expr      min       lq     mean   median       uq     max neval
##  glm_nopar 27.51427 27.90744 28.27665 28.38591 28.71571 28.8599     5

Now, we will setup a cluster of cores and repeat the benchmarking. We use parallel::detectCores() to establish how many cores are available and use some amount smaller 100% of cores to leave some resources for the system. Usually, we would write a log of the distributed calculation via parallel::makeCluster(outfile = ). It will be opened in append mode as all workers write to it. However, we found issues with running parallel::makeCluster(outfile = ) so we simply run the cluster directly from registerDoParallel() which has no option to define a log file.

# doParallel would also load parallel
library(parallel)
library(doParallel)
## Warning: package 'doParallel' was built under R version 3.4.2
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loading required package: iterators
cores_2_use <- floor(0.8 * detectCores())
# cl <- makeCluster(cores_2_use, outfile = "./logs/parallel_log.txt")
# registerDoParallel(cl, cores_2_use)
registerDoParallel(cores = cores_2_use)
getDoParWorkers()
## [1] 6

We can inspect the open connections via base::showConnections().

knitr::kable(base::showConnections())
description class mode text isopen can read can write
4 output textConnection wr text opened no yes
5 <-eclipse:11076 sockconn a+b binary opened yes yes
6 <-eclipse:11076 sockconn a+b binary opened yes yes
7 <-eclipse:11076 sockconn a+b binary opened yes yes
8 <-eclipse:11076 sockconn a+b binary opened yes yes
9 <-eclipse:11076 sockconn a+b binary opened yes yes
10 <-eclipse:11076 sockconn a+b binary opened yes yes

Now we tried running in parallel but the setup may produce an error on Windows.

Error in e$fun(obj, substitute(ex), parent.frame(), e$data) : unable to find variable "optimismBoot"

This seems to be a bug as noted in SO question Caret train function - unable to find variable “optimismBoot”. The bug seems to be solved in caret development version which can be installed from GitHub via devtools::install_github('topepo/caret/pkg/caret') (note that building caret from source requires RBuildTools which are part of RTools which need to be installed separately, e.g. from Building R for Windows - RStudio may manage the install for you). The issue is also tracked on GitHub under optimismBoot error #706 and may already be resolved in the caret version you are using.

We do not run the install of the development version here again but you should in case you do not have it and wish to test the parallel execution. Before you do, make sure to close the cluster with below code (not executed now as we continue).

parallel::stopCluster(cl)
foreach::registerDoSEQ()
glm_par <-
  microbenchmark(glm_par =
    train(default ~ .,
            data = benchmark_train_data,
            method = "glm",
            family = "binomial",
            metric = "ROC",
            trControl = ctrl),
    times = 5
    )

We make sure to stop the cluster and force R back to sequential execution by using foreach::registerDoSEQ(). In case you used the setup via parallel::makeCluster(), you need to call parallel::stopCluster() first. Otherwise, memory and CPU may be occupied with legacy clusters.

#parallel::stopCluster(cl)
foreach::registerDoSEQ()

Note that we ran into issues when using too much system resources on a Windows 7 workstation as documented in the SO question caret train binary glm fails on parallel cluster via doParallel. Below error was produced.

Error in serialize(data, node$con) : error writing to connection

However, on a Windows 10 machine with fewer cores and development version of caret the issue could not be reproduced.

We have also tried running caret::train() parallel on Linux (Ubuntu) via the doMC library and found that below setting runs just fine and also faster than a sequential execution. We only include the code here for info without execution.

library(doMC)

cores_2_use <- floor(0.8 * parallel::detectCores())
registerDoMC(cores_2_use)

microbenchmark(
  glm_par =
    train(y ~ .,
          data = df,
          method = "glm",
          family = "binomial",
          metric = "ROC",
          trControl = ctrl),
  times = 5)

Finally, let’s compare parallel versus non-parallel execution time. We can see that the parallel setup performed significantly better but we would have expected even more as we are using a multiple of the one-core setup.

glm_nopar
## Unit: seconds
##       expr      min       lq     mean   median       uq     max neval
##  glm_nopar 27.51427 27.90744 28.27665 28.38591 28.71571 28.8599     5
glm_par
## Unit: seconds
##     expr      min       lq     mean  median       uq      max neval
##  glm_par 13.65229 13.79392 17.34424 13.8397 13.89238 31.54293     5

12.10 A note on tuning parameters in caret

When it comes to more complex models in general, often tuning parameters can be used to improve the model. There are different ways to provide the range of parameters to take. In particular, one may use

  • pre-defined number (e.g. if domain knowledge already provides and idea)
  • use grid search (default in caret::train())
  • use random search

In general, the more tuning parameters are available, the higher the computational cost of selecting them. We will show some applications of options below. For further information, see The caret package 5: Model Training and Tuning

12.11 Trees

Trees stratify or segment the predictor space into a number of simple regions. In order to make a prediction for a given observation, we typically use the mean or the mode of the training observations in the region to which it belongs. Trees are easy to interpret and built but as pointed out in “An Introduction to Statistical Learning” they may not compete with the best supervised learning approaches:

Tree-based methods are simple and useful for interpretation. However, they typically are not competitive with the best supervised learning approaches, such as those seen in Chapters 6 and 7, in terms of prediction accuracy. Hence in this chapter we also introduce bagging, random forests, and boosting. Each of these approaches involves producing multiple trees which are then combined to yield a single consensus prediction. We will see that combining a large number of trees can often result in dramatic improvements in prediction accuracy, at the expense of some loss in interpretation.

Following this guidance, we will also look into some alternatives / enhancements of the basic tree model used in CART. Some further resources on trees are found in

As usual, there are many ways to implement tree models in R but we will stay within the caret package and use the API that was already explained earlier.

12.11.1 Simple tree using CART via rpart in caret

We start by building a simple tree using the CART method via rpart in caret. Note that caret may not offer the full set of features that are available when using the rpart library directly but it should suffice for our illustration. For more details, see rpart on CRAN and in particular An Introduction to Recursive Partitioning Using the RPART Routines for some background.

We define our control and train functions as before and look at the results.

ctrl <- 
  trainControl(method = "repeatedcv", 
               number = 10,
               repeats = 5,
               classProbs = TRUE,
               summaryFunction = twoClassSummary,
               verboseIter = FALSE,
               allowParallel = TRUE)

# library(rpart)

model_rpart <-
  train_down %>%
  select(model_vars) %>%
  mutate(default = as.factor(ifelse(default == TRUE, "yes", "no"))) %>%
  filter(complete.cases(.)) %>%
  train(default ~ .,
        data = .,
        method = 'rpart',
        metric = "ROC",
        preProc = c("center", "scale"),
        trControl = ctrl)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: home_ownershipNONE
model_rpart
## CART 
## 
## 35488 samples
##     9 predictor
##     2 classes: 'no', 'yes' 
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 31939, 31938, 31940, 31939, 31940, 31939, ... 
## Resampling results across tuning parameters:
## 
##   cp          ROC        Sens       Spec     
##   0.01445641  0.7018554  0.5572830  0.7674466
##   0.03111086  0.6461547  0.3900115  0.8799842
##   0.26027166  0.5707699  0.6019078  0.5396320
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.01445641.

We see that ROCAUC is ~ 70% for the best model iteration but sensitivity is worse than in some previous models. caret stepped through some complexity parameters (cp) and printed results for them. We can look at cp visually for the different values. In fact, under caret this is the only tuning parameter available for method = rpart.

ggplot(model_rpart)

We can also visualize the final tree by accessing the finalModel element from the caret::train() model object. We have to add the text manually to the plot via graphics::text(). We use a larger margin to have enough space for the text.

plot(model_rpart$finalModel, uniform = TRUE, margin = 0.2)
graphics::text(model_rpart$finalModel)

Apparently, there are only two splits, for out_prncp_inv and gradeB.

Rather than using the default grid search for choosing the tuning parameter cp, we could use random search (as mentioned earlier). For a theoretical background on why this might be useful, see e.g. Random Search for Hyper-Parameter Optimization. We implement random search via the parameter caret::trainControl(search = "random").

ctrl <- 
  trainControl(method = "repeatedcv", 
               number = 10,
               repeats = 5,
               classProbs = TRUE,
               summaryFunction = twoClassSummary,
               verboseIter = FALSE,
               allowParallel = TRUE,
               search = "random")

model_rpart <-
  train_down %>%
  select(model_vars) %>%
  mutate(default = as.factor(ifelse(default == TRUE, "yes", "no"))) %>%
  filter(complete.cases(.)) %>%
  train(default ~ .,
        data = .,
        method = 'rpart',
        metric = "ROC",
        preProc = c("center", "scale"),
        trControl = ctrl)

model_rpart
## CART 
## 
## 35488 samples
##     9 predictor
##     2 classes: 'no', 'yes' 
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 31939, 31940, 31940, 31938, 31940, 31940, ... 
## Resampling results across tuning parameters:
## 
##   cp            ROC        Sens       Spec     
##   0.0002254410  0.7418953  0.5806143  0.7904981
##   0.0002536211  0.7433551  0.5725220  0.8003501
##   0.0005072423  0.7462419  0.5532501  0.8225663
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.0005072423.

We can see that ROCAUC went up to ~ 74% and sensitivity also increased somewhat with the random search cp parameterization which ended up using a value much smaller than grid search. We again visualize the tree to see how it changed.

plot(model_rpart$finalModel, uniform = TRUE, margin = 0.1)
graphics::text(model_rpart$finalModel, cex = 0.5)

Clearly the model has become more complex which was expected given that its ROCAUC went up. As before, we will predict test outcome with the final model.

model_rpart_pred <- 
  predict(model_rpart, 
          newdata = test %>% 
            select(model_vars) %>%
            mutate(default = as.factor(ifelse(default == TRUE, 
                                              "yes", "no"))) %>%
            filter(complete.cases(.)), 
          type = "prob")

caret::confusionMatrix(
  data = ifelse(model_rpart_pred[, "yes"] > 0.5, "yes", "no"), 
  reference = as.factor(ifelse(test[complete.cases(test[, model_vars]), 
                                    "default"] == TRUE, "yes", "no")),
  positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  91226   682
##        yes 81806  3753
##                                              
##                Accuracy : 0.5352             
##                  95% CI : (0.5329, 0.5375)   
##     No Information Rate : 0.975              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.0377             
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.84622            
##             Specificity : 0.52722            
##          Pos Pred Value : 0.04386            
##          Neg Pred Value : 0.99258            
##              Prevalence : 0.02499            
##          Detection Rate : 0.02115            
##    Detection Prevalence : 0.48211            
##       Balanced Accuracy : 0.68672            
##                                              
##        'Positive' Class : yes                
## 

Compute ROC.

roc_rpart <- 
  pROC::roc(response = temp, 
            predictor = model_rpart_pred[, "yes"])

roc_rpart
## 
## Call:
## roc.default(response = temp, predictor = model_rpart_pred[, "yes"])
## 
## Data: model_rpart_pred[, "yes"] in 173032 controls (temp no) < 4435 cases (temp yes).
## Area under the curve: 0.7463

The ROCAUC is ~74% which is nearly the same as for cross-validated training set. We plot the curve comparing it against the ROC from earlier models.

pROC::plot.roc(x = roc_glm_1, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               col = "green")

pROC::plot.roc(x = roc_glm_2, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               add = TRUE, col = "blue")

pROC::plot.roc(x = roc_rpart, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               add = TRUE, col = "orange")


legend(x = "bottomright", legend=c("glm_1 AUC = 0.664", "glm_2 AUC = 0.7",
                                   "rpart AUC = 0.74"), 
       col = c("green", "blue", "orange"), lty = 1, cex = 1.0)

12.11.2 Random forest via randomForest in caret

For random forests, the library randomForest is used. Tuning parameter is mtry which is the number of variables randomly sampled as candidates at each split. For computationally reasons, we limit the number of trees via caret::trainControl(ntree)), the number of folds via caret::trainControl(number)) and the resampling iterations via caret::trainControl(repeats)).

# library(randomForest)

ctrl <- 
  trainControl(method = "repeatedcv", 
               number = 5,
               repeats = 1,
               classProbs = TRUE,
               summaryFunction = twoClassSummary,
               verboseIter = FALSE,
               allowParallel = TRUE)

model_rf <-
  train_down %>%
  select(model_vars) %>%
  mutate(default = as.factor(ifelse(default == TRUE, "yes", "no"))) %>%
  filter(complete.cases(.)) %>%
  train(default ~ .,
        data = .,
        method = 'rf',
        ntree = 10,
        metric = "ROC",
        preProc = c("center", "scale"),
        trControl = ctrl)

model_rf
## Random Forest 
## 
## 35488 samples
##     9 predictor
##     2 classes: 'no', 'yes' 
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 28391, 28390, 28390, 28391, 28390 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.6877044  0.5636517  0.7134646
##   20    0.7041407  0.6133559  0.6806064
##   39    0.6982071  0.6123415  0.6655027
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 20.

We can see that ROCAUC is ~ 70% at mtry = 20. We again visualize but since it is a combination of many trees, the plot will look different.

plot(model_rf$finalModel)

The error would most likely go further down if we had allowed more trees.

model_rf_pred <- 
  predict(model_rf, 
          newdata = test %>% 
            select(model_vars) %>%
            mutate(default = as.factor(ifelse(default == TRUE, 
                                              "yes", "no"))) %>%
            filter(complete.cases(.)), 
          type = "prob")

caret::confusionMatrix(
  data = ifelse(model_rf_pred[, "yes"] > 0.5, "yes", "no"), 
  reference = as.factor(ifelse(test[complete.cases(test[, model_vars]), 
                                    "default"] == TRUE, "yes", "no")),
  positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     no    yes
##        no  114693   1723
##        yes  58339   2712
##                                              
##                Accuracy : 0.6616             
##                  95% CI : (0.6594, 0.6638)   
##     No Information Rate : 0.975              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.038              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.61150            
##             Specificity : 0.66284            
##          Pos Pred Value : 0.04442            
##          Neg Pred Value : 0.98520            
##              Prevalence : 0.02499            
##          Detection Rate : 0.01528            
##    Detection Prevalence : 0.34401            
##       Balanced Accuracy : 0.63717            
##                                              
##        'Positive' Class : yes                
## 

Compute ROC.

roc_rf <- 
  pROC::roc(response = temp, 
            predictor = model_rf_pred[, "yes"])

roc_rf
## 
## Call:
## roc.default(response = temp, predictor = model_rf_pred[, "yes"])
## 
## Data: model_rf_pred[, "yes"] in 173032 controls (temp no) < 4435 cases (temp yes).
## Area under the curve: 0.7021

As the ROCAUC is not better than before, we will not plot it against earlier models.

12.11.3 Stochastic Gradient Boosting via gbm in caret

For Stochastic Gradient Boosting, the library gbm is used. Tuning parameters are

  • n.trees: total number of trees to fit
  • interaction.depth: maximum depth of variable interactions
  • shrinkage: shrinkage parameter applied to each tree in the expansion
  • n.minobsinnode: minimum number of observations in the trees terminal nodes

For some background, see e.g.

# library(gbm)

ctrl <- 
  trainControl(method = "repeatedcv", 
               number = 5,
               repeats = 1,
               classProbs = TRUE,
               summaryFunction = twoClassSummary,
               verboseIter = FALSE,
               allowParallel = TRUE)

model_gbm_1 <- 
   train_down %>%
   select(model_vars) %>%
   mutate(default = as.factor(ifelse(default == TRUE, "yes", "no"))) %>%
   filter(complete.cases(.)) %>%
   train(default ~ ., 
         data = ., 
         method = "gbm",
         metric = "ROC",
         trControl = ctrl,
         preProc = c("center", "scale"),
         ## This last option is actually one
         ## for gbm() that passes through
         verbose = FALSE)

model_gbm_1
## Stochastic Gradient Boosting 
## 
## 35488 samples
##     9 predictor
##     2 classes: 'no', 'yes' 
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 28390, 28391, 28390, 28390, 28391 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  ROC        Sens       Spec     
##   1                   50      0.7218075  0.4989575  0.7923127
##   1                  100      0.7347236  0.5438152  0.7776594
##   1                  150      0.7376056  0.5922232  0.7565241
##   2                   50      0.7351958  0.5467456  0.7754611
##   2                  100      0.7416798  0.6105945  0.7484081
##   2                  150      0.7480446  0.6190476  0.7499302
##   3                   50      0.7387262  0.6076078  0.7507755
##   3                  100      0.7488759  0.6164553  0.7538746
##   3                  150      0.7515752  0.6082840  0.7630048
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Plotting the boosting iterations.

ggplot(model_gbm_1)
## Warning: Ignoring unknown aesthetics: shape

Predict.

model_gbm_1_pred <- 
  predict(model_gbm_1, 
          newdata = test %>% 
            select(model_vars) %>%
            mutate(default = as.factor(ifelse(default == TRUE, 
                                              "yes", "no"))) %>%
            filter(complete.cases(.)), 
          type = "prob")

caret::confusionMatrix(
  data = ifelse(model_gbm_1_pred[, "yes"] > 0.5, "yes", "no"), 
  reference = as.factor(ifelse(test[complete.cases(test[, model_vars]), 
                                    "default"] == TRUE, "yes", "no")),
  positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     no    yes
##        no  104238   1033
##        yes  68794   3402
##                                              
##                Accuracy : 0.6065             
##                  95% CI : (0.6043, 0.6088)   
##     No Information Rate : 0.975              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.0438             
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.76708            
##             Specificity : 0.60242            
##          Pos Pred Value : 0.04712            
##          Neg Pred Value : 0.99019            
##              Prevalence : 0.02499            
##          Detection Rate : 0.01917            
##    Detection Prevalence : 0.40681            
##       Balanced Accuracy : 0.68475            
##                                              
##        'Positive' Class : yes                
## 
roc_gbm_1 <- 
  pROC::roc(response = temp, 
            predictor = model_gbm_1_pred[, "yes"])

roc_gbm_1
## 
## Call:
## roc.default(response = temp, predictor = model_gbm_1_pred[, "yes"])
## 
## Data: model_gbm_1_pred[, "yes"] in 173032 controls (temp no) < 4435 cases (temp yes).
## Area under the curve: 0.7516

The ROCAUC is ~75% which is nearly the same as for cross-validated training set. We plot the curve comparing it against the ROC from the simple model.

pROC::plot.roc(x = roc_glm_1, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               col = "green")

pROC::plot.roc(x = roc_glm_2, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               add = TRUE, col = "blue")

pROC::plot.roc(x = roc_rpart, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               add = TRUE, col = "orange")

pROC::plot.roc(x = roc_gbm_1, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               add = TRUE, col = "brown")

legend(x = "bottomright", legend=c("glm_1 AUC = 0.664", "glm_2 AUC = 0.7",
                                   "rpart AUC = 0.74", "gbm AUC = 0.753"), 
       col = c("green", "blue", "orange", "brown"), lty = 1, cex = 1.0)