Lending Club Loan Data in R

1 Overview
- 1.1 Kaggle Data Description
2 Environment Setup
- 2.1 Libraries
- 2.2 Session Info
3 A note on libraries
- 3.1 Library versions
4 Data Import
5 Meta Data
- 5.1 Data Transformation
- 5.2 Additional Meta Data File
6 Defining default
7 Removing variables deemed unfit for most modeling
8 A note on hypothesis generation vs. hypothesis confirmation
9 Exploratory Data Analysis
10 Correlation
11 Summary plots
12 Modeling
13 Further Resources

1 Overview

A case study of machine learning / modeling in R with credit default data. Data is taken from Kaggle Lending Club Loan Data but is also available publicly at Lending Club Statistics Page. We illustrate the complete workflow from data ingestion, over data wrangling/transformation to exploratory data analysis and finally modeling approaches. Along the way will be helpful discussions on side topics such as training strategy, computational efficiency, R intricacies, and more. Note that the focus is more on methods in R rather than statistical rigour. This is meant to be a reference for an end-to-end data science workflow rather than a serious attempt to achieve best model performance.

1.1 Kaggle Data Description

The files contain complete loan data for all loans issued through 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the “present” contains complete loan data for all loans issued through the previous completed calendar quarter. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. The file is a matrix of about 890 thousand observations and 75 variables. A data dictionary is provided in a separate file.

2 Environment Setup

2.1 Libraries

This will automatically install any missing libraries. Make sure to have proper proxy settings. You do not necessarily need to have all of them but only those needed for respective sections (which will be indicated).

# define used libraries
libraries_used <- 
  c("lazyeval", "readr","plyr" ,"dplyr", "readxl", "ggplot2", 
    "funModeling", "scales", "tidyverse", "corrplot", "GGally", "caret",
    "rpart", "randomForest", "pROC", "gbm", "choroplethr", "choroplethrMaps",
    "microbenchmark", "doParallel", "e1071")

# check missing libraries
libraries_missing <- 
  libraries_used[!(libraries_used %in% installed.packages()[,"Package"])]
# install missing libraries
if(length(libraries_missing)) install.packages(libraries_missing)

2.2 Session Info

sessionInfo()

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.1  backports_1.1.0 bookdown_0.5    magrittr_1.5   
##  [5] rprojroot_1.2   tools_3.4.1     htmltools_0.3.6 yaml_2.1.14    
##  [9] Rcpp_0.12.12    stringi_1.1.5   rmarkdown_1.6   blogdown_0.1   
## [13] knitr_1.17      stringr_1.2.0   digest_0.6.12   evaluate_0.10.1

3 A note on libraries

There are generally two ways to access functions from R packages (we use package and library interchangeably). Either via a direct access without loading the library, i.e. package::function() or by loading (attaching) the library into workspace (environment) and thus making all its functions available at once. The problem with the first option is that direct access sometimes can lead to issues (unexpected behavior/errors) and makes coding cumbersome. The problem with the second approach is a conflict of function (and other variables) names between two libraries (called masking). A third complication is that some functions require libraries to be loaded (e.g. some instances of caret::train()) and thus attach the library without being explicitly told. Again this may lead to masking of functions and if user is unaware this can lead to nasty problems (conceived bugs) down the road (sometimes libraries will issue a warning upon load, e.g. plyr when dplyr is already loaded). There is no golden way (at least we are not aware of it) and so we tend to do the following

load major libraries that are used frequently into workspace but pay attention to load succession to avoid unwanted masking
access rarely used functions directly (being aware that they work and don’t attach anything themselves)
sometimes use direct access to a fucntion although its library is loaded (either to make clear which library is currently used or because we explicitly need a function that has been masked due to another loaded library)

We will load a few essential libraries here as they are used heavily. Other librraies may only be loaded in respective section. You can also try to follow the analysis without loading many librraies at the beginning but only when needed. Just be aware of the succession to avoid issues.

3.1 Library versions

Also note that often the version of the library used matters (it always matters but what we mean is, it makes a difference) and some libraries are developed more actively than others which may lead to issues. For example, note that library caret is using plyr while the tidyverse of which dplyr (successor of plyr) is part has some new concepts, e.g. the default data frame is a tibble::tibble(). It seems that caret has issues with tibbles, see e.g. “Wrong model type for classification” in regression problems in R-Caret.

library(plyr)
library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## arrange():   dplyr, plyr
## compact():   purrr, plyr
## count():     dplyr, plyr
## failwith():  dplyr, plyr
## filter():    dplyr, stats
## id():        dplyr, plyr
## lag():       dplyr, stats
## mutate():    dplyr, plyr
## rename():    dplyr, plyr
## summarise(): dplyr, plyr
## summarize(): dplyr, plyr

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

4 Data Import

Refer to section Overview to see how the data may be acquired.

The loans data can be conveniently read via readr::read_csv(). Note that the function tries to determine the variable type by reading the first 1,000 rows which does not always guarantee correct import especially in cases of missing values at the begining of the file. Other packages treat imports differently, for example data.table::fread() takes a sample of 1,000 rows to determine column types (100 rows from 10 different points) which seems to get more robust results than readr::read_csv(). Both package functions allow the explicit definition of column classes, for readr::read_csv() it is parameter col_types and for data.table::fread() it is parameter colClasses. Alternatively, readr::read_csv() offers the parameter guess_max that allows increasing the number of rows being guessed similar to SAS import procedure parameter guessingrows. Naturally, import time increases if more rows are guessed. For details on readr including comparisons against Base R and data.table::fread() see readr.

path <- "D:/data/lending_club"
loans <- readr::read_csv(paste0(path, "./loan.csv"))

The meta data comes in an Excel file and needs to be parsed via a special library, in this case we use readxl.

library(readxl)
# Load the Excel workbook
excel_file = paste0("./LendingClubDataDictionary.xlsx")
# see available tabs
excel_sheets(paste0(path, excel_file))

## [1] "LoanStats"   "browseNotes" "RejectStats"

# Read in the first worksheet
meta_loan_stats = read_excel(paste0(path, excel_file), sheet = "LoanStats")
meta_browse_notes = read_excel(paste0(path, excel_file), sheet = "browseNotes")
meta_reject_stats = read_excel(paste0(path, excel_file), sheet = "RejectStats")

5 Meta Data

Let’s have a look at the meta data in more detail. First, which variables are present in loan data set and what is their type. Then check meta data information and finally compare the two to see if anything is missing.

The usual approach may be to use base::str() function to get a summary of the data structure. However, it may be useful to quantify the “information power” of different metrics and dimensions by looking at the ratio of zeros and missing values to overall observations. This will not always reveal the truth (as there may be variables that are only populated if certain conditions apply) but it still gives some indication. The funModeling package offers the function funModeling::df_status() for that. It does not scale very well and has quite a few dependencies (so a direct call is preferred over a full library load) but it suits the purpose for this data. Unfortunately, it does not return the number of rows and columns. The data has 887379 observations (rows) and 74 variables (columns).

We will require the meta data at a later stage so we assign it to a variable. The function funModeling::df_status() has parameter print_results = TRUE set by default which means the data will be assigned and printed at the same time.

meta_loans <- funModeling::df_status(loans, print_results = FALSE)
knitr::kable(meta_loans)

variable	q_zeros	p_zeros	q_na	p_na	type	unique
id	0	0.00	0	0.00	integer	887379
member_id	0	0.00	0	0.00	integer	887379
loan_amnt	0	0.00	0	0.00	numeric	1372
funded_amnt	0	0.00	0	0.00	numeric	1372
funded_amnt_inv	233	0.03	0	0.00	numeric	9856
term	0	0.00	0	0.00	character	2
int_rate	0	0.00	0	0.00	numeric	542
installment	0	0.00	0	0.00	numeric	68711
grade	0	0.00	0	0.00	character	7
sub_grade	0	0.00	0	0.00	character	35
emp_title	0	0.00	51457	5.80	character	289148
emp_length	0	0.00	0	0.00	character	12
home_ownership	0	0.00	0	0.00	character	6
annual_inc	2	0.00	4	0.00	numeric	49384
verification_status	0	0.00	0	0.00	character	3
issue_d	0	0.00	0	0.00	character	103
loan_status	0	0.00	0	0.00	character	10
pymnt_plan	0	0.00	0	0.00	character	2
url	0	0.00	0	0.00	character	887379
desc	0	0.00	761597	85.83	character	124453
purpose	0	0.00	0	0.00	character	14
title	0	0.00	151	0.02	character	61452
zip_code	0	0.00	0	0.00	character	935
addr_state	0	0.00	0	0.00	character	51
dti	451	0.05	0	0.00	numeric	4086
delinq_2yrs	716961	80.80	29	0.00	numeric	29
earliest_cr_line	0	0.00	29	0.00	character	697
inq_last_6mths	497905	56.11	29	0.00	numeric	28
mths_since_last_delinq	1723	0.19	454312	51.20	numeric	155
mths_since_last_record	1283	0.14	750326	84.56	numeric	123
open_acc	7	0.00	29	0.00	numeric	77
pub_rec	751572	84.70	29	0.00	numeric	32
revol_bal	3402	0.38	0	0.00	numeric	73740
revol_util	3540	0.40	502	0.06	numeric	1356
total_acc	0	0.00	29	0.00	numeric	135
initial_list_status	0	0.00	0	0.00	character	2
out_prncp	255798	28.83	0	0.00	numeric	248332
out_prncp_inv	255798	28.83	0	0.00	numeric	266244
total_pymnt	17759	2.00	0	0.00	numeric	506726
total_pymnt_inv	18037	2.03	0	0.00	numeric	506616
total_rec_prncp	18145	2.04	0	0.00	numeric	260227
total_rec_int	18214	2.05	0	0.00	numeric	324635
total_rec_late_fee	874862	98.59	0	0.00	numeric	6181
recoveries	862702	97.22	0	0.00	numeric	23055
collection_recovery_fee	863872	97.35	0	0.00	numeric	20708
last_pymnt_d	0	0.00	17659	1.99	character	98
last_pymnt_amnt	17673	1.99	0	0.00	numeric	232451
next_pymnt_d	0	0.00	252971	28.51	character	100
last_credit_pull_d	0	0.00	53	0.01	character	103
collections_12_mths_ex_med	875553	98.67	145	0.02	numeric	12
mths_since_last_major_derog	0	0.00	665676	75.02	character	168
policy_code	0	0.00	0	0.00	numeric	1
application_type	0	0.00	0	0.00	character	2
annual_inc_joint	0	0.00	886868	99.94	character	308
dti_joint	0	0.00	886870	99.94	character	449
verification_status_joint	0	0.00	886868	99.94	character	3
acc_now_delinq	883236	99.53	29	0.00	numeric	8
tot_coll_amt	0	0.00	70276	7.92	character	10325
tot_cur_bal	0	0.00	70276	7.92	character	327342
open_acc_6m	0	0.00	866007	97.59	character	13
open_il_6m	0	0.00	866007	97.59	character	35
open_il_12m	0	0.00	866007	97.59	character	12
open_il_24m	0	0.00	866007	97.59	character	17
mths_since_rcnt_il	0	0.00	866569	97.65	character	201
total_bal_il	0	0.00	866007	97.59	character	17030
il_util	0	0.00	868762	97.90	character	1272
open_rv_12m	0	0.00	866007	97.59	character	18
open_rv_24m	0	0.00	866007	97.59	character	28
max_bal_bc	0	0.00	866007	97.59	character	10707
all_util	0	0.00	866007	97.59	character	1128
total_rev_hi_lim	0	0.00	70276	7.92	character	21251
inq_fi	0	0.00	866007	97.59	character	18
total_cu_tl	0	0.00	866007	97.59	character	33
inq_last_12m	0	0.00	866007	97.59	character	29

It seems the data has two unique identifiers id and member_id. There is no variable with 100% missing or zero values i.e. information power of zero. There are a few which have a high ratio of NAs so the meta description needs to be checked whether this is expected. There are also variables which have only one unique value, e.g. policy_code. Again, the meta description should be checked to see the rationale but such dimensions are not useful for any analysis or model buidling.

Our meta table also contains the absolute number of unique values which is helpful for plotting (for attributes that is). Another interesting ratio to look at is that of unique values over all values for any attribute. A high ratio would indicate that this is probably a “free” field, i.e. no particular constraints are put on its values (except key variables of course but in case of uniqueness their ratio should be one as is the case with id and member_id). When looking into correlations, these high ratio fields will have low correlation with other fields but they may still be useful e.g. because they have “direction” information (e.g. the direction of an effect) or, in the case of strings, may be useful for text analytics. We thus add a respective variable to the meta data. For improved readability, we use the scales::percent() function to convert output to percent.

meta_loans <-
  meta_loans %>%
  mutate(uniq_rat = unique / nrow(loans))

meta_loans %>%
  select(variable, unique, uniq_rat) %>%
  mutate(unique = unique, uniq_rat = scales::percent(uniq_rat)) %>%
  knitr::kable()

variable	unique	uniq_rat
id	887379	100.0%
member_id	887379	100.0%
loan_amnt	1372	0.2%
funded_amnt	1372	0.2%
funded_amnt_inv	9856	1.1%
term	2	0.0%
int_rate	542	0.1%
installment	68711	7.7%
grade	7	0.0%
sub_grade	35	0.0%
emp_title	289148	32.6%
emp_length	12	0.0%
home_ownership	6	0.0%
annual_inc	49384	5.6%
verification_status	3	0.0%
issue_d	103	0.0%
loan_status	10	0.0%
pymnt_plan	2	0.0%
url	887379	100.0%
desc	124453	14.0%
purpose	14	0.0%
title	61452	6.9%
zip_code	935	0.1%
addr_state	51	0.0%
dti	4086	0.5%
delinq_2yrs	29	0.0%
earliest_cr_line	697	0.1%
inq_last_6mths	28	0.0%
mths_since_last_delinq	155	0.0%
mths_since_last_record	123	0.0%
open_acc	77	0.0%
pub_rec	32	0.0%
revol_bal	73740	8.3%
revol_util	1356	0.2%
total_acc	135	0.0%
initial_list_status	2	0.0%
out_prncp	248332	28.0%
out_prncp_inv	266244	30.0%
total_pymnt	506726	57.1%
total_pymnt_inv	506616	57.1%
total_rec_prncp	260227	29.3%
total_rec_int	324635	36.6%
total_rec_late_fee	6181	0.7%
recoveries	23055	2.6%
collection_recovery_fee	20708	2.3%
last_pymnt_d	98	0.0%
last_pymnt_amnt	232451	26.2%
next_pymnt_d	100	0.0%
last_credit_pull_d	103	0.0%
collections_12_mths_ex_med	12	0.0%
mths_since_last_major_derog	168	0.0%
policy_code	1	0.0%
application_type	2	0.0%
annual_inc_joint	308	0.0%
dti_joint	449	0.1%
verification_status_joint	3	0.0%
acc_now_delinq	8	0.0%
tot_coll_amt	10325	1.2%
tot_cur_bal	327342	36.9%
open_acc_6m	13	0.0%
open_il_6m	35	0.0%
open_il_12m	12	0.0%
open_il_24m	17	0.0%
mths_since_rcnt_il	201	0.0%
total_bal_il	17030	1.9%
il_util	1272	0.1%
open_rv_12m	18	0.0%
open_rv_24m	28	0.0%
max_bal_bc	10707	1.2%
all_util	1128	0.1%
total_rev_hi_lim	21251	2.4%
inq_fi	18	0.0%
total_cu_tl	33	0.0%
inq_last_12m	29	0.0%

An attribute like emp_title has over 30% unique values which makes it a poor candidate for modeling as it seems borrowers are free to describe their notion of employment title. A sophisticated model, however, may even take advantage of the data by checking the strings for “indicator words” that may be associated with an honest or a dishonest (credible / non-credible) candidate.

5.1 Data Transformation

We should also check the variable types. As mentioned in section Data Import the import function uses heuristics to guess the data types and these may not always work.

annual_inc_joint = character but should be numeric
dates are read as character and need to be transformed

First let’s transform character to numeric. We make use of dplyr::mutate_at() function and provide a vector of columns to be mutated (transformed). In general, when using libraries from the tidyverse (these libraries are mainly authored by Hadley Wickham and other RStudio people), most functions offer a standard version as opposed to an NSE (non-standard evaluation) version which can take character values as variable names. These functions usually have a trailing underscore, e.g. dplyr::select_() as compared to the non-standard evaluation function dplyr::select(). For details, see the dplyr vignette on Non-standard evaluation. However, note that the tidyverse libraries are still changing a fair amount so best to cross-check whether used function is the most efficient one for given task or if there is a new one introduced in the mean time.

As opposed to packages like data.table tidyverse packages do not alter their input in place so we have to re-assign the result to the original object. We could also create a new object but this is memory-inefficient (in fact even the re-assignment to the original object creates a temporary copy). For more details on memory management, see chapter Memory in Hadley Wickham’s book “Advanced R”.

chr_to_num_vars <- 
  c("annual_inc_joint", "mths_since_last_major_derog", "open_acc_6m",
    "open_il_6m", "open_il_12m", "open_il_24m", "mths_since_rcnt_il",
    "total_bal_il", "il_util", "open_rv_12m", "open_rv_24m",
    "max_bal_bc", "all_util", "total_rev_hi_lim", "total_cu_tl",
    "inq_last_12m", "dti_joint", "inq_fi", "tot_cur_bal", "tot_coll_amt")

loans <-
  loans %>%
  mutate_at(.funs = funs(as.numeric), .vars = chr_to_num_vars)

Let’s have a look at the date variables to see how they need to be transformed.

chr_to_date_vars <- 
  c("issue_d", "last_pymnt_d", "last_credit_pull_d",
    "next_pymnt_d", "earliest_cr_line", "next_pymnt_d")

loans %>%
  select_(.dots = chr_to_date_vars) %>%
  str()

## Classes 'tbl_df', 'tbl' and 'data.frame':    887379 obs. of  5 variables:
##  $ issue_d           : chr  "Dec-2011" "Dec-2011" "Dec-2011" "Dec-2011" ...
##  $ last_pymnt_d      : chr  "Jan-2015" "Apr-2013" "Jun-2014" "Jan-2015" ...
##  $ last_credit_pull_d: chr  "Jan-2016" "Sep-2013" "Jan-2016" "Jan-2015" ...
##  $ next_pymnt_d      : chr  NA NA NA NA ...
##  $ earliest_cr_line  : chr  "Jan-1985" "Apr-1999" "Nov-2001" "Feb-1996" ...

head(unique(loans$next_pymnt_d))

## [1] NA         "Feb-2016" "Jan-2016" "Sep-2013" "Feb-2014" "May-2014"

for (i in chr_to_date_vars){
  print(head(unique(loans[, i])))
  }

## # A tibble: 6 x 1
##    issue_d
##      <chr>
## 1 Dec-2011
## 2 Nov-2011
## 3 Oct-2011
## 4 Sep-2011
## 5 Aug-2011
## 6 Jul-2011
## # A tibble: 6 x 1
##   last_pymnt_d
##          <chr>
## 1     Jan-2015
## 2     Apr-2013
## 3     Jun-2014
## 4     Jan-2016
## 5     Apr-2012
## 6     Nov-2012
## # A tibble: 6 x 1
##   last_credit_pull_d
##                <chr>
## 1           Jan-2016
## 2           Sep-2013
## 3           Jan-2015
## 4           Sep-2015
## 5           Dec-2014
## 6           Aug-2012
## # A tibble: 6 x 1
##   next_pymnt_d
##          <chr>
## 1         <NA>
## 2     Feb-2016
## 3     Jan-2016
## 4     Sep-2013
## 5     Feb-2014
## 6     May-2014
## # A tibble: 6 x 1
##   earliest_cr_line
##              <chr>
## 1         Jan-1985
## 2         Apr-1999
## 3         Nov-2001
## 4         Feb-1996
## 5         Jan-1996
## 6         Nov-2004
## # A tibble: 6 x 1
##   next_pymnt_d
##          <chr>
## 1         <NA>
## 2     Feb-2016
## 3     Jan-2016
## 4     Sep-2013
## 5     Feb-2014
## 6     May-2014

It seems the date format is consistent and follows month-year convention. We can use the base::as.Date() function to convert this to a date format. The function requires a day as well so we simply add the 1st of each month to the existing character string via pasting together strings using base::paste0(). Note that base::paste0("bla", NA) will not return NA but the concatenated string (here: blaNA). Conveniently, base::as.Date() will return NA so we can leave it at that. Alternatively, we could include an exception handler for NA values to explicitly handle those because we saw previously that some date variables include NA values. One way to approach that would be to wrap the date conversion function into a base::ifelse() call as so ifelse(is.na(x), NA, some_date_function(). However, it seems that base::ifelse() is dropping the date class that we just created with our date function, for details, see e.g. How to prevent ifelse from turning Date objects into numeric objects. As we do not want to deal with these issues at this moment, we simply go with the default behavior of base::as.Date() as it returns NA in cases of bad input anyway.

Let’s also return the NA values of the input to remind ourselves of the number. It should coincide with the number of NA after we have converted / transformed the date variables.

meta_loans %>% 
  select(variable, q_na) %>% 
  filter(variable %in% chr_to_date_vars)

##             variable   q_na
## 1            issue_d      0
## 2   earliest_cr_line     29
## 3       last_pymnt_d  17659
## 4       next_pymnt_d 252971
## 5 last_credit_pull_d     53

Finally, this is how our custom date conversion function will look like, we call it convert_date() and it will take the string value to be converted as input. We define the date format by following the function conventions of base::as.Date(), for details see ?base::as.Date. We could have also used an anonymous function directly in the dplyr::mutate_at() call but the creation of a specific function seemed appropriate as we may use it several times.

convert_date <- function(x){
  as.Date(paste0("01-", x), format = "%d-%b-%Y")
  } 

loans <-
  loans %>%
  mutate_at(.funs = funs(convert_date), .vars = chr_to_date_vars)

num_vars <- 
  loans %>% 
  sapply(is.numeric) %>% 
  which() %>% 
  names()

meta_loans %>%
  select(variable, p_zeros, p_na, unique) %>%
  filter_(~ variable %in% num_vars) %>%
  knitr::kable()

variable	p_zeros	p_na	unique
id	0.00	0.00	887379
member_id	0.00	0.00	887379
loan_amnt	0.00	0.00	1372
funded_amnt	0.00	0.00	1372
funded_amnt_inv	0.03	0.00	9856
int_rate	0.00	0.00	542
installment	0.00	0.00	68711
annual_inc	0.00	0.00	49384
dti	0.05	0.00	4086
delinq_2yrs	80.80	0.00	29
inq_last_6mths	56.11	0.00	28
mths_since_last_delinq	0.19	51.20	155
mths_since_last_record	0.14	84.56	123
open_acc	0.00	0.00	77
pub_rec	84.70	0.00	32
revol_bal	0.38	0.00	73740
revol_util	0.40	0.06	1356
total_acc	0.00	0.00	135
out_prncp	28.83	0.00	248332
out_prncp_inv	28.83	0.00	266244
total_pymnt	2.00	0.00	506726
total_pymnt_inv	2.03	0.00	506616
total_rec_prncp	2.04	0.00	260227
total_rec_int	2.05	0.00	324635
total_rec_late_fee	98.59	0.00	6181
recoveries	97.22	0.00	23055
collection_recovery_fee	97.35	0.00	20708
last_pymnt_amnt	1.99	0.00	232451
collections_12_mths_ex_med	98.67	0.02	12
mths_since_last_major_derog	0.00	75.02	168
policy_code	0.00	0.00	1
annual_inc_joint	0.00	99.94	308
dti_joint	0.00	99.94	449
acc_now_delinq	99.53	0.00	8
tot_coll_amt	0.00	7.92	10325
tot_cur_bal	0.00	7.92	327342
open_acc_6m	0.00	97.59	13
open_il_6m	0.00	97.59	35
open_il_12m	0.00	97.59	12
open_il_24m	0.00	97.59	17
mths_since_rcnt_il	0.00	97.65	201
total_bal_il	0.00	97.59	17030
il_util	0.00	97.90	1272
open_rv_12m	0.00	97.59	18
open_rv_24m	0.00	97.59	28
max_bal_bc	0.00	97.59	10707
all_util	0.00	97.59	1128
total_rev_hi_lim	0.00	7.92	21251
inq_fi	0.00	97.59	18
total_cu_tl	0.00	97.59	33
inq_last_12m	0.00	97.59	29

We also see that variables mths_since_last_delinq, mths_since_last_record, mths_since_last_major_derog, dti_joint and annual_inc_joint have a large share of NA values. If we think about this in more detail, it may be reasonable to assume that NA values for the variables mths_since_last_delinq, mths_since_last_record and mths_since_last_major_derog actually indicate that there was no event/record of any missed payment so there cannot be any time value. Analogously, a missing value for annual_inc_joint and dti_joint may simply indicate that it is a single borrower or the partner has no income. Thus, the first three variables actually carry valuable information that may be lost if we ignored it. We will thus replace the missing values with zeros to make them available for modeling. It should be noted though that a zero time could indicate an event that is just happening so we have to document our assumptions carefully.

na_to_zero_vars <-
  c("mths_since_last_delinq", "mths_since_last_record",
    "mths_since_last_major_derog")

loans <- 
  loans %>%
  mutate_at(.vars = na_to_zero_vars, .funs = funs(replace(., is.na(.), 0)))

These transformations should have us covered for now. We recreate the meta table after all these changes.

meta_loans <- funModeling::df_status(loans, print_results = FALSE)
meta_loans <-
  meta_loans %>%
  mutate(uniq_rat = unique / nrow(loans))

5.2 Additional Meta Data File

Next we look at the meta descriptions provided in the additional Excel file.

knitr::kable(meta_loan_stats[,1:2])

LoanStatNew	Description
acc_now_delinq	The number of accounts on which the borrower is now delinquent.
acc_open_past_24mths	Number of trades opened in past 24 months.
addr_state	The state provided by the borrower in the loan application
all_util	Balance to credit limit on all trades
annual_inc	The self-reported annual income provided by the borrower during registration.
annual_inc_joint	The combined self-reported annual income provided by the co-borrowers during registration
application_type	Indicates whether the loan is an individual application or a joint application with two co-borrowers
avg_cur_bal	Average current balance of all accounts
bc_open_to_buy	Total open to buy on revolving bankcards.
bc_util	Ratio of total current balance to high credit/credit limit for all bankcard accounts.
chargeoff_within_12_mths	Number of charge-offs within 12 months
collection_recovery_fee	post charge off collection fee
collections_12_mths_ex_med	Number of collections in 12 months excluding medical collections
delinq_2yrs	The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years
delinq_amnt	The past-due amount owed for the accounts on which the borrower is now delinquent.
desc	Loan description provided by the borrower
dti	A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
dti_joint	A ratio calculated using the co-borrowers’ total monthly payments on the total debt obligations, excluding mortgages and the requested LC loan, divided by the co-borrowers’ combined self-reported monthly income
earliest_cr_line	The month the borrower’s earliest reported credit line was opened
emp_length	Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
emp_title	The job title supplied by the Borrower when applying for the loan.*
fico_range_high	The upper boundary range the borrower’s FICO at loan origination belongs to.
fico_range_low	The lower boundary range the borrower’s FICO at loan origination belongs to.
funded_amnt	The total amount committed to that loan at that point in time.
funded_amnt_inv	The total amount committed by investors for that loan at that point in time.
grade	LC assigned loan grade
home_ownership	The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER
id	A unique LC assigned ID for the loan listing.
il_util	Ratio of total current balance to high credit/credit limit on all install acct
initial_list_status	The initial listing status of the loan. Possible values are – W, F
inq_fi	Number of personal finance inquiries
inq_last_12m	Number of credit inquiries in past 12 months
inq_last_6mths	The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
installment	The monthly payment owed by the borrower if the loan originates.
int_rate	Interest Rate on the loan
issue_d	The month which the loan was funded
last_credit_pull_d	The most recent month LC pulled credit for this loan
last_fico_range_high	The upper boundary range the borrower’s last FICO pulled belongs to.
last_fico_range_low	The lower boundary range the borrower’s last FICO pulled belongs to.
last_pymnt_amnt	Last total payment amount received
last_pymnt_d	Last month payment was received
loan_amnt	The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
loan_status	Current status of the loan
max_bal_bc	Maximum current balance owed on all revolving accounts
member_id	A unique LC assigned Id for the borrower member.
mo_sin_old_il_acct	Months since oldest bank installment account opened
mo_sin_old_rev_tl_op	Months since oldest revolving account opened
mo_sin_rcnt_rev_tl_op	Months since most recent revolving account opened
mo_sin_rcnt_tl	Months since most recent account opened
mort_acc	Number of mortgage accounts.
mths_since_last_delinq	The number of months since the borrower’s last delinquency.
mths_since_last_major_derog	Months since most recent 90-day or worse rating
mths_since_last_record	The number of months since the last public record.
mths_since_rcnt_il	Months since most recent installment accounts opened
mths_since_recent_bc	Months since most recent bankcard account opened.
mths_since_recent_bc_dlq	Months since most recent bankcard delinquency
mths_since_recent_inq	Months since most recent inquiry.
mths_since_recent_revol_delinq	Months since most recent revolving delinquency.
next_pymnt_d	Next scheduled payment date
num_accts_ever_120_pd	Number of accounts ever 120 or more days past due
num_actv_bc_tl	Number of currently active bankcard accounts
num_actv_rev_tl	Number of currently active revolving trades
num_bc_sats	Number of satisfactory bankcard accounts
num_bc_tl	Number of bankcard accounts
num_il_tl	Number of installment accounts
num_op_rev_tl	Number of open revolving accounts
num_rev_accts	Number of revolving accounts
num_rev_tl_bal_gt_0	Number of revolving trades with balance >0
num_sats	Number of satisfactory accounts
num_tl_120dpd_2m	Number of accounts currently 120 days past due (updated in past 2 months)
num_tl_30dpd	Number of accounts currently 30 days past due (updated in past 2 months)
num_tl_90g_dpd_24m	Number of accounts 90 or more days past due in last 24 months
num_tl_op_past_12m	Number of accounts opened in past 12 months
open_acc	The number of open credit lines in the borrower’s credit file.
open_acc_6m	Number of open trades in last 6 months
open_il_12m	Number of installment accounts opened in past 12 months
open_il_24m	Number of installment accounts opened in past 24 months
open_il_6m	Number of currently active installment trades
open_rv_12m	Number of revolving trades opened in past 12 months
open_rv_24m	Number of revolving trades opened in past 24 months
out_prncp	Remaining outstanding principal for total amount funded
out_prncp_inv	Remaining outstanding principal for portion of total amount funded by investors
pct_tl_nvr_dlq	Percent of trades never delinquent
percent_bc_gt_75	Percentage of all bankcard accounts > 75% of limit.
policy_code	publicly available policy_code=1
new products not publicly availab	le policy_code=2
pub_rec	Number of derogatory public records
pub_rec_bankruptcies	Number of public record bankruptcies
purpose	A category provided by the borrower for the loan request.
pymnt_plan	Indicates if a payment plan has been put in place for the loan
recoveries	post charge off gross recovery
revol_bal	Total credit revolving balance
revol_util	Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
sub_grade	LC assigned loan subgrade
tax_liens	Number of tax liens
term	The number of payments on the loan. Values are in months and can be either 36 or 60.
title	The loan title provided by the borrower
tot_coll_amt	Total collection amounts ever owed
tot_cur_bal	Total current balance of all accounts
tot_hi_cred_lim	Total high credit/credit limit
total_acc	The total number of credit lines currently in the borrower’s credit file
total_bal_ex_mort	Total credit balance excluding mortgage
total_bal_il	Total current balance of all installment accounts
total_bc_limit	Total bankcard high credit/credit limit
total_cu_tl	Number of finance trades
total_il_high_credit_limit	Total installment high credit/credit limit
total_pymnt	Payments received to date for total amount funded
total_pymnt_inv	Payments received to date for portion of total amount funded by investors
total_rec_int	Interest received to date
total_rec_late_fee	Late fees received to date
total_rec_prncp	Principal received to date
total_rev_hi_lim	Total revolving high credit/credit limit
url	URL for the LC page with listing data.
verification_status	Indicates if income was verified by LC, not verified, or if the income source was verified
verified_status_joint	Indicates if the co-borrowers’ joint income was verified by LC, not verified, or if the income source was verified
zip_code	The first 3 numbers of the zip code provided by the borrower in the loan application.
NA	NA
NA	* Employer Title replaces Employer Name for all loans listed after 9/23/2013

As expected, id is “A unique LC assigned ID for the loan listing.” and member_id is “A unique LC assigned Id for the borrower member.”. The policy_code is “publicly available policy_code=1 new products not publicly available policy_code=2”. That means there could be different values but as we have seen before in this data it only takes on one value.

Finally, let’s look at variables that are either in the data and not in the meta description or vice versa. The dplyr::setdiff() function does what its name suggests, just pay attention to the order of arguments to understand which difference you actually get.

Variables in loans data but not in meta description.

dplyr::setdiff(colnames(loans), meta_loan_stats$LoanStatNew)

## [1] "verification_status_joint" "total_rev_hi_lim"

Variables in meta description but not in loans data.

dplyr::setdiff(meta_loan_stats$LoanStatNew, colnames(loans))

##  [1] "acc_open_past_24mths"           "avg_cur_bal"                   
##  [3] "bc_open_to_buy"                 "bc_util"                       
##  [5] "chargeoff_within_12_mths"       "delinq_amnt"                   
##  [7] "fico_range_high"                "fico_range_low"                
##  [9] "last_fico_range_high"           "last_fico_range_low"           
## [11] "mo_sin_old_il_acct"             "mo_sin_old_rev_tl_op"          
## [13] "mo_sin_rcnt_rev_tl_op"          "mo_sin_rcnt_tl"                
## [15] "mort_acc"                       "mths_since_recent_bc"          
## [17] "mths_since_recent_bc_dlq"       "mths_since_recent_inq"         
## [19] "mths_since_recent_revol_delinq" "num_accts_ever_120_pd"         
## [21] "num_actv_bc_tl"                 "num_actv_rev_tl"               
## [23] "num_bc_sats"                    "num_bc_tl"                     
## [25] "num_il_tl"                      "num_op_rev_tl"                 
## [27] "num_rev_accts"                  "num_rev_tl_bal_gt_0"           
## [29] "num_sats"                       "num_tl_120dpd_2m"              
## [31] "num_tl_30dpd"                   "num_tl_90g_dpd_24m"            
## [33] "num_tl_op_past_12m"             "pct_tl_nvr_dlq"                
## [35] "percent_bc_gt_75"               "pub_rec_bankruptcies"          
## [37] "tax_liens"                      "tot_hi_cred_lim"               
## [39] "total_bal_ex_mort"              "total_bc_limit"                
## [41] "total_il_high_credit_limit"     "total_rev_hi_lim  "            
## [43] "verified_status_joint"          NA

It seems for the loans variables verification_status_joint and total_rev_hi_lim there are actually equivalents in the meta data but they carry slightly different names or have leading / trailing blanks. There are quite a few variables in the meta description that are not in the data. But that should be less of a concern as we don’t need those anyway.

6 Defining default

Our ultimate goal is the prediction of loan defaults from a given set of observations by selecting explanatory (independent) variables (also called feature in machine learning parlance) that result in an acceptable model performance as quantified by a pre-defined measure. This goal will also impact our exploratory data analysis. We will try to build some intuition about the data given our knowledge of the final goal. If we were given a data discovery task such as detecting interesting patterns via unsupervised learning methods we would probably perform a very different analysis.

For loans data the usual variable of interest is a delay or a default on required payments. So far we haven’t looked at any default variable because, in fact, there is no one variable indicating it. One may speculate about the reasons but a plausible explanation may be that the definiton of default depends on the perspective. A risk-averse person may classify any delay in scheduled payments immediately as default from day one while others may apply a step-wise approach considering that borrowers may pay at a later stage. Yet another classification may look at any rating deterioration and include a default as last grade in a rating scale.

Let’s look at potential variables that may indicate a default / delay in payments:

loan_status: Current status of the loan
delinq_2yrs: The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years
mths_since_last_delinq: The number of months since the borrower’s last delinquency.

We can look at the unique values of above variables by applying base::unique() via the purrr::map() function to each column of interest. The purrr library is part of the tidyverse set of libraries and applies functional paradigms via a level of abstraction.

default_vars <- c("loan_status", "delinq_2yrs", "mths_since_last_delinq")
purrr::map(.x = loans[, default_vars], .f = base::unique)

## $loan_status
##  [1] "Fully Paid"                                         
##  [2] "Charged Off"                                        
##  [3] "Current"                                            
##  [4] "Default"                                            
##  [5] "Late (31-120 days)"                                 
##  [6] "In Grace Period"                                    
##  [7] "Late (16-30 days)"                                  
##  [8] "Does not meet the credit policy. Status:Fully Paid" 
##  [9] "Does not meet the credit policy. Status:Charged Off"
## [10] "Issued"                                             
## 
## $delinq_2yrs
##  [1]  0  2  3  1  4  6  5  8  7  9 11 NA 13 15 10 12 17 18 29 24 14 21 22
## [24] 19 16 30 26 20 27 39
## 
## $mths_since_last_delinq
##   [1]   0  35  38  61   8  20  18  68  45  48  41  40  74  25  53  39  10
##  [18]  26  56  77  28  52  24  16  60  54  23   9  11  13  65  19  80  22
##  [35]  59  79  44  64  57  14  63  49  15  73  70  29  51   5  75  55   2
##  [52]  30  47  33  69   4  43  21  27  46  81  78  82  31  76  62  72  42
##  [69]  50   3  12  67  36  34  58  17  71  66  32   6  37   7   1  83  86
##  [86] 115  96 103 120 106  89 107  85  97  95 110  84 135  88  87 122  91
## [103] 146 134 114  99  93 127 101  94 102 129 113 139 131 156 143 109 119
## [120] 149 118 130 148 126  90 141 116 100 152  98  92 108 133 104 111 105
## [137] 170 124 136 180 188 140 151 159 121 123 157 112 154 171 142 125 117
## [154] 176 137

We can see that delinq_2yrs shows only a few unique values which is a bit surprising as it could take many more values given its definition. The variable mths_since_last_delinq has some surprisingly large values. Both variables only indicate a delinquency in the past so they cannot help with the default definition.

The variable loan status seems to be an indicator of the current state a particular loan is in. We should also perform a count of the different values.

loans %>%
  group_by(loan_status) %>%
  summarize(count = n(), rel_count = count/nrow(loans)) %>%
  knitr::kable()

loan_status	count	rel_count
Charged Off	45248	0.0509906
Current	601779	0.6781533
Default	1219	0.0013737
Does not meet the credit policy. Status:Charged Off	761	0.0008576
Does not meet the credit policy. Status:Fully Paid	1988	0.0022403
Fully Paid	207723	0.2340860
In Grace Period	6253	0.0070466
Issued	8460	0.0095337
Late (16-30 days)	2357	0.0026561
Late (31-120 days)	11591	0.0130621

It is not immediately obvious what the different values stand for, so we refer to Lending Club’s documentation about “What do the different Note statuses mean?”

Fully Paid: Loan has been fully repaid, either at the expiration of the 3- or 5-year year term or as a result of a prepayment.
Current: Loan is up to date on all outstanding payments.
Does not meet the credit policy. Status:Fully Paid: No explanation but see “fully paid”.
Issued: New loan that has passed all Lending Club reviews, received full funding, and has been issued.
Charged Off: Loan for which there is no longer a reasonable expectation of further payments. Generally, Charge Off occurs no later than 30 days after the Default status is reached. Upon Charge Off, the remaining principal balance of the Note is deducted from the account balance. Learn more about the difference between “default” and “charge off”.
Does not meet the credit policy. Status:Charged Off: No explanation but see “Charged Off”
Late (31-120 days): Loan has not been current for 31 to 120 days.
In Grace Period: Loan is past due but within the 15-day grace period.
Late (16-30 days): Loan has not been current for 16 to 30 days.
Default: Loan has not been current for 121 days or more.

Given above information, we will define a default as follows.

defaulted <- 
  c("Default", 
    "Does not meet the credit policy. Status:Charged Off", 
    "In Grace Period", 
    "Late (16-30 days)", 
    "Late (31-120 days)")

We now have to add an indicator variable to the data. We do this by reassigning the mutated data to the original object. An alternative would be to update the object via a compound assignment pipe-operator from the magrittr package magrittr::%<>% or an assignment in place := from the data.table package. We use a Boolean (True/False) indicator variable which will have nicer plotting properties (as it is treated like a character variable by the plotting library ggplot2) rather than a numerical value such as 1/0. R is usually clever enough to still allow calculations on Boolean values.

loans <-
  loans %>%
  mutate(default = ifelse(!(loan_status %in% defaulted), FALSE, TRUE))

Given our new indicator variable, we can now compute the frequency of actual defaults in the training set. It is around 0.0249961.

loans %>%
  summarise(default_freq = sum(default / n()))

## # A tibble: 1 x 1
##   default_freq
##          <dbl>
## 1   0.02499608

# alternatively in a table
table(loans$default) / nrow(loans)

## 
##      FALSE       TRUE 
## 0.97500392 0.02499608

7 Removing variables deemed unfit for most modeling

As stated before some variables may actually have information value but are kicked out as we deem them unfit for most practical purposes. Arguably one would have to look at the actual value distribution as e.g. a high number of unique values may be non-sense values for only a few loans but we don’t dig deeper here.

We can get rid of following variables with given reason

annual_inc_joint: high NA ratio
dti_joint: high NA ratio
verification_status_joint: high NA ratio
policy_code: only one unique value -> standard deviation = 0
id: key variable
member_id: key variable
emp_title: high amount of unique values
url: high amount of unique values
desc: high NA ratio
title: high amount of unique values
next_pymnt_d: high NA ratio
open_acc_6m: high NA ratio
open_il_6m: high NA ratio
open_il_12m: high NA ratio
open_il_24m: high NA ratio
mths_since_rcnt_il: high NA ratio
total_bal_il: high NA ratio
il_util: high NA ratio
open_rv_12m: high NA ratio
open_rv_24m: high NA ratio
max_bal_bc: high NA ratio
all_util: high NA ratio
total_rev_hi_lim: high NA ratio
inq_fi: high NA ratio
total_cu_tl: high NA ratio
inq_last_12m: high NA ratio

vars_to_remove <- 
  c("annual_inc_joint", "dti_joint", "policy_code", "id", "member_id",
    "emp_title", "url", "desc", "title", "open_acc_6m", "open_il_6m", 
    "open_il_12m", "open_il_24m", "mths_since_rcnt_il", "total_bal_il", 
    "il_util", "open_rv_12m", "open_rv_24m", "max_bal_bc", "all_util",
    "total_rev_hi_lim", "inq_fi", "total_cu_tl", "inq_last_12m",
    "verification_status_joint", "next_pymnt_d")

loans <- loans %>% select(-one_of(vars_to_remove))

We further remove variables for different (stated) reasons

sub_grade: contains same (but more granular) information as grade
loan_status: has been used to define target variable

vars_to_remove <- 
  c("sub_grade", "loan_status")

loans <- loans %>% select(-one_of(vars_to_remove))

8 A note on hypothesis generation vs. hypothesis confirmation

Before we start our analysis, we should be clear about its aim and what the data is used for. In the book “R for Data Science” the authors put it quite nicely under the name Hypothesis generation vs. hypothesis confirmation:

Each observation can either be used for exploration or confirmation, not both.

You can use an observation as many times as you like for exploration, but you can only use it once for confirmation. As soon as you use an observation twice, you’ve switched from confirmation to exploration. This is necessary because to confirm a hypothesis you must use data independent of the data that you used to generate the hypothesis. Otherwise you will be over optimistic. There is absolutely nothing wrong with exploration, but you should never sell an exploratory analysis as a confirmatory analysis because it is fundamentally misleading.

In a strict sense, this requires us to split the data into different sets. The authors go on suggesting a split:

60% of your data goes into a training (or exploration) set. You’re allowed to do anything you like with this data: visualise it and fit tons of models to it.

20% goes into a query set. You can use this data to compare models or visualisations by hand, but you’re not allowed to use it as part of an automated process.

20% is held back for a test set. You can only use this data ONCE, to test your final model.

This means that even for exploratory data analysis (EDA), we would only look at parts of the data. We will split the data into two sets with 80% train and 20% test ratio at random. As we are dealing with time-series data, we could also split the data by time. But time itself may be an explanatory variable which could be modeled. All exploratory analysis will be performed on the training data only. We will use base::set.seed() to make the random split reproducible. You can have an argument whether it is sensible to even use split data for EDA but EDA usually builds your intuition about the data and thus will shape data transformation and model decisions. The test data allows for testing all these assumptions in addition to the actual model performance. There are other methods such as cross validation which do not necessarily require a test data set but we have enough observations to afford one.

One note of caution is necessary here. Since not all data is used for model fitting, the test data may have labels that do not occur in the training set and with same rationale feautures may have unseen values. In addition, the data is imbalanced, i.e. only a few lenders default while many more do not. The last fact may actually require a non-random split considering the class label (default / non-default). The same may hold true for the features (independent variables). For more details on dealing with imbalanced data, see Learning from Imbalanced Classes or Learning from Imbalanced Data. Tom Fawcett puts it nicely in previous first mentioned blog post:

Conventional algorithms are often biased towards the majority class because their loss functions attempt to optimize quantities such as error rate, not taking the data distribution into consideration. In the worst case, minority examples are treated as outliers of the majority class and ignored. The learning algorithm simply generates a trivial classifier that classifies every example as the majority class.

We can do the split manually (see commented out code) but in order to ensure class distributions within the split data, we use function createDataPartition() from the caret package which performs stratified sampling. For details on the function, see The caret package: 4.1 Simple Splitting Based on the Outcome.

# ## manual approach
# # 80% of the sample size
# smp_size <- floor(0.8 * nrow(loans))
# 
# # set the seed to make your partition reproductible
# set.seed(123)
# train_index <- sample(seq_len(nrow(loans)), size = smp_size)
# 
# train <- loans[train_index, ]
# test <- loans[-train_index, ]

## with caret
set.seed(6438)

train_index <- 
  caret::createDataPartition(y = loans$default, times = 1, 
                             p = .8, list = FALSE)

train <- loans[train_index, ]
test <- loans[-train_index, ]

9 Exploratory Data Analysis

There are many excellent resources on exploratory analysis, e.g.

A protocol for data exploration to avoid common statistical problems

A visual look at the data should always precede any model considerations. A useful visualization library is ggplot2 (which is part of the tidyverse and further on also referred to as ggplot) that requires a few other libraries on top for extensions such as scales. We are not after publication-ready visualizations yet as this phase is considered data exploration or data discovery rather than results reporting. Nontheless, used visualization libraries already produce visually appealing graphics as they use smart heuristics and default values to guess sensible parameter settings.

library(scales)

The most important questions around visualization are which variables are numeric and if so are they continous or discrete and which are strings. Furthermore, which variables are attributes (categorical) and which make up sensible metric-attribute pairs. An important information for efficient visualization with categorical variables is also the amount of unique values they can take and the ratio of zero or missing values, both which were already analyzed in section Meta Data. In a first exploratory analysis we aim for categorical variables that have high information power and not too many unique values to keep the information density at a manageable level. Also consider group sizes and differences between median and mean driven by outliers. Especially when drawing conclusions from summarized / aggregated information, we should be aware of group size. We thus may add this info directly in the plot or look at it before plotting in a grouped table.

A generally good idea is to look at the distributions of relevant continous variables. These are probably

loan_amnt
funded_amnt
funded_amnt_inv
annual_inc

Among the target variable default some interesting categorical variables might be

term
int_rate
grade
emp_title
home_ownership
verification_status
issue_d (timestamp)
loan_status
purpose
zip_code (geo-info)
addr_state (geo-info)
application_type

Assign continous variables to a character vector and reshape data for plotting of distributions.

income_vars <- c("annual_inc")
loan_amount_vars <- c("loan_amnt", "funded_amnt", "funded_amnt_inv")

We can reshape and plot the original data for specified variables in a tidyr-dplyr-ggplot pipe. For details on tidying with tidyr, see Data Science with R - Data Tidying or Data Processing with dplyr & tidyr.

train %>%
  select_(.dots = income_vars) %>%
  gather_("variable", "value", gather_cols = income_vars) %>%
  ggplot(aes(x = value)) +
  facet_wrap(~ variable, scales = "free_x", ncol = 3) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 4 rows containing non-finite values (stat_bin).

We can see that a lot of loans have corresponding annual income of zero and in general income seems low. As already known, joint income has a large number of NA values (i.e. cannot be displayed) and those few values that are present do not seem to have significant exposure. Most loan applications must have been submitted by single income borrowers, i.e. either single persons or single-income households.

train %>%
  select_(.dots = loan_amount_vars) %>%
  gather_("variable", "value", gather_cols = loan_amount_vars) %>%
  ggplot(aes(x = value)) +
  facet_wrap(~ variable, scales = "free_x", ncol = 3) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The loan amount distributions seems similar in shape suggesting not too much divergence between the loan amount applied for, the amount committed and the amount invested.

We had already identified a number of interesting categorical variables. Let’s combine our selection with the meta information gathered in an earlier stage to see the information power and uniqueness.

categorical_vars <- 
  c("term", "grade", "sub_grade", "emp_title", "home_ownership",
    "verification_status", "loan_status", "purpose", "zip_code",
    "addr_state", "application_type", "policy_code")

meta_loans %>%
  select(variable, p_zeros, p_na, type, unique) %>%
  filter_(~ variable %in% categorical_vars) %>%
  knitr::kable()

variable	p_na	type	unique
term	0.0	character	2
grade	0.0	character	7
sub_grade	0.0	character	35
emp_title	5.8	character	289148
home_ownership	0.0	character	6
verification_status	0.0	character	3
loan_status	0.0	character	10
purpose	0.0	character	14
zip_code	0.0	character	935
addr_state	0.0	character	51
policy_code	0.0	numeric	1
application_type	0.0	character	2

This gives us some ideas on what may be useful for a broad data exploration. Variables such as emp_title have too many unique values to be suitable for a classical categorical graph. Other variables may lend themselves to pairwise or correlation graphs such as int_rate while again others may be used in time series plots such as issue_d. We may even be able to plot some appealing geographical plots with geolocation variables such as zip_code or addr_state.

9.1 Grade

For a start, let’s look at grade which seems to be a rating classification scheme that Lending Club uses to assign loans into risk buckets similar to other popular rating schemes like S&P or Moodys. For more details, see Lending Club Rate Information. For now, it suffices to know that grades take values A, B, C, D, E, F, G where A represents the highest quality loan and G the lowest.

As a first relation, we investigate the distribution of loan amount over the different grades with a standard boxplot highlighting potential outliers in red. One may also select a pre-defined theme from the ggthemes package by adding a call to ggplot such as theme_economist() or even create a theme oneself. For details on themes, see ggplot2 themes and Introduction to ggthemes.

As mentioned before we want to be group size aware so let’s write a short function to be used inside the ggplot::geom_boxplot() call in combination with ggplot::stat_summary() where we can call user-defined functions using parameter fun.data. As it is often the case with plots, we need to experiment with the position of additional text elements which we achieve by scaling the y-position with some constant multiplier around the median (as the boxplot will have the median as horizontal bar within the rectangles). The mean can be added in a similar fashion, however we don’t need to specify the y-position explicitly. The count as text and the mean as point may overlap themselves and with the median horizontal bar. This is acceptable in an exploratory setting. For publication-ready plots one would have to perform some adjustments.

# see http://stackoverflow.com/questions/15660829/how-to-add-a-number-of-observations-per-group-and-use-group-mean-in-ggplot2-boxp
give_count <- 
  stat_summary(fun.data = function(x) return(c(y = median(x)*1.06,
                                               label = length(x))),
               geom = "text")

# see http://stackoverflow.com/questions/19876505/boxplot-show-the-value-of-mean
give_mean <- 
  stat_summary(fun.y = mean, colour = "darkgreen", geom = "point", 
               shape = 18, size = 3, show.legend = FALSE)

train %>%
  ggplot(aes(grade, loan_amnt)) +
  geom_boxplot(fill = "white", colour = "darkblue", 
               outlier.colour = "red", outlier.shape = 1) +
  give_count +
  give_mean +
  scale_y_continuous(labels = comma) +
  facet_wrap(~ default) +
  labs(title="Loan Amount by Grade", x = "Grade", y = "Loan Amount \n")

We can derive a few points from the plot:

there is not a lot of difference between default and non-default
lower quality loans tend to have a higher loan amount
there are virtually no outliers except for grade B
the loan amount spread (IQR) seems to be slightly higher for lower quality loans

According to Lending Club’s rate information, we would expect a strong increasing relationship between grade and interest rate so we can test this. We also consider the term of the loan as one might expect that longer term loans could have a higher interest rate.

train %>%
  ggplot(aes(grade, int_rate)) +
  geom_boxplot(fill = "white", colour = "darkblue", 
               outlier.colour = "red", outlier.shape = 1) +
  give_count +
  give_mean +
  scale_y_continuous(labels = comma) +
  labs(title="Interest Rate by Grade", x = "Grade", y = "Interest Rate \n") +
  facet_wrap(~ term)

We can derive a few points from the plot:

interest rate increases with grade worsening
a few loans seem to have an equally low interest rate independent of grade
the spread of rates seems to increase with grade worsening
there tend to be more outliers on the lower end of the rate
The 3-year term has a much higher number of high-rated borrowers while the 5-year term has a larger number in the low-rating grades

9.2 Home Ownership

We can do the same for home ownership.

train %>%
  ggplot(aes(home_ownership, int_rate)) +
  geom_boxplot(fill = "white", colour = "darkblue", 
               outlier.colour = "red", outlier.shape = 1) +
  give_count +
  give_mean +
  scale_y_continuous(labels = comma) +
  facet_wrap(~ default) +
  labs(title="Interest Rate by Home Ownership", x = "Home Ownership", y = "Interest Rate \n")

We can derive a few points from the plot:

there seems no immediate conclusion with respect to the impact of home ownership, e.g. we can see that median/mean interest rate is higher for people who own a house than for those who still pay a mortgage.
interest rates are highest for values “any” and “none” which could be loans of limited data quality but there are very few data points.
interest rate is higher for default loans, which is probably driven by other factors (e.g. grade)

9.3 Loan Amount and Income

Another interesting plot may be the relationship between loan amount and funded / invested amount. As all variables are continous, we can do that with a simple scatterplot but we will need to reshape the data to have both loan values plotted against loan amount.

In fact the reshaping here may be slightly odd as we like to display the same variable on the x-axis but different values on the y-axis facetted by their variable names. To achieve that, we gather the data three times with the same x variable but changing y variables.

funded_amnt <-
  train %>%
  transmute(loan_amnt = loan_amnt, value = funded_amnt, 
              variable = "funded_amnt")

funded_amnt_inv <-
  train %>%
  transmute(loan_amnt = loan_amnt, value = funded_amnt_inv, 
              variable = "funded_amnt_inv")

plot_data <- rbind(funded_amnt, funded_amnt_inv)
# remove unnecessary data using regex
ls()

##  [1] "categorical_vars"  "chr_to_date_vars"  "chr_to_num_vars"  
##  [4] "convert_date"      "default_vars"      "defaulted"        
##  [7] "excel_file"        "funded_amnt"       "funded_amnt_inv"  
## [10] "give_count"        "give_mean"         "i"                
## [13] "income_vars"       "libraries_missing" "libraries_used"   
## [16] "loan_amount_vars"  "loans"             "meta_browse_notes"
## [19] "meta_loan_stats"   "meta_loans"        "meta_reject_stats"
## [22] "na_to_zero_vars"   "num_vars"          "path"             
## [25] "plot_data"         "test"              "train"            
## [28] "train_index"       "vars_to_remove"

rm(list = ls()[grep("^funded", ls())])

Now let’s plot the data.

plot_data %>%
  ggplot(aes(x = loan_amnt, y = value)) +
  facet_wrap(~ variable, scales = "free_x", ncol = 3) +
  geom_point()

We can derive a few points from the plot:

there are instances when funded amount is smaller loan amount
there seems to be a number of loans where investment is smaller funded amount i.e. not the full loan is invested in

Let’s do the same but only for annual income versus loan amount.

train %>%
  ggplot(aes(x = annual_inc, y = loan_amnt)) +
  geom_point()

## Warning: Removed 4 rows containing missing values (geom_point).

We can derive a few points from the plot:

there is no immediatly discernible relationship
there are quite a few income outliers with questionable values (e.g. why would a person with annual income 9500000 request a loan amount of 24000)

9.4 Time Series

Now let’s take a look at interest rates over time but split the time series by grade to see if there are differences in interest rate development depending on the borrower grade. Again we make use of a dplyr pipe, first selecting the variables of interest, then grouping by attributes and finally summarising the metric interest rate by taking the mean for each goup. We use facet_wrap to split by attribute grade. As we are using the mean for building an aggregate representation, we should be weary of outliers and group size which we had already looked at earlier (ignoring time dimension).

train %>%
  select(int_rate, grade) %>%
  group_by(grade) %>%
  summarise(int_rate_mean = mean(int_rate, na.rm = TRUE),
            int_rate_median = median(int_rate, na.rm = TRUE),
            n = n()) %>%
  knitr::kable()

grade	int_rate_mean	int_rate_median	n
A	7.244639	7.26	118281
B	10.830884	10.99	203741
C	13.979427	13.98	196815
D	17.178566	16.99	111490
E	19.899484	19.99	56765
F	23.572306	23.76	18454
G	25.624502	25.83	4358

Group size are relatively large except for grade G which may have only a few hundred loans per group when grouping by grade and date. Mean and median seem fairly close at this non-granular level. Let’s plot the data.

train %>%
  select(int_rate, grade, issue_d) %>%
  group_by(grade, issue_d) %>%
  summarise(int_rate_mean = mean(int_rate, na.rm = TRUE)) %>%
  ggplot(aes(issue_d, int_rate_mean)) +
  geom_line(color= "darkblue", size = 1) +
  facet_wrap(~ grade)

We can derive a few points from the plot:

the mean interest rate is falling or relatively constant for high-rated clients
the mean interest rate is increasing significantly for low-rated clients

Let’s also look at loan amounts over time in the same manner.

train %>%
  select(loan_amnt, grade, issue_d) %>%
  group_by(grade, issue_d) %>%
  summarise(loan_amnt_mean = mean(loan_amnt, na.rm = TRUE)) %>%
  ggplot(aes(issue_d, loan_amnt_mean)) +
  geom_line(color= "darkblue", size = 1) +
  facet_wrap(~ grade)

We can derive a few points from the plot:

the mean loan amount is increasing for all grades
while high-rated clients have some mean loan amount volatility, it is much higher for low-rated clients

9.5 Geolocation Plots

Let’s remind ourselves of the geolocation variables in the data and their information power.

zip_code (geo-info)
addr_state (geo-info)

geo_vars <- c("zip_code", "addr_state")

meta_loans %>%
  select(variable, p_zeros, p_na, type, unique) %>%
  filter_(~ variable %in% geo_vars) %>%
  knitr::kable()

variable	p_zeros	p_na	type	unique
zip_code	0	0	character	935
addr_state	0	0	character	51

loans %>%
  select_(.dots = geo_vars) %>%
  str()

## Classes 'tbl_df', 'tbl' and 'data.frame':    887379 obs. of  2 variables:
##  $ zip_code  : chr  "860xx" "309xx" "606xx" "917xx" ...
##  $ addr_state: chr  "AZ" "GA" "IL" "CA" ...

We see that zip_code seems to be the truncated US postal code with only first three digits having a value. The addr_state seems to be the state names in a two-letter abbreviation.

We can use the choroplethr package to work with maps (alternatives may be maps among other packages). While choroplethr provides functions to work with the data, its sister package choroplethrMaps contains corresponding maps that can be used by choroplethr. One issue with choroplethr is that it attaches the package plyr which means many functions from dplyr are masked as it is the successor to plyr. Thus we do not load the package choroplethr despite using its functions frequently. However, we need to load choroplethrMaps as it has some data that is not exported from its namespace so syntax like choroplethrMaps::state.map where state.map is the rdata file does not work. An alternative might be to load the file directly by going to the underlying directory, e.g. R-3.x.x\library\choroplethrMaps\data but this is cumbersome. As discussed before, when attaching a library, we just need to make sure that important functions of other packages are not masked. In this case it’s fine.

For details on those libraries, see CRAN choroplethr and CRAN choroplethrMaps.

First we aggregate the default rate by state.

default_rate_state <- 
  train %>%
  select(default, addr_state) %>%
  group_by(addr_state) %>%
  summarise(default_rate = sum(default, na.rm = TRUE) / n())

knitr::kable(default_rate_state)

addr_state	default_rate
AK	0.0289280
AL	0.0300944
AR	0.0243994
AZ	0.0260203
CA	0.0247008
CO	0.0233654
CT	0.0222655
DC	0.0131646
DE	0.0291604
FL	0.0267678
GA	0.0238065
HI	0.0303441
IA	0.2000000
ID	0.0000000
IL	0.0201743
IN	0.0242243
KS	0.0162975
KY	0.0204172
LA	0.0268172
MA	0.0246096
MD	0.0279492
ME	0.0023474
MI	0.0249429
MN	0.0239122
MO	0.0222711
MS	0.0301540
MT	0.0289296
NC	0.0263557
ND	0.0075377
NE	0.0156087
NH	0.0206490
NJ	0.0249566
NM	0.0264766
NV	0.0303213
NY	0.0294889
OH	0.0225960
OK	0.0283897
OR	0.0209814
PA	0.0256838
RI	0.0245849
SC	0.0218611
SD	0.0301164
TN	0.0267720
TX	0.0240810
UT	0.0250696
VA	0.0279697
VT	0.0172771
WA	0.0216223
WI	0.0213508
WV	0.0225434
WY	0.0179455

The data is already in a good format but when using choroplethr we need to adhere to some conventions to make it work out of the box. The first thing would be to bring the state names into a standard format. In fact, we can investigate the format by looking e.g. into the choroplethrMaps::state.map dataset which we later use to map our data. We first load the library and then the data via the function utils::data(). From the documentation, we also learn that state.map is a “data.frame which contains a map of all 50 US States plus the District of Columbia.” It is based on a shapefile which is often used in geospatial visualizations and “taken from the US Census 2010 Cartographic Boundary shapefiles page”. We are interested in the region variable which seems to hold the state names.

library(choroplethrMaps)
utils::data(state.map)
str(state.map)

## 'data.frame':    50763 obs. of  12 variables:
##  $ long      : num  -113 -113 -112 -112 -111 ...
##  $ lat       : num  37 37 37 37 37 ...
##  $ order     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hole      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ piece     : Factor w/ 66 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ group     : Factor w/ 226 levels "0.1","1.1","10.1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ id        : chr  "0" "0" "0" "0" ...
##  $ GEO_ID    : Factor w/ 51 levels "0400000US01",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ STATE     : Factor w/ 51 levels "01","02","04",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ region    : chr  "arizona" "arizona" "arizona" "arizona" ...
##  $ LSAD      : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
##  $ CENSUSAREA: num  113594 113594 113594 113594 113594 ...

unique(state.map$region)

##  [1] "arizona"              "arkansas"             "louisiana"           
##  [4] "minnesota"            "mississippi"          "montana"             
##  [7] "new mexico"           "north dakota"         "oklahoma"            
## [10] "pennsylvania"         "tennessee"            "virginia"            
## [13] "california"           "delaware"             "west virginia"       
## [16] "wisconsin"            "wyoming"              "alabama"             
## [19] "alaska"               "florida"              "idaho"               
## [22] "kansas"               "maryland"             "colorado"            
## [25] "new jersey"           "north carolina"       "south carolina"      
## [28] "washington"           "vermont"              "utah"                
## [31] "iowa"                 "kentucky"             "maine"               
## [34] "massachusetts"        "connecticut"          "michigan"            
## [37] "missouri"             "nebraska"             "nevada"              
## [40] "new hampshire"        "new york"             "ohio"                
## [43] "oregon"               "rhode island"         "south dakota"        
## [46] "district of columbia" "texas"                "georgia"             
## [49] "hawaii"               "illinois"             "indiana"

So it seems we need small case long form state names which implies we need a mapping from the short names in our data. Some internet research may give us the site American National Standards Institute (ANSI) Codes for States where we find a mapping under section FIPS Codes for the States and the District of Columbia. We could try using some parsing tool to extract the table from the web page but for now we take a short cut and simply copy paste the data into a tab-delimited txt file and read the file with readr::read_tsv(). We then cross-check if all abbreviations used in our loans data are present in the newly created mapping table.

states <- readr::read_tsv("./data//us_states.txt")

## Parsed with column specification:
## cols(
##   Name = col_character(),
##   `FIPS State Numeric Code` = col_integer(),
##   `Official USPS Code` = col_character()
## )

str(states)

## Classes 'tbl_df', 'tbl' and 'data.frame':    51 obs. of  3 variables:
##  $ Name                   : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ FIPS State Numeric Code: int  1 2 4 5 6 8 9 10 11 12 ...
##  $ Official USPS Code     : chr  "AL" "AK" "AZ" "AR" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 3
##   .. ..$ Name                   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ FIPS State Numeric Code: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Official USPS Code     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

# give some better variable names (spaces never helpful)
names(states) <- c("name", "fips_code", "usps_code")
# see if all states are covered
dplyr::setdiff(default_rate_state$addr_state, states$usps_code)

## character(0)

Comparing the unique abbreviations in the loans data with the usps codes reveals that we are all covered so let’s bring the data together using dplyr::left_join(). For details on dplyr join syntax, see Two-table verbs. To go with the choroplethr conventions, we also need to

have state names in lower case
rename the mapping variable to region and the numeric metric to value.
only select above two variables in the function call

Note that we are re-assigning the data to itself to avoid creating a new table.

default_rate_state <-
  default_rate_state %>%
  left_join(states[, c("usps_code", "name")], 
            by = c("addr_state" = "usps_code")) %>%
  rename(region = name, value = default_rate) %>%
  mutate(region = tolower(region)) %>%
  select(region, value)

Finally, we can plot the data using the default colors from choroplethr. As we are having data on state level, we use function choroplethr::state_choropleth().

choroplethr::state_choropleth(df = default_rate_state, 
                              title = "Default rate by State")

We can derive a few points from the plot:

default rate varies across states
we may want to adjust the default binning
we are only looking at default rate but not defaulted exposure (there may be states with many defaults but the defaulted amount is small)

We could do the same exercise but on a more granular level as we have zip codes available. They are truncated so we would have to find a way to work with them. They also have a high number of unique values so the map legend may be cramped if we were looking at all of US.

10 Correlation

Many models rely on the notion of correlation between independent and dependent variables so a natural exploratoy visualization would be a correlation plot or correlogram. One library offering this is corrplot with its main function corrplot::corrplot(). The function takes as input the correlation matrix that can be produced with stats::cor(). This of course is only defined for numeric, non-missing variables. In order to have a reasonable information density in the correlation matrix, we will kick out some variables with a missing value share of larger 50%.

For a discussion on missing value treatment, see e.g. Data Analysis Using Regression and Multilevel/Hierarchical Models - Chapter 25: Missing-data imputation

Let’s again build a numeric variable vector after all previous operations and look at correlations.

num_vars <- 
  train %>% 
  sapply(is.numeric) %>% 
  which() %>% 
  names()

meta_train <- funModeling::df_status(train, print_results = FALSE)

meta_train %>%
  select(variable, p_zeros, p_na, unique) %>%
  filter_(~ variable %in% num_vars) %>%
  knitr::kable()

variable	p_zeros	p_na	unique
loan_amnt	0.00	0.00	1369
funded_amnt	0.00	0.00	1369
funded_amnt_inv	0.03	0.00	8203
int_rate	0.00	0.00	540
installment	0.00	0.00	64438
annual_inc	0.00	0.00	41978
dti	0.05	0.00	4072
delinq_2yrs	80.80	0.00	26
inq_last_6mths	56.10	0.00	27
mths_since_last_delinq	51.43	0.00	152
mths_since_last_record	84.70	0.00	123
open_acc	0.00	0.00	74
pub_rec	84.70	0.00	31
revol_bal	0.38	0.00	68979
revol_util	0.40	0.05	1312
total_acc	0.00	0.00	131
out_prncp	28.85	0.00	209810
out_prncp_inv	28.85	0.00	224620
total_pymnt	2.01	0.00	423038
total_pymnt_inv	2.04	0.00	424921
total_rec_prncp	2.05	0.00	222003
total_rec_int	2.06	0.00	287778
total_rec_late_fee	98.59	0.00	5169
recoveries	97.22	0.00	18681
collection_recovery_fee	97.35	0.00	16796
last_pymnt_amnt	2.00	0.00	197434
collections_12_mths_ex_med	98.65	0.02	12
mths_since_last_major_derog	75.06	0.00	167
acc_now_delinq	99.53	0.00	8
tot_coll_amt	78.94	7.92	9319
tot_cur_bal	0.01	7.92	293158

Finally, we can produce a correlation plot. Dealing with missing values, using option use = "pairwise.complete.obs" in function stats::cor() is considered bad practice as it uses pair matching and correlations may not be comparable, see e.g. Pairwise-complete correlation considered dangerous. Alternatively, we can use option use = complete.obs which only considers complete observations but may discard a lot of data. However, as we have looked into the meta information, after our wrangling the proportion of missing values for the numeric variables has dropped a lot so we should be fine.

Once we have the correlation matrix, we can use the function corrplot::corrplot() from the corrplot library. It offers a number of different visualization options. For details, see An Introduction to corrplot Package.

library(corrplot)
corrplot::corrplot(cor(train[, num_vars], use = "complete.obs"), 
                   method = "pie", type = "upper")

We can derive a few things from the plot

overall correlation between variables seems low but there are a few highly correlated ones
some highly correlated variables may indicate very similar information, e.g. mths_since_last_delinq and mths_since_last_major_derog
some variables are perfectly correlated and thus should be removed

Rather than a visual inspection, an (automatic) inspection of correlations and removal of highly correlated features can be done via function caret::findCorrelation() with a defined cutoff parameter. If two variables have a high correlation, the function looks at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation. Using exact = TRUE will cause the function to re-evaluate the average correlations at each step while exact = FALSE uses all the correlations regardless of whether they have been eliminated or not. The exact calculations will remove a smaller number of predictors but can be much slower when the problem dimensions are “big”.

caret::findCorrelation(cor(train[, num_vars], use = "complete.obs"), 
                       names = TRUE, cutoff = .5)

##  [1] "loan_amnt"                   "funded_amnt"                
##  [3] "funded_amnt_inv"             "installment"                
##  [5] "total_pymnt_inv"             "total_pymnt"                
##  [7] "total_rec_prncp"             "out_prncp"                  
##  [9] "total_acc"                   "mths_since_last_record"     
## [11] "mths_since_last_major_derog" "recoveries"

Given above, we remove a few variables

vars_to_remove <- 
  c("loan_amnt", "funded_amnt", "funded_amnt_inv", "installment",
    "total_pymnt_inv", "total_rec_prncp", "mths_since_last_delinq", 
    "out_prncp", "total_pymnt", "total_rec_int", "total_acc",
    "mths_since_last_record", "recoveries")

train <- train %>% select(-one_of(vars_to_remove))

11 Summary plots

An alternative to producing individual exploratory plots is using summary plots that come with sensible default settings and display the data according to its data type (numeric or categorical). The advantages are that a lot of information can be put into one graph and potential relationships may be spotted more easily. The disadvantages are potential performance issues when plotting many variables of a large dataset and readability issues in case of many dimensions for categorical variables. These disadvantages can be somewhat remediated by looking at summary plots after a lot of variables have already been discarded due to other findings (as we have done here). To show the general idea, we use the GGally library which builds on top of ggplot2 and provides some nice visualization functions. We choose a subset of variables for illustration of the GGally::ggpairs() function. Even with this subset the production of the graph may take some time depending on your hardware. For a theoretical background, see The Generalized Pairs Plot. For the many customization options, see GGally. As always when visualizing a lot of information, there is a trade-off between information density and visualization efficiency. The more variables are selected for the plot, the more difficult it becomes to efficiently visualize it. This needs some playing around or a really big screen.

Note

Simply leaving the variable default as boolean would produce an error in ggpairs (or rather in ggplot2 which it calls): Error: StatBin requires a continuous x variable the x variable is discrete. Perhaps you want stat="count"?. Besides, we like it to be treated as categorical so we convert it to character for this function call.

library(GGally)
plot_ggpairs <-
  train %>%
  select(annual_inc, term, grade, default) %>%
  mutate(default = as.character(default)) %>%
  ggpairs()

plot_ggpairs

The resulting graph has a lot of information in it and we won’t go into further detail here but the most important points to read the graph properly. In the columns:

for single continuous numeric variable plots density estimate
for continuous numeric variable against another continuous numeric variable plots scatterplot
for continuous numeric variable against another categorical variable plots histogram
for single categorical variable plots histogram

In the rows:

for continuous numeric variable against another continuous numeric variable plots correlation
for continuous numeric variable against another categorical variable plots boxplot by dimension of categorical variable
for categorical variable against another categorical variable plots historgram by dimension of y-axis categorical variable
for categorical variable against another continuous variable plots scatterplot by dimension of categorical variable

Another interesting plot would be to visualize the density (estimates) of the numeric variables color-coded by the classes of the target variable (here: default/non-default) to inspect whether there are visual clues about differences in the distributions. We can achieve this by using ggplot but need to bring the data into a long format rather than its current wide format (see Data Science with R - Chapter Data Tidying for details on reshaping). We use tidyr::gather() to bring the numeric variables into a long format and then add the target variable again. Note that there are different lengths of the reshaped data and the original target variable vector. As dplyr::mutate() will not properly take care of the recycling rules, we have to recycle (repeat) the variable default manually to have the same length as the reshaped train data. See An Introduction to R - Chapter 2.2: Vector arithmetic how recycling works in general.

Vectors occurring in the same expression need not all be of the same length. If they are not, the value of the expression is a vector with the same length as the longest vector which occurs in the expression. Shorter vectors in the expression are recycled as often as need be (perhaps fractionally) until they match the length of the longest vector. In particular a constant is simply repeated.

Note that we make use of a few plot parameters to make the visualization easier to follow. We like default to be a factor so the densities are independent of class distribution. First, we switch the levels of variable default to have TRUE on top (as the highest objective is to get the true positive right). We use the fill and color parameters to encode with the target variable. This allows a visual distinction of overlaying densities. In addition, we set the alpha value to have the area under the curve transparently filled. For color selection we make use of brewer scales which offer a well selected pallette of colors. Finally, we split the plots by variable with facet_wrap allowing the scales of each indifidual plot to be free and defining the number of columns via ncol.

num_vars <- train %>% sapply(is.numeric) %>% which() %>% names()

train %>%
  select_(.dots = num_vars) %>%
  gather(measure, value) %>%
  mutate(default = factor(rep(x = train$default, 
                              length.out = length(num_vars) * dim(train)[1]), 
                          levels = c("TRUE", "FALSE"))) %>%
  ggplot(data = ., aes(x = value, fill = default, 
                       color = default, order = -default)) +
  geom_density(alpha = 0.3, size = 0.5) +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap( ~ measure, scales = "free", ncol = 3)

We can derive a few things from the plot:

most of the densities are right-skewed with probably a few outliers extending the x-axis which makes discerning patterns in the area with highest mass difficult
even with above mentioned visualization impediment it seems that there is not a lot of difference between default and non-default loans for most variables
some densities seem wobbly for the majority class (e.g. open_acc) which is caused by the fact that the smoothing algorithm has many more data points to smooth in between but the density follows the same general shape

Given above observations, we can try to improve the plot by removing (adjusting) outliers. There are many different methods that can be used, see e.g.

Most of these methods focus on removing outliers for modeling which requires a lot of care and understanding how model characteristics change when removing / replacing outliers. In our case, we can be more pragmatic as we are only interested in improving our understanding of the mass of the distribution in a graphical way to see if there are easily discernible pattern differences between the two outcomes of our target variable. For this, we like to replace outliers with some sensible value derived from the distribution of the particular variable. One could use the interquartile range as it is used in boxplots. We will use winsorization, i.e. replacing outliers with a specified percentile, e.g. 95%. As this is a fairly simple task, we write our own function to do that. Alternatively, one may use e.g. function psych::winsor() which trims outliers from both ends of the distribution. Since we have already seen, that there is no negative value present, we only cut from the right side of the distribution. We call the function winsor_outlier and allow settings for the percentile and removing NA values. The design is inspired by an SO question How to replace outliers with the 5th and 95th percentile values in R.

winsor_outlier <- function(x, cutoff = .95, na.rm = FALSE){
    quantiles <- quantile(x, cutoff, na.rm = na.rm)
    x[x > quantiles] <- quantiles
    x
    }

train %>%
  select_(.dots = num_vars) %>%
  mutate_all(.funs = winsor_outlier, cutoff = .95, na.rm = TRUE) %>%
  gather(measure, value) %>%
  mutate(default = factor(rep(x = train$default, 
                              length.out = length(num_vars)*dim(train)[1]), 
                          levels = c("TRUE", "FALSE"))) %>%
  ggplot(data = ., aes(x = value, fill = default, 
                       color = default, order = -default)) +
  geom_density(alpha = 0.3, size = 0.5) +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap( ~ measure, scales = "free", ncol = 3)

We can derive a few things from the plot:

most densities don’t show a discernible difference between default and non-default
ignore the densities centered around zero as they stem from too few observations
some densities show a bump on the right stemming from the outlier replacement and thus indicating that there is actually data in the tail

Interestingly, variable out_prncp_inv shows a significant bump around zero for default == FALSE which can be confirmed via a count. However, the description tells us that this variable represents Remaining outstanding principal for portion of total amount funded by investors so it’s unclear how useful it is (especially when knowing that this variable will only be available after a loan has been granted).

train %>% 
  select(default, out_prncp_inv) %>% 
  filter(out_prncp_inv == 0) %>%
  group_by(default) %>% 
  summarize(n = n())

## # A tibble: 2 x 2
##   default      n
##     <lgl>  <int>
## 1   FALSE 204207
## 2    TRUE    613

Overall, it seems from visual inspection that the numeric variables do not contain a lot of information gain regarding the classification of the target variable. One obvious exception is the interest rate which is a function of the expected risk of the borrower probably mostly driven by the rating. Here we see that defaulted clients tend to have higher interest rates and also seem to have more outliers in the tail.

A side note on plotting

If your screen is too small, you can export any plot to e.g. svg or pdf using graphics devices of the cairo API in package grDevices (usually part of base R distro) and scale the plot size to something bigger (especially helpful if it’s a long plot). Here we show an example for the previously created ggpairs plot stored as pdf.

cairo_pdf(filename = paste0(path, "/plot_ggpairs.pdf"),
    width=15,
    height=20,
    pointsize=10)
plot_ggpairs
dev.off()

12 Modeling

12.1 Model options

There are a couple of different model options available for a supervised binary classification problem such as determining whether a loan will default or not. In general, we may distinguish the following model approaches (among others):

logistic regression
linear discriminant analysis
k-nearest neighbors (kNN)
trees
random forests
boosting
support vector machines (SVM)

In addition, there are multiple ways to assess model performance such as model parameters, cross-validation, bootstrap and multiple feature selection methods such as ridge regression, the lasso, principal component anaylsis, etc.

We will look at some of those models and methods and see how they compare. For a more detailed treatment and further background, an accessible text is “An Introduction to Statistical Learning” or for a more advanced treatment “The Elements of Statistical Learning: Data Mining, Inference, and Prediction”.

12.2 Imbalanced data

Before we can think about any modeling, we have to realize that the dataset is highly imbalanced, i.e. the target variable has a very low proportion of defaults (namely 0.0249963). This can lead to several issues in many algorithms, e.g.

biased accuracy
loss functions attempt to optimize quantities such as error rate, not taking the data distribution into consideration
errors have the same cost (not pertaining to imbalanced data only)

More details can be found in

There are a couple of approaches to deal with the problem which can be divided into (taken from Learning from Imbalanced Classes)

do nothing (not a good idea in general)
balance training set
- oversample minority class
- undersample majority class
- synthesize new minority classes
throw away minority examples and switch to an anomaly detection framework
at the algorithm level, or after it:
- adjust the class weight (misclassification costs)
- adjust the decision threshold
- modify an existing algorithm to be more sensitive to rare classes
construct an entirely new algorithm to perform well on imbalanced data

We will focus on balancing the training set. There are basically three options available

undersample majority class (randomly or informed)
oversample minority class (randomly or informed)
do a mixture of both (e.g. SMOTE or ROSE method)

Undersampling the majority class may lose information but via decreasing dataset also lead to more computational efficiency. We also tried SMOTE and ROSE but functions DMwR::SMOTE() and ROSE::ROSE() seem to be very picky about their input and so far we failed to run them. For details on SMOTE, see SMOTE: Synthetic Minority Over-sampling Technique and for details on ROSE, see ROSE: A Package for Binary Imbalanced Learning. As pointed out here a random sampling for either under- or oversampling is not the best idea. It is recommended to use cross-validation and perform over- or under-sampling on each fold independently to get an honest estimate of model performance. In fact, function caret::train() also allows the re-balancing of data during the training of a model via caret::trainControl(sampling = "..."), see The caret package 11.2: Subsampling During Resampling for details. For now, we go with undersampling which still leaves a fair amount of training observations (at least for non-deep learning approaches).

train_down <- 
  caret::downSample(x = train[, !(names(train) %in% c("default"))], 
                    y = as.factor(train$default), yname = "default")

base::prop.table(table(train_down$default))

## 
## FALSE  TRUE 
##   0.5   0.5

base::table(train_down$default)

## 
## FALSE  TRUE 
## 17745 17745

12.3 A note on modeling libraries

There are many modeling libraries in R (in fact it is one of the major reasons for its popularity) and they can be very roughly distinguished into

general modeling libraries comprising many different model functions (sometimes wrappers around specialized model libraries)
specialized modeling libraries focusing on only one or a certain class of models

We will be using general modeling libraries and in particular the (standard) library shipping with R stats and a popular library caret. Another library often used is rms. The advantage of the general libraries is their easier access via a common API, usually a fairly comprehensive documentation and a relatively large user base which means high chance that someone already did what you like to do.

12.3.1 A note on caret library in particular

The caret package (short for _C_lassification _A_nd _RE_gression _T_raining) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:

data splitting
pre-processing
feature selection
model tuning using resampling
variable importance estimation

as well as other functionality. It aims at providing a uniform interface to R modeling functions themselves, as well as a way to standardize common tasks (such as parameter tuning and variable importance). The documentation gives an overview of the models that can be used via the caret::train() function which is the workhorse for building models. A quick look at available models reveals that there is a lot of choice.

names(caret::getModelInfo())

##   [1] "ada"                 "AdaBag"              "AdaBoost.M1"        
##   [4] "adaboost"            "amdai"               "ANFIS"              
##   [7] "avNNet"              "awnb"                "awtan"              
##  [10] "bag"                 "bagEarth"            "bagEarthGCV"        
##  [13] "bagFDA"              "bagFDAGCV"           "bam"                
##  [16] "bartMachine"         "bayesglm"            "binda"              
##  [19] "blackboost"          "blasso"              "blassoAveraged"     
##  [22] "bridge"              "brnn"                "BstLm"              
##  [25] "bstSm"               "bstTree"             "C5.0"               
##  [28] "C5.0Cost"            "C5.0Rules"           "C5.0Tree"           
##  [31] "cforest"             "chaid"               "CSimca"             
##  [34] "ctree"               "ctree2"              "cubist"             
##  [37] "dda"                 "deepboost"           "DENFIS"             
##  [40] "dnn"                 "dwdLinear"           "dwdPoly"            
##  [43] "dwdRadial"           "earth"               "elm"                
##  [46] "enet"                "evtree"              "extraTrees"         
##  [49] "fda"                 "FH.GBML"             "FIR.DM"             
##  [52] "foba"                "FRBCS.CHI"           "FRBCS.W"            
##  [55] "FS.HGD"              "gam"                 "gamboost"           
##  [58] "gamLoess"            "gamSpline"           "gaussprLinear"      
##  [61] "gaussprPoly"         "gaussprRadial"       "gbm_h2o"            
##  [64] "gbm"                 "gcvEarth"            "GFS.FR.MOGUL"       
##  [67] "GFS.GCCL"            "GFS.LT.RS"           "GFS.THRIFT"         
##  [70] "glm.nb"              "glm"                 "glmboost"           
##  [73] "glmnet_h2o"          "glmnet"              "glmStepAIC"         
##  [76] "gpls"                "hda"                 "hdda"               
##  [79] "hdrda"               "HYFIS"               "icr"                
##  [82] "J48"                 "JRip"                "kernelpls"          
##  [85] "kknn"                "knn"                 "krlsPoly"           
##  [88] "krlsRadial"          "lars"                "lars2"              
##  [91] "lasso"               "lda"                 "lda2"               
##  [94] "leapBackward"        "leapForward"         "leapSeq"            
##  [97] "Linda"               "lm"                  "lmStepAIC"          
## [100] "LMT"                 "loclda"              "logicBag"           
## [103] "LogitBoost"          "logreg"              "lssvmLinear"        
## [106] "lssvmPoly"           "lssvmRadial"         "lvq"                
## [109] "M5"                  "M5Rules"             "manb"               
## [112] "mda"                 "Mlda"                "mlp"                
## [115] "mlpKerasDecay"       "mlpKerasDecayCost"   "mlpKerasDropout"    
## [118] "mlpKerasDropoutCost" "mlpML"               "mlpSGD"             
## [121] "mlpWeightDecay"      "mlpWeightDecayML"    "monmlp"             
## [124] "msaenet"             "multinom"            "mxnet"              
## [127] "mxnetAdam"           "naive_bayes"         "nb"                 
## [130] "nbDiscrete"          "nbSearch"            "neuralnet"          
## [133] "nnet"                "nnls"                "nodeHarvest"        
## [136] "null"                "OneR"                "ordinalNet"         
## [139] "ORFlog"              "ORFpls"              "ORFridge"           
## [142] "ORFsvm"              "ownn"                "pam"                
## [145] "parRF"               "PART"                "partDSA"            
## [148] "pcaNNet"             "pcr"                 "pda"                
## [151] "pda2"                "penalized"           "PenalizedLDA"       
## [154] "plr"                 "pls"                 "plsRglm"            
## [157] "polr"                "ppr"                 "PRIM"               
## [160] "protoclass"          "pythonKnnReg"        "qda"                
## [163] "QdaCov"              "qrf"                 "qrnn"               
## [166] "randomGLM"           "ranger"              "rbf"                
## [169] "rbfDDA"              "Rborist"             "rda"                
## [172] "regLogistic"         "relaxo"              "rf"                 
## [175] "rFerns"              "RFlda"               "rfRules"            
## [178] "ridge"               "rlda"                "rlm"                
## [181] "rmda"                "rocc"                "rotationForest"     
## [184] "rotationForestCp"    "rpart"               "rpart1SE"           
## [187] "rpart2"              "rpartCost"           "rpartScore"         
## [190] "rqlasso"             "rqnc"                "RRF"                
## [193] "RRFglobal"           "rrlda"               "RSimca"             
## [196] "rvmLinear"           "rvmPoly"             "rvmRadial"          
## [199] "SBC"                 "sda"                 "sdwd"               
## [202] "simpls"              "SLAVE"               "slda"               
## [205] "smda"                "snn"                 "sparseLDA"          
## [208] "spikeslab"           "spls"                "stepLDA"            
## [211] "stepQDA"             "superpc"             "svmBoundrangeString"
## [214] "svmExpoString"       "svmLinear"           "svmLinear2"         
## [217] "svmLinear3"          "svmLinearWeights"    "svmLinearWeights2"  
## [220] "svmPoly"             "svmRadial"           "svmRadialCost"      
## [223] "svmRadialSigma"      "svmRadialWeights"    "svmSpectrumString"  
## [226] "tan"                 "tanSearch"           "treebag"            
## [229] "vbmpRadial"          "vglmAdjCat"          "vglmContRatio"      
## [232] "vglmCumulative"      "widekernelpls"       "WM"                 
## [235] "wsrf"                "xgbLinear"           "xgbTree"            
## [238] "xyf"

12.4 Data preparation for modeling

A number of models require data preparation in advance so here we are going to perform some pre-processing so that all models can work with the data. Note that a few model functions also allow the pre-processing during the training step (e.g. scale and center) but it is computationally more efficient to do at least some pre-processing before.

12.4.1 Proper names for character variables

Some models require a particular naming scheme for the values of character variables (e.g. no spcaes). The base::make.names() function does just that.

vars_to_mutate <-
  train_down %>%
  select(which(sapply(.,is.character))) %>%
  names()

vars_to_mutate

##  [1] "term"                "grade"               "emp_length"         
##  [4] "home_ownership"      "verification_status" "pymnt_plan"         
##  [7] "purpose"             "zip_code"            "addr_state"         
## [10] "initial_list_status" "application_type"

train_down <-
  train_down %>%
  mutate_at(.funs = make.names, .vars = vars_to_mutate)
  
test <-
  test %>%
  mutate_at(.funs = make.names, .vars = vars_to_mutate)

12.4.2 Dummy variables

A few models automatically convert character/factor variables into binary dummy variables (e.g. stats::glm()) while others don’t. To build a common ground, we can create dummy variables ahead of modeling. The caret::dummyVars() function creates a full set of dummy variables (i.e. less than full rank parameterization). The function takes a formula and a data set and outputs an object that can be used to create the dummy variables using the predict method. Let’s also remove the zip code variable as it would create a huge binary dummy matrix and potentially lead to computational problems. In many cases it is anyway not allowed to use it for modeling as it may be discriminative against certain communities. We also adjust the test set as we will need it for evaluation. We throw away all variables that are not part of the training set as they are not used by the models and would only clutter memory. We keep the dummy data separate from the original training data as any model using explicit variable names (i.e. not the full set) would require us to list each dummy variable separately which is cumbersome. If the data was big, we may have chosen to keep it all in one data set. For more details, see The caret package 3.1: Creating Dummy Variables.

vars_to_remove <- c("zip_code")
train_down <- train_down %>% select(-one_of(vars_to_remove))

# train
dummies_train <-
  dummyVars("~.", data = train_down[, !(names(train_down) %in% c("default"))], 
            fullRank = FALSE)

train_down_dummy <-
  train_down %>%
  select(-which(sapply(.,is.character))) %>%
  cbind(predict(dummies_train, newdata = train_down))

# test
dummies_test <-
  dummyVars("~.", data = test[, dummies_train$vars], fullRank = FALSE)

test_dummy <-
  test %>%
  select(one_of(colnames(train_down))) %>%
  select(-which(sapply(.,is.character))) %>%
  cbind(predict(dummies_test, newdata = test))

12.5 Logistic regression

While linear regression is very often used for (continuous) quantitative variables, logistic regression is its counterpart for (discrete) qualitative responses also referred to as categorical. Rather than modeling the outcome directly as default or not, logistic regression models the probability that the response belongs to a particular category. The logistic function will ensure that probabilties are within the range [0, 1]. The model coefficients are estimated via the maximum likelihood method.

The logistic regression model is part of the larger family of generalized linear model function and implemented in many R packages. We will use the stats::glm() function that usually comes with the standard install. A strong contender for any regression modeling is the caret package that we have already used before and will use again.

We will start with a very simple model of one explanatory variable, namely grade, which intuitively should be a good predictor of default as it is a credit rating and therefore an evaluation of the borrower. Note that a call to stats::glm() will not return all model statistics by default but only some attributes of the created model object. We can inspect the object’s attributes with the base::attributes() function. The function base::summary() is a generic function used to produce result summaries of the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument. In the case of a glm object, it will return model statistics such as

deviance residuals
coefficients
standard error
z statistic
p-value
AIC

and some more. We may also use the summary function on selected model objects, e.g. the coefficients only via summary(model_object)$coef.

Finally, before running the model, the documentation of stats::glm() tells us that if the response is of type factor the first level denotes failure and all others success. We can check and find that this is already the case here.

levels(train_down$default)

## [1] "FALSE" "TRUE"

model_glm_1 <- glm(formula = default ~ grade, 
                   family = binomial(link = "logit"), 
                   data = train_down, na.action = na.exclude)
class(model_glm_1)

## [1] "glm" "lm"

attributes(model_glm_1)

## $names
##  [1] "coefficients"      "residuals"         "fitted.values"    
##  [4] "effects"           "R"                 "rank"             
##  [7] "qr"                "family"            "linear.predictors"
## [10] "deviance"          "aic"               "null.deviance"    
## [13] "iter"              "weights"           "prior.weights"    
## [16] "df.residual"       "df.null"           "y"                
## [19] "converged"         "boundary"          "model"            
## [22] "call"              "formula"           "terms"            
## [25] "data"              "offset"            "control"          
## [28] "method"            "contrasts"         "xlevels"          
## 
## $class
## [1] "glm" "lm"

summary(model_glm_1)

## 
## Call:
## glm(formula = default ~ grade, family = binomial(link = "logit"), 
##     data = train_down, na.action = na.exclude)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8377  -1.1985  -0.0327   1.1564   1.7406  
## 
## Coefficients:
##             Estimate Std. Error z value            Pr(>|z|)    
## (Intercept) -1.26665    0.03911  -32.39 <0.0000000000000002 ***
## gradeB       0.77173    0.04519   17.08 <0.0000000000000002 ***
## gradeC       1.31620    0.04389   29.99 <0.0000000000000002 ***
## gradeD       1.67669    0.04600   36.45 <0.0000000000000002 ***
## gradeE       1.99273    0.05132   38.83 <0.0000000000000002 ***
## gradeF       2.26212    0.06856   33.00 <0.0000000000000002 ***
## gradeG       2.75092    0.12634   21.77 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 49200  on 35489  degrees of freedom
## Residual deviance: 46080  on 35483  degrees of freedom
## AIC: 46094
## 
## Number of Fisher Scoring iterations: 4

summary(model_glm_1)$coef

##               Estimate Std. Error   z value      Pr(>|z|)
## (Intercept) -1.2666450 0.03910963 -32.38704 4.178600e-230
## gradeB       0.7717315 0.04518690  17.07866  2.139941e-65
## gradeC       1.3161994 0.04388804  29.98993 1.327728e-197
## gradeD       1.6766852 0.04600412  36.44641 7.841098e-291
## gradeE       1.9927260 0.05131843  38.83061  0.000000e+00
## gradeF       2.2621234 0.06855512  32.99715 8.925306e-239
## gradeG       2.7509198 0.12633659  21.77453 4.046038e-105

There are a couple of functions that work natively with lm / glm objects created via stats::glm() (taken from Steph Locke: Logistic Regressions):

coefficients: Extract coefficients
summary: Output a basic summary
fitted: Return the predicted values for the training data
predict: Predict some values for new data
plot: Produce some basic diagnostic plots
residuals: Return the errors on predicted values for the training data

Also note that categorical variables will get transformed into dummy variables. A standardization is not necessarily needed, see e.g. Is standardization needed before fitting logistic regression. There are a number of model diagnostics one may perform, see e.g. Diagnostics for logistic regression and pay attention to Frank Harrell’s answer.

12.6 Model evaluation illustration

For the sake of illustration, let’s go once through the model evaluation steps for this very simple model. This workflow should be similar even when the model gets more complex. There are many perspectives one can take when evaluating a model and many packages that provide helper functions to do so. For this illustration, we focus on a fairly standard approach looking at some statistics and explaining them along the way. We will also be plotting an ROC curve to establish its basic functioning.

To evaluate the model performance, we have to use the fitted model to predict the target variable default on unseen (test) data. The function stats::preditc.glm() does exactly that. We set type = "response" which returns the predicted probabilities and thus allows us to evaluate against a chosen threshold. This can be useful in cases where the cost of missclassification is not equal and one outcome should be penalized heavier than another. Also note, that we choose the positive outcome to be TRUE, i.e. a true positive would be correctly identifiying a default. We use the function caret::ConfusionMatrix() to plot a confusion matrix (also called error matrix). The function also returns a number of model statistics (however note that the test data is highly imbalanced as we have not applied any resampling as in the training data - so a number of statistics are meaningless). Here is what the function reports:

Accuracy: (TP + TN) / (TP + FP + TN + FN)
95% CI: 95% confidence interval for accuracy
No Information Rate: accuracy in case of random guessing (in case of binary outcome the proportion of the majority class)
P-Value [Acc > NIR]: p-value that accuracy is larger No Information Rate
Kappa: (observed accuracy - expected accuracy) / (1 - expected accuracy), see Cohen’s kappa in plain English
Mcnemar’s Test P-Value
Sensitivity / recall: true positive rate = TP / (TP + FN)
Specificity: true negative rate = TN / (TN + FP)
Pos Pred Value: (sensitivity * prevalence) / ((sensitivity * prevalence) + ((1 - specificity) * (1 - prevalence)))
Neg Pred Value: (specificity * (1-prevalence)) / (((1-sensitivity) * prevalence) + ((specificity) * (1 - prevalence)))
Prevalence: (TP + FN) / (TP + FP + TN + FN)
Detection Rate: TP / (TP + FP + TN + FN)
Detection Prevalence: (TP + FP) / (TP + FP + TN + FN)
Balanced Accuracy: (sensitivity + specificity) / 2

model_glm_1_pred <- 
  predict.glm(object = model_glm_1, newdata = test, type = "response")
model_pred_t <- function(pred, t) ifelse(pred > t, TRUE, FALSE)
caret::confusionMatrix(data = model_pred_t(model_glm_1_pred, 0.5), 
                       reference = test$default,
                       positive = "TRUE")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE  TRUE
##      FALSE 79712  1003
##      TRUE  93327  3433
##                                              
##                Accuracy : 0.4685             
##                  95% CI : (0.4662, 0.4708)   
##     No Information Rate : 0.975              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.0211             
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.77390            
##             Specificity : 0.46066            
##          Pos Pred Value : 0.03548            
##          Neg Pred Value : 0.98757            
##              Prevalence : 0.02500            
##          Detection Rate : 0.01934            
##    Detection Prevalence : 0.54520            
##       Balanced Accuracy : 0.61728            
##                                              
##        'Positive' Class : TRUE               
##

Looking at the model statistics, we find a mixed picture:

we get a fair amount of true defaults right (sensitivity)
we get a large amount of non-defaults wrong (specificity)
the Kappa (which should consider class distributions) is very low

We can visualize the classification (predicted probabilities) versus the actual class for the test set and the given threshold. We make use of a plotting function inspired by Illustrated Guide to ROC and AUC with its source code residing on Github. We have slightly adjusted the code to fit our needs.

plot_pred_type_distribution <- function(df, threshold) {
  v <- rep(NA, nrow(df))
  v <- ifelse(df$pred >= threshold & df$actual == 1, "TP", v)
  v <- ifelse(df$pred >= threshold & df$actual == 0, "FP", v)
  v <- ifelse(df$pred < threshold & df$actual == 1, "FN", v)
  v <- ifelse(df$pred < threshold & df$actual == 0, "TN", v)
  
  df$pred_type <- v
  
  ggplot(data = df, aes(x = actual, y = pred)) + 
    geom_violin(fill = rgb(1, 1 ,1, alpha = 0.6), color = NA) + 
    geom_jitter(aes(color = pred_type), alpha = 0.6) +
    geom_hline(yintercept = threshold, color = "red", alpha = 0.6) +
    scale_color_discrete(name = "type") +
    labs(title = sprintf("Threshold at %.2f", threshold))
  }

plot_pred_type_distribution(
  df = tibble::tibble(pred = model_glm_1_pred, actual = test$default), 
  threshold = 0.5)

We can derive a few things from the plot:

We can clearly see the imbalance of the test dataset with many more observations being FALSE, i.e. non-defaulted
We can see that predictions tend to cluster more around the center (i.e. the threshold) rather than the top or bottom
We see that there are only seven distinct (unique) prediction probabilities, namely 0.512386, 0.2198321, 0.3787367, 0.6010975, 0.6739447, 0.8152174, 0.7301686 which corresponds to the number of unique levels in our predictor grade, namely B, C, A, E, F, D, G
We can see the trade-off of moving the threshold down or up leading to higher sensitivity (lower specificity) or higher specificity (lower sensitivity), respectively.

12.6.1 ROC curve

Let’s have a look at the ROC curve which evaluates the performance over all possible thresholds rather than just one. Without going into too much detail, the ROC curve works as follows:

draw sensitivity (true positive rate) on y-axis
draw specificivity (true negative rate) on x-axis (often also drawn as 1 - true negative rate)
go from threshold zero to threshold one in a number of pre-defined steps, e.g. with distance 0.01
for each threshold sensitivity is plotted against specificity
the result is usually a (smoothed) curve that lies above the diagonal splitting the area
the diagonal (also called base) represents the lower bound as it is the result of a random guess (assuming balanced class distribution) - so if the curve lies below it, the model is worse than random guessing and one could simply invert the prediction
in a perfect model the curve touches the upper left data point, i.e. having a sensitivity and specificity of one (perfect separation)
This setup nicely illustrates the trade-off between sensitivity and specificity as it often occurs in reality, i.e. increasing one, decreases the other
The curve also allows comparing different models against each other

Although the curve gives a good indication of model performance, it is usually preferrable to have one model statistic to use as benchmark. This is the role of the AUC (AUROC) known as area under the curve (area under the ROC curve). It is simply calculated via integration. Naturally, the base is 0.5 (the diagonal) and thus the lower bound. The upper bound is therefore 1.

More details can be found e.g. in:

ROC curves to evaluate binary classification algorithms
Illustrated Guide to ROC and AUC
or for a more theoretical treatment An introduction to ROC analysis.

We will be using the pROC library and its main function pROC::roc() to compute ROC.

roc_glm_1 <- pROC::roc(response = test$default, predictor = model_glm_1_pred)
roc_glm_1

## 
## Call:
## roc.default(response = test$default, predictor = model_glm_1_pred)
## 
## Data: model_glm_1_pred in 173039 controls (test$default FALSE) < 4436 cases (test$default TRUE).
## Area under the curve: 0.6641

We see an area of 0.6641045 which is better than random guessing but not too good. Let’s plot it to see its shape. Note that by default pROC::plot.roc() has the x-axis as specificity rather than 1 - specificity as many authors and other packages have. To use 1 - specificity, one can call the function with parameter legacy.axes = TRUE. However, it seems that the actual trade-off between sensitivity and specificity is easier to see when plotting specificity as the mind does not easily interpret one minus some quantity. We also set asp = NA to create an x-axis range from zero to one which, depending on your graphics output settings, ensures the standard way of plotting an ROC curve accepting the risk of a misshape in case graphics output does not have (close to) quadratic shape.

pROC::plot.roc(x = roc_glm_1, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               col = "green", print.auc = FALSE, print.auc.y = .4)

legend(x = "bottomright", legend=c("glm_1 AUC = 0.664"), 
       col = c("green"), lty = 1, cex = 1.0)

We see that the curve lies over the diagonal but it does not have a strong tendency to touch the upper left corner. A more complex model may perform better but would involve a larger amount of predictors so naturally the question of variable selection will come up.

12.7 Variable selection

Many articles and book chapters have been written on the topic of variable selection (feature selection) and some authors believe it is one of, if not the most, important topic in model buidling. A few good resources giving an overview, can be found below:

The different approaches can be categorized as follows:

subset selection
best subset selection
stepwise selection (forward/backward/hybrid)
shrinkage
ridge regression (l₂-norm)
lasso (l₁-norm)
dimension reduction
principal component analysis (PCA)

There are many trade-offs to be made when choosing one approach over the other but the two most important are statistical soundness and computational efficiency. We will not go into further details of variable selection but rather focus on the model building. We thus simply detrmine a predictor space that seems sensible given prior analysis.

Let’s remind ourselves of variables left for modeling.

full_vars <- colnames(train_down)
full_vars

##  [1] "term"                        "int_rate"                   
##  [3] "grade"                       "emp_length"                 
##  [5] "home_ownership"              "annual_inc"                 
##  [7] "verification_status"         "issue_d"                    
##  [9] "pymnt_plan"                  "purpose"                    
## [11] "addr_state"                  "dti"                        
## [13] "delinq_2yrs"                 "earliest_cr_line"           
## [15] "inq_last_6mths"              "open_acc"                   
## [17] "pub_rec"                     "revol_bal"                  
## [19] "revol_util"                  "initial_list_status"        
## [21] "out_prncp_inv"               "total_rec_late_fee"         
## [23] "collection_recovery_fee"     "last_pymnt_d"               
## [25] "last_pymnt_amnt"             "last_credit_pull_d"         
## [27] "collections_12_mths_ex_med"  "mths_since_last_major_derog"
## [29] "application_type"            "acc_now_delinq"             
## [31] "tot_coll_amt"                "tot_cur_bal"                
## [33] "default"

We determine a subspace and report which variables we ignore.

model_vars <-
  c("term", "grade", "emp_length", "home_ownership", "annual_inc",
    "purpose", "pymnt_plan", "out_prncp_inv", "delinq_2yrs", "default")

ignored_vars <- dplyr::setdiff(full_vars, model_vars)
ignored_vars

##  [1] "int_rate"                    "verification_status"        
##  [3] "issue_d"                     "addr_state"                 
##  [5] "dti"                         "earliest_cr_line"           
##  [7] "inq_last_6mths"              "open_acc"                   
##  [9] "pub_rec"                     "revol_bal"                  
## [11] "revol_util"                  "initial_list_status"        
## [13] "total_rec_late_fee"          "collection_recovery_fee"    
## [15] "last_pymnt_d"                "last_pymnt_amnt"            
## [17] "last_credit_pull_d"          "collections_12_mths_ex_med" 
## [19] "mths_since_last_major_derog" "application_type"           
## [21] "acc_now_delinq"              "tot_coll_amt"               
## [23] "tot_cur_bal"

12.8 A more complex logistic model with caret

The interested reader should check the comprehensive caret documentation on how the API works and for modeling in particular read model training and tuning. We only give a high-level overview here.

The caret package has several functions that attempt to streamline the model building and evaluation process. The train function can be used to

evaluate, using resampling, the effect of model tuning parameters on performance
choose the “optimal” model across these parameters
estimate model performance from a training set

First, a specific model must be chosen. User-defined models can also be created. Second, the corresponding model parameters need to be passed to the function call. The workhorse function is caret::train() but an important input is parameter trControl which also allows specifying a particular resampling approach, e.g. k-fold cross-validation (once or repeated), leave-one-out cross-validation and bootstrap (simple estimation or the 632 rule), etc.

12.8.1 trainControl

We will define the controlling parameters in a separate function call to caret::trainControl() and then pass this on to caret::train(trControl = ). This also allows using the control parameters for different models if applicable to those. The function caret::trainControl() generates parameters that further control how models are created. For more details, see The caret package 5.5.4: The trainControl Function.

We perform a 10-fold cross-validation repeated five times and return the class probabilities. Since we are interested in ROC as our performance measure, we need to set summaryFunction = twoClassSummary. If one likes to see a live log of the execution, verboseIter = TRUE provides that. Since we have no tuning parameters for logistic regression, we leave those out for now.

ctrl <- 
  trainControl(method = "repeatedcv", 
               number = 10,
               repeats = 5,
               classProbs = TRUE,
               summaryFunction = twoClassSummary,
               savePredictions = TRUE,
               verboseIter = FALSE)

12.8.2 train

Finally, we can call caret::train() with specified control parameters. Note that caret is picky about factor values which need to be valid R names. We thus convert the default variable to have values “yes” and “no” rather than “TRUE” and “FALSE” (which when converted to variable names would not be valid as they are reserved keywords). Make sure that the test set is transformed likewise when using the model for prediction as otherwise your factor levels may be different to the training set and the class probabilities thus wrong.

We will skip the complex discussion around variable selection for now and simply determine the “base” variable universe that seems plausible from intuition and prior exploratory analysis. We have removed categorical variables with many unique values (such as zip codes) as they would be mutated to large dummy matrices.

model_glm_2 <-
  train_down %>%
  select(model_vars) %>%
  mutate(default = as.factor(ifelse(default == TRUE, "yes", "no"))) %>%
  filter(complete.cases(.)) %>%
  train(default ~ ., 
        data = ., 
        method = "glm", 
        family = "binomial",
        metric = "ROC",
        trControl = ctrl)

As with the previous model object created via stats::glm(), the final model object created has many methods and attributes for further processing. But we first look at the results/performance on the training set (which should give a fair estimate of the test set performance since we used cross-validation). Note that all reported metrics are interpretable since this is done on the balanced data.

model_glm_2

## Generalized Linear Model 
## 
## 35488 samples
##     9 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 31938, 31938, 31940, 31939, 31940, 31940, ... 
## Resampling results:
## 
##   ROC        Sens       Spec    
##   0.6964721  0.6291356  0.662019

We can see that we have improved on the ROCAUC (~ 70%) but got worse for sensitivity compared to the first, simpler model. Looking at the predictors, we also see how the model created dummies to encode categorical variables.

attributes(model_glm_2)

## $names
##  [1] "method"       "modelInfo"    "modelType"    "results"     
##  [5] "pred"         "bestTune"     "call"         "dots"        
##  [9] "metric"       "control"      "finalModel"   "preProcess"  
## [13] "trainingData" "resample"     "resampledCM"  "perfNames"   
## [17] "maximize"     "yLimits"      "times"        "levels"      
## [21] "terms"        "coefnames"    "contrasts"    "xlevels"     
## 
## $class
## [1] "train"         "train.formula"

summary(model_glm_2)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3163  -1.0774  -0.2782   1.0635   5.6347  
## 
## Coefficients:
##                                 Estimate     Std. Error z value
## (Intercept)                -1.7381657126   0.1420955332 -12.232
## termX60.months             -0.5453038253   0.0298754606 -18.253
## gradeB                      0.8159277950   0.0461745543  17.671
## gradeC                      1.3693629935   0.0459015145  29.833
## gradeD                      1.7227789216   0.0490499846  35.123
## gradeE                      2.0514999392   0.0562719050  36.457
## gradeF                      2.3642399559   0.0741417266  31.888
## gradeG                      2.8775473195   0.1315002573  21.882
## emp_lengthX..1.year         0.0719719375   0.0634309986   1.135
## emp_lengthX1.year          -0.0231801001   0.0663542839  -0.349
## emp_lengthX10..years       -0.1447186450   0.0541082748  -2.675
## emp_lengthX2.years         -0.0155518922   0.0628511985  -0.247
## emp_lengthX3.years          0.0019357572   0.0640494674   0.030
## emp_lengthX4.years         -0.0416200714   0.0681959839  -0.610
## emp_lengthX5.years         -0.1208526942   0.0671199278  -1.801
## emp_lengthX6.years          0.0685014427   0.0713774190   0.960
## emp_lengthX7.years          0.0169740882   0.0708191548   0.240
## emp_lengthX8.years         -0.0231357344   0.0711988793  -0.325
## emp_lengthX9.years         -0.0454158473   0.0751319778  -0.604
## home_ownershipNONE         -0.5795772695   1.3066383779  -0.444
## home_ownershipOTHER        12.3599134059  91.3562921237   0.135
## home_ownershipOWN           0.1059366445   0.0397823985   2.663
## home_ownershipRENT          0.1738380528   0.0253881630   6.847
## annual_inc                 -0.0000018620   0.0000002708  -6.875
## purposecredit_card          0.0940714394   0.1290711042   0.729
## purposedebt_consolidation   0.1957294997   0.1273405178   1.537
## purposeeducational          1.4894997798   0.4601526184   3.237
## purposehome_improvement     0.2605554222   0.1355011685   1.923
## purposehouse                0.0010870105   0.2087251602   0.005
## purposemajor_purchase       0.1111783795   0.1507309963   0.738
## purposemedical              0.1525955150   0.1664861916   0.917
## purposemoving               0.4431876570   0.1857656345   2.386
## purposeother                0.0979564862   0.1356860122   0.722
## purposerenewable_energy     0.0910160004   0.4132422786   0.220
## purposesmall_business       0.2405034075   0.1572708880   1.529
## purposevacation             0.1203054032   0.1926193435   0.625
## purposewedding             -0.4075063208   0.3202319627  -1.273
## pymnt_plany                11.9809020840 154.7613613910   0.077
## out_prncp_inv               0.0000520918   0.0000017015  30.614
## delinq_2yrs                 0.0865718494   0.0125320528   6.908
##                                       Pr(>|z|)    
## (Intercept)               < 0.0000000000000002 ***
## termX60.months            < 0.0000000000000002 ***
## gradeB                    < 0.0000000000000002 ***
## gradeC                    < 0.0000000000000002 ***
## gradeD                    < 0.0000000000000002 ***
## gradeE                    < 0.0000000000000002 ***
## gradeF                    < 0.0000000000000002 ***
## gradeG                    < 0.0000000000000002 ***
## emp_lengthX..1.year                    0.25652    
## emp_lengthX1.year                      0.72684    
## emp_lengthX10..years                   0.00748 ** 
## emp_lengthX2.years                     0.80457    
## emp_lengthX3.years                     0.97589    
## emp_lengthX4.years                     0.54166    
## emp_lengthX5.years                     0.07177 .  
## emp_lengthX6.years                     0.33720    
## emp_lengthX7.years                     0.81058    
## emp_lengthX8.years                     0.74522    
## emp_lengthX9.years                     0.54552    
## home_ownershipNONE                     0.65736    
## home_ownershipOTHER                    0.89238    
## home_ownershipOWN                      0.00775 ** 
## home_ownershipRENT            0.00000000000753 ***
## annual_inc                    0.00000000000621 ***
## purposecredit_card                     0.46610    
## purposedebt_consolidation              0.12428    
## purposeeducational                     0.00121 ** 
## purposehome_improvement                0.05449 .  
## purposehouse                           0.99584    
## purposemajor_purchase                  0.46076    
## purposemedical                         0.35937    
## purposemoving                          0.01705 *  
## purposeother                           0.47033    
## purposerenewable_energy                0.82568    
## purposesmall_business                  0.12621    
## purposevacation                        0.53225    
## purposewedding                         0.20318    
## pymnt_plany                            0.93829    
## out_prncp_inv             < 0.0000000000000002 ***
## delinq_2yrs                   0.00000000000491 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 49197  on 35487  degrees of freedom
## Residual deviance: 44806  on 35448  degrees of freedom
## AIC: 44886
## 
## Number of Fisher Scoring iterations: 11

predictors(model_glm_2)

##  [1] "termX60.months"            "gradeB"                   
##  [3] "gradeC"                    "gradeD"                   
##  [5] "gradeE"                    "gradeF"                   
##  [7] "gradeG"                    "emp_lengthX..1.year"      
##  [9] "emp_lengthX1.year"         "emp_lengthX10..years"     
## [11] "emp_lengthX2.years"        "emp_lengthX3.years"       
## [13] "emp_lengthX4.years"        "emp_lengthX5.years"       
## [15] "emp_lengthX6.years"        "emp_lengthX7.years"       
## [17] "emp_lengthX8.years"        "emp_lengthX9.years"       
## [19] "home_ownershipNONE"        "home_ownershipOTHER"      
## [21] "home_ownershipOWN"         "home_ownershipRENT"       
## [23] "annual_inc"                "purposecredit_card"       
## [25] "purposedebt_consolidation" "purposeeducational"       
## [27] "purposehome_improvement"   "purposehouse"             
## [29] "purposemajor_purchase"     "purposemedical"           
## [31] "purposemoving"             "purposeother"             
## [33] "purposerenewable_energy"   "purposesmall_business"    
## [35] "purposevacation"           "purposewedding"           
## [37] "pymnt_plany"               "out_prncp_inv"            
## [39] "delinq_2yrs"

We can also look at the variable importance with caret::varImp(). For linear models, the absolute value of the t-statistic for each model parameter is used.

varImp(model_glm_2)

## glm variable importance
## 
##   only 20 most important variables shown (out of 39)
## 
##                           Overall
## gradeE                    100.000
## gradeD                     96.340
## gradeF                     87.466
## out_prncp_inv              83.972
## gradeC                     81.827
## gradeG                     60.017
## termX60.months             50.059
## gradeB                     48.462
## delinq_2yrs                18.937
## annual_inc                 18.846
## home_ownershipRENT         18.770
## purposeeducational          8.866
## emp_lengthX10..years        7.323
## home_ownershipOWN           7.291
## purposemoving               6.531
## purposehome_improvement     5.261
## emp_lengthX5.years          4.925
## purposedebt_consolidation   4.202
## purposesmall_business       4.181
## purposewedding              3.477

plot(varImp(model_glm_2))

Finally, we use the model to predict the test set classes. Again, we need to be aware of any pre-processing required such as mutations or treatment of missing values.

model_glm_2_pred <- 
  predict(model_glm_2, 
          newdata = test %>% 
            select(model_vars) %>%
            mutate(default = as.factor(ifelse(default == TRUE, 
                                              "yes", "no"))) %>%
            filter(complete.cases(.)), 
          type = "prob")

head(model_glm_2_pred, 3)

##          no       yes
## 1 0.7041954 0.2958046
## 2 0.8783103 0.1216897
## 3 0.6509819 0.3490181

Note that caret::predict.train returns the probabilties for all outcomes i.e. in our binary case for yes and no. We repeat the analysis from earlier and see how the new model performs.

caret::confusionMatrix(
  data = ifelse(model_glm_2_pred[, "yes"] > 0.5, "yes", "no"), 
  reference = as.factor(ifelse(test[complete.cases(test[, model_vars]), 
                                    "default"] == TRUE, "yes", "no")),
  positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     no    yes
##        no  109156   1482
##        yes  63876   2953
##                                              
##                Accuracy : 0.6317             
##                  95% CI : (0.6295, 0.634)    
##     No Information Rate : 0.975              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.0378             
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.66584            
##             Specificity : 0.63084            
##          Pos Pred Value : 0.04419            
##          Neg Pred Value : 0.98660            
##              Prevalence : 0.02499            
##          Detection Rate : 0.01664            
##    Detection Prevalence : 0.37657            
##       Balanced Accuracy : 0.64834            
##                                              
##        'Positive' Class : yes                
##

Again we build an ROC object.

temp <- as.factor(ifelse(test[complete.cases(test[, model_vars]),
                                             "default"] == TRUE, "yes", "no"))

roc_glm_2 <- 
  pROC::roc(response = temp, 
            predictor = model_glm_2_pred[, "yes"])

roc_glm_2

## 
## Call:
## roc.default(response = temp, predictor = model_glm_2_pred[, "yes"])
## 
## Data: model_glm_2_pred[, "yes"] in 173032 controls (temp no) < 4435 cases (temp yes).
## Area under the curve: 0.6995

#data = as.data.frame(unclass(train[complete.cases(train), ]))

The ROCAUC of ~ 70% is nearly the same as for the cross-validated training set. We plot the curve comparing it against the ROC from the simple model.

pROC::plot.roc(x = roc_glm_1, legacy.axes = FALSE, xlim = c(1, 0), asp = NA, 
               col = "green")

pROC::plot.roc(x = roc_glm_2, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               add = TRUE, col = "blue")

legend(x = "bottomright", legend=c("glm_1 AUC = 0.664", "glm_2 AUC = 0.7"), 
       col = c("green", "blue"), lty = 1, cex = 1.0)

We can see that the more complex model has a better ROC above the simple model.

12.9 A note on computational efficiency (parallelization via cluster)

Some models (and their tuning parameter calibration via resampling) will require significant computational resources both in terms of memory and CPUs. There are a few ways to increase computational efficiency which can be roughly divided into

decrease training set (without loosing information)
compute on several cores (parallelization via cluster)
shift matrix computations to GPU (e.g. via CUDA)
use compute instances from cloud offerings such as Amazon Web Services, googleCloud or Microsoft Azure

We will only look at parallelization here as it is supported by R and caret out of the box. For details, see The caret package 9: Parallel Processing, caret ml parallel and HOME of the R parallel WIKI! by Tobias Kind. There are a couple of libraries supporting parallel processing in R but one should be aware of the support for different operating systems. On Windows, one should use doParallel which is a parallel backend for the foreach package. Note that the multicore functionality only runs tasks on a single computer, not a cluster of computers. However, you can use the snow functionality to execute on a cluster, using Unix-like operating systems, Windows, or even a combination. It is pointless to use doParallel and parallel on a machine with only one processor with a single core. To get a speed improvement, it must run on a machine with multiple processors, multiple cores, or both. Remember: unless doParallel::registerDoParallel() is called, foreach will not run in parallel. Simply loading the doParallel package is not enough. Note that we can set the parameter caret::trainControl(allowParallel = TRUE) but it is actually the default setting, i.e. caret::train() will automatically run in parallel if it detects a registerd cluster, irrespective of library is used to build it.

To illustrate efficiency gains, we will benchmark an example model training using the previous glm model. We use library (and function with same name) microbenchmark and define to execute the operation several times (to get a stable time estimate) via parameter times. Note that parallizing usually means creating copies of the original data for each worker (core) which requires according memory. Since we only want to illustrate the logic here, we downsize the training data to avoid failures due to insufficient memory.

benchmark_train_data <- 
  train_down[seq(1, nrow(train_down), 2), model_vars] %>%
  mutate(default = as.factor(ifelse(default == TRUE, "yes", "no"))) %>%
  filter(complete.cases(.))
format(utils::object.size(benchmark_train_data), units = "Mb")

## [1] "1.3 Mb"

We see the data has ~ 1 Mb in size.

ctrl <- 
  trainControl(method = "repeatedcv", 
               number = 10,
               repeats = 5,
               classProbs = TRUE,
               summaryFunction = twoClassSummary,
               verboseIter = FALSE,
               allowParallel = TRUE) #default

library(microbenchmark)

glm_nopar <-
  microbenchmark(glm_nopar =
      train(default ~ .,
            data = benchmark_train_data,
            method = "glm",
            family = "binomial",
            metric = "ROC",
            trControl = ctrl),
    times = 5
    )

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

We can look at the resulting run time.

glm_nopar

## Unit: seconds
##       expr      min       lq     mean   median       uq     max neval
##  glm_nopar 27.51427 27.90744 28.27665 28.38591 28.71571 28.8599     5

Now, we will setup a cluster of cores and repeat the benchmarking. We use parallel::detectCores() to establish how many cores are available and use some amount smaller 100% of cores to leave some resources for the system. Usually, we would write a log of the distributed calculation via parallel::makeCluster(outfile = ). It will be opened in append mode as all workers write to it. However, we found issues with running parallel::makeCluster(outfile = ) so we simply run the cluster directly from registerDoParallel() which has no option to define a log file.

# doParallel would also load parallel
library(parallel)
library(doParallel)

## Warning: package 'doParallel' was built under R version 3.4.2

## Loading required package: foreach

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loading required package: iterators

cores_2_use <- floor(0.8 * detectCores())
# cl <- makeCluster(cores_2_use, outfile = "./logs/parallel_log.txt")
# registerDoParallel(cl, cores_2_use)
registerDoParallel(cores = cores_2_use)
getDoParWorkers()

## [1] 6

We can inspect the open connections via base::showConnections().

knitr::kable(base::showConnections())

	description	class	mode	text	isopen	can read	can write
4	output	textConnection	wr	text	opened	no	yes
5	<-eclipse:11076	sockconn	a+b	binary	opened	yes	yes
6	<-eclipse:11076	sockconn	a+b	binary	opened	yes	yes
7	<-eclipse:11076	sockconn	a+b	binary	opened	yes	yes
8	<-eclipse:11076	sockconn	a+b	binary	opened	yes	yes
9	<-eclipse:11076	sockconn	a+b	binary	opened	yes	yes
10	<-eclipse:11076	sockconn	a+b	binary	opened	yes	yes

Now we tried running in parallel but the setup may produce an error on Windows.

Error in e$fun(obj, substitute(ex), parent.frame(), e$data) : unable to find variable "optimismBoot"

This seems to be a bug as noted in SO question Caret train function - unable to find variable “optimismBoot”. The bug seems to be solved in caret development version which can be installed from GitHub via devtools::install_github('topepo/caret/pkg/caret') (note that building caret from source requires RBuildTools which are part of RTools which need to be installed separately, e.g. from Building R for Windows - RStudio may manage the install for you). The issue is also tracked on GitHub under optimismBoot error #706 and may already be resolved in the caret version you are using.

We do not run the install of the development version here again but you should in case you do not have it and wish to test the parallel execution. Before you do, make sure to close the cluster with below code (not executed now as we continue).

parallel::stopCluster(cl)
foreach::registerDoSEQ()

glm_par <-
  microbenchmark(glm_par =
    train(default ~ .,
            data = benchmark_train_data,
            method = "glm",
            family = "binomial",
            metric = "ROC",
            trControl = ctrl),
    times = 5
    )

We make sure to stop the cluster and force R back to sequential execution by using foreach::registerDoSEQ(). In case you used the setup via parallel::makeCluster(), you need to call parallel::stopCluster() first. Otherwise, memory and CPU may be occupied with legacy clusters.

#parallel::stopCluster(cl)
foreach::registerDoSEQ()

Note that we ran into issues when using too much system resources on a Windows 7 workstation as documented in the SO question caret train binary glm fails on parallel cluster via doParallel. Below error was produced.

Error in serialize(data, node$con) : error writing to connection

However, on a Windows 10 machine with fewer cores and development version of caret the issue could not be reproduced.

We have also tried running caret::train() parallel on Linux (Ubuntu) via the doMC library and found that below setting runs just fine and also faster than a sequential execution. We only include the code here for info without execution.

library(doMC)

cores_2_use <- floor(0.8 * parallel::detectCores())
registerDoMC(cores_2_use)

microbenchmark(
  glm_par =
    train(y ~ .,
          data = df,
          method = "glm",
          family = "binomial",
          metric = "ROC",
          trControl = ctrl),
  times = 5)

Finally, let’s compare parallel versus non-parallel execution time. We can see that the parallel setup performed significantly better but we would have expected even more as we are using a multiple of the one-core setup.

glm_nopar

## Unit: seconds
##       expr      min       lq     mean   median       uq     max neval
##  glm_nopar 27.51427 27.90744 28.27665 28.38591 28.71571 28.8599     5

glm_par

## Unit: seconds
##     expr      min       lq     mean  median       uq      max neval
##  glm_par 13.65229 13.79392 17.34424 13.8397 13.89238 31.54293     5

12.10 A note on tuning parameters in caret

When it comes to more complex models in general, often tuning parameters can be used to improve the model. There are different ways to provide the range of parameters to take. In particular, one may use

pre-defined number (e.g. if domain knowledge already provides and idea)
use grid search (default in caret::train())
use random search

In general, the more tuning parameters are available, the higher the computational cost of selecting them. We will show some applications of options below. For further information, see The caret package 5: Model Training and Tuning

12.11 Trees

Trees stratify or segment the predictor space into a number of simple regions. In order to make a prediction for a given observation, we typically use the mean or the mode of the training observations in the region to which it belongs. Trees are easy to interpret and built but as pointed out in “An Introduction to Statistical Learning” they may not compete with the best supervised learning approaches:

Tree-based methods are simple and useful for interpretation. However, they typically are not competitive with the best supervised learning approaches, such as those seen in Chapters 6 and 7, in terms of prediction accuracy. Hence in this chapter we also introduce bagging, random forests, and boosting. Each of these approaches involves producing multiple trees which are then combined to yield a single consensus prediction. We will see that combining a large number of trees can often result in dramatic improvements in prediction accuracy, at the expense of some loss in interpretation.

Following this guidance, we will also look into some alternatives / enhancements of the basic tree model used in CART. Some further resources on trees are found in

As usual, there are many ways to implement tree models in R but we will stay within the caret package and use the API that was already explained earlier.

12.11.1 Simple tree using CART via rpart in caret

We start by building a simple tree using the CART method via rpart in caret. Note that caret may not offer the full set of features that are available when using the rpart library directly but it should suffice for our illustration. For more details, see rpart on CRAN and in particular An Introduction to Recursive Partitioning Using the RPART Routines for some background.

We define our control and train functions as before and look at the results.

ctrl <- 
  trainControl(method = "repeatedcv", 
               number = 10,
               repeats = 5,
               classProbs = TRUE,
               summaryFunction = twoClassSummary,
               verboseIter = FALSE,
               allowParallel = TRUE)

# library(rpart)

model_rpart <-
  train_down %>%
  select(model_vars) %>%
  mutate(default = as.factor(ifelse(default == TRUE, "yes", "no"))) %>%
  filter(complete.cases(.)) %>%
  train(default ~ .,
        data = .,
        method = 'rpart',
        metric = "ROC",
        preProc = c("center", "scale"),
        trControl = ctrl)

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut
## = 10, : These variables have zero variances: home_ownershipNONE

model_rpart

## CART 
## 
## 35488 samples
##     9 predictor
##     2 classes: 'no', 'yes' 
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 31939, 31938, 31940, 31939, 31940, 31939, ... 
## Resampling results across tuning parameters:
## 
##   cp          ROC        Sens       Spec     
##   0.01445641  0.7018554  0.5572830  0.7674466
##   0.03111086  0.6461547  0.3900115  0.8799842
##   0.26027166  0.5707699  0.6019078  0.5396320
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.01445641.

We see that ROCAUC is ~ 70% for the best model iteration but sensitivity is worse than in some previous models. caret stepped through some complexity parameters (cp) and printed results for them. We can look at cp visually for the different values. In fact, under caret this is the only tuning parameter available for method = rpart.

ggplot(model_rpart)

We can also visualize the final tree by accessing the finalModel element from the caret::train() model object. We have to add the text manually to the plot via graphics::text(). We use a larger margin to have enough space for the text.

plot(model_rpart$finalModel, uniform = TRUE, margin = 0.2)
graphics::text(model_rpart$finalModel)

Apparently, there are only two splits, for out_prncp_inv and gradeB.

Rather than using the default grid search for choosing the tuning parameter cp, we could use random search (as mentioned earlier). For a theoretical background on why this might be useful, see e.g. Random Search for Hyper-Parameter Optimization. We implement random search via the parameter caret::trainControl(search = "random").

ctrl <- 
  trainControl(method = "repeatedcv", 
               number = 10,
               repeats = 5,
               classProbs = TRUE,
               summaryFunction = twoClassSummary,
               verboseIter = FALSE,
               allowParallel = TRUE,
               search = "random")

model_rpart <-
  train_down %>%
  select(model_vars) %>%
  mutate(default = as.factor(ifelse(default == TRUE, "yes", "no"))) %>%
  filter(complete.cases(.)) %>%
  train(default ~ .,
        data = .,
        method = 'rpart',
        metric = "ROC",
        preProc = c("center", "scale"),
        trControl = ctrl)

model_rpart

## CART 
## 
## 35488 samples
##     9 predictor
##     2 classes: 'no', 'yes' 
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 31939, 31940, 31940, 31938, 31940, 31940, ... 
## Resampling results across tuning parameters:
## 
##   cp            ROC        Sens       Spec     
##   0.0002254410  0.7418953  0.5806143  0.7904981
##   0.0002536211  0.7433551  0.5725220  0.8003501
##   0.0005072423  0.7462419  0.5532501  0.8225663
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.0005072423.

We can see that ROCAUC went up to ~ 74% and sensitivity also increased somewhat with the random search cp parameterization which ended up using a value much smaller than grid search. We again visualize the tree to see how it changed.

plot(model_rpart$finalModel, uniform = TRUE, margin = 0.1)
graphics::text(model_rpart$finalModel, cex = 0.5)

Clearly the model has become more complex which was expected given that its ROCAUC went up. As before, we will predict test outcome with the final model.

model_rpart_pred <- 
  predict(model_rpart, 
          newdata = test %>% 
            select(model_vars) %>%
            mutate(default = as.factor(ifelse(default == TRUE, 
                                              "yes", "no"))) %>%
            filter(complete.cases(.)), 
          type = "prob")

caret::confusionMatrix(
  data = ifelse(model_rpart_pred[, "yes"] > 0.5, "yes", "no"), 
  reference = as.factor(ifelse(test[complete.cases(test[, model_vars]), 
                                    "default"] == TRUE, "yes", "no")),
  positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  91226   682
##        yes 81806  3753
##                                              
##                Accuracy : 0.5352             
##                  95% CI : (0.5329, 0.5375)   
##     No Information Rate : 0.975              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.0377             
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.84622            
##             Specificity : 0.52722            
##          Pos Pred Value : 0.04386            
##          Neg Pred Value : 0.99258            
##              Prevalence : 0.02499            
##          Detection Rate : 0.02115            
##    Detection Prevalence : 0.48211            
##       Balanced Accuracy : 0.68672            
##                                              
##        'Positive' Class : yes                
##

Compute ROC.

roc_rpart <- 
  pROC::roc(response = temp, 
            predictor = model_rpart_pred[, "yes"])

roc_rpart

## 
## Call:
## roc.default(response = temp, predictor = model_rpart_pred[, "yes"])
## 
## Data: model_rpart_pred[, "yes"] in 173032 controls (temp no) < 4435 cases (temp yes).
## Area under the curve: 0.7463

The ROCAUC is ~74% which is nearly the same as for cross-validated training set. We plot the curve comparing it against the ROC from earlier models.

pROC::plot.roc(x = roc_glm_1, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               col = "green")

pROC::plot.roc(x = roc_glm_2, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               add = TRUE, col = "blue")

pROC::plot.roc(x = roc_rpart, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               add = TRUE, col = "orange")


legend(x = "bottomright", legend=c("glm_1 AUC = 0.664", "glm_2 AUC = 0.7",
                                   "rpart AUC = 0.74"), 
       col = c("green", "blue", "orange"), lty = 1, cex = 1.0)

12.11.2 Random forest via randomForest in caret

For random forests, the library randomForest is used. Tuning parameter is mtry which is the number of variables randomly sampled as candidates at each split. For computationally reasons, we limit the number of trees via caret::trainControl(ntree)), the number of folds via caret::trainControl(number)) and the resampling iterations via caret::trainControl(repeats)).

# library(randomForest)

ctrl <- 
  trainControl(method = "repeatedcv", 
               number = 5,
               repeats = 1,
               classProbs = TRUE,
               summaryFunction = twoClassSummary,
               verboseIter = FALSE,
               allowParallel = TRUE)

model_rf <-
  train_down %>%
  select(model_vars) %>%
  mutate(default = as.factor(ifelse(default == TRUE, "yes", "no"))) %>%
  filter(complete.cases(.)) %>%
  train(default ~ .,
        data = .,
        method = 'rf',
        ntree = 10,
        metric = "ROC",
        preProc = c("center", "scale"),
        trControl = ctrl)

model_rf

## Random Forest 
## 
## 35488 samples
##     9 predictor
##     2 classes: 'no', 'yes' 
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 28391, 28390, 28390, 28391, 28390 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.6877044  0.5636517  0.7134646
##   20    0.7041407  0.6133559  0.6806064
##   39    0.6982071  0.6123415  0.6655027
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 20.

We can see that ROCAUC is ~ 70% at mtry = 20. We again visualize but since it is a combination of many trees, the plot will look different.

plot(model_rf$finalModel)

The error would most likely go further down if we had allowed more trees.

model_rf_pred <- 
  predict(model_rf, 
          newdata = test %>% 
            select(model_vars) %>%
            mutate(default = as.factor(ifelse(default == TRUE, 
                                              "yes", "no"))) %>%
            filter(complete.cases(.)), 
          type = "prob")

caret::confusionMatrix(
  data = ifelse(model_rf_pred[, "yes"] > 0.5, "yes", "no"), 
  reference = as.factor(ifelse(test[complete.cases(test[, model_vars]), 
                                    "default"] == TRUE, "yes", "no")),
  positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     no    yes
##        no  114693   1723
##        yes  58339   2712
##                                              
##                Accuracy : 0.6616             
##                  95% CI : (0.6594, 0.6638)   
##     No Information Rate : 0.975              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.038              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.61150            
##             Specificity : 0.66284            
##          Pos Pred Value : 0.04442            
##          Neg Pred Value : 0.98520            
##              Prevalence : 0.02499            
##          Detection Rate : 0.01528            
##    Detection Prevalence : 0.34401            
##       Balanced Accuracy : 0.63717            
##                                              
##        'Positive' Class : yes                
##

Compute ROC.

roc_rf <- 
  pROC::roc(response = temp, 
            predictor = model_rf_pred[, "yes"])

roc_rf

## 
## Call:
## roc.default(response = temp, predictor = model_rf_pred[, "yes"])
## 
## Data: model_rf_pred[, "yes"] in 173032 controls (temp no) < 4435 cases (temp yes).
## Area under the curve: 0.7021

As the ROCAUC is not better than before, we will not plot it against earlier models.

12.11.3 Stochastic Gradient Boosting via gbm in caret

For Stochastic Gradient Boosting, the library gbm is used. Tuning parameters are

n.trees: total number of trees to fit
interaction.depth: maximum depth of variable interactions
shrinkage: shrinkage parameter applied to each tree in the expansion
n.minobsinnode: minimum number of observations in the trees terminal nodes

For some background, see e.g.

# library(gbm)

ctrl <- 
  trainControl(method = "repeatedcv", 
               number = 5,
               repeats = 1,
               classProbs = TRUE,
               summaryFunction = twoClassSummary,
               verboseIter = FALSE,
               allowParallel = TRUE)

model_gbm_1 <- 
   train_down %>%
   select(model_vars) %>%
   mutate(default = as.factor(ifelse(default == TRUE, "yes", "no"))) %>%
   filter(complete.cases(.)) %>%
   train(default ~ ., 
         data = ., 
         method = "gbm",
         metric = "ROC",
         trControl = ctrl,
         preProc = c("center", "scale"),
         ## This last option is actually one
         ## for gbm() that passes through
         verbose = FALSE)

model_gbm_1

## Stochastic Gradient Boosting 
## 
## 35488 samples
##     9 predictor
##     2 classes: 'no', 'yes' 
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 28390, 28391, 28390, 28390, 28391 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  ROC        Sens       Spec     
##   1                   50      0.7218075  0.4989575  0.7923127
##   1                  100      0.7347236  0.5438152  0.7776594
##   1                  150      0.7376056  0.5922232  0.7565241
##   2                   50      0.7351958  0.5467456  0.7754611
##   2                  100      0.7416798  0.6105945  0.7484081
##   2                  150      0.7480446  0.6190476  0.7499302
##   3                   50      0.7387262  0.6076078  0.7507755
##   3                  100      0.7488759  0.6164553  0.7538746
##   3                  150      0.7515752  0.6082840  0.7630048
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Plotting the boosting iterations.

ggplot(model_gbm_1)

## Warning: Ignoring unknown aesthetics: shape

Predict.

model_gbm_1_pred <- 
  predict(model_gbm_1, 
          newdata = test %>% 
            select(model_vars) %>%
            mutate(default = as.factor(ifelse(default == TRUE, 
                                              "yes", "no"))) %>%
            filter(complete.cases(.)), 
          type = "prob")

caret::confusionMatrix(
  data = ifelse(model_gbm_1_pred[, "yes"] > 0.5, "yes", "no"), 
  reference = as.factor(ifelse(test[complete.cases(test[, model_vars]), 
                                    "default"] == TRUE, "yes", "no")),
  positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     no    yes
##        no  104238   1033
##        yes  68794   3402
##                                              
##                Accuracy : 0.6065             
##                  95% CI : (0.6043, 0.6088)   
##     No Information Rate : 0.975              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.0438             
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.76708            
##             Specificity : 0.60242            
##          Pos Pred Value : 0.04712            
##          Neg Pred Value : 0.99019            
##              Prevalence : 0.02499            
##          Detection Rate : 0.01917            
##    Detection Prevalence : 0.40681            
##       Balanced Accuracy : 0.68475            
##                                              
##        'Positive' Class : yes                
##

roc_gbm_1 <- 
  pROC::roc(response = temp, 
            predictor = model_gbm_1_pred[, "yes"])

roc_gbm_1

## 
## Call:
## roc.default(response = temp, predictor = model_gbm_1_pred[, "yes"])
## 
## Data: model_gbm_1_pred[, "yes"] in 173032 controls (temp no) < 4435 cases (temp yes).
## Area under the curve: 0.7516

The ROCAUC is ~75% which is nearly the same as for cross-validated training set. We plot the curve comparing it against the ROC from the simple model.

pROC::plot.roc(x = roc_glm_1, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               col = "green")

pROC::plot.roc(x = roc_glm_2, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               add = TRUE, col = "blue")

pROC::plot.roc(x = roc_rpart, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               add = TRUE, col = "orange")

pROC::plot.roc(x = roc_gbm_1, legacy.axes = FALSE, xlim = c(1, 0), asp = NA,
               add = TRUE, col = "brown")

legend(x = "bottomright", legend=c("glm_1 AUC = 0.664", "glm_2 AUC = 0.7",
                                   "rpart AUC = 0.74", "gbm AUC = 0.753"), 
       col = c("green", "blue", "orange", "brown"), lty = 1, cex = 1.0)

Lending Club Loan Data in R

1 Overview

1.1 Kaggle Data Description

2 Environment Setup

2.1 Libraries

2.2 Session Info

3 A note on libraries

3.1 Library versions

4 Data Import

5 Meta Data

5.1 Data Transformation

5.2 Additional Meta Data File

6 Defining default

7 Removing variables deemed unfit for most modeling

8 A note on hypothesis generation vs. hypothesis confirmation

9 Exploratory Data Analysis

9.1 Grade

9.2 Home Ownership

9.3 Loan Amount and Income

9.4 Time Series

9.5 Geolocation Plots

10 Correlation

11 Summary plots

12 Modeling

12.1 Model options

12.2 Imbalanced data

12.3 A note on modeling libraries

12.3.1 A note on caret library in particular

12.4 Data preparation for modeling

12.4.1 Proper names for character variables

12.4.2 Dummy variables

12.5 Logistic regression

12.6 Model evaluation illustration

12.6.1 ROC curve

12.7 Variable selection

12.8 A more complex logistic model with caret

12.8.1 trainControl

12.8.2 train

12.9 A note on computational efficiency (parallelization via cluster)

12.10 A note on tuning parameters in caret

12.11 Trees

12.11.1 Simple tree using CART via rpart in caret

12.11.2 Random forest via randomForest in caret

12.11.3 Stochastic Gradient Boosting via gbm in caret

13 Further Resources