Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
SEPTEMBER 2008
Pavel Brusilovskiy, Business Intelligence Solutions
David Johnson, Strategic Link Consulting
Introduction
This white paper is the result of joint work between Business Intelligence Solutions (BIS) and Strategic Link Consulting (SLC).
Business Intelligence Solutions (www.bisolutions.us) is a well established statistical/data mining/GIS company that conducts business for banking, finance, insurance and other industries. Our specialization is complex unstructured business problems for data rich firms. Our multidisciplinary team includes professionals in applied statistics, data mining, optimization and simulation, GIS, and software application development. The team members are authors of more than 100 published papers on diverse applications of data mining and other quantitative fields.
Business Intelligence Solutions has access to the best statistical, visualization, data mining and GIS software on the world market. The essence of our approach is to understand and analyze our clientâ€™s business problem and corresponding data through the prism of dissimilar statistical/data mining models. As a result, we are always able to produce the best possible model and help our clients in the most effective and scientifically sound way.
Strategic Link Consulting (www.strategiclinkconsulting.com) represents multiple online personal loan clients within the subprime lending industry. Loan amounts vary based on the customerâ€™s income as a primary determinant of their ability to pay. Returning customers are eligible for larger loans with more stringent income requirements. The interest rate is a nonnegotiable flat rate based on the duration of the loan. Returning customers are offered larger loans with lower fees. Payment schedules are derived from customer pay frequency (weekly, biweekly, semimonthly or monthly). Customers may pay in full on their due date or refinance by paying either a portion of the principle or only the fee as allowed by applicable laws.
Customers qualify for a loan after completing a waterfall of underwriting phases which consists of internal fraud and duplication checks, identity verification and external credit checks (not Trans Union, Equifax or Experian). These steps produce a score, similar to a FICO score, which determines if a customer is approved or denied based on adverse data components derived from their external data sources. The funding/origination of a loan is based on a verbal verification process that includes several manual steps including contacting the customer directly.
Objectives, Goals, and Problem Statement
As a rule, a lender must decide whether to grant credit to a new applicant. The methodology and techniques that provide the answer to this question is called credit scoring. This white paper is dedicated to the development of credit scoring models for online personal loans.
Taking into account the nonlinearity of the relationship between overall customer risk and predictors, the primary objective is to develop a nonparametric and nonlinear credit scoring model within data mining paradigm that will predict overall customer risk with maximum possible accuracy. This objective implies several goals:
 Create a regression type credit scoring model that predicts overall customer risk on a 100 point scale, using the binary assessment of customer risk (good customer/bad customer).
 Identify the importance of the predictors, and the drivers of being a good customer in order to separate good behavior from bad.
 Develop the basis for a customer segmentation model that uses overall customer risk assessment to predict high (H), medium (M) and low (L) risk customers.
 Show the fruitfulness of the synergy of credit scoring modeling and Geographical Information Systems (GIS).
The outcome of the regression scoring model can be treated as the probability of being a good customer. The segmentation rule depends on two positive thresholds h1 and h2, h2<h1<1. If for a given customer the probability of being a good customer is greater than h1, where h1 is a large enough threshold (e.g., 0.75), then the customer belongs to the low risk segment. If, however, the probability of being a good customer is less than h1 but greater than h2 (e.g., h2=0.5), then the customer belongs to the medium risk segment. Finally, if the probability that the customer is a good customer is less than h2, he belongs to the high risk segment. The thresholds h1 and h2 should be provided by SLC, or their optimal values can be determined by BIS as a result of minimization of the corresponding cost matrix.
Risk scoring is a tool that is widely used to evaluate the level of credit risk associated with a customer. While it does not identify "good" (no negative behavior) or "bad" (negative behavior expected) applicants on an individual basis, it provides the statistical odds, or probability, that an applicant with any given score will be "good" or "bad" (6, p.5).
Scorecards are viewed as a tool for better decision making. There are two major types of scorecards: traditional and nontraditional. The first one, in its simplest form, consists of a group of "attributes" that are statistically significant in the separating good and bad customers. Each attribute is associated with some score, and the total score for applicant is the sum of the scores for each attribute present in the scorecard for that applicant.
Traditional scorecards have several advantages (6, p.2627):
 easy to interpret (there is no requirement for Risk Managers to know in depth statistics or data mining);
 easy to explain to a customer why an application was rejected;
 scorecard development process is transparent (not a black box) and is widely understood;
 scorecard performance is easy to evaluate and monitor.
The disadvantage of traditional scorecards is their accuracy. As a rule, nontraditional scorecards (that can be represented as a data mining nonlinear and nonparametric logistic regression) outperform traditional scorecards. Since each percent gained in credit assessment accuracy can lead to a huge savings, this disadvantage is crucial for credit scoring applications. Modern technology allows us to easily employ a very complex data mining scoring model to new applicants, and to dramatically reduce the misclassification rate for Good  Bad customers.
This white paper is dedicated to nontraditional scorecard development within a data mining paradigm.
Data Structure
This study is based on the SLC sample of 5,000 customers, including 2,500 Good customers and 2,500 Bad customers, one record per customer. According to the rule of thumb (6, p. 28), one should have at least 2,000 bad and 2,000 good accounts within a defined time frame in order to get a chance to develop a good scorecard. Therefore, in principle, the given sample of accounts is suitable for scorecard development.
Each record can be treated as a data point in a highdimensional space (approximately 50 dimensions). In other words, each customer is characterized by 50 attributes (variables) that are differently scaled. The following variable types are present in the data:
 numeric (interval scaled) variables such as age, average salary, credit score (industry specific credit bureau), etc;
 categorical (nominal), with a small number of categories such as periodicity (reflects payroll frequency) with just 4 categories;
 categorical, with a large number of categories (e.g., employer name, customerâ€™s bank routing number, email domain, etc)
 date variables (application date, employment date, due date, etc)
The data also include a geographic variable (customer ZIP), and several customer identification variables such as customer ID, user ID, application number, etc. Unfortunately, the data does not include psychographic profiling variables.
There are several specific variables that we would like to mention:
BV Completed is a variable that answers whether the customer had a bank verification completed by the loan processor. A value of 1 means the bank verification was completed. A missing value or 0 means it was not. Bank verification involves a 3 way call with the customer and their bank to confirm deposits, account status, etc.
Score is an industry specific credit bureau score.
Email Domain is a variable that reflects an ending part of the email address after the @ symbol.
The variable Monthly means monthly income in dollars.
Required Loan Amount is the principal amount of a loan at the time of origination.
Credit Model is a predictor that can take the following values:
 New customer scorecards  there are three credit bureau scorecards that exists, each with more stringent approval criteria. The baseline scorecard has only identity verification and an OFAC check while the tightest scorecard has a variety of criteria including inquiry restrictions, information about prior loan payment history, and fraud prevention rules. They are limited to standard loan amounts with standard fees, subject to meeting income requirements.
 Returning customers have minimal underwriting and are eligible for progressively larger loan amounts with a fee below the standard fee for new customers.
Isoriginated is either 1 for originated loans or 0 for unoriginated loans. Withdrawn applications and denied applications will have values of 0. Loans that were funded and had a payment attempt will have a value of 1.
Loan Status is the status of the loan. Loan statuses are grouped as follows:
 D designates the class of Good Customers (a loan is paid off successfully with no return).
 P, R, B, and C designate the class of Bad Customers.
Other variable names are selfexplanatory.
The available variables can be classified according to their role in the model development process. In statistical terms, variables can be dependent or independent, but in data mining, a dependent variable is called a target, and an independent variable is called an input (or predictor).
It makes sense to consider a twosegment analysis of risk. In twosegment analysis, the target is a binary variable Risk (Good, Bad), or Risk Indicator (1, 0), where 1 corresponds to a Good customer (Risk = Good) and 0 corresponds to a Bad customer (Risk = Bad).
As we mentioned before, each target is associated with a unique optimal regression type model. The outcome of each model can be treated as the corresponding probability of target = 1 which, in turn, can be interpreted as a credit score on a 100 point scale. In other words, the model under consideration serves to estimate probability/credit score of being a Good customer.
Exploratory Data Analysis and Data Preprocessing
Exploratory Data Analysis (EDA) and data preprocessing are time consuming but necessary steps of any data analysis and modeling project, and data mining is no exception (see, for example, 9). All major data mining algorithms are computationally intensive, and data preprocessing can significantly improve the quality of the model.
The objectives of EDA include understanding the data better, evaluating the feasibility and accuracy of overall customer risk assessment, estimating the predictability of Good/Bad customers, and identifying the best modeling methodology of credit scoring modeling, and in particular, customer segmentation with High, Medium, and Low risk.
SLC data preprocessing might include reduction of the number of categories, creation of new variables, treatment of missing values, etc. For example, in the categorical variable Application Source, the first four characters indicate the market source. When the Market Source variable was constructed, the frequency of each category was calculated (second column in boxes of Graph 1). It turned out that this variable has 45 distinct values, but only 18 categories are large. The rest of the categories were grouped into a new category, OTHR, and the frequency distribution of the modified variable is presented in the left box of Graph 1.
This example demonstrates the necessity of these preliminary steps: it turns out that the constructed variable Market Source Grouped is selected as an important predictor, whereas the original Market Source variable is not.
Graph 1a. Variable Transformation / Grouping: Market Source variable
Another problem with the data is the misspelling and/or double name of one and the same category for some categorical variables. In particular, the variable Email Domain has a lot of errors in the correct spelling of a domain. For instance, there are 5 different spelling versions of yahoo.com:
Email Domain 
Number of Customers 
yaho.com 
2 
yahoo.com 
2023 
yhaoo.com 
3 
Yahoo.com 
6 
YAHOO.COM 
402 
YAOO.COM 
1 
and 7 different versions of the domain sbcglobal.net:
Email Domain 
Number of Customers 
sbcglobal.ne 
1 
sbcglobal.net 
194 
sbcgloblal.net 
1 
sbcgolbal.net 
1 
sbclobal.net 
1 
SBCGLOBA.NET 
1 
SBCGLOBAL.NET 
33 
In order to produce meaningful results, all misspellings should be corrected.
According to our intuition, the variable Score is the most important to correctly predict the probability of being a good customer. The first thing that can be done is discriminating between customers, using just the Score predictor. Graph 1b shows that it is not easy to do manually.
Graph 1b.Distribution of the variable Score for Bad and Good customers
Construction of additional variables can dramatically improve the accuracy of risk prediction. New time duration variables
orig_duration = Origination Date  Application Date
emp_duration = Origination Date  Employment Date
due_duration = Loan Due Date  Origination Date
serve as examples of new variable creation. For the sake of illustrating the importance of data preprocessing, we can mention here that the latter two of these three variables were important predictors selected by the TreeNet algorithm.
In order to better understand the relationship between several interval scaled (continuous) variables, quite often a special visualization tool (a matrix plot) is used. The matrix plot (Graph 2) was developed for the following four variables: Requested Loan Amount, Finance Charge, Score, and Applicant Age for both segments: Good and Bad customers.
There is no obvious difference in the relationship of any pair of variables between two segments of customers (Good /Bad).
Graph2. Matrix Plots and Histograms
The correlation structure of interval scaled variables can be different for different segments. In order to check this hypothesis, let us select all interval scaled variables and estimate the nonparametric correlation coefficient (Spearman correlation) for each pair of the following 8 variables: Required Loan Amount, Financial Charge, Average Salary, Score, Applicant Age, and three duration variables that are defined below. The Spearman correlation is used in the situation when a pair of variables under consideration is not subject to bivariate normal distribution.
Graph 3. Nonparametric correlation analysis, based on Spearman correlation
There are no obvious differences in correlation structure among 8 numeric variables. In other words, the correlation structure is similar among predictors for Good and Bad segments.
The complexity of SLC data can be characterized by:
 High dimensionality (about 50 predictors)
 Uncharacterizable nonlinearities
 Presence of differently scaled predictors (numeric and categorical)
 Missing values for some predictors
 Large percentage of categorical predictors with extremely large numbers of categories and extremely nonuniform frequency distributions
 Nonnormality of numeric predictors.
Therefore, complex sophisticated methods should be employed to separate good and bad accounts in the SLC data.
Methodology
Data and problem specificity limit the number of algorithms that can be used for SCL data analysis. Any traditional parametric regression modeling approach (such as statistical logistic regression) and any traditional nonparametric regression (such as Lowess, Generalized Additive Models, etc.) are inadequate for such problems. The main reason for this is the presence of a large number of categorical variables with huge numbers of categories. The inclusion of such categorical information in a multidimensional dataset imposes a serious challenge to the way researchers analyze data (10).
Any approach based on linear, integer or nonlinear programming (see, for example, 7, Chapter 5), is also not the best approach for the same reasons.
Within the data mining universe, only some algorithms can be applicable to SLC data. For example, data mining cluster analysis algorithms available in some of the best data mining software (SAS Enterprise Miner and SPSS Clementine) are based on Euclidean distance and cannot be used for the same reasons as above.
On the other hand, preliminary analyses and modeling that we had conducted have shown that the accuracy of the nonlinear, nonparametric regression type models generated by the TreeNet and Random Forest algorithms are acceptable.
We should note that the use of each of the applicable methods implies that the original data are randomly separated into two parts: the first is for training (model development) and the second is for validation of the model. Validation is the process of testing the developed model on unseen data.
SLC describes possible findings in the analysis by the following example: â€œCustomers that are 29 years old, live in Pennsylvania, and make less than $2,000 per month have an 88% chance of default.â€ This is a typical representation of a CART / CHAID type regression tree algorithm findings. The set of CART / CHAID type rules can easily be applied to unseen data, and can be embedded into the SLC loan credit risk evaluation online system.
Unfortunately, it is quite possible that the best credit scoring model will not be a CART / CHAID type of regression tree model. Nonparametric and nonlinear TreeNet or Random Forest type regression models as a rule outperform CART / CHAID type of models on data similar to SLC data. If this is the case, a simple representation of the best model as a set of simple rules as mentioned above is impossible. We can gain the accuracy of risk prediction, but lose the simplicity of model/finding representation.
If this tradeoff between prediction accuracy and model representation simplicity is to be resolved in favor of accuracy, then the project should include the development of a .Net component implementing the best TreeNet or Random Forest scoring model that could be run independently on Salford Systems software and integrated into the SLC loan credit risk evaluation online system.
In addition, standard data mining tools such as SPSS Clementine, SAS Enterprise Miner, and Salford Systems have between 20 and 100 model parameter options that need to be specified by the researcher. The settings that would produce the best model could only be found through extensive systematic experimentation by a data mining expert and lead to the optimal model. For the SLC business problem,
this would mean a significantly reduced number of misclassified customers (i.e., customers with a wrongly estimated credit risk level). However, the search for the optimal model is a combination of art and science, and again, requires experience and expertise in the data mining field.
Stochastic Gradient Boosting (TreeNet) Overview
Stochastic gradient boosting was invented in 1999 by Stanford University Professor Jerome Friedman (1, 2). Salford Systems  a California based data mining software development company (http://www.salfordsystems.com) has implemented and commercialized this invention as a TreeNet product in 2002. The TreeNet was the first stochastic gradient boosting tool in the world data mining industry. The intensive research has shown that TreeNet models are among the most accurate of any known modeling techniques. TreeNet is also known as Multiple Additive Regression Trees (MART).
The TreeNet model is a nonparametric regression and can be described as a linear combination of small trees (3, 4, 5):
Predicted Target =
Here the first term AO is a model starting point; as a rule, it is the median of a target. The idea of the algorithm is the following. The residuals are calculated as the difference between AO and the reality. Then the residuals are transformed in order to reduce the impact of outliers (Huberâ€™s adjustment for outliers). The transformed residuals are called pseudoresiduals. The first tree T1(X) is fitted to the pseudoresiduals, and the coefficient B1 is determined. After that the new pseudoresiduals, the difference between predicted values of a target (employing the model AO B1 x T1(X)) and reality, are calculated, and the second tree T1(X) is fitted to the new pseudoresiduals. This process is repeated, and the final predicted value of the target is formed by adding the weighted contribution of each tree with the corresponding weights B1, B2, ... BN. The TreeNet algorithm typically generates hundreds or even thousands of small trees. This sequential errorcorrecting process converges to an accurate model that is highly resistant to outliers and misclassified data points.
The TreeNet algorithm
 is relatively impervious to errors in the dependent variable (target), such as mislabeling
 is strongly resistant to overfitting (predicting noise instead of predicting signal)
 generalizes well to unseen data.
TreeNet is the most flexible and powerful data mining tool, capable of generating extremely accurate models for both regression and classification and can work with varying sizes of data sets (from small to huge) while readily managing a large number of columns (http://www.salfordsystems.com/treenet.php). The algorithm can handle both continuous and categorical targets and predictors, and readily handle any number of irrelevant predictors.
From now on major data mining software developers such as Megaputer Intelligence http://www.megaputer.com/ and SAS (Enterprise Miner Version 5.3) http://www.sas.com/ include TreeNet type algorithms in the suite of available tools.
TreeNet models are usually complex, consisting of hundreds (or even thousands) of trees, and require special efforts to understand and interpret the results. The software generates a number of special reports with visualization to extract the meaning of the model, such as a ranking of predictors according to their importance on a 100 point scale, and graphs of the relationship between inputs and target.
In order to understand graphs of reports for binary targets that TreeNet generates, we need to remind ourselves of the concepts of Odds and Log Odds.
Log Odds, Odds and the Probability of an Event
The odds of an event (for example, first payment default) is defined as the ratio of the probability that an event occurs to the probability that it fails to occur. Thus,
Odds(Event) = Pr(Event) / [1  Pr(Event)]
The log odds are just the natural logarithm of the odds: ln(Odds).
People quite often use the concept of odds to express the likelihood of an event. When you hear someone say that the odds are 3to1, it means that the probability of an event occurring is three times greater than the probability of the event not occurring. The shorter way of saying the same: the odds equal 3 (which implies that the odds are 3to1). In other words, the odds are 3 means that the probability of the event is .75 and the probability of nonevent is .25, i.e., 3to1.
If the odds are 1to3, we could also say that the odds are .3333. The probability of the event is .25 and the probability of nonevent is .75.
Another example: saying the odds are 3to2 is the equivalent of saying that the odds are 1.5to1 or just 1.5, for short. The probability of the event is .6. For an inverse situation the odds are 2to3, we could say the odds are .6667 and that the probability is .4. When the odds are 1to1, or just 1 for short, the probability of the event is .5.
According to the definition, both odds and log odds are the monotonically increasing function of an event probability (See Graph 4 and Graph 5).
 If the probability of an event is 0.5, then odds are equal to 1, and log odds are equal to 0.
 If the probability of an event is 0, then odds are equal to 0 too, and log odds are equal to minus infinity.
 If the probability of an event is 1, then odds are plus infinity, and log odds are plus infinity as well.
Graph 4. Relationship between Odds and Probability of an event
Graph 5. Relationship between Log Odds and Probability of an event
Treenet Risk Assessment Models
TreeNet Analysis
For the purpose of our analysis we randomly selected the data sample into two subsamples. 60% of the sample builds the first subsample, the LEARN data. It will only be used for model estimation. The second subsample, the TEST data, will be used to estimate model quality.
The TreeNet algorithm has about 20 different options that can be controlled by a researcher. As a rule, usage of default options does not produce the best model. Determination of the best options/optimal model is time consuming and requires experience and expertise.
Models that are quite different (see First and Second models below) can have similar accuracy, and the interpretability criterion should be used to select the best model.
First TreeNet Risk Assessment model
The target is a binary variable Risk with two possible values: Good and Bad. The Good value of the target was selected as a focus event. All predictors are listed in the first column of Table 1. The second column reflects an importance score on a 100 point scale with the highest score of 100 corresponding to the most important predictor. If the score equals 0, the predictor is unimportant at all for the target. The third column, Variance Importance, just visualizes the second column, Score.
This particular model is based on just 8 predictors, but has a risk prediction error of about 14% on learning data, and a risk prediction error of about 19% on validation data. If the TreeNet algorithm did not select the Score variable, it means that within this model the variable Score is not important. It does not mean that the credit score is superfluous or irrelevant in customer credit risk assessment. It just means that the useful information provided by the variable Score is covered by 8 important predictors, selected by TreeNet (see Table 1).
Variable 
Score 
Variable Importance 
Bank_Names$ 
100.00 
 
Merch_Store_ID 
86.74 
 
Email_Domains 
46.16 
 
Market_Source_GRPD$ 
26.06 
 
BV_Completed 
23.14 
 
Fin_Charge 
9.22 
 
Due_Duration 
6.35 
 
Emp_Duration 
5.37 
 
Type_of_Payroll$ 
0.00 

Merchant_Nmbr$ 
0.00 

Credit_Model$ 
0.00 

Periodicity$ 
0.00 

Appramt$ 
0.00 

Req_Loan_Amt 
0.00 

Appl_Status$ 
0.00 

Avg_Salary 
0.00 

Courtesy_Days 
0.00 

Aba_No 
0.00 

Score 
0.00 

Monthly 
0.00 

Age 
0.00 

Orig_Duration 
0.00 

Cust_Acct_Type$ 
0.00 

Isoriginated 
0.00 

Table 2. TreeNet model misclassification rate
Graph 6. Impact of Market Source Grouped predictor on the probability of being a good customer:
Risk = Good, controlling for all other predictors.
The Yaxis is a log odds of the event Risk = Good. Therefore, 0 corresponds to the situation when odds are 1to1, or the probability of an event equals the probability of a nonevent. In other words, the Xaxis corresponds to the base line that reflects an equal chance to be a good or bad customer.
The impact of Market Source Grouped is significant and varies across different values. All values
of the Market Source Grouped variable with bars above the Xaxis increase the probability of being a good customer, and all bars that are below the Xaxis decrease the probability of being a good
customer. We can say that the value of LDPT has the highest positive impact on the probability of
being a good customer, and the value of CRSC has the highest negative impact on the probability
of being a good customer.
Table 3. Frequency of Loan Status by Market Source Grouped
Frequency 
CRUE 
CRUF 
LDPT 
MISS 
Total 
C 
8 
15 
31 
76 
130 
D 
60 
40 
92 
411 
603 
P 
2 
0 
5 
55 
62 
Total 
70 
55 
128 
12542 
795 
The left column of Table 3 depicts the values of the Loan Status predictor, and the upper row of the table depicts the values of the Market Source Grouped predictor. Table 3 pictured the frequency of customers that have the following values of the Market Source Grouped variable: CRUE, CRUF, LDPT, and MISS. These values are matched to the tallest positive bars on Graph 6 (the values with the highest positive impact on the probability of being a good customer). Since the value of D of the Loan Status predictor designates a Good customer, and the values C and P correspond to a Bad customer (see Data Structure section), we can infer that there is a good agreement between the model (Graph 6) and the data (Table 3).
There is a significant difference between the information presented in Table 3 and in Graph 6. If we forget for a minute about the existence of all other predictors, and consider just two of them (Loan Status and Market Source Groupe) using available data, then we can arrive at the conclusion that the majority of customers with Market Source Grouped values of CRUE, CRUF, LDPT, and MISS are Good customers. Again, we considered the join frequency distribution of only these two predictors, and disregarded the impact of all other predictors. In other words, there is no control for other predictors at all, and it is data induced information.
The information, represented by Graph 6, on the contrary, was produced by the developed TreeNet model, and it is model induced information. The relationship between target (probability of being a good customer) and Market Source Grouped was mapped, controlling for all other predictors.
Graph 7. TreeNet Modeling: Impact of BV Completed and Market Source Grouped predictors on Probability of Being a Good Customer (controlling for all other predictors).
Graph 7 represents an example of the nonlinear interaction between BV Completed and Market Source predictors: for different values of one predictor, the impact on the probability of being a good customer has different directions. Actually, for the value of Market Source Grouped = BSDE both values of BV Completed predictor have a positive impact on the probability of being a good customer. On the other hand, for the value of Market Source Grouped =CRSC, the value 0 of the BV Completed predictor accords with negative impact, but the value 1 accords with a positive impact on the probability of being a good customer.
Second TreeNet Risk Assessment model
As in the FirstTreeNet model construction, 60% of the data are randomly selected to be used for model development (learning), and the remaining 40% of data used for model validation (holdout observations, or test data).
This particular model is based on 17 predictors, and has a risk prediction error of about 9% on learning data, and a risk prediction error of about 20% on validation data.
Table 4. TreeNet model variable importance
Variable 
Score 
Variable Importance 
Bank_Names$ 
100.00 
 
Email_Domains 
47.27 
 
Credit_Model$ 
28.41 
 
Market_Source_GRPD$ 
24.94 
 
BV_Completed 
10.06 
 
Merchant_Nmbr$ 
8.57 
 
Age 
8.34 
 
Score 
8.21 
 
Orig_Duration 
7.33 
 
Emp_Duration 
7.33 
 
Appramt$ 
6.90 
 
Monthly 
6.69 
 
Due_Duration 
6.47 
 
Courtesy_Days 
6.24 
 
Cust_Zip 
5.36 
 
Avg_Salary 
4.84 
 
Fin_Charge 
4.09 
 
Merch_Store_ID 
3.46 

Req_Loan_Amt 
2.89 

Periodicity$ 
1.77 

State_Code$ 
1.36 

Type_of_Payroll$ 
1.25 

Aba_No 
0.00 

Isoriginated 
0.00 

Cust_Acct_Type$ 
0.00 

Appl_Status$ 
0.00 

Table 5. TreeNet model misclassification rate
Graph 8. TreeNet Modeling: Impact of Email Domain predictor on Probability of Being a Good Customer, controlling for all other predictors (Second Model)
The impact of Email Domain is extremely significant, but has different directions for different values. We can mention several distinctive segments of Email Domain values with different impacts on the probability of being a good customer:
 Extremely positive impact
 Modest positive impact
 Practically no impact
 Modest negative impact
 Extremely negative impact
Graph 9. TreeNet Modeling: Impact of Credit Model predictor on Probability of Being a Good Customer, controlling for all other predictors (Second Model).
The only value 0001 of Credit Model is associated with a strong negative impact on the Probability of being a Good customer. The value of 0003 is associated with the strongest positive impact on Probability of being a Good customer.
Table 6. Frequency of Loan Status by Credit Model
Frequency 
0001 
0002 
0003 
7777 
8080 
Total 
C 
295 
116 
13 
232 
193 
849 
D 
331 
500 
325 
712 
632 
2500 
P 
1552 
99 
0 
0 
0 
1651 
Total 
2178 
715 
338 
944 
825 
5000 
The left column of Table 6 depicts values of the Loan Status predictor (D designates a class of Good customers, and C and P designate a class of Bad customers), and the upper row of the table depicts the values of the Credit Model predictor (see section Data Structure for meaning of Credit Model values). The data supports the directions and strength (size) of impact on the probability of being a Good customer, induced by the model (Graph 9).
Graph 10. TreeNet Modeling: Impact of Merchant Number predictor on Probability of Being a Good Customer, controlling for all other predictors (Second Model)
Table 7. Frequency of Loan Status by Merchant Number
Frequency 
57201 
57206 
Total 
C 
43 
99 
142 
D 
0 
116 
116 
P 
9 
162 
171 
Total 
52 
377 
429 
The left column of Table 7 depicts values of the Loan Status predictor (symbol D corresponds to a Good customer, and symbols C and P correspond to a Bad customer), and the upper row of the table depicts the values of the Merchant Number predictor. Table 7 pictured the frequency of customers that have the following values of Merchant Number variable: 57201 and 57206. We can observe that the majority of customers with these values of Merchant Number belong to the segment of Bad customers. Again, the model induced knowledge (Graph 10) and the data are in a good agreement.
Graph 11. TreeNet Modeling: Impact of Score predictor on Probability of Being a Good Customer, controlling for all other predictors (Second Model)
Credit bureau score (Score predictor) has a binary impact on the probability of being a good customer: if the score is 600 and higher, then the probability jumps up, and if an applicant score is less than 600, then the probability jumps down, but this negative jump is not very large. In other words, for scores of less than 600 the impact on the probability of being a good customer is very modest.
Graph 12. TreeNet Modeling: Impact of Employment Duration and Score on Probability of Being a Good Customer, controlling for all other predictors (Second Model)
If Employment Duration is less than 1,500 days, we can say that there is no strong impact on the probability of being a Good customer (we can treat the part of Graph 12 curve for Employment Duration is less than 1,500 days as a noise). The real impact of Employment Duration starts from 1,500 days, and is linear up to 5,000 days. Then the probability of being a Good customer has a diminishing returns effect when Employment Duration becomes greater than 5000 days.
Graph 13. TreeNet Modeling: Impact of Age on Probability of Being a Good Customer, controlling for all other predictors (SecondModel).
It turned out that the best probability of being a Good customer occurs with applicants between ages 38 and 42 years. If the age is less than 32, then the impact is negative, and the younger an applicant the lower the probability of being a good customer. On the other hand, the strength of positive impact on the probability of being a Good customer goes down when age is increasing.
Graph 14. TreeNet Modeling: Impact of the interaction of BV Completed and Credit Model predictors on Probability of Being a Good Customer, controlling for all other predictors (Second Model)
The vertical axis maps log odds of the event Risk = Good. Two other axes are matched by values of the BV Completed and Credit Model predictors. The combination of BV Completed =1 and Credit Model = 7777 has the strongest positive impact on the probability of being a Good customer. On the other hand, the combination of BV Completed = 0 and Credit Model = 0001 has the strongest negative impact on the probability at hand.
GIS Application
In order to improve the quality of the data mining predictive models, it is useful to enrich SLCâ€™s data with additional regionlevel demographic, socioeconomic and housing variables that can be obtained from the US Bureau of the Census. These variables include
 median household income
 education
 median gross rent
 median house value, etc.
The variables can be obtained at different geographic levels, namely at the ZIP Code and the Census block levels.
Because the SLC data include ZIP code as one of the variables, it will be possible to merge the Zip level Census data to the SLC data directly.
However, if customer address data are available, it will be advisable to obtain the Census data for smaller geographic regions (namely, Census blocks). Because Census blocks are in general much smaller than Zip codes, the Census estimates for these areas will be much more precise and much more applicable than for their ZIP code counterparts.
Using the Geographic Information Systems software, customer addresses can be geocoded (i.e. the latitude and longitude of the addresses can be determined, and the addresses can be mapped). Then, it will be possible to spatially match the addresses to their respective Census blocks (and Census block data). Demographic, socioeconomic, and housing data can then be obtained at the Census Block level. Although geocoding is a time intensive procedure, enriching the SLC data with the Census block level data will make the accuracy of the credit score even higher.
Employing dissimilar data mining tools, it is easy to determine which Census variables are crucial for customer risk assessment. When corresponding data become available, maps produced by the GIS will enable us to visually identify zip codes with many bad (high) risk customers and zip codes with many good (low) risk customers (Graph 15 and Graph 16).
Graph 15. Percent of Bad Customers in Each ZIP Code
Graph 16. Number of Bad Customers in Each ZIP Code
Conclusion
Knowledge generated by the TreeNet models is in good compliance with the data and readily interpretable. The accuracy of TreeNet models is superior. TreeNet is an appropriate tool for nontraditional scorecard development, using SLC data. TreeNet model induced knowledge is a great asset for traditional scorecard development.
References
1. J. Friedman (1999), Greedy Function Approximation: A Gradient Boosting Machine http://www.salfordsystems.com/doc/GreedyFuncApproxSS.pdf
2. J. Friedman (1999), Stochastic Gradient Boosting http://www.salfordsystems.com/doc/StochasticBoostingSS.pdf
3. TreeNet Frequently Asked Questions http://www.salfordsystems.com/doc/TreeNetFAQ.pdf
4. Dan Steinberg (2006), Overview of TreeNet Technology. Stochastic Gradient Boosting http://perseo.dcaa.unam.mx/sistemas/doctos/TN_overview.pdf
5. Boosting Trees for Regression and Classification, StatSoft Electronic Text Book http://www.statsoft.com/textbook/stbootres.html
6. Naeem Siddiqi (2005), Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring (Wiley and SAS Business Series), Wiley, 208 p.
7. Thomas, L.C., Edelman, D.B., Crook, J.N, (2002), Credit Scoring and its Applications, SIAM, 250 p.
8. Matignon, R. (2007). Data Mining Using SAS Enterprise Miner. Wiley Publishing.
9. Myatt, G. J. (2006). Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining. Wiley Publishing.
10. Seo, J. and GordishDressman, H. (2007). Exploratory Data Analysis With Categorical Variables: An Improved RankbyFeature Framework and a Case Study. International Journal of HumanComputer Interaction. Available online at http://www.informaworld.com/smpp/title~content=t775653655