This project started as an activity for the Bootcamp "python for data science and machine learning" from Udemy, in order to generate a Decision Tree and Random Forest model. However, models did not have good performance due to the unbalance data. Thus, some random over-sampling techique was applied to increase model's metrics.
I use publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.
I use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full.
Here are what the columns represent:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Use pandas to read loan_data.csv as a dataframe called loans.
loans = pd.read_csv('loan_data.csv')
Check out the info(), head(), and describe() methods on loans.
loans.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9578 entries, 0 to 9577 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 credit.policy 9578 non-null int64 1 purpose 9578 non-null object 2 int.rate 9578 non-null float64 3 installment 9578 non-null float64 4 log.annual.inc 9578 non-null float64 5 dti 9578 non-null float64 6 fico 9578 non-null int64 7 days.with.cr.line 9578 non-null float64 8 revol.bal 9578 non-null int64 9 revol.util 9578 non-null float64 10 inq.last.6mths 9578 non-null int64 11 delinq.2yrs 9578 non-null int64 12 pub.rec 9578 non-null int64 13 not.fully.paid 9578 non-null int64 dtypes: float64(6), int64(7), object(1) memory usage: 1.0+ MB
loans.describe()
credit.policy | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | revol.util | inq.last.6mths | delinq.2yrs | pub.rec | not.fully.paid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 9578.000000 | 9578.000000 | 9578.000000 | 9578.000000 | 9578.000000 | 9578.000000 | 9578.000000 | 9.578000e+03 | 9578.000000 | 9578.000000 | 9578.000000 | 9578.000000 | 9578.000000 |
mean | 0.804970 | 0.122640 | 319.089413 | 10.932117 | 12.606679 | 710.846314 | 4560.767197 | 1.691396e+04 | 46.799236 | 1.577469 | 0.163708 | 0.062122 | 0.160054 |
std | 0.396245 | 0.026847 | 207.071301 | 0.614813 | 6.883970 | 37.970537 | 2496.930377 | 3.375619e+04 | 29.014417 | 2.200245 | 0.546215 | 0.262126 | 0.366676 |
min | 0.000000 | 0.060000 | 15.670000 | 7.547502 | 0.000000 | 612.000000 | 178.958333 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 1.000000 | 0.103900 | 163.770000 | 10.558414 | 7.212500 | 682.000000 | 2820.000000 | 3.187000e+03 | 22.600000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 1.000000 | 0.122100 | 268.950000 | 10.928884 | 12.665000 | 707.000000 | 4139.958333 | 8.596000e+03 | 46.300000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 1.000000 | 0.140700 | 432.762500 | 11.291293 | 17.950000 | 737.000000 | 5730.000000 | 1.824950e+04 | 70.900000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 1.000000 | 0.216400 | 940.140000 | 14.528354 | 29.960000 | 827.000000 | 17639.958330 | 1.207359e+06 | 119.000000 | 33.000000 | 13.000000 | 5.000000 | 1.000000 |
loans.head()
credit.policy | purpose | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | revol.util | inq.last.6mths | delinq.2yrs | pub.rec | not.fully.paid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | debt_consolidation | 0.1189 | 829.10 | 11.350407 | 19.48 | 737 | 5639.958333 | 28854 | 52.1 | 0 | 0 | 0 | 0 |
1 | 1 | credit_card | 0.1071 | 228.22 | 11.082143 | 14.29 | 707 | 2760.000000 | 33623 | 76.7 | 0 | 0 | 0 | 0 |
2 | 1 | debt_consolidation | 0.1357 | 366.86 | 10.373491 | 11.63 | 682 | 4710.000000 | 3511 | 25.6 | 1 | 0 | 0 | 0 |
3 | 1 | debt_consolidation | 0.1008 | 162.34 | 11.350407 | 8.10 | 712 | 2699.958333 | 33667 | 73.2 | 1 | 0 | 0 | 0 |
4 | 1 | credit_card | 0.1426 | 102.92 | 11.299732 | 14.97 | 667 | 4066.000000 | 4740 | 39.5 | 0 | 1 | 0 | 0 |
Let's do some data visualization! We'll use seaborn and pandas built-in plotting capabilities.
Create a histogram of two FICO distributions on top of each other, one for each credit.policy outcome.
plt.figure(figsize=(10, 6))
sns.histplot(data=loans, x = "fico", hue = "credit.policy", bins = 30, palette = "seismic");
Create a similar figure, except this time select by the not.fully.paid column.
plt.figure(figsize=(10, 6))
sns.histplot(data=loans, x = "fico", hue = "not.fully.paid", bins = 30, palette = "seismic");
Create a countplot using seaborn showing the counts of loans by purpose, with the color hue defined by not.fully.paid.
plt.figure(figsize=(10, 6))
sns.countplot(data = loans, x = "purpose", hue = "not.fully.paid", palette = "seismic");
Note: it can be seen that the not.fully.paid label is not balanced: there are much more data with 0 (fully paid) than with 1 (not fully paid)
Let's see the trend between FICO score and interest rate
sns.jointplot(data = loans, x = "fico", y = "int.rate", kind ="hex");
Create the following lmplots to see if the trend differed between not.fully.paid and credit.policy
sns.lmplot(data = loans, x = "fico", y = "int.rate", hue = "credit.policy", col = "not.fully.paid");
The purpose column as categorical, so I need to transform them using dummy variables so sklearn will be able to understand them, using pd.get_dummies.
final_data = pd.get_dummies(loans, columns = ["purpose"], drop_first = True)
Now its time to split our data into a training set and a testing set!
import sklearn
final_data.columns
Index(['credit.policy', 'int.rate', 'installment', 'log.annual.inc', 'dti', 'fico', 'days.with.cr.line', 'revol.bal', 'revol.util', 'inq.last.6mths', 'delinq.2yrs', 'pub.rec', 'not.fully.paid', 'purpose_credit_card', 'purpose_debt_consolidation', 'purpose_educational', 'purpose_home_improvement', 'purpose_major_purchase', 'purpose_small_business'], dtype='object')
X = final_data[['credit.policy', 'int.rate', 'installment', 'log.annual.inc', 'dti',
'fico', 'days.with.cr.line', 'revol.bal', 'revol.util',
'inq.last.6mths', 'delinq.2yrs', 'pub.rec',
'purpose_credit_card', 'purpose_debt_consolidation',
'purpose_educational', 'purpose_home_improvement',
'purpose_major_purchase', 'purpose_small_business']]
y = final_data['not.fully.paid']
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.33, random_state=42)
Let's start by training a single decision tree
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)
DecisionTreeClassifier()
Create predictions from the test set and create a classification report and a confusion matrix.
predictions1 = dtree.predict(X_test)
print(sklearn.metrics.classification_report(y_test,predictions1))
precision recall f1-score support 0 0.85 0.84 0.84 2650 1 0.21 0.23 0.22 511 accuracy 0.74 3161 macro avg 0.53 0.53 0.53 3161 weighted avg 0.75 0.74 0.74 3161
print(sklearn.metrics.confusion_matrix(y_test,predictions1))
[[2223 427] [ 396 115]]
sns.heatmap(sklearn.metrics.confusion_matrix(y_test,predictions1));
The model has bad metrics: the number of True Negatives ir low, and there are many False Negatives and False Positives.
from sklearn.ensemble import RandomForestClassifier
rdf1 = RandomForestClassifier(n_estimators=600)
rdf1.fit(X_train, y_train)
RandomForestClassifier(n_estimators=600)
Let's predict off the y_test values and evaluate our model.
predictions2 = rdf1.predict(X_test)
Now create a classification report from the results
print(sklearn.metrics.classification_report(y_test,predictions2))
precision recall f1-score support 0 0.84 0.99 0.91 2650 1 0.36 0.02 0.03 511 accuracy 0.84 3161 macro avg 0.60 0.51 0.47 3161 weighted avg 0.76 0.84 0.77 3161
Show the Confusion Matrix for the predictions.
print(sklearn.metrics.confusion_matrix(y_test,predictions2))
[[2636 14] [ 503 8]]
sns.heatmap(sklearn.metrics.confusion_matrix(y_test,predictions1));
Depending what metric you see, the random forest model is better or worst than the decision tree. However, this model has also bad metrics: the number of True Negatives ir low (lower than the desicion tree), and there are many False Negatives and False Positives.
First, I check how many cases are in class 0 and in class 1, seeing the variable "not.fully.paid"
final_data['not.fully.paid'].value_counts().plot(kind='bar', title='Count not.fully.paid');
The diference, around 5:1, could be responsable of the bad performance of both models. Looking in the internet, I found a very good article in keable about unbalanced data. So, I decided yo apply a random oversampling of the data to balance it
# Divide by class
df_class_0 = final_data[final_data['not.fully.paid'] == 0]
df_class_1 = final_data[final_data['not.fully.paid'] == 1]
I made the random sampling usin df.sample
df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df_test_over = pd.concat([df_class_0, df_class_1_over], axis=0)
print('Random over-sampling:')
print(df_test_over['not.fully.paid'].value_counts())
df_test_over['not.fully.paid'].value_counts().plot(kind='bar', title='Count not.fully.paid');
Random over-sampling: 0 8045 1 8045 Name: not.fully.paid, dtype: int64
Now, the data is balanced and I can try to generate a random forest model again
X2 = df_test_over[['credit.policy', 'int.rate', 'installment', 'log.annual.inc', 'dti',
'fico', 'days.with.cr.line', 'revol.bal', 'revol.util',
'inq.last.6mths', 'delinq.2yrs', 'pub.rec',
'purpose_credit_card', 'purpose_debt_consolidation',
'purpose_educational', 'purpose_home_improvement',
'purpose_major_purchase', 'purpose_small_business']]
y2 = df_test_over['not.fully.paid']
X2_train, X2_test, y2_train, y2_test = sklearn.model_selection.train_test_split(X2, y2, test_size=0.33, random_state=42)
rdf2 = RandomForestClassifier(n_estimators=600)
rdf2.fit(X2_train, y2_train)
RandomForestClassifier(n_estimators=600)
predictions3 = rdf2.predict(X2_test)
print(sklearn.metrics.classification_report(y_test,predictions2))
precision recall f1-score support 0 0.97 0.95 0.96 2664 1 0.95 0.97 0.96 2646 accuracy 0.96 5310 macro avg 0.96 0.96 0.96 5310 weighted avg 0.96 0.96 0.96 5310
print(sklearn.metrics.confusion_matrix(y2_test,predictions3))
[[2536 128] [ 51 2595]]
sns.heatmap(sklearn.metrics.confusion_matrix(y2_test,predictions3));
Now, the confusion matrix show better results (much more true positives and negatives than falses). The classification report also show very godd metrics