Random Forest Project with Random over-sampling

This project started as an activity for the Bootcamp "python for data science and machine learning" from Udemy, in order to generate a Decision Tree and Random Forest model. However, models did not have good performance due to the unbalance data. Thus, some random over-sampling techique was applied to increase model's metrics.

I use publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.

I use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full.

Here are what the columns represent:

Import Libraries

Get the Data

Use pandas to read loan_data.csv as a dataframe called loans.

Check out the info(), head(), and describe() methods on loans.

Exploratory Data Analysis

Let's do some data visualization! We'll use seaborn and pandas built-in plotting capabilities.

Create a histogram of two FICO distributions on top of each other, one for each credit.policy outcome.

Create a similar figure, except this time select by the not.fully.paid column.

Create a countplot using seaborn showing the counts of loans by purpose, with the color hue defined by not.fully.paid.

Note: it can be seen that the not.fully.paid label is not balanced: there are much more data with 0 (fully paid) than with 1 (not fully paid)

Let's see the trend between FICO score and interest rate

Create the following lmplots to see if the trend differed between not.fully.paid and credit.policy

Setting up the Data for the Random Forest Classification Model!

Categorical Features

The purpose column as categorical, so I need to transform them using dummy variables so sklearn will be able to understand them, using pd.get_dummies.

Train Test Split

Now its time to split our data into a training set and a testing set!

Training a Decision Tree Model

Let's start by training a single decision tree

Predictions and Evaluation of Decision Tree

Create predictions from the test set and create a classification report and a confusion matrix.

Some Analysis

The model has bad metrics: the number of True Negatives ir low, and there are many False Negatives and False Positives.

Training the Random Forest model

Predictions and Evaluation

Let's predict off the y_test values and evaluate our model.

Now create a classification report from the results

Show the Confusion Matrix for the predictions.

Some Analysis

Depending what metric you see, the random forest model is better or worst than the decision tree. However, this model has also bad metrics: the number of True Negatives ir low (lower than the desicion tree), and there are many False Negatives and False Positives.

Random oversampling of the data

First, I check how many cases are in class 0 and in class 1, seeing the variable "not.fully.paid"

The diference, around 5:1, could be responsable of the bad performance of both models. Looking in the internet, I found a very good article in keable about unbalanced data. So, I decided yo apply a random oversampling of the data to balance it

I made the random sampling usin df.sample

Now, the data is balanced and I can try to generate a random forest model again

Now, the confusion matrix show better results (much more true positives and negatives than falses). The classification report also show very godd metrics