EDA on Prosper Loan

  |   Source

Fintech company is a place where you can borrow and lend a money. The power to send is limitless. As one of Forbes Articles described, “Fintech companies, as they’ve come to be called, are easing payment processes, reducing fraud, saving users money, promoting financial planning, and ultimately moving a giant industry forward.” When talking about fintech companies, one that comes to mind is Prosper. In this blog, I will use their data to perform the analysis.

There are 113937 loans and 81 features in the dataset. I only use 13 among them since all of the features will make this blog too long. I rename the columns to make it short, and drop duplicate data so 1 observations consist of 1 person.

Note that this blog is an exploratory data analysis. This will perform grid search analysis which might be too heavy for some public reader. Nonetheless it’s contain some useful information and finding which is useful. Specifically, those that contains in the “analysis” may provides some useful insights.

Due to many variables in this dataset, I only use 12 variables. Those are:

  • ‘BorrowerRate’
  • ‘ProsperRating..numeric.’
  • ‘ListingCategory..numeric.’
  • ‘IsBorrowerHomeowner’
  • ‘StatedMonthlyIncome’
  • ‘Recommendations’
  • ‘ProsperPaymentsOneMonthPlusLate’
  • ‘LoanOriginalAmount’
  • ‘Term’,
  • ‘Occupation’,
  • ‘LoanStatus’,
  • ‘CreditGrade’,
  • ‘PercentFunded’

Univariate Plot Section

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.04228  0.00000 39.00000
## 
##     0     1     2     3     4     5     6     7     8     9    14    16 
## 87862  2438   403    78    23     8     3     3     2     5     1     2 
##    21    24    39 
##     1     1     1

Recommendations: Number of recommendations.

This is the number of recommendations that Borrowers have when listing the loans. We can see that the number of the recommendations is right skewed. With the mean 0.04, average borrowers don’t have many recommendations. And it goes as high as 39 recommendations for just one borrower.

Histogram chose to depict the distribution of the numerical variable. Since this is right skewed, I log scale the number of recommendations.

Rate: The Borrower’s interest rate for this loan.

The borrower’s rate follow an almost unimodal distribution, with the peak around 0.16. There’s small spike occurs around 0.3.

## 
##     1     2     3     4     5     6     7 
##  5487  7521 10839 15224 12633 11081  4151

Rating : The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009.

The Rating can also be null if the Prosper system can’t rate the loan. About 29084 loans aren’t rated by Prosper, which means that the loans originated before July 2009. The number of rating almost follow the order of the rating, except A-grade is the highest number of rating, AA comes second, and the rest following the order.

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 14398 47562  5242  5406  1851   561  1738  7903   150    64    76   179 
##    12    13    14    15    16    17    18    19    20 
##    44  1602   746  1117   253    41   669   597   632

Category: The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans

I choose Bar Chart for this Listing Category since this is categorical variable. Out of the listing category, three categories comes out as the highest past 10.000 loans. There’s N/A and Other categories, so we can’t know for sure the specific category. But one comes out highest which is category 1 (a.k.a. Debt Consolidation), where’s one take out a loan to pay many others. This comes really high with 58308 loans, overshadowing the rest of the categories. It could be that many Prosper visitors comes with already have loans, and want to search some loans to pay for it.

## False  True 
## 45292 45539

A Borrower will be classified as a homeowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner. When looking at the loan that’s been listed, we see that borrower that is homeowner is around equal proportion to those who do not. So it’s not affecting much.

Income is the borrower stated at the time the listing was created. Still we see that isn’t much going on with the monthly income. Monthly Income will definitely be right skewed since fewer people will have a higher salary. So I cut the outliers and log 10 scales.

## 
##    0    1    2    3    4    5    6    7    9   11   12   16   21 
## 8445   66   27   19    4    6    3    5    1    1    1    1    1

PaymentsLate: Number of payments the borrower made on Prosper loans that were greater than one month late at the time they created this listing. This value will be null if the borrower had no prior loans.

Here’s we see that number of late payments is also very right-skewed. Since the number isn’t a floating point, the number isn’t represented as a smooth histogram.

Looking at the frequency count, we see that most of the borrowers have no recommendations when they listing loans. It may seem really hard to get recommendations in the listing loans, since we see that number only few people that can get at least higher than ten recommendations.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    7000    8644   12500   35000

Amount: The origination amount of the loan.

Now this is interesting. Here we see that the distribution is still right skewed, but the peak seems like it’s around 4000 dollars. In the summary table, we see that it’s the first quarter. While the median is 6500 dollars. Notice that max loan is 35000 dollars, and with people who have the high salary as high as ~2 million dollars monthly income, it doesn’t make sense if he borrow money with such low amount. Since StatedMonthlyIncome is human input, it can’t be trusted hence I will exclude the outlier from now on.

## 
##    12    36    60 
##  1062 69530 20239
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   41.07   36.00   60.00

Term: The length of the loan expressed in months.

Here we see that term of loans only has three categorical, 1 year, 3 years, and 5 years. Most of the loans made around 3 years, with 5 years come with 24545 loans and 1614 loans at 1 year. I can expect the distribution of the term loan to be nearly normal.

## 
##                                                        Accountant/CPA 
##                               3190                               2439 
##           Administrative Assistant                            Analyst 
##                               2912                               2748 
##                          Architect                           Attorney 
##                                162                                798 
##                          Biologist                         Bus Driver 
##                                 94                                275 
##                         Car Dealer                            Chemist 
##                                140                                116 
##                      Civil Service                             Clergy 
##                               1109                                159 
##                           Clerical                Computer Programmer 
##                               2558                               3301 
##                       Construction                            Dentist 
##                               1548                                 61 
##                             Doctor                Engineer - Chemical 
##                                402                                178 
##              Engineer - Electrical              Engineer - Mechanical 
##                                866                               1108 
##                          Executive                            Fireman 
##                               3400                                327 
##                   Flight Attendant                       Food Service 
##                                 95                                943 
##            Food Service Management                          Homemaker 
##                               1008                                104 
##                           Investor                              Judge 
##                                152                                 18 
##                            Laborer                        Landscaping 
##                               1343                                190 
##                 Medical Technician                  Military Enlisted 
##                                929                               1003 
##                   Military Officer                        Nurse (LPN) 
##                                250                                422 
##                         Nurse (RN)                       Nurse's Aide 
##                               2101                                437 
##                              Other                         Pharmacist 
##                              22931                                209 
##         Pilot - Private/Commercial  Police Officer/Correction Officer 
##                                146                               1272 
##                     Postal Service                          Principal 
##                                488                                257 
##                       Professional                          Professor 
##                              10368                                452 
##                       Psychologist                            Realtor 
##                                124                                439 
##                          Religious                  Retail Management 
##                                 90                               2115 
##                 Sales - Commission                     Sales - Retail 
##                               2788                               2248 
##                          Scientist                      Skilled Labor 
##                                287                               2269 
##                      Social Worker         Student - College Freshman 
##                                595                                 34 
## Student - College Graduate Student           Student - College Junior 
##                                185                                 90 
##           Student - College Senior        Student - College Sophomore 
##                                153                                 58 
##        Student - Community College         Student - Technical School 
##                                 19                                 15 
##                            Teacher                     Teacher's Aide 
##                               3002                                217 
##              Tradesman - Carpenter            Tradesman - Electrician 
##                                 98                                366 
##               Tradesman - Mechanic                Tradesman - Plumber 
##                                780                                 82 
##                       Truck Driver                    Waiter/Waitress 
##                               1423                                345
## 
##               Other        Professional           Executive 
##               22931               10368                3400 
## Computer Programmer                     
##                3301                3190

Occupation: The Occupation selected by the Borrower at the time they created the listing.

This time, we have 68 levels of the categories. Among top 5 are computer programmer and teachers. We can see that Other is highest by a wide margin. This is understandable since this is a category that is not listed in the occupation category.

##              Cancelled             Chargedoff              Completed 
##                      4                  10457                  25925 
##                Current              Defaulted FinalPaymentInProgress 
##                  48042                   4644                    155 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     13                    619                    204 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    297                    237                    234

LoanStatus: The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.

Many of the loan status in this dataset are still running, loan completed comes second with 38074 unit. There is also loan status that past due about 2000 loans. We see a very small number of users which their loans have past-due, which suggests that all of the borrowers have fulfilled their debt.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7000  1.0000  1.0000  0.9985  1.0000  1.0040

PercentFunded: Percent the listing was funded.

This is percent funded of the loans. It seems not good in any way since most of the data are already 1% funded. Minimum at 0.7% and maximum at 1.02%. We could guess that it may floating number representing the percentage, but with 1.02% it shouldn’t be possible. In the density plot, we see that it centered around 1.0, which means that most of the loans listed are actually get funded completely.

## 
##           A    AA     B     C     D     E    HR    NC 
## 67003  2711  2893  3500  4474  4188  2782  3148   132

CreditGrade: The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings.

This is essentially the same as Prosper Rating except that credit grade is rating given to loans pre-2009 and ProsperRating given after 2009. While looking at the frequency table we see that credit grade filled with an empty value, that is the loans listed after July 2009. Since in this plot we only want to see credit grade, I exclude the loans listed after that date. What I can see now is Credit Grade that shown like a normal distribution(I can infer that since this is ordinal categorical), centered at C. NC is an outlier, which in this case “No Score”.

Univariate Analysis

Looking at the univariate plot above, I see some interesting insights. It’s unlikely people with 2 million salary have low loan amount. Income is input manually by the borrowers. So there could be some dishonesty from the input.

Many people make in ProsperLoan listing loan money to pay them another loan, as shown by a higher number of Debt Consolidations. People who visit Prosper tends to have another debt from their to consolidate their debt.

Computer programmer and teachers are among top 5 occupations that listing the loans the most. These jobs require them to find some latest info (programming questions or teaching method) from the web, thus, they have a higher chance to hear about how Prosper works.

The main feature of the dataset is the Prosper rating. I like to know which features will best play as the predictors for the Prosper rating.

I want to test the features that only exist when borrower listing the loans. Features like income, loan, occupation, listing category, recommendations, borrower rate, whether borrowers is a homeowner, the number of late payments, and the term. Because Prosper rating is the feature of interest, I exclude the missing value of Prosper rating for bivariate analysis. For this dataset, I don’t create a new feature.

There is unusual features in this dataset, in particular, monthly income. This is stated(human input) by the borrowers. As such, we see many unusual salaries. there’s borrower who’s stated their monthly income close to two million but makes 35000 dollars loan. There’s also a person who makes 0 dollars monthly salary. So I tidy the features by excluding 5% below and 95% above. We know that income will be long tail right skewed, and the distance will be very far among salaries. So I use log 10 transformations. The result after excluding outliers and log 10 transformation produces a normal distribution.

Bivariate Plots Section