eda
Fintech company is a place where you can borrow and lend a money. The power to send is limitless. As one of Forbes Articles described, “Fintech companies, as they’ve come to be called, are easing payment processes, reducing fraud, saving users money, promoting financial planning, and ultimately moving a giant industry forward.” When talking about fintech companies, one that comes to mind is Prosper. In this blog, I will use their data to perform the analysis.
There are 113937 loans and 81 features in the dataset. I only use 13 among them since all of the features will make this blog too long. I rename the columns to make it short, and drop duplicate data so 1 observations consist of 1 person.
Note that this blog is an exploratory data analysis. This will perform grid search analysis which might be too heavy for some public reader. Nonetheless it’s contain some useful information and finding which is useful. Specifically, those that contains in the “analysis” may provides some useful insights.
Due to many variables in this dataset, I only use 12 variables. Those are:
- ‘BorrowerRate’
- ‘ProsperRating..numeric.’
- ‘ListingCategory..numeric.’
- ‘IsBorrowerHomeowner’
- ‘StatedMonthlyIncome’
- ‘Recommendations’
- ‘ProsperPaymentsOneMonthPlusLate’
- ‘LoanOriginalAmount’
- ‘Term’,
- ‘Occupation’,
- ‘LoanStatus’,
- ‘CreditGrade’,
- ‘PercentFunded’
Univariate Plot Section
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04228 0.00000 39.00000
##
## 0 1 2 3 4 5 6 7 8 9 14 16
## 87862 2438 403 78 23 8 3 3 2 5 1 2
## 21 24 39
## 1 1 1
Recommendations: Number of recommendations.
This is the number of recommendations that Borrowers have when listing the loans. We can see that the number of the recommendations is right skewed. With the mean 0.04, average borrowers don’t have many recommendations. And it goes as high as 39 recommendations for just one borrower.
Histogram chose to depict the distribution of the numerical variable. Since this is right skewed, I log scale the number of recommendations.
Rate: The Borrower’s interest rate for this loan.
The borrower’s rate follow an almost unimodal distribution, with the peak around 0.16. There’s small spike occurs around 0.3.
##
## 1 2 3 4 5 6 7
## 5487 7521 10839 15224 12633 11081 4151
Rating : The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009.
The Rating can also be null if the Prosper system can’t rate the loan. About 29084 loans aren’t rated by Prosper, which means that the loans originated before July 2009. The number of rating almost follow the order of the rating, except A-grade is the highest number of rating, AA comes second, and the rest following the order.
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 14398 47562 5242 5406 1851 561 1738 7903 150 64 76 179
## 12 13 14 15 16 17 18 19 20
## 44 1602 746 1117 253 41 669 597 632
Category: The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
I choose Bar Chart for this Listing Category since this is categorical variable. Out of the listing category, three categories comes out as the highest past 10.000 loans. There’s N/A
and Other
categories, so we can’t know for sure the specific category. But one comes out highest which is category 1 (a.k.a. Debt Consolidation), where’s one take out a loan to pay many others. This comes really high with 58308 loans, overshadowing the rest of the categories. It could be that many Prosper visitors comes with already have loans, and want to search some loans to pay for it.
## False True
## 45292 45539
A Borrower will be classified as a homeowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner. When looking at the loan that’s been listed, we see that borrower that is homeowner is around equal proportion to those who do not. So it’s not affecting much.
Income is the borrower stated at the time the listing was created. Still we see that isn’t much going on with the monthly income. Monthly Income will definitely be right skewed since fewer people will have a higher salary. So I cut the outliers and log 10 scales.
##
## 0 1 2 3 4 5 6 7 9 11 12 16 21
## 8445 66 27 19 4 6 3 5 1 1 1 1 1
PaymentsLate: Number of payments the borrower made on Prosper loans that were greater than one month late at the time they created this listing. This value will be null if the borrower had no prior loans.
Here’s we see that number of late payments is also very right-skewed. Since the number isn’t a floating point, the number isn’t represented as a smooth histogram.
Looking at the frequency count, we see that most of the borrowers have no recommendations when they listing loans. It may seem really hard to get recommendations in the listing loans, since we see that number only few people that can get at least higher than ten recommendations.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7000 8644 12500 35000
Amount: The origination amount of the loan.
Now this is interesting. Here we see that the distribution is still right skewed, but the peak seems like it’s around 4000 dollars. In the summary table, we see that it’s the first quarter. While the median is 6500 dollars. Notice that max loan is 35000 dollars, and with people who have the high salary as high as ~2 million dollars monthly income, it doesn’t make sense if he borrow money with such low amount. Since StatedMonthlyIncome is human input, it can’t be trusted hence I will exclude the outlier from now on.
##
## 12 36 60
## 1062 69530 20239
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 36.00 36.00 41.07 36.00 60.00
Term: The length of the loan expressed in months.
Here we see that term of loans only has three categorical, 1 year, 3 years, and 5 years. Most of the loans made around 3 years, with 5 years come with 24545 loans and 1614 loans at 1 year. I can expect the distribution of the term loan to be nearly normal.
##
## Accountant/CPA
## 3190 2439
## Administrative Assistant Analyst
## 2912 2748
## Architect Attorney
## 162 798
## Biologist Bus Driver
## 94 275
## Car Dealer Chemist
## 140 116
## Civil Service Clergy
## 1109 159
## Clerical Computer Programmer
## 2558 3301
## Construction Dentist
## 1548 61
## Doctor Engineer - Chemical
## 402 178
## Engineer - Electrical Engineer - Mechanical
## 866 1108
## Executive Fireman
## 3400 327
## Flight Attendant Food Service
## 95 943
## Food Service Management Homemaker
## 1008 104
## Investor Judge
## 152 18
## Laborer Landscaping
## 1343 190
## Medical Technician Military Enlisted
## 929 1003
## Military Officer Nurse (LPN)
## 250 422
## Nurse (RN) Nurse's Aide
## 2101 437
## Other Pharmacist
## 22931 209
## Pilot - Private/Commercial Police Officer/Correction Officer
## 146 1272
## Postal Service Principal
## 488 257
## Professional Professor
## 10368 452
## Psychologist Realtor
## 124 439
## Religious Retail Management
## 90 2115
## Sales - Commission Sales - Retail
## 2788 2248
## Scientist Skilled Labor
## 287 2269
## Social Worker Student - College Freshman
## 595 34
## Student - College Graduate Student Student - College Junior
## 185 90
## Student - College Senior Student - College Sophomore
## 153 58
## Student - Community College Student - Technical School
## 19 15
## Teacher Teacher's Aide
## 3002 217
## Tradesman - Carpenter Tradesman - Electrician
## 98 366
## Tradesman - Mechanic Tradesman - Plumber
## 780 82
## Truck Driver Waiter/Waitress
## 1423 345
##
## Other Professional Executive
## 22931 10368 3400
## Computer Programmer
## 3301 3190
Occupation: The Occupation selected by the Borrower at the time they created the listing.
This time, we have 68 levels of the categories. Among top 5 are computer programmer and teachers. We can see that Other
is highest by a wide margin. This is understandable since this is a category that is not listed in the occupation category.
## Cancelled Chargedoff Completed
## 4 10457 25925
## Current Defaulted FinalPaymentInProgress
## 48042 4644 155
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 13 619 204
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 297 237 234
LoanStatus: The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
Many of the loan status in this dataset are still running, loan completed comes second with 38074 unit. There is also loan status that past due about 2000 loans. We see a very small number of users which their loans have past-due, which suggests that all of the borrowers have fulfilled their debt.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7000 1.0000 1.0000 0.9985 1.0000 1.0040
PercentFunded: Percent the listing was funded.
This is percent funded of the loans. It seems not good in any way since most of the data are already 1% funded. Minimum at 0.7% and maximum at 1.02%. We could guess that it may floating number representing the percentage, but with 1.02% it shouldn’t be possible. In the density plot, we see that it centered around 1.0, which means that most of the loans listed are actually get funded completely.
##
## A AA B C D E HR NC
## 67003 2711 2893 3500 4474 4188 2782 3148 132
CreditGrade: The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings.
This is essentially the same as Prosper Rating except that credit grade is rating given to loans pre-2009 and ProsperRating given after 2009. While looking at the frequency table we see that credit grade filled with an empty value, that is the loans listed after July 2009. Since in this plot we only want to see credit grade, I exclude the loans listed after that date. What I can see now is Credit Grade that shown like a normal distribution(I can infer that since this is ordinal categorical), centered at C. NC is an outlier, which in this case “No Score”.
Univariate Analysis
Looking at the univariate plot above, I see some interesting insights. It’s unlikely people with 2 million salary have low loan amount. Income is input manually by the borrowers. So there could be some dishonesty from the input.
Many people make in ProsperLoan listing loan money to pay them another loan, as shown by a higher number of Debt Consolidations. People who visit Prosper tends to have another debt from their to consolidate their debt.
Computer programmer and teachers are among top 5 occupations that listing the loans the most. These jobs require them to find some latest info (programming questions or teaching method) from the web, thus, they have a higher chance to hear about how Prosper works.
The main feature of the dataset is the Prosper rating. I like to know which features will best play as the predictors for the Prosper rating.
I want to test the features that only exist when borrower listing the loans. Features like income, loan, occupation, listing category, recommendations, borrower rate, whether borrowers is a homeowner, the number of late payments, and the term. Because Prosper rating is the feature of interest, I exclude the missing value of Prosper rating for bivariate analysis. For this dataset, I don’t create a new feature.
There is unusual features in this dataset, in particular, monthly income. This is stated(human input) by the borrowers. As such, we see many unusual salaries. there’s borrower who’s stated their monthly income close to two million but makes 35000 dollars loan. There’s also a person who makes 0 dollars monthly salary. So I tidy the features by excluding 5% below and 95% above. We know that income will be long tail right skewed, and the distance will be very far among salaries. So I use log 10 transformations. The result after excluding outliers and log 10 transformation produces a normal distribution.
Bivariate Plots Section
Let’s take a look at them in details
7 HIGHEST –> 0 LOWEST
We see that even though the rating isn’t properly scattered,(only 6 different rating), as the rate lower, the rating is increasing. And we see that it has a strong negative linear relationship. This is can be explained as the interest rate is the one where the loaner has benefit from the money that he/she loans.
While the distribution of the rating between homeowner and not homeowner is the same (you can see the median is the same for both category), rating of homeowner has higher Q3 then those who do not.
This is interesting. Rating for 1 and 2 has a really small IQR compared to the rest of the rating. Rating 3,4 has no outlier. Rating 5,6 has a similar outlier, and rating 7 has one outlier. With a correlation of 0.47, it shows a positive moderately strong linear relationship, as the rating increase, the loan amount has longer and longer tail.
When looking at the heatmap above, I see that lot of the borrower have a lower rating, as it confirms with the univariate chart that we saw earlier. An increasing number of recommendations doesn’t affect much of the Prosper Rating. This also correlates with 0.05 correlation between both variables.
PaymentsLate should also be an indication where higher payments late by borrower should be seen as a bad rating. However, Prosper rating showed a weak relationship.
## df$Homeowner: False
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2917 4167 4876 6000 394400
## --------------------------------------------------------
## df$Homeowner: True
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 4167 5833 6885 8167 1750000
Then I try faceting the homeowner categorical by monthly salary. What I see is that salary is higher to homeowner people to those who don’t. In this summary, we see that both median and mean both higher for homeowner people in term of income.
And in the box plot, we see that having a home will definitely have higher salary income as well, as expected. The outlier in this data is so high it’s overshadow the interquartile range. I see that as expected the outlier to stay within 50,000. StatedMonthlyIncome is the borrower that self-stated his income, I begin to wonder people who have close to 2 million dollars for his monthly income. I categorize this as a bad outlier in the extreme case and replot the boxplot to stay within US$50,000.
There you go, now it looks like the people who homeowner still have a higher median salary. still there are many of outliers in both boxplots since we expect that the income is long tail.
##
## Pearson's product-moment correlation
##
## data: df$Rate and df$Income
## t = -22.813, df = 66934, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09534881 -0.08031441
## sample estimates:
## cor
## -0.08783661
I see no relationship betwen both variables. We can see that in the figure as interest rate increase, shows no sign of an increase in the borrower’s income. Additionally, the correlation also approximate to zero which shows almost no correlation. Interestingly, when I add a regression line, I can see that income slightly decreases as the interest rate increase. So we see in regression line based on the data, people with lower income tend to have a higher interest rate.
Bivariate Analysis
Almost all of them has no strong relationship with Prosper rating, except interest rate. Based on data, borrowers’ interest rate explains 91.15% of the variance in Prosper rating.
When observing the relationship between whether person is homeowner and their monthly stated income, it’s becoming clear that those who have home has higher income both median and mean. Observing the data, with median, salary of homeowner is 1700 dolars higher than those who doesn’t have a home.
The strongest relationship is shown by interest rate and rating, with a negative strong linear relationship. The second one is the relationship between the original loan amount and the rating, which shows positive moderately strong linear relationship. Both loan amount and interest rate can be used to infer rating since it moves in the opposite direction.
Multivariate Plots Section
I see that income now affects the rating as well, but not with a wide margin. The lowest rating peak around the center of the Income. but as the rating goes up, the center of the income turns out also change, shifted towards the right making a left-skewed distribution.
It also apparent that the lower the rating, the rarer it is based on the density plot. Note that we have a huge peak for rating lowest rating when I plot it against interest rate and loan amount. 1 is HR in the Prosper rating, which means high risk. Interestingly, we see that borrower with a higher interest rate will receive a lower rating from Prosper.
But that is not the case with the loan amount. Even if the loan amount is really high, doesn’t make it high risk.
Income/rating shown no unusual difference when plotting against a homeowner.
As we can see that those without homeowner tend to have a higher interest rate, and thus lower rating. While homeowner tends to have lower interest rate and higher rating. This also makes us believe that homeowner is the safer bet for people who loans their money.
Multivariate Analysis
Monthly income indeed strengthen the Prosper rating. We see that the center of the distribution between each rating is shifted to the right every time the rating is going up.
The number of payments late also affecting the rating. The rating will go up as the borrowers have a higher number of late payments.
The high amount of loans should generate higher risk since it’s risky if someone doesn’t pay the loans. But here the rating doesn’t event consider the loan amount. Even the opposite, low loan amount could generate higher risk.
Final Plots and Summary
Plot One
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3430 5000 5935 7083 1750000
## [1] 4000
Description One
I choose histogram since income is numerical variable. The histogram is used when we want to see the distribution. Salary Income is expected to be highly right0skewed since fewer people have a higher salary. But it’s human input, so it’s expected to have human bias like an outlier as well. After excluding the outliers and log10 scale the income, we have tidier normal distribution.
Since this is right skewed, I use quantiles to describe it more statistically. We see that minimum monthly salary is 0, and this is not to be expected. Income is manually stated by the borrower. So there are some users that prefer not to fill it in, thus default to zero. The other thing is where most of the users in the interquartile range are in thousands, there is income that flies as high as 1.75 million but makes US$4,000 loan. Clearly person that have this income shouldn’t make such a low loan. And since this is human input, there’s high chance this person is not being honest. Because of this distribution and outliers stated, I choose to log10 scale to make it normal distribution.
Plot Two
##
## Pearson's product-moment correlation
##
## data: df$Rate and df$Rating
## t = -830.36, df = 66934, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9553976 -0.9540569
## sample estimates:
## cor
## -0.9547321
Description Two
Looking at the statistics, Prosper Rating is strongly correlated with borrower’s interest rate, with negative linear relationship. This is the highest correlation among all other features. This suggests as interest rate decrease, the Prosper Rating receive higher grade, and with R^2 of 0.91, means that 91% of Prosper rating variance can be explained by borrower’s interest rate in the data.
I use box plot and differentiate the charts by Prosper Rating. The reason to do this is I want to see skewed distributions of borrowers’ interest rate across Prosper Rating. Looking at the chart, the lowest Prosper rating have the highest median of interest rate. It has no outliers above Q3. Meanwhile, the highest rating has the smallest IQR compared to the rest of the rating. This suggests that the interest rate in smallest rating only in the small range. If we observe summary statistics of interest rate at highest rating,
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.0704 0.0789 0.0797 0.0849 0.2100
We see that the interest rate is minimum at 0.04, and only as high as 0.21
Plot Three
Description Three
I choose density plot to see where distribution is centered as Prosper Rating increase. While it’s not a very distinct trend, we see that the center of the distribution is shifted towards the right when the rating goes higher. We can’t see the normal distribution like this unless we exclude 5% and 95% quantiles, and also log10 scale the income.
##
## Pearson's product-moment correlation
##
## data: df$Income and df$Rating
## t = 22.654, df = 66934, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07970669 0.09474270
## sample estimates:
## cor
## 0.08722966
Eventhough we see some different trend in the density plot, The correlation test shows weak positive linear relationship.
Reflection
The dataset is from the Prosper loans, where people could loans money by listing it in the website, specifying all of the requirement fields. The columns specified by the borrowers, among the 13 features I’m exploring are whether the borrowers are a homeowner, their income, and occupation. Of course, since this is human input we should expect that there will be human error.
We see that income that is stated by the borrowers, could be as high as close to 2 million dollars, and when I investigate this, the borrower just borrows for US$35,000. This is very unlikely, given his high monthly salary. Income is expected to be right skewed as well, so I log10 scale the income and also exclude the outlier.
The feature of interest is Prosper Rating, that is what features that contributed to the Prosper rating system. The rating is given immediately after the loans is listed, so I only concern about those features that actually exist at the time of the listing. So among all of the features, I select borrower’s interest rate, homeowner, listing category, borrower’s income, the number of late payments, number of recommendation, loans’ term, and the occupation of the borrower. Among all of this features, the interest rate is strongly correlated, with lower interest rate contribute to higher prosper rating. I’m actually surprise that loans amount is not contributed significantly to the rating. I expect that the higher the loans, the higher the risk the loaner should have.
Occupation and Listing Category can act as the predictors for predicting Prosper rating. But this is hard since both have many categorical variables. People with a more interesting job could have a higher rating. But too many occupations have Other and Professional which is hard to define. While listing category could also play important role, this also has too many categorical levels. The significant of both variables with Prosper Rating can be tested using ANOVA, but it’s outside the scope of the analysis. This analysis also succeeds in selecting the features and the correlation with the feature of interest, which in this case is Prosper Rating. There are whole 81 features in the dataset, and I only select 13 variables in this dataset. It’s interesting additional features to predict the rating of the Prosper system.