loan optimization

26 Jan 2018 | 203 Minute Read on Finance

Loan Portfolio Optimization¶

머신러닝 기법을 사용하여 대출 자산 포트폴리오에 대한 최적화 방법에 대해 생각해 본다.

Part1: 개요(문제정의)¶

각 대출섹터별 수익성과 위험을 기준으로 전체 포트폴리오의 수익을 최적화하는 것이 목적임
제약조건의 변경(금리 수준, 투자 비중의 제약)에 따라 포트폴리오의 리밸런싱이 가능해야 함
대출섹터별로(신차 대출, 중고차 대출, 신용대출(direct, indirect)) 투자비중을 계산하는 것이 목표임

Part2: 접근방식(포트폴리오 최적화)¶

Loan Portfolio Optimization를 하기 위해서 전체 포트폴리오의 Sharp Ratio를 극대화하는 투자비중(w)를 계산함
1. 개별 대출의 묶음을 하나의 투자자산으로 가정함(단위 포트폴리오로 정의함)
2. 단위 포트폴리오에 투자하는 비중(w)을 계산하기 위해서 예상 수익률과 공분산 매트릭스가 필요함
3. 계산에 필요한 예상 수익률은 발생가능한 현금 흐름을 사용함
예상 수익률을 계산하는 방법은 다음과 두가지 방법을 생각해 볼 수 있음
1. 각각 포트폴리오의 과거 월별 현금흐름을 사용하여 계산함
2. 각각 포트폴리오의 미래 부도율을 예측하여 계산함
3. 현금흐름을 할인하는 이자율은 중요 요인이나 분석에서는 고려하지 않았음
포트폴리오의 과거 월별 수익률은 다음과 같이 계산함
1. 섹터별, 대출규모별 단위 포트폴리오를 구성함
2. 월별 수익률을 계산하는 방법은 부도 미발생 일 경우에는 이자율 * 원금의 현금흐름이 발생함
3. 해당월에 부도가 발생하면 원금을 차감함
4. 월별발생 현금흐름에 원금을 나누어서 월별 수익률을 계산
5. 월별 수익률이 계산 되었으므로 분산 공분산 매트릭스를 계산 할 수 있음
포트폴리오의 미래 수익률은 다음과 같이 계산함
1. t시점의 3년전 데이터로 train set과 t시점부터 1년 후 데이터로 test set을 구성함
2. train set의 데이터를 이용하여 부도율을 학습시킴
3. test set를 이용해서 부도 대상을 예측할 수 있음
4. 부도 시기를 예측 할 수 없으므로 36개월 만기 중에서 평균(13.5)의 부도 시점을 사용함
5. 부도율과 미래현금흐름을 이용하여 단위 포트폴리오의 1년 예상 수익률을 시뮬레이션 함
미래수익률과 과거수익률로 분산공분산을 추정하였으므로 Mean Variance 최적화를 시도할 수 있음

Part3: 분석 데이터 예제와 분석 툴¶

입수데이터는 Lending Club의 2008 년부터 2015년 년간의 80만여건의 자료
입수항목은 grade, zip, 소득, 대출기간, 대출금액, 대출상태(부도, 완료, 진행), 집소유 등 78개의 컬럼
분석툴은 Python, Numpy, Pandas, SKlearn, xgboost(Gradient Boost), Jupyter Notebook 를 사용
파이썬 노트북 환경의 장점
- SQL, 코딩, 통계분석, 머신러닝, 시각화를 한 화면에서 지원이 가능
- 웹에서 어디서나 접근이 가능함
- 금융 프로그래밍에 있어서 타 언어(JAVA, C++)보다 더 높은 생산성

Part4: 분석 예제¶

구현을 통해서 Mean-Variance 방식을 사용하여 Loan Portfolio의 최적화 가능여부와 문제점 대해서 살펴 봄
구현 데이터는 lendingclub의 데이터를 사용함
데이터는 https://www.kaggle.com/wendykan/lending-club-loan-data

Part 4.1 데이터 탐색¶

In [1]:

# - 데이터 입수 필요 패키지 import

#https://www.kaggle.com/wendykan/lending-club-loan-data
# 
%pylab
%matplotlib inline
import pandas as pd
import datetime
import time
from collections import defaultdict
import gc
import warnings
import numpy as np
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
#pd.set_option("display.max_columns", 100)
print('Preparing arrays...')
ddir = '/home/bono/ONEDRIVE_DATA/lending/'
#ddir = 'C:/ONEDRIVE_DATA/lending/'
df_data = pd.read_csv(ddir + "loan.csv")
pd.options.display.float_format = '{:.2f}'.format

import matplotlib.font_manager as fm
font_location = "/usr/share/fonts/truetype/nanum/NanumGothic.ttf"
font_name = fm.FontProperties(fname=font_location).get_name()
matplotlib.rc('font', family=font_name)

#importing all the required ML packages
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn import svm #support vector Machine
from sklearn.model_selection import train_test_split #training and testing data split
from sklearn import metrics #accuracy measure
from sklearn.metrics import confusion_matrix #for confusion matrix

Using matplotlib backend: Qt5Agg
Populating the interactive namespace from numpy and matplotlib
Preparing arrays...

4.1.1 사용 컬럼¶

대출 만기는 3년과 5년으로 구성되어 있는데 분석의 편의 위해서 3년 데이터만 사용함
중요 데이터만 추출하여 사용함
- issue_d:대출 발행월
- purpose:대출 상품의 종류(card, mortage, car)
- term:대출 만기
- zip_code: 우편번호
- int_rate: 이자율
- annual_inc: 연소득
- grade: 신용등급
- sub_grade: 신용 보조등급
- emp_length: 근무 년수
- dti: 연소득 대비 대출 금액 비율
- home_ownership:집 자가
- loan_status: 대출 상태(진행, 종료, 부도)
- last_pymnt_d: 마지막 이자 지급월
- total_pymnt:총지급 금액
- installment:월 지급 이자(원금 포함)
- last_pymnt_amnt:마지막 이자 금액
- funded_amnt:대출 원금

In [2]:

#사용한 컬럼과. 대출만기 정의
use_col = ['issue_d','purpose','term','zip_code','int_rate','annual_inc'
           , 'grade','sub_grade','emp_length','dti','home_ownership','loan_status'
           ,'last_pymnt_d','total_pymnt','installment','last_pymnt_amnt','funded_amnt']

df_loan = df_data[use_col]
df_loan = df_loan[df_loan['term'].str.strip() =='36 months']

4.1.2 년도별, 상태별 대출 상태¶

2012년 부터 대출의 증가하였음
Charged Off를 부도로 정의하였음. 전체적으로 5% 수준의 부도율 보임
2015년에 부도가 급속히 줄어 듦

In [3]:

pd.crosstab(df_loan.loan_status, pd.to_datetime(df_loan['issue_d']).dt.year, margins =True).style.background_gradient(cmap='summer_r')
#.merge(pd.crosstab(df_loan.loan_status, df_loan.term).apply(lambda r: r/r.sum()), how  = 'left', right_index =True, left_index =True).style.background_gradient(cmap='summer_r')
#df_loan[use_col].groupby(['loan_status','term'])['issue_d'].agg('count').unstack()#.apply(lambda r: r/r.sum())

Out[3]:

issue_d	2007	2008	2009	2010	2011	2012	2013	2014	2015	All
loan_status
Charged Off	45	247	594	842	1499	5844	9823	8561	1628	29083
Current	0	0	0	0	0	33	42531	106941	253343	402848
Default	0	0	0	0	0	12	121	357	225	715
Does not meet the credit policy. Status:Charged Off	113	249	129	158	0	0	0	0	0	649
Does not meet the credit policy. Status:Fully Paid	239	582	436	532	0	0	0	0	0	1789
Fully Paid	206	1315	4122	7624	12602	37488	45858	41886	16474	167575
In Grace Period	0	0	0	0	0	6	675	1298	1996	3975
Issued	0	0	0	0	0	0	0	0	5982	5982
Late (16-30 days)	0	0	0	0	0	6	223	540	714	1483
Late (31-120 days)	0	0	0	0	0	81	1149	2986	2810	7026
All	603	2393	5281	9156	14101	43470	100380	162569	283172	621125

In [4]:

f,ax=plt.subplots(1,2,figsize=(18,8))
#df_loan.groupby('loan_status')['loan_status'].count().plot.bar(ax=ax[0])

df_loan['loan_status'].value_counts(normalize = True).plot.bar(ax=ax[0])
df_loan['loan_status'].value_counts(normalize = True).plot.pie(ax=ax[1])
ax[0].set_title('Number Of Passengers By Pclass')
ax[0].set_ylabel('Count')
ax[0].set_title('Number Of Loan Status')
ax[1].set_title('Ratio Of Loan Status')
plt.show()
#normalize 
#df_loan.groupby('loan_status')['loan_status'].count().plot.pie(ax=ax[1])
#data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])

4.1.3 등급별 대출 상태¶

부도와 가장 밀접한 관계를 가지고 있는 신용등급을 조회했음
등급이 낮을 수록 부도율이 높은 것을 확인할 수 있음

In [5]:

pd.crosstab(df_loan.loan_status, df_loan.grade, margins =True).style.background_gradient(cmap='YlGn')
#df_loan[use_col].groupby(['loan_status','term'])['issue_d'].agg('count').unstack()

Out[5]:

grade	A	B	C	D	E	F	G	All
loan_status
Charged Off	2537	8352	9013	6481	2090	546	64	29083
Current	99093	136927	105765	45396	12988	2392	287	402848
Default	44	172	249	161	71	17	1	715
Does not meet the credit policy. Status:Charged Off	8	79	140	170	119	75	58	649
Does not meet the credit policy. Status:Fully Paid	87	254	453	448	314	127	106	1789
Fully Paid	38903	60909	40984	20609	5040	1010	120	167575
In Grace Period	349	1093	1349	812	307	56	9	3975
Issued	1386	2105	1659	623	164	38	7	5982
Late (16-30 days)	130	348	485	353	127	35	5	1483
Late (31-120 days)	478	1770	2323	1634	622	170	29	7026
All	143015	212009	162420	76687	21842	4466	686	621125

In [6]:

pd.crosstab(df_loan.loan_status, df_loan.grade).apply(lambda r: r/r.sum(), axis = 0).style.background_gradient(cmap='YlGn')

Out[6]:

grade	A	B	C	D	E	F	G
loan_status
Charged Off	0.0177394	0.0393946	0.0554919	0.0845124	0.0956872	0.122257	0.0932945
Current	0.692885	0.645855	0.651182	0.591965	0.594634	0.535602	0.418367
Default	0.00030766	0.000811286	0.00153306	0.00209944	0.00325062	0.00380654	0.00145773
Does not meet the credit policy. Status:Charged Off	5.59382e-05	0.000372626	0.000861963	0.0022168	0.00544822	0.0167936	0.0845481
Does not meet the credit policy. Status:Fully Paid	0.000608328	0.00119806	0.00278907	0.00584193	0.014376	0.0284371	0.154519
Fully Paid	0.27202	0.287294	0.252333	0.268742	0.230748	0.226153	0.174927
In Grace Period	0.0024403	0.00515544	0.00830563	0.0105885	0.0140555	0.0125392	0.0131195
Issued	0.00969129	0.00992882	0.0102143	0.00812393	0.00750847	0.00850873	0.0102041
Late (16-30 days)	0.000908996	0.00164144	0.00298609	0.00460313	0.00581449	0.00783699	0.00728863
Late (31-120 days)	0.00334231	0.0083487	0.0143024	0.0213074	0.0284772	0.0380654	0.0422741

4.1.4 섹터별 대출 상태¶

섹터별로 포트폴리오 최적화를 하는 것이 목적이므로 섹터별 대출 건수와 대출 비율이 중요함

In [7]:

df_loan.groupby(['loan_status','purpose'])['int_rate'].agg('count').unstack().fillna(0).style.background_gradient(cmap='summer_r')

Out[7]:

purpose	car	credit_card	debt_consolidation	educational	home_improvement	house	major_purchase	medical	moving	other	renewable_energy	small_business	vacation	wedding
loan_status
Charged Off	306	5414	16932	49	1396	175	588	449	353	2063	39	884	232	203
Current	3832	105123	227579	1	23385	1254	7369	4246	2655	20816	231	3492	2663	202
Default	5	140	442	0	30	4	7	11	7	48	0	13	8	0
Does not meet the credit policy. Status:Charged Off	11	60	246	32	60	8	16	19	12	106	1	65	4	9
Does not meet the credit policy. Status:Fully Paid	41	243	732	64	121	30	87	31	28	278	2	82	12	38
Fully Paid	2603	35597	95024	260	9785	1039	4569	1971	1440	9639	181	2724	1252	1491
In Grace Period	27	757	2467	0	211	27	87	45	34	232	6	47	33	2
Issued	67	1505	3213	0	369	26	142	75	42	401	5	83	54	0
Late (16-30 days)	9	239	907	0	88	11	35	12	20	108	0	36	18	0
Late (31-120 days)	53	1332	4306	0	387	41	134	101	71	408	8	127	53	5

In [8]:

df_loan.groupby(['loan_status','purpose'])['int_rate'].agg('count').unstack().fillna(0).apply(lambda r: r/r.sum()).style.background_gradient(cmap='YlGn')
#.apply(lambda r: r/r.sum())

#.style.background_gradient(cmap='summer_r')
#.merge(pd.crosstab(df_loan.loan_status, df_loan.term).apply(lambda r: r/r.sum()), how  = 'left', right_index =True, left_index =True).style.background_gradient(cmap='summer_r')
#df_loan[use_col].groupby(['loan_status','term'])['issue_d'].agg('count').unstack()#.apply(lambda r: r/r.sum())

Out[8]:

purpose	car	credit_card	debt_consolidation	educational	home_improvement	house	major_purchase	medical	moving	other	renewable_energy	small_business	vacation	wedding
loan_status
Charged Off	0.0440035	0.0359949	0.0481231	0.12069	0.0389596	0.0669216	0.0451128	0.0645115	0.0757186	0.0605003	0.0824524	0.11704	0.0535921	0.104103
Current	0.55105	0.69891	0.646811	0.00246305	0.652629	0.479541	0.565368	0.610057	0.569498	0.610458	0.488372	0.462333	0.615154	0.10359
Default	0.000719011	0.000930789	0.00125622	0	0.00083724	0.00152964	0.000537057	0.00158046	0.0015015	0.00140767	0	0.00172117	0.001848	0
Does not meet the credit policy. Status:Charged Off	0.00158182	0.00039891	0.000699166	0.0788177	0.00167448	0.00305927	0.00122756	0.00272989	0.002574	0.0031086	0.00211416	0.00860585	0.000924001	0.00461538
Does not meet the credit policy. Status:Fully Paid	0.00589589	0.00161558	0.00208044	0.157635	0.00337687	0.0114723	0.00667485	0.00445402	0.00600601	0.00815273	0.00422833	0.0108566	0.002772	0.0194872
Fully Paid	0.374317	0.236666	0.270071	0.640394	0.27308	0.397323	0.350545	0.28319	0.30888	0.282677	0.382664	0.360651	0.289212	0.764615
In Grace Period	0.00388266	0.00503291	0.00701155	0	0.00588859	0.010325	0.00667485	0.00646552	0.00729301	0.00680372	0.012685	0.00622269	0.00762301	0.00102564
Issued	0.00963474	0.010006	0.00913178	0	0.0102981	0.00994264	0.0108946	0.0107759	0.00900901	0.0117599	0.0105708	0.010989	0.012474	0
Late (16-30 days)	0.00129422	0.00158899	0.00257782	0	0.00245591	0.0042065	0.00268528	0.00172414	0.00429	0.00316725	0	0.00476632	0.004158	0
Late (31-120 days)	0.00762151	0.00885579	0.0122382	0	0.0108004	0.0156788	0.0102808	0.0145115	0.0152295	0.0119652	0.0169133	0.0168145	0.012243	0.0025641

In [9]:

f,ax=plt.subplots(1,2,figsize=(18,8))

da1 = df_loan['purpose'].value_counts().sort_values()
da1.plot.pie(ax=ax[0])
da2 = df_loan.groupby('purpose')['int_rate'].mean().sort_values()
da2.plot.bar(ax=ax[1])
ax[0].set_title("Sector Ratio")
ax[1].set_title('Interest Average')
plt.show()

4.1.5 부도 발생시점¶

수익률을 계산하기 위해서 부도 발생 시점을 확인하는 것은 중요한 단계중에 하나임
평균 13.5개월임(대출기간 36개월)

In [10]:

df_loan1 = df_loan
df_loan1['de_mon'] = pd.to_datetime(df_loan1['last_pymnt_d']).dt.to_period('M') - pd.to_datetime(df_loan1['issue_d']).dt.to_period('M')
print(df_loan1['de_mon'].mean())

13.5805664824

In [11]:

f,ax=plt.subplots(1,2,figsize=(18,8))

da1 = df_loan1[df_loan1["loan_status"].str.strip()=='Charged Off']['de_mon'].value_counts().sort_index()
da1.plot.line(ax=ax[0])
da1.plot.pie(ax=ax[1])
ax[0].set_title("Trend Of Default Month")
ax[1].set_title('Ratio of Default Month')
plt.show()

4.1.6 대출규모와 연간 소득¶

전체 평균의 대출규모은 12,474 달러임
연간소득의 평균은 (72,555)임
연간소득에는 아주 높은 이상치가 존재함

In [12]:

print(df_loan.funded_amnt.mean())
f,ax=plt.subplots(2,2,figsize=(18,8))
df_loan.funded_amnt.hist(bins= 10, ax = ax[0][0]) # [0][i]
ax[0][0].set_title('Number Of Hist')
ax[0][0].set_ylabel('Count')
df_loan[df_loan.funded_amnt.rank(pct=True) < 0.1]['funded_amnt'].hist(bins = 20, ax = ax[0][1], color= 'r')
df_loan[df_loan.funded_amnt.rank(pct=True) > 0.9]['funded_amnt'].hist(bins = 20, ax = ax[1][0], color= 'g')
ax[0][1].set_title('10% less 이상')
ax[1][0].set_title('90% more 이하.')
plt.show()
#.hist(bins= 100, ax = ax[1])#

12474.785993157577

In [13]:

df_loan.describe().style.background_gradient(cmap='terrain_r')

Out[13]:

	int_rate	annual_inc	dti	total_pymnt	installment	last_pymnt_amnt	funded_amnt
count	621125	621121	621125	621125	621125	621125	621125
mean	12.0187	72555.5	17.5822	7130.69	413.538	1939.92	12474.8
std	3.85307	66293.5	19.7838	7297.5	259.335	4028.19	7779.36
min	5.32	0	0	0	15.67	0	500
25%	8.9	43000	11.34	1802.84	219.97	246.62	6600
50%	11.99	60000	17.02	4737.8	342.31	430.45	10000
75%	14.47	87000	23.31	10198.6	543.6	924.81	16000
max	28.99	9e+06	9999	49978.7	1445.46	36188.8	35000

In [14]:

#print(df_loan.funded_amnt.mean())
f,ax=plt.subplots(2,2,figsize=(18,8))
df_loan.annual_inc.hist(bins= 100, ax = ax[0][0]) # [0][i]
ax[0][0].set_title('Number Of Annual Income')
ax[0][1].set_ylabel('Count')
df_loan[df_loan.annual_inc.rank(pct=True) < 0.99]['annual_inc'].hist(bins = 30, ax = ax[0][1], color= 'r')
df_loan[df_loan.annual_inc.rank(pct=True) > 0.99]['annual_inc'].hist(bins = 30, ax = ax[1][0], color= 'y')
ax[0][1].set_title('10% less')
ax[1][0].set_title('90% more')
#df_loan[df_loan.annual_inc.rank(pct=True) > 0.999]['annual_inc'].hist(ax = ax[1][1])
#.hist(bins= 100, ax = ax[1])#

Out[14]:

<matplotlib.text.Text at 0x7f3b58a59400>

Part 4.2 단위 포트폴리오 구성과 분석¶

단위 포트폴리오는 섹터별로 4개 ('credit_card', 'other','home_improvement', 'debt_consolidation') 별로, 그리고 대출규모 금액가중으로 8분위로 분류하여 구성하였음

Part 4.2.1 단위 포트폴리오 구성¶

대출 금액의 합계가 동일하게 단일 포트폴리오 구성
대출금액이 낮을 때와 높을 때 이자율이 높음
credit_card는 이자율이 가장 높고 other의 이자율이 높음
연도별로 2012년과 2015년에는 이자율이 낮아짐

In [15]:

df_loan['m_rank'] = np.ceil(df_loan.funded_amnt.rank(pct=True) * 10)
df_loan.groupby('m_rank')['funded_amnt'].agg(['count','mean','sum'])
df_loan['funded_amnt'].agg(['count','mean','sum'])
a= np.linspace(0, 7748401550, num=10) 
df_loan['funded_amnt_cum']= df_loan['funded_amnt'].sort_values().cumsum()
df_loan['cum_rank'] = df_loan['funded_amnt_cum'].apply(lambda x:  np.digitize(x, a))
df_loan.groupby('cum_rank').agg({'funded_amnt':['sum','count','mean'],'int_rate':['mean']}).style.background_gradient(cmap='RdPu')

Out[15]:

	int_rate <th class="col_heading level0 col1" colspan=3>funded_amnt</th>
	mean	sum	count	mean
cum_rank
1	12.7019	8.60928e+08	178952	4810.94
2	12.2866	8.60938e+08	99873	8620.33
3	11.7826	8.60927e+08	80755	10661
4	11.6757	8.60933e+08	65060	13232.9
5	11.4368	8.60942e+08	54855	15694.9
6	11.1931	8.6092e+08	44974	19142.6
7	11.0124	8.60947e+08	38994	22079
8	11.1401	8.60927e+08	32073	26842.7
9	13.1449	8.6094e+08	25589	33644.9

In [16]:

df_loan.head()
df_loan['start_d'] = pd.to_datetime(df_loan.issue_d)

In [17]:

f,ax=plt.subplots(1,2,figsize=(18,8))
df_loan.groupby(["grade",(pd.Grouper(key='start_d', freq="Q"))])['int_rate'].mean().unstack('grade').plot(ax = ax[0])
p_l1 = ['credit_card', 'other','home_improvement', 'debt_consolidation']
df_loan[df_loan.purpose.isin(p_l1)].groupby(["purpose",(pd.Grouper(key='start_d', freq="M"))])['int_rate'].mean().unstack('purpose').plot(ax = ax[1])
#df_loan[df_loan.purpose.isin(p_l1)]
ax[0].set_title('grade별 이자율 추이')
ax[1].set_title('섹터별 이자율 추이')
#df_loan.groupby(["purpose","issue_d"])['int_rate'].mean().plot()

Out[17]:

<matplotlib.text.Text at 0x7f3b58c76fd0>

Part 4.2.2 단위 포트폴리오 수익률 계산¶

포트폴리오 수익률 계산
미 부도시 이자율 * 원급 지급의 현금 흐름 발생
부도시 원금만 현금흐름이 유출되고 그 다음 부터는 0
부도 시점은 마지막 지급액

In [18]:

df_loan['last_pymnt_d']= df_loan.apply(lambda cols: cols['last_pymnt_d'] if pd.isnull(cols['last_pymnt_d']) else cols['issue_d'], axis = 1) 
df_train10 = df_loan
df_train10 = df_train10[~df_train10.last_pymnt_d.isnull()]

In [19]:

df_train10['start_d'] = pd.to_datetime(df_train10.issue_d)
#df_train10['str_start_d'] = df_train10['start_d'].strftime("%Y-%m-%d")
df_train10['str_start_d'] = df_train10['start_d'].apply(lambda x: x.strftime('%Y-%m'))
df_train10['str_start_d'] = df_train10['start_d'].apply(lambda x: x.strftime('%Y-%m'))
df_train10['last_d'] = pd.to_datetime(df_train10.last_pymnt_d)
df_train10['str_last_d'] = df_train10['last_d'].apply(lambda x: x.strftime('%Y-%m'))
#type(df_train10['start_d'])
df_train10['maturity_d'] = df_train10['start_d'].apply(lambda x: x + pd.DateOffset(years=3))
df_train10['str_maturity_d'] = df_train10['maturity_d'].apply(lambda x: x.strftime('%Y-%m'))

In [20]:

df_train10['issue_d'].count()
#608845
df_train10['cum_rank'].unique()
df_train10['purpose'].unique()
p_l1 = ['credit_card', 'other','home_improvement', 'debt_consolidation']
p_l2 = np.arange(1,9)

In [21]:

def calRetOfPeiord(df_port, pr_cal):
    ret_mo = {}
    for date_i in pr_cal:
        #print(calRet(df_port,date_i))
        ret_mo[date_i] = calRet(df_port,date_i)
    
    return ret_mo

#calRetOfPeiord(df_train10, pr_cal)     
#def calRet()
def calRet(df_train10,  current_month):
    #current_month = '2013-01'
    df_cur = df_train10[(df_train10['str_start_d'] <= current_month ) & (df_train10['str_maturity_d'] >=  current_month)]
    #print(df_cur.count())
    de_m1 = df_cur[(df_cur['loan_status'].str.strip() == 'Charged Off') & (df_cur['last_d'] == current_month)]
    de_m0 = df_cur[(df_cur['loan_status'].str.strip() == 'Charged Off') & (df_cur['last_d'] != current_month)]
    de_p1 = df_cur[(df_cur['loan_status'].str.strip() != 'Charged Off') ]
    d1 = sum(de_m1['funded_amnt'])
    p1 = sum(de_p1['int_rate'] / 12 * de_p1['funded_amnt'])
    try:
        retOfMonth = (p1 - d1) / sum(df_cur['funded_amnt'])
    except ZeroDivisionError as e:
        #print(e)
        retOfMonth = 0
    return retOfMonth

In [22]:

#df_cal = df_train10[(df_train10['purpose'] == 'credit_card') &  (df_train10['cum_rank'] == 8)]

In [23]:

#nd1 = df_train10.matuarity.unique() 

pr_cal = df_train10['str_start_d'].unique() 
pr_cal.sort()
ret_sect_size = {}

for i in p_l1:
    for j in p_l2:
        #print(i + str(j))
        #print(df_train10[(df_train10['purpose'] == i) &  (df_train10['cum_rank'] == j)]['issue_d'].count())
        df_cal = df_train10[(df_train10['purpose'] == i) &  (df_train10['cum_rank'] == j)]
        ret_sect_size[(i + '|' + str(j))] = calRetOfPeiord(df_cal, pr_cal)

In [ ]:

In [24]:

df_ret = pd.DataFrame.from_dict(ret_sect_size)
df_ret = df_ret.iloc[67 :,:]
mean_daily_returns  = df_ret.mean()
corr_matrix  = df_ret.corr()

In [25]:

#df_ret.columns

Part 4.2.3 단위 포트폴리오 수익률 분석¶

Others 섹터의 단위 포트포리오의 수익률이 높음.
소액투자의 단위 포트폴리오이 수익률이 높음(투자수가 많아져서 분산 효과???)

In [26]:

f,ax=plt.subplots(1,1,figsize=(18,8))
pd.options.display.float_format = '{:.4f}'.format
df_ret.plot(ax = ax)
ax.set_title('32개 단위 포트폴리오의 수익률 추이')

Out[26]:

<matplotlib.text.Text at 0x7f3b4973f240>

In [27]:

def process_index(k):
    return tuple(k.split("|"))

#df.index = 
df_ret.columns  = pd.MultiIndex.from_tuples([process_index(k) for k,v in df_ret.iteritems()])

In [28]:

f,ax=plt.subplots(1,2,figsize=(18,8))
df_ret.groupby(level=[0], axis =1).mean().plot(ax =ax[0])
df_ret.groupby(level=[1], axis =1).mean().plot(ax =ax[1])

Out[28]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3b5881b240>

In [29]:

# cov_matrix

import seaborn as sns
fig, ax = plt.subplots(figsize=(14,14))   
sns.heatmap(corr_matrix, annot=True, linewidths=.5, ax=ax)
# turn the axis label
for item in ax.get_yticklabels():
    item.set_rotation(0)

for item in ax.get_xticklabels():
    item.set_rotation(90)

plt.show()

4.3 Predictive Part¶

4.3.1 문자 catetory로 숫자로 변환¶

In [30]:

df_loan_st = df_loan.copy()
loan_obj_mod =   df_loan.select_dtypes(include=['object']).columns
loan_obj_mod =[ 'purpose', 'term', 'zip_code', 'grade', 'sub_grade',
       'emp_length', 'home_ownership', 'loan_status']

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

#var_mod = ['term','grade','sub_grade']
for i in loan_obj_mod:
    df_loan[i] = le.fit_transform(df_loan[i])

In [31]:

#df_loan.issue_d.to_period('M')
# df_train =df_loan[((pd.to_datetime(df_loan['issue_d']))>= '2010-01-01') & ((pd.to_datetime(df_loan['issue_d'])) <= '2012-12-31') ]
# df_test =df_loan[((pd.to_datetime(df_loan['issue_d']))>= '2013-01-01') & ((pd.to_datetime(df_loan['issue_d'])) <= '2013-12-31') ]

#df_loan.issue_d.to_period('M')
df_train =df_loan[((pd.to_datetime(df_loan['issue_d']))>= '2012-01-01') & ((pd.to_datetime(df_loan['issue_d'])) <= '2014-12-31') ]
df_test =df_loan[((pd.to_datetime(df_loan['issue_d']))>= '2015-01-01') & ((pd.to_datetime(df_loan['issue_d'])) <= '2015-12-31') ]

m_use =['purpose', 'zip_code', 'int_rate', 'annual_inc',   'grade', 'sub_grade', 'emp_length', 'dti', 'home_ownership',
         'installment', 'funded_amnt', 'target']
features  =['purpose', 'zip_code', 'int_rate', 'annual_inc',   'grade', 'sub_grade', 'emp_length', 'dti', 'home_ownership',
         'installment', 'funded_amnt']
df_train['target'] = df_train.loan_status.apply(lambda x:1 if x == 0 else 0)
df_test['target'] = df_test.loan_status.apply(lambda x:1 if x == 0 else 0)

In [32]:

#분석에 사용될 변수
features  =['purpose', 'zip_code', 'int_rate', 'annual_inc',   'grade', 'sub_grade', 'emp_length', 'dti', 'home_ownership',
         'installment', 'funded_amnt']

y_values = df_train['target'].values

Part 4.3.2 Gradinent Boost 알고리즘 적용¶

2012년 - 2014년 데이터를 train으로 하여 2015년의 부도율을 예측함.

In [33]:

def get_importance(gbm, features):
    #importance = pd.Series(gbm.booster().get_fscore()).sort_values(ascending=False)
    #AttributeError: 'Booster' object has no attribute 'booster'
    #https://github.com/dmlc/xgboost/issues/2119
    outfile = open('xgb.fmap', 'w')
    for i, feat in enumerate(features):
        outfile.write('{0}\t{1}\tq\n'.format(i, feat))
    outfile.close()
    importance = gbm.get_fscore(fmap='xgb.fmap')
    importance = sorted(importance.items(), key=itemgetter(1), reverse=False)
    df = pd.DataFrame(importance, columns=['feature', 'fscore'])
    df['fscore'] = df['fscore'] / df['fscore'].sum()
    plt.figure()
    #df.plot()
    df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(16, 10))
    plt.title('XGBoost Feature Importance')
    return importance

### xgb start

import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import cross_validation, metrics   #Additional scklearn functions
from sklearn.grid_search import GridSearchCV   #Perforing grid search
import xgboost as xgb
from sklearn.metrics import roc_auc_score
from operator import itemgetter

def runXGboost(df_train, df_test, feature_names, targetY):
    
    xgb_pars = {
    "objective": "binary:logistic",    #  1 reg:linear  2 "binary:logistic", 3 multi:softmax,  4 multi:softprob 
    "booster" : "gbtree",             # 1 gbtree: tree-based, 2 gblinear: linear models
    "eval_metric": "auc", 
    #rmse – root mean square error,   mae – mean absolute error , auc: Area under the curve
    # logloss, mlogloss  – negative log-likelihood   error, merror – Binary classification error rate (0.5 threshold)  merror        
    'min_child_weight': 50, 'eta': 0.3, 'colsample_bytree': 0.3, 'max_depth': 10,  "tree_method": 'exact',
            'subsample': 0.8, 'lambda': 1., 'nthread': 4, 'silent': 1}
    num_boost_round = 115
    early_stopping_rounds = 10
    test_size = 0.2
    
    print(df_train[feature_names].columns)
    print(targetY.shape)
    
    #X_train, X_valid=                               
    X_train ,X_valid, y_train, y_valid =  train_test_split(df_train[feature_names].values, targetY, test_size = test_size , random_state= 1)
    dtrain = xgb.DMatrix(X_train, label = y_train)
    dvalid = xgb.DMatrix(X_valid, label=  y_valid)
    dtest = xgb.DMatrix(df_test[feature_names].values)
    watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
    
    gbm_model = xgb.train(xgb_pars, dtrain, 60, watchlist, early_stopping_rounds=50,
                  maximize=False, verbose_eval=10)
    
    print("Validating...")
    check = gbm_model.predict(dvalid)
    importance = get_importance(gbm_model, features)
    print(importance)
    #importance = gbm.get_fscore(fmap='xgb.fmap')
    score = roc_auc_score(y_valid, check)
    print('Check error value: {:.6f}'.format(score))

    return gbm_model.predict(dtest)

df_test['result'] = runXGboost(df_train, df_test, features,  y_values)

/home/bono/anaconda3/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/bono/anaconda3/lib/python3.5/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

Index(['purpose', 'zip_code', 'int_rate', 'annual_inc', 'grade', 'sub_grade',
       'emp_length', 'dti', 'home_ownership', 'installment', 'funded_amnt'],
      dtype='object')
(306419,)
[0] train-auc:0.651848  valid-auc:0.654326
Multiple eval metrics have been passed: 'valid-auc' will be used for early stopping.

Will train until valid-auc hasn't improved in 50 rounds.
[10]  train-auc:0.696095  valid-auc:0.670383
[20]  train-auc:0.733308  valid-auc:0.693867
[30]  train-auc:0.745974  valid-auc:0.698791
[40]  train-auc:0.752872  valid-auc:0.701923
[50]  train-auc:0.756673  valid-auc:0.70102
Validating...
[('grade', 105), ('home_ownership', 142), ('purpose', 238), ('emp_length', 256), ('sub_grade', 298), ('int_rate', 372), ('funded_amnt', 462), ('zip_code', 658), ('dti', 674), ('annual_inc', 810), ('installment', 1126)]
Check error value: 0.700114

<matplotlib.figure.Figure at 0x7f3b5230e518>

Part 4.3.2 Gradinent Boost 알고리즘 적용 결과¶

사후적으로 부도로 인식된 것들이 알고리즘 적용 결과 더 높은 부도율을 보여줌

In [34]:

f,ax=plt.subplots(1,3,figsize=(18,8))
df_test['result'].hist(bins= 20,  color='y', ax = ax[0])
df_test[df_test['target'] == 0]['result'].hist(bins= 20,  color='g', ax = ax[1])
df_test[df_test['target'] == 1]['result'].hist(bins= 20,  color='r', ax = ax[2])
ax[0].set_title('test 전체')
ax[1].set_title('test 중 미 부도')
ax[2].set_title('test 중 부도')
plt.show()

In [35]:

df_pair = df_test[['target','purpose','int_rate','grade','dti','loan_status']]
%matplotlib inline
sns.set()
sns.pairplot(df_pair[df_pair['loan_status'].isin([0,1])], hue='loan_status');

Part 4.3.3 미래부도확률을 사용하여 수익률 계산¶

미 부도시 이자율 * 원급 지급의 현금 흐름 발생
부도시 원금만 현금흐름이 유출되고 그 다음 부터는 0
부도 시점은 계산할 수 없음

In [37]:

df_test.head()

Out[37]:

	issue_d	purpose	zip_code	int_rate	annual_inc	grade	sub_grade	emp_length	dti	...	installment	funded_amnt	de_mon	m_rank	funded_amnt_cum	cum_rank	start_d	result
466286	Dec-2015	1	439	5.3200	100000.0000	0	0	10	7.2800	...	260.5000	8650.0000	NaN	4.0000	1302554975.0000	2	2015-12-01	0.0114
466287	Dec-2015	2	625	14.8500	35000.0000	2	14	5	15.2200	...	146.1600	4225.0000	NaN	1.0000	179683275.0000	1	2015-12-01	0.0742
466290	Dec-2015	2	894	9.8000	92000.0000	1	7	3	13.8900	...	482.6100	15000.0000	NaN	7.0000	3377575900.0000	4	2015-12-01	0.0388
466291	Dec-2015	2	211	7.4900	109000.0000	0	3	1	26.0200	...	777.5500	25000.0000	NaN	10.0000	6264808700.0000	8	2015-12-01	0.0088
466292	Dec-2015	2	179	13.9900	55000.0000	2	13	1	5.9100	...	123.0300	3600.0000	NaN	1.0000	108387050.0000	1	2015-12-01	0.0921

5 rows × 24 columns

In [38]:

df_result = df_loan_st.merge(df_test[['target','result']], how= 'right', left_index= True, right_index=True)

In [39]:

df_result['exRet'] = (1) * df_result['int_rate'] - 100 *df_result['result']

In [40]:

ret_sect_size2= {}

for i in p_l1:
    for j in p_l2:
        df_cal = df_result[(df_result['purpose'] == i) & (df_result['cum_rank'] == j)]
        ret_sect_size2[(i + '|' + str(j))] =sum (df_cal['exRet'] * df_cal['funded_amnt']) / sum( df_cal['funded_amnt'])

In [41]:

df_train['int_rate'].mean()
df_result['int_rate'].mean()

Out[41]:

11.278818350702988

Part 4.4.2 Mean Variance¶

최적화 포트폴리오 계산

In [42]:

# Calculates portfolio mean return
def port_mean(W, R):
    return sum(R * W)

# Calculates portfolio variance of returns
def port_var(W, C):
    return np.dot(np.dot(W, C), W)

# Combination of the two functions above - mean and variance of returns calculation
def port_mean_var(W, R, C):
    return port_mean(W, R), port_var(W, C)

def solve_frontier(R, C, rf):
    def fitness(W, R, C, r):
        # For given level of return r, find weights which minimizes portfolio variance.
        mean, var = port_mean_var(W, R, C)
        penalty = 100 * abs(
            mean - r)  # Big penalty for not meeting stated portfolio return effectively serves as optimization constraint
        return var + penalty

    frontier_mean, frontier_var = [], []
    n = len(R)  # Number of assets in the portfolio
    for r in np.linspace(min(R), max(R), num=20):  # Iterate through the range of returns on Y axis
        W =  np.ones([n]) / n  # start optimization with equal weights
        b_ = [(0, 1) for i in range(n)]
        c_ = ({'type': 'eq', 'fun': lambda W: sum(W) - 1.})
        optimized = scipy.optimize.minimize(fitness, W, (R, C, r), method='SLSQP', constraints=c_, bounds=b_)
        if not optimized.success:
            raise BaseException(optimized.message)
        # add point to the efficient frontier [x,y] = [optimized.x, r]
        frontier_mean.append(r)
        frontier_var.append(port_var(optimized.x, C))
    return np.array(frontier_mean), np.array(frontier_var)

from scipy.optimize import minimize
import scipy
def solve_weights(R, C, rf):
    def fitness(W, R, C, rf):
        # calculate mean/variance of the portfolio
        mean, var = port_mean_var(W, R, C)  
        util = (mean - rf) / np.sqrt(var)      # utility = Sharpe ratio
        return 1/util                       # maximize the utility
    n = len(R)
    W = np.ones([n])/n                     # start with equal weights
    b_ = [(0.,1.) for i in range(n)]    # weights between 0%..100%. 
                                        # No leverage, no shorting
    c_ = ({'type':'eq', 'fun': lambda W: sum(W)-1. })   # Sum of weights = 100%
    optimized = scipy.optimize.minimize(fitness, W, (R, C, rf), 
                method='SLSQP', constraints=c_, bounds=b_)  
    if not optimized.success: 
        raise BaseException(optimized.message)
    return optimized.x  # Return optimized weights

In [43]:

class Result:
    def __init__(self, W, tan_mean, tan_var, front_mean, front_var):
        self.W=W
        self.tan_mean=tan_mean
        self.tan_var=tan_var
        self.front_mean=front_mean
        self.front_var=front_var
        
def optimize_frontier(R, C, rf):
    W = solve_weights(R, C, rf)
    tan_mean, tan_var = port_mean_var(W, R, C)  # calculate tangency portfolio
    front_mean, front_var = solve_frontier(R, C, rf)  # calculate efficient frontier
    # Weights, Tangency portfolio asset means and variances, Efficient frontier means and variances
    return Result(W, tan_mean, tan_var, front_mean, front_var)

def display_assets(names, R, C, color='red'):
    f,ax=plt.subplots(1,1,figsize=(18,8))

    scatter([C[i, i] ** .5 for i in range(len(names))], R, marker='x', color='G'), grid(True)  # draw assets
    for i in range(len(names)): 
        text(C[i, i] ** .5, R[i], '  %s' % names[i], verticalalignment='center', color=color) # draw labels

def display_frontier(result, label=None, color='black'):
    text(result.tan_var ** .5, result.tan_mean, '   tangent', verticalalignment='center', color=color)
    scatter(result.tan_var ** .5, result.tan_mean, marker='o', color=color), grid(True)
    plot(result.front_var ** .5, result.front_mean, label=label, color=color), grid(True)  # draw efficient frontier

In [44]:

df_ex_ret = pd.DataFrame(ret_sect_size2,  index =["rets"])
df_ex_ret  = df_ex_ret / 100

df_ret = pd.DataFrame.from_dict(ret_sect_size)
df_ret = df_ret.iloc[67 :,:]

#mean_daily_returns  = df_ret.mean()
cov_matrix  = df_ret.cov()
mean_month_returns  = df_ret.mean()

mean_month_returns
df_ex1 = df_ex_ret.T.join(df_ret.var().to_frame("vari"))

rf = 0.001
R = 12 * df_ex_ret.T['rets'].values
C = cov_matrix.values

W = solve_weights(R, C, rf)
print(W)

mean, var = port_mean_var(W, R, C)     
print(mean)
print(var)

mean, var = port_mean_var(W, R, C)     
names = list(df_ret)
#display_a
res1 = optimize_frontier(R, C, rf)
display_assets(names, R, C, color='blue')

display_frontier(res1, color='red')
xlabel('variance $\sigma$'), ylabel('mean $\mu$'), show()

display(pd.DataFrame({'Weight': res1.W}, index=names)[pd.DataFrame({'Weight': res1.W}, index=names)['Weight'] >= 0.000001].T)
plt.show()

[  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   1.30751986e-17   0.00000000e+00   3.25749869e-01   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   2.38320548e-01
   7.99859749e-02   3.55943609e-01   1.74123304e-17   0.00000000e+00
   0.00000000e+00   8.33392227e-18   6.89257690e-18   6.59073393e-18
   6.47707150e-20   0.00000000e+00   0.00000000e+00   2.63013043e-17
   0.00000000e+00   2.31346960e-18   8.99918864e-18   5.45602067e-18
   0.00000000e+00   0.00000000e+00   1.18741740e-18   1.40242754e-18]
0.653256337566
1.6049738616e-05

	credit_card\|7	debt_consolidation\|4	debt_consolidation\|5	debt_consolidation\|6
Weight	0.3257	0.2383	0.0800	0.3559

In [ ]:

5 결론¶

MV은 수익 하나보다는 수익과 위험을 동시에 고려하는 최적화 방식인데 위험 보다는 수익을 좀 더 고려하는 방안이 필요함
제약조건의 변경에 따른 시뮬레이션을 구하려면 실제 이자율과 할인율에 대한 적용방법이 필요함

In [47]:

#-! jupyter nbconvert –to html  *.ipynb

In [46]:

help(nbconvert)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-46-aba79f90e137> in <module>()
----> 1 help(nbconvert)

NameError: name 'nbconvert' is not defined

In [ ]: