loan optimization

Loan Portfolio Optimization

머신러닝 기법을 사용하여 대출 자산 포트폴리오에 대한 최적화 방법에 대해 생각해 본다.

Part1: 개요(문제정의)

  1. 각 대출섹터별 수익성과 위험을 기준으로 전체 포트폴리오의 수익을 최적화하는 것이 목적임
  2. 제약조건의 변경(금리 수준, 투자 비중의 제약)에 따라 포트폴리오의 리밸런싱이 가능해야 함
  3. 대출섹터별로(신차 대출, 중고차 대출, 신용대출(direct, indirect)) 투자비중을 계산하는 것이 목표임

Part2: 접근방식(포트폴리오 최적화)

  1. Loan Portfolio Optimization를 하기 위해서 전체 포트폴리오의 Sharp Ratio를 극대화하는 투자비중(w)를 계산함
    1. 개별 대출의 묶음을 하나의 투자자산으로 가정함(단위 포트폴리오로 정의함)
    2. 단위 포트폴리오에 투자하는 비중(w)을 계산하기 위해서 예상 수익률과 공분산 매트릭스가 필요함
    3. 계산에 필요한 예상 수익률은 발생가능한 현금 흐름을 사용함
  2. 예상 수익률을 계산하는 방법은 다음과 두가지 방법을 생각해 볼 수 있음
    1. 각각 포트폴리오의 과거 월별 현금흐름을 사용하여 계산함
    2. 각각 포트폴리오의 미래 부도율을 예측하여 계산함
    3. 현금흐름을 할인하는 이자율은 중요 요인이나 분석에서는 고려하지 않았음
  3. 포트폴리오의 과거 월별 수익률은 다음과 같이 계산함

    1. 섹터별, 대출규모별 단위 포트폴리오를 구성함
    2. 월별 수익률을 계산하는 방법은 부도 미발생 일 경우에는 이자율 * 원금의 현금흐름이 발생함
    3. 해당월에 부도가 발생하면 원금을 차감함
    4. 월별발생 현금흐름에 원금을 나누어서 월별 수익률을 계산
    5. 월별 수익률이 계산 되었으므로 분산 공분산 매트릭스를 계산 할 수 있음
  4. 포트폴리오의 미래 수익률은 다음과 같이 계산함

    1. t시점의 3년전 데이터로 train set과 t시점부터 1년 후 데이터로 test set을 구성함
    2. train set의 데이터를 이용하여 부도율을 학습시킴
    3. test set를 이용해서 부도 대상을 예측할 수 있음
    4. 부도 시기를 예측 할 수 없으므로 36개월 만기 중에서 평균(13.5)의 부도 시점을 사용함
    5. 부도율과 미래현금흐름을 이용하여 단위 포트폴리오의 1년 예상 수익률을 시뮬레이션 함
  5. 미래수익률과 과거수익률로 분산공분산을 추정하였으므로 Mean Variance 최적화를 시도할 수 있음

Part3: 분석 데이터 예제와 분석 툴

  1. 입수데이터는 Lending Club의 2008 년부터 2015년 년간의 80만여건의 자료
  2. 입수항목은 grade, zip, 소득, 대출기간, 대출금액, 대출상태(부도, 완료, 진행), 집소유 등 78개의 컬럼
  3. 분석툴은 Python, Numpy, Pandas, SKlearn, xgboost(Gradient Boost), Jupyter Notebook 를 사용
  4. 파이썬 노트북 환경의 장점
    • SQL, 코딩, 통계분석, 머신러닝, 시각화를 한 화면에서 지원이 가능
    • 웹에서 어디서나 접근이 가능함
    • 금융 프로그래밍에 있어서 타 언어(JAVA, C++)보다 더 높은 생산성

Part4: 분석 예제

Part 4.1 데이터 탐색

In [1]:
# - 데이터 입수 필요 패키지 import

#https://www.kaggle.com/wendykan/lending-club-loan-data
# 
%pylab
%matplotlib inline
import pandas as pd
import datetime
import time
from collections import defaultdict
import gc
import warnings
import numpy as np
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
#pd.set_option("display.max_columns", 100)
print('Preparing arrays...')
ddir = '/home/bono/ONEDRIVE_DATA/lending/'
#ddir = 'C:/ONEDRIVE_DATA/lending/'
df_data = pd.read_csv(ddir + "loan.csv")
pd.options.display.float_format = '{:.2f}'.format

import matplotlib.font_manager as fm
font_location = "/usr/share/fonts/truetype/nanum/NanumGothic.ttf"
font_name = fm.FontProperties(fname=font_location).get_name()
matplotlib.rc('font', family=font_name)

#importing all the required ML packages
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn import svm #support vector Machine
from sklearn.model_selection import train_test_split #training and testing data split
from sklearn import metrics #accuracy measure
from sklearn.metrics import confusion_matrix #for confusion matrix
Using matplotlib backend: Qt5Agg
Populating the interactive namespace from numpy and matplotlib
Preparing arrays...

4.1.1 사용 컬럼

  • 대출 만기는 3년과 5년으로 구성되어 있는데 분석의 편의 위해서 3년 데이터만 사용함
  • 중요 데이터만 추출하여 사용함
    • issue_d:대출 발행월
    • purpose:대출 상품의 종류(card, mortage, car)
    • term:대출 만기
    • zip_code: 우편번호
    • int_rate: 이자율
    • annual_inc: 연소득
    • grade: 신용등급
    • sub_grade: 신용 보조등급
    • emp_length: 근무 년수
    • dti: 연소득 대비 대출 금액 비율
    • home_ownership:집 자가
    • loan_status: 대출 상태(진행, 종료, 부도)
    • last_pymnt_d: 마지막 이자 지급월
    • total_pymnt:총지급 금액
    • installment:월 지급 이자(원금 포함)
    • last_pymnt_amnt:마지막 이자 금액
    • funded_amnt:대출 원금
In [2]:
#사용한 컬럼과. 대출만기 정의
use_col = ['issue_d','purpose','term','zip_code','int_rate','annual_inc'
           , 'grade','sub_grade','emp_length','dti','home_ownership','loan_status'
           ,'last_pymnt_d','total_pymnt','installment','last_pymnt_amnt','funded_amnt']

df_loan = df_data[use_col]
df_loan = df_loan[df_loan['term'].str.strip() =='36 months']

4.1.2 년도별, 상태별 대출 상태

  • 2012년 부터 대출의 증가하였음
  • Charged Off를 부도로 정의하였음. 전체적으로 5% 수준의 부도율 보임
  • 2015년에 부도가 급속히 줄어 듦
In [3]:
pd.crosstab(df_loan.loan_status, pd.to_datetime(df_loan['issue_d']).dt.year, margins =True).style.background_gradient(cmap='summer_r')
#.merge(pd.crosstab(df_loan.loan_status, df_loan.term).apply(lambda r: r/r.sum()), how  = 'left', right_index =True, left_index =True).style.background_gradient(cmap='summer_r')
#df_loan[use_col].groupby(['loan_status','term'])['issue_d'].agg('count').unstack()#.apply(lambda r: r/r.sum())
Out[3]:
issue_d200720082009201020112012201320142015All
loan_status
Charged Off452475948421499584498238561162829083
Current000003342531106941253343402848
Default0000012121357225715
Does not meet the credit policy. Status:Charged Off11324912915800000649
Does not meet the credit policy. Status:Fully Paid239582436532000001789
Fully Paid2061315412276241260237488458584188616474167575
In Grace Period000006675129819963975
Issued0000000059825982
Late (16-30 days)0000062235407141483
Late (31-120 days)00000811149298628107026
All6032393528191561410143470100380162569283172621125
In [4]:
f,ax=plt.subplots(1,2,figsize=(18,8))
#df_loan.groupby('loan_status')['loan_status'].count().plot.bar(ax=ax[0])

df_loan['loan_status'].value_counts(normalize = True).plot.bar(ax=ax[0])
df_loan['loan_status'].value_counts(normalize = True).plot.pie(ax=ax[1])
ax[0].set_title('Number Of Passengers By Pclass')
ax[0].set_ylabel('Count')
ax[0].set_title('Number Of Loan Status')
ax[1].set_title('Ratio Of Loan Status')
plt.show()
#normalize 
#df_loan.groupby('loan_status')['loan_status'].count().plot.pie(ax=ax[1])
#data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])

4.1.3 등급별 대출 상태

  • 부도와 가장 밀접한 관계를 가지고 있는 신용등급을 조회했음
  • 등급이 낮을 수록 부도율이 높은 것을 확인할 수 있음
In [5]:
pd.crosstab(df_loan.loan_status, df_loan.grade, margins =True).style.background_gradient(cmap='YlGn')
#df_loan[use_col].groupby(['loan_status','term'])['issue_d'].agg('count').unstack()
Out[5]:
gradeABCDEFGAll
loan_status
Charged Off253783529013648120905466429083
Current9909313692710576545396129882392287402848
Default4417224916171171715
Does not meet the credit policy. Status:Charged Off8791401701197558649
Does not meet the credit policy. Status:Fully Paid872544534483141271061789
Fully Paid3890360909409842060950401010120167575
In Grace Period349109313498123075693975
Issued1386210516596231643875982
Late (16-30 days)1303484853531273551483
Late (31-120 days)478177023231634622170297026
All14301521200916242076687218424466686621125
In [6]:
pd.crosstab(df_loan.loan_status, df_loan.grade).apply(lambda r: r/r.sum(), axis = 0).style.background_gradient(cmap='YlGn')
Out[6]:
gradeABCDEFG
loan_status
Charged Off0.01773940.03939460.05549190.08451240.09568720.1222570.0932945
Current0.6928850.6458550.6511820.5919650.5946340.5356020.418367
Default0.000307660.0008112860.001533060.002099440.003250620.003806540.00145773
Does not meet the credit policy. Status:Charged Off5.59382e-050.0003726260.0008619630.00221680.005448220.01679360.0845481
Does not meet the credit policy. Status:Fully Paid0.0006083280.001198060.002789070.005841930.0143760.02843710.154519
Fully Paid0.272020.2872940.2523330.2687420.2307480.2261530.174927
In Grace Period0.00244030.005155440.008305630.01058850.01405550.01253920.0131195
Issued0.009691290.009928820.01021430.008123930.007508470.008508730.0102041
Late (16-30 days)0.0009089960.001641440.002986090.004603130.005814490.007836990.00728863
Late (31-120 days)0.003342310.00834870.01430240.02130740.02847720.03806540.0422741

4.1.4 섹터별 대출 상태

  • 섹터별로 포트폴리오 최적화를 하는 것이 목적이므로 섹터별 대출 건수와 대출 비율이 중요함
In [7]:
df_loan.groupby(['loan_status','purpose'])['int_rate'].agg('count').unstack().fillna(0).style.background_gradient(cmap='summer_r')
Out[7]:
purposecarcredit_carddebt_consolidationeducationalhome_improvementhousemajor_purchasemedicalmovingotherrenewable_energysmall_businessvacationwedding
loan_status
Charged Off306541416932491396175588449353206339884232203
Current383210512322757912338512547369424626552081623134922663202
Default5140442030471174801380
Does not meet the credit policy. Status:Charged Off11602463260816191210616549
Does not meet the credit policy. Status:Fully Paid4124373264121308731282782821238
Fully Paid26033559795024260978510394569197114409639181272412521491
In Grace Period277572467021127874534232647332
Issued67150532130369261427542401583540
Late (16-30 days)923990708811351220108036180
Late (31-120 days)5313324306038741134101714088127535
In [8]:
df_loan.groupby(['loan_status','purpose'])['int_rate'].agg('count').unstack().fillna(0).apply(lambda r: r/r.sum()).style.background_gradient(cmap='YlGn')
#.apply(lambda r: r/r.sum())

#.style.background_gradient(cmap='summer_r')
#.merge(pd.crosstab(df_loan.loan_status, df_loan.term).apply(lambda r: r/r.sum()), how  = 'left', right_index =True, left_index =True).style.background_gradient(cmap='summer_r')
#df_loan[use_col].groupby(['loan_status','term'])['issue_d'].agg('count').unstack()#.apply(lambda r: r/r.sum())
Out[8]:
purposecarcredit_carddebt_consolidationeducationalhome_improvementhousemajor_purchasemedicalmovingotherrenewable_energysmall_businessvacationwedding
loan_status
Charged Off0.04400350.03599490.04812310.120690.03895960.06692160.04511280.06451150.07571860.06050030.08245240.117040.05359210.104103
Current0.551050.698910.6468110.002463050.6526290.4795410.5653680.6100570.5694980.6104580.4883720.4623330.6151540.10359
Default0.0007190110.0009307890.0012562200.000837240.001529640.0005370570.001580460.00150150.0014076700.001721170.0018480
Does not meet the credit policy. Status:Charged Off0.001581820.000398910.0006991660.07881770.001674480.003059270.001227560.002729890.0025740.00310860.002114160.008605850.0009240010.00461538
Does not meet the credit policy. Status:Fully Paid0.005895890.001615580.002080440.1576350.003376870.01147230.006674850.004454020.006006010.008152730.004228330.01085660.0027720.0194872
Fully Paid0.3743170.2366660.2700710.6403940.273080.3973230.3505450.283190.308880.2826770.3826640.3606510.2892120.764615
In Grace Period0.003882660.005032910.0070115500.005888590.0103250.006674850.006465520.007293010.006803720.0126850.006222690.007623010.00102564
Issued0.009634740.0100060.0091317800.01029810.009942640.01089460.01077590.009009010.01175990.01057080.0109890.0124740
Late (16-30 days)0.001294220.001588990.0025778200.002455910.00420650.002685280.001724140.004290.0031672500.004766320.0041580
Late (31-120 days)0.007621510.008855790.012238200.01080040.01567880.01028080.01451150.01522950.01196520.01691330.01681450.0122430.0025641
In [9]:
f,ax=plt.subplots(1,2,figsize=(18,8))

da1 = df_loan['purpose'].value_counts().sort_values()
da1.plot.pie(ax=ax[0])
da2 = df_loan.groupby('purpose')['int_rate'].mean().sort_values()
da2.plot.bar(ax=ax[1])
ax[0].set_title("Sector Ratio")
ax[1].set_title('Interest Average')
plt.show()

4.1.5 부도 발생시점

  • 수익률을 계산하기 위해서 부도 발생 시점을 확인하는 것은 중요한 단계중에 하나임
  • 평균 13.5개월임(대출기간 36개월)
In [10]:
df_loan1 = df_loan
df_loan1['de_mon'] = pd.to_datetime(df_loan1['last_pymnt_d']).dt.to_period('M') - pd.to_datetime(df_loan1['issue_d']).dt.to_period('M')
print(df_loan1['de_mon'].mean())
13.5805664824
In [11]:
f,ax=plt.subplots(1,2,figsize=(18,8))

da1 = df_loan1[df_loan1["loan_status"].str.strip()=='Charged Off']['de_mon'].value_counts().sort_index()
da1.plot.line(ax=ax[0])
da1.plot.pie(ax=ax[1])
ax[0].set_title("Trend Of Default Month")
ax[1].set_title('Ratio of Default Month')
plt.show()

4.1.6 대출규모와 연간 소득

  • 전체 평균의 대출규모은 12,474 달러임
  • 연간소득의 평균은 (72,555)임
  • 연간소득에는 아주 높은 이상치가 존재함
In [12]:
print(df_loan.funded_amnt.mean())
f,ax=plt.subplots(2,2,figsize=(18,8))
df_loan.funded_amnt.hist(bins= 10, ax = ax[0][0]) # [0][i]
ax[0][0].set_title('Number Of Hist')
ax[0][0].set_ylabel('Count')
df_loan[df_loan.funded_amnt.rank(pct=True) < 0.1]['funded_amnt'].hist(bins = 20, ax = ax[0][1], color= 'r')
df_loan[df_loan.funded_amnt.rank(pct=True) > 0.9]['funded_amnt'].hist(bins = 20, ax = ax[1][0], color= 'g')
ax[0][1].set_title('10% less 이상')
ax[1][0].set_title('90% more 이하.')
plt.show()
#.hist(bins= 100, ax = ax[1])#
12474.785993157577
In [13]:
df_loan.describe().style.background_gradient(cmap='terrain_r')
Out[13]:
int_rateannual_incdtitotal_pymntinstallmentlast_pymnt_amntfunded_amnt
count621125621121621125621125621125621125621125
mean12.018772555.517.58227130.69413.5381939.9212474.8
std3.8530766293.519.78387297.5259.3354028.197779.36
min5.3200015.670500
25%8.94300011.341802.84219.97246.626600
50%11.996000017.024737.8342.31430.4510000
75%14.478700023.3110198.6543.6924.8116000
max28.999e+06999949978.71445.4636188.835000
In [14]:
#print(df_loan.funded_amnt.mean())
f,ax=plt.subplots(2,2,figsize=(18,8))
df_loan.annual_inc.hist(bins= 100, ax = ax[0][0]) # [0][i]
ax[0][0].set_title('Number Of Annual Income')
ax[0][1].set_ylabel('Count')
df_loan[df_loan.annual_inc.rank(pct=True) < 0.99]['annual_inc'].hist(bins = 30, ax = ax[0][1], color= 'r')
df_loan[df_loan.annual_inc.rank(pct=True) > 0.99]['annual_inc'].hist(bins = 30, ax = ax[1][0], color= 'y')
ax[0][1].set_title('10% less')
ax[1][0].set_title('90% more')
#df_loan[df_loan.annual_inc.rank(pct=True) > 0.999]['annual_inc'].hist(ax = ax[1][1])
#.hist(bins= 100, ax = ax[1])#
Out[14]:
<matplotlib.text.Text at 0x7f3b58a59400>

Part 4.2 단위 포트폴리오 구성과 분석

  • 단위 포트폴리오는 섹터별로 4개 ('credit_card', 'other','home_improvement', 'debt_consolidation') 별로, 그리고 대출규모 금액가중으로 8분위로 분류하여 구성하였음

Part 4.2.1 단위 포트폴리오 구성

  • 대출 금액의 합계가 동일하게 단일 포트폴리오 구성
  • 대출금액이 낮을 때와 높을 때 이자율이 높음
  • credit_card는 이자율이 가장 높고 other의 이자율이 높음
  • 연도별로 2012년과 2015년에는 이자율이 낮아짐
In [15]:
df_loan['m_rank'] = np.ceil(df_loan.funded_amnt.rank(pct=True) * 10)
df_loan.groupby('m_rank')['funded_amnt'].agg(['count','mean','sum'])
df_loan['funded_amnt'].agg(['count','mean','sum'])
a= np.linspace(0, 7748401550, num=10) 
df_loan['funded_amnt_cum']= df_loan['funded_amnt'].sort_values().cumsum()
df_loan['cum_rank'] = df_loan['funded_amnt_cum'].apply(lambda x:  np.digitize(x, a))
df_loan.groupby('cum_rank').agg({'funded_amnt':['sum','count','mean'],'int_rate':['mean']}).style.background_gradient(cmap='RdPu')
Out[15]:
int_rate <th class="col_heading level0 col1" colspan=3>funded_amnt</th>
meansumcountmean
cum_rank
112.70198.60928e+081789524810.94
212.28668.60938e+08998738620.33
311.78268.60927e+088075510661
411.67578.60933e+086506013232.9
511.43688.60942e+085485515694.9
611.19318.6092e+084497419142.6
711.01248.60947e+083899422079
811.14018.60927e+083207326842.7
913.14498.6094e+082558933644.9
In [16]:
df_loan.head()
df_loan['start_d'] = pd.to_datetime(df_loan.issue_d)
In [17]:
f,ax=plt.subplots(1,2,figsize=(18,8))
df_loan.groupby(["grade",(pd.Grouper(key='start_d', freq="Q"))])['int_rate'].mean().unstack('grade').plot(ax = ax[0])
p_l1 = ['credit_card', 'other','home_improvement', 'debt_consolidation']
df_loan[df_loan.purpose.isin(p_l1)].groupby(["purpose",(pd.Grouper(key='start_d', freq="M"))])['int_rate'].mean().unstack('purpose').plot(ax = ax[1])
#df_loan[df_loan.purpose.isin(p_l1)]
ax[0].set_title('grade별 이자율 추이')
ax[1].set_title('섹터별 이자율 추이')
#df_loan.groupby(["purpose","issue_d"])['int_rate'].mean().plot()
Out[17]:
<matplotlib.text.Text at 0x7f3b58c76fd0>

Part 4.2.2 단위 포트폴리오 수익률 계산

  • 포트폴리오 수익률 계산
  • 미 부도시 이자율 * 원급 지급의 현금 흐름 발생
  • 부도시 원금만 현금흐름이 유출되고 그 다음 부터는 0
  • 부도 시점은 마지막 지급액
In [18]:
df_loan['last_pymnt_d']= df_loan.apply(lambda cols: cols['last_pymnt_d'] if pd.isnull(cols['last_pymnt_d']) else cols['issue_d'], axis = 1) 
df_train10 = df_loan
df_train10 = df_train10[~df_train10.last_pymnt_d.isnull()]
In [19]:
df_train10['start_d'] = pd.to_datetime(df_train10.issue_d)
#df_train10['str_start_d'] = df_train10['start_d'].strftime("%Y-%m-%d")
df_train10['str_start_d'] = df_train10['start_d'].apply(lambda x: x.strftime('%Y-%m'))
df_train10['str_start_d'] = df_train10['start_d'].apply(lambda x: x.strftime('%Y-%m'))
df_train10['last_d'] = pd.to_datetime(df_train10.last_pymnt_d)
df_train10['str_last_d'] = df_train10['last_d'].apply(lambda x: x.strftime('%Y-%m'))
#type(df_train10['start_d'])
df_train10['maturity_d'] = df_train10['start_d'].apply(lambda x: x + pd.DateOffset(years=3))
df_train10['str_maturity_d'] = df_train10['maturity_d'].apply(lambda x: x.strftime('%Y-%m'))
In [20]:
df_train10['issue_d'].count()
#608845
df_train10['cum_rank'].unique()
df_train10['purpose'].unique()
p_l1 = ['credit_card', 'other','home_improvement', 'debt_consolidation']
p_l2 = np.arange(1,9)
In [21]:
def calRetOfPeiord(df_port, pr_cal):
    ret_mo = {}
    for date_i in pr_cal:
        #print(calRet(df_port,date_i))
        ret_mo[date_i] = calRet(df_port,date_i)
    
    return ret_mo

#calRetOfPeiord(df_train10, pr_cal)     
#def calRet()
def calRet(df_train10,  current_month):
    #current_month = '2013-01'
    df_cur = df_train10[(df_train10['str_start_d'] <= current_month ) & (df_train10['str_maturity_d'] >=  current_month)]
    #print(df_cur.count())
    de_m1 = df_cur[(df_cur['loan_status'].str.strip() == 'Charged Off') & (df_cur['last_d'] == current_month)]
    de_m0 = df_cur[(df_cur['loan_status'].str.strip() == 'Charged Off') & (df_cur['last_d'] != current_month)]
    de_p1 = df_cur[(df_cur['loan_status'].str.strip() != 'Charged Off') ]
    d1 = sum(de_m1['funded_amnt'])
    p1 = sum(de_p1['int_rate'] / 12 * de_p1['funded_amnt'])
    try:
        retOfMonth = (p1 - d1) / sum(df_cur['funded_amnt'])
    except ZeroDivisionError as e:
        #print(e)
        retOfMonth = 0
    return retOfMonth
In [22]:
#df_cal = df_train10[(df_train10['purpose'] == 'credit_card') &  (df_train10['cum_rank'] == 8)]
In [23]:
#nd1 = df_train10.matuarity.unique() 

pr_cal = df_train10['str_start_d'].unique() 
pr_cal.sort()
ret_sect_size = {}

for i in p_l1:
    for j in p_l2:
        #print(i + str(j))
        #print(df_train10[(df_train10['purpose'] == i) &  (df_train10['cum_rank'] == j)]['issue_d'].count())
        df_cal = df_train10[(df_train10['purpose'] == i) &  (df_train10['cum_rank'] == j)]
        ret_sect_size[(i + '|' + str(j))] = calRetOfPeiord(df_cal, pr_cal)
In [ ]:
 
In [24]:
df_ret = pd.DataFrame.from_dict(ret_sect_size)
df_ret = df_ret.iloc[67 :,:]
mean_daily_returns  = df_ret.mean()
corr_matrix  = df_ret.corr()
In [25]:
#df_ret.columns

Part 4.2.3 단위 포트폴리오 수익률 분석

  • Others 섹터의 단위 포트포리오의 수익률이 높음.
  • 소액투자의 단위 포트폴리오이 수익률이 높음(투자수가 많아져서 분산 효과???)
In [26]:
f,ax=plt.subplots(1,1,figsize=(18,8))
pd.options.display.float_format = '{:.4f}'.format
df_ret.plot(ax = ax)
ax.set_title('32개 단위 포트폴리오의 수익률 추이')
Out[26]:
<matplotlib.text.Text at 0x7f3b4973f240>
In [27]:
def process_index(k):
    return tuple(k.split("|"))

#df.index = 
df_ret.columns  = pd.MultiIndex.from_tuples([process_index(k) for k,v in df_ret.iteritems()])
In [28]:
f,ax=plt.subplots(1,2,figsize=(18,8))
df_ret.groupby(level=[0], axis =1).mean().plot(ax =ax[0])
df_ret.groupby(level=[1], axis =1).mean().plot(ax =ax[1])
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3b5881b240>
In [29]:
# cov_matrix

import seaborn as sns
fig, ax = plt.subplots(figsize=(14,14))   
sns.heatmap(corr_matrix, annot=True, linewidths=.5, ax=ax)
# turn the axis label
for item in ax.get_yticklabels():
    item.set_rotation(0)

for item in ax.get_xticklabels():
    item.set_rotation(90)

plt.show()

4.3 Predictive Part

4.3.1 문자 catetory로 숫자로 변환
In [30]:
df_loan_st = df_loan.copy()
loan_obj_mod =   df_loan.select_dtypes(include=['object']).columns
loan_obj_mod =[ 'purpose', 'term', 'zip_code', 'grade', 'sub_grade',
       'emp_length', 'home_ownership', 'loan_status']

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

#var_mod = ['term','grade','sub_grade']
for i in loan_obj_mod:
    df_loan[i] = le.fit_transform(df_loan[i])
In [31]:
#df_loan.issue_d.to_period('M')
# df_train =df_loan[((pd.to_datetime(df_loan['issue_d']))>= '2010-01-01') & ((pd.to_datetime(df_loan['issue_d'])) <= '2012-12-31') ]
# df_test =df_loan[((pd.to_datetime(df_loan['issue_d']))>= '2013-01-01') & ((pd.to_datetime(df_loan['issue_d'])) <= '2013-12-31') ]

#df_loan.issue_d.to_period('M')
df_train =df_loan[((pd.to_datetime(df_loan['issue_d']))>= '2012-01-01') & ((pd.to_datetime(df_loan['issue_d'])) <= '2014-12-31') ]
df_test =df_loan[((pd.to_datetime(df_loan['issue_d']))>= '2015-01-01') & ((pd.to_datetime(df_loan['issue_d'])) <= '2015-12-31') ]

m_use =['purpose', 'zip_code', 'int_rate', 'annual_inc',   'grade', 'sub_grade', 'emp_length', 'dti', 'home_ownership',
         'installment', 'funded_amnt', 'target']
features  =['purpose', 'zip_code', 'int_rate', 'annual_inc',   'grade', 'sub_grade', 'emp_length', 'dti', 'home_ownership',
         'installment', 'funded_amnt']
df_train['target'] = df_train.loan_status.apply(lambda x:1 if x == 0 else 0)
df_test['target'] = df_test.loan_status.apply(lambda x:1 if x == 0 else 0)
In [32]:
#분석에 사용될 변수
features  =['purpose', 'zip_code', 'int_rate', 'annual_inc',   'grade', 'sub_grade', 'emp_length', 'dti', 'home_ownership',
         'installment', 'funded_amnt']

y_values = df_train['target'].values

Part 4.3.2 Gradinent Boost 알고리즘 적용

  • 2012년 - 2014년 데이터를 train으로 하여 2015년의 부도율을 예측함.
In [33]:
def get_importance(gbm, features):
    #importance = pd.Series(gbm.booster().get_fscore()).sort_values(ascending=False)
    #AttributeError: 'Booster' object has no attribute 'booster'
    #https://github.com/dmlc/xgboost/issues/2119
    outfile = open('xgb.fmap', 'w')
    for i, feat in enumerate(features):
        outfile.write('{0}\t{1}\tq\n'.format(i, feat))
    outfile.close()
    importance = gbm.get_fscore(fmap='xgb.fmap')
    importance = sorted(importance.items(), key=itemgetter(1), reverse=False)
    df = pd.DataFrame(importance, columns=['feature', 'fscore'])
    df['fscore'] = df['fscore'] / df['fscore'].sum()
    plt.figure()
    #df.plot()
    df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(16, 10))
    plt.title('XGBoost Feature Importance')
    return importance

### xgb start

import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import cross_validation, metrics   #Additional scklearn functions
from sklearn.grid_search import GridSearchCV   #Perforing grid search
import xgboost as xgb
from sklearn.metrics import roc_auc_score
from operator import itemgetter

def runXGboost(df_train, df_test, feature_names, targetY):
    
    xgb_pars = {
    "objective": "binary:logistic",    #  1 reg:linear  2 "binary:logistic", 3 multi:softmax,  4 multi:softprob 
    "booster" : "gbtree",             # 1 gbtree: tree-based, 2 gblinear: linear models
    "eval_metric": "auc", 
    #rmse – root mean square error,   mae – mean absolute error , auc: Area under the curve
    # logloss, mlogloss  – negative log-likelihood   error, merror – Binary classification error rate (0.5 threshold)  merror        
    'min_child_weight': 50, 'eta': 0.3, 'colsample_bytree': 0.3, 'max_depth': 10,  "tree_method": 'exact',
            'subsample': 0.8, 'lambda': 1., 'nthread': 4, 'silent': 1}
    num_boost_round = 115
    early_stopping_rounds = 10
    test_size = 0.2
    
    print(df_train[feature_names].columns)
    print(targetY.shape)
    
    #X_train, X_valid=                               
    X_train ,X_valid, y_train, y_valid =  train_test_split(df_train[feature_names].values, targetY, test_size = test_size , random_state= 1)
    dtrain = xgb.DMatrix(X_train, label = y_train)
    dvalid = xgb.DMatrix(X_valid, label=  y_valid)
    dtest = xgb.DMatrix(df_test[feature_names].values)
    watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
    
    gbm_model = xgb.train(xgb_pars, dtrain, 60, watchlist, early_stopping_rounds=50,
                  maximize=False, verbose_eval=10)
    
    print("Validating...")
    check = gbm_model.predict(dvalid)
    importance = get_importance(gbm_model, features)
    print(importance)
    #importance = gbm.get_fscore(fmap='xgb.fmap')
    score = roc_auc_score(y_valid, check)
    print('Check error value: {:.6f}'.format(score))

    return gbm_model.predict(dtest)

df_test['result'] = runXGboost(df_train, df_test, features,  y_values)
/home/bono/anaconda3/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/bono/anaconda3/lib/python3.5/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
Index(['purpose', 'zip_code', 'int_rate', 'annual_inc', 'grade', 'sub_grade',
       'emp_length', 'dti', 'home_ownership', 'installment', 'funded_amnt'],
      dtype='object')
(306419,)
[0] train-auc:0.651848  valid-auc:0.654326
Multiple eval metrics have been passed: 'valid-auc' will be used for early stopping.

Will train until valid-auc hasn't improved in 50 rounds.
[10]  train-auc:0.696095  valid-auc:0.670383
[20]  train-auc:0.733308  valid-auc:0.693867
[30]  train-auc:0.745974  valid-auc:0.698791
[40]  train-auc:0.752872  valid-auc:0.701923
[50]  train-auc:0.756673  valid-auc:0.70102
Validating...
[('grade', 105), ('home_ownership', 142), ('purpose', 238), ('emp_length', 256), ('sub_grade', 298), ('int_rate', 372), ('funded_amnt', 462), ('zip_code', 658), ('dti', 674), ('annual_inc', 810), ('installment', 1126)]
Check error value: 0.700114
<matplotlib.figure.Figure at 0x7f3b5230e518>

Part 4.3.2 Gradinent Boost 알고리즘 적용 결과

  • 사후적으로 부도로 인식된 것들이 알고리즘 적용 결과 더 높은 부도율을 보여줌
In [34]:
f,ax=plt.subplots(1,3,figsize=(18,8))
df_test['result'].hist(bins= 20,  color='y', ax = ax[0])
df_test[df_test['target'] == 0]['result'].hist(bins= 20,  color='g', ax = ax[1])
df_test[df_test['target'] == 1]['result'].hist(bins= 20,  color='r', ax = ax[2])
ax[0].set_title('test 전체')
ax[1].set_title('test 중 미 부도')
ax[2].set_title('test 중 부도')
plt.show()
In [35]:
df_pair = df_test[['target','purpose','int_rate','grade','dti','loan_status']]
%matplotlib inline
sns.set()
sns.pairplot(df_pair[df_pair['loan_status'].isin([0,1])], hue='loan_status');

Part 4.3.3 미래부도확률을 사용하여 수익률 계산

  • 미 부도시 이자율 * 원급 지급의 현금 흐름 발생
  • 부도시 원금만 현금흐름이 유출되고 그 다음 부터는 0
  • 부도 시점은 계산할 수 없음
In [37]:
df_test.head()
Out[37]:
issue_dpurposetermzip_codeint_rateannual_incgradesub_gradeemp_lengthdti...installmentlast_pymnt_amntfunded_amntde_monm_rankfunded_amnt_cumcum_rankstart_dtargetresult
466286Dec-2015104395.3200100000.000000107.2800...260.50000.00008650.0000NaN4.00001302554975.000022015-12-0100.0114
466287Dec-20152062514.850035000.0000214515.2200...146.16000.00004225.0000NaN1.0000179683275.000012015-12-0100.0742
466290Dec-2015208949.800092000.000017313.8900...482.61000.000015000.0000NaN7.00003377575900.000042015-12-0100.0388
466291Dec-2015202117.4900109000.000003126.0200...777.55000.000025000.0000NaN10.00006264808700.000082015-12-0100.0088
466292Dec-20152017913.990055000.000021315.9100...123.03000.00003600.0000NaN1.0000108387050.000012015-12-0100.0921

5 rows × 24 columns

In [38]:
df_result = df_loan_st.merge(df_test[['target','result']], how= 'right', left_index= True, right_index=True)
In [39]:
df_result['exRet'] = (1) * df_result['int_rate'] - 100 *df_result['result'] 
In [40]:
ret_sect_size2= {}

for i in p_l1:
    for j in p_l2:
        df_cal = df_result[(df_result['purpose'] == i) & (df_result['cum_rank'] == j)]
        ret_sect_size2[(i + '|' + str(j))] =sum (df_cal['exRet'] * df_cal['funded_amnt']) / sum( df_cal['funded_amnt'])
In [41]:
df_train['int_rate'].mean()
df_result['int_rate'].mean()
Out[41]:
11.278818350702988

Part 4.4.2 Mean Variance

최적화 포트폴리오 계산

In [42]:
# Calculates portfolio mean return
def port_mean(W, R):
    return sum(R * W)

# Calculates portfolio variance of returns
def port_var(W, C):
    return np.dot(np.dot(W, C), W)

# Combination of the two functions above - mean and variance of returns calculation
def port_mean_var(W, R, C):
    return port_mean(W, R), port_var(W, C)

def solve_frontier(R, C, rf):
    def fitness(W, R, C, r):
        # For given level of return r, find weights which minimizes portfolio variance.
        mean, var = port_mean_var(W, R, C)
        penalty = 100 * abs(
            mean - r)  # Big penalty for not meeting stated portfolio return effectively serves as optimization constraint
        return var + penalty

    frontier_mean, frontier_var = [], []
    n = len(R)  # Number of assets in the portfolio
    for r in np.linspace(min(R), max(R), num=20):  # Iterate through the range of returns on Y axis
        W =  np.ones([n]) / n  # start optimization with equal weights
        b_ = [(0, 1) for i in range(n)]
        c_ = ({'type': 'eq', 'fun': lambda W: sum(W) - 1.})
        optimized = scipy.optimize.minimize(fitness, W, (R, C, r), method='SLSQP', constraints=c_, bounds=b_)
        if not optimized.success:
            raise BaseException(optimized.message)
        # add point to the efficient frontier [x,y] = [optimized.x, r]
        frontier_mean.append(r)
        frontier_var.append(port_var(optimized.x, C))
    return np.array(frontier_mean), np.array(frontier_var)

from scipy.optimize import minimize
import scipy
def solve_weights(R, C, rf):
    def fitness(W, R, C, rf):
        # calculate mean/variance of the portfolio
        mean, var = port_mean_var(W, R, C)  
        util = (mean - rf) / np.sqrt(var)      # utility = Sharpe ratio
        return 1/util                       # maximize the utility
    n = len(R)
    W = np.ones([n])/n                     # start with equal weights
    b_ = [(0.,1.) for i in range(n)]    # weights between 0%..100%. 
                                        # No leverage, no shorting
    c_ = ({'type':'eq', 'fun': lambda W: sum(W)-1. })   # Sum of weights = 100%
    optimized = scipy.optimize.minimize(fitness, W, (R, C, rf), 
                method='SLSQP', constraints=c_, bounds=b_)  
    if not optimized.success: 
        raise BaseException(optimized.message)
    return optimized.x  # Return optimized weights
In [43]:
class Result:
    def __init__(self, W, tan_mean, tan_var, front_mean, front_var):
        self.W=W
        self.tan_mean=tan_mean
        self.tan_var=tan_var
        self.front_mean=front_mean
        self.front_var=front_var
        
def optimize_frontier(R, C, rf):
    W = solve_weights(R, C, rf)
    tan_mean, tan_var = port_mean_var(W, R, C)  # calculate tangency portfolio
    front_mean, front_var = solve_frontier(R, C, rf)  # calculate efficient frontier
    # Weights, Tangency portfolio asset means and variances, Efficient frontier means and variances
    return Result(W, tan_mean, tan_var, front_mean, front_var)

def display_assets(names, R, C, color='red'):
    f,ax=plt.subplots(1,1,figsize=(18,8))

    scatter([C[i, i] ** .5 for i in range(len(names))], R, marker='x', color='G'), grid(True)  # draw assets
    for i in range(len(names)): 
        text(C[i, i] ** .5, R[i], '  %s' % names[i], verticalalignment='center', color=color) # draw labels

def display_frontier(result, label=None, color='black'):
    text(result.tan_var ** .5, result.tan_mean, '   tangent', verticalalignment='center', color=color)
    scatter(result.tan_var ** .5, result.tan_mean, marker='o', color=color), grid(True)
    plot(result.front_var ** .5, result.front_mean, label=label, color=color), grid(True)  # draw efficient frontier
In [44]:
df_ex_ret = pd.DataFrame(ret_sect_size2,  index =["rets"])
df_ex_ret  = df_ex_ret / 100

df_ret = pd.DataFrame.from_dict(ret_sect_size)
df_ret = df_ret.iloc[67 :,:]

#mean_daily_returns  = df_ret.mean()
cov_matrix  = df_ret.cov()
mean_month_returns  = df_ret.mean()

mean_month_returns
df_ex1 = df_ex_ret.T.join(df_ret.var().to_frame("vari"))

rf = 0.001
R = 12 * df_ex_ret.T['rets'].values
C = cov_matrix.values

W = solve_weights(R, C, rf)
print(W)

mean, var = port_mean_var(W, R, C)     
print(mean)
print(var)

mean, var = port_mean_var(W, R, C)     
names = list(df_ret)
#display_a
res1 = optimize_frontier(R, C, rf)
display_assets(names, R, C, color='blue')

display_frontier(res1, color='red')
xlabel('variance $\sigma$'), ylabel('mean $\mu$'), show()

display(pd.DataFrame({'Weight': res1.W}, index=names)[pd.DataFrame({'Weight': res1.W}, index=names)['Weight'] >= 0.000001].T)
plt.show()
[  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   1.30751986e-17   0.00000000e+00   3.25749869e-01   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   2.38320548e-01
   7.99859749e-02   3.55943609e-01   1.74123304e-17   0.00000000e+00
   0.00000000e+00   8.33392227e-18   6.89257690e-18   6.59073393e-18
   6.47707150e-20   0.00000000e+00   0.00000000e+00   2.63013043e-17
   0.00000000e+00   2.31346960e-18   8.99918864e-18   5.45602067e-18
   0.00000000e+00   0.00000000e+00   1.18741740e-18   1.40242754e-18]
0.653256337566
1.6049738616e-05
credit_card|7debt_consolidation|4debt_consolidation|5debt_consolidation|6
Weight0.32570.23830.08000.3559
In [ ]:
 

5 결론

  1. MV은 수익 하나보다는 수익과 위험을 동시에 고려하는 최적화 방식인데 위험 보다는 수익을 좀 더 고려하는 방안이 필요함

  2. 제약조건의 변경에 따른 시뮬레이션을 구하려면 실제 이자율과 할인율에 대한 적용방법이 필요함

In [47]:
#-! jupyter nbconvert –to html  *.ipynb
In [46]:
help(nbconvert)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-46-aba79f90e137> in <module>()
----> 1 help(nbconvert)

NameError: name 'nbconvert' is not defined
In [ ]:
 


© 2017. All rights reserved.

Powered by ZooFighter v0.12