본문 바로가기

AI Study/Machine Learning

분류 (4) - LightGBM

 

LightGBM

 
  • XGBoost와 함께 Boosting 계열 알고리즘에서 가장 각광받고 있음.
  • XGBoost는 학습 시간이 오래 걸리는 것이 단점
  • LightGBM의 가장 큰 장점: XGBoost보다 학습에 걸리는 시간이 훨씬 적다 + 메모리 사용량이 상대적으로 적다
  • 예측 성능은 비슷하면서 기능은 LightGBM이 더 많음.
  • LightGBM은 일반 GBM 계열의 트리 분할 방법(보통 Level Wise)과 다르게 리프 중심 트리 분할(Leaf Wise) 방식을 사용함
    • Level Wise : 최대한 균형 잡힌 트리를 유지하면서 분할하기 때문에 트리의 깊이가 최소화 될 수 있음.
      • 균형잡힌 트리는 오버피팅에 보다 더 강함
      • 균형을 맞추는 시간이 오래 걸림
    • Leaf Wise : 트리의 균형을 맞추지 않고, 최대 손실 값(max delta loss)를 가지는 리프 노트를 지속적으로 분할, 트리의 깊이가 깊어지고 비대칭 트리 생성
      • 최대 손실값을 가지는 리프 노드를 지속적으로 분할하여 생성된 규칙 트리는 학습을 반복할 수록 예측 오류 손실을 최소화 할 수 있음 (Level Wise보다)
 
In [1]:
import lightgbm

print(lightgbm.__version__)
 
2.2.1
 

 

LightGBM 적용 – 위스콘신 Breast Cancer Prediction

 
 
In [4]:
# LightGBM의 파이썬 패키지인 lightgbm에서 LGBMClassifier 임포트
from lightgbm import LGBMClassifier

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

dataset = load_breast_cancer()
ftr = dataset.data
target = dataset.target

# 전체 데이터 중 80%는 학습용 데이터, 20%는 테스트용 데이터 추출
X_train, X_test, y_train, y_test=train_test_split(ftr, target, test_size=0.2, random_state=156 )

# 앞서 XGBoost와 동일하게 n_estimators는 400 설정. 
lgbm_wrapper = LGBMClassifier(n_estimators=400)

# LightGBM도 XGBoost와 동일하게 조기 중단 수행 가능. 
evals = [(X_test, y_test)]
lgbm_wrapper.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="logloss", 
                 eval_set=evals, verbose=True)
preds = lgbm_wrapper.predict(X_test)
pred_proba = lgbm_wrapper.predict_proba(X_test)[:, 1]
 
[1]	valid_0's binary_logloss: 0.565079	valid_0's binary_logloss: 0.565079
Training until validation scores don't improve for 100 rounds.
[2]	valid_0's binary_logloss: 0.507451	valid_0's binary_logloss: 0.507451
[3]	valid_0's binary_logloss: 0.458489	valid_0's binary_logloss: 0.458489
[4]	valid_0's binary_logloss: 0.417481	valid_0's binary_logloss: 0.417481
[5]	valid_0's binary_logloss: 0.385507	valid_0's binary_logloss: 0.385507
[6]	valid_0's binary_logloss: 0.355846	valid_0's binary_logloss: 0.355846
[7]	valid_0's binary_logloss: 0.330897	valid_0's binary_logloss: 0.330897
[8]	valid_0's binary_logloss: 0.306923	valid_0's binary_logloss: 0.306923
[9]	valid_0's binary_logloss: 0.28776	valid_0's binary_logloss: 0.28776
[10]	valid_0's binary_logloss: 0.26917	valid_0's binary_logloss: 0.26917
[11]	valid_0's binary_logloss: 0.250954	valid_0's binary_logloss: 0.250954
[12]	valid_0's binary_logloss: 0.23847	valid_0's binary_logloss: 0.23847
[13]	valid_0's binary_logloss: 0.225865	valid_0's binary_logloss: 0.225865
[14]	valid_0's binary_logloss: 0.215076	valid_0's binary_logloss: 0.215076
[15]	valid_0's binary_logloss: 0.205996	valid_0's binary_logloss: 0.205996
[16]	valid_0's binary_logloss: 0.196091	valid_0's binary_logloss: 0.196091
[17]	valid_0's binary_logloss: 0.186395	valid_0's binary_logloss: 0.186395
[18]	valid_0's binary_logloss: 0.17942	valid_0's binary_logloss: 0.17942
[19]	valid_0's binary_logloss: 0.174727	valid_0's binary_logloss: 0.174727
[20]	valid_0's binary_logloss: 0.168563	valid_0's binary_logloss: 0.168563
[21]	valid_0's binary_logloss: 0.165432	valid_0's binary_logloss: 0.165432
[22]	valid_0's binary_logloss: 0.160356	valid_0's binary_logloss: 0.160356
[23]	valid_0's binary_logloss: 0.155508	valid_0's binary_logloss: 0.155508
[24]	valid_0's binary_logloss: 0.151598	valid_0's binary_logloss: 0.151598
[25]	valid_0's binary_logloss: 0.149861	valid_0's binary_logloss: 0.149861
[26]	valid_0's binary_logloss: 0.149873	valid_0's binary_logloss: 0.149873
[27]	valid_0's binary_logloss: 0.147032	valid_0's binary_logloss: 0.147032
[28]	valid_0's binary_logloss: 0.145077	valid_0's binary_logloss: 0.145077
[29]	valid_0's binary_logloss: 0.139891	valid_0's binary_logloss: 0.139891
[30]	valid_0's binary_logloss: 0.137416	valid_0's binary_logloss: 0.137416
[31]	valid_0's binary_logloss: 0.138393	valid_0's binary_logloss: 0.138393
[32]	valid_0's binary_logloss: 0.135948	valid_0's binary_logloss: 0.135948
[33]	valid_0's binary_logloss: 0.132968	valid_0's binary_logloss: 0.132968
[34]	valid_0's binary_logloss: 0.129697	valid_0's binary_logloss: 0.129697
[35]	valid_0's binary_logloss: 0.131102	valid_0's binary_logloss: 0.131102
[36]	valid_0's binary_logloss: 0.129382	valid_0's binary_logloss: 0.129382
[37]	valid_0's binary_logloss: 0.128044	valid_0's binary_logloss: 0.128044
[38]	valid_0's binary_logloss: 0.127325	valid_0's binary_logloss: 0.127325
[39]	valid_0's binary_logloss: 0.128091	valid_0's binary_logloss: 0.128091
[40]	valid_0's binary_logloss: 0.129045	valid_0's binary_logloss: 0.129045
[41]	valid_0's binary_logloss: 0.127023	valid_0's binary_logloss: 0.127023
[42]	valid_0's binary_logloss: 0.129314	valid_0's binary_logloss: 0.129314
[43]	valid_0's binary_logloss: 0.129175	valid_0's binary_logloss: 0.129175
[44]	valid_0's binary_logloss: 0.128212	valid_0's binary_logloss: 0.128212
[45]	valid_0's binary_logloss: 0.126664	valid_0's binary_logloss: 0.126664
[46]	valid_0's binary_logloss: 0.127662	valid_0's binary_logloss: 0.127662
[47]	valid_0's binary_logloss: 0.126108	valid_0's binary_logloss: 0.126108
[48]	valid_0's binary_logloss: 0.129371	valid_0's binary_logloss: 0.129371
[49]	valid_0's binary_logloss: 0.129573	valid_0's binary_logloss: 0.129573
[50]	valid_0's binary_logloss: 0.130876	valid_0's binary_logloss: 0.130876
[51]	valid_0's binary_logloss: 0.131366	valid_0's binary_logloss: 0.131366
[52]	valid_0's binary_logloss: 0.131336	valid_0's binary_logloss: 0.131336
[53]	valid_0's binary_logloss: 0.13208	valid_0's binary_logloss: 0.13208
[54]	valid_0's binary_logloss: 0.13306	valid_0's binary_logloss: 0.13306
[55]	valid_0's binary_logloss: 0.132342	valid_0's binary_logloss: 0.132342
[56]	valid_0's binary_logloss: 0.134836	valid_0's binary_logloss: 0.134836
[57]	valid_0's binary_logloss: 0.135208	valid_0's binary_logloss: 0.135208
[58]	valid_0's binary_logloss: 0.13328	valid_0's binary_logloss: 0.13328
[59]	valid_0's binary_logloss: 0.134147	valid_0's binary_logloss: 0.134147
[60]	valid_0's binary_logloss: 0.134549	valid_0's binary_logloss: 0.134549
[61]	valid_0's binary_logloss: 0.133202	valid_0's binary_logloss: 0.133202
[62]	valid_0's binary_logloss: 0.135726	valid_0's binary_logloss: 0.135726
[63]	valid_0's binary_logloss: 0.134011	valid_0's binary_logloss: 0.134011
[64]	valid_0's binary_logloss: 0.131493	valid_0's binary_logloss: 0.131493
[65]	valid_0's binary_logloss: 0.134114	valid_0's binary_logloss: 0.134114
[66]	valid_0's binary_logloss: 0.134525	valid_0's binary_logloss: 0.134525
[67]	valid_0's binary_logloss: 0.131412	valid_0's binary_logloss: 0.131412
[68]	valid_0's binary_logloss: 0.12878	valid_0's binary_logloss: 0.12878
[69]	valid_0's binary_logloss: 0.129571	valid_0's binary_logloss: 0.129571
[70]	valid_0's binary_logloss: 0.129671	valid_0's binary_logloss: 0.129671
[71]	valid_0's binary_logloss: 0.129935	valid_0's binary_logloss: 0.129935
[72]	valid_0's binary_logloss: 0.128951	valid_0's binary_logloss: 0.128951
[73]	valid_0's binary_logloss: 0.128977	valid_0's binary_logloss: 0.128977
[74]	valid_0's binary_logloss: 0.127121	valid_0's binary_logloss: 0.127121
[75]	valid_0's binary_logloss: 0.128107	valid_0's binary_logloss: 0.128107
[76]	valid_0's binary_logloss: 0.129796	valid_0's binary_logloss: 0.129796
[77]	valid_0's binary_logloss: 0.131663	valid_0's binary_logloss: 0.131663
[78]	valid_0's binary_logloss: 0.132483	valid_0's binary_logloss: 0.132483
[79]	valid_0's binary_logloss: 0.131578	valid_0's binary_logloss: 0.131578
[80]	valid_0's binary_logloss: 0.130352	valid_0's binary_logloss: 0.130352
[81]	valid_0's binary_logloss: 0.129895	valid_0's binary_logloss: 0.129895
[82]	valid_0's binary_logloss: 0.131587	valid_0's binary_logloss: 0.131587
[83]	valid_0's binary_logloss: 0.132763	valid_0's binary_logloss: 0.132763
[84]	valid_0's binary_logloss: 0.133677	valid_0's binary_logloss: 0.133677
[85]	valid_0's binary_logloss: 0.137552	valid_0's binary_logloss: 0.137552
[86]	valid_0's binary_logloss: 0.136055	valid_0's binary_logloss: 0.136055
[87]	valid_0's binary_logloss: 0.137904	valid_0's binary_logloss: 0.137904
[88]	valid_0's binary_logloss: 0.139524	valid_0's binary_logloss: 0.139524
[89]	valid_0's binary_logloss: 0.138434	valid_0's binary_logloss: 0.138434
[90]	valid_0's binary_logloss: 0.138402	valid_0's binary_logloss: 0.138402
[91]	valid_0's binary_logloss: 0.139384	valid_0's binary_logloss: 0.139384
[92]	valid_0's binary_logloss: 0.139642	valid_0's binary_logloss: 0.139642
[93]	valid_0's binary_logloss: 0.138006	valid_0's binary_logloss: 0.138006
[94]	valid_0's binary_logloss: 0.141612	valid_0's binary_logloss: 0.141612
[95]	valid_0's binary_logloss: 0.142319	valid_0's binary_logloss: 0.142319
[96]	valid_0's binary_logloss: 0.145095	valid_0's binary_logloss: 0.145095
[97]	valid_0's binary_logloss: 0.141542	valid_0's binary_logloss: 0.141542
[98]	valid_0's binary_logloss: 0.144993	valid_0's binary_logloss: 0.144993
[99]	valid_0's binary_logloss: 0.147936	valid_0's binary_logloss: 0.147936
[100]	valid_0's binary_logloss: 0.147432	valid_0's binary_logloss: 0.147432
[101]	valid_0's binary_logloss: 0.149689	valid_0's binary_logloss: 0.149689
[102]	valid_0's binary_logloss: 0.153542	valid_0's binary_logloss: 0.153542
[103]	valid_0's binary_logloss: 0.154556	valid_0's binary_logloss: 0.154556
[104]	valid_0's binary_logloss: 0.155458	valid_0's binary_logloss: 0.155458
[105]	valid_0's binary_logloss: 0.159357	valid_0's binary_logloss: 0.159357
[106]	valid_0's binary_logloss: 0.160176	valid_0's binary_logloss: 0.160176
[107]	valid_0's binary_logloss: 0.163369	valid_0's binary_logloss: 0.163369
[108]	valid_0's binary_logloss: 0.163494	valid_0's binary_logloss: 0.163494
[109]	valid_0's binary_logloss: 0.161111	valid_0's binary_logloss: 0.161111
[110]	valid_0's binary_logloss: 0.16332	valid_0's binary_logloss: 0.16332
[111]	valid_0's binary_logloss: 0.1663	valid_0's binary_logloss: 0.1663
[112]	valid_0's binary_logloss: 0.166363	valid_0's binary_logloss: 0.166363
[113]	valid_0's binary_logloss: 0.169834	valid_0's binary_logloss: 0.169834
[114]	valid_0's binary_logloss: 0.166509	valid_0's binary_logloss: 0.166509
[115]	valid_0's binary_logloss: 0.165823	valid_0's binary_logloss: 0.165823
[116]	valid_0's binary_logloss: 0.167059	valid_0's binary_logloss: 0.167059
[117]	valid_0's binary_logloss: 0.169086	valid_0's binary_logloss: 0.169086
[118]	valid_0's binary_logloss: 0.170012	valid_0's binary_logloss: 0.170012
[119]	valid_0's binary_logloss: 0.168639	valid_0's binary_logloss: 0.168639
[120]	valid_0's binary_logloss: 0.16907	valid_0's binary_logloss: 0.16907
[121]	valid_0's binary_logloss: 0.16918	valid_0's binary_logloss: 0.16918
[122]	valid_0's binary_logloss: 0.170233	valid_0's binary_logloss: 0.170233
[123]	valid_0's binary_logloss: 0.165655	valid_0's binary_logloss: 0.165655
[124]	valid_0's binary_logloss: 0.16695	valid_0's binary_logloss: 0.16695
[125]	valid_0's binary_logloss: 0.170955	valid_0's binary_logloss: 0.170955
[126]	valid_0's binary_logloss: 0.168916	valid_0's binary_logloss: 0.168916
[127]	valid_0's binary_logloss: 0.172316	valid_0's binary_logloss: 0.172316
[128]	valid_0's binary_logloss: 0.173734	valid_0's binary_logloss: 0.173734
[129]	valid_0's binary_logloss: 0.174309	valid_0's binary_logloss: 0.174309
[130]	valid_0's binary_logloss: 0.176719	valid_0's binary_logloss: 0.176719
[131]	valid_0's binary_logloss: 0.176591	valid_0's binary_logloss: 0.176591
[132]	valid_0's binary_logloss: 0.180168	valid_0's binary_logloss: 0.180168
[133]	valid_0's binary_logloss: 0.179856	valid_0's binary_logloss: 0.179856
[134]	valid_0's binary_logloss: 0.179251	valid_0's binary_logloss: 0.179251
[135]	valid_0's binary_logloss: 0.18315	valid_0's binary_logloss: 0.18315
[136]	valid_0's binary_logloss: 0.184656	valid_0's binary_logloss: 0.184656
[137]	valid_0's binary_logloss: 0.187475	valid_0's binary_logloss: 0.187475
[138]	valid_0's binary_logloss: 0.188721	valid_0's binary_logloss: 0.188721
[139]	valid_0's binary_logloss: 0.188542	valid_0's binary_logloss: 0.188542
[140]	valid_0's binary_logloss: 0.18817	valid_0's binary_logloss: 0.18817
[141]	valid_0's binary_logloss: 0.185899	valid_0's binary_logloss: 0.185899
[142]	valid_0's binary_logloss: 0.185452	valid_0's binary_logloss: 0.185452
[143]	valid_0's binary_logloss: 0.186084	valid_0's binary_logloss: 0.186084
[144]	valid_0's binary_logloss: 0.185302	valid_0's binary_logloss: 0.185302
[145]	valid_0's binary_logloss: 0.187856	valid_0's binary_logloss: 0.187856
[146]	valid_0's binary_logloss: 0.190334	valid_0's binary_logloss: 0.190334
[147]	valid_0's binary_logloss: 0.192769	valid_0's binary_logloss: 0.192769
Early stopping, best iteration is:
[47]	valid_0's binary_logloss: 0.126108	valid_0's binary_logloss: 0.126108
 
 
In [5]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, roc_auc_score

def get_clf_eval(y_test, pred=None, pred_proba=None):
    confusion = confusion_matrix( y_test, pred)
    accuracy = accuracy_score(y_test , pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    f1 = f1_score(y_test,pred)
    # ROC-AUC 추가 
    roc_auc = roc_auc_score(y_test, pred_proba)
    print('오차 행렬')
    print(confusion)
    # ROC-AUC print 추가
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f},\
    F1: {3:.4f}, AUC:{4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))
 
 
In [6]:
get_clf_eval(y_test, preds, pred_proba)
 
오차 행렬
[[33  4]
 [ 2 75]]
정확도: 0.9474, 정밀도: 0.9494, 재현율: 0.9740,    F1: 0.9615, AUC:0.9926
 
 
In [6]:
# plot_importance( )를 이용하여 feature 중요도 시각화
from lightgbm import plot_importance
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(figsize=(10, 12))
plot_importance(lgbm_wrapper, ax=ax)
 
 
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x2b36b67dc88>