二手车价格预测-03-建模调参

天池比赛打卡

创建时间:2020-03-31 22:46

阅读:

官方直播总结
XGBoost和LightGBM学习

官方直播总结

改变精度降低内存

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

学习曲线

要做一个variance和bias的trade-off
training score曲线和Cross-validation score曲线间的距离象征了variance的大小

之前特征工程的遗漏的部分

对于预测值（即训练集数据的标签，这次是price）的正态化处理
主要有对数变换指数变换约翰逊变换

XGBoost和LightGBM学习

两种模型的基础：GBDT

GBDT（Gradient Boosting Decision Tree）

梯度提升

$$
f_t(x)=f_{t−1}(x)+h_t(x))
$$
每轮在上一轮生成的强学习器$f_{t−1}(x)xyz$基础上增加一个弱学习器$h_t(x)$，最小化本轮的损失值
用损失函数Lost function的负梯度来近似本轮的损失值

决策树

优点：灵活，解释性好
缺点：容易过拟合

XGBoost介绍及学习

直接使用

先导入一些需要的库

import pandas
import numpy
import xgboost
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt

%matplotlib inline

处理通常的分类问题

data = pandas.read_csv('iris.data', header=None)
dataset = data.values 
X = dataset[:,0:4]
Y = dataset[:,4]
# 编码数据类别标签
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)
# 训练集验证集分割
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = model_selection.train_test_split( X, label_encoded_y , 												  test_size=test_size , random_state=seed )
# 训练模型
model = xgboost.XGBClassifier()
model.fit(X_train, y_train)
print(model)

# 做预测
y_pred = model.predict(X_test)
print(y_pred[::5])
predictions = [round(value) for value in y_pred]
print(predictions[::5])
# 预测结果评价
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

# 用xgboost自带方法，给feature的重要性排序
from xgboost import plot_importance
plot_importance(model)
plt.show()

有类别型特征时

因为XGBoost不能直接处理类别型特征，给类别性特征做One-hot encoder

from sklearn.preprocessing import OneHotEncoder

encoded_x = None
for i in range(0, X.shape[1]):
    label_encoder = LabelEncoder()
    feature = label_encoder.fit_transform(X[:,i])
    feature = feature.reshape(X.shape[0], 1)
    onehot_encoder = OneHotEncoder(sparse=False, categories='auto')
    feature = onehot_encoder.fit_transform(feature)
    if encoded_x is None:
        encoded_x = feature
    else:
        encoded_x = numpy.concatenate((encoded_x, feature), axis=1)

缺失值过多时

1
2

特征选择

# Fit model using each importance as a threshold
from numpy import sort
from sklearn.feature_selection import SelectFromModel
print("model.feature_importances_", model.feature_importances_)
print("majority classifier",1 - numpy.mean(y_test))
sort_idx = numpy.argsort(-model.feature_importances_) # sort in decreasing order
best_feature = sort_idx[0]  
X_train_build = X_train[:,best_feature].reshape(X_train.shape[0], 1)
X_test_build = X_test[:,best_feature].reshape(X_test.shape[0], 1)
for idx in list(sort_idx):
    if idx != best_feature:
        X_train_build = numpy.concatenate((X_train_build, 
                                           X_train[:,idx].reshape(X_train.shape[0], 1)), 
                                           axis=1)
        X_test_build = numpy.concatenate((X_test_build, 
                                          X_test[:,idx].reshape(X_test.shape[0], 1)), 
                                          axis=1)
    # train model
    selection_model = xgboost.XGBClassifier()
    selection_model.fit(X_train_build, y_train)
    # eval model
    y_pred = selection_model.predict(X_test_build)
    predictions = [round(value) for value in y_pred]
    accuracy = accuracy_score(y_test, predictions)
    print("num_features=%d, Accuracy: %.2f%%" % ( X_train_build.shape[1],accuracy*100.0))

模型参数

我现在不太清除该先特征选择还是先调参
参考https://xgboost.readthedocs.io/en/latest/parameter.html
主要用GridSearch

1	from sklearn.grid_search import GridSearchCV

cv_params = {'subsample': [0.6, 0.7, 0.8, 0.9], 'colsample_bytree': [0.6, 0.7, 0.8, 0.9]}
other_params = {'learning_rate': 0.1, 'n_estimators':4, 'max_depth': 4, 'min_child_weight':1, 'seed': 0,'subsample': 0.8 , 'colsample_bytree': 0.8, 'gamma': 0.1, 'reg_alpha': 0, 'reg_lambda': 1}

model = xgb.XGBClassifier(**other_params)
optimized_GBM = GridSearchCV(estimator=model, param_grid=cv_params, scoring='accuracy', cv=5, verbose=1, n_jobs=4)
optimized_GBM.fit(train_x, train_y)
evalute_result = optimized_GBM.grid_scores_
#print('每轮迭代运行结果:{0}'.format(evalute_result))
print('参数的最佳取值：{0}'.format(optimized_GBM.best_params_))
print('最佳模型得分:{0}'.format(optimized_GBM.best_score_))

XGBoost和LightGBM区别

整理自这个直播材料

结构区别：在连续变量的划分上

确定划分点的时候最原始、最‘笨’的方法是每个点都试一次，和其他点比，复杂度O(n^2)

处理连续变量时，划分点的算法不同

XGboost使用的是pre-sorted base algorithms

排序后再计算，减少了比较的过程

进一步可以用直方图，拿每个区间的中位线来比较（牺牲了准确度）
LightGBM使用的是histogram based algorithms

先做了数据分桶，把连续型数据离散化
LightGBM还使用了一种技术GOSS(Gradient-based One-side Sampling)

对于gradients较小，即不那么重要的数据（这里重要度由梯度决定），只抽样部分出来

树的生长方式不同

XGboost是level-wise tree growth

一层一层向下生长
LightGBM是leaf-wise tree growth

不同叶子节点层数不相同

处理离散变量

XGBoost不能直接处理类别型变量，要先做label encoding
LightGBM有自己的不同于传统的处理方法，更智能

参数区别

图片来自https://www.youtube.com/watch?v=dOwKbwQ97tI

转载请注明来源，欢迎对文章中的引用来源进行考证，欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论，也可以邮件至 carlos@sjtu.edu.cn