二手车价格预测-03-建模调参

官方直播总结

改变精度降低内存

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def reduce_mem_usage(df):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum()
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

for col in df.columns:
col_type = df[col].dtype

if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')

end_mem = df.memory_usage().sum()
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df

学习曲线

要做一个variance和bias的trade-off
training score曲线和Cross-validation score曲线间的距离象征了variance的大小

之前特征工程的遗漏的部分

对于预测值(即训练集数据的标签,这次是price)的正态化处理
主要有对数变换 指数变换 约翰逊变换

XGBoost和LightGBM学习

两种模型的基础:GBDT

GBDT(Gradient Boosting Decision Tree)

梯度提升

$$
f_t(x)=f_{t−1}(x)+h_t(x))
$$
每轮在上一轮生成的强学习器$f_{t−1}(x)xyz$基础上增加一个弱学习器$h_t(x)$,最小化本轮的损失值
用损失函数Lost function的负梯度来近似本轮的损失值

决策树

优点:灵活,解释性好
缺点:容易过拟合

XGBoost介绍及学习

直接使用

先导入一些需要的库

1
2
3
4
5
6
7
8
9
10
import pandas
import numpy
import xgboost
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt

%matplotlib inline

处理通常的分类问题

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
data = pandas.read_csv('iris.data', header=None)
dataset = data.values
X = dataset[:,0:4]
Y = dataset[:,4]
# 编码数据类别标签
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)
# 训练集验证集分割
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = model_selection.train_test_split( X, label_encoded_y , test_size=test_size , random_state=seed )
# 训练模型
model = xgboost.XGBClassifier()
model.fit(X_train, y_train)
print(model)
1
2
3
4
5
6
7
8
# 做预测
y_pred = model.predict(X_test)
print(y_pred[::5])
predictions = [round(value) for value in y_pred]
print(predictions[::5])
# 预测结果评价
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
1
2
3
4
# 用xgboost自带方法,给feature的重要性排序
from xgboost import plot_importance
plot_importance(model)
plt.show()

有类别型特征时

因为XGBoost不能直接处理类别型特征,给类别性特征做One-hot encoder

1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.preprocessing import OneHotEncoder

encoded_x = None
for i in range(0, X.shape[1]):
label_encoder = LabelEncoder()
feature = label_encoder.fit_transform(X[:,i])
feature = feature.reshape(X.shape[0], 1)
onehot_encoder = OneHotEncoder(sparse=False, categories='auto')
feature = onehot_encoder.fit_transform(feature)
if encoded_x is None:
encoded_x = feature
else:
encoded_x = numpy.concatenate((encoded_x, feature), axis=1)

缺失值过多时

1
2


特征选择

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Fit model using each importance as a threshold
from numpy import sort
from sklearn.feature_selection import SelectFromModel
print("model.feature_importances_", model.feature_importances_)
print("majority classifier",1 - numpy.mean(y_test))
sort_idx = numpy.argsort(-model.feature_importances_) # sort in decreasing order
best_feature = sort_idx[0]
X_train_build = X_train[:,best_feature].reshape(X_train.shape[0], 1)
X_test_build = X_test[:,best_feature].reshape(X_test.shape[0], 1)
for idx in list(sort_idx):
if idx != best_feature:
X_train_build = numpy.concatenate((X_train_build,
X_train[:,idx].reshape(X_train.shape[0], 1)),
axis=1)
X_test_build = numpy.concatenate((X_test_build,
X_test[:,idx].reshape(X_test.shape[0], 1)),
axis=1)
# train model
selection_model = xgboost.XGBClassifier()
selection_model.fit(X_train_build, y_train)
# eval model
y_pred = selection_model.predict(X_test_build)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("num_features=%d, Accuracy: %.2f%%" % ( X_train_build.shape[1],accuracy*100.0))

模型参数

我现在不太清除该先特征选择还是先调参
参考https://xgboost.readthedocs.io/en/latest/parameter.html
主要用GridSearch

1
from sklearn.grid_search import GridSearchCV
1
2
3
4
5
6
7
8
9
10
cv_params = {'subsample': [0.6, 0.7, 0.8, 0.9], 'colsample_bytree': [0.6, 0.7, 0.8, 0.9]}
other_params = {'learning_rate': 0.1, 'n_estimators':4, 'max_depth': 4, 'min_child_weight':1, 'seed': 0,'subsample': 0.8 , 'colsample_bytree': 0.8, 'gamma': 0.1, 'reg_alpha': 0, 'reg_lambda': 1}

model = xgb.XGBClassifier(**other_params)
optimized_GBM = GridSearchCV(estimator=model, param_grid=cv_params, scoring='accuracy', cv=5, verbose=1, n_jobs=4)
optimized_GBM.fit(train_x, train_y)
evalute_result = optimized_GBM.grid_scores_
#print('每轮迭代运行结果:{0}'.format(evalute_result))
print('参数的最佳取值:{0}'.format(optimized_GBM.best_params_))
print('最佳模型得分:{0}'.format(optimized_GBM.best_score_))

XGBoost和LightGBM区别

整理自这个直播材料

结构区别:在连续变量的划分上

确定划分点的时候最原始、最‘笨’的方法是每个点都试一次,和其他点比,复杂度O(n^2)

  1. 处理连续变量时,划分点的算法不同
  • XGboost使用的是pre-sorted base algorithms

    排序后再计算,减少了比较的过程

    进一步可以用直方图,拿每个区间的中位线来比较(牺牲了准确度)

  • LightGBM使用的是histogram based algorithms

    先做了数据分桶,把连续型数据离散化

  • LightGBM还使用了一种技术GOSS(Gradient-based One-side Sampling)

    对于gradients较小,即不那么重要的数据(这里重要度由梯度决定),只抽样部分出来

  1. 树的生长方式不同
  • XGboost是level-wise tree growth

    一层一层向下生长

  • LightGBM是leaf-wise tree growth

    不同叶子节点层数不相同

  1. 处理离散变量
  • XGBoost不能直接处理类别型变量,要先做label encoding
  • LightGBM有自己的不同于传统的处理方法,更智能

参数区别

Gld9wd.png
图片来自https://www.youtube.com/watch?v=dOwKbwQ97tI


转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论,也可以邮件至 carlos@sjtu.edu.cn

目录
×

恰饭