收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度

taotao_2016 2019-12-07

展開全文

全文共5320字，預(yù)計學(xué)習(xí)時長20分鐘

收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度

圖源: https://blog.datasciencedojo.c

收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度

簡介

如今，使用具有數(shù)百（甚至數(shù)千）個特征的數(shù)據(jù)集已然十分普遍了。如果這些特征數(shù)量與數(shù)據(jù)集中存儲的觀察值數(shù)量相差無幾（或者前者比后者更多）的話，很可能會導(dǎo)致機器學(xué)習(xí)模型過度擬合。為避免此類問題的發(fā)生，需采用正則化或降維技術(shù)（特征提?。?。在機器學(xué)習(xí)中，數(shù)據(jù)集的維數(shù)等于用來表示它的變量數(shù)。

使用正則化當(dāng)然有助于降低過度擬合的風(fēng)險，但使用特征提取技術(shù)也具備一定的優(yōu)勢，例如：

· 提高準(zhǔn)確性

· 降低過度擬合風(fēng)險

· 提高訓(xùn)練速度

· 提升數(shù)據(jù)可視化能力

· 提高模型可解釋性

特征提取旨在通過在現(xiàn)有數(shù)據(jù)集中創(chuàng)建新特征（并放棄原始特征）來減少數(shù)據(jù)集中的特征數(shù)量。這些新的簡化特征集需能夠匯總原始特征集中的大部分信息。這樣便可以從整合的原始特征集中創(chuàng)建原始特征的簡化版本。

特征選擇也是一種常用的用來減少數(shù)據(jù)集中特征數(shù)量的技術(shù)。它與特征提取的區(qū)別在于：特征選擇旨在對數(shù)據(jù)集中現(xiàn)有特征的重要性進行排序，放棄次重要的特征（不創(chuàng)建新特征）。

本文將以 Kaggle MushroomClassification Dataset為例介紹如何應(yīng)用特征提取技術(shù)。本文的目標(biāo)是通過觀察給定的特征來對蘑菇是否有毒進行預(yù)測。

首先，需導(dǎo)入所有必需的數(shù)據(jù)庫。

import time

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from matplotlib.pyplot import figure

import seaborn as sns

from sklearn import preprocessing

from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report,confusion_matrix

from sklearn.ensemble import RandomForestClassifier

extraction17.py hosted with ? by GitHub

下圖為本例中將采用的數(shù)據(jù)集。

收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度

圖1: 蘑菇分類數(shù)據(jù)集

將這些數(shù)據(jù)輸入機器學(xué)習(xí)模型之前，將數(shù)據(jù)劃分為特征（X）和標(biāo)簽（Y）以及獨熱碼所有的分類變量。

X = df.drop(['class'], axis=1)

Y = df['class']

X = pd.get_dummies(X, prefix_sep='_')

Y = LabelEncoder().fit_transform(Y)

X = StandardScaler().fit_transform(X)

extraction15.py hosted with ? by GitHub

接著，創(chuàng)建一個函數(shù)（forest_test），將輸入數(shù)據(jù)分成訓(xùn)練集和測試集，訓(xùn)練和測試一個隨機森林分類器。

defforest_test(X, Y):

X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y,

test_size=0.30,

random_state=101)

start = time.process_time()

trainedforest = RandomForestClassifier(n_estimators=700).fit(X_Train,Y_Train)

print(time.process_time() - start)

predictionforest = trainedforest.predict(X_Test)

print(confusion_matrix(Y_Test,predictionforest))

print(classification_report(Y_Test,predictionforest))

extraction14.py hosted with ? by GitHub

現(xiàn)在可以首先將該函數(shù)應(yīng)用于整個數(shù)據(jù)集，然后再連續(xù)使用簡化的數(shù)據(jù)集來比較二者的結(jié)果。

forest_test(X, Y)

extraction16.py hosted with ? by GitHub

如下圖所示，使用這整個特征集訓(xùn)練隨機森林分類器，可在2.2秒左右的訓(xùn)練時間內(nèi)獲得100%的準(zhǔn)確率。在下列示例中，第一行提供了訓(xùn)練時間，供您參考。

2.2676709799999992

[[1274 0]
[ 0 1164]]
precision recall f1-score support
0 1.00 1.00 1.00 1274
1 1.00 1.00 1.00 1164
accuracy 1.00 2438
macro avg 1.00 1.00 1.00 2438
weighted avg 1.00 1.00 1.00 2438

收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度

特征提取

主成分分析 (PCA)

PCA是一項常用的線性降維技術(shù)。使用PCA時，將原始數(shù)據(jù)作為輸入，并嘗試尋找能夠最好地概括原始數(shù)據(jù)分布的輸入特征的組合，從而降低原始數(shù)據(jù)的維數(shù)。它是通過觀察pairwisedistances，來最大化方差和最小化重建誤差。在PCA中，原始數(shù)據(jù)投影到一組正交軸上，并且每個軸上的數(shù)據(jù)都按重要程度排序。

PCA是一種無監(jiān)督的學(xué)習(xí)算法，因此它不關(guān)注數(shù)據(jù)標(biāo)簽，只關(guān)注變量。這在某些情況下會導(dǎo)致數(shù)據(jù)分類錯誤。

在此例中，首先在整個數(shù)據(jù)集中應(yīng)用PCA，將數(shù)據(jù)簡化至二維，然后使用這些新數(shù)據(jù)特征及其標(biāo)簽構(gòu)建一個數(shù)據(jù)幀。

from sklearn.decomposition importPCA

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X)

PCA_df= pd.DataFrame(data= X_pca, columns= ['PC1', 'PC2'])

PCA_df= pd.concat([PCA_df, df['class']], axis=1)

PCA_df['class'] = LabelEncoder().fit_transform(PCA_df['class'])

PCA_df.head()

extraction.py hosted with ? by GitHub

收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度

圖2: PCA數(shù)據(jù)集

有了新創(chuàng)建的數(shù)據(jù)幀，現(xiàn)在可以在二維散點圖中繪制數(shù)據(jù)分布圖。

figure(num=None, figsize=(8, 8), dpi=80, facecolor='w', edgecolor='k')

classes = [1, 0]

colors = ['r', 'b']

for clas, color inzip(classes, colors):

plt.scatter(PCA_df.loc[PCA_df['class'] == clas, 'PC1'],

PCA_df.loc[PCA_df['class'] == clas, 'PC2'],

c= color)

plt.xlabel('Principal Component 1', fontsize=12)

plt.ylabel('Principal Component 2', fontsize=12)

plt.title('2D PCA', fontsize=15)

plt.legend(['Poisonous', 'Edible'])

plt.grid()

extraction2.py hosted with ? by GitHub

收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度

圖3: 2維PCA可視化

現(xiàn)在可以重復(fù)這一步驟，但將數(shù)據(jù)簡化至三維，使用Plotly創(chuàng)建動畫。

使用PCA還可以通過使用explained_variance_ratio_Scikit-learn函數(shù)來探究原始數(shù)據(jù)方差的保留程度。計算出方差比后就構(gòu)造精美的可視化圖形了。

收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度

使用由PCA構(gòu)造的三維特征集（而不是整個數(shù)據(jù)集）再次運行隨機森林分類器，分類準(zhǔn)確率為98%，而使用二維的特征集的分類準(zhǔn)確率為95%。

pca = PCA(n_components=3,svd_solver='full')

X_pca = pca.fit_transform(X)

print(pca.explained_variance_)

forest_test(X_pca, Y)

extraction9.py hosted with ? by GitHub

[10.31484926 9.42671062 8.35720548]
2.769664902999999
[[1261 13]
[ 41 1123]]
precision recall f1-score support
0 0.97 0.99 0.98 1274
1 0.99 0.96 0.98 1164
accuracy 0.98 2438
macro avg 0.98 0.98 0.98 2438
weighted avg 0.98 0.98 0.98 2438

此外，使用二維數(shù)據(jù)集，現(xiàn)在還可以對隨機森林使所用的決策邊界進行可視化，以便對每個不同的數(shù)據(jù)點進行分類。

from itertools import product

X_Reduced, X_Test_Reduced, Y_Reduced, Y_Test_Reduced = train_test_split(X_pca, Y,

test_size=0.30,

random_state=101)

trainedforest = RandomForestClassifier(n_estimators=700).fit(X_Reduced,Y_Reduced)

x_min, x_max = X_Reduced[:, 0].min() -1, X_Reduced[:, 0].max() +1

y_min, y_max = X_Reduced[:, 1].min() -1, X_Reduced[:, 1].max() +1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))

Z = trainedforest.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z,cmap=plt.cm.coolwarm, alpha=0.4)

plt.scatter(X_Reduced[:, 0], X_Reduced[:, 1], c=Y_Reduced, s=20, edgecolor='k')

plt.xlabel('Principal Component 1', fontsize=12)

plt.ylabel('Principal Component 2', fontsize=12)

plt.title('Random Forest', fontsize=15)

plt.show()

extraction3.py hosted with ? by GitHub

收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度

圖4: PCA隨機森林決策邊界

獨立成分分析 (ICA)

ICA是一種線性降維方法，它以將獨立成分混合作為輸入數(shù)據(jù)，旨在正確識別每個成分（刪除所有不必要的噪聲）。如果兩個輸入特征的線性相關(guān)和非線性相關(guān)都等于零[1]，則可以認(rèn)為它們是獨立的。

ICA在醫(yī)學(xué)中得到廣泛應(yīng)用，如腦電圖和磁共振成像分析等，它常用于區(qū)分有用信號和無用信號。

舉一個ICA簡單的應(yīng)用事例：在做音頻記錄時，有兩個人在交談。ICA可以區(qū)分出音頻中兩個不同的獨立成分（即兩種不同的聲音）。這樣，ICA就可以識別出對話中不同的說話人。

現(xiàn)在，可以使用ICA再次將數(shù)據(jù)集簡化為三維，利用隨機森林分類器來測試其準(zhǔn)確性并在三維圖中繪制結(jié)果。

from sklearn.decomposition import FastICA

ica = FastICA(n_components=3)

X_ica = ica.fit_transform(X)

forest_test(X_ica, Y)

extraction5.py hosted with ? by GitHub

2.8933812039999793
[[1263 11]
[ 44 1120]]
precision recall f1-score support
0 0.97 0.99 0.98 1274
1 0.99 0.96 0.98 1164
accuracy 0.98 2438
macro avg 0.98 0.98 0.98 2438
weighted avg 0.98 0.98 0.98 2438

從下面的動畫中可以發(fā)現(xiàn)，盡管PCA和ICA的準(zhǔn)確度相同，但是它們構(gòu)造出的三維空間分布圖卻不同。

收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度

線性判別式分析(LDA)

LDA是有監(jiān)督的學(xué)習(xí)降維技術(shù)和機器學(xué)習(xí)分類器。

LDA旨在最大化類間距離，并最小化類內(nèi)距離。因此，LDA將類內(nèi)距離和類間距離作為衡量尺度。在低維空間投影數(shù)據(jù)，能最大化類間距離，從而可以得出更好的分類結(jié)果（不同類之間的重疊減少），因此，LDA是上乘之選。

使用LDA時，應(yīng)假設(shè)輸入數(shù)據(jù)遵循高斯分布（如本例），因此將LDA應(yīng)用于非高斯數(shù)據(jù)可能會導(dǎo)致錯誤的分類結(jié)果。

本例將運行LDA將數(shù)據(jù)集簡化為一維，測試其準(zhǔn)確性并繪制結(jié)果。

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(n_components=1)

# run an LDA and use it to transform the features

X_lda = lda.fit(X, Y).transform(X)

print('Original number of features:', X.shape[1])

print('Reduced number of features:', X_lda.shape[1])

extraction11.py hosted with ? by GitHub

Original number of features: 117
Reduced number of features: 1

由于本例遵循高斯分布，所以LDA得到了非常好的結(jié)果，使用隨機森林分類器測試，精確度達到100%。

forest_test(X_lda, Y)

extraction12.py hosted with ? by GitHub

1.2756952610000099
[[1274 0]
[ 0 1164]]
precision recall f1-score support
0 1.00 1.00 1.00 1274
1 1.00 1.00 1.00 1164
accuracy 1.00 2438
macro avg 1.00 1.00 1.00 2438
weighted avg 1.00 1.00 1.00 2438

X_Reduced, X_Test_Reduced, Y_Reduced, Y_Test_Reduced = train_test_split(X_lda, Y,

test_size=0.30,

random_state=101)

start = time.process_time()

lda = LinearDiscriminantAnalysis().fit(X_Reduced,Y_Reduced)

print(time.process_time() - start)

predictionlda = lda.predict(X_Test_Reduced)

print(confusion_matrix(Y_Test_Reduced,predictionlda))

print(classification_report(Y_Test_Reduced,predictionlda))

extraction13.py hosted with ? by GitHub

0.008464782999993758
[[1274 0]
[ 2 1162]]
precision recall f1-score support
0 1.00 1.00 1.00 1274
1 1.00 1.00 1.00 1164
accuracy 1.00 2438
macro avg 1.00 1.00 1.00 2438
weighted avg 1.00 1.00 1.00 2438

最后，可以直觀地看到兩個類的分布是如何看起來像創(chuàng)建一維數(shù)據(jù)分布圖的。

收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度

圖5: LDA類分離

局部線性嵌入 (LLE)

本文已經(jīng)討論了PCA和LDA等方法，它們能夠針對不同特征間的線性關(guān)系很好地運行，下面將討論如何處理非線性情況。

LLE是一種基于流形學(xué)習(xí)的降維技術(shù)。流形數(shù)據(jù)指嵌入高維空間中的D維對象。流形學(xué)習(xí)旨在使該對象在最初的D維中表現(xiàn)出來，而不是在不必要的更大空間中表現(xiàn)出來。

機器學(xué)習(xí)中用于解釋流形學(xué)習(xí)的典型例子便是Swiss Roll Manifold（圖6）。我們得到一些數(shù)據(jù)作為輸入，這些數(shù)據(jù)的分布類似于一個卷（在三維空間中），然后將其展開，從而將數(shù)據(jù)壓縮進二維空間。

流形學(xué)習(xí)算法有：Isomap、LLE、ModifiedLocally Linear Embedding, Hessian Eigenmapping等。

收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度

圖6: 流形學(xué)習(xí) [2]

現(xiàn)將帶你了解如何在本例中使用LLE。根據(jù)Scikit-learn文檔顯示[3]：

LLE在局部鄰域內(nèi)尋求存在距離的數(shù)據(jù)的低維投影。它可以看作是一系列PCA，通過進行全局比較來尋找最佳的非線性嵌入。

現(xiàn)可以在數(shù)據(jù)集上運行LLE，將數(shù)據(jù)降到3維，測試準(zhǔn)確度并繪制結(jié)果。

from sklearn.manifold import LocallyLinearEmbedding

embedding = LocallyLinearEmbedding(n_components=3)

X_lle = embedding.fit_transform(X)

forest_test(X_lle, Y)

extraction6.py hosted with ? by GitHub

2.578125
[[1273 0]
[1143 22]]
precision recall f1-score support
0 0.53 1.00 0.69 1273
1 1.00 0.02 0.04 1165
micro avg 0.53 0.53 0.53 2438
macro avg 0.76 0.51 0.36 2438
weighted avg 0.75 0.53 0.38 2438

收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度

t-分布隨機鄰域嵌入(t-SNE)

t-SNE是一種典型的用于高維數(shù)據(jù)可視化的非線性降維技術(shù)。它的主要應(yīng)用是自然語言處理（NLP）、語音處理等。

t-SNE通過最小化由原始高維空間中輸入特征的成對概率相似性構(gòu)成的分布和其在縮減的低維空間中的等效分布之間的差異來工作。它利用 Kullback-Leiber (KL)散度來度量兩種不同分布的差異性。然后使用梯度下降將KL散度最小化。

使用t-SNE時，高維空間使用高斯分布建模，而低維空間使用學(xué)生t分布建模。這樣做是為了避免由于轉(zhuǎn)換到低維空間而導(dǎo)致相鄰點距離分布不平衡的問題。

現(xiàn)已準(zhǔn)備使用t-SNE，并將數(shù)據(jù)集降至到3維。

from sklearn.manifold importTSNE

start = time.process_time()

tsne = TSNE(n_components=3, verbose=1, perplexity=40, n_iter=300)

X_tsne = tsne.fit_transform(X)

print(time.process_time() - start)

extraction4.py hosted with ? by GitHub

[t-SNE] Computing 121 nearestneighbors...
[t-SNE] Indexed 8124 samples in 0.139s...
[t-SNE] Computed neighbors for 8124 samples in 11.891s...
[t-SNE] Computed conditional probabilities for sample

1000 / 8124
[t-SNE] Computed conditional probabilities for sample

2000 / 8124
[t-SNE] Computed conditional probabilities for sample

3000 / 8124
[t-SNE] Computed conditional probabilities for sample

4000 / 8124
[t-SNE] Computed conditional probabilities for sample

5000 / 8124
[t-SNE] Computed conditional probabilities for sample

6000 / 8124
[t-SNE] Computed conditional probabilities for sample

7000 / 8124
[t-SNE] Computed conditional probabilities for sample

8000 / 8124
[t-SNE] Computed conditional probabilities for sample

8124 / 8124
[t-SNE] Mean sigma: 2.658530
[t-SNE] KL divergence after 250 iterations with early

exaggeration: 65.601128
[t-SNE] KL divergence after 300 iterations: 1.909915
143.984375

可視化結(jié)果特征的分布清楚地顯示了即使數(shù)據(jù)在一個縮小的空間進行轉(zhuǎn)換，也能很好地分離。

收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度

使用t-SNE降維子集測試隨機森林的準(zhǔn)確度，從而證實分類可以很容易地分離。

forest_test(X_tsne, Y)

extraction10.py hosted with ? by GitHub

2.6462027340000134
[[1274 0]
[ 0 1164]]
precision recall f1-score support
0 1.00 1.00 1.00 1274
1 1.00 1.00 1.00 1164
accuracy 1.00 2438
macro avg 1.00 1.00 1.00 2438
weighted avg 1.00 1.00 1.00 2438

自動編碼器

自動編碼器指一類可用作降維技術(shù)的機器學(xué)習(xí)算法。它與其他降維技術(shù)的主要區(qū)別在于：自動編碼器使用的是非線性轉(zhuǎn)換，將數(shù)據(jù)從高維投影到低維。

自動編碼器有以下幾種不同類型，如：

·去噪自動編碼器

·變分自動編碼器

·卷積自動編碼器

·稀疏自動編碼器

本例將首先構(gòu)建一個基本的自動編碼器（圖7）。自動編碼器的基本結(jié)構(gòu)可分為兩個主要部分：

1.編碼器：將輸入的數(shù)據(jù)進行壓縮，從而移除所有可能的噪聲和無用信息。編碼器的輸出通常稱為瓶頸或潛在空間。

2.解碼器：將編碼后的潛在空間作為輸入，并嘗試僅使用其壓縮形式（編碼后的潛在空間）再現(xiàn)原始的自動編碼器輸入。

如果所有的輸入特征都是相互獨立的，那么自動編碼器將很難對在低維空間中輸入的數(shù)據(jù)進行編碼和解碼。

收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度

圖7: 自動編碼器結(jié)構(gòu)[4]

自動編碼器可以使用Keras API在python中應(yīng)用。本例在編碼層中指定要將輸入數(shù)據(jù)減少到一定的特征數(shù)（例3）。從下面的代碼片段中可以看到，自動編碼器將X（輸入特征）作為特征和標(biāo)簽（X，Y）。

在此例子中，本文決定使用ReLu作為編碼階段的激活函數(shù)，使用Softmax作為解碼階段的激活函數(shù)。如不使用非線性激活函數(shù)，那么自動編碼器就會嘗試使用線性變換來給輸入數(shù)據(jù)降維（因此會得到一個與使用PCA類似的結(jié)果）。

from keras.layers import Input, Dense

from keras.models import Model

input_layer = Input(shape=(X.shape[1],))

encoded = Dense(3, activation='relu')(input_layer)

decoded = Dense(X.shape[1], activation='softmax')(encoded)

autoencoder = Model(input_layer, decoded)

autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

X1, X2, Y1, Y2 = train_test_split(X, X, test_size=0.3, random_state=101)

autoencoder.fit(X1, Y1,

epochs=100,

batch_size=300,

shuffle=True,

verbose=30,

validation_data=(X2, Y2))

encoder = Model(input_layer, encoded)

X_ae = encoder.predict(X)

extraction7.py hosted with ? by GitHub

現(xiàn)可重復(fù)前例類似步驟，這次使用一個簡單的自動編碼器作為特征提取。

forest_test(X_ae, Y)

extraction8.py hosted with ? by GitHub

1.734375
[[1238 36]
[ 67 1097]]
precision recall f1-score support
0 0.95 0.97 0.96 1274
1 0.97 0.94 0.96 1164
micro avg 0.96 0.96 0.96 2438
macro avg 0.96 0.96 0.96 2438
weighted avg 0.96 0.96 0.96 2438

收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度

希望你有所收獲！感謝閱讀！

收藏！如何使用特征提取技術(shù)降低數(shù)據(jù)集維度