Python中完成机械进修功用的四种要领引见【Python教程】,开发,编程语言,机器学习,python

本篇文章给人人带来的内容是关于Python中完成机械进修功用的四种要领引见，有肯定的参考价值，有须要的朋侪能够参考一下，愿望对你有所协助。

在本文中，我们将引见从数据鸠合挑选要素的差别要领; 并运用Scikit-learn（sklearn）库议论特性挑选算法的范例及其在Python中的完成：

单变量特性挑选
递归特性消弭(RFE)
主身分剖析（PCA）
特性挑选 (feature importance)

单变量特性挑选

统计测试可用于挑选与输出变量具有最强关联的那些特性。

scikit-learn库供应SelectKBest类，能够与一组差别的统计测试一同运用，以挑选特定数目的功用。

以下示例运用chi平方（chi ^ 2）统计磨练非负特性来挑选Pima Indians糖尿病数据鸠合的四个最好特性：

#Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)

#Import the required packages

#Import pandas to read csv import pandas

#Import numpy for array related operations import numpy

#Import sklearn's feature selection algorithm

from sklearn.feature_selection import SelectKBest

#Import chi2 for performing chi square test from sklearn.feature_selection import chi2

#URL for loading the dataset

url ="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data"

#Define the attribute names

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

#Create pandas data frame by loading the data from URL

dataframe = pandas.read_csv(url, names=names)

#Create array from data values

array = dataframe.values

#Split the data into input and target

X = array[:,0:8]

Y = array[:,8]

#We will select the features using chi square

test = SelectKBest(score_func=chi2, k=4)

#Fit the function for ranking the features by score

fit = test.fit(X, Y)

#Summarize scores numpy.set_printoptions(precision=3) print(fit.scores_)

#Apply the transformation on to dataset

features = fit.transform(X)

#Summarize selected features print(features[0:5,:])

每一个属性的分数和所选的四个属性（分数最高的分数）：plas，test，mass和age。

每一个功用的分数：

[111.52   1411.887 17.605 53.108  2175.565   127.669 5.393

181.304]

特征：

[[148. 0. 33.6 50. ]

[85. 0. 26.6 31. ]

[183. 0. 23.3 32. ]

[89. 94. 28.1 21. ]

[137. 168. 43.1 33. ]]

递归特性消弭(RFE)

RFE经由历程递归删除属性并在盈余的属性上构建模子来事情。它运用模子精度来辨认哪些属性（和属性组合）对展望目的属性的孝敬最大。以下示例运用RFE和逻辑回归算法来挑选前三个特性。算法的挑选并不重要，只需它技巧性和一致性：

#Import the required packages

#Import pandas to read csv import pandas

#Import numpy for array related operations import numpy

#Import sklearn's feature selection algorithm from sklearn.feature_selection import RFE

#Import LogisticRegression for performing chi square test from sklearn.linear_model import LogisticRegression

#URL for loading the dataset

url =

"https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-dia betes/pima-indians-diabetes.data"

#Define the attribute names

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

#Create pandas data frame by loading the data from URL

dataframe = pandas.read_csv(url, names=names)

#Create array from data values

array = dataframe.values

#Split the data into input and target

X = array[:,0:8]

Y = array[:,8]

#Feature extraction

model = LogisticRegression() rfe = RFE(model, 3)

fit = rfe.fit(X, Y)

print("Num Features: %d"% fit.n_features_) print("Selected Features: %s"% fit.support_) print("Feature Ranking: %s"% fit.ranking_)

实行后，我们将获得：

Num Features: 3

Selected Features: [ True False False False False   True  True False]

Feature Ranking: [1 2 3 5 6 1 1 4]

您能够看到RFE挑选了前三个功用，如preg，mass和pedi。这些在support_数组中标记为True，并在ranking_数组中标记为选项1。

主身分剖析（PCA）

PCA运用线性代数将数据集转换为紧缩情势。一般，它被认为是数据简化手艺。PCA的一个属性是您能够挑选转换效果中的维数或主身分数。

在以下示例中，我们运用PCA并挑选三个重要组件：

#Import the required packages

#Import pandas to read csv import pandas

#Import numpy for array related operations import numpy

#Import sklearn's PCA algorithm

from sklearn.decomposition import PCA

#URL for loading the dataset

url =

"https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data"

#Define the attribute names

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

#Create array from data values

array = dataframe.values

#Split the data into input and target

X = array[:,0:8]

Y = array[:,8]

#Feature extraction

pca = PCA(n_components=3) fit = pca.fit(X)

#Summarize components

print("Explained Variance: %s") % fit.explained_variance_ratio_

print(fit.components_)

您能够看到转换后的数据集（三个重要组件）与源数据几乎没有相似之处：

Explained Variance: [ 0.88854663   0.06159078  0.02579012]

[[ -2.02176587e-03    9.78115765e-02 1.60930503e-02    6.07566861e-02

9.93110844e-01          1.40108085e-02 5.37167919e-04   -3.56474430e-03]

[ -2.26488861e-02   -9.72210040e-01              -1.41909330e-01  5.78614699e-02 9.46266913e-02   -4.69729766e-02               -8.16804621e-04  -1.40168181e-01

[ -2.24649003e-02 1.43428710e-01                 -9.22467192e-01  -3.07013055e-01 2.09773019e-02   -1.32444542e-01                -6.39983017e-04  -1.25454310e-01]]

特性挑选 (feature importance)

特性重要性是用于运用练习有监视的分类器来挑选特性的手艺。当我们练习分类器（比方决策树）时，我们会评价每一个属性以建立破裂; 我们能够将此器量用作特性挑选器。让我们细致相识它。

随机丛林是最受迎接的机械进修要领之一，由于它们具有相对较好的准确性，妥当性和易用性。它们还供应了两种直接的特性挑选要领 - 均匀下降杂质和均匀下降精度。

随机丛林由很多决策树构成。决策树中的每一个节点都是单个要素上的前提，旨在将数据集拆分为两个，以便相似的响应值终究出如今同一个鸠合中。挑选（部份）最好前提的器量称为杂质。关于分类，它一般是基尼系数

杂质或信息增益/熵，关于回归树，它是方差。因而，当练习树时，能够经由历程每一个特性削减树中的加权杂质的水平来盘算它。关于丛林，能够对每一个特性的杂质削减举行均匀，而且依据该器量对特性举行排序。

让我们看看怎样运用随机丛林分类器举行特性挑选，并评价特性挑选前后分类器的准确性。我们将运用Otto数据集。

该数据集形貌了凌驾61,000种产物的93个隐约细节，这些产物分为10个产物种别（比方，服装，电子产物等）。输入属性是某种差别事宜的计数。

目的是将新产物的展望作为10个种别中每一个种别的几率数组，并运用多类对数丧失（也称为交织熵）来评价模子。

我们将从导入一切库最先：

#Import the supporting libraries

#Import pandas to load the dataset from csv file

from pandas import read_csv

#Import numpy for array based operations and calculations

import numpy as np

#Import Random Forest classifier class from sklearn

from sklearn.ensemble import RandomForestClassifier

#Import feature selector class select model of sklearn

        from sklearn.feature_selection

        import SelectFromModel

         np.random.seed(1)

让我们定义一种要领将数据集拆分为练习和测试数据; 我们将在练习部份练习我们的数据集，测试部份将用于评价练习模子：

#Function to create Train and Test set from the original dataset def getTrainTestData(dataset,split):

np.random.seed(0) training = [] testing = []

np.random.shuffle(dataset) shape = np.shape(dataset)

trainlength = np.uint16(np.floor(split*shape[0]))

for i in range(trainlength): training.append(dataset[i])

for i in range(trainlength,shape[0]): testing.append(dataset[i])

training = np.array(training) testing = np.array(testing)

return training,testing

我们还须要增加一个函数来评价模子的准确性; 它将展望和现实输出作为输入来盘算百分比准确度：

#Function to evaluate model performance

def getAccuracy(pre,ytest): count = 0

for i in range(len(ytest)):

if ytest[i]==pre[i]: count+=1

acc = float(count)/len(ytest)

return acc

这是加载数据集的时候。我们将加载train.csv文件; 此文件包括凌驾61,000个练习实例。我们将在我们的示例中运用50000个实例，个中我们将运用35,000个实例来练习分类器，并运用15,000个实例来测试分类器的机能：

#Load dataset as pandas data frame

data = read_csv('train.csv')

#Extract attribute names from the data frame

feat = data.keys()

feat_labels = feat.get_values()

#Extract data values from the data frame

dataset = data.values

#Shuffle the dataset

np.random.shuffle(dataset)

#We will select 50000 instances to train the classifier

inst = 50000

#Extract 50000 instances from the dataset

dataset = dataset[0:inst,:]

#Create Training and Testing data for performance evaluation

train,test = getTrainTestData(dataset, 0.7)

#Split data into input and output variable with selected features

Xtrain = train[:,0:94] ytrain = train[:,94] shape = np.shape(Xtrain)

print("Shape of the dataset ",shape)

#Print the size of Data in MBs

print("Size of Data set before feature selection: %.2f MB"%(Xtrain.nbytes/1e6))

我们在这里注重数据大小; 由于我们的数据集包括约莫35000个具有94个属性的练习实例; 我们的数据集的大小非常大。让我们来看看：

Shape of the dataset (35000, 94)

Size of Data set before feature selection: 26.32 MB

如您所见，我们的数据鸠合有35000行和94列，凌驾26 MB数据。

鄙人一个代码块中，我们将设置随机林分类器; 我们将运用250棵树，最大深度为30，随机要素的数目为7.其他超参数将是sklearn的默认值：

#Lets select the test data for model evaluation purpose

Xtest = test[:,0:94] ytest = test[:,94]

#Create a random forest classifier with the following Parameters

trees            = 250

max_feat     = 7

max_depth = 30

min_sample = 2

clf = RandomForestClassifier(n_estimators=trees,

max_features=max_feat,

max_depth=max_depth,

min_samples_split= min_sample, random_state=0,

n_jobs=-1)

#Train the classifier and calculate the training time

import time

start = time.time() clf.fit(Xtrain, ytrain) end = time.time()

#Lets Note down the model training time

print("Execution time for building the Tree is: %f"%(float(end)- float(start)))

pre = clf.predict(Xtest)

Let's see how much time is required to train the model on the training dataset:

Execution time for building the Tree is: 2.913641

#Evaluate the model performance for the test data

acc = getAccuracy(pre, ytest)

print("Accuracy of model before feature selection is %.2f"%(100*acc))

我们模子的准确性是：

特性挑选前的模子精度为98.82

正如您所看到的，我们正在获得非常好的准确性，由于我们快要99％的测试数据分类到准确的种别中。这意味着我们正在对15,000个准确类中的14,823个实例举行分类。

那末，如今我的题目是：我们是不是应当进一步革新？好吧，为何不呢？假如能够的话，我们肯定会追求更多的革新; 在这里，我们将运用功用重要性来挑选功用。如您所知，在树木构建历程当中，我们运用杂质丈量来挑选节点。挑选具有最低杂质的属性值作为树中的节点。我们能够运用相似的规范举行特性挑选。我们能够越发注重杂质较少的功用，这能够运用sklearn库的feature_importances_函数来完成。让我们找出每一个功用的重要性：

#Once我们造就的模子中，我们的排名将一切功用的功用在拉链（feat_labels，clf.feature_importances_）：

print(feature)

('id', 0.33346650420175183)

('feat_1', 0.0036186958628801214)

('feat_2', 0.0037243050888530957)

('feat_3', 0.011579217472062748)

('feat_4', 0.010297382675187445)

('feat_5', 0.0010359139416194116)

('feat_6', 0.00038171336038056165)

('feat_7', 0.0024867672489765021)

('feat_8', 0.0096689721610546085)

('feat_9', 0.007906150362995093)

('feat_10', 0.0022342480802130366)

正如您在此地方看到的，每一个要素都基于其对终究展望的孝敬而具有差别的重要性。

我们将运用这些重要性分数来分列我们的功用; 鄙人面的部份中，我们将挑选功用重要性大于0.01的模子练习功用：

#Select features which have higher contribution in the final prediction

sfm = SelectFromModel(clf, threshold=0.01) sfm.fit(Xtrain,ytrain)

在这里，我们将依据所选的特性属性转换输入数据集。鄙人一个代码块中，我们将转换数据集。然后，我们将搜检新数据集的大小和外形：

#Transform input dataset

Xtrain_1 = sfm.transform(Xtrain) Xtest_1      = sfm.transform(Xtest)

#Let's see the size and shape of new dataset print("Size of Data set before feature selection: %.2f MB"%(Xtrain_1.nbytes/1e6))

shape = np.shape(Xtrain_1)

print("Shape of the dataset ",shape)

Size of Data set before feature selection: 5.60 MB Shape of the dataset (35000, 20)

你看到数据集的外形了吗？在功用挑选历程以后，我们只剩下20个功用，这将数据库的大小从26 MB削减到5.60 MB。这比原始数据集削减了约80％。

鄙人一个代码块中，我们将练习一个新的随机丛林分类器，它具有与之前雷同的超参数，并在测试数据集上举行测试。让我们看看修正练习集后获得的准确度：

#Model training time

start = time.time() clf.fit(Xtrain_1, ytrain) end = time.time()

print("Execution time for building the Tree is: %f"%(float(end)- float(start)))

#Let's evaluate the model on test data

pre = clf.predict(Xtest_1) count = 0

acc2 = getAccuracy(pre, ytest)

print("Accuracy after feature selection %.2f"%(100*acc2))

Execution time for building the Tree is: 1.711518 Accuracy after feature selection 99.97

你能看到!! 我们运用修正后的数据集获得了99.97％的准确率，这意味着我们在准确的类中对14,996个实例举行了分类，而之前我们只准确地对14,823个实例举行了分类。

这是我们在功用挑选历程当中获得的巨大进步; 我们能够总结下表中的一切效果：

评价规范	在挑选特性之前	挑选功用后
功用数目	94	20
数据集的大小	26.32 MB	5.60 MB
练习时候	2.91秒	1.71秒
准确性	98.82％	99.97％

上表显现了特性挑选的现实长处。您能够看到我们明显削减了要素数目，从而下降了数据集的模子复杂性和维度。尺寸减小后我们的练习时候收缩，末了，我们克服了过分拟合题目，获得了比之前更高的精度。

以上就是Python中完成机械进修功用的四种要领引见的细致内容，更多请关注ki4网别的相干文章！

正文

Python中完成机械进修功用的四种要领引见【Python教程】,开发,编程语言,机器学习,python

单变量特性挑选

递归特性消弭(RFE)

主身分剖析（PCA）

特性挑选 (feature importance)

相关阅读

python数据类型有哪几种？_Python教程,python

python针对Excel表格的操作_Python教程,python,excel

详细分析之Python可变对象和不可变对象_Python教程,Python,可变对象,不可变对象

学习python 抓取知乎指定回答下视频的方法_Python教程,Python,抓取视频