在本文中,我们将引见从数据鸠合挑选要素的差别要领; 并运用Scikit-learn(sklearn)库议论特性挑选算法的范例及其在Python中的完成 :
- 单变量特性挑选
- 递归特性消弭(RFE)
- 主身分剖析(PCA)
- 特性挑选 (feature importance)
单变量特性挑选
统计测试可用于挑选与输出变量具有最强关联的那些特性。
scikit-learn库供应SelectKBest类,能够与一组差别的统计测试一同运用,以挑选特定数目的功用。
以下示例运用chi平方(chi ^ 2)统计磨练非负特性来挑选Pima Indians糖尿病数据鸠合的四个最好特性:
#Feature Extraction with Univariate Statistical Tests (Chi-squared for classification) #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import SelectKBest #Import chi2 for performing chi square test from sklearn.feature_selection import chi2 #URL for loading the dataset url ="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #We will select the features using chi square test = SelectKBest(score_func=chi2, k=4) #Fit the function for ranking the features by score fit = test.fit(X, Y) #Summarize scores numpy.set_printoptions(precision=3) print(fit.scores_) #Apply the transformation on to dataset features = fit.transform(X) #Summarize selected features print(features[0:5,:])
每一个属性的分数和所选的四个属性(分数最高的分数):plas,test,mass和age。
每一个功用的分数:
[111.52 1411.887 17.605 53.108 2175.565 127.669 5.393 181.304]
特征:
[[148. 0. 33.6 50. ] [85. 0. 26.6 31. ] [183. 0. 23.3 32. ] [89. 94. 28.1 21. ] [137. 168. 43.1 33. ]]
递归特性消弭(RFE)
RFE经由历程递归删除属性并在盈余的属性上构建模子来事情。它运用模子精度来辨认哪些属性(和属性组合)对展望目的属性的孝敬最大。以下示例运用RFE和逻辑回归算法来挑选前三个特性。算法的挑选并不重要,只需它技巧性和一致性:
#Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import RFE #Import LogisticRegression for performing chi square test from sklearn.linear_model import LogisticRegression #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-dia betes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction model = LogisticRegression() rfe = RFE(model, 3) fit = rfe.fit(X, Y) print("Num Features: %d"% fit.n_features_) print("Selected Features: %s"% fit.support_) print("Feature Ranking: %s"% fit.ranking_)
实行后,我们将获得:
Num Features: 3 Selected Features: [ True False False False False True True False] Feature Ranking: [1 2 3 5 6 1 1 4]
您能够看到RFE挑选了前三个功用,如preg,mass和pedi。这些在support_数组中标记为True,并在ranking_数组中标记为选项1。
主身分剖析(PCA)
PCA运用线性代数将数据集转换为紧缩情势。一般,它被认为是数据简化手艺。PCA的一个属性是您能够挑选转换效果中的维数或主身分数。
在以下示例中,我们运用PCA并挑选三个重要组件:
#Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's PCA algorithm from sklearn.decomposition import PCA #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction pca = PCA(n_components=3) fit = pca.fit(X) #Summarize components print("Explained Variance: %s") % fit.explained_variance_ratio_ print(fit.components_)
您能够看到转换后的数据集(三个重要组件)与源数据几乎没有相似之处:
Explained Variance: [ 0.88854663 0.06159078 0.02579012] [[ -2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02 9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03] [ -2.26488861e-02 -9.72210040e-01 -1.41909330e-01 5.78614699e-02 9.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01 [ -2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01 2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]]
特性挑选 (feature importance)
特性重要性是用于运用练习有监视的分类器来挑选特性的手艺。当我们练习分类器(比方决策树)时,我们会评价每一个属性以建立破裂; 我们能够将此器量用作特性挑选器。让我们细致相识它。
随机丛林是最受迎接的 机械进修要领之一,由于它们具有相对较好的准确性,妥当性和易用性。它们还供应了两种直接的特性挑选要领 - 均匀下降杂质和均匀下降精度。
随机丛林由很多决策树构成。决策树中的每一个节点都是单个要素上的前提,旨在将数据集拆分为两个,以便相似的响应值终究出如今同一个鸠合中。挑选(部份)最好前提的器量称为杂质。关于分类,它一般是基尼系数
杂质或信息增益/熵,关于回归树,它是方差。因而,当练习树时,能够经由历程每一个特性削减树中的加权杂质的水平来盘算它。关于丛林,能够对每一个特性的杂质削减举行均匀,而且依据该器量对特性举行排序。
让我们看看怎样运用随机丛林分类器举行特性挑选,并评价特性挑选前后分类器的准确性。我们将运用Otto数据集。
该数据集形貌了凌驾61,000种产物的93个隐约细节,这些产物分为10个产物种别(比方,服装,电子产物等)。输入属性是某种差别事宜的计数。
目的是将新产物的展望作为10个种别中每一个种别的几率数组,并运用多类对数丧失(也称为交织熵)来评价模子。
我们将从导入一切库最先:
#Import the supporting libraries #Import pandas to load the dataset from csv file from pandas import read_csv #Import numpy for array based operations and calculations import numpy as np #Import Random Forest classifier class from sklearn from sklearn.ensemble import RandomForestClassifier #Import feature selector class select model of sklearn from sklearn.feature_selection import SelectFromModel np.random.seed(1)
让我们定义一种要领将数据集拆分为练习和测试数据; 我们将在练习部份练习我们的数据集,测试部份将用于评价练习模子:
#Function to create Train and Test set from the original dataset def getTrainTestData(dataset,split): np.random.seed(0) training = [] testing = [] np.random.shuffle(dataset) shape = np.shape(dataset) trainlength = np.uint16(np.floor(split*shape[0])) for i in range(trainlength): training.append(dataset[i]) for i in range(trainlength,shape[0]): testing.append(dataset[i]) training = np.array(training) testing = np.array(testing) return training,testing
我们还须要增加一个函数来评价模子的准确性; 它将展望和现实输出作为输入来盘算百分比准确度:
#Function to evaluate model performance def getAccuracy(pre,ytest): count = 0 for i in range(len(ytest)): if ytest[i]==pre[i]: count+=1 acc = float(count)/len(ytest) return acc
这是加载数据集的时候。我们将加载train.csv文件; 此文件包括凌驾61,000个练习实例。我们将在我们的示例中运用50000个实例,个中我们将运用35,000个实例来练习分类器,并运用15,000个实例来测试分类器的机能:
#Load dataset as pandas data frame data = read_csv('train.csv') #Extract attribute names from the data frame feat = data.keys() feat_labels = feat.get_values() #Extract data values from the data frame dataset = data.values #Shuffle the dataset np.random.shuffle(dataset) #We will select 50000 instances to train the classifier inst = 50000 #Extract 50000 instances from the dataset dataset = dataset[0:inst,:] #Create Training and Testing data for performance evaluation train,test = getTrainTestData(dataset, 0.7) #Split data into input and output variable with selected features Xtrain = train[:,0:94] ytrain = train[:,94] shape = np.shape(Xtrain) print("Shape of the dataset ",shape) #Print the size of Data in MBs print("Size of Data set before feature selection: %.2f MB"%(Xtrain.nbytes/1e6))
我们在这里注重数据大小; 由于我们的数据集包括约莫35000个具有94个属性的练习实例; 我们的数据集的大小非常大。让我们来看看:
Shape of the dataset (35000, 94) Size of Data set before feature selection: 26.32 MB
如您所见,我们的数据鸠合有35000行和94列,凌驾26 MB数据。
鄙人一个代码块中,我们将设置随机林分类器; 我们将运用250棵树,最大深度为30,随机要素的数目为7.其他超参数将是sklearn的默认值:
#Lets select the test data for model evaluation purpose Xtest = test[:,0:94] ytest = test[:,94] #Create a random forest classifier with the following Parameters trees = 250 max_feat = 7 max_depth = 30 min_sample = 2 clf = RandomForestClassifier(n_estimators=trees, max_features=max_feat, max_depth=max_depth, min_samples_split= min_sample, random_state=0, n_jobs=-1) #Train the classifier and calculate the training time import time start = time.time() clf.fit(Xtrain, ytrain) end = time.time() #Lets Note down the model training time print("Execution time for building the Tree is: %f"%(float(end)- float(start))) pre = clf.predict(Xtest) Let's see how much time is required to train the model on the training dataset: Execution time for building the Tree is: 2.913641 #Evaluate the model performance for the test data acc = getAccuracy(pre, ytest) print("Accuracy of model before feature selection is %.2f"%(100*acc))
我们模子的准确性是:
特性挑选前的模子精度为98.82
正如您所看到的,我们正在获得非常好的准确性,由于我们快要99%的测试数据分类到准确的种别中。这意味着我们正在对15,000个准确类中的14,823个实例举行分类。
那末,如今我的题目是:我们是不是应当进一步革新?好吧,为何不呢?假如能够的话,我们肯定会追求更多的革新; 在这里,我们将运用功用重要性来挑选功用。如您所知,在树木构建历程当中,我们运用杂质丈量来挑选节点。挑选具有最低杂质的属性值作为树中的节点。我们能够运用相似的规范举行特性挑选。我们能够越发注重杂质较少的功用,这能够运用sklearn库的feature_importances_函数来完成。让我们找出每一个功用的重要性:
#Once我们造就的模子中,我们的排名将一切功用的功用在拉链(feat_labels,clf.feature_importances_):
print(feature) ('id', 0.33346650420175183) ('feat_1', 0.0036186958628801214) ('feat_2', 0.0037243050888530957) ('feat_3', 0.011579217472062748) ('feat_4', 0.010297382675187445) ('feat_5', 0.0010359139416194116) ('feat_6', 0.00038171336038056165) ('feat_7', 0.0024867672489765021) ('feat_8', 0.0096689721610546085) ('feat_9', 0.007906150362995093) ('feat_10', 0.0022342480802130366)
正如您在此地方看到的,每一个要素都基于其对终究展望的孝敬而具有差别的重要性。
我们将运用这些重要性分数来分列我们的功用; 鄙人面的部份中,我们将挑选功用重要性大于0.01的模子练习功用:
#Select features which have higher contribution in the final prediction sfm = SelectFromModel(clf, threshold=0.01) sfm.fit(Xtrain,ytrain)
在这里,我们将依据所选的特性属性转换输入数据集。鄙人一个代码块中,我们将转换数据集。然后,我们将搜检新数据集的大小和外形:
#Transform input dataset Xtrain_1 = sfm.transform(Xtrain) Xtest_1 = sfm.transform(Xtest) #Let's see the size and shape of new dataset print("Size of Data set before feature selection: %.2f MB"%(Xtrain_1.nbytes/1e6)) shape = np.shape(Xtrain_1) print("Shape of the dataset ",shape) Size of Data set before feature selection: 5.60 MB Shape of the dataset (35000, 20)
你看到数据集的外形了吗?在功用挑选历程以后,我们只剩下20个功用,这将数据库的大小从26 MB削减到5.60 MB。这比原始数据集削减了约80%。
鄙人一个代码块中,我们将练习一个新的随机丛林分类器,它具有与之前雷同的超参数,并在测试数据集上举行测试。让我们看看修正练习集后获得的准确度:
#Model training time start = time.time() clf.fit(Xtrain_1, ytrain) end = time.time() print("Execution time for building the Tree is: %f"%(float(end)- float(start))) #Let's evaluate the model on test data pre = clf.predict(Xtest_1) count = 0 acc2 = getAccuracy(pre, ytest) print("Accuracy after feature selection %.2f"%(100*acc2)) Execution time for building the Tree is: 1.711518 Accuracy after feature selection 99.97
你能看到!! 我们运用修正后的数据集获得了99.97%的准确率,这意味着我们在准确的类中对14,996个实例举行了分类,而之前我们只准确地对14,823个实例举行了分类。
这是我们在功用挑选历程当中获得的巨大进步; 我们能够总结下表中的一切效果:
评价规范 | 在挑选特性之前 | 挑选功用后 |
---|---|---|
功用数目 | 94 | 20 |
数据集的大小 | 26.32 MB | 5.60 MB |
练习时候 | 2.91秒 | 1.71秒 |
准确性 | 98.82% | 99.97% |
上表显现了特性挑选的现实长处。您能够看到我们明显削减了要素数目,从而下降了数据集的模子复杂性和维度。尺寸减小后我们的练习时候收缩,末了,我们克服了过分拟合题目,获得了比之前更高的精度。
以上就是Python中完成机械进修功用的四种要领引见的细致内容,更多请关注ki4网别的相干文章!