[学习交流] 【上海校区】【Python实例第20讲】手写数字识别问题的K-Mean...

在这个例子里，我们在手写数字识别数据集上，比较 K-means 聚类算法对于不同的初始化策略对运行时间和结果质量的影响。我们也利用不同的聚类质量测度判别聚类标签对于参考标签的拟合优度。这里使用的聚类评价测度有：

homo (homogeneity score)

compl (completeness score)

v-meas (V measure)

ARI (adjusted Rand index)

AMI (adjusted mutual information)

silhouette (silhouette coefficient)

实例详解

首先，加载必需的库。导入手写数字数据集 digits.

from time import time
import numpy as np
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

np.random.seed(42)

digits = load_digits()
data = scale(digits.data)

n_samples, n_features = data.shape
n_digits = len(np.unique(digits.target))
labels = digits.target

sample_size = 300

print("n_digits: %d, \t n_samples %d, \t n_features %d"
   % (n_digits, n_samples, n_features))

print(82 * '_')
print('init\t\ttime\tinertia\thomo\tcompl\tv-meas\tARI\tAMI\tsilhouette')

bench_k_means 函数

函数 bench_k_means(estimator, name, data) 根据不同的初始化策略进行k-means聚类，分别计算上述的评价测度和运行时间。它有三个输入参数，estimator指定聚类方法，这里使用相同的k-means. name以字符串形式指定初始化策略。data指定要聚类的数据集。

def bench_k_means(estimator, name, data):
t0 = time()
estimator.fit(data)
print('%-9s\t%.2fs\t%i\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f'
      % (name, (time() - t0), estimator.inertia_,
         metrics.homogeneity_score(labels, estimator.labels_),
         metrics.completeness_score(labels, estimator.labels_),
         metrics.v_measure_score(labels, estimator.labels_),
         metrics.adjusted_rand_score(labels, estimator.labels_),
         metrics.adjusted_mutual_info_score(labels,  estimator.labels_),
         metrics.silhouette_score(data, estimator.labels_,
                                    metric='euclidean',
                                    sample_size=sample_size)))

分别使用k-means++, random, pca 进行k-means初始化。

bench_k_means(KMeans(init='k-means++', n_clusters=n_digits, n_init=10),
            name="k-means++", data=data)

bench_k_means(KMeans(init='random', n_clusters=n_digits, n_init=10),
            name="random", data=data)

# in this case the seeding of the centers is deterministic, hence we run the
# kmeans algorithm only once with n_init=1
pca = PCA(n_components=n_digits).fit(data)
bench_k_means(KMeans(init=pca.components_, n_clusters=n_digits, n_init=1),
            name="PCA-based",
            data=data)
print(82 * '_')

结果可视化

在经主成分降维的数据上可视化结果。

reduced_data = PCA(n_components=2).fit_transform(data)
kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)
kmeans.fit(reduced_data)

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02    # point in the mesh [x_min, x_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
         extent=(xx.min(), xx.max(), yy.min(), yy.max()),
         cmap=plt.cm.Paired,
         aspect='auto', origin='lower')

plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
         marker='x', s=169, linewidths=3,
         color='w', zorder=10)
plt.title('K-means clustering on the digits dataset (PCA-reduced data)\n'
      'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

---------------------
【转载】
作者：Goodsta
原文：https://blog.csdn.net/wong2016/article/details/84587581

不二晨 · 不二晨

小影姐姐 · 小影姐姐

帐号		自动登录	找回密码
密码			加入黑马

[学习交流] 【上海校区】【Python实例第20讲】手写数字识别问题的K-Mean...

3 个回复

浏览过的版块