[学习交流] 【上海校区】【Python实例第17讲】均值偏移聚类算法

均值偏移(mean shift)是一个非参数特征空间分析技术，用来寻找密度函数的最大值点。它的应用领域包括聚类分析和图像处理等。

均值偏移算法

均值偏移是一个迭代地求密度函数极值点的方法。首先，从一个初始估计 x xx 出发。这里要给定一个核函数 K(xi−x) K(x_i-x)K(x
i

−x), 典型采用的是高斯核。核函数用来确定 x xx 的邻近点的权，而这些邻近点用来重新计算均值。这样，在 x xx 点的密度的加权均值

m(x)=∑xi∈N(x)K(xi−x)xi∑xi∈N(x)K(xi−x) m(x)=\dfrac{\sum_{x_i\in N(x)}K(x_i-x)x_i}{\sum_{x_i\in N(x)}K(x_i-x)}
m(x)=
∑
x
i

∈N(x)

K(x
i

−x)
∑
x
i

∈N(x)

K(x
i

−x)x
i




其中，N(x) N(x)N(x) 是 xi x_ix
i

  的邻居集。称

m(x)−x m(x)-x
m(x)−x

是mean shift. 现在，升级 x xx 的值为 m(x) m(x)m(x), 重复这个估计过程，直到 m(x) m(x)m(x) 收敛。
以下是一个迭代过程的示意图。

聚类应用

均值偏移聚类的目的是发现来自平滑密度的样本团(‘blobs’). 它是一个基于质心的算法，当质心的改变很小时，将停止搜索。因此，它能够自动设置类数，这是与k-means聚类法的显著区别。当确定所有质心后，质心对应类。对于每一个样本点，将它归于距离最近的质心代表的类里。

A demo example

import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets.samples_generator import make_blobs

# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, _ = make_blobs(n_samples=10000, centers=centers, cluster_std=0.6)

# #############################################################################
# Compute clustering with MeanShift

# The following bandwidth can be automatically detected using
bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500)

ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

print("number of estimated clusters : %d" % n_clusters_)

# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle

plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
my_members = labels == k
cluster_center = cluster_centers[k]
plt.plot(X[my_members, 0], X[my_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
number of estimated clusters : 3

---------------------
【转载】作者：Goodsta
原文：https://blog.csdn.net/wong2016/article/details/84255245

梦缠绕的时候 · 梦缠绕的时候

不二晨 · 不二晨

小影姐姐 · 小影姐姐

帐号		自动登录	找回密码
密码			加入黑马

[学习交流] 【上海校区】【Python实例第17讲】均值偏移聚类算法

3 个回复