[学习交流] 【上海校区】中餐馆过程伪代码及python实现

中餐馆两种采样方式：

已知条件概率

算法1：直接从联合分布中采样

N:餐厅的总人数
T：样本总数（采样的次数）
<span class="MathJax" id="MathJax-Element-1-Frame" tabindex="0" data-mathml="α" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">αα：Dirichlet参数

代码1：#算法1：直接从联合分布里采样,根据中餐馆条件概率采样，先得到z1，再根据z1得到z2.。。。。最后得到联合概率样本#N表示中餐馆有10个人，alpha = 3表示dirichlet参数，T=50表示样本数，即迭代多少次#verbose表示详细信息，verbose=FALSE，意思就是设置运行的时候不显示详细信息#给定初始Z[0]=1,其他初始化为0import numpy as npdef Draw_CRP_Direct_Sample(N = 10, alpha = 3, T = 50, VERBOSE = False): Z_table = np.zeros((T,N))#(T,N)用0填充的数组，数值表示每个人坐的桌子的编号 for t in range(T): Z = np.zeros(N,dtype=int)#包含N个人的样本 for i in range(N): if i == 0: Z = 1#初始化Z[0] = 1,即第一个人坐的的类别 else: if VERBOSE: print('初始Z=',Z)#一个样本 unique, counts = np.unique(Z, return_counts=True) #unique返回样本Z中去重之后的主题编号（从小到大排序），counts计数(主题权重）返回每个主题的个数 #对于一维数组或者列表，unique函数去除其中重复的元素，并按元素由小到大返回一个新的无元素重复的元组或者列表 #return_index=True表示返回新列表元素在旧列表中的位置，并以列表形式储存在s中 #return_inverse=True 表示返回旧列表元素在新列表中的位置，并以列表形式储存在p中 #return_counts=True 表示返回新列表元素在旧表中出现的次数，并以列表形式存储在counts中 # remove the zeros unsigned tables if unique[0] == 0: unique = np.delete(unique,0)#删除主题编号为0的元素 if VERBOSE: print("unique,counts,alpha", unique,counts,alpha) # added alpha to the end of the counts (weights) array counts = np.append(counts,alpha) #counts=[counts,alpha] # also the new table index will be the max of table seen so far unique = np.append(unique,max(unique)+1)#添加一个新的主题 print("np.append(counts,alpha)",counts)#将counts和alpha拼接后的新counts print("np.append(unique,max(unique)+1)",unique) if VERBOSE: print("sum(counts)=",sum(counts)) #轮盘赌法 #u是随机产生的数，使得不同数据对应不同的数据概率，并且在整体上保留了“区域概率越大，对应数据越多” u = np.random.uniform()*sum(counts) #np.cumsum累加求和例子：counts=[1,2,3,4,5] np.cumsum(counts) array([ 1, 3, 6, 10, 15], dtype=int32) a_counts = np.cumsum(counts)#累加求和列表，这里是对每个主题包含的个数累加求和，也可以算出每个类别的概率进行累加求和，效果是一样的 if VERBOSE: print("acounts,counts, u, (a_counts > u)",a_counts,counts, u, a_counts > u) # first index where accumuated sum is greater than random variable index = np.argmax(a_counts > u) #返回最大值的索引 print("index：", index) Z = unique[index] if VERBOSE: print("最终Z=",Z) print("\n\n") Z_table[t,:] = Z return Z_table

算法2:用吉布斯采样从联合样本中采样
N:餐厅的总人数
T：样本总数（采样的次数）
<span class="MathJax" id="MathJax-Element-2-Frame" tabindex="0" data-mathml="α" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">αα：Dirichlet参数
b:burn-in，采样完丢弃的样本数

代码2：Gibbs samplingimport numpy as np#算法2：用吉布斯采样从联合分布中采样#burn_in = 10代表有前10个样本会去掉def Draw_CRP_Gibbs_Sample(N = 10, alpha = 3, T = 50, burn_in = 10, VERBOSE = False): Z = np.ones(N,dtype=int)#一个样本初始化为1 Z_table = np.zeros((T,N))#所有样本初始化为0 for t in range(T+burn_in): for i in range(N): if VERBOSE: print("初始Z0=",Z) # remove current table assignment删除当前表赋值，处理到第几项，第几项就赋为0，Z2更新的就是Z1中为0的那一项 Z = 0 if VERBOSE: print("Z1=",Z) unique, counts = np.unique(Z, return_counts=True) # remove the zeros in unassigned tables 删除未分配表中的0 if unique[0] == 0: unique = np.delete(unique,0) counts = np.delete(counts,0) if VERBOSE: print("unique,counts,alpha", unique,counts,alpha) # added alpha to the end of the counts (weights) array counts = np.append(counts,alpha) # also the new table index will be the max of table seen so far unique = np.append(unique,max(unique)+1) print("np.append(counts,alpha)",counts) print("np.append(unique,max(unique)+1)",unique) if VERBOSE: print("sum(counts)=",sum(counts)) u = np.random.uniform()*sum(counts) a_counts = np.cumsum(counts) if VERBOSE: print("a_counts,counts, u, a_counts > u",a_counts,counts, u, a_counts > u) # first index where accumuated sum is greater than random variable index = np.argmax(a_counts > u) print("index:", index) Z = unique[index] if VERBOSE: print("Z2=",Z) print("\n") old_table = np.unique(Z)#old_table统计出现的类别，可能不是从1开始 print("old_table=",old_table) new_table = np.array(range(1,len(old_table)+1))#new_table与old_table形状大小一样，目的是将old_table中的类别变为从1开始 print("new_table=",new_table) for k in range(len(old_table)): Z[Z == old_table[k]]=new_table[k]#将old_table中旧类别转化为新类别 if t >= burn_in: #舍弃前burn-in个样本 Z_table[t-burn_in,:] = Z if VERBOSE: print("Z3=",Z) print("\n\n\n") if VERBOSE: print("Z_table=",Z_table) return Z_table

如何验证采样算法的正确性？
通过两方面进行比较：
被占桌子个数K的期望（样本均值与理论均值）
被占桌子个数P(K=k)的概率（样本概率与理论概率）

参考
徐亦达：中国餐馆过程演示

【转载】原文地址：https://blog.csdn.net/sinat_37386947/article/details/81010978

wuqiong · wuqiong

帐号		自动登录	找回密码
密码			加入黑马

[学习交流] 【上海校区】中餐馆过程伪代码及python实现

1 个回复