笔者很早就对LDA模型着迷,最近在学习gensim库发现了LDA比较有意义且项目较为完整的Tutorials,于是乎就有本系列,本系列包含三款:Latent Dirichlet Allocation、Author-Topic Model、Dynamic Topic Models
pyLDA系列模型 解析 功能
ATM模型(Author-Topic Model) 加入监督的’作者’,每个作者对不同主题的偏好;弊端:chained topics, intruded words, random topics, and unbalanced topics (see Mimno and co-authors 2011) 作者主题偏好、
词语主题偏好、
相似作者推荐、
可视化
LDA模型(Latent Dirichlet Allocation) 主题模型 文章主题偏好、
单词的主题偏好、
主题内容展示、
主题内容矩阵
DTM模型(Dynamic Topic Models) 加入时间因素,不同主题随着时间变动 时间-主题词条矩阵、
主题-时间词条矩阵、文档主题偏好、新文档预测、跨时间+主题属性的文档相似性
案例与数据主要来源,jupyter notebook可见gensim的官方github
详细参数解释:Author-topic models
.
1、理论介绍
Author Topic Model解析
2、Author-Topic Model数据准备
2.1 元数据
作者 文章
张三 文章1
李四 文章2,文章3,文章4
王五 文章5
… …
2.2 模型需要材料
材料 解释 示例
corpus 用过gensim 都懂 [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(0, 1), (4, 1), (5, 1), (7, 1), (8, 1), (9, 2)]]
dictionary 用过gensim 都懂,dictionary = Dictionary(docs) docs的格式,每篇文章都变成如下样式,然后整入List之中:[‘probabilistic’, ‘characterization’, ‘of’, ‘neural’, ‘model’, ‘computation’, ‘richard’,’david_rumelhart’, ‘-PRON-_grateful’, ‘helpful_discussion’, ‘early_version’, ‘like_thank’, ‘brown_university’]
id2word 每个词语ID的映射表,dictionary构成,id2word = dictionary.id2token {0: ’ 0’, 1: ’ American nstitute of Physics 1988 ‘, 2: ’ Fig’, 3: ’ The’, 4: ‘1 1’, 5: ‘2 2’, 6: ‘2 3’, 7: ‘CA 91125 ‘, 8: ‘CONCLUSIONS ‘}
author2doc 作者到每个文档ID的映射表,dict,人名+代表作的ID,字典形式 {‘RobertA.Jacobs’: [252, 429], ‘G.J.Goodhill’: [1007], ‘MichaelHumphreys’: [323]}
doc2author 从每个文档的作者映射表,author2doc 倒转
2.3 案例中spacy的使用
下面的案例是官网的案例,所以英文中使用spacy进行分词和清洗,使用的时候需要额外加载一些model,因为网络问题,一个30MB+的内容老是timeout,尼玛!
参考spacy官方Link:https://spacy.io/usage/models#download-pip
第一种方式:
python -m spacy download en # 如果没有翻墙会很慢
1
使用的话就是:
import spacy
nlp = spacy.load('en')
1
2
第二种方式:
1.先到该网站下载tar包
就是这个download link
2.解压
!pip install /Users/you/en_core_web_md-2.0.0.tar.gz
3.使用
import en_core_web_sm
nlp = en_core_web_sm.load()
nlp(u'This is a sentence.')
1
2
3
.
3、官网案例demo
3.1 主函数介绍
3.1.1 AuthorTopicModel主函数:
class gensim.models.atmodel.AuthorTopicModel(corpus=None, num_topics=100, id2word=None, author2doc=None, doc2author=None, chunksize=2000, passes=1, iterations=50, decay=0.5, offset=1.0, alpha='symmetric', eta='symmetric', update_every=1, eval_every=10, gamma_threshold=0.001, serialized=False, serialization_path=None, minimum_probability=0.01, random_state=None)
1
与LDA相同的一些参数:
corpus、passes、iterations、chunksize、eval_every、alpha/eta、random_state,详细解析可见:pyLDA系列︱gensim中的主题模型(Latent Dirichlet Allocation)
不同的参数:
id2word,文档ID到词语的映射表,id2word,id2word=dictionary.id2token,由字典生成
author2doc,作者到每个文档ID的映射表,author2doc
doc2author,author2doc的反向表征,从每个文档的作者映射表
最简训练模式:
model = AuthorTopicModel(corpus, num_topics=10, author2doc=author2doc, id2word=id2word)
1
最简增量训练模式:
# update主函数
update(corpus=None, author2doc=None, doc2author=None, chunksize=None, decay=None, offset=None, passes=None, update_every=None, eval_every=None, iterations=None, gamma_threshold=None, chunks_as_numpy=False)
# 最简方式
model.update(other_corpus, other_author2doc)
1
2
3
4
延伸一:利用random_state进行随机化设置
%%time
model_list = []
for i in range(5):
model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
author2doc=author2doc, chunksize=2000, passes=100, gamma_threshold=1e-10, \
eval_every=0, iterations=1, random_state=i)
top_topics = model.top_topics(corpus)
tc = sum([t[1] for t in top_topics])
model_list.append((model, tc))
1
2
3
4
5
6
7
8
9
通过设置该参数random_state,不同的随机种子,并选择具有最高主题一致性的模型。
top_topics 代表的指标为主题一致性。
.
延伸二:参数serialized、serialization_path
serialized打开之后,可以把该模型中的corpus语料,以MmCorpus格式保存到serialization_path设置的指定目录之中。
%time model_ser = AuthorTopicModel(corpus=corpus,id2word=dictionary.id2token, author2doc=author2doc, serialized=True,serialization_path='/tmp/model_serialization.mm')
# 序列化模型存放
1
2
serialized=True之后,运行之后会自动把corpus保存。常规的corpus序列化方式:corpora.MmCorpus.serialize('/tmp/model_serialization.mm', corpus)
下次在想使用的时候,可以这么反序列:
new_corpus = corpora.MmCorpus('/tmp/model_serialization.mm')
1
3.1.2 辅助函数
主题展示函数六个:
model.show_topic(topicid, topn=10) 输入主题号,得到每个主题哪些重要词+重要词概率
model.show_topics(num_topics=10, num_words=10, log=False, formatted=True) 每个主题下,重要词等式
model.get_topic_terms(topicid, topn=10) 输入主题号,返回重要词以及重要词概率
model.get_topics() 返回主题数字数的矩阵,10主题 7674个单词
(不计入)get_document_topics ,get_document_topics(word_id, minimum_probability=None),该参数适合LDA,并不适用在ATM模型之中。Method “get_document_topics” is not valid for the author-topic model.
五:model.print_topic(1, topn=10)
六:print_topics(num_topics=20, num_words=10)
# first
show_topic(topicid, topn=10)
>>> [('action', 0.013790729946622874),
('control', 0.013754026606322274),
('policy', 0.010037394726575378),
('q', 0.0087439205722043382),
('reinforcement', 0.0087102831394097746),
('optimal', 0.0074764680531377312),
('robot', 0.0057665635437760083),
('controller', 0.0053787501576589725)]
# second
model.show_topics(num_topics=10)
>>> [(0,
'0.014*"action" + 0.014*"control" + 0.010*"policy" + 0.009*"q" + 0.009*"reinforcement" + 0.007*"optimal" + 0.006*"robot" + 0.005*"controller" + 0.005*"dynamic" + 0.005*"environment"'),
(1,
'0.020*"image" + 0.008*"face" + 0.007*"cluster" + 0.006*"signal" + 0.005*"source" + 0.005*"matrix" + 0.005*"filter" + 0.005*"search" + 0.004*"distance" + 0.004*"o_o"')]
# third
model.get_topic_terms(1, topn=10)
>>> [(774, 0.019700538013351386),
(3215, 0.0075965808303036916),
(3094, 0.0067132528809042526),
(514, 0.0063925849599646822),
(2739, 0.0054527647598129206),
(341, 0.004987335769043616),
(752, 0.0046566448210636699),
(1218, 0.0046234352422933724),
(186, 0.0042132891022475458),
(829, 0.0041800479706789939)]
#forth
model.get_topics()
>>> array([[ 9.57974777e-05, 6.17130780e-07, 6.34938224e-07, ...,
6.17080048e-07, 6.19691132e-07, 6.17090716e-07],
[ 9.81065671e-05, 3.12945042e-05, 2.80837858e-04, ...,
7.86879291e-07, 7.86479617e-07, 7.86592758e-07],
[ 4.57734625e-05, 1.33555568e-05, 2.55108081e-05, ...,
5.31796854e-07, 5.32000122e-07, 5.31934336e-07],
# 五
model.print_topic(1, topn=10)
>>> '0.025*"image" + 0.010*"object" + 0.008*"distance" + 0.007*"recognition" + 0.005*"pixel" + 0.004*"cluster" + 0.004*"class" + 0.004*"transformation" + 0.004*"constraint" + 0.004*"map"'
# 六
model.print_topics(num_topics=20, num_words=10)
[(0,
'0.008*"gaussian" + 0.007*"mixture" + 0.006*"density" + 0.006*"matrix" + 0.006*"likelihood" + 0.005*"noise" + 0.005*"component" + 0.005*"prior" + 0.005*"estimate" + 0.004*"log"'),
(1,
'0.025*"image" + 0.010*"object" + 0.008*"distance" + 0.007*"recognition" + 0.005*"pixel" + 0.004*"cluster" + 0.004*"class" + 0.004*"transformation" + 0.004*"constraint" + 0.004*"map"'),
(2,
'0.011*"visual" + 0.010*"cell" + 0.009*"response" + 0.008*"field" + 0.008*"motion" + 0.007*"stimulus" + 0.007*"direction" + 0.005*"orientation" + 0.005*"eye" + 0.005*"frequency"')]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
作者偏好主题函数两个:
model[‘name’]
model.get_author_topics(‘name’)
model.id2author.values() 模型作者列表
model.id2author.values() #作者列表
>>> dict_values(['A.A.Handzel', 'A.Afghan', 'A.B.Bonds', 'A.B.Kirillov', 'A.Blake', 'A.C.C.Coolen', 'A.C.Tsoi', 'A.Cichocki', 'A.D.Back', 'A.D.Rexlish', 'A.Dembo', 'A.Drees', 'A.During', 'A.E.Friedman', 'A.F.Murray', 'A.Ferguson', 'A.G.Barto', 'A.G.U.Perera']
1
2
model['GeoffreyE.Hinton']
model.get_author_topics('GeoffreyE.Hinton')
>>> [(6, 0.76808063951144978), (7, 0.23181972762044473)]
1
2
3
两个函数返回的内容都一样,表示:该作者(GeoffreyE.Hinton)偏好第6、7号主题,主题分别是概率。
单词的主题偏好:get_term_topics()
返回词典中指定词汇最有可能对应的主题
[(model.get_term_topics(i),dictionary.id2token) for i in range(100)]
>>> [([], 'acknowledgements'),
([], 'acknowledgements_-PRON-'),
([], 'acquire'),
([(0, 0.013787687427660612)], 'action'),
([], 'action_potential'),
([], 'active')]
1
2
3
4
5
6
7
其中单词’action’,更偏好于0号主题(’Circuits’),可能性为0.013
3.1.3 衡量指标
两个ATM 模型的衡量指标单词边界(perwordbound)、一致性指标(top_topics):
# 指标一:perwordbound
from gensim.models import atmodel
doc2author = atmodel.construct_doc2author(model.corpus, model.author2doc)
# Compute the per-word bound.
# Number of words in corpus.
corpus_words = sum(cnt for document in model.corpus for _, cnt in document)
# Compute bound and divide by number of words.
perwordbound = model.bound(model.corpus, author2doc=model.author2doc, \
doc2author=model.doc2author) / corpus_words
print(perwordbound)
# 指标二:话题一致性指标计算
%time top_topics = model.top_topics(model.corpus)
tc = sum([t[1] for t in top_topics])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
用的是话题的一致性指标,用于模型选择以及不同主题内容评估。
其中top_topics 返回的针对主题的,10个主题 * 2(每个主题重要词概率+一致性指标):
[([(0.0081142522, 'gaussian'), (0.0029860872, 'hidden')],
-0.83264680887371556),
([(0.010487712, 'layer'), (0.0023913214, 'solution')],
-0.96372771081309494)]
...
1
2
3
4
5
其中 tc代表计算了所有主题一致性指标之和,还可以计算平均:
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)
1
2
3.1.4 相似作者推荐
官方案例中包括两种相似性距离:cos距离、Hellinger距离
第一种:自带、常规的cos距离
from gensim.similarities import MatrixSimilarity
# Generate a similarity object for the transformed corpus.
index = MatrixSimilarity(model[list(model.id2author.values())])
# Get similarities to some author.
author_name = 'YannLeCun'
sims = index[model[author_name]]
sims
>>> array([ 0.20777275, 0.8723157 , 0. , ..., 0.16174853,
0.07938839, 0. ], dtype=float32)
#sims的列表是,'YannLeCun'作者跟其他每个作者,主题偏好向量的cos距离
1
2
3
4
5
6
7
8
9
10
11
12
其中model[list(model.id2author.values())]中,model.id2author是作者姓名的列表,model[姓名列表]代表每个作者-主题偏好列表向量,当然不定长,有的作者对某些主题没有一点关联,就会缺失。
主题偏好向量为:
[(4, 0.88636616132828339), (8, 0.083545096138703312)],
[(4, 0.27129746825443646), (8, 0.71594003971848896)],
[(0, 0.07188868711639794), (1, 0.069390116586544176),
(3, 0.035190795872695843), (4, 0.011718365474455844),
(5, 0.058831820905365088), (6, 0.68542799691757561),
(9, 0.041390087371285786)]
1
2
3
4
5
6
譬如 [(4, 0.88636616132828339), (8, 0.083545096138703312)]代表某作者,偏好于4/8号主题。
第二种:Hellinger距离
官方自定义函数:
from gensim import matutils
import pandas as pd
# Make a list of all the author-topic distributions.
# 作者-主题偏好向量
author_vecs = [model.get_author_topics(author) for author in model.id2author.values()]
def similarity(vec1, vec2):
'''Get similarity between two vectors'''
dist = matutils.hellinger(matutils.sparse2full(vec1, model.num_topics), \
matutils.sparse2full(vec2, model.num_topics))
sim = 1.0 / (1.0 + dist)
return sim
def get_sims(vec):
'''Get similarity of vector to all authors.'''
sims = [similarity(vec, vec2) for vec2 in author_vecs]
return sims
def get_table(name, top_n=10, smallest_author=1):
'''
Get table with similarities, author names, and author sizes.
Return `top_n` authors as a dataframe.
'''
# Get similarities.
sims = get_sims(model.get_author_topics(name))
# Arrange author names, similarities, and author sizes in a list of tuples.
table = []
for elem in enumerate(sims):
author_name = model.id2author[elem[0]]
sim = elem[1]
author_size = len(model.author2doc[author_name])
if author_size >= smallest_author:
table.append((author_name, sim, author_size))
# Make dataframe and retrieve top authors.
df = pd.DataFrame(table, columns=['Author', 'Score', 'Size'])
df = df.sort_values('Score', ascending=False)[:top_n]
return df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
可以用get_table('YannLeCun', top_n=10,smallest_author=3)得到’YannLeCun’作者,与其他每一个作者的Hellinger距离。
.
3.2 官方案例:基于ATM模型的NIPS科技论文作者-主题分布及偏好探究
3.2.1 数据下载
!wget -O - 'http://www.cs.nyu.edu/~roweis/data/nips12raw_str602.tgz' > /tmp/nips12raw_str602.tgz
>>> 100%[===================>] 12.26M 25.1KB/s in 11m 34s
1
2
3
3.2.2 数据整理与形成作者-文档映射表
import tarfile
import os, re
# 数据解压
filename = '/tmp/nips12raw_str602.tgz'
tar = tarfile.open(filename, 'r:gz')
for item in tar:
tar.extract(item, path='/tmp')
# Folder containing all NIPS papers.
data_dir = '/tmp/nipstxt/' # Set this path to the data on your machine.
# Folders containin individual NIPS papers.
yrs = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
dirs = ['nips' + yr for yr in yrs]
# ['nips00', 'nips01', 'nips02', 'nips03', 'nips04', 'nips05', 'nips06', 'nips07', 'nips08', 'nips09', 'nips10', 'nips11', 'nips12']
# Get all document texts and their corresponding IDs.
# doc_ids,list+str,第几届NIPS+'_'+第几篇,'07_713',用来准备author2doc的材料
docs = []
doc_ids = []
for yr_dir in dirs:
files = os.listdir(data_dir + yr_dir) # List of filenames.
for filen in files:
# Get document ID.
(idx1, idx2) = re.search('[0-9]+', filen).span() # Matches the indexes of the start end end of the ID.
doc_ids.append(yr_dir[4:] + '_' + str(int(filen[idx1:idx2])))
# Read document text.
# Note: ignoring characters that cause encoding errors.
with open(data_dir + yr_dir + '/' + filen, errors='ignore', encoding='utf-8') as fid:
txt = fid.read()
# Replace any whitespace (newline, tabs, etc.) by a single space.
txt = re.sub('\s', ' ', txt)
docs.append(txt)
# 制作author2doc,作者-文档映射表
filenames = [data_dir + 'idx/a' + yr + '.txt' for yr in yrs] # Using the years defined in previous cell.
# Get all author names and their corresponding document IDs.
author2doc = dict()
i = 0
for yr in yrs:
# The files "a00.txt" and so on contain the author-document mappings.
filename = data_dir + 'idx/a' + yr + '.txt'
for line in open(filename, errors='ignore', encoding='utf-8'):
# Each line corresponds to one author.
contents = re.split(',', line)
author_name = (contents[1] + contents[0]).strip()
# Remove any whitespace to reduce redundant author names.
author_name = re.sub('\s', '', author_name)
# Get document IDs for author.
ids = [c.strip() for c in contents[2:]]
if not author2doc.get(author_name):
# This is a new author.
author2doc[author_name] = []
i += 1
# Add document IDs to author.
author2doc[author_name].extend([yr + '_' + id for id in ids])
# Use an integer ID in author2doc, instead of the IDs provided in the NIPS dataset.
# Mapping from ID of document in NIPS datast, to an integer ID.
doc_id_dict = dict(zip(doc_ids, range(len(doc_ids))))
# Replace NIPS IDs by integer IDs.
for a, a_doc_ids in author2doc.items():
for i, doc_id in enumerate(a_doc_ids):
author2doc[a] = doc_id_dict[doc_id]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
3.2.3 数据清洗与分词
# 数据处理
# 分词,同时在后面加入2-gram,现在每篇文章变成了:word1,word2,word3,..,word1_word2,word2_word3...这样的形式
import en_core_web_sm
import spacy
nlp = en_core_web_sm.load()
nlp(u'This is a sentence.')
%%time
processed_docs = []
for doc in nlp.pipe(docs, n_threads=4, batch_size=100):
# Process document using Spacy NLP pipeline.
ents = doc.ents # Named entities.
# Keep only words (no numbers, no punctuation).
# Lemmatize tokens, remove punctuation and remove stopwords.
doc = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]
# Remove common words from a stopword list.
#doc = [token for token in doc if token not in STOPWORDS]
# Add named entities, but only if they are a compound of more than word.
doc.extend([str(entity) for entity in ents if len(entity) > 1])
processed_docs.append(doc)
# Compute bigrams.
# 'neural_networks'
from gensim.models import Phrases
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
for token in bigram[docs[idx]]:
if '_' in token:
# Token is a bigram, add to document.
docs[idx].append(token)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
3.2.4 构建模型语料
# Create a dictionary representation of the documents, and filter out frequent and rare words.
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)
# Remove rare and common tokens.
# Filter out words that occur too frequently or too rarely.
max_freq = 0.5
min_wordcount = 20
dictionary.filter_extremes(no_below=min_wordcount, no_above=max_freq)
_ = dictionary[0] # This sort of "initializes" dictionary.id2token.
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]
print('Number of authors: %d' % len(author2doc))
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
3.2.5 模型训练与选择
# 模型训练
from gensim.models import AuthorTopicModel
%time model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
author2doc=author2doc, chunksize=2000, passes=1, eval_every=0, \
iterations=1, random_state=1)
# 模型选择
#如果你觉得模型没有收敛,那么可以采用增量训练,`model.update(corpus, author2doc)`。
#通过设置该参数random_state,不同的随机种子,并选择具有最高主题一致性的模型。
%%time
model_list = []
for i in range(5):
model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
author2doc=author2doc, chunksize=2000, passes=100, gamma_threshold=1e-10, \
eval_every=0, iterations=1, random_state=i)
top_topics = model.top_topics(corpus)
tc = sum([t[1] for t in top_topics])
model_list.append((model, tc))
# 模型评估:主题一致性
model, tc = max(model_list, key=lambda x: x[1])
print('Topic coherence: %.3e' %tc)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
其中需要了解一下模型的保存与载入:
# 模型保存
model.save('/mnt/gensim/lda/model.atmodel')
model = AuthorTopicModel.load('/mnt/gensim/lda/model.atmodel')
1
2
3
3.2.6 模型评估指标:per-word bound与话题一致性
# 每词单词边界指标
from gensim.models import atmodel
doc2author = atmodel.construct_doc2author(model.corpus, model.author2doc)
# Compute the per-word bound.
# Number of words in corpus.
corpus_words = sum(cnt for document in model.corpus for _, cnt in document)
# Compute bound and divide by number of words.
perwordbound = model.bound(model.corpus, author2doc=model.author2doc, \
doc2author=model.doc2author) / corpus_words
print(perwordbound)
# 话题一致性指标计算
%time top_topics = model.top_topics(model.corpus)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
3.3 模型使用与展示
3.3.1 模型基本使用
# 主题内容展示
model.show_topic(0)
# 自定义每个主题主要内容
topic_labels = ['Circuits', 'Neuroscience', 'Numerical optimization', 'Object recognition', \
'Math/general', 'Robotics', 'Character recognition', \
'Reinforcement learning', 'Speech recognition', 'Bayesian modelling']
# 查看每个主题下都有哪些词语
for topic in model.show_topics(num_topics=10):
print('Label: ' + topic_labels[topic[0]])
words = ''
for word, prob in model.show_topic(topic[0]):
words += word + ' '
print('Words: ' + words)
print()
'''
Label: Circuits
Words: action control policy q reinforcement optimal robot controller dynamic environment
Label: Neuroscience
Words: image face cluster signal source matrix filter search distance o_o
'''
# 查看某作者偏好主题内容
model['YannLeCun']
from pprint import pprint
def show_author(name):
print('\n%s' % name)
print('Docs:', model.author2doc[name])
print('Topics:')
pprint([(topic_labels[topic[0]], topic[1]) for topic in model[name]])
# 作者的主要文章有哪些,话题有那个
show_author('YannLeCun')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
打印出来为:
YannLeCun
Docs: [183, 355, 288, 515, 529, 544, 679, 662, 711, 720, 1512]
Topics:
[('Reinforcement learning', 0.99977836093843353)]
1
2
3
4
其中大牛YannLeCun,更偏向于强化学习主题。
3.3.2 作者-主题对应图t-SNE
这里有一个参数:smallest_author,可以将一些长尾作者进行剔除
%%time
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
smallest_author = 0 # Ignore authors with documents less than this.
authors = [model.author2id[a] for a in model.author2id.keys() if len(model.author2doc[a]) >= smallest_author]
_ = tsne.fit_transform(model.state.gamma[authors, :]) # Result stored in tsne.embedding_
# Tell Bokeh to display plots inside the notebook.
from bokeh.io import output_notebook
#!pip3 install bokeh
output_notebook()
from bokeh.models import HoverTool
from bokeh.plotting import figure, show, ColumnDataSource
x = tsne.embedding_[:, 0]
y = tsne.embedding_[:, 1]
author_names = [model.id2author[a] for a in authors]
# Radius of each point corresponds to the number of documents attributed to that author.
scale = 0.1
author_sizes = [len(model.author2doc[a]) for a in author_names]
radii = [size * scale for size in author_sizes]
source = ColumnDataSource(
data=dict(
x=x,
y=y,
author_names=author_names,
author_sizes=author_sizes,
radii=radii,
)
)
# Add author names and sizes to mouse-over info.
hover = HoverTool(
tooltips=[
("author", "@author_names"),
("size", "@author_sizes"),
]
)
p = figure(tools=[hover, 'crosshair,pan,wheel_zoom,box_zoom,reset,save,lasso_select'])
p.scatter('x', 'y', radius='radii', source=source, fill_alpha=0.6, line_color=None)
show(p)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
上图中的圆圈是单个作者,它们的大小表示归因于相应作者的文档数量。将鼠标悬停在圆圈上会告诉您作者的名称及其大小。一些圆圈的重叠,代表着不同作者话题内容的重叠。
3.3.3 作者相似探究
不同作者有不同的主题偏好,同样因为不同作者有不同偏好,那么跟上面t-SNE一样,一些作者就会有共同兴趣的部分,从而可以进行相似性推荐。
其中使用的距离不是cos距离,而是 Hellinger distance。
H(p,q)=12–√∑i=1K(pi−−√−qi−−√)2−−−−−−−−−−−−−−⎷,
H(p,q)=12∑i=1K(pi−qi)2,
where pp and qq are both topic distributions for two different authors. We define the similarity as
S(p,q)=11+H(p,q).
S(p,q)=11+H(p,q).
主要函数为:
# Make a function that returns similarities based on the Hellinger distance.
from gensim import matutils
import pandas as pd
# Make a list of all the author-topic distributions.
author_vecs = [model.get_author_topics(author) for author in model.id2author.values()]
def similarity(vec1, vec2):
'''Get similarity between two vectors'''
dist = matutils.hellinger(matutils.sparse2full(vec1, model.num_topics), \
matutils.sparse2full(vec2, model.num_topics))
sim = 1.0 / (1.0 + dist)
return sim
def get_sims(vec):
'''Get similarity of vector to all authors.'''
sims = [similarity(vec, vec2) for vec2 in author_vecs]
return sims
def get_table(name, top_n=10, smallest_author=1):
'''
Get table with similarities, author names, and author sizes.
Return `top_n` authors as a dataframe.
'''
# Get similarities.
sims = get_sims(model.get_author_topics(name))
# Arrange author names, similarities, and author sizes in a list of tuples.
table = []
for elem in enumerate(sims):
author_name = model.id2author[elem[0]]
sim = elem[1]
author_size = len(model.author2doc[author_name])
if author_size >= smallest_author:
table.append((author_name, sim, author_size))
# Make dataframe and retrieve top authors.
df = pd.DataFrame(table, columns=['Author', 'Score', 'Size'])
df = df.sort_values('Score', ascending=False)[:top_n]
return df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
来看看效果:
get_table('YannLeCun', smallest_author=3)
1
---------------------
【转载】
作者:悟乙己
原文:https://blog.csdn.net/sinat_26917383/article/details/79339727
|
|