黑马程序员技术交流社区

标题: 【上海校区】50行代码爬取微信公众号所有文章 [打印本页]

作者: 梦缠绕的时候 时间: 2019-8-8 09:56
标题: 【上海校区】50行代码爬取微信公众号所有文章
[url=]

[/url]
#今日目标**50行代码爬取微信公众号所有文章**今天要爬取的是微信公众号，爬取公众号的方式常见的有两种。一是通过搜狗搜索去获取，缺点是只能获取最新的十条推送文章，今天介绍另一种通过抓包PC端微信的方式去获取公众号文章的方法，相对其他方法更加便捷。分析：我们发现每次下拉刷新文章的时候都会请求 mp.weixin.qq.com/mp/xxx公众号不让添加主页链接，xxx表示profile_ext）这个接口。经过多次测试分析，用到了以下几个参数：__biz : 用户和公众号之间的唯一id，uin ：用户的私密idkey ：请求的秘钥，一段时候只会就会失效。offset ：偏移量count ：每次请求的条数*代码实现*```import requestsimport jsonimport timefrom pymongo import MongoClienturl = 'http://mp.weixin.qq.com/mp/xxx'（公众号不让添加主页链接，xxx表示profile_ext)# Mongo配置conn = MongoClient('127.0.0.1', 27017)db = conn.wx  #连接wx数据库，没有则自动创建mongo_wx = db.article  #使用article集合，没有则自动创建def get_wx_article(biz, uin, key, index=0, count=10): offset = (index + 1) * count params = {       '__biz': biz,       'uin': uin,       'key': key,       'offset': offset,       'count': count,       'action': 'getmsg',       'f': 'json' } headers = {       'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36' } response = requests.get(url=url, params=params, headers=headers) resp_json = response.json() if resp_json.get('errmsg') == 'ok':       resp_json = response.json()       #
是否还有分页数据，用于判断return的值       can_msg_continue = resp_json['can_msg_continue']       # 当前分页文章数       msg_count = resp_json['msg_count']       general_msg_list = json.loads(resp_json['general_msg_list'])       list = general_msg_list.get('list')       print(list, "**************")       for i in list:          app_msg_ext_info = i['app_msg_ext_info']          # 标题          title = app_msg_ext_info['title']          # 文章地址          content_url = app_msg_ext_info['content_url']          # 封面图          cover = app_msg_ext_info['cover']          # 发布时间          datetime = i['comm_msg_info']['datetime']          datetime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(datetime))          mongo_wx.insert({             'title': title,             'content_url': content_url,             'cover': cover,             'datetime': datetime          })       if can_msg_continue == 1:          return True       return False else:       print('获取文章异常...')       return Falseif __name__ == '__main__': biz = 'Mzg4MTA2Nzg0NA==' uin = 'NDIyMTI5NDM1' key = '20a680e825f03f1e7f38f326772e54e7dc0fd02ffba17e92730ba3f0a0329c5ed310b0bd55
　　　　　　b3c0b1f122e5896c6261df2eaea4036ab5a5d32dbdbcb0a638f5f3605cf1821decf486bb6eb4d92d36c620' index = 0 while 1:       print(f'开始抓取公众号第{index + 1} 页文章.')       flag = get_wx_article(biz, uin, key, index=index)       # 防止和谐，暂停8秒       time.sleep(8)       index += 1       if not flag:          print('公众号文章已全部抓取完毕，退出程序.')          break       print(f'..........准备抓取公众号第{index + 1} 页文章.')```[url=]

[/url]

作者: 梦缠绕的时候 时间: 2019-8-8 09:57
有任何问题欢迎在评论区留言

作者: 梦缠绕的时候 时间: 2019-8-8 09:57
或者添加学姐微信
DKA-2018

欢迎光临黑马程序员技术交流社区 (http://bbs.itheima.com/)

黑马程序员IT技术论坛 X3.2