[学习交流] 三分钟学会西瓜皮

Scrapy简单使用过程

制作 Scrapy 爬虫 一共需要4步：

1. 新建项目 (scrapy startproject xxx)：新建一个新的爬虫项目

2. 明确目标（编写items.py）：明确你想要抓取的目标

3. 制作爬虫（spiders/xxspider.py）：制作爬虫开始爬取网页

4. 存储内容（pipelines.py）：设计管道存储爬取内容

1启动powershell ，创建第一个scrapy 项目

2在桌面上创建mySscrapy的项目

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image002.png

3根据提示确定需要爬取的网站，建立名称和网址，例如需要爬取如下网站

http://www.itcast.cn/channel/teacher.shtml导出教师目录

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image004.png

自动建立的好处是可以避免写固定代码的麻烦，也可以自己编写itcast.py

自动创建后会在\Desktop\mySpider\mySpider\spiders下面创建itcast.py

4在\Desktop\mySpider\mySpider文件夹下找到items.py文件，添加代码，保存关闭

import scrapy

class ItcastItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
title = scrapy.Field()
info = scrapy.Field()

5打开itcast.py，由于自动创建内部类的属性和方法都已经生成

import scrapy

class ItcastSpider(scrapy.Spider):
name = 'itcast'
allowed_domains = ['itcast.cn']
start_urls = ["http://www.itcast.cn"]

修改start_urls 确定从哪个网址开始爬起

start_urls = ["http://www.itcast.cn/channel/teacher.shtml"]

修改def parse（）解析方法

def parse(self, response):
filename = response.url.split('/')[-1]
#filename = "teacher.html"
with open (filename, 'wb') as f:
f.write(response.body)

6通过powershell 执行修改后的代码，注意文件路径，在创建项目的跟目录下运行

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image006.png

执行后会在mySpider根目录下生成文件teacher.shtml文件

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image007.png

这个文件就是网站的全部内容，可以用notepad++打开，显示内容

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image009.png

7再次修改itcast.py的只修改def部分代码进行测试，这次在powershell上显示需要的信息

Sel是初始化变量

def parse(self, response):
sel = scrapy.selector.Selector(response)
sites = sel.xpath("//div[@class='li_txt']")
for each in sites:
      name = each.xpath("h3/text()").extract()
      title = each.xpath("h4/text()").extract()
      info = each.xpath("p/text()").extract()
      print(name, title, info)

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image010.png

显示结果

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image011.png

通过6和7可以成功进行网页的爬取，现在需要对网页中需要的内容进行整理。（可以进入scrapy提供的shell ，试试scrapy的命令）

8，对网页中需要的内容进行爬取，在itcast.py中编写代码如下， sel初始化变量

from mySpider.items import ItcastItemdef parse(self, response):
sel = scrapy.selector.Selector(response)
sites = sel.xpath("//div[@class='li_txt']")
items = []
for each in sites:
      item = ItcastItem()
      item['name'] = each.xpath("h3/text()").extract()
      item['title'] = each.xpath("h4/text()").extract()
      item['info'] = each.xpath("p/text()").extract()
      items.append(item)
return items

或者直接用response.xpath()

def parse(self, response):
#sel = scrapy.selector.Selector(response)
sites = response.xpath("//div[@class='li_txt']")
items = []
for each in sites:
      item = ItcastItem()
      item['name'] = each.xpath("h3/text()").extract()
      item['title'] = each.xpath("h4/text()").extract()
      item['info'] = each.xpath("p/text()").extract()
      items.append(item)
return items

通过powershell导出文件

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image012.png

在文档的跟目录下多出items.json，可以用notepad++打开

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image013.png

或者保存成xml文件

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image015.png

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image016.png

附录1 文件保存的方式

scrapy保存信息的最简单的方法主要有四种，-o 输出指定格式的文件，命令如下：

scrapy crawl itcast -o teachers.json

json lines格式，默认为Unicode编码

scrapy crawl itcast -o teachers.jsonl

csv 逗号表达式，可用Excel打开

scrapy crawl itcast -o teachers.csv

xml格式

scrapy crawl itcast -o teachers.xml

附录2 进入scrapy 提供的shell

在项目的根目录下输入scrapy shell 网址

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image018.png

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image020.png

1 response.body 显示网页代码

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image021.png

2 response.headers 获得文件的头

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image023.png

3 response.xpath(‘//title’) 选择所有的title 返回列表

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image024.png

4 response.xpath('//title').extract()将返回的列表字符串化

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image025.png

5 去掉title标签，用text（）方法

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image026.png

6 通过审查元素找到对应的标签ul和li ，Shell会初始化一个sel变量

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image027.png

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image028.png 这俩感觉一样，获得所有ul和li的类型

7 获得ul和li 描述内容（网站描述内容）

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image029.png

8 获得ul和li 描述内容（网站描述内容）转换成字符串

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image030.png

9 获得a标签里面的text

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image031.png

10 得到a标签下所有的网址链接

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image032.png

11 获得所有title

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image033.png

12 退出shell

file:///C:/Users/ADMINI~1/AppData/Local/Temp/msohtmlclip1/01/clip_image034.png

附录3 XPath语法

http://www.runoob.com/xpath/xpath-syntax.html

这里给出一些 XPath 表达式的例子及对应的含义:

· /html/head/title: 选择HTML文档中 <head> 标签内的 <title> 元素

· /html/head/title/text(): 选择上面提到的 <title> 元素的文字

· //td: 选择所有的 <td> 元素

· //div[@class="mine"]: 选择所有具有 class="mine" 属性的 div 元素

这个版块一直跟我耍流氓，我从word复制过来有好多图都显示不了，无奈我把word放到附件中了。

帐号		自动登录	找回密码
密码			加入黑马

[学习交流] 三分钟学会西瓜皮

0 个回复