from scrapy.item import Item
from scrapy.item import Field
class GoodsItem(Item):
name = Field()
price = Field()
date = Field()
types = Field()
Field()实质就是一个字典Dict()类型的扩展,如上代码所示,一组Item对应一个商品信息,单个网页可能包含一个或多个商品,所有Item信息都需要在Spider中赋值,然后经引擎交给Item Pipeline。具体实现在后续博文的实例中会有体现,本文旨在简单记述scrapy的基本概念和使用方法。
Install
with pip
pip install scrapy
or conda
conda install -c conda-forge scrapy
基本指令如下:
D:\WorkSpace>scrapy --help
Scrapy 1.5.0 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
如果需要使用虚拟环境,需要安装virtualenv
$ scrapy -h
Scrapy 1.5.0 - project: huaban
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
$ scrapy crawl -h
Usage
=====
scrapy crawl [options] <spider>
Run a spider
Options
=======
--help, -h show this help message and exit
-a NAME=VALUE set spider argument (may be repeated)
--output=FILE, -o FILE dump scraped items into FILE (use - for stdout)
--output-format=FORMAT, -t FORMAT
format to use for dumping items with -o
Global Options
--------------
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)
--pdb enable pdb on failure
从scrapy crawl的帮助信息可以看出,该指令包含很多可选参数,但必选参数只有一个,就是spider,即要执行的爬虫名称,对应每个爬虫的名称(name)。