Spiders:爬虫,定义了爬取的逻辑和网页内容的解析规则,主要负责解析响应并生成结果和新的请求;
Engine:引擎,处理整个系统的数据流处理,出发事物,框架的核心;
Scheduler:调度器,接受引擎发过来的请求,并将其加入队列中,在引擎再次请求时将请求提供给引擎;
Downloader:下载器,下载网页内容,并将下载内容返回给 spider;
ItemPipeline:管道,负责处理spider从网页中抽取的数据,主要是负责清洗,验证和数据入库;
Downloader Middlewares:下载中间件,是处于Scrapy的Request和Response之间的处理模块;
Spider Middlewares:spider中间件,位于引擎和spider之间的框架,主要处理spider输入的响应和输出的结果及新的请求 middlewares.py 里实现;
python 版本
python --version
Python 3.7.3
安装 scrapy
python install Scrapy
scrapy
Scrapy 1.8.0 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
创建项目
scrapy startproject project_name # 项目名称
创建爬虫
scrapy genspider xxx(爬虫名) xxx.com (爬取域)
爬取 诗词名句
spiders 文件夹创建 shicimingju.py
# coding:utf-8
import logging
import re
import scrapy
from scrapy import Selector
from MooseSpider.items import ShiCiItem
class ShiCiMingJuSpider(scrapy.Spider):
name = "ShiCiMingJuSpider" # 爬虫名称
allowed_domains = ['www.shicimingju.com'] # 允许爬取域名
start_urls = ['https://www.shicimingju.com/chaxun/zuozhe/1.html'] # 开始爬取地址
def parse(self, response):
shici_item = ShiCiItem() # 封装Item对象
shici_item["type"] = 'poetry' # Item 类型
# xpath 解析作者
poetry_author = response.xpath(
"//div[@id='main_right']/div[@class='card about_zuozhe']/div[2]/div/h4/a/text()").extract_first()
shici_item['poetry_author'] = poetry_author
# 解析内容
shici_content_list = response.xpath("//div[@id='main_left']/div[@class='card shici_card']/div").extract()
for content_item in shici_content_list:
list_num_info = Selector(text=content_item).xpath(
"//div/div[@class='list_num_info']/text()").extract_first()
if list_num_info is not None:
list_num_info = str.strip(list_num_info)
shici_item["poetry_num"] = list_num_info
# 解析诗名
poetry_name = Selector(text=content_item).xpath(
"//div/div[@class='shici_list_main']/h3/a/text()").extract_first()
if poetry_name is not None:
shici_item["poetry_name"] = poetry_name
# 解析诗内容
shici_content_more = Selector(text=content_item).xpath("//div/div[@class='shici_list_main']/div").extract()
for item in shici_content_more:
html = Selector(text=item).xpath("//div[@class='shici_content']").extract_first()
if html is not None:
pattern = re.compile(r'<[^>]+>', re.S)
poetry_content = pattern.sub("", html).replace("展开全文", "").replace("收起", "").replace("\n",
"").replace(
" ", "")
shici_item['poetry_content'] = poetry_content
# 打印数据
logging.info(shici_item)
# 交给 Pipeline
yield shici_item
# 获取下一页数据
next_url = response.xpath(
"//div[@id='main_left']/div[@id='list_nav']/div[@id='list_nav_part']/a[text()='下一页']/@href").extract_first()
# 抓取下一页数据
if next_url is not None:
yield scrapy.Request(str.format("https://www.shicimingju.com{}", next_url), callback=self.parse,
dont_filter=True)
items.py 添加 ShiCiItem
class ShiCiItem(scrapy.Item):
type = scrapy.Field()
# 作者
poetry_author = scrapy.Field()
# 诗词名
poetry_name = scrapy.Field()
# 诗词内容
poetry_content = scrapy.Field()
# 诗词序号
poetry_num = scrapy.Field()
把爬取的数据使用 mongodb 保存
pip3 install pymongo
settings.py
# mongodb config
MONGO_URI = "127.0.0.1:27017"
MONGO_DATABASE = "shicimingju"
pipelines.py
# -*- coding: utf-8 -*-
from pymongo import MongoClient
class MooseSpiderPipeline(object):
# mongo collection 名称
collection_name = 'shici_item'
@classmethod
def from_crawler(cls, crawler):
# 获取 mongodb 设置
return cls(mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'shicimingju'))
def __init__(self, mongo_uri, mongo_db):
self.client = MongoClient(mongo_uri)
self.db = self.client[mongo_db]
def open_spider(self, spider):
pass
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.process_video_item(item, spider)
return item
def process_video_item(self, item, spider):
self.db[self.collection_name].insert_one(dict(item))
开启使用 MooseSpiderPipeline
ITEM_PIPELINES = {
'MooseSpider.pipelines.MooseSpiderPipeline': 300,
}
运行使用 scrapy cmdline
在scrapy.cfg同级目录创建 main.py
# coding:utf-8
from scrapy import cmdline
cmdline.execute('scrapy crawl ShiCiMingJuSpider'.split())
执行 python main.py