爬虫文件中settings文件中的参数作用

项目名称

BOT_NAME = 'qidianwang'

爬虫文件路径

SPIDER_MODULES = ['qidianwang.spiders']
NEWSPIDER_MODULE = 'qidianwang.spiders'

Crawl responsibly by identifying yourself (and your website) on the user-agent

设置模拟浏览器加载

USER_AGENT = 'qidianwang (+http://www.yourdomain.com)'

Obey robots.txt rules

是否遵守robot协议(默认为True表示遵守)

ROBOTSTXT_OBEY = False

Configure maximum concurrent requests performed by Scrapy (default: 16)

scrapy 发起请求的最大并发数量(默认是16个)

CONCURRENT_REQUESTS = 32

Configure a delay for requests for the same website (default: 0)

See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

See also autothrottle settings and docs

设置下载延时,默认为0

DOWNLOAD_DELAY = 0

The download delay setting will honor only one of:

在每个域下允许发起请求的最大并发数(默认是8个)

CONCURRENT_REQUESTS_PER_DOMAIN = 16

针对每个ip允许发起请求的最大并发数量(默认0个)

1.在不为0的情况CONCURRENT_REQUESTS_PER_IP的设置优先级要比CONCURRENT_REQUESTS_PER_DOMAIN要高

2.不为0的情况下DOWNLOAD_DELAY就会针对于ip而不是网站了,

CONCURRENT_REQUESTS_PER_IP = 16

Disable cookies (enabled by default)

是否要携带cookies,默认为True表示携带

COOKIES_ENABLED = False

COOKIES_DEBUG 默认为False表示不追踪cookies

COOKIES_DEBUG = True

Disable Telnet Console (enabled by default)

====是一个扩展插件,通过TELENET可以监听到当前爬虫的一些状态,默认是True开启状态

TELNETCONSOLE_ENABLED = False

Override the default request headers:

=======请求头的设置

DEFAULT_REQUEST_HEADERS = {

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',

'Accept-Language': 'en',

'User-Agnet':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',

}

Enable or disable spider middlewares

See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

=========爬虫中间件

SPIDER_MIDDLEWARES = {

'qidianwang.middlewares.QidianwangSpiderMiddleware': 543,

}

Enable or disable downloader middlewares

See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

===========下载中间件,自定义下载中间键需要在这里激活,后面的数字越小优先级越高,

DOWNLOADER_MIDDLEWARES = {
'qidianwang.middlewares.QidianUserAgentDownloadmiddlerware': 543,
# 'qidianwang.middlewares.QidianProxyDownloadMiddlerware':544,
# 'qidianwang.middlewares.SeleniumDownlaodMiddlerware':543,
}

Enable or disable extensions

See https://doc.scrapy.org/en/latest/topics/extensions.html

EXTENSIONS================添加扩展

EXTENSIONS = {

'scrapy.extensions.telnet.TelnetConsole': None,

}

Configure item pipelines

See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

=====================激活管道,后面跟的数字越小优先级越高

ITEM_PIPELINES = {
'qidianwang.pipelines.QidianwangPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline': 400,
}

=======================================================动态下载延时,(自动限速的扩展,默认情况下是关闭的)

使用步骤1.打开:AUTOTHROTTLE_ENABLED = True

Enable and configure the AutoThrottle extension (disabled by default)

See https://doc.scrapy.org/en/latest/topics/autothrottle.html

AUTOTHROTTLE_ENABLED = True

The initial download delay

========初始的下载延时

AUTOTHROTTLE_START_DELAY = 5

The maximum download delay to be set in case of high latencies

========最大的下载延时

AUTOTHROTTLE_MAX_DELAY = 60

The average number of requests Scrapy should be sending in parallel to

each remote server

==========发送到每一个服务器的并行请求数量

AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

Enable showing throttling stats for every response received:

============是否要开启自动限速的DEBUG模式

AUTOTHROTTLE_DEBUG = False

==========================================================

Enable and configure HTTP caching (disabled by default)

See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

=====数据缓存的一个扩展(默认情况下是关闭的为HTTPCACHE_ENABLED = False)

HTTPCACHE_ENABLED = True

=====设置缓存超时的时间

HTTPCACHE_EXPIRATION_SECS = 0

=====设置缓存保存的路径

HTTPCACHE_DIR = 'httpcache'

=====缓存忽略的响应状态码设置为400表示忽略掉不缓存 :HTTPCACHE_IGNORE_HTTP_CODES = ['400']

HTTPCACHE_IGNORE_HTTP_CODES = []

缓存的储存插件,

HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

将log日志信息保存在本地文件

LOG_FILE = 'qdlogfile.log'

LOG_LEVEL = 'DEBUG'

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • Scrapy,Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化...
    Evtion阅读 5,942评论 12 18
  • Scrapy 什么是scrapy Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在...
    帅碧阅读 2,037评论 1 1
  • 文件目录说明: scrapy.cfg: 项目的配置文件 tutorial/: 该项目的python模块。之后您将在...
    关键先生耶阅读 607评论 0 0
  • by孤鸟差鱼 把手机放进包里 把你放在我兜里 包掉了没关系 兜还在就行
    孤鸟差鱼阅读 96评论 0 1
  • 虽然已经很累,已经凌晨两点半,我还在工作。因为心里难受,不愿意停下。羡慕这座城市,睡得这么香。羡慕不用担心账单的人...
    寒之陵阅读 286评论 0 1