使用scrapy满足各种爬虫需求

玩爬虫也有很长一段时间了,一直没做一个记录,经常在同一个坑中跌倒,做个记录吧。

1.ImportError: Error loading object ‘scrapy.core.downloader.handlers.s3.S3DownloadHandler’: No module named win32api

安装pywin32 : http://sourceforge.net/projects/pywin32/

2.新版本(1.0+)如果spider再继承于scrapy.spider 不会执行rules,被这坑了好久,果断看了官方文档,发现变动还挺大的,几个包都变了

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
spider也得继承于CrawlSpider,如果不重载parse,会自动根据rules抓取

3.初始化爬虫用start_url不能满足需求,需要配置Request
重载函数

def start_requests(self):yield request

4.需要在middleware或parse的时候复用request并更改url

request.replace(url = new_url)

5.需要得到配置文件或更改爬虫配置
重载函数,cls 为当前类,若是自己写的扩展需要在settings.py中声明

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):return cls(*args, **kwargs)

6.得到当前spider配置文件

from scrapy.utils import project
self.settings = project.get_project_settings()

7.获取crawler状态信息

crawler.stats.get_stats()

8.随机变换代理及useragent
定义自己的middleware,设置request.meta的保留字段proxy
useragent 为设置reques.headers
返回None,继续处理此request

def process_request(self,request,spider):
    request.meta['proxy'] ="http://127.0.0.1:8888"
    request.headers.setdefault('User-Agent', ua)

9.部署scrapyd,需要创建配置文件 /etc/scrapyd/scrapyd.conf

[scrapyd]
eggs_dir    = {your dir}/eggs
logs_dir    = {your dir}/logs
items_dir   = {your dir}/items
jobs_to_keep = 100
dbs_dir     ={your dir}/dbs
max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 10
http_port   = 6800
debug       = on
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

10.使用scrapyd运行爬虫,出错
scrapyd CRITICAL: Unhandled error in Deferred:

'scrapy.telnet.TelnetConsole': None,
'scrapy.extensions.feedexport.FeedExporter': None,

发表评论

电子邮件地址不会被公开。 必填项已用*标注