技术库 > Python

同时运行多个scrapy爬虫

技术库:tec.5lulu.com

1 创建spider

from:tec.5lulu.com

1、创建多个spider, scrapy genspider spidername domain

scrapy genspider CnblogsHomeSpider cnblogs.com

通过上述命令创建了一个spider name为CnblogsHomeSpider的爬虫,start_urls为http://www.test.com/的爬虫

2、查看项目下有几个爬虫scrapy list

[root@bogon cnblogs]# scrapy list CnblogsHomeSpider CnblogsSpider

由此可以知道我的项目下有两个spider,一个名称叫CnblogsHomeSpider,另一个叫CnblogsSpider。

更多关于scrapy命令可参考: http://doc.scrapy.org/en/latest/topics/commands.html

2 让几个spider同时运行起来

现在我们的项目有两个spider,那么现在我们怎样才能让两个spider同时运行起来呢?你可能会说写个shell脚本一个个调用,也可能会说写个python脚本一个个运行等。然而我在stackoverflow.com上看到。的确也有不上前辈是这么实现。然而官方文档是这么介绍的。

1、Run Scrapy from a script

import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
  # Your spider definition
  ...
process = CrawlerProcess({
  'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
这里主要通过scrapy.crawler.CrawlerProcess来实现在脚本里运行一个spider。更多的例子可以在此查看:https://github.com/scrapinghub/testspiders

2、Running multiple spiders in the same process

  • 通过 CrawlerProcess
    import scrapy
    from scrapy.crawler import CrawlerProcess
    class MySpider1(scrapy.Spider):
      # Your first spider definition
      ...
    class MySpider2(scrapy.Spider):
      # Your second spider definition
      ...
    process = CrawlerProcess()
    process.crawl(MySpider1)
    process.crawl(MySpider2)
    process.start() # the script will block here until all crawling jobs are finished
  • 通过 CrawlerRunner
    import scrapy
    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    class MySpider1(scrapy.Spider):
      # Your first spider definition
      ...
    class MySpider2(scrapy.Spider):
      # Your second spider definition
      ...
    configure_logging()
    runner = CrawlerRunner()
    runner.crawl(MySpider1)
    runner.crawl(MySpider2)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    reactor.run() # the script will block here until all crawling jobs are finished
  • 通过CrawlerRunner和链接(chaining) deferred来线性运行
    from twisted.internet import reactor, defer
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    class MySpider1(scrapy.Spider):
      # Your first spider definition
      ...
    class MySpider2(scrapy.Spider):
      # Your second spider definition
      ...
    configure_logging()
    runner = CrawlerRunner()
    @defer.inlineCallbacks
    def crawl():
      yield runner.crawl(MySpider1)
      yield runner.crawl(MySpider2)
      reactor.stop()
    crawl()
    reactor.run() # the script will block here until the last crawl call is finished
    这是官方文档提供的几种在script里面运行spider的方法。

3 通过自定义scrapy命令的方式来运行

创建项目命令可参考: http://doc.scrapy.org/en/master/topics/commands.html?highlight=commands_module#custom-project-commands

1、创建commands目录

mkdir commands

注意:commands和spiders目录是同级的

2、在commands下面添加一个文件crawlall.py

这里主要通过修改scrapy的crawl命令来完成同时执行spider的效果。crawl的源码可以在此查看:https://github.com/scrapy/scrapy/blob/master/scrapy/commands/crawl.py

from scrapy.commands import ScrapyCommand  
from scrapy.crawler import CrawlerRunner
from scrapy.utils.conf import arglist_to_dict
class Command(ScrapyCommand):
  requires_project = True
  def syntax(self):  
    return '[options]'  
  def short_desc(self):  
    return 'Runs all of the spiders'  
  def add_options(self, parser):
    ScrapyCommand.add_options(self, parser)
    parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
              help="set spider argument (may be repeated)")
    parser.add_option("-o", "--output", metavar="FILE",
              help="dump scraped items into FILE (use - for stdout)")
    parser.add_option("-t", "--output-format", metavar="FORMAT",
              help="format to use for dumping items with -o")
  def process_options(self, args, opts):
    ScrapyCommand.process_options(self, args, opts)
    try:
      opts.spargs = arglist_to_dict(opts.spargs)
    except ValueError:
      raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)
  def run(self, args, opts):
    #settings = get_project_settings()

    spider_loader = self.crawler_process.spider_loader
    for spidername in args or spider_loader.list():
      print "*********cralall spidername************" + spidername
      self.crawler_process.crawl(spidername, **opts.spargs)
    self.crawler_process.start()
这里主要是用了self.crawler_process.spider_loader.list()方法获取项目下所有的spider,然后利用self.crawler_process.crawl运行spider

3、commands命令下添加__init__.py文件

touch __init__.py

注意:这一步一定不能省略。 

如果省略了会报这样一个异常

Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 9, in <module>
    load_entry_point('Scrapy==1.0.0rc2', 'console_scripts', 'scrapy')()
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 122, in execute
    cmds = _get_commands_dict(settings, inproject)
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 50, in _get_commands_dict
    cmds.update(_get_commands_from_module(cmds_module, inproject))
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 29, in _get_commands_from_module
    for cmd in _iter_command_classes(module):
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 20, in _iter_command_classes
    for module in walk_modules(module_name):
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/utils/misc.py", line 63, in walk_modules
    mod = import_module(path)
  File "/usr/local/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
ImportError: No module named commands
一开始怎么找都找不到原因在哪。耗了我一整天,后来到http://stackoverflow.com/上得到了网友的帮助。再次感谢万能的互联网,要是没有那道墙该是多么的美好呀!扯远了,继续回来。

4、settings.py目录下创建setup.py( 这一步去掉也没影响,不知道官网帮助文档这么写有什么具体的意义。 

from setuptools import setup, find_packages

setup(name='scrapy-mymodule',
  entry_points={
    'scrapy.commands': [
      'crawlall=cnblogs.commands:crawlall',
    ],
  },
 )
这个文件的含义是定义了一个crawlall命令,cnblogs.commands为命令文件目录,crawlall为命令名。

5. 在settings.py中添加配置:

COMMANDS_MODULE = 'cnblogs.commands'

6. 运行命令scrapy crawlall

同时运行多个scrapy爬虫


标签: scrapy本文链接 http://tec.5lulu.com/detail/108asn4wmg1518sa3.html

我来评分 :6.1
1

转载注明:转自5lulu技术库

本站遵循:署名-非商业性使用-禁止演绎 3.0 共享协议

www.5lulu.com