python - very simple scrapy crawler not following links -

this simple scrapy spider crawls yelp.com , fetches data

i've set rule(linkextractor(allow=('.*')),follow=true,callback="parsebusiness")

to follow links , callback parsebusiness

however, scrapy here, not follow links

this specific output (full output here http://pastebin.com/bkuervmq)

2015-07-14 01:06:22 [scrapy] debug: telnet console listening on 127.0.0.1:6023 2015-07-14 01:06:25 [scrapy] debug: crawled (200) <get http://www.yelp.com/search?find_desc=hotels&find_loc=san+francisco%2c+ca&ns=1> (referer: none) 2015-07-14 01:06:26 [scrapy] debug: crawled (200) <get http://www.yelp.com/biz/ucsf-medical-center-at-mount-zion-san-francisco> (referer: none) 2015-07-14 01:06:26 [scrapy] info: closing spider (finished) 2015-07-14 01:06:26 [scrapy] info: dumping scrapy stats:

code below

import sys import scrapy  scrapy.spiders import crawlspider, rule scrapy.linkextractors import linkextractor   class business(scrapy.item):     name = scrapy.field()     contactnumber = scrapy.field()     address = scrapy.field()  class yelpspider(crawlspider):     name = "yelp"     allowed_domains = ["www.yelp.com"]     start_urls = [     "http://www.yelp.com/search?find_desc=hotels&find_loc=san+francisco%2c+ca&ns=1",     "http://www.yelp.com/biz/ucsf-medical-center-at-mount-zion-san-francisco" ]  rule(linkextractor(allow=()),follow=true,callback="parsebusiness")  def parsebusiness(self, response):         business = business()         business['name'] = stripchars(response.xpath('//h1[@itemprop="name"]//text()').extract())         business['contactnumber'] = stripchars(response.xpath('//span[@itemprop="telephone"]//text()').extract())         business['address'] = stripchars(response.xpath('//li[@class="address"]//text()').extract())         yield business

what missing here? scrapy follow links

you not setting rules attribute of spider:

class yelpspider(crawlspider):     name = "yelp"     allowed_domains = ["www.yelp.com"]     start_urls = [         "http://www.yelp.com/search?find_desc=hotels&find_loc=san+francisco%2c+ca&ns=1",         "http://www.yelp.com/biz/ucsf-medical-center-at-mount-zion-san-francisco"     ]      rules = [         rule(linkextractor(allow=('.*')),follow=true,callback="parsebusiness")     ]

Search This Blog

Brant

python - very simple scrapy crawler not following links -

Comments

Post a Comment

Popular posts from this blog

Rendering JButton to get the JCheckBox behavior in a JTable by using images does not update my table -

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -