python - very simple scrapy crawler not following links -
this simple scrapy
spider crawls yelp.com , fetches data
i've set rule(linkextractor(allow=('.*')),follow=true,callback="parsebusiness")
to follow links , callback parsebusiness
however, scrapy here, not follow links
this specific output (full output here http://pastebin.com/bkuervmq)
2015-07-14 01:06:22 [scrapy] debug: telnet console listening on 127.0.0.1:6023 2015-07-14 01:06:25 [scrapy] debug: crawled (200) <get http://www.yelp.com/search?find_desc=hotels&find_loc=san+francisco%2c+ca&ns=1> (referer: none) 2015-07-14 01:06:26 [scrapy] debug: crawled (200) <get http://www.yelp.com/biz/ucsf-medical-center-at-mount-zion-san-francisco> (referer: none) 2015-07-14 01:06:26 [scrapy] info: closing spider (finished) 2015-07-14 01:06:26 [scrapy] info: dumping scrapy stats:
code below
import sys import scrapy scrapy.spiders import crawlspider, rule scrapy.linkextractors import linkextractor class business(scrapy.item): name = scrapy.field() contactnumber = scrapy.field() address = scrapy.field() class yelpspider(crawlspider): name = "yelp" allowed_domains = ["www.yelp.com"] start_urls = [ "http://www.yelp.com/search?find_desc=hotels&find_loc=san+francisco%2c+ca&ns=1", "http://www.yelp.com/biz/ucsf-medical-center-at-mount-zion-san-francisco" ] rule(linkextractor(allow=()),follow=true,callback="parsebusiness") def parsebusiness(self, response): business = business() business['name'] = stripchars(response.xpath('//h1[@itemprop="name"]//text()').extract()) business['contactnumber'] = stripchars(response.xpath('//span[@itemprop="telephone"]//text()').extract()) business['address'] = stripchars(response.xpath('//li[@class="address"]//text()').extract()) yield business
what missing here? scrapy follow links
you not setting rules
attribute of spider:
class yelpspider(crawlspider): name = "yelp" allowed_domains = ["www.yelp.com"] start_urls = [ "http://www.yelp.com/search?find_desc=hotels&find_loc=san+francisco%2c+ca&ns=1", "http://www.yelp.com/biz/ucsf-medical-center-at-mount-zion-san-francisco" ] rules = [ rule(linkextractor(allow=('.*')),follow=true,callback="parsebusiness") ]
Comments
Post a Comment