python - Scrapy crawler not getting data after crawling -
im new scrapy, when run code debug returns no errors , when @ amount of data has scraped should not case? below code. im trying reviews tripadvisor.
import htmlparser import unicodedata import re import time scrapy.spider import basespider scrapy.selector import selector scrapy.http import request scrapy.contrib.spiders import crawlspider, rule class scrapingtestspider(crawlspider): name = "scrapingtest" allowed_domains = ["tripadvisor.com"] base_uri = "http://www.tripadvisor.com" start_urls = [ base_uri + "/restaurantsearch?geo=60763&q=new+york+city%2c+new+york&cat=&pid=" ] htmlparser = htmlparser.htmlparser() def is_ascii(s): return all(ord(c) < 128 c in s) def clean_parsed_string(string): if len(string) > 0: ascii_string = string if is_ascii(ascii_string) == false: ascii_string = unicodedata.normalize('nfkd', ascii_string).encode('ascii', 'ignore') return str(ascii_string) else: return none def get_parsed_string(selector, xpath): return_string = '' extracted_list = selector.xpath(xpath).extract() if len(extracted_list) > 0: raw_string = extracted_list[0].strip() if raw_string not none: return_string = htmlparser.unescape(raw_string) return return_string def get_parsed_string_multiple(selector, xpath): return_string = '' return selector.xpath(xpath).extract() def parse(self, response): tripadvisor_items = [] sel = selector(response) snode_restaurants = sel.xpath('//div[@id="eatery_search_results"]/div[starts-with(@class, "listing")]') # build item index. snode_restaurant in snode_restaurants: # cleaning string , taking first part before whitespace. snode_restaurant_item_avg_stars = clean_parsed_string(get_parsed_string(snode_restaurant, 'div[@class="wrap"]/div[@class="entry wrap"]/div[@class="description"]/div[@class="wrap"]/div[@class="rs rating"]/span[starts-with(@class, "rate")]/img[@class="sprite-ratings"]/@alt')) tripadvisor_item['avg_stars'] = re.match(r'(\s+)', snode_restaurant_item_avg_stars).group() # popolate reviews , address current item. yield request(url=tripadvisor_item['url'], meta={'tripadvisor_item': tripadvisor_item}, callback=self.parse_search_page) def parse_fetch_review(self, response): tripadvisor_item = response.meta['tripadvisor_item'] sel = selector(response) counter_page_review = response.meta['counter_page_review'] # tripadvisor reviews item. snode_reviews = sel.xpath('//div[@id="reviews"]/div/div[contains(@class, "review")]/div[@class="col2of2"]/div[@class="innerbubble"]') # reviews item. snode_review in snode_reviews: tripadvisor_review_item = scrapingtestreviewitem() tripadvisor_review_item['title'] = clean_parsed_string(get_parsed_string(snode_review, 'div[@class="quote"]/text()')) # review item description list of strings. # strings in list generated parsing user intentional newline. dom: <br> tripadvisor_review_item['description'] = get_parsed_string_multiple(snode_review, 'div[@class="entry"]/p/text()') # cleaning string , taking first part before whitespace. snode_review_item_stars = clean_parsed_string(get_parsed_string(snode_review, 'div[@class="rating reviewiteminline"]/span[starts-with(@class, "rate")]/img/@alt')) tripadvisor_review_item['stars'] = re.match(r'(\s+)', snode_review_item_stars).group() snode_review_item_date = clean_parsed_string(get_parsed_string(snode_review, 'div[@class="rating reviewiteminline"]/span[@class="ratingdate"]/text()')) snode_review_item_date = re.sub(r'reviewed ', '', snode_review_item_date, flags=re.ignorecase) snode_review_item_date = time.strptime(snode_review_item_date, '%b %d, %y') if snode_review_item_date else none tripadvisor_review_item['date'] = time.strftime('%y-%m-%d', snode_review_item_date) if snode_review_item_date else none tripadvisor_item['reviews'].append(tripadvisor_review_item)
here's debug log
c:\users\smash_000\desktop\scrapingtest\scrapingtest>scrapy crawl scrapingtest - o items.json c:\users\smash_000\desktop\scrapingtest\scrapingtest\spiders\scrapingtest_spider .py:6: scrapydeprecationwarning: module `scrapy.spider` deprecated, use `scra py.spiders` instead scrapy.spider import basespider c:\users\smash_000\desktop\scrapingtest\scrapingtest\spiders\scrapingtest_spider .py:9: scrapydeprecationwarning: module `scrapy.contrib.spiders` deprecated, use `scrapy.spiders` instead scrapy.contrib.spiders import crawlspider, rule 2015-07-14 11:07:04 [scrapy] info: scrapy 1.0.1 started (bot: scrapingtest) 2015-07-14 11:07:04 [scrapy] info: optional features available: ssl, http11 2015-07-14 11:07:04 [scrapy] info: overridden settings: {'newspider_module': 'sc rapingtest.spiders', 'feed_format': 'json', 'spider_modules': ['scrapingtest.spi ders'], 'feed_uri': 'items.json', 'bot_name': 'scrapingtest'} 2015-07-14 11:07:04 [scrapy] info: enabled extensions: closespider, feedexporter , telnetconsole, logstats, corestats, spiderstate 2015-07-14 11:07:05 [scrapy] info: enabled downloader middlewares: httpauthmiddl eware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaulth eadersmiddleware, metarefreshmiddleware, httpcompressionmiddleware, redirectmidd leware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2015-07-14 11:07:05 [scrapy] info: enabled spider middlewares: httperrormiddlewa re, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2015-07-14 11:07:05 [scrapy] info: enabled item pipelines: 2015-07-14 11:07:05 [scrapy] info: spider opened 2015-07-14 11:07:05 [scrapy] info: crawled 0 pages (at 0 pages/min), scraped 0 tems (at 0 items/min) 2015-07-14 11:07:05 [scrapy] debug: telnet console listening on 127.0.0.1:6023 2015-07-14 11:07:06 [scrapy] debug: crawled (200) <get http://www.tripadvisor.co m/restaurantsearch?geo=60763&q=new+york+city%2c+new+york&cat=&pid=> (referer: no ne) 2015-07-14 11:07:06 [scrapy] info: closing spider (finished) 2015-07-14 11:07:06 [scrapy] info: dumping scrapy stats: {'downloader/request_bytes': 281, 'downloader/request_count': 1, 'downloader/request_method_count/get': 1, 'downloader/response_bytes': 46932, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 7, 14, 5, 37, 6, 929000), 'log_count/debug': 2, 'log_count/info': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2015, 7, 14, 5, 37, 5, 474000)} 2015-07-14 11:07:06 [scrapy] info: spider closed (finished)
did try debug code print
statements?
i tried execute parser. if copy provided code same result because spider class scrapingtestspider
has no parse
method , not called.
if formatting of code (i indent under start_urls
in class) errors helper-methods not defined global name.
if go further , leave parse
methods crawler other errors mentioning tripadvisor_item
not defined.... code not working.
try format code better in ide , add print
messages parse
methods see if called or not. main parse
method should entered when scrapy crawls first url. think won't work.
and way callback add request
named bad too:
yield request(url=tripadvisor_item['url'], meta={'tripadvisor_item': tripadvisor_item}, callback=self.parse_search_page)
should changed to
yield request(url=tripadvisor_item['url'], meta={'tripadvisor_item': tripadvisor_item}, callback=self.parse_fetch_review)
when fix indentation problems.
and @ end of parse_fetch_review
method return
or yield
tripadvisor_item
created in parse method.
Comments
Post a Comment