python - How to use a decorator to enable a spider to distinguish between scrapy pipelines -


i've made scrapy project contains multiple spiders in file , need interpreter able distinguish pipeline associated spider, similar person asked so question. using solution provided top voted answer, put decorator within pipeline classes , defined pipelines lists within spiders themselves. when run this, name error because pipelines defined in spider file.

since pipelines.py file isn't module can't import spiders.py file. i'm not sure if answer posted still relevant since not recent, seems did work worth try @ least. btw, i'm running 2 spiders have sequentially based on code provided in docs , though both spiders run when use command: scrapy runspider, think pipeline classes not being called. however, when run each spider individually, tables filled properly. included both pipeline classes in settings.py dicionary. this, have few questions:

[1.] have correct set of both files based on answer provided in question?
[2.] if so, how connect namespaces of 2 files?

[3.] there better way besides creating separate projects?

i have code both files below, appreciated, thanks.

pipelines.py

from sqlalchemy.orm import sessionmaker models import tickets, tickets3, db_connect, create_vs_tickets_table, create_tc_tickets_table    class comparatorpipeline(object):     """price comparison pipeline storing scraped items in database"""     def __init__(self):         """         initializes database connection , sessionmaker.         creates deals table.         """         engine = db_connect()         create_vs_tickets_table(engine)         self.session = sessionmaker(bind=engine)      def process_item(self, item, spider):         """save tickets in database.          method called every item pipeline component.          """          def check_spider_pipeline(process_item_method):                  @functools.wraps(process_item_method)                 def wrapper(self, item, spider):                     #message template debugging                     msg = '%%s %s pipeline step' % (self.__.class__.__name__,)                      #if class in spider's pipeline, use                      #process_item method normally.                     if self.__class__ in spider.pipeline:                         spider.log(msg % 'executing', level=log.debug)                         return process_item_method(self, item, spider)                      #otherwise, return untouched item (skip step in pipeline)                 else:                     spider.log(msg % 'skipping', level= log.debug)                     return item                 return wrapper               if spider.name == "comparator":             session = self.session()             ticket = tickets(**item)              try:                 session.add(ticket)                 session.commit()             except:                 session.rollback()                 raise             finally:                 session.close()              return item  class comparatorpipeline2(object):     """price comparison pipeline storing scraped items in database"""     def __init__(self):         """         initializes database connection , sessionmaker.         creates deals table.         """         engine = db_connect()         create_tc_tickets_table(engine)         self.session = sessionmaker(bind=engine)      def process_item(self, item, spider):         """save tickets in database.          method called every item pipeline component.          """         def check_spider_pipeline(process_item_method):              @functools.wraps(process_item_method)             def wrapper(self, item, spider):                 #message template debugging                 msg = '%%s %s pipeline step' % (self.__.class__.__name__,)                  #if class in spider's pipeline, use                  #process_item method normally.                 if self.__class__ in spider.pipeline:                     spider.log(msg % 'executing', level=log.debug)                     return process_item_method(self, item, spider)                  #otherwise, return untouched item (skip step in pipeline)             else:                 spider.log(msg % 'skipping', level= log.debug)                 return item             return wrapper            if spider.name == "comparator2":             session = self.session()             ticket2 = tickets2(**item)              try:                 session.add(ticket2)                 session.commit()             except:                 session.rollback()                 raise             finally:                 session.close()              return item 

spider class definitions

import scrapy import re import json scrapy.crawler import crawlerprocess scrapy import request scrapy.contrib.spiders import crawlspider , rule scrapy.selector import htmlxpathselector scrapy.selector import selector scrapy.contrib.loader import itemloader scrapy.contrib.loader import xpathitemloader scrapy.contrib.loader.processor import join, mapcompose concert_comparator.items import comparatoritem, comparatoritem3 urlparse import urljoin  scrapy.crawler import crawlerrunner twisted.internet import reactor, defer scrapy.utils.log import configure_logging  bandname = raw_input("enter bandname \n")  vs_url = "http://www.vividseats.com/concerts/" + bandname + "-tickets.html" tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"  class myspider(crawlspider):     pipeline = set([         comparatorpipeline         ])     pipeline = ['first']     handle_httpstatus_list = [416]     name = 'comparator'     allowed_domains = ["www.vividseats.com"]     start_urls = [vs_url]     tickets_list_xpath = './/*[@itemtype="http://schema.org/event"]'      def parse_json(self, response):         loader = response.meta['loader']         jsonresponse = json.loads(response.body_as_unicode())         ticket_info = jsonresponse.get('tickets')         price_list = [i.get('p') in ticket_info]         if len(price_list) > 0:             str_price = str(price_list[0])             ticketprice = unicode(str_price, "utf-8")             loader.add_value('ticketprice', ticketprice)         else:             ticketprice = unicode("sold out", "utf-8")             loader.add_value('ticketprice', ticketprice)         return loader.load_item()     def parse_price(self, response):         loader = response.meta['loader']         ticketslink = loader.get_output_value("ticketslink")         json_id_list= re.findall(r"(\d+)[^-]*$", ticketslink)         json_id=  "".join(json_id_list)         json_url = "http://www.vividseats.com/javascript/tickets.shtml?productionid=" + json_id         yield scrapy.request(json_url, meta={'loader': loader}, callback = self.parse_json, dont_filter = true)       def parse(self, response):         """         # """         selector = htmlxpathselector(response)         # iterate on tickets         ticket in selector.select(self.tickets_list_xpath):             loader = xpathitemloader(comparatoritem(), selector=ticket)             # define loader             loader.default_input_processor = mapcompose(unicode.strip)             loader.default_output_processor = join()             # iterate on fields , add xpaths loader             loader.add_xpath('eventname' , './/*[@class="productionsevent"]/text()')             loader.add_xpath('eventlocation' , './/*[@class = "productionsvenue"]/span[@itemprop  = "name"]/text()')             loader.add_xpath('ticketslink' , './/*/a[@class = "btn btn-primary"]/@href')             loader.add_xpath('eventdate' , './/*[@class = "productionsdate"]/text()')             loader.add_xpath('eventcity' , './/*[@class = "productionsvenue"]/span[@itemprop  = "address"]/span[@itemprop  = "addresslocality"]/text()')             loader.add_xpath('eventstate' , './/*[@class = "productionsvenue"]/span[@itemprop  = "address"]/span[@itemprop  = "addressregion"]/text()')             loader.add_xpath('eventtime' , './/*[@class = "productionstime"]/text()')              print "here ticket link \n" + loader.get_output_value("ticketslink")             #sel.xpath("//span[@id='practitionerdetails1_label4']/text()").extract()             ticketsurl = "concerts/" + bandname + "-tickets/" + bandname + "-" + loader.get_output_value("ticketslink")             ticketsurl = urljoin(response.url, ticketsurl)             yield scrapy.request(ticketsurl, meta={'loader': loader}, callback = self.parse_price, dont_filter = true)   class myspider3(crawlspider):     pipeline = set([         comparatorpipeline2         ])     handle_httpstatus_list = [416]     name = 'comparator3'     allowed_domains = ["www.ticketcity.com"]     start_urls = [tc_url]     tickets_list_xpath = './/div[@class = "vevent"]'      def parse_json(self, response):         loader = response.meta['loader']         jsonresponse = json.loads(response.body_as_unicode())         ticket_info = jsonresponse.get('b')         price_list = [i.get('p') in ticket_info]         if len(price_list) > 0:             str_price = str(price_list[0])             ticketprice = unicode(str_price, "utf-8")             loader.add_value('ticketprice', ticketprice)         else:             ticketprice = unicode("sold out", "utf-8")             loader.add_value('ticketprice', ticketprice)         return loader.load_item()      def parse_price(self, response):         print "parse price function entered \n"         loader = response.meta['loader']         event_city = response.xpath('.//span[@itemprop="addresslocality"]/text()').extract()          eventcity = ''.join(event_city)          loader.add_value('eventcity' , eventcity)         event_state = response.xpath('.//span[@itemprop="addressregion"]/text()').extract()          eventstate = ''.join(event_state)          loader.add_value('eventstate' , eventstate)          event_date = response.xpath('.//span[@class="event_datetime"]/text()').extract()          eventdate = ''.join(event_date)           loader.add_value('eventdate' , eventdate)             ticketslink = loader.get_output_value("ticketslink")         json_id_list= re.findall(r"(\d+)[^-]*$", ticketslink)         json_id=  "".join(json_id_list)         json_url = "https://www.ticketcity.com/catalog/public/v1/events/" + json_id + "/ticketblocks?p=0,99999999&q=0&per_page=250&page=1&sort=p.asc&f.t=s&_=1436642392938"         yield scrapy.request(json_url, meta={'loader': loader}, callback = self.parse_json, dont_filter = true)       def parse(self, response):         """         # """         selector = htmlxpathselector(response)         # iterate on tickets         ticket in selector.select(self.tickets_list_xpath):             loader = xpathitemloader(comparatoritem(), selector=ticket)             # define loader             loader.default_input_processor = mapcompose(unicode.strip)             loader.default_output_processor = join()             # iterate on fields , add xpaths loader             loader.add_xpath('eventname' , './/span[@class="summary listingeventname"]/text()')             loader.add_xpath('eventlocation' , './/div[@class="divvenue location"]/text()')             loader.add_xpath('ticketslink' , './/a[@class="diveventdetails url"]/@href')             #loader.add_xpath('eventdatetime' , '//div[@id="diveventdate"]/@title') #datetime type             #loader.add_xpath('eventtime' , './/*[@class = "productionstime"]/text()')              print "here ticket link \n" + loader.get_output_value("ticketslink")             #sel.xpath("//span[@id='practitionerdetails1_label4']/text()").extract()             ticketsurl = "https://www.ticketcity.com/" + loader.get_output_value("ticketslink")             ticketsurl = urljoin(response.url, ticketsurl)             yield scrapy.request(ticketsurl, meta={'loader': loader}, callback = self.parse_price, dont_filter = true)   configure_logging() runner = crawlerrunner()  @defer.inlinecallbacks def crawl():     yield runner.crawl(myspider)     yield runner.crawl(myspider3)     reactor.stop()  crawl() reactor.run() 

pipelines directory

you should @ least read decorators , how used before posting type of question.

you don't have them set properly. should create 1 project @ least 2 modules. 1 module named spiders , named pipelines. note directory considered module needs have file named __init__.py in it. https://stackoverflow.com/a/448279/2368836

in pipelines module add file called util following code:

def check_spider_pipeline(process_item_method):     """         wrapper makes pipelines can turned on , off @ spider level.     """     @functools.wraps(process_item_method)     def wrapper(self, item, spider):         msg = '%%s %s pipeline step' % (self.__class__.__name__,)         if self.__class__ in spider.pipeline:             spider.log(msg % 'executing', level=log.debug)             return process_item_method(self, item, spider)         else:             spider.log(msg % 'skipping', level=log.debug)             return item      return wrapper 

create file in pipelines called pipelines:

from sqlalchemy.orm import sessionmaker models import tickets, tickets3, db_connect, create_vs_tickets_table, create_tc_tickets_table pipelines.util import check_spider_pipeline   class comparatorpipeline(object):     """price comparison pipeline storing scraped items in database"""     def __init__(self):         """         initializes database connection , sessionmaker.         creates deals table.         """         engine = db_connect()         create_vs_tickets_table(engine)         self.session = sessionmaker(bind=engine)     @check_spider_pipeline         def process_item(self, item, spider):         """save tickets in database.          method called every item pipeline component.          """              if spider.name == "comparator":             session = self.session()             ticket = tickets(**item)              try:                 session.add(ticket)                 session.commit()             except:                 session.rollback()                 raise             finally:                 session.close()              return item  class comparatorpipeline2(object):     """price comparison pipeline storing scraped items in database"""     def __init__(self):         """         initializes database connection , sessionmaker.         creates deals table.         """         engine = db_connect()         create_tc_tickets_table(engine)         self.session = sessionmaker(bind=engine)     @check_spider_pipeline         def process_item(self, item, spider):         """save tickets in database.          method called every item pipeline component.          """           if spider.name == "comparator2":             session = self.session()             ticket2 = tickets2(**item)              try:                 session.add(ticket2)                 session.commit()             except:                 session.rollback()                 raise             finally:                 session.close()              return item 

in spiders module:

import scrapy import re import json scrapy.crawler import crawlerprocess scrapy import request scrapy.contrib.spiders import crawlspider , rule scrapy.selector import htmlxpathselector scrapy.selector import selector scrapy.contrib.loader import itemloader scrapy.contrib.loader import xpathitemloader scrapy.contrib.loader.processor import join, mapcompose concert_comparator.items import comparatoritem, comparatoritem3 urlparse import urljoin  scrapy.crawler import crawlerrunner twisted.internet import reactor, defer scrapy.utils.log import configure_logging pipelines.pipelines import comparatorpipeline, comparatorpipeline2  bandname = raw_input("enter bandname \n")  vs_url = "http://www.vividseats.com/concerts/" + bandname + "-tickets.html" tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"  class myspider(crawlspider):     pipeline = set([         comparatorpipeline         ])     handle_httpstatus_list = [416]     name = 'comparator'     allowed_domains = ["www.vividseats.com"]     start_urls = [vs_url]     tickets_list_xpath = './/*[@itemtype="http://schema.org/event"]'      def parse_json(self, response):         loader = response.meta['loader']         jsonresponse = json.loads(response.body_as_unicode())         ticket_info = jsonresponse.get('tickets')         price_list = [i.get('p') in ticket_info]         if len(price_list) > 0:             str_price = str(price_list[0])             ticketprice = unicode(str_price, "utf-8")             loader.add_value('ticketprice', ticketprice)         else:             ticketprice = unicode("sold out", "utf-8")             loader.add_value('ticketprice', ticketprice)         return loader.load_item()     def parse_price(self, response):         loader = response.meta['loader']         ticketslink = loader.get_output_value("ticketslink")         json_id_list= re.findall(r"(\d+)[^-]*$", ticketslink)         json_id=  "".join(json_id_list)         json_url = "http://www.vividseats.com/javascript/tickets.shtml?productionid=" + json_id         yield scrapy.request(json_url, meta={'loader': loader}, callback = self.parse_json, dont_filter = true)       def parse(self, response):         """         # """         selector = htmlxpathselector(response)         # iterate on tickets         ticket in selector.select(self.tickets_list_xpath):             loader = xpathitemloader(comparatoritem(), selector=ticket)             # define loader             loader.default_input_processor = mapcompose(unicode.strip)             loader.default_output_processor = join()             # iterate on fields , add xpaths loader             loader.add_xpath('eventname' , './/*[@class="productionsevent"]/text()')             loader.add_xpath('eventlocation' , './/*[@class = "productionsvenue"]/span[@itemprop  = "name"]/text()')             loader.add_xpath('ticketslink' , './/*/a[@class = "btn btn-primary"]/@href')             loader.add_xpath('eventdate' , './/*[@class = "productionsdate"]/text()')             loader.add_xpath('eventcity' , './/*[@class = "productionsvenue"]/span[@itemprop  = "address"]/span[@itemprop  = "addresslocality"]/text()')             loader.add_xpath('eventstate' , './/*[@class = "productionsvenue"]/span[@itemprop  = "address"]/span[@itemprop  = "addressregion"]/text()')             loader.add_xpath('eventtime' , './/*[@class = "productionstime"]/text()')              print "here ticket link \n" + loader.get_output_value("ticketslink")             #sel.xpath("//span[@id='practitionerdetails1_label4']/text()").extract()             ticketsurl = "concerts/" + bandname + "-tickets/" + bandname + "-" + loader.get_output_value("ticketslink")             ticketsurl = urljoin(response.url, ticketsurl)             yield scrapy.request(ticketsurl, meta={'loader': loader}, callback = self.parse_price, dont_filter = true)   class myspider3(crawlspider):     pipeline = set([         comparatorpipeline2         ])     handle_httpstatus_list = [416]     name = 'comparator3'     allowed_domains = ["www.ticketcity.com"]     start_urls = [tc_url]     tickets_list_xpath = './/div[@class = "vevent"]'      def parse_json(self, response):         loader = response.meta['loader']         jsonresponse = json.loads(response.body_as_unicode())         ticket_info = jsonresponse.get('b')         price_list = [i.get('p') in ticket_info]         if len(price_list) > 0:             str_price = str(price_list[0])             ticketprice = unicode(str_price, "utf-8")             loader.add_value('ticketprice', ticketprice)         else:             ticketprice = unicode("sold out", "utf-8")             loader.add_value('ticketprice', ticketprice)         return loader.load_item()      def parse_price(self, response):         print "parse price function entered \n"         loader = response.meta['loader']         event_city = response.xpath('.//span[@itemprop="addresslocality"]/text()').extract()          eventcity = ''.join(event_city)          loader.add_value('eventcity' , eventcity)         event_state = response.xpath('.//span[@itemprop="addressregion"]/text()').extract()          eventstate = ''.join(event_state)          loader.add_value('eventstate' , eventstate)          event_date = response.xpath('.//span[@class="event_datetime"]/text()').extract()          eventdate = ''.join(event_date)           loader.add_value('eventdate' , eventdate)             ticketslink = loader.get_output_value("ticketslink")         json_id_list= re.findall(r"(\d+)[^-]*$", ticketslink)         json_id=  "".join(json_id_list)         json_url = "https://www.ticketcity.com/catalog/public/v1/events/" + json_id + "/ticketblocks?p=0,99999999&q=0&per_page=250&page=1&sort=p.asc&f.t=s&_=1436642392938"         yield scrapy.request(json_url, meta={'loader': loader}, callback = self.parse_json, dont_filter = true)       def parse(self, response):         """         # """         selector = htmlxpathselector(response)         # iterate on tickets         ticket in selector.select(self.tickets_list_xpath):             loader = xpathitemloader(comparatoritem(), selector=ticket)             # define loader             loader.default_input_processor = mapcompose(unicode.strip)             loader.default_output_processor = join()             # iterate on fields , add xpaths loader             loader.add_xpath('eventname' , './/span[@class="summary listingeventname"]/text()')             loader.add_xpath('eventlocation' , './/div[@class="divvenue location"]/text()')             loader.add_xpath('ticketslink' , './/a[@class="diveventdetails url"]/@href')             #loader.add_xpath('eventdatetime' , '//div[@id="diveventdate"]/@title') #datetime type             #loader.add_xpath('eventtime' , './/*[@class = "productionstime"]/text()')              print "here ticket link \n" + loader.get_output_value("ticketslink")             #sel.xpath("//span[@id='practitionerdetails1_label4']/text()").extract()             ticketsurl = "https://www.ticketcity.com/" + loader.get_output_value("ticketslink")             ticketsurl = urljoin(response.url, ticketsurl)             yield scrapy.request(ticketsurl, meta={'loader': loader}, callback = self.parse_price, dont_filter = true)  if __name__ == "__main__":     configure_logging()     runner = crawlerrunner()      @defer.inlinecallbacks     def crawl():         yield runner.crawl(myspider)         yield runner.crawl(myspider3)         reactor.stop()      crawl()     reactor.run() 

also make sure have these pipelines defined in settings. , recommend using scrapy crawl spider_name on code @ bottom of spiders file until have stuff figured out.

project structure example:

enter image description here

note: did not make sure worked stuff getting band name user. if want better off doing similar this: https://stackoverflow.com/a/15618520/2368836


Comments

Popular posts from this blog

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -

Rendering JButton to get the JCheckBox behavior in a JTable by using images does not update my table -