Scraping external website within main project with python scrapy framework -


i have been looking better way scrape external website main source website. better explain let me use example yelp.com explain trying (though target not yelp).

  1. i scrape title , address
  2. visit link titles leads to company website
  3. i extract emails source code of main website. (i know difficult, not crawling pages assuming site have contact in url e.g site.com/contact.php)
  4. the point while scraping yelp , storing data in field, want external data companies main website.

below code can't figure out how using scrapy.

# -*- coding: utf-8 -*- import scrapy scrapy.contrib.spiders import crawlspider, rule scrapy.linkextractors import linkextractor comb.items import combitem, siteitem  class comberspider(crawlspider):     name = "comber"     allowed_domains = ["example.com"]     query = 'shoe'     page = 'http://www.example.com/corp/' + query + '/1.html'     start_urls = (         page,     )     rules = (rule(linkextractor(allow=(r'corp/.+/\d+\.html'), restrict_xpaths=("//a[@class='next']")),                   callback="parse_items", follow=true),              )       def parse_items(self, response):          sel in response.xpath("//div[@class='item-main']"):             item = combitem()             item['company_name'] = sel.xpath("h2[@class='title']/a/text()").extract()             item['contact_url'] = sel.xpath("div[@class='company']/a/@href").extract()[0]             item['gold_supplier'] = sel.xpath("div[@class='item-title']/a/@title").extract()[0]             company_details = sel.xpath("div[@class='attrs']/div[@class='attr']/span['name']/text()").extract()              item = self.parse_meta(sel, item, company_details)             request = scrapy.request(item['contact_url'], callback=self.parse_site)             request.meta['item'] = item              yield request      def parse_meta(self, sel, item, company_details):          if (company_details):             if "products:" in company_details:                 item['products'] = sel.xpath("./div[@class='value']//text()").extract()             if "country/region:" in company_details:                  item['country'] = sel.xpath("./div[@class='right']"                                         + "/span[@data-coun]/text()").extract()             if "revenue:" in company_details:                 item['revenue'] = sel.xpath("./div[@class='right']/"                                         + "span[@data-reve]/text()").extract()             if "markets:" in company_details:                 item['markets'] = sel.xpath("./div[@class='value']/span[@data-mark]/text()").extract()         return item      def parse_site(self, response):         item = response.meta['item']         # value of item['websites'] http://target-company.com, http://any-other-website.com         # aim jump http://company.com , scrap data it's contact page ,         # store item item['emails'] = [info@company.com, sales@company.com]          # please how can done in same project         # thing can think of store item['websites'] , other values of item , make project         # still not work because of allowed_domains , start_urls           item['websites'] = response.xpath("//div[@class='company-contact-information']/table/tr/td/a/@href").extract()         print(item)         print('*'* 50)         yield item    """  scrapy.item import item, field   class combitem(item):     company_name = field()     main_products = field()     contact_url = field()     revenue = field()     gold_supplier = field()     country = field()     markets= field()     product_home = field()     websites = field()     """     #emails = field() not implemented because emails need extracted websites different start_url 

when issue request, passing dont_filter=true turn off offsitemiddleware , url not filtered allowed_domains:

if request has dont_filter attribute set, offsite middleware allow request if domain not listed in allowed domains.


Comments

Popular posts from this blog

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -

jquery - javascript onscroll fade same class but with different div -