Scraping external website within main project with python scrapy framework -
i have been looking better way scrape external website main source website. better explain let me use example yelp.com explain trying (though target not yelp).
- i scrape title , address
- visit link titles leads to company website
- i extract emails source code of main website. (i know difficult, not crawling pages assuming site have contact in url e.g site.com/contact.php)
- the point while scraping yelp , storing data in field, want external data companies main website.
below code can't figure out how using scrapy.
# -*- coding: utf-8 -*- import scrapy scrapy.contrib.spiders import crawlspider, rule scrapy.linkextractors import linkextractor comb.items import combitem, siteitem class comberspider(crawlspider): name = "comber" allowed_domains = ["example.com"] query = 'shoe' page = 'http://www.example.com/corp/' + query + '/1.html' start_urls = ( page, ) rules = (rule(linkextractor(allow=(r'corp/.+/\d+\.html'), restrict_xpaths=("//a[@class='next']")), callback="parse_items", follow=true), ) def parse_items(self, response): sel in response.xpath("//div[@class='item-main']"): item = combitem() item['company_name'] = sel.xpath("h2[@class='title']/a/text()").extract() item['contact_url'] = sel.xpath("div[@class='company']/a/@href").extract()[0] item['gold_supplier'] = sel.xpath("div[@class='item-title']/a/@title").extract()[0] company_details = sel.xpath("div[@class='attrs']/div[@class='attr']/span['name']/text()").extract() item = self.parse_meta(sel, item, company_details) request = scrapy.request(item['contact_url'], callback=self.parse_site) request.meta['item'] = item yield request def parse_meta(self, sel, item, company_details): if (company_details): if "products:" in company_details: item['products'] = sel.xpath("./div[@class='value']//text()").extract() if "country/region:" in company_details: item['country'] = sel.xpath("./div[@class='right']" + "/span[@data-coun]/text()").extract() if "revenue:" in company_details: item['revenue'] = sel.xpath("./div[@class='right']/" + "span[@data-reve]/text()").extract() if "markets:" in company_details: item['markets'] = sel.xpath("./div[@class='value']/span[@data-mark]/text()").extract() return item def parse_site(self, response): item = response.meta['item'] # value of item['websites'] http://target-company.com, http://any-other-website.com # aim jump http://company.com , scrap data it's contact page , # store item item['emails'] = [info@company.com, sales@company.com] # please how can done in same project # thing can think of store item['websites'] , other values of item , make project # still not work because of allowed_domains , start_urls item['websites'] = response.xpath("//div[@class='company-contact-information']/table/tr/td/a/@href").extract() print(item) print('*'* 50) yield item """ scrapy.item import item, field class combitem(item): company_name = field() main_products = field() contact_url = field() revenue = field() gold_supplier = field() country = field() markets= field() product_home = field() websites = field() """ #emails = field() not implemented because emails need extracted websites different start_url
when issue request
, passing dont_filter=true
turn off offsitemiddleware
, url not filtered allowed_domains
:
if request has dont_filter attribute set, offsite middleware allow request if domain not listed in allowed domains.
Comments
Post a Comment