I’m trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this:
- starts with a product_list page with 10 products
- a click on “next” button loads the next 10 products (url doesn’t change between the two pages)
- i use LinkExtractor to follow each product link into the product page, and get all the information I need
I tried to replicate the next-button-ajax-call but can’t get working, so I’m giving selenium a try. I can run selenium’s webdriver in a separate script, but I don’t know how to integrate with scrapy. Where shall I put the selenium part in my scrapy spider?
My spider is pretty standard, like the following:
class ProductSpider(CrawlSpider): name = “product_spider” allowed_domains = [‘example.com’] start_urls = [‘http://example.com/shanghai’] rules = [ Rule(SgmlLinkExtractor(restrict_xpaths=’//div[@id=“productList”]//dl[@class=“t2”]//dt’), callback=‘parse_product’), ]
def parse_product(self, response): self.log("parsing product %s" %response.url, level=INFO) hxs = HtmlXPathSelector(response)