css selectors - Python Scrapy Spider: Inconsistent results -


i love know guys think please. have researched few days , can't seem find going wrong. highly appreciated.

i want systematically crawl url: question site using pagination crawl rest of pages.

my current code:

import scrapy scrapy.linkextractors import linkextractor scrapy.selector import selector scrapy.spiders import crawlspider, rule  acer.items import aceritem   class acercrawlerspider(crawlspider):     name = 'acercrawler'     allowed_domains = ['studyacer.com']     start_urls = ['http://www.studyacer.com/latest']      rules = (         rule(linkextractor(), callback='parse_item', follow=true),     )      def parse_item(self, response):         questions= selector(response).xpath('//td[@class="word-break"]/a/@href').extract()          question in questions:             item= aceritem()             item['title']= question.xpath('//h1/text()').extract()             item['body']= selector(response).xpath('//div[@class="row-fluid"][2]//p/text()').extract()             yield item 

when ran spider doesn't throw errors instead outputs inconsistent results. scraping article page twice. thinking might selectors have used can't narrow further. please?

kevin; had similar different problem earlier today, crawlspider visiting unwanted pages. responded question suggestion of checking linkextractor suggested here : http://doc.scrapy.org/en/latest/topics/link-extractors.html

class scrapy.linkextractors.lxmlhtml.lxmllinkextractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=none, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=true, unique=true, process_value=none) 

i ended reviewing allow / deny components focus crawler on specific subsets of pages. can specify using regex express relevant substrings of links allow (include) or deny (exclude). tested expressions using http://www.regexpal.com/

i found approach sufficient prevent duplicates, if you're still seeing them, found article looking @ earlier in day on how prevent duplicates, although have didn't have implement fix:

avoid duplicate url crawling

https://stackoverflow.com/a/21344753/6582364


Comments