i love know guys think please. have researched few days , can't seem find going wrong. highly appreciated.
i want systematically crawl url: question site using pagination crawl rest of pages.
my current code:
import scrapy scrapy.linkextractors import linkextractor scrapy.selector import selector scrapy.spiders import crawlspider, rule acer.items import aceritem class acercrawlerspider(crawlspider): name = 'acercrawler' allowed_domains = ['studyacer.com'] start_urls = ['http://www.studyacer.com/latest'] rules = ( rule(linkextractor(), callback='parse_item', follow=true), ) def parse_item(self, response): questions= selector(response).xpath('//td[@class="word-break"]/a/@href').extract() question in questions: item= aceritem() item['title']= question.xpath('//h1/text()').extract() item['body']= selector(response).xpath('//div[@class="row-fluid"][2]//p/text()').extract() yield item
when ran spider doesn't throw errors instead outputs inconsistent results. scraping article page twice. thinking might selectors have used can't narrow further. please?
kevin; had similar different problem earlier today, crawlspider visiting unwanted pages. responded question suggestion of checking linkextractor suggested here : http://doc.scrapy.org/en/latest/topics/link-extractors.html
class scrapy.linkextractors.lxmlhtml.lxmllinkextractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=none, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=true, unique=true, process_value=none)
i ended reviewing allow / deny components focus crawler on specific subsets of pages. can specify using regex express relevant substrings of links allow (include) or deny (exclude). tested expressions using http://www.regexpal.com/
i found approach sufficient prevent duplicates, if you're still seeing them, found article looking @ earlier in day on how prevent duplicates, although have didn't have implement fix:
Comments
Post a Comment