im trying scrape reviews university research. code have prints out of information need, need find rating , userid.
this of code here.
import requests bs4 import beautifulsoup s = requests.session() headers = {'user-agent': 'mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, gecko) chrome/51.0.2704.103 safari/537.36', 'referer': "http://www.imdb.com/"} url = 'http://www.imdb.com/title/tt0082158/reviews?ref_=tt_urv' r = s.get(url).content page = s.get(url) soup = beautifulsoup(page.content, "lxml") soup.prettify() cj = s.cookies requests.utils.dict_from_cookiejar(cj) s.post(url, headers=headers) in soup('style'): i.decompose() s in soup('script'): s.decompose() t in soup('table'): t.decompose() ip in soup('input'): ip.decompose() important = soup.find("div", id='tn15content') print(important.text)
this returns of information need in printout this.
output (just showing 1 review, prints out of them on page)
120 out of 141 people found following review useful: 1 of oscar best pictures deserved honor. author: gachronicled usa 18 february 2001 happened flipping channels today , saw on. since had been several years since last saw clicked on, didn't mean stay. happened, found film gripping before. own kids started watching it, too, , enjoyed - more satisfying me considering kind of current junk they're used to. no, not action-packed thriller, nor there juicy love scenes between abrahams , actress girlfriend. there no "colorful" language speak of; no politically correct agenda underlying tale of cambridge jew , scottish christian.this story drives people internally - pushes them excel or @ least make attempt so. story personal , societal values, loyalty, faith, desire accepted in society , healthy competition without utter selfishness characterizes of athletic endeavors of our day. characters not alike in motivation, end result same far accomplishments.my adolescent son (whose favorite movies of star wars movies , matrix) couldn't stop asking questions throughout movie hooked. great educational opportunity entertainment. if you've never seen film or it's been long time, recommend unabashedly, regardless of labels many have tried give being slow-paced or causing boredom. in addition great story - based on real people , events - photography , music fabulous , moving. it's no mistake movie has been spoofed , otherwise stolen in last twenty years - it's unforgettable movie , in opinion bashers hate oscar winners on principle or don't philosophies espoused protagonists.
however, need userid , rating given each movie.
the userid contained in each href element ...
<a href="/user/ur0511587/">
the rating contained in each img element rating equal "10/10" in alt attribute.
<img width="102" height="12" alt="10/10" src="http://i.media-imdb.com/images/showtimes/100.gif">
any tips on how able scrape both of these items in addition output scraped printing "important.text" without printing "important"? im hesitant print "important" because quite messy of tags , other unnecessary stuff. input.
you can use css selectors, a[href^=/user/ur]
find anchors have href starting /user/ur
, img[alt*=/10]
find img tags have alt attribute value "some_number/10"
:
user_ids = [a["href"].split("ur")[1].rstrip("/") in important.select("a[href^=/user/ur]")] ratings = [img["alt"] img in important.select("img[alt*=/10]")] print(user_ids, ratings)
the problem there not every review has rating , finding every a[href^=/user/ur] give more want ,so deal can find specific div contains anchor , review(if present) finding small tag contains text review useful:, calling .parent select div.
import re important = soup.find("div", id='tn15content') small in important.find_all("small", text=re.compile("review useful:")): div = small.parent user_id = div.select_one("a[href^=/user/ur]")["href"].split("ur")[1].rstrip("/") rating = div.select_one("img[alt*=/10]") print(user_id, rating["alt"] if rating else "n/a")
now get:
('0511587', '10/10') ('0209436', '9/10') ('1318093', 'n/a') ('0556711', '10/10') ('0075285', '9/10') ('0059151', '10/10') ('4445210', '9/10') ('0813687', 'n/a') ('0033913', '10/10') ('0819028', 'n/a')
you doing lot more work source need to, need single request, full code needed be:
import requests bs4 import beautifulsoup import re headers = {'user-agent': 'mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, gecko) chrome/51.0.2704.103 safari/537.36'} url = 'http://www.imdb.com/title/tt0082158/reviews?ref_=tt_urv' soup = beautifulsoup(requests.get(url, headers=headers).content, "lxml") important = soup.find("div", id='tn15content') small in important.find_all("small", text=re.compile("review useful:")): div = small.parent user_id = div.select_one("a[href^=/user/ur]")["href"].split("ur")[1].rstrip("/") rating = div.select_one("img[alt*=/10]") print(user_id, rating["alt"] if rating else "n/a")
to review text, find next p after div:
for small in important.find_all("small", text=re.compile("review useful:")): div = small.parent user_id = div.select_one("a[href^=/user/ur]")["href"].split("ur")[1].rstrip("/") rating = div.select_one("img[alt*=/10]") print(user_id, rating["alt"] if rating else "n/a") print(div.find_next("p").text.strip())
that give output like:
('0511587', '10/10') happened flipping channels today , saw on. since had been several years since last saw clicked on, didn't mean stay. happened, found film gripping before. own kids started watching it, too, , enjoyed - more satisfying me considering kind of current junk they're used to. no, not action-packed thriller, nor there juicy love scenes between abrahams , actress girlfriend. there no "colorful" language speak of; no politically correct agenda underlying tale of cambridge jew , scottish christian.this story drives people internally - pushes them excel or @ least make attempt so. story personal , societal values, loyalty, faith, desire accepted in society , healthy competition without utter selfishness characterizes of athletic endeavors of our day. characters not alike in motivation, end result same far accomplishments.my adolescent son (whose favorite movies of star wars movies , matrix) couldn't stop asking questions throughout movie hooked. great educational opportunity entertainment. if you've never seen film or it's been long time, recommend unabashedly, regardless of labels many have tried give being slow-paced or causing boredom. in addition great story - based on real people , events - photography , music fabulous , moving. it's no mistake movie has been spoofed , otherwise stolen in last twenty years - it's unforgettable movie , in opinion bashers hate oscar winners on principle or don't philosophies espoused protagonists.
Comments
Post a Comment