python - PDFMiner and text encoding -


i work on pdf text extraction , have issue. use python 3.5 , pdfminer.six (pdfminer python 3 support) extract content. below code :

from pdfminer.pdfinterp import pdfresourcemanager, pdfpageinterpreter pdfminer.converter import textconverter pdfminer.layout import laparams pdfminer.pdfpage import pdfpage io import stringio  def convert_pdf_to_txt(path, codec='utf-8'):     rsrcmgr = pdfresourcemanager()     retstr = stringio()     laparams = laparams()     device = textconverter(rsrcmgr, retstr, codec=codec, laparams=laparams)     fp = open(path, 'rb')     interpreter = pdfpageinterpreter(rsrcmgr, device)      page in pdfpage.get_pages(fp):         interpreter.process_page(page)      text = retstr.getvalue()      fp.close()     device.close()     retstr.close()     return text 

i try function on pdf. when print result, have these results :

text > 5\xa0killed,\xa09\xa0injured\xa0in\xa0explosion\xa0at\xa0rio\xa0de\xa0janeiro\xa0building\xa0\xad\xa0the\xa0new\xa0york\xa0times\n\namericas\n\n5\xa0killed (...)  print(text) > 5 killed, 9 injured in explosion @ rio de janeiro building ­ new york times > > americas 5 killed(...) 

so, how can remove unicode characters result ? (without using replace function , regex).

thanks !

just

from unidecode import unidecode 

and then

text = unidecode(retstr.getvalue()) 

see if works


Comments