i work on pdf text extraction , have issue. use python 3.5 , pdfminer.six (pdfminer python 3 support) extract content. below code :
from pdfminer.pdfinterp import pdfresourcemanager, pdfpageinterpreter pdfminer.converter import textconverter pdfminer.layout import laparams pdfminer.pdfpage import pdfpage io import stringio def convert_pdf_to_txt(path, codec='utf-8'): rsrcmgr = pdfresourcemanager() retstr = stringio() laparams = laparams() device = textconverter(rsrcmgr, retstr, codec=codec, laparams=laparams) fp = open(path, 'rb') interpreter = pdfpageinterpreter(rsrcmgr, device) page in pdfpage.get_pages(fp): interpreter.process_page(page) text = retstr.getvalue() fp.close() device.close() retstr.close() return text
i try function on pdf. when print result, have these results :
text > 5\xa0killed,\xa09\xa0injured\xa0in\xa0explosion\xa0at\xa0rio\xa0de\xa0janeiro\xa0building\xa0\xad\xa0the\xa0new\xa0york\xa0times\n\namericas\n\n5\xa0killed (...) print(text) > 5 killed, 9 injured in explosion @ rio de janeiro building new york times > > americas 5 killed(...)
so, how can remove unicode characters result ? (without using replace function , regex).
thanks !
just
from unidecode import unidecode
and then
text = unidecode(retstr.getvalue())
see if works
Comments
Post a Comment