additionally, no table of contents is created, and should be.titles appear in bold, and are somehow recognized by calibre heuristic conversion, however, this process is not always correctly performed.text is "dirty": there are many different classes for paragraphs, with " absolute position" attributes, leading to messy text: sentences are in the correct order, but the different settings for every class make the text rendered slightly under, above, before or after (a few pixels) the place where it should be -> all the classes should be removed and replaced only with and when needed.footnotes appear as normal text -> they should be recognized as footnotes and properly linked.page numbers were in the pdf and appear in the epub, that make no sense -> they should be removed. the title of the document repeats at header of every page -> it should be removed.If I feed this Html to calibre (specifically to ebook-convert) I get a dirty epub with the following problems that need to be solved: I got an HTML that contains text and surprisingly text is correctly un-wrapped, so the main problem of extracting text from PDFs is gone. I've extracted text from a PDF using pdftohtml (part of Poppler) using -c and -s options. I've read all similar questions here, and most of the answers suggest using calibre for this task, however, I'm trying to improve the output.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |