You are not logged in.

#1 2012-08-31 17:27:53

coolwanglu
Member
Registered: 2012-08-31
Posts: 11

[0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

[0916 Update]
Added 2 more demo pages:
http://coolwanglu.github.com/pdf2htmlEX/demo/cheat.html
http://coolwanglu.github.com/pdf2htmlEX … eneve.html

* Completed removed Boost
* Relaxed dependency of C++11, supports GCC no earlier than 4.4.6
* Links are now supported (In-document jumping is accurate to pages)
* Fixed an encoding problem for some fonts.

Demo comes first:
http://coolwanglu.github.com/pdf2htmlEX/demo/demo.html

Another (with CJK):
http://coolwanglu.github.com/pdf2htmlEX/demo/chn.html

Home page:
https://github.com/coolwanglu/pdf2htmlEX

Special thanks to Arthur Titeica for the AUR package.
https://aur.archlinux.org/packages.php?ID=62426


There are bascially 2 types of pdf-to-html converters:
One is roughly a pdf-to-text converter with a few pre-defined formats in HTML.
The other is render-everything-as-images converter, which loses all text and generated huge files.


But pdf2htmlEX takes advatanges of both, retaining both Text and Styling.
Features:
1.Extract and embed fonts from PDF
2.Optimizing for web while making sure render is precise
3.Non-text objects are rendered as images.
4.Single-file output mode -- I know you hate spearated font/image files


To compile & install
grab a recent poppler (>=0.20.3), make sure '--enable-xpdf-headers' is used for configure
grab the latest git version of fontforge https://github.com/fontforge/fontforge, because I submitted a few features/bugs for pdf2htmlEX
the boost c++ library. (See detailed depended components in the project home page)
cmake
GCC that supports c++11

Any suggestion, fork/star-at-gihub, bug-report is appreciated.

Last edited by coolwanglu (2012-09-16 14:41:29)

Offline

#2 2012-08-31 19:50:17

avx
Member
Registered: 2011-07-05
Posts: 71

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

I must admit, this is pretty impressive to me, could be a good starting point to get saner pdf->epub.

I know PDF is a pita to handle and parse, but here are some feature wishes:
a) automaticly create working links for any valid URL and mail addresses
b) trying to find table of contents and link it
c) link objects/images to open in a new window/tab, so I can look at them and read the surrounding text more easily.

Will try it over the weekend and give feedback, thanks for far, much appreciated.

Offline

#3 2012-08-31 20:12:25

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 19,739

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

coolwanglu,
Welcome to the Arch Forums.  Very nice application you have put together there.

Generally, we reserve these forums for Arch support.  That you are using Ubuntu does not violate our rules as you are not asking for support and are not about to create confusion amongst Arch users.

We do, however, strongly encourage users to use our build system so that Pacman (our package manager) can keep abreast of the system files that are installed.  I see you asked if anyone would like to help package this for Arch.  We have a subforum for just that purpose; I am wondering if this thread should perhaps be moved to that subforum.

a) What are your thoughts on the move.
b) Why not make the move to Arch? wink


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way

Offline

#4 2012-09-01 11:48:46

roentgen
Member
Registered: 2011-03-15
Posts: 91

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

Hi.
I've made an initial AUR package.

While it's functional as it is with the stable fontforge from extra the html rendering is a bit odd in my tests. I plan later to check on fontforge-git.

Thanks for the cool program. Just the other day I was looking for something like that and couldn't find anything functional.

Edit: with fontforge-git everything looks beautiful wink

Last edited by roentgen (2012-09-01 12:09:09)

Offline

#5 2012-09-01 12:01:06

coolwanglu
Member
Registered: 2012-08-31
Posts: 11

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

avx wrote:

I must admit, this is pretty impressive to me, could be a good starting point to get saner pdf->epub.

I know PDF is a pita to handle and parse, but here are some feature wishes:
a) automaticly create working links for any valid URL and mail addresses
b) trying to find table of contents and link it
c) link objects/images to open in a new window/tab, so I can look at them and read the surrounding text more easily.

Will try it over the weekend and give feedback, thanks for far, much appreciated.

Thanks for your attention.

I've been working so far to handle all kinds of text/fonts stuff. And in future versions I'm planning to support other objects (images/link/drawing etc) "natively" in HTML. So (a) (b) is in the plan.
But not sure about (c), what do you expect to see in a new tab after clicking an image?

Offline

#6 2012-09-01 12:05:24

coolwanglu
Member
Registered: 2012-08-31
Posts: 11

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

ewaller wrote:

coolwanglu,
Welcome to the Arch Forums.  Very nice application you have put together there.

Generally, we reserve these forums for Arch support.  That you are using Ubuntu does not violate our rules as you are not asking for support and are not about to create confusion amongst Arch users.

We do, however, strongly encourage users to use our build system so that Pacman (our package manager) can keep abreast of the system files that are installed.  I see you asked if anyone would like to help package this for Arch.  We have a subforum for just that purpose; I am wondering if this thread should perhaps be moved to that subforum.

a) What are your thoughts on the move.
b) Why not make the move to Arch? wink

Hello,
sorry if I have posted in the wrong subforum,
I just saw the description "A place for true innovation. Share your own created utilities with the Arch community." and came in.

I didn't intend to advertise that ubuntu ppa, I just put a general description and wanted to broadcast this tool.
As there's already one user kindly made a package, I'll remove the line of PPA and add a link to this instead.
Would this be OK?

Offline

#7 2012-09-01 12:08:10

coolwanglu
Member
Registered: 2012-08-31
Posts: 11

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

roentgen wrote:

Hi.
I've made an initial AUR package.

While it's functional as it is with the stable fontforge from extra the html rendering is a bit odd in my tests. I plan later to check on fontforge-git.

Thanks for the cool program. Just the other day I was looking for something like that and couldn't find anything functional.

Thank you very much!
I'll put in into the git repo.

I've submitted a few features/bugs for fontforge recently for pdf2htmlEX. So the scripts may not be valid for earlier versions of fontforge.
I think the 'odd' you saw was incorrect fonts, as there were no fonts generated actually.
Please do check with the lastest version and see if they'll work.

Offline

#8 2012-09-01 12:14:31

coolwanglu
Member
Registered: 2012-08-31
Posts: 11

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

roentgen wrote:

Hi.
I've made an initial AUR package.

While it's functional as it is with the stable fontforge from extra the html rendering is a bit odd in my tests. I plan later to check on fontforge-git.

Thanks for the cool program. Just the other day I was looking for something like that and couldn't find anything functional.

Edit: with fontforge-git everything looks beautiful wink

Glad to hear that smile

Offline

#9 2012-09-01 12:21:38

coolwanglu
Member
Registered: 2012-08-31
Posts: 11

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

roentgen wrote:

Hi.
I've made an initial AUR package.

While it's functional as it is with the stable fontforge from extra the html rendering is a bit odd in my tests. I plan later to check on fontforge-git.

Thanks for the cool program. Just the other day I was looking for something like that and couldn't find anything functional.

Edit: with fontforge-git everything looks beautiful wink

I'm not sure about linking your AUG.

Shall I put the file into the git repo, or link to the file in aur.archlinux.org? Which one is better?

Offline

#10 2012-09-01 12:24:07

roentgen
Member
Registered: 2011-03-15
Posts: 91

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

coolwanglu wrote:

I'm not sure about linking your AUG.

Shall I put the file into the git repo, or link to the file in aur.archlinux.org? Which one is better?

Just linking to the page is fine for archlinux users. Thanks.

Offline

#11 2012-09-01 12:31:58

coolwanglu
Member
Registered: 2012-08-31
Posts: 11

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

roentgen wrote:
coolwanglu wrote:

I'm not sure about linking your AUG.

Shall I put the file into the git repo, or link to the file in aur.archlinux.org? Which one is better?

Just linking to the page is fine for archlinux users. Thanks.

Done.

Offline

#12 2012-09-05 13:05:57

coolwanglu
Member
Registered: 2012-08-31
Posts: 11

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

roentgen wrote:

Hi.
I've made an initial AUR package.

While it's functional as it is with the stable fontforge from extra the html rendering is a bit odd in my tests. I plan later to check on fontforge-git.

Thanks for the cool program. Just the other day I was looking for something like that and couldn't find anything functional.

Edit: with fontforge-git everything looks beautiful wink

I've updated several stuff in the devv branch.

The good news is that now fontforge-git is not depeneded, a recent version should be enough. And fontforge is linked directly instead of call with scripts, so it should be faster.

The bad news is that fontforge.so is not supported officially, so I'm doing something heuristic in CMakeList.txt.

Can you please check if it works in Arch Linux?

Offline

#13 2012-09-05 14:14:35

roentgen
Member
Registered: 2011-03-15
Posts: 91

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

coolwanglu wrote:

The good news is that now fontforge-git is not depeneded, a recent version should be enough. And fontforge is linked directly instead of call with scripts, so it should be faster.

The bad news is that fontforge.so is not supported officially, so I'm doing something heuristic in CMakeList.txt.

Can you please check if it works in Arch Linux?

As far as I've tested (with fontforge 20120731_b-1) things go pretty bad and ugly. The page is mis-aligned, certain characters are missing (diacritics).

This is the output when converting a pdf.

Working: Warning: encoding confliction detected in font: f1
Warning: fontforge failed.
Warning: cannot read font info for f1
Warning: encoding confliction detected in font: f2
Warning: fontforge failed.
Warning: cannot read font info for f2
Warning: encoding confliction detected in font: f3
Warning: fontforge failed.
Warning: cannot read font info for f3
Warning: fontforge failed.
Warning: cannot read font info for f4
Warning: encoding confliction detected in font: f5
Warning: fontforge failed.
Warning: cannot read font info for f5
Warning: fontforge failed.
Warning: cannot read font info for f6
Warning: encoding confliction detected in font: f7
Warning: fontforge failed.
Warning: cannot read font info for f7
Warning: fontforge failed.
Warning: cannot read font info for f8
Warning: fontforge failed.
Warning: cannot read font info for f9
..Warning: encoding confliction detected in font: fa
Warning: fontforge failed.
Warning: cannot read font info for fa
....Warning: encoding confliction detected in font: fb
Warning: fontforge failed.
Warning: cannot read font info for fb
................................Warning: fontforge failed.
Warning: cannot read font info for fc
.....Warning: fontforge failed.
Warning: cannot read font info for fd
......

Any ideas?

Offline

#14 2012-09-05 14:28:48

coolwanglu
Member
Registered: 2012-08-31
Posts: 11

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

roentgen wrote:

Any ideas?

Are you using the lastest devv branch ?
There should never been a 'fontforge failed' message. It should have been removed.

Offline

#15 2012-09-05 17:37:02

roentgen
Member
Registered: 2011-03-15
Posts: 91

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

Indeed. I updated the AUR package to use the devv branch and now everything is fine with fontforge from [extra].

Thanks.

Offline

#16 2012-09-07 18:01:29

coolwanglu
Member
Registered: 2012-08-31
Posts: 11

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

Two more beautiful demo pages have been included.
Please check out the github page.

Offline

#17 2012-09-11 06:54:56

coolwanglu
Member
Registered: 2012-08-31
Posts: 11

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

roentgen wrote:

Indeed. I updated the AUR package to use the devv branch and now everything is fine with fontforge from [extra].

Thanks.

I've removed the dependency of boost, and pushed everything into the master branch.
Could you please change the AUR accordingly,and probably check if it stil works on Arch ?
Thanks!!

Offline

#18 2012-09-12 18:15:26

roentgen
Member
Registered: 2011-03-15
Posts: 91

Re: [0916] Introducing pdf2htmlEX: converts PDF to HTML w/o losing fmt

coolwanglu wrote:

I've removed the dependency of boost, and pushed everything into the master branch.
Could you please change the AUR accordingly,and probably check if it stil works on Arch ?

Thanks for the updates. Everything looks fine. I updated the AUR package.

Offline

Board footer

Powered by FluxBB