HTML Editor - converting "special characters" into entities

ctarwater · 2011-02-24 02:19:36

Hi everyone,

I'm currently in the middle of formatting my wife's novel into 'clean' html so I can use Calibre to create a well formatted epub/mobi/etc.

I've read that TextMate on Mac includes a function to "Convert Selection to Entities excluding Tags". In other words it will replace all special characters (ellipses, copyright sign, etc) with the html entities.

I'm currently using Kate and I can't seem to find anything similar. Is anyone aware of any linux editors with the same function?

Thanks for the help!

upsidaisium · 2011-02-24 03:31:22

Find and replace? Just look up a reference of HTML entities so you know that, for example, you should replace the ellipsis with:

&hellip;

Here is one reference: http://www.w3schools.com/tags/ref_entities.asp
(Note that it's broken up into two or three pages, so if you don't find all the entities you're looking for on the first page then keep looking)

ctarwater · 2011-02-24 03:37:12

Yeah, I'm currently doing it all manually. Using find and replace for things one at a time, but apparently that mac program lets you click a button and it replaces ALL special characters (curved quotes, ellipses, etc.) I'm neurotic, what can I say?

upsidaisium · 2011-02-24 15:01:24

Well it would, of course, be nice if it was easier. I wonder if Kate supports macros or scripting of any sort? If you invested the time to create a macro that replaces all special characters with HTML entities then you could re-use it again in the future without having to waste any more time.

jdarnold · 2011-02-24 15:43:27

I did find this on the web:

http://www.opinionatedgeek.com/dotnet/t … ncode.aspx

I'll bet there are plenty of other sites that will translate text with HTML encoding.

ctarwater · 2011-02-24 16:22:09

I'll give that a try and check the quality against my manual entries. So far manually isn't taking as long as I figured it would.

Thanks!

awkwood · 2011-02-25 17:14:11

If you have Ruby installed (and haven't converted it all manually yet) you could use the HTMLEntities gem for this task.
To install:

gem install htmlentities

Then:

require 'htmlentities'

coder = HTMLEntities.new
string = "Tacòs!"
puts coder.encode(string, :named)

#  Tac&ograve;s!

More info on the HTMLEntities gem.

Last edited by awkwood (2011-02-25 17:14:58)

rwd · 2011-02-25 21:47:30

I know that at least PSPad , a freeware windows text editor that runs perfectly under wine can do this. It has the option under 'tools -> user convertors -> chars to named html entity.

Last edited by rwd (2011-02-25 21:47:45)

ctarwater · 2011-02-25 23:02:33

Thanks for the suggestions guys, I'll check them out tonight!

Anthony Bentley · 2011-02-27 02:13:56

What is the encoding of the files? If it’s already UTF‐8, you shouldn’t have to convert anything, as ePUB and others are XML and require Unicode.

If they’re something else (e.g., CP1252), you can use iconv to convert:

iconv -f WINDOWS-1252 -t UTF-8 infile > outfile

ctarwater · 2011-02-27 02:17:09

Encoding is already UTF-8 but I'm trying to make these docs as "universal" as possible and I've read that little things like replacing quotation marks and other symbols with their html entities is a good step in that direction.

Anthony Bentley · 2011-02-27 06:13:57

ctarwater wrote:

Encoding is already UTF-8 but I'm trying to make these docs as "universal" as possible and I've read that little things like replacing quotation marks and other symbols with their html entities is a good step in that direction.

In situations with multiple encodings this might be the case, but the XML spec requires UTF‐8 support at minimum and defaults to it when the encoding is not declared. All XML parsers can deal with UTF‐8.

In fact, replacing UTF‐8 characters with entities is less portable in XML. This is because the available entities are defined by the doctype, and XML only supports five by default (<, >, ", &, and '). Things like … are defined in the HTML and XHTML doctypes, but not in other dialects of XML. This used to bite RSS pretty hard (probably still does), because people assumed the entities were available everywhere when they’re really not. EPUB and Mobi are based on XHTML, but other formats might use other XML dialects.

ctarwater · 2011-02-27 11:53:40

Huh, I wasn't aware of that - thanks. I'm learning more about this every day. I'ma have to rethink this a bit now.

Arch Linux

#1 2011-02-24 02:19:36

HTML Editor - converting "special characters" into entities

#2 2011-02-24 03:31:22

Re: HTML Editor - converting "special characters" into entities

#3 2011-02-24 03:37:12

Re: HTML Editor - converting "special characters" into entities

#4 2011-02-24 15:01:24

Re: HTML Editor - converting "special characters" into entities

#5 2011-02-24 15:43:27

Re: HTML Editor - converting "special characters" into entities

#6 2011-02-24 16:22:09

Re: HTML Editor - converting "special characters" into entities

#7 2011-02-25 17:14:11

Re: HTML Editor - converting "special characters" into entities

#8 2011-02-25 21:47:30

Re: HTML Editor - converting "special characters" into entities

#9 2011-02-25 23:02:33

Re: HTML Editor - converting "special characters" into entities

#10 2011-02-27 02:13:56

Re: HTML Editor - converting "special characters" into entities

#11 2011-02-27 02:17:09

Re: HTML Editor - converting "special characters" into entities

#12 2011-02-27 06:13:57

Re: HTML Editor - converting "special characters" into entities

#13 2011-02-27 11:53:40

Re: HTML Editor - converting "special characters" into entities

Board footer