freelanceprogrammers.org Forum Index » XML / XSL

Can`t Get UTF Characters to Work


View user's profile Post To page top
red_vorlon Posted: Fri Mar 24, 2006 4:00 am


Joined: 10 Mar 2005

Posts: 11
Can`t Get UTF Characters to Work
I am trying to put a simple character in an XML document
I am writing. The character is none other than "é" --- that
is an "e" with an upward accent (French accent) on top of
it.

According to all the documentation, that should be what
I get when I enter the sequence: é
or the sequence: é

although neither sequence gets me anything *near* what
I want?

Simply putting the character as-is in the document caused
the parcer to simply fail.

What is going on? And what can I do?

It is a bad thing if one can`t put a certain character in an
XML document. And the one I`m trying to put in is *far*
from being an *unheard* of character.

So what`s the deal? And what do I do?

Thanks,
Adam
Reply with quote
Send private message
View user's profile Post To page top
david_sewell Posted: Fri Mar 24, 2006 4:32 am


Joined: 24 Mar 2006

Posts: 4
Can`t Get UTF Characters to Work
On Thu, 23 Mar 2006, Adam Shapira wrote:

> I am trying to put a simple character in an XML document
> I am writing. The character is none other than "é" --- that
> is an "e" with an upward accent (French accent) on top of
> it.
>
> According to all the documentation, that should be what
> I get when I enter the sequence: é
> or the sequence: é
>
> although neither sequence gets me anything *near* what
> I want?
>
> Simply putting the character as-is in the document caused
> the parcer to simply fail.
>
> What is going on? And what can I do?

To suggest an answer, we`d really need to know three things:
(1) what XML parsing software are you using?
(2) what editor or editing software are you using to create the file?
(3) what are you using to display output?

There may be two separate problems. If your XML document contains either
of é or é then that character SHOULD appear as accented e
(é) in an XML editor or if you output the XML to a Web browser (directly
or transformed to HTML first). What do you get instead?

Re the parser failing: to enter the character as-is, you must be in a
UTF-8 character set environment. If you are not, you must specify an XML
character encoding. See for example

http://www.w3schools.com/xml/xml_encoding.asp

for info.

--
David Sewell, Editorial and Technical Manager
Electronic Imprint, The University of Virginia Press
PO Box 400318, Charlottesville, VA 22904-4318 USA
Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
Email: dsewell@... Tel: +1 434 924 9973
Web: http://www.ei.virginia.edu/

[Non-text portions of this message have been removed]
Reply with quote
Send private message
View user's profile Post To page top
red_vorlon Posted: Fri Mar 24, 2006 11:14 am


Joined: 10 Mar 2005

Posts: 11
Can`t Get UTF Characters to Work
David Sewell wrote:
>
> To suggest an answer, we`d really need to know three things:
> (1) what XML parsing software are you using?

I am using xsltproc.


> (2) what editor or editing software are you using to create the file?

I use both -vi- and -BBEdit-.


> (3) what are you using to display output?

After exporting from XML to HTML, I looked at it both with
Mozilla and with Safari.


>
> There may be two separate problems. If your XML document contains either
> of é or é then that character SHOULD appear as accented e
> (é) in an XML editor or if you output the XML to a Web browser (directly
> or transformed to HTML first). What do you get instead?

Instead of getting "é" I got "é".

It appeared that way both on Firefox and on Safari.
Reply with quote
Send private message
View user's profile Post To page top
efolia Posted: Fri Mar 24, 2006 11:54 am


Joined: 24 Mar 2006

Posts: 1
Can`t Get UTF Characters to Work
FYI...
é is the UTF-8 encoded form of é as it is displayed on a platform that
does not properly recognize the encoding (such as Microsoft Wordpad). If
you use the correct encoding parameter, it should display correctly.

Yannick

On Fri, 24 Mar 2006 00:14:10 -0500, Adam Ophir Shapira
<red_angel@...> wrote:

> David Sewell wrote:
>>
>> To suggest an answer, we`d really need to know three things:
>> (1) what XML parsing software are you using?
>
> I am using xsltproc.
>
>
>> (2) what editor or editing software are you using to create the file?
>
> I use both -vi- and -BBEdit-.
>
>
>> (3) what are you using to display output?
>
> After exporting from XML to HTML, I looked at it both with
> Mozilla and with Safari.
>
>
>>
>> There may be two separate problems. If your XML document contains either
>> of &#x00E9; or &#xE9; then that character SHOULD appear as accented e
>> (é) in an XML editor or if you output the XML to a Web browser (directly
>> or transformed to HTML first). What do you get instead?
>
> Instead of getting "é" I got "é".
>
> It appeared that way both on Firefox and on Safari.
>
>
> Yahoo! Groups Links
>
>
>
>
>
>
>
>



--
Using Opera`s revolutionary e-mail client: http://www.opera.com/mail/
Reply with quote
Send private message
View user's profile Post To page top
red_vorlon Posted: Fri Mar 24, 2006 7:55 pm


Joined: 10 Mar 2005

Posts: 11
Can`t Get UTF Characters to Work
So basically --- all I need to do is read up on how
to specify the propper encoding parameter --- and
the problem will be solved?

If so -- then thanks --- now I know what area I
need to read up on.

Thanks.


Yannick Forest wrote:
> FYI...
> é is the UTF-8 encoded form of é as it is displayed on a platform that
> does not properly recognize the encoding (such as Microsoft Wordpad). If
> you use the correct encoding parameter, it should display correctly.
>
> Yannick
>
Reply with quote
Send private message
View user's profile Post To page top
david_sewell Posted: Fri Mar 24, 2006 8:21 pm


Joined: 24 Mar 2006

Posts: 4
Can`t Get UTF Characters to Work
On Fri, 24 Mar 2006, Adam Ophir Shapira wrote:

> David Sewell wrote:
> >
> > To suggest an answer, we`d really need to know three things:
> > (1) what XML parsing software are you using?
>
> I am using xsltproc.
>
>
> > (2) what editor or editing software are you using to create the file?
>
> I use both -vi- and -BBEdit-.
>
> > (3) what are you using to display output?
>
> After exporting from XML to HTML, I looked at it both with
> Mozilla and with Safari.
>
> Instead of getting "é" I got "é".
>
> It appeared that way both on Firefox and on Safari.

That`s odd. In the HTML file that was exported, is there a line like
this in the <head>:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

? If I create a simple XML file with accented characters and write a
simple XSLT transformation, the resulting HTML has that <meta> element,
which the browser should use to display the UTF-8 characters properly.
Unless you tell the browser to use a different encoding. For example, if
I am seeing a correctly rendered UTF-8 HTML file in Firefox, and then go
to "View - Character Encoding" and choose "Western (ISO-8859-1)", I will
get the incorrect output "é".

--
David Sewell, Editorial and Technical Manager
Electronic Imprint, The University of Virginia Press
PO Box 400318, Charlottesville, VA 22904-4318 USA
Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
Email: dsewell@... Tel: +1 434 924 9973
Web: http://www.ei.virginia.edu/

[Non-text portions of this message have been removed]
Reply with quote
Send private message
View user's profile Post To page top
david_sewell Posted: Fri Mar 24, 2006 8:25 pm


Joined: 24 Mar 2006

Posts: 4
Can`t Get UTF Characters to Work
On Fri, 24 Mar 2006, David Sewell wrote:

> ? If I create a simple XML file with accented characters and write a
> simple XSLT transformation, the resulting HTML has that <meta> element,
> which the browser should use to display the UTF-8 characters properly.

I should have mentioned I used xsltproc also to test this.

There is one other possibility. If you are viewing the output HTML via a
Web server rather than just by opening the file directly, there might be
a server setting that is messing up the display. I have seen this happen
with Apache when "AddDefaultCharset" is set to "On". See
http://httpd.apache.org/docs/2.0/mod/core.html#adddefaultcharset
for details.

--
David Sewell, Editorial and Technical Manager
Electronic Imprint, The University of Virginia Press
PO Box 400318, Charlottesville, VA 22904-4318 USA
Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
Email: dsewell@... Tel: +1 434 924 9973
Web: http://www.ei.virginia.edu/
Reply with quote
Send private message
View user's profile Post To page top
dirtroad30534 Posted: Mon Mar 27, 2006 9:11 pm


Joined: 13 Jun 2003

Posts: 39
Can`t Get UTF Characters to Work
Sorry about the late response, I`ve been out of town.

In later follow-ups, it became clear you`re using a Mac. You need to
explicitly enable UTF-8 support in the Terminal: in the Terminal menu,
select Window Settings and choose Display from the dialog`s popup menu.
Toward the bottom of the window, select "Unicode (UTF-8)" in the Character
Set Encoding menu.

Don`t forget to include the XML processing instruction at the beginning of
your files:
[?xml version="1.0" encoding="utf-8" ?]
(replace square brackets with angle brackets of course).

Hope that helps,
--
Larry Kollar, Senior Technical Writer, ARRIS CPE Products
"Content creators are the engine that drives
value in the information life cycle."
-- Barry Schaeffer, on XML-Doc
Reply with quote
Send private message
View user's profile Post To page top
xmldoc Posted: Tue Mar 28, 2006 5:31 am


Joined: 26 Jun 2003

Posts: 20
Can`t Get UTF Characters to Work
Adam Ophir Shapira <red_angel@...> writes:

> David Sewell wrote:
> >
> > (2) what editor or editing software are you using to create the file?
>
> I use both -vi- and -BBEdit-.
>
> > (3) what are you using to display output?
>
> After exporting from XML to HTML, I looked at it both with
> Mozilla and with Safari.
> >
[...]
> Instead of getting "[eacute] I got "[Atilde+copy]".
>
> It appeared that way both on Firefox and on Safari.

I know you already got an answer to your question, but note that
rather than trying to figure out what the character is by checking
how the file contents are displayed in a browser or whatever, you
can use a hex-dump utility or hex editor to determine exactly what
the character is -- the hexedit or xxd commands if you`re working
in a command-line environment, or whatever equivalent is built
into your editing app.

I would guess BBEdit has some kind of hex mode. In Emacs, you can
do "M-x hexl-mode". In Vim, you can do ":%! xxd".

Regardless of what you use, what you`ll see is something like this:

00002a0: 6164 3f0a 0a49 6e73 7465 6164 206f 6620 ad?..Instead of
00002b0: 6765 7474 696e 6720 22e9 2220 4920 676f getting "." I go
00002c0: 7420 22c3 a922 2e0a 0a49 7420 6170 7065 t ".."...It appe

That`s a fragment of your file as seen by xxd. It shows the file
using one line for every sixteen bytes. It shows the hexadecimal
value for every byte in the file, along with an ASCII
representation of the contents (at the far right). Bytes that
can`t be displayed with an ASCII character are shown with a dot.

To figure out what a particular dot corresponds to, you count
over. So for the dot in the "getting" line -- which is where the
acute e character shows up in your original message -- you can see
that corresponds to the single hex value "e9". And in the next
line down, you`ll see that the borked stuff showing up when you
display it in a browser is two bytes, "c3a9".

Selection of the glyphs that are used to display those bytes when
you view them in some app depends on what encoding the application
thinks your file is in. In the case of your mail message, your mail
client sent it with the following header:

Content-Type: text/plain; charset=ISO-8859-1

So when I view it in my mail client, that e9 is displayed with an
"e with acute accent" glyph -- as expected, because in ISO-8859-1
encoding, a single e9 = eacute -- and the c3a9 pair shows up
borked. Because in ISO-8859-1, c3+a9 = Atilde+copy (capital A with
a tilde, followed by the copyright symbol).

But if the charset part of your message`s Content-Type header had
"charset=UTF-8" instead, the c3a9 would actually be displayed with
an "e with acute accent" glyph, and the e9 would show up with some
(undefined) strange character -- a black or white box, or a
black diamond with a question mark, or maybe even just a question
mark. The reason being that in UTF-8, a single hex e9 does not
correspond to any displayable character.

If you look up the character "e9" in a Unicode character database
of some kind, like the one at the Zvon site, you might be led to
conclude that e9 in "Unicode" should be displayed as an eacute,
just as it is in IS0-8859-1.

http://zvon.org/other/charSearch/PHP/search.php

If you look at that page, it`ll tell you that the e9 corresponds
to the Unicode character "LATIN SMALL LETTER E WITH ACUTE". But
the problem is that what doesn`t tell you anything at all about is
what it corresponds to in a particular Unicode encoding. Most of
the time, what you`d probably want to know is what it is in UTF-8,
which isn`t the same as its actual Unicode value. The reason is
that in UTF-8, unlike ISO-8859-1, most special characters are
represented by two bytes.

There`s a very good online reference that will tell you what the
hex values are for UTF-8-encoded versions of Unicode code points --
the "letter database" at the Institute of the Estonian Language:

http://www.eki.ee/letter/chardata.cgi?ucode=e9
http://www.eki.ee/letter/

If you look at the bottom of the left-hand column, you can see
that it says Unicode 00e9 corresponds to c3a9 in UTF-8. Another
part of the page tells you what it corresponds to in other
charsets (for example, e9 in ISO-8859-1).

So in the case of your content, as others on the list have pointed
out, you just need to tell your application that the contents are
UTF-8 encoded instead of IS0-8859-1 encoded. If that application
happens to be a web browser and your content is being served up to
the browser from a Web server, one common problem is that many
Apache web servers are configured to serve up pages with a
particular charset setting in the HTTP headers. And the default
value for that setting is "ISO-8859-1".

--Mike
Reply with quote
Send private message
Post new topic Reply to topic
Display posts from previous:   
 

All times are GMT
Page 1 of 1
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Freelace Website Designer - Customer web design and software building.
China Wholesale - Electronics Products
Character Studio - Tutorials and Help