Charset

De WikiLICC
Revisão de 12h59min de 23 de maio de 2009 por Dago (Discussão | contribs) (Nova página: ==Character Sets / Character Encoding Issues== ==Introduction== Let’s first define some terms to make it easier to understand the following sections (taken from the book XML Interna...)
(dif) ← Edição anterior | ver versão atual (dif) | Versão posterior → (dif)
Ir para: navegação, pesquisa

Character Sets / Character Encoding Issues

Introduction

Let’s first define some terms to make it easier to understand the following sections (taken from the book XML Internationalization and Localization). See also the introductory WIKI page on i18n.

A character is the smallest component of written language that has a semantic value. Examples of characters are letters, ideographs (e.g. Chinese characters), punctuation marks, digits etc.

A character set is a group of characters without associated numerical values. An example of a character set is the Latin alphabet or the Cyrillic alphabet.

Coded character sets are character sets in which each character is associated with a scalar value: a code point. For example, in ASCII, the uppercase letter “A” has the value 65. Examples for coded character sets are ASCII and Unicode. A coded character set is meant to be encoded, i.e. converted into a digital representation so that the characters can be serialized in files, databases, or strings. This is done through a character encoding scheme or encoding. The encoding method maps each character value to a given sequence of bytes.

In many cases, the encoding is just a direct projection of the scalar values, and there is no real distinction between the coded character set and its serialized representation. For example, in ISO 8859-1 (Latin 1), the character “A” (code point 65) is encoded as a byte 0×41 (i.e. 65). In other cases, the encoding method is more complex. For example, in UTF-8, an encoding of Unicode, the character “á” (225) is encoded as two bytes: 0xC3 and 0xA1.

Unicode and its encodings

For Unicode (also called Universal Character Set or UCS), a coded character set developed by the Unicode consortium, there a several possible encodings: UTF-8, UTF-16, and UTF-32. Of these, UTF-8 is most relevant for a web application. UTF-8

UTF-8 is a multibyte 8-bit encoding in which each Unicode scalar value is mapped to a sequence of one to four bytes. One of the main advantages of UTF-8 is its compatibility with ASCII. If no extended characters are present, there is no difference between a dencoded in ASCII and one encoded in UTF-8.

One thing to take into consideration when using UTF-8 with PHP is that characters are represented with a varying number of bytes. Some PHP functions do not take this into account and will not work as expected (more on this below).

See also Handling UTF-8 with PHP

PHP and character sets

This page is going to assume you’ve done a little reading and absorbed some paranioa about the issue of character sets and character encoding in web applications. If you haven’t, try here;

   “When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.”

“Darn near impossible” is perhaps too extreme but, certainly in PHP, if you simply “accept the defaults” you probably will end up with all kinds of strange characters and question marks the moment anyone outside the US or Western Europe submits some content to your site

This page won’t rehash existing discussions suffice to say you should be thinking in terms of Unicode, the grand unified solution to all character issues and, in particular, UTF-8, a specific encoding of Unicode and the best solution for PHP applications. Everybody Gets it wrong

Just so you don’t get the idea that only “serious programmers” can understand the problem, and as a taster for the type of problems you can have, right now (i.e. they may fix it later) on IBM’s new PHP Blog @ developerworks, here’s what you see if you right click > Page Info in Firefox;

Firefox say it regards the character encoding as being ISO-8859-1 1). That’s actually coming from an HTTP header - if you click on the “Headers” tab you see;

Content-Type: text/html;charset=ISO-8859-1

Meanwhile amongst the HTML meta tags (scroll down past the whitespace) though you find;

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

Now that’s not a train smash (yet) but it should raise the flag that something isn’t quite right. The meta tag will be ignored by browsers so content will be regarded as being encoded as ISO-8859-1, thanks to the HTTP header.

This begs the question - how is the content on the blog actually encoded. If that Content-Type: text/html;charset=ISO-8859-1 header is also turning up in a form that writers on the blog use to submit content, it will probably mean the content being stored will have been encoded as ISO-8859-1. If that’s the case, the real problem will raise it’s head in the blogs RSS Feed which currently does not specify the the charset with an HTTP header - just that it’s XML;

Content-Type: text/xml

...but does declare UTF as the encoding in the XML content itself;

<?xml version="1.0" encoding="UTF-8"?>

Anyone subscribed to this feed is going to see some wierd characters appearing, should the blog contain anything but pure ASCII characters, because there’s a very good chance the content is actually stored is ISO-8859-1, the guess here being that the “back end” content admin page (containing a form for adding content) is also telling the browser it’s ISO-8859-1.

Hopefully, by the time you’ve read this document, you’ll understand what exactly is going wrong here and why.