Chapter 9. Internationalization and Localization

Table of Contents

9.1. Character Set Support
9.1.1. Character Sets and Collations in General
9.1.2. Character Sets and Collations in MySQL
9.1.3. Specifying Character Sets and Collations
9.1.4. Connection Character Sets and Collations
9.1.5. Collation Issues
9.1.6. Operations Affected by Character Set Support
9.1.7. Unicode Support
9.1.8. UTF-8 for Metadata
9.1.9. Upgrading Character Sets from MySQL 4.0
9.1.10. Character Sets and Collations That MySQL Supports
9.2. The Character Set Used for Data and Sorting
9.2.1. Using the German Character Set
9.3. Setting the Error Message Language
9.4. Adding a New Character Set
9.4.1. The Character Definition Arrays
9.4.2. String Collating Support
9.4.3. Multi-Byte Character Support
9.5. How to Add a New Collation to a Character Set
9.5.1. Collation Implementation Types
9.5.2. Choosing a Collation ID
9.5.3. Adding a Simple Collation to an 8-Bit Character Set
9.6. Problems With Character Sets
9.7. MySQL Server Time Zone Support
9.8. MySQL Server Locale Support

This chapter covers issues of internationalization (MySQL's capabilities for adapting to local use) and localization (selecting particular local conventions):

9.1. Character Set Support

Improved support for character set handling was added to MySQL in version 4.1. This support enables you to store data using a variety of character sets and perform comparisons according to a variety of collations. You can specify character sets at the server, database, table, and column level. MySQL supports the use of character sets for the MyISAM, MEMORY, and (as of MySQL 4.1.2) InnoDB storage engines. The ISAM storage engine does not include character set support; there are no plans to change this, because ISAM is deprecated.

Note

The NDBCLUSTER storage engine in MySQL 4.1 (available beginning with MySQL 4.1.3-Max) provides limited character set and collation support; see Section 15.12, “Known Limitations of MySQL Cluster”.

This chapter discusses the following topics:

  • What are character sets and collations?

  • The multiple-level default system for character set assignment

  • Syntax for specifying character sets and collations

  • Affected functions and operations

  • Unicode support

  • The character sets and collations that are available, with notes

Character set issues affect not only data storage, but also communication between client programs and the MySQL server. If you want the client program to communicate with the server using a character set different from the default, you'll need to indicate which one. For example, to use the utf8 Unicode character set, issue this statement after connecting to the server:

SET NAMES 'utf8';

For more information about character set-related issues in client/server communication, see Section 9.1.4, “Connection Character Sets and Collations”.

9.1.1. Character Sets and Collations in General

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.

Suppose that we have an alphabet with four letters: “A”, “B”, “a”, “b”. We give each letter a number: “A” = 0, “B” = 1, “a” = 2, “b” = 3. The letter “A” is a symbol, the number 0 is the encoding for “A”, and the combination of all four letters and their encodings is a character set.

Suppose that we want to compare two string values, “A” and “B”. The simplest way to do this is to look at the encodings: 0 for “A” and 1 for “B”. Because 0 is less than 1, we say “A” is less than “B”. What we've just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): “compare the encodings.” We call this simplest of all possible collations a binary collation.

But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters “a” and “b” as equivalent to “A” and “B”; (2) then compare the encodings. We call this a case-insensitive collation. It's a little more complex than a binary collation.

In real life, most character sets have many characters: not just “A” and “B” but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules, not just for whether to distinguish lettercase, but also for whether to distinguish accents (an “accent” is a mark attached to a character as in German “Ö”), and for multiple-character mappings (such as the rule that “Ö” = “OE” in one of the two German collations).

MySQL 4.1 can do these things for you:

  • Store strings using a variety of character sets

  • Compare strings using a variety of collations

  • Mix strings with different character sets or collations in the same server, the same database, or even the same table

  • Allow specification of character set and collation at any level

In these respects, not only is MySQL 4.1 far more flexible than MySQL 4.0, it also is far ahead of most other database management systems. However, to use these features effectively, you need to know what character sets and collations are available, how to change the defaults, and how they affect the behavior of string operators and functions.

9.1.2. Character Sets and Collations in MySQL

The MySQL server can support multiple character sets. To list the available character sets, use the SHOW CHARACTER SET statement. A partial listing follows. For more complete information, see Section 9.1.10, “Character Sets and Collations That MySQL Supports”.

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | cp1252 West European        | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                    | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |      2 |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |      1 |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |      1 |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |      2 |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |      1 |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |      1 |
| cp1250   | Windows Central European    | cp1250_general_ci   |      1 |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |      1 |
...

Any given character set always has at least one collation. It may have several collations. To list the collations for a character set, use the SHOW COLLATION statement. For example, to see the collations for the latin1 (cp1252 West European) character set, use this statement to find those collation names that begin with latin1:

mysql> SHOW COLLATION LIKE 'latin1%';
+---------------------+---------+----+---------+----------+---------+
| Collation           | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| latin1_german1_ci   | latin1  |  5 |         |          |       0 |
| latin1_swedish_ci   | latin1  |  8 | Yes     | Yes      |       1 |
| latin1_danish_ci    | latin1  | 15 |         |          |       0 |
| latin1_german2_ci   | latin1  | 31 |         | Yes      |       2 |
| latin1_bin          | latin1  | 47 |         | Yes      |       1 |
| latin1_general_ci   | latin1  | 48 |         |          |       0 |
| latin1_general_cs   | latin1  | 49 |         |          |       0 |
| latin1_spanish_ci   | latin1  | 94 |         |          |       0 |
+---------------------+---------+----+---------+----------+---------+

The latin1 collations have the following meanings:

CollationMeaning
latin1_german1_ciGerman DIN-1
latin1_swedish_ciSwedish/Finnish
latin1_danish_ciDanish/Norwegian
latin1_german2_ciGerman DIN-2
latin1_binBinary according to latin1 encoding
latin1_general_ciMultilingual (Western European)
latin1_general_csMultilingual (ISO Western European), case sensitive
latin1_spanish_ciModern Spanish

Collations have these general characteristics:

  • Two different character sets cannot have the same collation.

  • Each character set has one collation that is the default collation. For example, the default collation for latin1 is latin1_swedish_ci. The output for SHOW CHARACTER SET indicates which collation is the default for each displayed character set.

  • There is a convention for collation names: They start with the name of the character set with which they are associated, they usually include a language name, and they end with _ci (case insensitive), _cs (case sensitive), or _bin (binary).

In cases where a character set has multiple collations, it might not be clear which collation is most suitable for a given application. To avoid choosing the wrong collation, it can be helpful to perform some comparisons with representative data values to make sure that a given collation sorts values the way you expect.

Collation-Charts.Org is a useful site for information that shows how one collation compares to another.

9.1.3. Specifying Character Sets and Collations

There are default settings for character sets and collations at four levels: server, database, table, and column. The description in the following sections may appear complex, but it has been found in practice that multiple-level defaulting leads to natural and obvious results.

CHARACTER SET is used in clauses that specify a character set. CHARSET may be used as a synonym for CHARACTER SET.

Character set issues affect not only data storage, but also communication between client programs and the MySQL server. If you want the client program to communicate with the server using a character set different from the default, you'll need to indicate which one. For example, to use the utf8 Unicode character set, issue this statement after connecting to the server:

SET NAMES 'utf8';

For more information about character set-related issues in client/server communication, see Section 9.1.4, “Connection Character Sets and Collations”.

9.1.3.1. Server Character Set and Collation

MySQL Server has a server character set and a server collation. These can be set at server startup on the command line or in an option file and changed at runtime.

Initially, the server character set and collation depend on the options that you use when you start mysqld. You can use --character-set-server for the character set. Along with it, you can add --collation-server for the collation. If you don't specify a character set, that is the same as saying --character-set-server=latin1. If you specify only a character set (for example, latin1) but not a collation, that is the same as saying --character-set-server=latin1 --collation-server=latin1_swedish_ci because latin1_swedish_ci is the default collation for latin1. Therefore, the following three commands all have the same effect:

shell> mysqld
shell> mysqld --character-set-server=latin1
shell> mysqld --character-set-server=latin1 \
           --collation-server=latin1_swedish_ci

One way to change the settings is by recompiling. If you want to change the default server character set and collation when building from sources, use: --with-charset and --with-collation as arguments for configure. For example:

shell> ./configure --with-charset=latin1

Or:

shell> ./configure --with-charset=latin1 \
           --with-collation=latin1_german1_ci

Both mysqld and configure verify that the character set/collation combination is valid. If not, each program displays an error message and terminates.

The server character set and collation are used as default values if the database character set and collation are not specified in CREATE DATABASE statements. They have no other purpose.

The current server character set and collation can be determined from the values of the character_set_server and collation_server system variables. These variables can be changed at runtime.

9.1.3.2. Database Character Set and Collation

Every database has a database character set and a database collation. The CREATE DATABASE and ALTER DATABASE statements have optional clauses for specifying the database character set and collation:

CREATE DATABASE db_name
    [[DEFAULT] CHARACTER SET charset_name]
    [[DEFAULT] COLLATE collation_name]

ALTER DATABASE db_name
    [[DEFAULT] CHARACTER SET charset_name]
    [[DEFAULT] COLLATE collation_name]

All database options are stored in a text file named db.opt that can be found in the database directory.

The CHARACTER SET and COLLATE clauses make it possible to create databases with different character sets and collations on the same MySQL server.

Example:

CREATE DATABASE db_name CHARACTER SET latin1 COLLATE latin1_swedish_ci;

MySQL chooses the database character set and database collation in the following manner:

  • If both CHARACTER SET X and COLLATE Y were specified, then character set X and collation Y.

  • If CHARACTER SET X was specified without COLLATE, then character set X and its default collation.

  • If COLLATE Y was specified without CHARACTER SET, then the character set associated with Y and collation Y.

  • Otherwise, the server character set and server collation.

The database character set and collation are used as default values if the table character set and collation are not specified in CREATE TABLE statements. They have no other purpose.

The character set and collation for the default database can be determined from the values of the character_set_database and collation_database system variables. The server sets these variables whenever the default database changes. If there is no default database, the variables have the same value as the corresponding server-level system variables, character_set_server and collation_server.

9.1.3.3. Table Character Set and Collation

Every table has a table character set and a table collation. The CREATE TABLE and ALTER TABLE statements have optional clauses for specifying the table character set and collation:

CREATE TABLE tbl_name (column_list)
    [[DEFAULT] CHARACTER SET charset_name] [COLLATE collation_name]]

ALTER TABLE tbl_name
    [[DEFAULT] CHARACTER SET charset_name] [COLLATE collation_name]

Example:

CREATE TABLE t1 ( ... ) CHARACTER SET latin1 COLLATE latin1_danish_ci;

MySQL chooses the table character set and collation in the following manner:

  • If both CHARACTER SET X and COLLATE Y were specified, then character set X and collation Y.

  • If CHARACTER SET X was specified without COLLATE, then character set X and its default collation.

  • If COLLATE Y was specified without CHARACTER SET, then the character set associated with Y and collation Y.

  • Otherwise, the database character set and collation.

The table character set and collation are used as default values if the column character set and collation are not specified in individual column definitions. The table character set and collation are MySQL extensions; there are no such things in standard SQL.

9.1.3.4. Column Character Set and Collation

Every “character” column (that is, a column of type CHAR, VARCHAR, or TEXT) has a column character set and a column collation. Column definition syntax for CREATE TABLE and ALTER TABLE has optional clauses for specifying the column character set and collation:

col_name {CHAR | VARCHAR | TEXT} (col_length)
    [CHARACTER SET charset_name] [COLLATE collation_name]

Examples:

CREATE TABLE Table1
(
    column1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_german1_ci
);

ALTER TABLE Table1 MODIFY
    column1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_swedish_ci;

If you convert a column from one character set to another, ALTER TABLE attempts to map the data values, but if the character sets are incompatible, there may be data loss.

MySQL chooses the column character set and collation in the following manner:

  • If both CHARACTER SET X and COLLATE Y were specified, then character set X and collation Y are used.

  • If CHARACTER SET X was specified without COLLATE, then character set X and its default collation are used.

  • If COLLATE Y was specified without CHARACTER SET, then the character set associated with Y and collation Y.

  • Otherwise, the table character set and collation are used.

The CHARACTER SET and COLLATE clauses are standard SQL.

9.1.3.5. Character String Literal Character Set and Collation

Every character string literal has a character set and a collation.

A character string literal may have an optional character set introducer and COLLATE clause:

[_charset_name]'string' [COLLATE collation_name]

Examples:

SELECT 'string';
SELECT _latin1'string';
SELECT _latin1'string' COLLATE latin1_danish_ci;

For the simple statement SELECT 'string', the string has the character set and collation defined by the character_set_connection and collation_connection system variables.

The _charset_name expression is formally called an introducer. It tells the parser, “the string that is about to follow uses character set X.” Because this has confused people in the past, we emphasize that an introducer does not change the string to the introducer character set like CONVERT() would do. It does not change the string's value, although padding may occur. The introducer is just a signal. An introducer is also legal before standard hex literal and numeric hex literal notation (x'literal' and 0xnnnn).

Examples:

SELECT _latin1 x'AABBCC';
SELECT _latin1 0xAABBCC;

MySQL determines a literal's character set and collation in the following manner:

  • If both _X and COLLATE Y were specified, then character set X and collation Y are used.

  • If _X is specified but COLLATE is not specified, then character set X and its default collation are used.

  • Otherwise, the character set and collation given by the character_set_connection and collation_connection system variables are used.

Examples:

  • A string with latin1 character set and latin1_german1_ci collation:

    SELECT _latin1'Müller' COLLATE latin1_german1_ci;
    
  • A string with latin1 character set and its default collation (that is, latin1_swedish_ci):

    SELECT _latin1'Müller';
    
  • A string with the connection default character set and collation:

    SELECT 'Müller';
    

Character set introducers and the COLLATE clause are implemented according to standard SQL specifications.

An introducer indicates the character set for the following string, but does not change now how the parser performs escape processing within the string. Escapes are always interpreted by the parser according to the character set given by character_set_connection.

The following examples show that escape processing occurs using character_set_connection even in the presence of an introducer. The examples use SET NAMES (which changes character_set_connection, as discussed in Section 9.1.4, “Connection Character Sets and Collations”), and display the resulting strings using the HEX() function so that the exact string contents can be seen.

Example 1:

mysql> SET NAMES latin1;
Query OK, 0 rows affected (0.01 sec)

mysql> SELECT HEX('à\n'), HEX(_sjis'à\n');
+------------+-----------------+
| HEX('à\n') | HEX(_sjis'à\n') |
+------------+-----------------+
| E00A       | E00A            | 
+------------+-----------------+
1 row in set (0.00 sec)

Here, “à” (hex value E0) is followed by “\n”, the escape sequence for newline. The escape sequence is interpreted using the character_set_connection value of latin1 to produce a literal newline (hex value 0A). This happens even for the second string. That is, the introducer of _sjis does not affect the parser's escape processing.

Example 2:

mysql> SET NAMES sjis;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT HEX('à\n'), HEX(_latin1'à\n');
+------------+-------------------+
| HEX('à\n') | HEX(_latin1'à\n') |
+------------+-------------------+
| E05C6E     | E05C6E            | 
+------------+-------------------+
1 row in set (0.04 sec)

Here, character_set_connection is sjis, a character set in which the sequence of “à” followed by “\” (hex values 05 and 5C) is a valid multi-byte character. Hence, the first two bytes of the string are interpreted as a single sjis character, and the “\” is not intrepreted as an escape character. The following “n” (hex value 6E) is not interpreted as part of an escape sequence. This is true even for the second string; the introducer of _latin1 does not affect escape processing.

9.1.3.6. National Character Set

Before MySQL 4.1, NCHAR and CHAR were synonymous. Standard SQL defines NCHAR or NATIONAL CHAR as a way to indicate that a CHAR column should use some predefined character set. MySQL 4.1 and up uses utf8 as that predefined character set. For example, these data type declarations are equivalent:

CHAR(10) CHARACTER SET utf8
NATIONAL CHARACTER(10)
NCHAR(10)

As are these:

VARCHAR(10) CHARACTER SET utf8
NATIONAL VARCHAR(10)
NCHAR VARCHAR(10)
NATIONAL CHARACTER VARYING(10)
NATIONAL CHAR VARYING(10)

You can use N'literal' (or n'literal') to create a string in the national character set. These statements are equivalent:

SELECT N'some text';
SELECT n'some text';
SELECT _utf8'some text';

9.1.3.7. Examples of Character Set and Collation Assignment

The following examples show how MySQL determines default character set and collation values.

Example 1: Table and Column Definition

CREATE TABLE t1
(
    c1 CHAR(10) CHARACTER SET latin1 COLLATE latin1_german1_ci
) DEFAULT CHARACTER SET latin2 COLLATE latin2_bin;

Here we have a column with a latin1 character set and a latin1_german1_ci collation. The definition is explicit, so that's straightforward. Notice that there is no problem with storing a latin1 column in a latin2 table.

Example 2: Table and Column Definition

CREATE TABLE t1
(
    c1 CHAR(10) CHARACTER SET latin1
) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;

This time we have a column with a latin1 character set and a default collation. Although it might seem natural, the default collation is not taken from the table level. Instead, because the default collation for latin1 is always latin1_swedish_ci, column c1 has a collation of latin1_swedish_ci (not latin1_danish_ci).

Example 3: Table and Column Definition

CREATE TABLE t1
(
    c1 CHAR(10)
) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;

We have a column with a default character set and a default collation. In this circumstance, MySQL checks the table level to determine the column character set and collation. Consequently, the character set for column c1 is latin1 and its collation is latin1_danish_ci.

Example 4: Database, Table, and Column Definition

CREATE DATABASE d1
    DEFAULT CHARACTER SET latin2 COLLATE latin2_czech_ci;
USE d1;
CREATE TABLE t1
(
    c1 CHAR(10)
);

We create a column without specifying its character set and collation. We're also not specifying a character set and a collation at the table level. In this circumstance, MySQL checks the database level to determine the table settings, which thereafter become the column settings.) Consequently, the character set for column c1 is latin2 and its collation is latin2_czech_ci.

9.1.3.8. Compatibility with Other DBMSs

For MaxDB compatibility these two statements are the same:

CREATE TABLE t1 (f1 CHAR(N) UNICODE);
CREATE TABLE t1 (f1 CHAR(N) CHARACTER SET ucs2);

9.1.4. Connection Character Sets and Collations

Several character set and collation system variables relate to a client's interaction with the server. Some of these have been mentioned in earlier sections:

  • The server character set and collation can be determined from the values of the character_set_server and collation_server system variables.

  • The character set and collation of the default database can be determined from the values of the character_set_database and collation_database system variables.

Additional character set and collation system variables are involved in handling traffic for the connection between a client and the server. Every client has connection-related character set and collation system variables.

Consider what a “connection” is: It's what you make when you connect to the server. The client sends SQL statements, such as queries, over the connection to the server. The server sends responses, such as result sets, over the connection back to the client. This leads to several questions about character set and collation handling for client connections, each of which can be answered in terms of system variables:

  • What character set is the statement in when it leaves the client?

    The server takes the character_set_client system variable to be the character set in which statements are sent by the client.

  • What character set should the server translate a statement to after receiving it?

    For this, the server uses the character_set_connection and collation_connection system variables. It converts statements sent by the client from character_set_client to character_set_connection (except for string literals that have an introducer such as _latin1 or _utf8). collation_connection is important for comparisons of literal strings. For comparisons of strings with column values, collation_connection does not matter because columns have their own collation, which has a higher collation precedence.

  • What character set should the server translate to before shipping result sets or error messages back to the client?

    The character_set_results system variable indicates the character set in which the server returns query results to the client. This includes result data such as column values, and result metadata such as column names.

You can fine-tune the settings for these variables, or you can depend on the defaults (in which case, you can skip the rest of this section).

There are two statements that affect the connection character sets:

SET NAMES 'charset_name'
SET CHARACTER SET charset_name

SET NAMES indicates what character set the client will use to send SQL statements to the server. Thus, SET NAMES 'cp1251' tells the server “future incoming messages from this client are in character set cp1251.” It also specifies the character set that the server should use for sending results back to the client. (For example, it indicates what character set to use for column values if you use a SELECT statement.)

A SET NAMES 'x' statement is equivalent to these three statements:

SET character_set_client = x;
SET character_set_results = x;
SET character_set_connection = x;

Setting character_set_connection to x also sets collation_connection to the default collation for x. It is not necessary to set that collation explicitly. To specify a particular collation for the character sets, use the optional COLLATE clause:

SET NAMES 'charset_name' COLLATE 'collation_name'

SET CHARACTER SET is similar to SET NAMES but sets character_set_connection and collation_connection to character_set_database and collation_database. A SET CHARACTER SET x statement is equivalent to these three statements:

SET character_set_client = x;
SET character_set_results = x;
SET collation_connection = @@collation_database;

Setting collation_connection also sets character_set_connection to the character set associated with the collation (equivalent to executing SET character_set_connection = @@character_set_database). It is not necessary to set character_set_connection explicitly.

When a client connects, it sends to the server the name of the character set that it wants to use. The server uses the name to set the character_set_client, character_set_results, and character_set_connection system variables. In effect, the server performs a SET NAMES operation using the character set name.

With the mysql client, it is not necessary to execute SET NAMES every time you start up if you want to use a character set different from the default. You can add the --default-character-set option setting to your mysql statement line, or in your option file. For example, the following option file setting changes the three character set variables set to koi8r each time you invoke mysql:

[mysql]
default-character-set=koi8r

Example: Suppose that column1 is defined as CHAR(5) CHARACTER SET latin2. If you do not say SET NAMES or SET CHARACTER SET, then for SELECT column1 FROM t, the server sends back all the values for column1 using the character set that the client specified when it connected. On the other hand, if you say SET NAMES 'latin1' or SET CHARACTER SET latin1 before issuing the SELECT statement, the server converts the latin2 values to latin1 just before sending results back. Conversion may be lossy if there are characters that are not in both character sets.

If you do not want the server to perform any conversion of result sets, set character_set_results to NULL:

SET character_set_results = NULL;

Note

ucs2 cannot be used as a client character set, which means that it does not work for SET NAMES or SET CHARACTER SET.

To see the values of the character set and collation system variables that apply to your connection, use these statements:

SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';

You must also consider the environment within which your MySQL application executes. For example, if you will send statements using UTF-8 text taken from a file that you create in an editor, you should edit the file with the locale of your environment set to UTF-8 so that the file's encoding is correct and so that the operating system handles it correctly. For a script that executes in a Web environment, the script must handle the character encoding properly for its interaction with the MySQL server, and it must generate pages that correctly indicate the encoding so that browsers know now to display the content of the pages.

9.1.5. Collation Issues

The following sections discuss various aspects of character set collations.

9.1.5.1. Using COLLATE in SQL Statements

With the COLLATE clause, you can override whatever the default collation is for a comparison. COLLATE may be used in various parts of SQL statements. Here are some examples:

  • With ORDER BY:

    SELECT k
    FROM t1
    ORDER BY k COLLATE latin1_german2_ci;
    
  • With AS:

    SELECT k COLLATE latin1_german2_ci AS k1
    FROM t1
    ORDER BY k1;
    
  • With GROUP BY:

    SELECT k
    FROM t1
    GROUP BY k COLLATE latin1_german2_ci;
    
  • With aggregate functions:

    SELECT MAX(k COLLATE latin1_german2_ci)
    FROM t1;
    
  • With DISTINCT:

    SELECT DISTINCT k COLLATE latin1_german2_ci
    FROM t1;
    
  • With WHERE:

         SELECT *
         FROM t1
         WHERE _latin1 'Müller' COLLATE latin1_german2_ci = k;
    
         SELECT *
         FROM t1
         WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci;
    
  • With HAVING:

    SELECT k
    FROM t1
    GROUP BY k
    HAVING k = _latin1 'Müller' COLLATE latin1_german2_ci;
    

9.1.5.2. COLLATE Clause Precedence

The COLLATE clause has high precedence (higher than ||), so the following two expressions are equivalent:

x || y COLLATE z
x || (y COLLATE z)

9.1.5.3. BINARY Operator

The BINARY operator casts the string following it to a binary string. This is an easy way to force a comparison to be done byte by byte rather than character by character. BINARY also causes trailing spaces to be significant.

mysql> SELECT 'a' = 'A';
        -> 1
mysql> SELECT BINARY 'a' = 'A';
        -> 0
mysql> SELECT 'a' = 'a ';
        -> 1
mysql> SELECT BINARY 'a' = 'a ';
        -> 0

BINARY str is shorthand for CAST(str AS BINARY).

The BINARY attribute in character column definitions has a different effect. A character column defined with the BINARY attribute is assigned the binary collation of the column's character set. Every character set has a binary collation. For example, the binary collation for the latin1 character set is latin1_bin, so if the table default character set is latin1, these two column definitions are equivalent:

CHAR(10) BINARY
CHAR(10) CHARACTER SET latin1 COLLATE latin1_bin

The effect of BINARY as a column attribute differs from its effect prior to MySQL 4.1. Formerly, BINARY resulted in a column that was treated as a binary string. A binary string is a string of bytes that has no character set or collation, which differs from a non-binary character string that has a binary collation. For both types of strings, comparisons are based on the numeric values of the string unit, but for non-binary strings the unit is the character and some character sets allow multi-byte characters. Section 10.4.2, “The BINARY and VARBINARY Types”.

The use of CHARACTER SET binary in the definition of a CHAR, VARCHAR, or TEXT column causes the column to be treated as a binary data type. For example, the following pairs of definitions are equivalent:

CHAR(10) CHARACTER SET binary
BINARY(10)

VARCHAR(10) CHARACTER SET binary
VARBINARY(10)

TEXT CHARACTER SET binary
BLOB

9.1.5.4. Some Special Cases Where the Collation Determination Is Tricky

In the great majority of statements, it is obvious what collation MySQL uses to resolve a comparison operation. For example, in the following cases, it should be clear that the collation is the collation of column x:

SELECT x FROM T ORDER BY x;
SELECT x FROM T WHERE x = x;
SELECT DISTINCT x FROM T;

However, when multiple operands are involved, there can be ambiguity. For example:

SELECT x FROM T WHERE x = 'Y';

Should this query use the collation of the column x, or of the string literal 'Y'?

Standard SQL resolves such questions using what used to be called “coercibility” rules. Basically, this means: Both x and 'Y' have collations, so which collation takes precedence? This can be difficult to resolve, but the following rules cover most situations:

  • An explicit COLLATE clause has a coercibility of 0. (Not coercible at all.)

  • The concatenation of two strings with different collations has a coercibility of 1.

  • A column's collation has a coercibility of 2.

  • A “system constant” (the string returned by functions such as USER() or VERSION()) has a coercibility of 3.

  • A literal's collation has a coercibility of 4.

  • NULL or an expression that is derived from NULL has a coercibility of 5.

The preceding coercibility values are current as of MySQL 4.1.11. See the note later in this section for additional version-related information.

Those rules resolve ambiguities in the following manner:

  • Use the collation with the lowest coercibility value.

  • If both sides have the same coercibility, then:

    • If both sides are Unicode, or both sides are not Unicode, it is an error.

    • If one of the sides has a Unicode character set, and another side has a non-Unicode character set, the side with Unicode character set wins, and automatic character set conversion is applied to the non-Unicode side. For example, the following statement will not return an error:

      SELECT CONCAT(utf8_column, latin1_column) FROM t1;
      

      It will return a result, and the character set of the result will be utf8. The collation of the result will be the collation of utf8_column. Values of latin1_column will be automatically converted to utf8 before concatenating.

Although automatic conversion is not in the SQL standard, the SQL standard document does say that every character set is (in terms of supported characters) a “subset” of Unicode. Because it is a well-known principle that “what applies to a superset can apply to a subset,” we believe that a collation for Unicode can apply for comparisons with non-Unicode strings.

Examples:

column1 = 'A'Use collation of column1
column1 = 'A' COLLATE xUse collation of 'A'
column1 COLLATE x = 'A' COLLATE yError

The COERCIBILITY() function can be used to determine the coercibility of a string expression:

mysql> SELECT COERCIBILITY('A' COLLATE latin1_swedish_ci);
        -> 0
mysql> SELECT COERCIBILITY(VERSION());
        -> 3
mysql> SELECT COERCIBILITY('A');
        -> 4

See Section 11.10.3, “Information Functions”.

Before MySQL 4.1.11, there is no system constant or ignorable coercibility. Functions such as USER() have a coercibility of 2 rather than 3, and literals have a coercibility of 3 rather than 4.

9.1.5.5. Collations Must Be for the Right Character Set

Each character set has one or more collations, but each collation is associated with one and only one character set. Therefore, the following statement causes an error message because the latin2_bin collation is not legal with the latin1 character set:

mysql> SELECT _latin1 'x' COLLATE latin2_bin;
ERROR 1253 (42000): COLLATION 'latin2_bin' is not valid
for CHARACTER SET 'latin1'

In some cases, expressions that worked before MySQL 4.1 fail in early versions of MySQL 4.1 if you do not take character set and collation into account. For example, before 4.1, this statement works as is:

mysql> SELECT SUBSTRING_INDEX(USER(),'@',1);
+-------------------------------+
| SUBSTRING_INDEX(USER(),'@',1) |
+-------------------------------+
| root                          |
+-------------------------------+

The statement also works as is in MySQL 4.1 as of 4.1.8: In MySQL 4.1, usernames are stored using the utf8 character set (see Section 9.1.8, “UTF-8 for Metadata”). The literal string '@' has the server character set (latin1 by default). Although the character sets are different, MySQL can coerce the latin1 string to the character set (and collation) of USER() without data loss. It does so, performs the substring operation, and returns a result that has a character set of utf8.

However, in versions of MySQL 4.1 before 4.1.8, the statement fails:

mysql> SELECT SUBSTRING_INDEX(USER(),'@',1);
ERROR 1267 (HY000): Illegal mix of collations
(utf8_general_ci,IMPLICIT) and (latin1_swedish_ci,COERCIBLE)
for operation 'substr_index'

This happens because the automatic character set conversion of '@' does not occur and the string operands have different character sets (and thus different collations):

mysql> SELECT COLLATION(USER()), COLLATION('@');
+-------------------+-------------------+
| COLLATION(USER()) | COLLATION('@')    |
+-------------------+-------------------+
| utf8_general_ci   | latin1_swedish_ci |
+-------------------+-------------------+

One way to deal with this is to upgrade to MySQL 4.1.8 or later. If that is not possible, you can tell MySQL to interpret the literal string as utf8:

mysql> SELECT SUBSTRING_INDEX(USER(),_utf8'@',1);
+------------------------------------+
| SUBSTRING_INDEX(USER(),_utf8'@',1) |
+------------------------------------+
| root                               |
+------------------------------------+

Another way is to change the connection character set and collation to utf8. You can do that with SET NAMES 'utf8' or by setting the character_set_connection and collation_connection system variables directly.

9.1.5.6. Examples of the Effect of Collation

Example 1: Sorting German Umlauts

Suppose that column X in table T has these latin1 column values:

Muffler
Müller
MX Systems
MySQL

Suppose also that the column values are retrieved using the following statement:

SELECT X FROM T ORDER BY X COLLATE collation_name;

The following table shows the resulting order of the values if we use ORDER BY with different collations:

latin1_swedish_cilatin1_german1_cilatin1_german2_ci
MufflerMufflerMüller
MX SystemsMüllerMuffler
MüllerMX SystemsMX Systems
MySQLMySQLMySQL

The character that causes the different sort orders in this example is the U with two dots over it (ü), which the Germans call “U-umlaut.

  • The first column shows the result of the SELECT using the Swedish/Finnish collating rule, which says that U-umlaut sorts with Y.

  • The second column shows the result of the SELECT using the German DIN-1 rule, which says that U-umlaut sorts with U.

  • The third column shows the result of the SELECT using the German DIN-2 rule, which says that U-umlaut sorts with UE.

Example 2: Searching for German Umlauts

Suppose that you have three tables that differ only by the character set and collation used:

mysql> CREATE TABLE german1 (
    ->   c CHAR(10)
    -> ) CHARACTER SET latin1 COLLATE latin1_german1_ci;
mysql> CREATE TABLE german2 (
    ->   c CHAR(10)
    -> ) CHARACTER SET latin1 COLLATE latin1_german2_ci;
mysql> CREATE TABLE germanutf8 (
    ->   c CHAR(10)
    -> ) CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Each table contains two records:

mysql> INSERT INTO german1 VALUES ('Bar'), ('Bär');
mysql> INSERT INTO german2 VALUES ('Bar'), ('Bär');
mysql> INSERT INTO germanutf8 VALUES ('Bar'), ('Bär');

Two of the above collations have an A = Ä equality, and one has no such equality (latin1_german2_ci). For that reason, you'll get these results in comparisons:

mysql> SELECT * FROM german1 WHERE c = 'Bär';
+------+
| c    |
+------+
| Bar  |
| Bär  |
+------+
mysql> SELECT * FROM german2 WHERE c = 'Bär';
+------+
| c    |
+------+
| Bär  |
+------+
mysql> SELECT * FROM germanutf8 WHERE c = 'Bär';
+------+
| c    |
+------+
| Bar  |
| Bär  |
+------+

This is not a bug but rather a consequence of the sorting that latin1_german1_ci or utf8_unicode_ci do (the sorting shown is done according to the German DIN 5007 standard).

9.1.6. Operations Affected by Character Set Support

This section describes operations that take character set information into account as of MySQL 4.1.

9.1.6.1. Result Strings

MySQL has many operators and functions that return a string. This section answers the question: What is the character set and collation of such a string?

For simple functions that take string input and return a string result as output, the output's character set and collation are the same as those of the principal input value. For example, UPPER(X) returns a string whose character string and collation are the same as that of X. The same applies for INSTR(), LCASE(), LOWER(), LTRIM(), MID(), REPEAT(), REPLACE(), REVERSE(), RIGHT(), RPAD(), RTRIM(), SOUNDEX(), SUBSTRING(), TRIM(), UCASE(), and UPPER().

Note: The REPLACE() function, unlike all other functions, always ignores the collation of the string input and performs a case-sensitive comparison.

If a string input or function result is a binary string, the string has no character set or collation. This can be checked by using the CHARSET() and COLLATION() functions, both of which return binary to indicate that their argument is a binary string:

mysql> SELECT CHARSET(BINARY 'a'), COLLATION(BINARY 'a');
+---------------------+-----------------------+
| CHARSET(BINARY 'a') | COLLATION(BINARY 'a') |
+---------------------+-----------------------+
| binary              | binary                |
+---------------------+-----------------------+

For operations that combine multiple string inputs and return a single string output, the “aggregation rules” of standard SQL apply for determining the collation of the result:

  • If an explicit COLLATE X occurs, use X.

  • If explicit COLLATE X and COLLATE Y occur, raise an error.

  • Otherwise, if all collations are X, use X.

  • Otherwise, the result has no collation.

For example, with CASE ... WHEN a THEN b WHEN b THEN c COLLATE X END, the resulting collation is X. The same applies for UNION, ||, CONCAT(), ELT(), GREATEST(), IF(), and LEAST().

For operations that convert to character data, the character set and collation of the strings that result from the operations are defined by the character_set_connection and collation_connection system variables. This applies only to CAST(), CHAR(), CONV(), FORMAT(), HEX(), and SPACE().

If you are uncertain about the character set or collation of the result returned by a string function, you can use the CHARSET() or COLLATION() function to find out:

mysql> SELECT USER(), CHARSET(USER()), COLLATION(USER());
+----------------+-----------------+-------------------+
| USER()         | CHARSET(USER()) | COLLATION(USER()) |
+----------------+-----------------+-------------------+
| test@localhost | utf8            | utf8_general_ci   | 
+----------------+-----------------+-------------------+

9.1.6.2. CONVERT() and CAST()

CONVERT() provides a way to convert data between different character sets. The syntax is:

CONVERT(expr USING transcoding_name)

In MySQL, transcoding names are the same as the corresponding character set names.

Examples:

SELECT CONVERT(_latin1'Müller' USING utf8);
INSERT INTO utf8table (utf8column)
    SELECT CONVERT(latin1field USING utf8) FROM latin1table;

CONVERT(... USING ...) is implemented according to the standard SQL specification.

You may also use CAST() to convert a string to a different character set. The syntax is:

CAST(character_string AS character_data_type CHARACTER SET charset_name)

Example:

SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8);

If you use CAST() without specifying CHARACTER SET, the resulting character set and collation are defined by the character_set_connection and collation_connection system variables. If you use CAST() with CHARACTER SET X, the resulting character set and collation are X and the default collation of X.

You may not use a COLLATE clause inside a CAST(), but you may use it outside. That is, CAST(... COLLATE ...) is illegal, but CAST(...) COLLATE ... is legal.

Example:

SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8) COLLATE utf8_bin;

9.1.6.3. SHOW Statements and INFORMATION_SCHEMA

Several SHOW statements are added or modified in MySQL 4.1 to provide additional character set information. SHOW CHARACTER SET, SHOW COLLATION, and SHOW CREATE DATABASE are new. SHOW CREATE TABLE and SHOW COLUMNS are modified. These statements are described here briefly. For more information, see Section 12.5.4, “SHOW Syntax”.

The SHOW CHARACTER SET command shows all available character sets. It takes an optional LIKE clause that indicates which character set names to match. For example:

mysql> SHOW CHARACTER SET LIKE 'latin%';
+---------+-----------------------------+-------------------+--------+
| Charset | Description                 | Default collation | Maxlen |
+---------+-----------------------------+-------------------+--------+
| latin1  | cp1252 West European        | latin1_swedish_ci |      1 |
| latin2  | ISO 8859-2 Central European | latin2_general_ci |      1 |
| latin5  | ISO 8859-9 Turkish          | latin5_turkish_ci |      1 |
| latin7  | ISO 8859-13 Baltic          | latin7_general_ci |      1 |
+---------+-----------------------------+-------------------+--------+

The output from SHOW COLLATION includes all available character sets. It takes an optional LIKE clause that indicates which collation names to match. For example:

mysql> SHOW COLLATION LIKE 'latin1%';
+-------------------+---------+----+---------+----------+---------+
| Collation         | Charset | Id | Default | Compiled | Sortlen |
+-------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1  |  5 |         |          |       0 |
| latin1_swedish_ci | latin1  |  8 | Yes     | Yes      |       0 |
| latin1_danish_ci  | latin1  | 15 |         |          |       0 |
| latin1_german2_ci | latin1  | 31 |         | Yes      |       2 |
| latin1_bin        | latin1  | 47 |         | Yes      |       0 |
| latin1_general_ci | latin1  | 48 |         |          |       0 |
| latin1_general_cs | latin1  | 49 |         |          |       0 |
| latin1_spanish_ci | latin1  | 94 |         |          |       0 |
+-------------------+---------+----+---------+----------+---------+

SHOW CREATE DATABASE displays the CREATE DATABASE statement that creates a given database:

mysql> SHOW CREATE DATABASE test;
+----------+-----------------------------------------------------------------+
| Database | Create Database                                                 |
+----------+-----------------------------------------------------------------+
| test     | CREATE DATABASE `test` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+----------+-----------------------------------------------------------------+

If no COLLATE clause is shown, the default collation for the character set applies.

SHOW CREATE TABLE is similar, but displays the CREATE TABLE statement to create a given table. The column definitions indicate any character set specifications, and the table options include character set information.

The SHOW COLUMNS statement displays the collations of a table's columns when invoked as SHOW FULL COLUMNS. Columns with CHAR, VARCHAR, or TEXT data types have collations. Numeric and other non-character types have no collation (indicated by NULL as the Collation value). For example:

mysql> SHOW FULL COLUMNS FROM person\G
*************************** 1. row ***************************
     Field: id
      Type: smallint(5) unsigned
 Collation: NULL
      Null: NO
       Key: PRI
   Default: NULL
     Extra: auto_increment
Privileges: select,insert,update,references
   Comment:
*************************** 2. row ***************************
     Field: name
      Type: char(60)
 Collation: latin1_swedish_ci
      Null: NO
       Key:
   Default:
     Extra:
Privileges: select,insert,update,references
   Comment:

The character set is not part of the display but is implied by the collation name.

9.1.7. Unicode Support

As of MySQL version 4.1, there are two new character sets for storing Unicode data:

  • ucs2, the UCS-2 encoding of the Unicode character set using 16 bits per character

  • utf8, a UTF-8 encoding of the Unicode character set using one to three bytes per character

These two character sets support the characters from the Basic Multilingual Plane (BMP) of Unicode Version 3.0. BMP characters have these characteristics:

  • Their code values are between 0 and 65535 (or U+0000 .. U+FFFF)

  • They can be encoded with a fixed 16-bit word, as in ucs2

  • They can be encoded with 8, 16, or 24 bits, as in utf8

  • They are sufficient for almost all characters in major languages

The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP.

A similar set of collations is available for each Unicode character set. For example, each has a Danish collation, the names of which are ucs2_danish_ci and utf8_danish_ci. All Unicode collations are listed at Section 9.1.10.1, “Unicode Character Sets”.

In UCS-2, every character is represented by a two-byte Unicode code with the most significant byte first. For example: LATIN CAPITAL LETTER A has the code 0x0041 and it is stored as a two-byte sequence: 0x00 0x41. CYRILLIC SMALL LETTER YERU (Unicode 0x044B) is stored as a two-byte sequence: 0x04 0x4B. For Unicode characters and their codes, please refer to the Unicode Home Page.

The MySQL implementation of UCS-2 stores characters in big-endian byte order and does not use a byte order mark (BOM) at the beginning of UCS-2 values. Other database systems might use little-endian byte order or a BOM, in which case, conversion of UCS-2 values will need to be performed when transferring data between those systems and MySQL.

UTF-8 (Unicode Transformation Format with 8-bit units) is an alternative way to store Unicode data. It is implemented according to RFC 3629. RFC 3629 describes encoding sequences that take from one to four bytes. Currently, MySQL support for UTF-8 does not include four-byte sequences. (An older standard for UTF-8 encoding is given by RFC 2279, which describes UTF-8 sequences that take from one to six bytes. RFC 3629 renders RFC 2279 obsolete; for this reason, sequences with five and six bytes are no longer used.)

The idea of UTF-8 is that various Unicode characters are encoded using byte sequences of different lengths:

  • Basic Latin letters, digits, and punctuation signs use one byte.

  • Most European and Middle East script letters fit into a two-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.

  • Korean, Chinese, and Japanese ideographs use three-byte sequences.

MySQL uses no BOM for UTF-8 values.

Tip: To save space with UTF-8, use VARCHAR instead of CHAR. Otherwise, MySQL must reserve three bytes for each character in a CHAR CHARACTER SET utf8 column because that is the maximum possible length. For example, MySQL must reserve 30 bytes for a CHAR(10) CHARACTER SET utf8 column.

UCS-2 cannot be used as a client character set, which means that SET NAMES 'ucs2' does not work. (See Section 9.1.4, “Connection Character Sets and Collations”.)

Client applications that need to communicate with the server using Unicode should set the client character set accordingly; for example, by issuing a SET NAMES 'utf8' statement. ucs2 cannot be used as a client character set, which means that it does not work for SET NAMES or SET CHARACTER SET. (See Section 9.1.4, “Connection Character Sets and Collations”.)

9.1.8. UTF-8 for Metadata

Metadata is “the data about the data.” Anything that describes the database — as opposed to being the contents of the database — is metadata. Thus column names, database names, usernames, version names, and most of the string results from SHOW are metadata.

Representation of metadata must satisfy these requirements:

  • All metadata must be in the same character set. Otherwise, SHOW wouldn't work properly because different rows in the same column would be in different character sets.

  • Metadata must include all characters in all languages. Otherwise, users would not be able to name columns and tables using their own languages.

To satisfy both requirements, MySQL stores metadata in a Unicode character set, namely UTF-8. This does not cause any disruption if you never use accented or non-Latin characters. But if you do, you should be aware that metadata is in UTF-8.

The metadata requirements mean that the return values of the USER(), CURRENT_USER(), SESSION_USER(), SYSTEM_USER(), DATABASE(), and VERSION() functions have the UTF-8 character set by default.

The server sets the character_set_system system variable to the name of the metadata character set:

mysql> SHOW VARIABLES LIKE 'character_set_system';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| character_set_system | utf8  |
+----------------------+-------+

Storage of metadata using Unicode does not mean that the server returns headers of columns and the results of DESCRIBE functions in the character_set_system character set by default. When you use SELECT column1 FROM t, the name column1 itself is returned from the server to the client in the character set determined by the value of the character_set_results system variable, which has a default value of latin1. If you want the server to pass metadata results back in a different character set, use the SET NAMES statement to force the server to perform character set conversion. SET NAMES sets the character_set_results and other related system variables. (See Section 9.1.4, “Connection Character Sets and Collations”.) Alternatively, a client program can perform the conversion after receiving the result from the server. It is more efficient for the client perform the conversion, but this option is not available for many clients until late in the MySQL 4.x product cycle.

If character_set_results is set to NULL, no conversion is performed and the server returns metadata using its original character set (the set indicated by character_set_system).

Beginning with MySQL 4.1.1, error messages returned from the server to the client are converted to the client character set automatically, as with metadata.

If you are using (for example) the USER() function for comparison or assignment within a single statement, don't worry. MySQL performs some automatic conversion for you.

SELECT * FROM Table1 WHERE USER() = latin1_column;

This works because the contents of latin1_column are automatically converted to UTF-8 before the comparison.

INSERT INTO Table1 (latin1_column) SELECT USER();

This works because the contents of USER() are automatically converted to latin1 before the assignment.

Although automatic conversion is not in the SQL standard, the SQL standard document does say that every character set is (in terms of supported characters) a “subset” of Unicode. Because it is a well-known principle that “what applies to a superset can apply to a subset,” we believe that a collation for Unicode can apply for comparisons with non-Unicode strings. For more information about coercion of strings, see Section 9.1.5.4, “Some Special Cases Where the Collation Determination Is Tricky”.

9.1.9. Upgrading Character Sets from MySQL 4.0

What about upgrading from older versions of MySQL? MySQL 4.1 is almost upward compatible with MySQL 4.0 and earlier for the simple reason that almost all the features are new, so there's nothing in earlier versions to conflict with. However, there are some differences and a few things to be aware of.

It is important to note that the “MySQL 4.0 character set” contains both character set and collation information in one single entity. Beginning in MySQL 4.1, character sets and collations are separate entities. Though each collation corresponds to a particular character set, the two are not bundled together.

There is a special treatment of national character sets in MySQL 4.1. NCHAR is not the same as CHAR, and N'...' literals are not the same as '...' literals.

Finally, there is a different file format for storing information about character sets and collations. Make sure that you have reinstalled the /share/mysql/charsets/ directory containing the new configuration files.

If you want to start mysqld from a 4.1.x distribution with data created by MySQL 4.0, you should start the server with the same character set and collation. In this case, you won't need to reindex your data.

There are two ways to do so:

shell> ./configure --with-charset=... --with-collation=...
shell> ./mysqld --default-character-set=... --default-collation=...

If you used mysqld with, for example, the MySQL 4.0 danish character set, you should use the latin1 character set and the latin1_danish_ci collation:

shell> ./configure --with-charset=latin1 \
           --with-collation=latin1_danish_ci
shell> ./mysqld --default-character-set=latin1 \
           --default-collation=latin1_danish_ci

Use the table shown in Section 9.1.9.1, “4.0 Character Sets and Corresponding 4.1 Character Set/Collation Pairs”, to find old 4.0 character set names and their 4.1 character set/collation pair equivalents.

If you have non-latin1 data stored in a 4.0 latin1 table and want to convert the table column definitions to reflect the actual character set of the data, use the instructions in Section 9.1.9.2, “Converting 4.0 Character Columns to 4.1 Format”.

9.1.9.1. 4.0 Character Sets and Corresponding 4.1 Character Set/Collation Pairs

ID4.0 Character Set4.1 Character Set4.1 Collation
1big5big5big5_chinese_ci
2czechlatin2latin2_czech_ci
3dec8dec8dec8_swedish_ci
4doscp850cp850_general_ci
5german1latin1latin1_german1_ci
6hp8hp8hp8_english_ci
7koi8_rukoi8rkoi8r_general_ci
8latin1latin1latin1_swedish_ci
9latin2latin2latin2_general_ci
10swe7swe7swe7_swedish_ci
11usa7asciiascii_general_ci
12ujisujisujis_japanese_ci
13sjissjissjis_japanese_ci
14cp1251cp1251cp1251_bulgarian_ci
15danishlatin1latin1_danish_ci
16hebrewhebrewhebrew_general_ci
17win1251(removed)(removed)
18tis620tis620tis620_thai_ci
19euc_kreuckreuckr_korean_ci
20estonialatin7latin7_estonian_ci
21hungarianlatin2latin2_hungarian_ci
22koi8_ukrkoi8ukoi8u_ukrainian_ci
23win1251ukrcp1251cp1251_ukrainian_ci
24gb2312gb2312gb2312_chinese_ci
25greekgreekgreek_general_ci
26win1250cp1250cp1250_general_ci
27croatlatin2latin2_croatian_ci
28gbkgbkgbk_chinese_ci
29cp1257cp1257cp1257_lithuanian_ci
30latin5latin5latin5_turkish_ci
31latin1_delatin1latin1_german2_ci

9.1.9.2. Converting 4.0 Character Columns to 4.1 Format

Normally, the server runs using the latin1 character set by default. If you have been storing column data that actually is in some other character set that the 4.1 server supports directly, you can convert the column. However, you should avoid trying to convert directly from latin1 to the "real" character set. This may result in data loss. Instead, convert the column to a binary data type, and then from the binary type to a non-binary type with the desired character set. Conversion to and from binary involves no attempt at character value conversion and preserves your data intact. For example, suppose that you have a 4.0 table with three columns that are used to store values represented in latin1, latin2, and utf8:

CREATE TABLE t
(
    latin1_col CHAR(50),
    latin2_col CHAR(100),
    utf8_col CHAR(150)
);

For MySQL 4.1, you want to convert this table to leave latin1_col alone but change the latin2_col and utf8_col columns to have character sets of latin2 and utf8. Before upgrading to 4.1, back up your table, then convert the columns as follows:

ALTER TABLE t MODIFY latin2_col BINARY(100);
ALTER TABLE t MODIFY utf8_col BINARY(150);

Then, after upgrading to 4.1, complete the conversion by issuing these statements:

ALTER TABLE t MODIFY latin2_col CHAR(100) CHARACTER SET latin2;
ALTER TABLE t MODIFY utf8_col CHAR(150) CHARACTER SET utf8;

The first two statements “remove” the character set information from the latin2_col and utf8_col columns. The second two statements assign the proper character sets to the two columns.

If you like, you can combine the to-binary conversions and from-binary conversions into single statements. In MySQL 4.0, do this:

ALTER TABLE t
    MODIFY latin2_col BINARY(100),
    MODIFY utf8_col BINARY(150);

After upgrading to 4.1, do this:

ALTER TABLE t
    MODIFY latin2_col CHAR(100) CHARACTER SET latin2,
    MODIFY utf8_col CHAR(150) CHARACTER SET utf8;

If you can ensure that the tables will not otherwise be modified before you perform the character set conversion, you can issue all of the ALTER TABLE statements after upgrading to MySQL 4.1.

If you specified attributes when creating a column initially, you should also specify them when altering the table with ALTER TABLE. For example, if you specified NOT NULL and an explicit DEFAULT value, you should also provide them in the ALTER TABLE statement. Otherwise, the resulting column definition will not include those attributes.

9.1.10. Character Sets and Collations That MySQL Supports

MySQL supports 70+ collations for 30+ character sets. This section indicates which character sets MySQL supports. There is one subsection for each group of related character sets. For each character set, the allowable collations are listed.

You can always list the available character sets and their default collations with the SHOW CHARACTER SET statement:

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+
| Charset  | Description                 | Default collation   |
+----------+-----------------------------+---------------------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |
| dec8     | DEC West European           | dec8_swedish_ci     |
| cp850    | DOS West European           | cp850_general_ci    |
| hp8      | HP West European            | hp8_english_ci      |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |
| latin1   | cp1252 West European        | latin1_swedish_ci   |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |
| ascii    | US ASCII                    | ascii_general_ci    |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |
| cp1250   | Windows Central European    | cp1250_general_ci   |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |
| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |
| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |
| cp866    | DOS Russian                 | cp866_general_ci    |
| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |
| macce    | Mac Central European        | macce_general_ci    |
| macroman | Mac West European           | macroman_general_ci |
| cp852    | DOS Central European        | cp852_general_ci    |
| latin7   | ISO 8859-13 Baltic          | latin7_general_ci   |
| cp1251   | Windows Cyrillic            | cp1251_general_ci   |
| cp1256   | Windows Arabic              | cp1256_general_ci   |
| cp1257   | Windows Baltic              | cp1257_general_ci   |
| binary   | Binary pseudo charset       | binary              |
| geostd8  | GEOSTD8 Georgian            | geostd8_general_ci  |
| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |
| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |
+----------+-----------------------------+---------------------+

In cases where a character set has multiple collations, it might not be clear which collation is most suitable for a given application. To avoid choosing the wrong collation, it can be helpful to perform some comparisons with representative data values to make sure that a given collation sorts values the way you expect.

Collation-Charts.Org is a useful site for information that shows how one collation compares to another.

9.1.10.1. Unicode Character Sets

MySQL 4.1 has two Unicode character sets:

  • ucs2, the UCS-2 encoding of the Unicode character set using 16 bits per character

  • utf8, a UTF-8 encoding of the Unicode character set using one to three bytes per character

You can store text in about 650 languages using these character sets. This section lists the collations available for each Unicode character set. For general information about the character sets, see Section 9.1.7, “Unicode Support”.

A similar set of collations is available for each Unicode character set. These are shown in the following list, where xxx represents the character set name. For example, xxx_danish_ci represents the Danish collations, the specific names of which are ucs2_danish_ci and utf8_danish_ci.

  • xxx_bin

  • xxx_czech_ci

  • xxx_danish_ci

  • xxx_estonian_ci

  • xxx_general_ci (default)

  • xxx_icelandic_ci

  • xxx_latvian_ci

  • xxx_lithuanian_ci

  • xxx_persian_ci

  • xxx_polish_ci

  • xxx_roman_ci

  • xxx_romanian_ci

  • xxx_slovak_ci

  • xxx_slovenian_ci

  • xxx_spanish2_ci

  • xxx_spanish_ci

  • xxx_swedish_ci

  • xxx_turkish_ci

  • xxx_unicode_ci

There is a limitation in MySQL 4.1 that results in two characters not being correctly handled when a user tries to change their case using LOWER() or UPPER():

  • LATIN SMALL LETTER DOTLESS i

  • LATIN CAPITAL LETTER I WITH DOT ABOVE

Here are two workarounds for MySQL 4.1:

  1. Use ucs2 if you have Turkish data.

  2. Use these function calls:

    CONVERT(LOWER(CONVERT(col USING ucs2)) USING utf8)
    

MySQL implements the xxx_unicode_ci collations according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. Currently, the xxx_unicode_ci collations have only partial support for the Unicode Collation Algorithm. Some characters are not supported yet. Also, combining marks are not fully supported. This affects primarily Vietnamese, Yoruba, and some smaller languages such as Navajo. The following discussion uses utf8_unicode_ci for concreteness.

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

To further illustrate, the following equalities hold in both utf8_general_ci and utf8_unicode_ci (for the effect this has in comparisons or when doing searches, see Section 9.1.5.6, “Examples of the Effect of Collation”):

Ä = A
Ö = O
Ü = U

A difference between the collations is that this is true for utf8_general_ci:

ß = s

Whereas this is true for utf8_unicode_ci:

ß = ss

MySQL implements language-specific collations for the utf8 character set only if the ordering with utf8_unicode_ci does not work well for a language. For example, utf8_unicode_ci works fine for German and French, so there is no need to create special utf8 collations for these two languages.

utf8_general_ci also is satisfactory for both German and French, except that “ß” is equal to “s”, and not to “ss”. If this is acceptable for your application, then you should use utf8_general_ci because it is faster. Otherwise, use utf8_unicode_ci because it is more accurate.

utf8_swedish_ci, like other utf8 language-specific collations, is derived from utf8_unicode_ci with additional language rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:

Ü = Y < Ö

The xxx_spanish_ci and xxx_spanish2_ci collations correspond to modern Spanish and traditional Spanish, respectively. In both collations, “ñ” (n-tilde) is a separate letter between “n” and “o”. In addition, for traditional Spanish, “ch” is a separate letter between “c” and “d”, and “ll” is a separate letter between “l” and “m

In the xxx_roman_ci collations, I and J compare as equal, and U and V compare as equal.

For additional information about Unicode collations in MySQL, see Collation-Charts.Org (utf8).

9.1.10.2. West European Character Sets

Western European character sets cover most West European languages, such as French, Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English.

  • ascii (US ASCII) collations:

    • ascii_bin

    • ascii_general_ci (default)

  • cp850 (DOS West European) collations:

    • cp850_bin

    • cp850_general_ci (default)

  • dec8 (DEC Western European) collations:

    • dec8_bin

    • dec8_swedish_ci (default)

  • hp8 (HP Western European) collations:

    • hp8_bin

    • hp8_english_ci (default)

  • latin1 (cp1252 West European) collations:

    • latin1_bin

    • latin1_danish_ci

    • latin1_general_ci

    • latin1_general_cs

    • latin1_german1_ci

    • latin1_german2_ci

    • latin1_spanish_ci

    • latin1_swedish_ci (default)

    latin1 is the default character set. MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d.

    The latin1_swedish_ci collation is the default that probably is used by the majority of MySQL customers. Although it is frequently said that it is based on the Swedish/Finnish collation rules, there are Swedes and Finns who disagree with this statement.

    The latin1_german1_ci and latin1_german2_ci collations are based on the DIN-1 and DIN-2 standards, where DIN stands for Deutsches Institut für Normung (the German equivalent of ANSI). DIN-1 is called the “dictionary collation” and DIN-2 is called the “phone book collation.” For an example of the effect this has in comparisons or when doing searches, see Section 9.1.5.6, “Examples of the Effect of Collation”.

    • latin1_german1_ci (dictionary) rules:

      Ä = A
      Ö = O
      Ü = U
      ß = s
      
    • latin1_german2_ci (phone-book) rules:

      Ä = AE
      Ö = OE
      Ü = UE
      ß = ss
      

    For an example of the effect this has in comparisons or when doing searches, see Section 9.1.5.6, “Examples of the Effect of Collation”.

    In the latin1_spanish_ci collation, “ñ” (n-tilde) is a separate letter between “n” and “o”.

  • macroman (Mac West European) collations:

    • macroman_bin

    • macroman_general_ci (default)

  • swe7 (7bit Swedish) collations:

    • swe7_bin

    • swe7_swedish_ci (default)

For additional information about Western European collations in MySQL, see Collation-Charts.Org (ascii, cp850, dec8, hp8, latin1, macroman, swe7).

9.1.10.3. Central European Character Sets

MySQL provides some support for character sets used in the Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia, Poland, and Serbia (Latin).

  • cp1250 (Windows Central European) collations:

    • cp1250_bin

    • cp1250_croatian_ci

    • cp1250_czech_cs

    • cp1250_general_ci (default)

  • cp852 (DOS Central European) collations:

    • cp852_bin

    • cp852_general_ci (default)

  • keybcs2 (DOS Kamenicky Czech-Slovak) collations:

    • keybcs2_bin

    • keybcs2_general_ci (default)

  • latin2 (ISO 8859-2 Central European) collations:

    • latin2_bin

    • latin2_croatian_ci

    • latin2_czech_cs

    • latin2_general_ci (default)

    • latin2_hungarian_ci

  • macce (Mac Central European) collations:

    • macce_bin

    • macce_general_ci (default)

For additional information about Central European collations in MySQL, see Collation-Charts.Org (cp1250, cp852, keybcs2, latin2, macce).

9.1.10.4. South European and Middle East Character Sets

South European and Middle Eastern character sets supported by MySQL include Armenian, Arabic, Georgian, Greek, Hebrew, and Turkish.

  • armscii8 (ARMSCII-8 Armenian) collations:

    • armscii8_bin

    • armscii8_general_ci (default)

  • cp1256 (Windows Arabic) collations:

    • cp1256_bin

    • cp1256_general_ci (default)

  • geostd8 (GEOSTD8 Georgian) collations:

    • geostd8_bin

    • geostd8_general_ci (default)

  • greek (ISO 8859-7 Greek) collations:

    • greek_bin

    • greek_general_ci (default)

  • hebrew (ISO 8859-8 Hebrew) collations:

    • hebrew_bin

    • hebrew_general_ci (default)

  • latin5 (ISO 8859-9 Turkish) collations:

    • latin5_bin

    • latin5_turkish_ci (default)

For additional information about South European and Middle Eastern collations in MySQL, see Collation-Charts.Org (armscii8, cp1256, geostd8, greek, hebrew, latin5).

9.1.10.5. Baltic Character Sets

The Baltic character sets cover Estonian, Latvian, and Lithuanian languages.

  • cp1257 (Windows Baltic) collations:

    • cp1257_bin

    • cp1257_general_ci (default)

    • cp1257_lithuanian_ci

  • latin7 (ISO 8859-13 Baltic) collations:

    • latin7_bin

    • latin7_estonian_cs

    • latin7_general_ci (default)

    • latin7_general_cs

For additional information about Baltic collations in MySQL, see Collation-Charts.Org (cp1257, latin7).

9.1.10.6. Cyrillic Character Sets

The Cyrillic character sets and collations are for use with Belarusian, Bulgarian, Russian, Ukrainian, and Serbian (Cyrillic) languages.

  • cp1251 (Windows Cyrillic) collations:

    • cp1251_bin

    • cp1251_bulgarian_ci

    • cp1251_general_ci (default)

    • cp1251_general_cs

    • cp1251_ukrainian_ci

  • cp866 (DOS Russian) collations:

    • cp866_bin

    • cp866_general_ci (default)

  • koi8r (KOI8-R Relcom Russian) collations:

    • koi8r_bin

    • koi8r_general_ci (default)

  • koi8u (KOI8-U Ukrainian) collations:

    • koi8u_bin

    • koi8u_general_ci (default)

For additional information about Cyrillic collations in MySQL, see Collation-Charts.Org (cp1251, cp866, koi8r, koi8u). ).

9.1.10.7. Asian Character Sets

The Asian character sets that we support include Chinese, Japanese, Korean, and Thai. These can be complicated. For example, the Chinese sets must allow for thousands of different characters. See Section 9.1.10.7.1, “The cp932 Character Set”, for additional information about the cp932 and sjis character sets.

  • big5 (Big5 Traditional Chinese) collations:

    • big5_bin

    • big5_chinese_ci (default)

  • cp932 (SJIS for Windows Japanese) collations:

    • cp932_bin

    • cp932_japanese_ci (default)

  • eucjpms (UJIS for Windows Japanese) collations:

    • eucjpms_bin

    • eucjpms_japanese_ci (default)

  • euckr (EUC-KR Korean) collations:

    • euckr_bin

    • euckr_korean_ci (default)

  • gb2312 (GB2312 Simplified Chinese) collations:

    • gb2312_bin

    • gb2312_chinese_ci (default)

  • gbk (GBK Simplified Chinese) collations:

    • gbk_bin

    • gbk_chinese_ci (default)

  • sjis (Shift-JIS Japanese) collations:

    • sjis_bin

    • sjis_japanese_ci (default)

  • tis620 (TIS620 Thai) collations:

    • tis620_bin

    • tis620_thai_ci (default)

  • ujis (EUC-JP Japanese) collations:

    • ujis_bin

    • ujis_japanese_ci (default)

For additional information about Asian collations in MySQL, see Collation-Charts.Org (big5, cp932, eucjpms, euckr, gb2312, gbk, sjis, tis620, ujis).

9.1.10.7.1. The cp932 Character Set

Why is cp932 needed?

In MySQL, the sjis character set corresponds to the Shift_JIS character set defined by IANA, which supports JIS X0201 and JIS X0208 characters. (See http://www.iana.org/assignments/character-sets.)

However, the meaning of “SHIFT JIS” as a descriptive term has become very vague and it often includes the extensions to Shift_JIS that are defined by various vendors.

For example, “SHIFT JIS” used in Japanese Windows environments is a Microsoft extension of Shift_JIS and its exact name is Microsoft Windows Codepage : 932 or cp932. In addition to the characters supported by Shift_JIS, cp932 supports extension characters such as NEC special characters, NEC selected — IBM extended characters, and IBM extended characters.

Since MySQL 4.1, many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:

  • MySQL automatically converts character sets.

  • Character sets are converted via Unicode (ucs2).

  • The sjis character set does not support the conversion of these extension characters.

  • There are several conversion rules from so-called “SHIFT JIS” to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).

The MySQL cp932 character set is designed to solve these problems. It is available as of MySQL 4.1.12.

Before MySQL 4.1, it was safe to use any version of “SHIFT JIS” in conjunction with the sjis character set. However, because MySQL supports character set conversion beginning with 4.1, it is important to separate IANA Shift_JIS and cp932 into two different character sets because they provide different conversion rules.

How does cp932 differ from sjis?

The cp932 character set differs from sjis in the following ways:

For some characters, conversion to and from ucs2 is different for sjis and cp932. The following tables illustrate these differences.

Conversion to ucs2:

sjis/cp932 Valuesjis -> ucs2 Conversioncp932 -> ucs2 Conversion
5C005C005C
7E007E007E
815C20152015
815F005CFF3C
8160301CFF5E
816120162225
817C2212FF0D
819100A2FFE0
819200A3FFE1
81CA00ACFFE2

Conversion from ucs2:

ucs2 valueucs2 -> sjis Conversionucs2 -> cp932 Conversion
005C815F5C
007E7E7E
00A281913F
00A381923F
00AC81CA3F
2015815C815C
201681613F
2212817C3F
22253F8161
301C81603F
FF0D3F817C
FF3C3F815F
FF5E3F8160
FFE03F8191
FFE13F8192
FFE23F81CA

Users of any Japanese character sets should be aware that using --character-set-client-handshake (or --skip-character-set-client-handshake) has an important effect. See Section 5.1.2, “Command Options”.

9.2. The Character Set Used for Data and Sorting

By default, MySQL uses the latin1 (cp1252 West European) character set and the latin1_swedish_ci collation that sorts according to Swedish/Finnish rules. These defaults are suitable for the United States and most of Western Europe.

All MySQL binary distributions are compiled with --with-extra-charsets=complex. This adds code to all standard programs that enables them to handle latin1 and all multi-byte character sets within the binary. Other character sets are loaded from a character-set definition file when needed.

The character set determines what characters are allowed in identifiers. The collation determines how strings are sorted by the ORDER BY and GROUP BY clauses of the SELECT statement.

You can change the default server character set and collation with the --character-set-server and --collation-server options when you start the server. The collation must be a legal collation for the default character set. (Use the SHOW COLLATION statement to determine which collations are available for each character set.) See Section 5.1.2, “Command Options”. Before MySQL 4.1.3, use --default-character-set to change the character set; there is no option for changing the collation.

The character sets available depend on the --with-charset=charset_name and --with-extra-charsets=list-of-charsets | complex | all | none options to configure, and the character set configuration files listed in SHAREDIR/charsets/Index. See Section 2.9.2, “Typical configure Options”.

If you change the character set when running MySQL, that may also change the sort order. Consequently, you must run myisamchk -r -q --set-collation=collation_name on all MyISAM tables, or your indexes may not be ordered correctly. (Use --set-character-set before MySQL 4.1.1.)

When a client connects to a MySQL server, the server indicates to the client what the server's default character set is. The client switches to this character set for this connection.

You should use mysql_real_escape_string() when escaping strings for an SQL query. mysql_real_escape_string() is identical to the old mysql_escape_string() function, except that it takes the MYSQL connection handle as the first parameter so that the appropriate character set can be taken into account when escaping characters.

If the client is compiled with paths that differ from where the server is installed and the user who configured MySQL didn't include all character sets in the MySQL binary, you must tell the client where it can find the additional character sets it needs if the server runs with a different character set from the client. You can do this by specifying a --character-sets-dir option to indicate the path to the directory in which the dynamic MySQL character sets are stored. For example, you can put the following in an option file:

[client]
character-sets-dir=/usr/local/mysql/share/mysql/charsets

You can force the client to use specific character set as follows:

[client]
default-character-set=charset_name

This is normally unnecessary, however.

9.2.1. Using the German Character Set

In MySQL 4.0, to get German sorting order, you should start mysqld with a --default-character-set=latin1_de option. This affects server behavior in several ways:

  • When sorting and comparing strings, the following mapping is performed on the strings before doing the comparison:

    ä  ->  ae
    ö  ->  oe
    ü  ->  ue
    ß  ->  ss
    
  • All accented characters are converted to their unaccented uppercase counterpart. All letters are converted to uppercase.

  • When comparing strings with LIKE, the one-character to two-character mapping is not done. All letters are converted to uppercase. Accents are removed from all letters except Ü, ü, Ö, ö, Ä, and ä.

In MySQL 4.1 and up, character set and collation are specified separately. You should select the latin1 character set and either the latin1_german1_ci or latin1_german2_ci collation. For example, to start the server with the latin1_german1_ci collation, use the --character-set-server=latin1 and --collation-server=latin1_german1_ci options.

For information on the differences between these two collations, see Section 9.1.10.2, “West European Character Sets”.

9.3. Setting the Error Message Language

By default, mysqld produces error messages in English, but they can also be displayed in any of these other languages: Czech, Danish, Dutch, Estonian, French, German, Greek, Hungarian, Italian, Japanese, Korean, Norwegian, Norwegian-ny, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, or Swedish.

To start mysqld with a particular language for error messages, use the --language or -L option. The option value can be a language name or the full path to the error message file. For example:

shell> mysqld --language=swedish

Or:

shell> mysqld --language=/usr/local/share/swedish

The language name should be specified in lowercase.

By default, the language files are located in the share/LANGUAGE directory under the MySQL base directory.

You can also change the content of the error messages produced by the server. Details can be found in the MySQL Internals manual, available at http://forge.mysql.com/wiki/MySQL_Internals_Error_Messages. If you upgrade to a newer version of MySQL after changing the error messages, remember to repeat your changes after the upgrade.

9.4. Adding a New Character Set

This section discusses the procedure for adding a new character set to MySQL. You must have a MySQL source distribution to use these instructions. There is one procedure for MySQL 4.1 and a different one for MySQL 4.0 or older. For either procedure, the instructions depend on whether the character set is simple or complex:

  • If the character set does not need to use special string collating routines for sorting and does not need multi-byte character support, it is simple.

  • If the character set needs either of those features, it is complex.

For example, greek and swe7 are simple character sets, whereas big5 and czech are complex character sets.

In the following instructions, MYSET represents the name of the character set that you want to add.

If you have MySQL 4.1, use this procedure to add a new character set:

  1. Add a <charset> element for MYSET to the sql/share/charsets/Index.xml file. Use the existing contents in the file as a guide to adding new contents.

    The <charset> element must list all the collations for the character set. These must include at least a binary collation and a default collation. The default collation is usually named using a suffix of general_ci (general, case insensitive). It is possible for the binary collation to be the default collation, but usually they are different. The default collation should have a primary flag. The binary collation should have a binary flag.

    You must assign a unique ID number to each collation. To see the currently used collation IDs, use this query:

    SHOW COLLATION;
    
  2. This step depends on whether you are adding a simple or complex character set. A simple character set requires only a configuration file, whereas a complex character set requires C source file that defines collation functions, multi-byte functions, or both.

    For a simple character set, create a configuration file, MYSET.xml, that describes the character set properties. Create this file in the sql/share/charsets directory. (You can use a copy of latin1.xml as the basis for this file.) The syntax for the file is very simple:

    • Comments are written as ordinary XML comments (<!-- text -->).

    • Words within <map> array elements are separated by arbitrary amounts of whitespace.

    • Each word within <map> array elements must be a number in hexadecimal format.

    • The <map> array element for the <ctype> element has 257 words. The other <map> array elements after that have 256 words. See Section 9.4.1, “The Character Definition Arrays”.

    • For each collation listed in the <charset> element for the character set in Index.xml, MYSET.xml must contain a <collation> element that defines the character ordering.

    For a complex character set, create a C source file that describes the character set properties and defines the support routines necessary to properly perform operations on the character set:

    1. Create the file ctype-MYSET.c in the strings directory. Look at one of the existing ctype-*.c files (such as ctype-big5.c) to see what needs to be defined. The arrays in your file must have names like ctype_MYSET, to_lower_MYSET, and so on. These correspond to the arrays for a simple character set. See Section 9.4.1, “The Character Definition Arrays”.

    2. For each collation listed in the <charset> element for the character set in Index.xml, the ctype-MYSET.c file must provide an implementation of the collation.

    3. If you need string collating functions, see Section 9.4.2, “String Collating Support”.

    4. If you need multi-byte character support, see Section 9.4.3, “Multi-Byte Character Support”.

  3. Follow these steps to modify the configuration information. Use the existing configuration information as a guide to adding information for MYSYS. The example here assumes that the character set has default and binary collations, but more lines will be needed if MYSET has additional collations.

    1. Edit mysys/charset-def.c, and “register” the collations for the new character set.

      Add these lines to the “declaration” section:

      #ifdef HAVE_CHARSET_MYSET
      extern CHARSET_INFO my_charset_MYSET_general_ci;
      extern CHARSET_INFO my_charset_MYSET_bin;
      #endif
      

      Add these lines to the “registration” section:

      #ifdef HAVE_CHARSET_MYSET
        add_compiled_collation(&my_charset_MYSET_general_ci);
        add_compiled_collation(&my_charset_MYSET_bin);
      #endif
      
    2. If the character set uses ctype-MYSET.c, edit strings/Makefile.am and add ctype-MYSET.c to each definition of the CSRCS variable, and to the EXTRA_DIST variable.

    3. If the character set uses ctype-MYSET.c, edit libmysql/Makefile.shared and add ctype-MYSET.lo to the mystringsobjects definition.

    4. Edit configure.in:

      1. Add MYSET to one of the define(CHARSETS_AVAILABLE...) lines in alphabetic order.

      2. Add MYSET to CHARSETS_COMPLEX. This is needed even for simple character sets, or configure will not recognize --default-character-set=MYSET.

      3. Add MYSET to the first case control structure. Omit the USE_MB and USE_MB_IDENT lines for 8-bit character sets.

        MYSET)
          AC_DEFINE(HAVE_CHARSET_MYSET, 1, [Define to enable charset MYSET])
          AC_DEFINE([USE_MB], 1, [Use multi-byte character routines])
          AC_DEFINE(USE_MB_IDENT, 1)
          ;;
        
      4. Add MYSET to the second case control structure:

        MYSET)
          default_charset_default_collation="MYSET_general_ci"
          default_charset_collations="MYSET_general_ci MYSET_bin"
          ;;
        
  4. Reconfigure, recompile, and test.

If you have MySQL 4.0 or older, use this procedure to add a new character set:

  1. Add MYSET to the end of the sql/share/charsets/Index file. Assign a unique number to it.

  2. This step depends on whether you are adding a simple or complex character set. A simple character set requires only a configuration file, whereas a complex character set requires C source file that defines collation functions, multi-byte functions, or both.

    For a simple character set, create a configuration file that describes the character set properties. Create the file MYSET.conf file in the sql/share/charsets directory. (You can use a copy of latin1.conf as the basis for this file.) The syntax for the file is very simple:

    • Comments start with a “#” character and continue to the end of the line.

    • Words are separated by arbitrary amounts of whitespace.

    • When defining the character set, every word must be a number in hexadecimal format.

    • The ctype array takes up the first 257 words. The to_lower[], to_upper[], and sort_order[] arrays take up 256 words each after that. See Section 9.4.1, “The Character Definition Arrays”.

    For a complex character set, create a C source file that describes the character set properties and defines the support routines necessary to properly perform operations on the character set:

    1. Create the file ctype-MYSET.c in the strings directory. Look at one of the existing ctype-*.c files (such as ctype-big5.c) to see what needs to be defined. The arrays in your file must have names like ctype_MYSET, to_lower_MYSET, and so on. These correspond to the arrays for a simple character set. See Section 9.4.1, “The Character Definition Arrays”.

    2. Near the top of the file, place a special comment like this:

      /*
       * This comment is parsed by configure to create ctype.c,
       * so don't change it unless you know what you are doing.
       *
       * .configure. strxfrm_multiply_MYSET=N
       * .configure. mbmaxlen_MYSET=N
       */
      

      The configure program uses this comment to include the character set into the MySQL library automatically.

      If you need string collating functions, you must specify the strxfrm_multiply_MYSET=N value in the special comment at the top of the source file. N must be a positive integer that indicates the maximum ratio to which strings may grow during execution of the my_strxfrm_MYSET() function.

      If you need multi-byte character set functions, you must specify the mbmaxlen_MYSET=N value in the special comment at the top of the ctype-MYSET.c source file for your character set. N should be set to the size in bytes of the largest character in the set.

    3. If you need string collating functions, see Section 9.4.2, “String Collating Support”.

    4. If you need multi-byte character support, see Section 9.4.3, “Multi-Byte Character Support”.

  3. Follow these steps to modify the configuration information. Use the existing configuration information as a guide to adding information for MYSYS.

    1. Add the character set name to the CHARSETS_AVAILABLE list in configure.in.

    2. If the character set uses ctype-MYSET.c, edit strings/Makefile.am and add ctype-MYSET.c to the EXTRA_DIST variable.

  4. Reconfigure, recompile, and test.

The sql/share/charsets/README file includes additional instructions.

9.4.1. The Character Definition Arrays

Each simple character set has a configuration file located in the sql/share/charsets directory. The file is named MYSET.xml for MySQL 4.1 and MYSET.conf for MySQL 4.0 or older.

For MySQL 4.1, MYSET.xml files use <map> array elements to list character set properties. <map> elements appear within these elements:

  • <ctype> defines attributes for each character

  • <lower> and <upper> list the lowercase and uppercase characters

  • <unicode> maps 8-bit character values to Unicode values

  • <collation> elements indicate character ordering for comparisons and sorts, one element per collation (binary collations need no <map> element because the character codes themselves provide the ordering)

For MySQL 4.0 or older, MYSET.conf files list character set properties using these arrays:

  • ctype[] defines attributes for each character

  • to_lower[] and to_upper[] list the lowercase and uppercase characters

  • sort_order[] indicates character ordering for comparisons and sorts

For a complex character set as implemented in a ctype-MYSET.c file in the strings directory, there are corresponding arrays: ctype_MYSET[], to_lower_MYSET[], and so forth. Not every complex character set has all of the arrays. See the existing ctype-*.c files for examples. For MySQL 4.1, see the CHARSET_INFO.txt file in the strings directory for additional information.

The ctype array is indexed by character value + 1 and has 257 elements. This is an old legacy convention for handling EOF. The other arrays are indexed by character value and have 256 elements.

ctype array elements are bit values. Each element describes the attributes of a single character in the character set. Each attribute is associated with a bitmask, as defined in include/m_ctype.h:

#define _MY_U   01      /* Upper case */
#define _MY_L   02      /* Lower case */
#define _MY_NMR 04      /* Numeral (digit) */
#define _MY_SPC 010     /* Spacing character */
#define _MY_PNT 020     /* Punctuation */
#define _MY_CTR 040     /* Control character */
#define _MY_B   0100    /* Blank */
#define _MY_X   0200    /* heXadecimal digit */

The ctype value for a given character should be the union of the applicable bitmask values that describe the character. For example, 'A' is an uppercase character (_MY_U) as well as a hexadecimal digit (_MY_X), so its ctype value should be defined like this:

ctype['A'+1] = _MY_U | _MY_X = 01 | 0200 = 0201

The bitmask values in m_ctype.h are octal values, but the elements of the ctype array in MYSET.xml (or MYSET.conf) should be written as hexadecimal values.

The lower and upper (or to_lower and to_upper) arrays hold the lowercase and uppercase characters corresponding to each member of the character set. For example:

lower['A'] should contain 'a'
upper['a'] should contain 'A'

Each collation (or sort_order) array is a map indicating how characters should be ordered for comparison and sorting purposes. MySQL sorts characters based on the values of this information. In some cases, this is the same as the upper array, which means that sorting is case-insensitive. For more complicated sorting rules (for complex character sets), see the discussion of string collating in Section 9.4.2, “String Collating Support”.

9.4.2. String Collating Support

For simple character sets in MySQL 4.1, sorting rules are specified in the MYSET.xml configuration file using <map> array elements within <collation> elements. (For MySQL 4.0 or older, the rules are given by the sort_order array in MYSET.conf.) If the sorting rules for your language are too complex to be handled with simple arrays, you need to define string collating functions in the ctype-MYSET.c source file in the strings directory.

The existing character sets provide the best documentation and examples to show how these functions are implemented. Look at the ctype-*.c files in the strings directory, such as the files for the big5, czech, gbk, sjis, and tis160 character sets. For MySQL 4.1, take a look at the MY_COLLATION_HANDLER structures to see how they are used, and see the CHARSET_INFO.txt file in the strings directory for additional information. For MySQL 4.0 or older, look at the CHARSET_INFO structures.

9.4.3. Multi-Byte Character Support

If you want to add support for a new character set that includes multi-byte characters, you need to use multi-byte character functions in the ctype-MYSET.c source file in the strings directory.

The existing character sets provide the best documentation and examples to show how these functions are implemented. Look at the ctype-*.c files in the strings directory, such as the files for the euc_kr, gb2312, gbk, sjis, and ujis character sets. For MySQL 4.1, take a look at the MY_CHARSET_HANDLER structures to see how they are used, and see the CHARSET_INFO.txt file in the strings directory for additional information. For MySQL 4.0 or older, look at the CHARSET_INFO structures.

9.5. How to Add a New Collation to a Character Set

A collation is a set of rules that defines how to compare and sort character strings. Each collation in MySQL belongs to a single character set. Every character set has at least one collation, and most have two or more collations.

A collation orders characters based on weights. Each character in a character set maps to a weight. Characters with equal weights compare as equal, and characters with unequal weights compare according to the relative magnitude of their weights.

MySQL supports several collation implementations, as discussed in Section 9.5.1, “Collation Implementation Types”. Some of these can be added to MySQL without recompiling:

  • Simple collations for 8-bit character sets

  • UCA-based collations for Unicode character sets

  • Binary (xxx_bin) collations

The following discussion describes how to add simple collations for 8-bit character sets. In MySQL 4.1, this is the only type of collation that can be added without recompiling. To add a UCA-based Unicode collation, MySQL 5.0 or higher is required.

All existing character sets already have a binary collation, so there is no need here to describe how to add one.

Summary of the procedure for adding a new collation:

  1. Choose a collation ID

  2. Add configuration information that names the collation and describes the character-ordering rules

  3. Restart the server

  4. Verify that the collation is present

The instructions here cover only collations that can be added without recompiling MySQL. To add a collation that does require recompiling (as implemented by means of functions in a C source file), use the instructions in Section 9.4, “Adding a New Character Set”. However, instead of adding all the information required for a complete character set, just modify the appropriate files for an existing character set. That is, based on what is already present for the character set's current collations, add new data structures, functions, and configuration information for the new collation. For an example, see the MySQL Blog article in the following list of additional resources.

Additional resources

9.5.1. Collation Implementation Types

MySQL implements several types of collations:

Simple collations for 8-bit character sets

This kind of collation is implemented using an array of 256 weights that defines a one-to-one mapping from character codes to weights. latin1_swedish_ci is an example. It is a case-insensitive collation, so the uppercase and lowercase versions of a character have the same weights and they compare as equal.

mysql> SET NAMES 'latin1' COLLATE 'latin1_swedish_ci';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT 'a' = 'A';
+-----------+
| 'a' = 'A' |
+-----------+
|         1 | 
+-----------+
1 row in set (0.00 sec)

Complex collations for 8-bit character sets

This kind of collation is implemented using functions in a C source file that define how to order characters, as described in Section 9.4, “Adding a New Character Set”.

Collations for non-Unicode multi-byte character sets

For this type of collation, 8-bit (single-byte) and multi-byte characters are handled differently. For 8-bit characters, character codes map to weights in case-insensitive fashion. (For example, the single-byte characters 'a' and 'A' both have a weight of 0x41.) For multi-byte characters, there are two types of relationship between character codes and weights:

  • Weights equal character codes. sjis_japanese_ci is an example of this kind of collation. The multi-byte character 'ぢ' has a character code of 0x82C0, and the weight is also 0x82C0.

  • Character codes map one-to-one to weights, but a code is not necessarily equal to the weight. gbk_chinese_ci is an example of this kind of collation. The multi-byte character '膰' has a character code of 0x81B0 but a weight of 0xC286.

Collations for Unicode multi-byte character sets

Some of these collations are based on the Unicode Collation Algorithm (UCA), others are not.

Non-UCA collations have a one-to-one mapping from character code to weight. In MySQL, such collations are case insensitive and accent insensitive. utf8_general_ci is an example: 'a', 'A', 'À', and 'á' each have different character codes but all have a weight of 0x0041 and compare as equal.

mysql> SET NAMES 'utf8' COLLATE 'utf8_general_ci';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT 'a' = 'A', 'a' = 'À', 'a' = 'á';
+-----------+-----------+-----------+
| 'a' = 'A' | 'a' = 'À' | 'a' = 'á' |
+-----------+-----------+-----------+
|         1 |         1 |         1 | 
+-----------+-----------+-----------+
1 row in set (0.06 sec)

UCA-based collations in MySQL have these properties:

  • If a character has weights, each weight uses 2 bytes (16 bits)

  • A character may have zero weights (or an empty weight). In this case, the character is ignorable. Example: "U+0000 NULL" does not have a weight and is ignorable.

  • A character may have one weight. Example: 'a' has a weight of 0x0E33.

  • A character may have many weights. This is an expansion. Example: The German letter 'ß' (SZ LEAGUE, or SHARP S) has a weight of 0x0FEA0FEA.

  • Many characters may have one weight. This is a contraction. Example: 'ch' is a single letter in Czech and has a weight of 0x0EE2.

A many-characters-to-many-weights mapping is also possible (this is contraction with expansion), but is not supported by MySQL.

Miscellaneous collations

There are also a few collations that do not fall into any of the previous categories.

9.5.2. Choosing a Collation ID

Each collation must have a unique ID. To add a new collation, you must choose an ID value that is not currently used. The ID that you choose is the value that will show up in these contexts:

  • The Id column of SHOW COLLATION output

  • The charsetnr member of the MYSQL_FIELD C API data structure

To display a list of the currently used collation IDs, use this statement:

mysql> SHOW COLLATION;
+----------------------+----------+-----+---------+----------+---------+
| Collation            | Charset  | Id  | Default | Compiled | Sortlen |
+----------------------+----------+-----+---------+----------+---------+
| big5_chinese_ci      | big5     |   1 | Yes     | Yes      |       1 | 
| big5_bin             | big5     |  84 |         | Yes      |       1 | 
...
| latin1_german1_ci    | latin1   |   5 |         | Yes      |       1 | 
| latin1_swedish_ci    | latin1   |   8 | Yes     | Yes      |       1 | 
| latin1_danish_ci     | latin1   |  15 |         | Yes      |       1 | 
| latin1_german2_ci    | latin1   |  31 |         | Yes      |       2 | 
| latin1_bin           | latin1   |  47 |         | Yes      |       1 | 
| latin1_general_ci    | latin1   |  48 |         | Yes      |       1 | 
| latin1_general_cs    | latin1   |  49 |         | Yes      |       1 | 
| latin1_spanish_ci    | latin1   |  94 |         | Yes      |       1 | 
| latin2_czech_cs      | latin2   |   2 |         | Yes      |       4 | 
| latin2_general_ci    | latin2   |   9 | Yes     | Yes      |       1 | 
| latin2_hungarian_ci  | latin2   |  21 |         | Yes      |       1 | 
| latin2_croatian_ci   | latin2   |  27 |         | Yes      |       1 | 
| latin2_bin           | latin2   |  77 |         | Yes      |       1 | 
...
+----------------------+----------+-----+---------+----------+---------+

Look through the values in the Id column and pick a value that is not used.

Warning

If you upgrade MySQL, you may find that the collation ID you choose has been assigned to a collation included in the new MySQL distribution. In this case, you will need to choose a new value for your own collation.

In addition, before upgrading, you should save the configuration files that you change. If you upgrade in place, the process will replace the your modified files.

9.5.3. Adding a Simple Collation to an 8-Bit Character Set

To add a simple collation for an 8-bit character set without recompiling MySQL, use the following procedure. The example adds a collation named latin1_test_ci to the latin1 character set.

  1. Choose a collation ID, as shown in Section 9.5.2, “Choosing a Collation ID”. The following steps use an ID of 56.

  2. You will need to modify the Index.xml and latin1.xml configuration files. These files will be located in the directory named by the character_sets_dir system variable. You can check the variable value as follows, although the pathname might be different on your system:

    mysql> SHOW VARIABLES LIKE 'character_sets_dir';
    +--------------------+-----------------------------------------+
    | Variable_name      | Value                                   |
    +--------------------+-----------------------------------------+
    | character_sets_dir | /user/local/mysql/share/mysql/charsets/ | 
    +--------------------+-----------------------------------------+
    
  3. Choose a name for the collation and list it in the Index.xml file. Find the <charset> element for the character set to which the collation is being added, and add a <collation> element that indicates the collation name and ID. For example:

    <charset name="latin1">
      ...
      <!-- associate collation name with its ID -->
      <collation name="latin1_test_ci" id="56"/>
      ...
    </charset>
    
  4. In the latin1.xml configuration file, add a <collation> element that names the collation and that contains a <map> element that defines a character code-to-weight mapping table. Each word within the <map> element must be a number in hexadecimal format.

    <collation name="latin1_test_ci">
    <map>
     00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
     10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
     20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
     30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
     40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
     50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
     60 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
     50 51 52 53 54 55 56 57 58 59 5A 7B 7C 7D 7E 7F
     80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
     90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
     A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
     B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
     41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
     44 4E 4F 4F 4F 4F 5C D7 5C 55 55 55 59 59 DE DF
     41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
     44 4E 4F 4F 4F 4F 5C F7 5C 55 55 55 59 59 DE FF
    </map>
    </collation>
    
  5. Restart the server and use this statement to verify that the collation is present:

    mysql> SHOW COLLATION LIKE 'latin1_test_ci';
    +----------------+---------+----+---------+----------+---------+
    | Collation      | Charset | Id | Default | Compiled | Sortlen |
    +----------------+---------+----+---------+----------+---------+
    | latin1_test_ci | latin1  | 56 |         |          |       1 | 
    +----------------+---------+----+---------+----------+---------+
    

9.6. Problems With Character Sets

If you try to use a character set that is not compiled into your binary, you might run into the following problems:

  • Your program uses an incorrect path to determine where the character sets are stored (which is typically the share/mysql/charsets or share/charsets directory under the MySQL installation directory). This can be fixed by using the --character-sets-dir option when you run the program in question. For example, to specify a directory to be used by MySQL client programs, list it in the [client] group of your option file. The examples given here show what the setting might look like for Unix or Windows, respectively:

    [client]
    character-sets-dir=/usr/local/mysql/share/mysql/charsets
    
    [client]
    character-sets-dir="C:/Program Files/MySQL/MySQL Server 4.1/share/charsets"
    
  • The character set is a complex character set that cannot be loaded dynamically. In this case, you must recompile the program with support for the character set.

  • The character set is a dynamic character set, but you do not have a configuration file for it. In this case, you should install the configuration file for the character set from a new MySQL distribution.

  • If your character set index file does not contain the name for the character set, your program displays an error message. In MySQL 4.1, the file is named Index.xml and the message is:

    Character set 'charset_name' is not a compiled character set and is not
    specified in the '/usr/share/mysql/charsets/Index.xml' file
    

    Before MySQL 4.1, the file is named Index and the message is:

    ERROR 1105: File '/usr/local/share/mysql/charsets/charset_name.conf'
    not found (Errcode: 2)
    

    To solve this problem, you should either get a new index file or manually add the name of any missing character sets to the current file.

For MyISAM tables, you can check the character set name and number for a table with myisamchk -dvv tbl_name.

9.7. MySQL Server Time Zone Support

You can set the system time zone for MySQL Server at startup with the --timezone=timezone_name option to mysqld_safe. You can also set it by setting the TZ environment variable before you start mysqld. The allowable values for --timezone or TZ are system-dependent. Consult your operating system documentation to see what values are acceptable.

Before MySQL 4.1.3, the server operates only in the system time zone set at startup. Beginning with MySQL 4.1.3, the server maintains several time zone settings, some of which can be modified at runtime:

  • The system time zone. When the server starts, it attempts to determine the time zone of the host machine and uses it to set the system_time_zone system variable. The value does not change thereafter.

  • The server's current time zone. The global time_zone system variable indicates the time zone the server currently is operating in. The initial value for time_zone is 'SYSTEM', which indicates that the server time zone is the same as the system time zone.

    The initial global server time zone value can be specified explicitly at startup with the --default-time-zone=timezone option on the command line, or you can use the following line in an option file:

    default-time-zone='timezone'
    

    If you have the SUPER privilege, you can set the global server time zone value at runtime with this statement:

    mysql> SET GLOBAL time_zone = timezone;
    
  • Per-connection time zones. Each client that connects has its own time zone setting, given by the session time_zone variable. Initially, the session variable takes its value from the global time_zone variable, but the client can change its own time zone with this statement:

    mysql> SET time_zone = timezone;
    

The current session time zone setting affects display and storage of time values that are zone-sensitive. This includes the values displayed by functions such as NOW() or CURTIME(), and values stored in and retrieved from TIMESTAMP columns. Values for TIMESTAMP columns are converted from the current time zone to UTC for storage, and from UTC to the current time zone for retrieval. The current time zone setting does not affect values displayed by functions such as UTC_TIMESTAMP() or values in DATE, TIME, or DATETIME columns.

The current values of the global and client-specific time zones can be retrieved like this:

mysql> SELECT @@global.time_zone, @@session.time_zone;

timezone values can be given in several formats, none of which are case sensitive:

  • The value 'SYSTEM' indicates that the time zone should be the same as the system time zone.

  • The value can be given as a string indicating an offset from UTC, such as '+10:00' or '-6:00'.

  • The value can be given as a named time zone, such as 'Europe/Helsinki', 'US/Eastern', or 'MET'. Named time zones can be used only if the time zone information tables in the mysql database have been created and populated.

The MySQL installation procedure creates the time zone tables in the mysql database, but does not load them. You must do so manually using the following instructions. (If you are upgrading to MySQL 4.1.3 or later from an earlier version, you can create the tables by upgrading your mysql database. Use the instructions in Section 4.4.5, “mysql_fix_privilege_tables — Upgrade MySQL System Tables”. After creating the tables, you can load them.)

Note

Loading the time zone information is not necessarily a one-time operation because the information changes occasionally. For example, the rules for Daylight Saving Time in the United States, Mexico, and parts of Canada changed in 2007. When such changes occur, applications that use the old rules become out of date and you may find it necessary to reload the time zone tables to keep the information used by your MySQL server current. See the notes at the end of this section.

If your system has its own zoneinfo database (the set of files describing time zones), you should use the mysql_tzinfo_to_sql program for filling the time zone tables. Examples of such systems are Linux, FreeBSD, Sun Solaris, and Mac OS X. One likely location for these files is the /usr/share/zoneinfo directory. If your system does not have a zoneinfo database, you can use the downloadable package described later in this section.

The mysql_tzinfo_to_sql program is used to load the time zone tables. On the command line, pass the zoneinfo directory pathname to mysql_tzinfo_to_sql and send the output into the mysql program. For example:

shell> mysql_tzinfo_to_sql /usr/share/zoneinfo | mysql -u root mysql

mysql_tzinfo_to_sql reads your system's time zone files and generates SQL statements from them. mysql processes those statements to load the time zone tables.

mysql_tzinfo_to_sql also can be used to load a single time zone file or to generate leap second information:

  • To load a single time zone file tz_file that corresponds to a time zone name tz_name, invoke mysql_tzinfo_to_sql like this:

    shell> mysql_tzinfo_to_sql tz_file tz_name | mysql -u root mysql
    

    With this approach, you must execute a separate command to load the time zone file for each named zone that the server needs to know about.

  • If your time zone needs to account for leap seconds, initialize the leap second information like this, where tz_file is the name of your time zone file:

    shell> mysql_tzinfo_to_sql --leap tz_file | mysql -u root mysql
    
  • After running mysql_tzinfo_to_sql, it is best to restart the server so that it does not continue to use any previously cached time zone data.

If your system is one that has no zoneinfo database (for example, Windows or HP-UX), you can use the package of pre-built time zone tables that is available for download at the MySQL Developer Zone:

http://dev.mysql.com/downloads/timezones.html

This time zone package contains .frm, .MYD, and .MYI files for the MyISAM time zone tables. These tables should be part of the mysql database, so you should place the files in the mysql subdirectory of your MySQL server's data directory. The server should be stopped while you do this and restarted afterward.

Warning

Do not use the downloadable package if your system has a zoneinfo database. Use the mysql_tzinfo_to_sql utility instead. Otherwise, you may cause a difference in datetime handling between MySQL and other applications on your system.

For information about time zone settings in replication setup, please see Section 14.7, “Replication Features and Known Problems”.

Staying Current with Time Zone Changes

As mentioned earlier, when the time zone rules change, applications that use the old rules become out of date. To stay current, it is necessary to make sure that your system uses current time zone information is used. For MySQL, there are two factors to consider in staying current:

  • The operating system time affects the value that the MySQL server uses for times if its time zone is set to SYSTEM. Make sure that your operating system is using the latest time zone information. For most operating systems, the latest update or service pack prepares your system for the time changes. Check the Web site for your operating system vendor for an update that addresses the time changes.

  • If you replace the system's /etc/localtime timezone file with a version that uses rules differing from those in effect at mysqld startup, you should restart mysqld so that it uses the updated rules. Otherwise, mysqld might not notice when the system changes its time.

  • If you use named time zones with MySQL, make sure that the time zone tables in the mysql database are up to date. If your system has its own zoneinfo database, you should reload the MySQL time zone tables whenever the zoneinfo database is updated, using the instructions given earlier in this section. For systems that do not have their own zoneinfo database, check the MySQL Developer Zone for updates. When a new update is available, download it and use it to replace your current time zone tables. mysqld caches time zone information that it looks up, so after replacing the time zone tables, you should restart mysqld to make sure that it does not continue to serve outdated time zone data.

  • For versions of MySQL older than 4.1.3 that do not have time zone support, the server always tracks the operating system time (much like a time zone setting of SYSTEM in 4.1.3 and up). Assuming that the server host itself has its operating system updated to handle any changes to Daylight Saving Time rules, the MySQL server should know the correct time.

If you are uncertain whether named time zones are available, for use either as the server's time zone setting or by clients that set their own time zone, check whether your time zone tables are empty. The following query determines whether the table that contains time zone names has any rows:

mysql> SELECT COUNT(*) FROM mysql.time_zone_name;
+----------+
| COUNT(*) |
+----------+
|        0 | 
+----------+

A count of zero indicates that the table is empty. In this case, no one can be using named time zones, and you don't need to update the tables. A count greater than zero indicates that the table is not empty and that its contents are available to be used for named time zone support. In this case, you should be sure to reload your time zone tables so that anyone who uses named time zones will get correct query results.

To check whether your MySQL installation is updated properly for a change in Daylight Saving Time rules, use a test like the one following. The example uses values that are appropriate for the 2007 DST 1-hour change that occurs in the United States on March 11 at 2 a.m.

The test uses these two queries:

SELECT CONVERT_TZ('2007-03-11 2:00:00','US/Eastern','US/Central');
SELECT CONVERT_TZ('2007-03-11 3:00:00','US/Eastern','US/Central');

The two time values indicate the times at which the DST change occurs, and the use of named time zones requires that the time zone tables be used. The desired result is that both queries return the same result (the input time, converted to the equivalent value in the 'US/Central' time zone).

Before updating the time zone tables, you would see an incorrect result like this:

mysql> SELECT CONVERT_TZ('2007-03-11 2:00:00','US/Eastern','US/Central');
+------------------------------------------------------------+
| CONVERT_TZ('2007-03-11 2:00:00','US/Eastern','US/Central') |
+------------------------------------------------------------+
| 2007-03-11 01:00:00                                        | 
+------------------------------------------------------------+

mysql> SELECT CONVERT_TZ('2007-03-11 3:00:00','US/Eastern','US/Central');
+------------------------------------------------------------+
| CONVERT_TZ('2007-03-11 3:00:00','US/Eastern','US/Central') |
+------------------------------------------------------------+
| 2007-03-11 02:00:00                                        |
+------------------------------------------------------------+

After updating the tables, you should see the correct result:

mysql> SELECT CONVERT_TZ('2007-03-11 2:00:00','US/Eastern','US/Central');
+------------------------------------------------------------+
| CONVERT_TZ('2007-03-11 2:00:00','US/Eastern','US/Central') |
+------------------------------------------------------------+
| 2007-03-11 01:00:00                                        | 
+------------------------------------------------------------+

mysql> SELECT CONVERT_TZ('2007-03-11 3:00:00','US/Eastern','US/Central');
+------------------------------------------------------------+
| CONVERT_TZ('2007-03-11 3:00:00','US/Eastern','US/Central') |
+------------------------------------------------------------+
| 2007-03-11 01:00:00                                        | 
+------------------------------------------------------------+

9.8. MySQL Server Locale Support

Beginning with MySQL 4.1.21, the locale indicated by the lc_time_names system variable controls the language used to display day and month names and abbreviations. This variable affects the output from the DATE_FORMAT(), DAYNAME() and MONTHNAME() functions.

Locale names are POSIX-style values such as 'ja_JP' or 'pt_BR'. The default value is 'en_US' regardless of your system's locale setting, but any client can examine or set its lc_time_names value as shown in the following example:

mysql> SET NAMES 'utf8';
Query OK, 0 rows affected (0.09 sec)

mysql> SELECT @@lc_time_names;
+-----------------+
| @@lc_time_names |
+-----------------+
| en_US           | 
+-----------------+
1 row in set (0.00 sec)

mysql> SELECT DAYNAME('2010-01-01'), MONTHNAME('2010-01-01');
+-----------------------+-------------------------+
| DAYNAME('2010-01-01') | MONTHNAME('2010-01-01') |
+-----------------------+-------------------------+
| Friday                | January                 | 
+-----------------------+-------------------------+
1 row in set (0.00 sec)

mysql> SELECT DATE_FORMAT('2010-01-01','%W %a %M %b');
+-----------------------------------------+
| DATE_FORMAT('2010-01-01','%W %a %M %b') |
+-----------------------------------------+
| Friday Fri January Jan                  | 
+-----------------------------------------+
1 row in set (0.00 sec)

mysql> SET lc_time_names = 'es_MX';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT @@lc_time_names;
+-----------------+
| @@lc_time_names |
+-----------------+
| es_MX           | 
+-----------------+
1 row in set (0.00 sec)

mysql> SELECT DAYNAME('2010-01-01'), MONTHNAME('2010-01-01');
+-----------------------+-------------------------+
| DAYNAME('2010-01-01') | MONTHNAME('2010-01-01') |
+-----------------------+-------------------------+
| viernes               | enero                   | 
+-----------------------+-------------------------+
1 row in set (0.00 sec)

mysql> SELECT DATE_FORMAT('2010-01-01','%W %a %M %b');
+-----------------------------------------+
| DATE_FORMAT('2010-01-01','%W %a %M %b') |
+-----------------------------------------+
| viernes vie enero ene                   | 
+-----------------------------------------+
1 row in set (0.00 sec)

The day or month name for each of the affected functions is converted from utf8 to the character set indicated by the character_set_connection system variable.

lc_time_names may be set to any of the following locale values:

ar_AE: Arabic - United Arab Emiratesar_BH: Arabic - Bahrain
ar_DZ: Arabic - Algeriaar_EG: Arabic - Egypt
ar_IN: Arabic - Iranar_IQ: Arabic - Iraq
ar_JO: Arabic - Jordanar_KW: Arabic - Kuwait
ar_LB: Arabic - Lebanonar_LY: Arabic - Libya
ar_MA: Arabic - Moroccoar_OM: Arabic - Oman
ar_QA: Arabic - Qatarar_SA: Arabic - Saudi Arabia
ar_SD: Arabic - Sudanar_SY: Arabic - Syria
ar_TN: Arabic - Tunisiaar_YE: Arabic - Yemen
be_BY: Belarusian - Belarusbg_BG: Bulgarian - Bulgaria
ca_ES: Catalan - Catalancs_CZ: Czech - Czech Republic
da_DK: Danish - Denmarkde_AT: German - Austria
de_BE: German - Belgiumde_CH: German - Switzerland
de_DE: German - Germanyde_LU: German - Luxembourg
EE: Estonian - Estoniaen_AU: English - Australia
en_CA: English - Canadaen_GB: English - United Kingdom
en_IN: English - Indiaen_NZ: English - New Zealand
en_PH: English - Philippinesen_US: English - United States
en_ZA: English - South Africaen_ZW: English - Zimbabwe
es_AR: Spanish - Argentinaes_BO: Spanish - Bolivia
es_CL: Spanish - Chilees_CO: Spanish - Columbia
es_CR: Spanish - Costa Ricaes_DO: Spanish - Dominican Republic
es_EC: Spanish - Ecuadores_ES: Spanish - Spain
es_GT: Spanish - Guatemalaes_HN: Spanish - Honduras
es_MX: Spanish - Mexicoes_NI: Spanish - Nicaragua
es_PA: Spanish - Panamaes_PE: Spanish - Peru
es_PR: Spanish - Puerto Ricoes_PY: Spanish - Paraguay
es_SV: Spanish - El Salvadores_US: Spanish - United States
es_UY: Spanish - Uruguayes_VE: Spanish - Venezuela
eu_ES: Basque - Basquefi_FI: Finnish - Finland
fo_FO: Faroese - Faroe Islandsfr_BE: French - Belgium
fr_CA: French - Canadafr_CH: French - Switzerland
fr_FR: French - Francefr_LU: French - Luxembourg
gl_ES: Galician - Galiciangu_IN: Gujarati - India
he_IL: Hebrew - Israelhi_IN: Hindi - India
hr_HR: Croatian - Croatiahu_HU: Hungarian - Hungary
id_ID: Indonesian - Indonesiais_IS: Icelandic - Iceland
it_CH: Italian - Switzerlandit_IT: Italian - Italy
ja_JP: Japanese - Japanko_KR: Korean - Korea
lt_LT: Lithuanian - Lithuanialv_LV: Latvian - Latvia
mk_MK: Macedonian - FYROMmn_MN: Mongolia - Mongolian
ms_MY: Malay - Malaysianb_NO: Norwegian(Bokml) - Norway
nl_BE: Dutch - Belgiumnl_NL: Dutch - The Netherlands
no_NO: Norwegian - Norwaypl_PL: Polish - Poland
pt_BR: Portugese - Brazilpt_PT: Portugese - Portugal
ro_RO: Romanian - Romaniaru_RU: Russian - Russia
ru_UA: Russian - Ukrainesk_SK: Slovak - Slovakia
sl_SI: Slovenian - Sloveniasq_AL: Albanian - Albania
sr_YU: Serbian - Yugoslaviasv_FI: Swedish - Finland
sv_SE: Swedish - Swedenta_IN: Tamil - India
te_IN: Telugu - Indiath_TH: Thai - Thailand
tr_TR: Turkish - Turkeyuk_UA: Ukrainian - Ukraine
ur_PK: Urdu - Pakistanvi_VN: Vietnamese - Vietnam
zh_CN: Chinese - Peoples Republic of Chinazh_HK: Chinese - Hong Kong SAR
zh_TW: Chinese - Taiwan 

lc_time_names currently does not affect the STR_TO_DATE() or GET_FORMAT() function.