Collation: Difference between revisions

(4 intermediate revisions by the same user not shown)

Line 1:

Note: When dealing with MariaDB, they have this [https://mariadb.com/kb/en/character-set-and-collation-overview/ KB Overview]. For MySQL, there is a dedicated [https://forums.mysql.com/list.php?103 forum for Character Sets, Collation, Unicode]

Collation is a big word that we don't use everyday. It helps to remember that it basically means 'sorting'. Throwback: when printing documents in the office was a big deal (before the Internet), you'd have to pay extra to get huge sorting trays attached to the printer so that you could print '20 copies of this 5-page document' and each set of 5 pages would be printed into it's own tray. Now collating copiers are the norm and they do it without the massive tray finishers.

===OS-dependent defaults===

MariaDB is different in Ubuntu vs. Debian<ref>https://mariadb.com/kb/en/library/differences-in-mariadb-in-debian-and-ubuntu/</ref> (and CentOS / Rocky Linux seem more like Ubuntu than Debian). Debian does the 'right thing' and sets defaults to be UTF-8 compatible, whereas others are still lagging with defaults of <code>latin1</code> for character set (and <code>latin1_swedish_ci</code> collation)!!

Debian also enables the <code>unix_socket</code> authentication plugin by default for MariaDB <= 10.4.This allows passwordless login. See the reference for more details.

===Best practice===

{{Ambox|text=Get into the habit of specifying CHARACTER SET and COLLATION on all connections and CREATE TABLEs. MySQL and MariaDB are gradually changing from latin1_swedish_ci to utf8mb4_0900_ai_ci.

"900" is probably not the last Unicode standard.

By explicitly specifying the charset and collation, you maintain control and consistency, even if it is an out-dated pair.}}

===Changing Character Sets and Collations===

Character sets and collations can be set from the server level right down to the column level, as well as for client-server communication. For example, the <code>[https://mariadb.com/kb/en/server-system-variables/#collation_connection collation_connection]</code> '''server system variable'''. There is another KB page for [https://mariadb.com/kb/en/setting-character-sets-and-collations/ Setting Character Sets and Collations]. Keep in mind that there a many possible collations per character set, but only one default per character set. So, if you specify a character set, but are silent on the collation, you will receive the default. Character sets and collations always cascade down, so a column without a specified collation will look for the table default, the table for the database, and the database for the server. It's therefore possible to have extremely fine-grained control over all the character sets and collations used in your data. Collation names always start with the character set they correspond to, e.g. '''latin1''' charset and '''latin1'''_bin collation (the bin suffix stands for 'binary').

show GLOBAL VARIABLES LIKE 'character_set_%';

show VARIABLES LIKE 'collation%';

</syntaxhighlight>This is the mess that I wound up with somehow:

SET @DATABASE_NAME = 'wiki_freephile';

SELECT

TABLE_SCHEMA,

COUNT(COLLATION_NAME) AS 'count',

COLLATION_NAME

FROM

information_schema.columns AS cols

WHERE

table_schema = @DATABASE_NAME

GROUP BY COLLATION_NAME;

</syntaxhighlight>

<pre>

wiki_freephile, 0,

wiki_freephile, 11, binary

wiki_freephile, 73, latin1_bin

wiki_freephile, 69, latin1_swedish_ci

wiki_freephile, 1, utf8_general_ci

</pre>

Note that this query is counting the '''columns''' which have an undesirable collation. (Recall that encoding/collation can be set at the server, database, table, and column levels.)

So, start off by setting the defaults for each database:

ALTER DATABASE wiki_freephile CHARACTER SET = 'binary' COLLATE = 'BINARY';

ALTER DATABASE wiki_meta CHARACTER SET = 'binary' COLLATE = 'BINARY';

ALTER DATABASE wiki_wiki CHARACTER SET = 'binary' COLLATE = 'BINARY';

</syntaxhighlight>

You can check the status of each '''table''' too:

SET @DATABASE_NAME = 'wiki_freephile';

SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_COLLATION

FROM information_schema.tables

WHERE TABLE_SCHEMA = @DATABASE_NAME

AND TABLE_COLLATION not in ('binary');

</syntaxhighlight>

Then, down in the section below on changing engines, there's an example SQL query for generating '''<code>ALTER TABLE</code>''' statements to fix the table defaults<ref>https://mariadb.com/kb/en/setting-character-sets-and-collations/#table-level</ref>. These statements would be in the form:

ALTER TABLE table_name

CONVERT TO CHARACTER SET charset_name [COLLATE collation_name];

</syntaxhighlight>

In my case, there are 90-100 columns that have old latin1 character sets.

SET @DATABASE_NAME = 'wiki_greg';

SELECT

TABLE_SCHEMA,

TABLE_NAME,

COLUMN_NAME,

CHARACTER_SET_NAME,

COLLATION_NAME

FROM

information_schema.columns AS cols

WHERE

table_schema = @DATABASE_NAME

AND COLLATION_NAME NOT IN ('binary');

</syntaxhighlight>

We can use the information schema to list them all out. And you can also inspect each TABLE, with e.g. <code>SHOW CREATE TABLE brokenlinks \G</code>

To set the mysql server default character set to binary <ref>https://mariadb.com/kb/en/library/setting-character-sets-and-collations/ https://scottlinux.com/2017/03/04/mysql-mariadb-set-character-set-and-collation-to-utf8/</ref>, add the following to <code>/etc/mysql/my.cnf</code> (Debian/Ubuntu) or <code>/etc/my.cnf</code> (Fedora/Centos/RHEL) under the '''<nowiki>[mysqld]</nowiki>''' section:

<pre>

# character set and collation

character-set-server = binary

collation-server = BINARY

# ignore client side character set value sent during handshake

skip-character-set-client-handshake = 1

</pre>

Then restart mysql/mariadb:

*<code>service mariadb restart</code> for Debian/Ubuntu;

*<code>/bin/systemctl restart mariadb.service</code> for Fedora/Centos/RHEL

<br />

===Changing defaults===

Depending on what version of MySQL or MariaDB you are using, it can be difficult at best to change the default collation '''for a given character set'''. <ref>https://dba.stackexchange.com/questions/239975/change-default-collation-for-character-set-utf8mb4-to-utf8mb4-unicode-ci</ref> The default relationship between character set and the collation for it will affect you when you use 'CREATE' or 'ALTER' with a 'CHARACTER SET' clause and the collation clause is left out. The admonition is to always specify both the character set and the collation in Connections as well as CREATE statements.

Not to confuse things, but the default server-wide character set changed in MariaDB >= 10.6. The <code>character_set_server</code> changed from <code>utf8</code> to <code>utf8mb3</code> <ref>https://mariadb.com/kb/en/server-system-variables/#character_set_server</ref>

Show the default collation '''for a given character set''':

<code>SELECT * FROM information_schema.COLLATIONS WHERE CHARACTER_SET_NAME = 'utf8mb4' ;</code>

On my CentOS server, the value is <code>utf8mb4_general_ci</code> and the same for Rocky Linux 8.9

For <code>CHARACTER_SET_NAME = 'binary'</code>, the collation is <code>binary</code> as no other collation is available, nor would make sense.

==Understand Encoding and Collation==

http://kunststube.net/encoding/

==2016 problems with MediaWiki==

Trying to Special:Import an XML dump from Special:Export of Commons, I get an "illegal mix of collations" error.

Line 17:

Line 140:

{| class="wikitable"

| Schema: || mediawiki

|Schema:||mediawiki

|-

| Default collation: || latin1_swedish_ci

|Default collation:||latin1_swedish_ci

|-

| Default characterset: || latin1

|Default characterset:||latin1

|}

I also have <code>$wgDBmysql5 = true;</code> in LocalSettings.php <ref>https://www.mediawiki.org/wiki/Manual:$wgDBmysql5</ref>

== Default MediaWiki SQL schema setup ==

==Default MediaWiki SQL schema setup==

When MediaWiki is first installed, [https://phabricator.wikimedia.org/source/mediawiki/browse/master/maintenance/tables.sql this is the SQL that is run].

Line 35:

Line 158:

You can see what are the defaults for your databases with

<~~source~~ lang="bash">

mysql -e 'SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;'

</~~source~~>

</syntaxhighlight>

== DB engine Conversion ==

==DB engine Conversion==

The table type has nothing to do with Collation, but while updating my database, I wanted to switch the engine from MyISAM to InnoDB which is now the default.

This SQL worked flawlessly to switch any MyISAM tables to INNODB; without touching the *searchindex table(s) because that uses MySQL fulltext search which doesn't come into INNODB until MySQL v. 5.6

<~~source~~ lang="mysql">

/* Find and convert your MediaWiki tables to use INNODB */

Line 80:

Line 203:

ALTER TABLE `blobs` ENGINE=InnoDB;

ALTER TABLE `archive` ENGINE=InnoDB;

</~~source~~>

</syntaxhighlight>

Line 86:

Line 209:

So then to change the charset/collation:

* UTF8 is a very bad choice for that.

* If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety.

The *recommended* charset/collation is 'binary BINARY'.

*UTF8 is a very bad choice for that.

<~~source~~ lang="mysql">

*If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety.

The '''recommended''' charset/collation is 'binary BINARY'.

ALTER table <tablename> CHARACTER SET = 'binary' COLLATE = 'BINARY';

Line 109:

Line 233:

AND `tb`.table_name NOT REGEXP '.*searchindex'

ORDER BY table_name DESC;

</~~source~~>

</syntaxhighlight>

== Table or column collation ==

==Table or column collation==

You can go further than setting the database, or table defaults; you can look at collation and character sets for individual columns. Individual columns probably don't need to be (and shouldn't be) converted to a new character set or collation unless you are certain you know what you're doing. MySQL has this [https://dev.mysql.com/doc/refman/5.5/en/charset-conversion.html reference on charset conversion]

You can look at the distribution of column collations with

<~~source~~ lang="mysql">

SET @DATABASE_NAME = 'mediawiki';

SELECT TABLE_SCHEMA, count(COLLATION_NAME) AS 'count', COLLATION_NAME

Line 121:

Line 245:

WHERE table_schema = @DATABASE_NAME

GROUP BY COLLATION_NAME;

</~~source~~>

</syntaxhighlight>

For a default demo database, you might see something like this

{| class="wikitable"

|-

| TABLE_SCHEMA || count || COLLATION_NAME

|TABLE_SCHEMA||count||COLLATION_NAME

|-

| wiki_demo || 0 || NULL

|wiki_demo||0||NULL

|-

| wiki_demo || 36 || latin1_bin

|wiki_demo||36||latin1_bin

|-

| wiki_demo || 37 || latin1_swedish_ci

|wiki_demo||37||latin1_swedish_ci

|-

| wiki_demo || 3 || utf8_general_ci

|wiki_demo||3||utf8_general_ci

|}

or the full detail at

<~~source~~ lang="mysql">

SELECT

TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLLATION_NAME

FROM information_schema.columns AS cols

WHERE table_schema = "wiki_demo";

</~~source~~>

</syntaxhighlight>

[[Category:MediaWiki]]

@@ Line 1: / Line 1: @@
+Note: When dealing with MariaDB, they have this [https://mariadb.com/kb/en/character-set-and-collation-overview/ KB Overview]. For MySQL, there is a dedicated [https://forums.mysql.com/list.php?103 forum for Character Sets, Collation, Unicode]
+Collation is a big word that we don't use everyday. It helps to remember that it basically means 'sorting'. Throwback: when printing documents in the office was a big deal (before the Internet), you'd have to pay extra to get huge sorting trays attached to the printer so that you could print '20 copies of this 5-page document' and each set of 5 pages would be printed into it's own tray. Now collating copiers are the norm and they do it without the massive tray finishers.
+===OS-dependent defaults===
+MariaDB is different in Ubuntu vs. Debian<ref>https://mariadb.com/kb/en/library/differences-in-mariadb-in-debian-and-ubuntu/</ref> (and CentOS / Rocky Linux seem more like Ubuntu than Debian). Debian does the 'right thing' and sets defaults to be UTF-8 compatible, whereas others are still lagging with defaults of <code>latin1</code> for character set (and <code>latin1_swedish_ci</code> collation)!!
+Debian also enables the <code>unix_socket</code> authentication plugin by default for MariaDB <= 10.4.This allows passwordless login.  See the reference for more details.
+===Best practice===
+{{Ambox|text=Get into the habit of specifying CHARACTER SET and COLLATION on all connections and CREATE TABLEs. MySQL and MariaDB are gradually changing from latin1_swedish_ci to utf8mb4_0900_ai_ci.
+"900" is probably not the last Unicode standard.
+By explicitly specifying the charset and collation, you maintain control and consistency, even if it is an out-dated pair.}}
+===Changing Character Sets and Collations===
+Character sets and collations can be set from the server level right down to the column level, as well as for client-server communication. For example, the <code>[https://mariadb.com/kb/en/server-system-variables/#collation_connection collation_connection]</code> '''server system variable'''. There is another KB page for [https://mariadb.com/kb/en/setting-character-sets-and-collations/ Setting Character Sets and Collations]. Keep in mind that there a many possible collations per character set, but only one default per character set. So, if you specify a character set, but are silent on the collation, you will receive the default. Character sets and collations always cascade down, so a column without a specified collation will look for the table default, the table for the database, and the database for the server. It's therefore possible to have extremely fine-grained control over all the character sets and collations used in your data. Collation names always start with the character set they correspond to, e.g. '''latin1''' charset and '''latin1'''_bin collation (the bin suffix stands for 'binary').
+<syntaxhighlight lang="mysql">
+show GLOBAL VARIABLES LIKE 'character_set_%';
+show VARIABLES LIKE 'collation%';
+</syntaxhighlight>This is the mess that I wound up with somehow:
+<syntaxhighlight lang="mysql">
+SET @DATABASE_NAME = 'wiki_freephile';
+SELECT
+    TABLE_SCHEMA,
+    COUNT(COLLATION_NAME) AS 'count',
+    COLLATION_NAME
+FROM
+    information_schema.columns AS cols
+WHERE
+    table_schema = @DATABASE_NAME
+GROUP BY COLLATION_NAME;
+</syntaxhighlight>
+<pre>
+wiki_freephile, 0,
+wiki_freephile, 11, binary
+wiki_freephile, 73, latin1_bin
+wiki_freephile, 69, latin1_swedish_ci
+wiki_freephile, 1, utf8_general_ci
+</pre>
+Note that this query is counting the '''columns''' which have an undesirable collation. (Recall that encoding/collation can be set at the server, database, table, and column levels.)
+So, start off by setting the defaults for each database:
+<syntaxhighlight lang="mysql">
+ALTER DATABASE wiki_freephile CHARACTER SET = 'binary' COLLATE = 'BINARY';
+ALTER DATABASE wiki_meta CHARACTER SET = 'binary' COLLATE = 'BINARY';
+ALTER DATABASE wiki_wiki CHARACTER SET = 'binary' COLLATE = 'BINARY';
+</syntaxhighlight>
+You can check the status of each '''table''' too:
+<syntaxhighlight lang="mysql">
+SET @DATABASE_NAME = 'wiki_freephile';
+SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_COLLATION
+FROM    information_schema.tables
+WHERE   TABLE_SCHEMA = @DATABASE_NAME
+AND TABLE_COLLATION not in ('binary');
+</syntaxhighlight>
+Then, down in the section below on changing engines, there's an example SQL query for generating '''<code>ALTER TABLE</code>''' statements to fix the table defaults<ref>https://mariadb.com/kb/en/setting-character-sets-and-collations/#table-level</ref>. These statements would be in the form:
+<syntaxhighlight lang="mysql">
+ALTER TABLE table_name
+ CONVERT TO CHARACTER SET charset_name [COLLATE collation_name];
+</syntaxhighlight>
+In my case, there are 90-100 columns that have old latin1 character sets.
+<syntaxhighlight lang="mysql">
+SET @DATABASE_NAME = 'wiki_greg';
+SELECT
+	TABLE_SCHEMA,
+	TABLE_NAME,
+	COLUMN_NAME,
+	CHARACTER_SET_NAME,
+	COLLATION_NAME
+FROM
+	information_schema.columns AS cols
+WHERE
+	table_schema = @DATABASE_NAME
+	AND COLLATION_NAME NOT IN ('binary');
+</syntaxhighlight>
+We can use the information schema to list them all out. And you can also inspect each TABLE, with e.g. <code>SHOW CREATE TABLE brokenlinks \G</code>
+To set the mysql server default character set to binary <ref>https://mariadb.com/kb/en/library/setting-character-sets-and-collations/ https://scottlinux.com/2017/03/04/mysql-mariadb-set-character-set-and-collation-to-utf8/</ref>, add the following to <code>/etc/mysql/my.cnf</code> (Debian/Ubuntu) or <code>/etc/my.cnf</code> (Fedora/Centos/RHEL) under the '''<nowiki>[mysqld]</nowiki>''' section:
+<pre>
+# character set and collation
+character-set-server = binary
+collation-server = BINARY
+#  ignore client side character set value sent during handshake
+skip-character-set-client-handshake = 1
+</pre>
+Then restart mysql/mariadb:
+*<code>service mariadb restart</code> for Debian/Ubuntu;
+*<code>/bin/systemctl restart mariadb.service</code> for Fedora/Centos/RHEL
+<br />
+===Changing defaults===
+Depending on what version of MySQL or MariaDB you are using, it can be difficult at best to change the default collation '''for a given character set'''. <ref>https://dba.stackexchange.com/questions/239975/change-default-collation-for-character-set-utf8mb4-to-utf8mb4-unicode-ci</ref> The default relationship between character set and the collation for it will affect you when you use 'CREATE' or 'ALTER' with a 'CHARACTER SET' clause and the collation clause is left out. The admonition is to always specify both the character set and the collation in Connections as well as CREATE statements.
+Not to confuse things, but the default server-wide character set changed in MariaDB >= 10.6. The <code>character_set_server</code> changed from <code>utf8</code> to <code>utf8mb3</code> <ref>https://mariadb.com/kb/en/server-system-variables/#character_set_server</ref>
+Show the default collation '''for a given character set''':
+<code>SELECT * FROM information_schema.COLLATIONS WHERE CHARACTER_SET_NAME = 'utf8mb4' ;</code>
+On my CentOS server, the value is <code>utf8mb4_general_ci</code> and the same for Rocky Linux 8.9
+For <code>CHARACTER_SET_NAME = 'binary'</code>, the collation is <code>binary</code> as no other collation is available, nor would make sense.
+==Understand Encoding and Collation==
+http://kunststube.net/encoding/
+==2016 problems with MediaWiki==
 Trying to Special:Import an XML dump from Special:Export of Commons, I get an "illegal mix of collations" error.
@@ Line 17: / Line 140: @@
 {| class="wikitable"
-| Schema: || mediawiki
+|Schema:||mediawiki
 |-
-| Default collation: || latin1_swedish_ci
+|Default collation:||latin1_swedish_ci
 |-
-| Default characterset: || latin1
+|Default characterset:||latin1
 |}
 I also have <code>$wgDBmysql5 = true;</code> in LocalSettings.php <ref>https://www.mediawiki.org/wiki/Manual:$wgDBmysql5</ref>
-== Default MediaWiki SQL schema setup ==
+==Default MediaWiki SQL schema setup==
 When MediaWiki is first installed, [https://phabricator.wikimedia.org/source/mediawiki/browse/master/maintenance/tables.sql this is the SQL that is run].
@@ Line 35: / Line 158: @@
 You can see what are the defaults for your databases with
-<source lang="bash">
+<syntaxhighlight lang="bash">
 mysql -e 'SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;'
-</source>
+</syntaxhighlight>
-== DB engine Conversion ==
+==DB engine Conversion==
 The table type has nothing to do with Collation, but while updating my database, I wanted to switch the engine from MyISAM to InnoDB which is now the default.
 This SQL worked flawlessly to switch any MyISAM tables to INNODB; without touching the *searchindex table(s) because that uses MySQL fulltext search which doesn't come into INNODB until MySQL v. 5.6
-<source lang="mysql">
+<syntaxhighlight lang="mysql">
 /* Find and convert your MediaWiki tables to use INNODB */
@@ Line 80: / Line 203: @@
 ALTER TABLE `blobs` ENGINE=InnoDB;
 ALTER TABLE `archive` ENGINE=InnoDB;
-</source>
+</syntaxhighlight>
@@ Line 86: / Line 209: @@
 So then to change the charset/collation:
-* UTF8 is a very bad choice for that.
-* If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety.
-The *recommended* charset/collation is 'binary BINARY'.
+*UTF8 is a very bad choice for that.
-<source lang="mysql">
+*If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety.
+The '''recommended''' charset/collation is 'binary BINARY'.
+<syntaxhighlight lang="mysql">
 ALTER table <tablename> CHARACTER SET = 'binary' COLLATE = 'BINARY';
@@ Line 109: / Line 233: @@
          AND `tb`.table_name NOT REGEXP '.*searchindex'
 ORDER BY table_name DESC;
-</source>
+</syntaxhighlight>
-== Table or column collation ==
+==Table or column collation==
 You can go further than setting the database, or table defaults; you can look at collation and character sets for individual columns.  Individual columns probably don't need to be (and shouldn't be) converted to a new character set or collation unless you are certain you know what you're doing.  MySQL has this [https://dev.mysql.com/doc/refman/5.5/en/charset-conversion.html reference on charset conversion]
 You can look at the distribution of column collations with
-<source lang="mysql">
+<syntaxhighlight lang="mysql">
 SET @DATABASE_NAME = 'mediawiki';
 SELECT TABLE_SCHEMA, count(COLLATION_NAME) AS 'count', COLLATION_NAME
@@ Line 121: / Line 245: @@
 WHERE   table_schema = @DATABASE_NAME
 GROUP BY COLLATION_NAME;
-</source>
+</syntaxhighlight>
 For a default demo database, you might see something like this
 {| class="wikitable"
 |-
-| TABLE_SCHEMA || count || COLLATION_NAME
+|TABLE_SCHEMA||count||COLLATION_NAME
 |-
-| wiki_demo    ||     0 || NULL
+|wiki_demo||0||NULL
 |-
-| wiki_demo    ||    36 || latin1_bin
+|wiki_demo||36||latin1_bin
 |-
-| wiki_demo    ||    37 || latin1_swedish_ci
+|wiki_demo||37||latin1_swedish_ci
 |-
-| wiki_demo    ||     3 || utf8_general_ci
+|wiki_demo||3||utf8_general_ci
 |}
 or the full detail at
-<source lang="mysql">
+<syntaxhighlight lang="mysql">
 SELECT
    TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLLATION_NAME
 FROM information_schema.columns AS cols
 WHERE table_schema = "wiki_demo";
-</source>
+</syntaxhighlight>
+<references />
 [[Category:MediaWiki]]