Collation: Difference between revisions
No edit summary |
m Text replacement - "<(\/?)source" to "<$1syntaxhighlight" |
||
| (4 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
Note: When dealing with MariaDB, they have this [https://mariadb.com/kb/en/character-set-and-collation-overview/ KB Overview]. For MySQL, there is a dedicated [https://forums.mysql.com/list.php?103 forum for Character Sets, Collation, Unicode] | |||
Collation is a big word that we don't use everyday. It helps to remember that it basically means 'sorting'. Throwback: when printing documents in the office was a big deal (before the Internet), you'd have to pay extra to get huge sorting trays attached to the printer so that you could print '20 copies of this 5-page document' and each set of 5 pages would be printed into it's own tray. Now collating copiers are the norm and they do it without the massive tray finishers. | |||
===OS-dependent defaults=== | |||
MariaDB is different in Ubuntu vs. Debian<ref>https://mariadb.com/kb/en/library/differences-in-mariadb-in-debian-and-ubuntu/</ref> (and CentOS / Rocky Linux seem more like Ubuntu than Debian). Debian does the 'right thing' and sets defaults to be UTF-8 compatible, whereas others are still lagging with defaults of <code>latin1</code> for character set (and <code>latin1_swedish_ci</code> collation)!! | |||
Debian also enables the <code>unix_socket</code> authentication plugin by default for MariaDB <= 10.4.This allows passwordless login. See the reference for more details. | |||
===Best practice=== | |||
{{Ambox|text=Get into the habit of specifying CHARACTER SET and COLLATION on all connections and CREATE TABLEs. MySQL and MariaDB are gradually changing from latin1_swedish_ci to utf8mb4_0900_ai_ci. | |||
"900" is probably not the last Unicode standard. | |||
By explicitly specifying the charset and collation, you maintain control and consistency, even if it is an out-dated pair.}} | |||
===Changing Character Sets and Collations=== | |||
Character sets and collations can be set from the server level right down to the column level, as well as for client-server communication. For example, the <code>[https://mariadb.com/kb/en/server-system-variables/#collation_connection collation_connection]</code> '''server system variable'''. There is another KB page for [https://mariadb.com/kb/en/setting-character-sets-and-collations/ Setting Character Sets and Collations]. Keep in mind that there a many possible collations per character set, but only one default per character set. So, if you specify a character set, but are silent on the collation, you will receive the default. Character sets and collations always cascade down, so a column without a specified collation will look for the table default, the table for the database, and the database for the server. It's therefore possible to have extremely fine-grained control over all the character sets and collations used in your data. Collation names always start with the character set they correspond to, e.g. '''latin1''' charset and '''latin1'''_bin collation (the bin suffix stands for 'binary'). | |||
<syntaxhighlight lang="mysql"> | |||
show GLOBAL VARIABLES LIKE 'character_set_%'; | |||
show VARIABLES LIKE 'collation%'; | |||
</syntaxhighlight>This is the mess that I wound up with somehow: | |||
<syntaxhighlight lang="mysql"> | |||
SET @DATABASE_NAME = 'wiki_freephile'; | |||
SELECT | |||
TABLE_SCHEMA, | |||
COUNT(COLLATION_NAME) AS 'count', | |||
COLLATION_NAME | |||
FROM | |||
information_schema.columns AS cols | |||
WHERE | |||
table_schema = @DATABASE_NAME | |||
GROUP BY COLLATION_NAME; | |||
</syntaxhighlight> | |||
<pre> | |||
wiki_freephile, 0, | |||
wiki_freephile, 11, binary | |||
wiki_freephile, 73, latin1_bin | |||
wiki_freephile, 69, latin1_swedish_ci | |||
wiki_freephile, 1, utf8_general_ci | |||
</pre> | |||
Note that this query is counting the '''columns''' which have an undesirable collation. (Recall that encoding/collation can be set at the server, database, table, and column levels.) | |||
So, start off by setting the defaults for each database: | |||
<syntaxhighlight lang="mysql"> | |||
ALTER DATABASE wiki_freephile CHARACTER SET = 'binary' COLLATE = 'BINARY'; | |||
ALTER DATABASE wiki_meta CHARACTER SET = 'binary' COLLATE = 'BINARY'; | |||
ALTER DATABASE wiki_wiki CHARACTER SET = 'binary' COLLATE = 'BINARY'; | |||
</syntaxhighlight> | |||
You can check the status of each '''table''' too: | |||
<syntaxhighlight lang="mysql"> | |||
SET @DATABASE_NAME = 'wiki_freephile'; | |||
SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_COLLATION | |||
FROM information_schema.tables | |||
WHERE TABLE_SCHEMA = @DATABASE_NAME | |||
AND TABLE_COLLATION not in ('binary'); | |||
</syntaxhighlight> | |||
Then, down in the section below on changing engines, there's an example SQL query for generating '''<code>ALTER TABLE</code>''' statements to fix the table defaults<ref>https://mariadb.com/kb/en/setting-character-sets-and-collations/#table-level</ref>. These statements would be in the form: | |||
<syntaxhighlight lang="mysql"> | |||
ALTER TABLE table_name | |||
CONVERT TO CHARACTER SET charset_name [COLLATE collation_name]; | |||
</syntaxhighlight> | |||
In my case, there are 90-100 columns that have old latin1 character sets. | |||
<syntaxhighlight lang="mysql"> | |||
SET @DATABASE_NAME = 'wiki_greg'; | |||
SELECT | |||
TABLE_SCHEMA, | |||
TABLE_NAME, | |||
COLUMN_NAME, | |||
CHARACTER_SET_NAME, | |||
COLLATION_NAME | |||
FROM | |||
information_schema.columns AS cols | |||
WHERE | |||
table_schema = @DATABASE_NAME | |||
AND COLLATION_NAME NOT IN ('binary'); | |||
</syntaxhighlight> | |||
We can use the information schema to list them all out. And you can also inspect each TABLE, with e.g. <code>SHOW CREATE TABLE brokenlinks \G</code> | |||
To set the mysql server default character set to binary <ref>https://mariadb.com/kb/en/library/setting-character-sets-and-collations/ https://scottlinux.com/2017/03/04/mysql-mariadb-set-character-set-and-collation-to-utf8/</ref>, add the following to <code>/etc/mysql/my.cnf</code> (Debian/Ubuntu) or <code>/etc/my.cnf</code> (Fedora/Centos/RHEL) under the '''<nowiki>[mysqld]</nowiki>''' section: | |||
<pre> | |||
# character set and collation | |||
character-set-server = binary | |||
collation-server = BINARY | |||
# ignore client side character set value sent during handshake | |||
skip-character-set-client-handshake = 1 | |||
</pre> | |||
Then restart mysql/mariadb: | |||
*<code>service mariadb restart</code> for Debian/Ubuntu; | |||
*<code>/bin/systemctl restart mariadb.service</code> for Fedora/Centos/RHEL | |||
<br /> | |||
===Changing defaults=== | |||
Depending on what version of MySQL or MariaDB you are using, it can be difficult at best to change the default collation '''for a given character set'''. <ref>https://dba.stackexchange.com/questions/239975/change-default-collation-for-character-set-utf8mb4-to-utf8mb4-unicode-ci</ref> The default relationship between character set and the collation for it will affect you when you use 'CREATE' or 'ALTER' with a 'CHARACTER SET' clause and the collation clause is left out. The admonition is to always specify both the character set and the collation in Connections as well as CREATE statements. | |||
Not to confuse things, but the default server-wide character set changed in MariaDB >= 10.6. The <code>character_set_server</code> changed from <code>utf8</code> to <code>utf8mb3</code> <ref>https://mariadb.com/kb/en/server-system-variables/#character_set_server</ref> | |||
Show the default collation '''for a given character set''': | |||
<code>SELECT * FROM information_schema.COLLATIONS WHERE CHARACTER_SET_NAME = 'utf8mb4' ;</code> | |||
On my CentOS server, the value is <code>utf8mb4_general_ci</code> and the same for Rocky Linux 8.9 | |||
For <code>CHARACTER_SET_NAME = 'binary'</code>, the collation is <code>binary</code> as no other collation is available, nor would make sense. | |||
==Understand Encoding and Collation== | |||
http://kunststube.net/encoding/ | |||
==2016 problems with MediaWiki== | |||
Trying to Special:Import an XML dump from Special:Export of Commons, I get an "illegal mix of collations" error. | Trying to Special:Import an XML dump from Special:Export of Commons, I get an "illegal mix of collations" error. | ||
| Line 17: | Line 140: | ||
{| class="wikitable" | {| class="wikitable" | ||
| Schema: || mediawiki | |Schema:||mediawiki | ||
|- | |- | ||
| Default collation: || latin1_swedish_ci | |Default collation:||latin1_swedish_ci | ||
|- | |- | ||
| Default characterset: || latin1 | |Default characterset:||latin1 | ||
|} | |} | ||
I also have <code>$wgDBmysql5 = true;</code> in LocalSettings.php <ref>https://www.mediawiki.org/wiki/Manual:$wgDBmysql5</ref> | I also have <code>$wgDBmysql5 = true;</code> in LocalSettings.php <ref>https://www.mediawiki.org/wiki/Manual:$wgDBmysql5</ref> | ||
== Default MediaWiki SQL schema setup == | ==Default MediaWiki SQL schema setup== | ||
When MediaWiki is first installed, [https://phabricator.wikimedia.org/source/mediawiki/browse/master/maintenance/tables.sql this is the SQL that is run]. | When MediaWiki is first installed, [https://phabricator.wikimedia.org/source/mediawiki/browse/master/maintenance/tables.sql this is the SQL that is run]. | ||
| Line 35: | Line 158: | ||
You can see what are the defaults for your databases with | You can see what are the defaults for your databases with | ||
< | <syntaxhighlight lang="bash"> | ||
mysql -e 'SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;' | mysql -e 'SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;' | ||
</ | </syntaxhighlight> | ||
== DB engine Conversion == | ==DB engine Conversion== | ||
The table type has nothing to do with Collation, but while updating my database, I wanted to switch the engine from MyISAM to InnoDB which is now the default. | The table type has nothing to do with Collation, but while updating my database, I wanted to switch the engine from MyISAM to InnoDB which is now the default. | ||
This SQL worked flawlessly to switch any MyISAM tables to INNODB; without touching the *searchindex table(s) because that uses MySQL fulltext search which doesn't come into INNODB until MySQL v. 5.6 | This SQL worked flawlessly to switch any MyISAM tables to INNODB; without touching the *searchindex table(s) because that uses MySQL fulltext search which doesn't come into INNODB until MySQL v. 5.6 | ||
< | <syntaxhighlight lang="mysql"> | ||
/* Find and convert your MediaWiki tables to use INNODB */ | /* Find and convert your MediaWiki tables to use INNODB */ | ||
| Line 80: | Line 203: | ||
ALTER TABLE `blobs` ENGINE=InnoDB; | ALTER TABLE `blobs` ENGINE=InnoDB; | ||
ALTER TABLE `archive` ENGINE=InnoDB; | ALTER TABLE `archive` ENGINE=InnoDB; | ||
</ | </syntaxhighlight> | ||
| Line 86: | Line 209: | ||
So then to change the charset/collation: | So then to change the charset/collation: | ||
The | *UTF8 is a very bad choice for that. | ||
< | *If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety. | ||
The '''recommended''' charset/collation is 'binary BINARY'. | |||
<syntaxhighlight lang="mysql"> | |||
ALTER table <tablename> CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ALTER table <tablename> CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ||
| Line 109: | Line 233: | ||
AND `tb`.table_name NOT REGEXP '.*searchindex' | AND `tb`.table_name NOT REGEXP '.*searchindex' | ||
ORDER BY table_name DESC; | ORDER BY table_name DESC; | ||
</ | </syntaxhighlight> | ||
== Table or column collation == | ==Table or column collation== | ||
You can go further than setting the database, or table defaults; you can look at collation and character sets for individual columns. Individual columns probably don't need to be (and shouldn't be) converted to a new character set or collation unless you are certain you know what you're doing. MySQL has this [https://dev.mysql.com/doc/refman/5.5/en/charset-conversion.html reference on charset conversion] | You can go further than setting the database, or table defaults; you can look at collation and character sets for individual columns. Individual columns probably don't need to be (and shouldn't be) converted to a new character set or collation unless you are certain you know what you're doing. MySQL has this [https://dev.mysql.com/doc/refman/5.5/en/charset-conversion.html reference on charset conversion] | ||
You can look at the distribution of column collations with | You can look at the distribution of column collations with | ||
< | <syntaxhighlight lang="mysql"> | ||
SET @DATABASE_NAME = 'mediawiki'; | SET @DATABASE_NAME = 'mediawiki'; | ||
SELECT TABLE_SCHEMA, count(COLLATION_NAME) AS 'count', COLLATION_NAME | SELECT TABLE_SCHEMA, count(COLLATION_NAME) AS 'count', COLLATION_NAME | ||
| Line 121: | Line 245: | ||
WHERE table_schema = @DATABASE_NAME | WHERE table_schema = @DATABASE_NAME | ||
GROUP BY COLLATION_NAME; | GROUP BY COLLATION_NAME; | ||
</ | </syntaxhighlight> | ||
For a default demo database, you might see something like this | For a default demo database, you might see something like this | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
| TABLE_SCHEMA || count || COLLATION_NAME | |TABLE_SCHEMA||count||COLLATION_NAME | ||
|- | |- | ||
| wiki_demo | |wiki_demo||0||NULL | ||
|- | |- | ||
| wiki_demo | |wiki_demo||36||latin1_bin | ||
|- | |- | ||
| wiki_demo | |wiki_demo||37||latin1_swedish_ci | ||
|- | |- | ||
| wiki_demo | |wiki_demo||3||utf8_general_ci | ||
|} | |} | ||
or the full detail at | or the full detail at | ||
< | <syntaxhighlight lang="mysql"> | ||
SELECT | SELECT | ||
TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLLATION_NAME | TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLLATION_NAME | ||
FROM information_schema.columns AS cols | FROM information_schema.columns AS cols | ||
WHERE table_schema = "wiki_demo"; | WHERE table_schema = "wiki_demo"; | ||
</ | </syntaxhighlight> | ||
<references /> | |||
[[Category:MediaWiki]] | [[Category:MediaWiki]] | ||