Difference between revisions of "Collation"
(→Changing Character Sets and Collations: update config example; remove duplicate) |
|||
Line 41: | Line 41: | ||
wiki_freephile, 1, utf8_general_ci | wiki_freephile, 1, utf8_general_ci | ||
</pre> | </pre> | ||
− | So, | + | |
+ | Note that this query is counting the '''columns''' which have an undesirable collation. (Recall that encoding/collation can be set at the server, database, table, and column levels.) | ||
+ | |||
+ | |||
+ | So, start off by setting the defaults for each database: | ||
<source lang="mysql"> | <source lang="mysql"> | ||
ALTER DATABASE wiki_freephile CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ALTER DATABASE wiki_freephile CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ||
Line 48: | Line 52: | ||
</source> | </source> | ||
− | You can check the status of each table too: | + | You can check the status of each '''table''' too: |
<source lang="mysql"> | <source lang="mysql"> | ||
SET @DATABASE_NAME = 'wiki_freephile'; | SET @DATABASE_NAME = 'wiki_freephile'; | ||
Line 56: | Line 60: | ||
AND TABLE_COLLATION not in ('binary'); | AND TABLE_COLLATION not in ('binary'); | ||
</source> | </source> | ||
+ | |||
+ | Then, down in the section below on changing engines, there's an example SQL query for generating '''<code>ALTER TABLE</code>''' statements to fix the table defaults<ref>https://mariadb.com/kb/en/setting-character-sets-and-collations/#table-level</ref>. These statements would be in the form: | ||
+ | <source lang="mysql"> | ||
+ | ALTER TABLE table_name | ||
+ | CONVERT TO CHARACTER SET charset_name [COLLATE collation_name]; | ||
+ | </source> | ||
+ | |||
+ | In my case, there are 90-100 columns that have old latin1 character sets. | ||
+ | <source lang="mysql"> | ||
+ | SET @DATABASE_NAME = 'wiki_greg'; | ||
+ | SELECT | ||
+ | TABLE_SCHEMA, | ||
+ | TABLE_NAME, | ||
+ | COLUMN_NAME, | ||
+ | CHARACTER_SET_NAME, | ||
+ | COLLATION_NAME | ||
+ | FROM | ||
+ | information_schema.columns AS cols | ||
+ | WHERE | ||
+ | table_schema = @DATABASE_NAME | ||
+ | AND COLLATION_NAME NOT IN ('binary'); | ||
+ | </source> | ||
+ | We can use the information schema to list them all out. And you can also inspect each TABLE, with e.g. <code>SHOW CREATE TABLE brokenlinks \G</code> | ||
+ | |||
+ | |||
To set the mysql server default character set to binary <ref>https://mariadb.com/kb/en/library/setting-character-sets-and-collations/ https://scottlinux.com/2017/03/04/mysql-mariadb-set-character-set-and-collation-to-utf8/</ref>, add the following to <code>/etc/mysql/my.cnf</code> (Debian/Ubuntu) or <code>/etc/my.cnf</code> (Fedora/Centos/RHEL) under the '''<nowiki>[mysqld]</nowiki>''' section: | To set the mysql server default character set to binary <ref>https://mariadb.com/kb/en/library/setting-character-sets-and-collations/ https://scottlinux.com/2017/03/04/mysql-mariadb-set-character-set-and-collation-to-utf8/</ref>, add the following to <code>/etc/mysql/my.cnf</code> (Debian/Ubuntu) or <code>/etc/my.cnf</code> (Fedora/Centos/RHEL) under the '''<nowiki>[mysqld]</nowiki>''' section: | ||
Line 184: | Line 213: | ||
*If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety. | *If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety. | ||
− | The | + | The '''recommended''' charset/collation is 'binary BINARY'. |
<source lang="mysql"> | <source lang="mysql"> | ||
ALTER table <tablename> CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ALTER table <tablename> CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ||
Line 239: | Line 268: | ||
WHERE table_schema = "wiki_demo"; | WHERE table_schema = "wiki_demo"; | ||
</source> | </source> | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | <references /> | ||
[[Category:MediaWiki]] | [[Category:MediaWiki]] | ||
Line 246: | Line 283: | ||
[[Category:QualityBox]] | [[Category:QualityBox]] | ||
[[Category:Wiki]] | [[Category:Wiki]] | ||
− |
Latest revision as of 16:35, 24 April 2024
Note: When dealing with MariaDB, they have this KB Overview. For MySQL, there is a dedicated forum for Character Sets, Collation, Unicode
Collation is a big word that we don't use everyday. It helps to remember that it basically means 'sorting'. Throwback: when printing documents in the office was a big deal (before the Internet), you'd have to pay extra to get huge sorting trays attached to the printer so that you could print '20 copies of this 5-page document' and each set of 5 pages would be printed into it's own tray. Now collating copiers are the norm and they do it without the massive tray finishers.
Contents
OS-dependent defaults[edit | edit source]
MariaDB is different in Ubuntu vs. Debian[1] (and CentOS / Rocky Linux seem more like Ubuntu than Debian). Debian does the 'right thing' and sets defaults to be UTF-8 compatible, whereas others are still lagging with defaults of latin1
for character set (and latin1_swedish_ci
collation)!!
Debian also enables the unix_socket
authentication plugin by default for MariaDB <= 10.4.This allows passwordless login. See the reference for more details.
Best practice[edit | edit source]
Get into the habit of specifying CHARACTER SET and COLLATION on all connections and CREATE TABLEs. MySQL and MariaDB are gradually changing from latin1_swedish_ci to utf8mb4_0900_ai_ci.
"900" is probably not the last Unicode standard. By explicitly specifying the charset and collation, you maintain control and consistency, even if it is an out-dated pair. |
Changing Character Sets and Collations[edit | edit source]
Character sets and collations can be set from the server level right down to the column level, as well as for client-server communication. For example, the collation_connection
server system variable. There is another KB page for Setting Character Sets and Collations. Keep in mind that there a many possible collations per character set, but only one default per character set. So, if you specify a character set, but are silent on the collation, you will receive the default. Character sets and collations always cascade down, so a column without a specified collation will look for the table default, the table for the database, and the database for the server. It's therefore possible to have extremely fine-grained control over all the character sets and collations used in your data. Collation names always start with the character set they correspond to, e.g. latin1 charset and latin1_bin collation (the bin suffix stands for 'binary').
show GLOBAL VARIABLES LIKE 'character_set_%';
show VARIABLES LIKE 'collation%';
This is the mess that I wound up with somehow:
SET @DATABASE_NAME = 'wiki_freephile';
SELECT
TABLE_SCHEMA,
COUNT(COLLATION_NAME) AS 'count',
COLLATION_NAME
FROM
information_schema.columns AS cols
WHERE
table_schema = @DATABASE_NAME
GROUP BY COLLATION_NAME;
wiki_freephile, 0, wiki_freephile, 11, binary wiki_freephile, 73, latin1_bin wiki_freephile, 69, latin1_swedish_ci wiki_freephile, 1, utf8_general_ci
Note that this query is counting the columns which have an undesirable collation. (Recall that encoding/collation can be set at the server, database, table, and column levels.)
So, start off by setting the defaults for each database:
ALTER DATABASE wiki_freephile CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER DATABASE wiki_meta CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER DATABASE wiki_wiki CHARACTER SET = 'binary' COLLATE = 'BINARY';
You can check the status of each table too:
SET @DATABASE_NAME = 'wiki_freephile';
SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_COLLATION
FROM information_schema.tables
WHERE TABLE_SCHEMA = @DATABASE_NAME
AND TABLE_COLLATION not in ('binary');
Then, down in the section below on changing engines, there's an example SQL query for generating ALTER TABLE
statements to fix the table defaults[2]. These statements would be in the form:
ALTER TABLE table_name
CONVERT TO CHARACTER SET charset_name [COLLATE collation_name];
In my case, there are 90-100 columns that have old latin1 character sets.
SET @DATABASE_NAME = 'wiki_greg';
SELECT
TABLE_SCHEMA,
TABLE_NAME,
COLUMN_NAME,
CHARACTER_SET_NAME,
COLLATION_NAME
FROM
information_schema.columns AS cols
WHERE
table_schema = @DATABASE_NAME
AND COLLATION_NAME NOT IN ('binary');
We can use the information schema to list them all out. And you can also inspect each TABLE, with e.g. SHOW CREATE TABLE brokenlinks \G
To set the mysql server default character set to binary [3], add the following to /etc/mysql/my.cnf
(Debian/Ubuntu) or /etc/my.cnf
(Fedora/Centos/RHEL) under the [mysqld] section:
# character set and collation character-set-server = binary collation-server = BINARY # ignore client side character set value sent during handshake skip-character-set-client-handshake = 1
Then restart mysql/mariadb:
service mariadb restart
for Debian/Ubuntu;/bin/systemctl restart mariadb.service
for Fedora/Centos/RHEL
Changing defaults[edit | edit source]
Depending on what version of MySQL or MariaDB you are using, it can be difficult at best to change the default collation for a given character set. [4] The default relationship between character set and the collation for it will affect you when you use 'CREATE' or 'ALTER' with a 'CHARACTER SET' clause and the collation clause is left out. The admonition is to always specify both the character set and the collation in Connections as well as CREATE statements.
Not to confuse things, but the default server-wide character set changed in MariaDB >= 10.6. The character_set_server
changed from utf8
to utf8mb3
[5]
Show the default collation for a given character set:
SELECT * FROM information_schema.COLLATIONS WHERE CHARACTER_SET_NAME = 'utf8mb4' ;
On my CentOS server, the value is utf8mb4_general_ci
and the same for Rocky Linux 8.9
For CHARACTER_SET_NAME = 'binary'
, the collation is binary
as no other collation is available, nor would make sense.
Understand Encoding and Collation[edit | edit source]
http://kunststube.net/encoding/
2016 problems with MediaWiki[edit | edit source]
Trying to Special:Import an XML dump from Special:Export of Commons, I get an "illegal mix of collations" error.
Import failed: A database error has occurred. Did you forget to run maintenance/update.php after upgrading? See: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script Query: SELECT page_id,page_len,page_is_redirect,page_latest,page_content_model FROM `page` WHERE page_namespace = '6' AND page_title = 'Fantomo_ru?a.svg' LIMIT 1 Function: LinkCache::addLinkObj Error: 1267 Illegal mix of collations (latin1_bin,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '=' (localhost)
This is a mediawiki that is "current" but was first established a long time ago (2004).
MediaWiki 1.27.1 (64f7660) 2016-11-25T23:52:22
PHP 5.5.9-1ubuntu4.22 (apache2handler)
MySQL 5.5.57-0ubuntu0.14.04.1
ICU 52.1
Elasticsearch 2.4.4
Lua 5.1.5
Schema: | mediawiki |
Default collation: | latin1_swedish_ci |
Default characterset: | latin1 |
I also have $wgDBmysql5 = true;
in LocalSettings.php [6]
Default MediaWiki SQL schema setup[edit | edit source]
When MediaWiki is first installed, this is the SQL that is run.
The import failed, and reveals that my database has a problem handling UTF8 characters.
MediaWiki (the application) uses UTF8 throughout. What your database and data use is another story. Since my database was setup a long time ago, I've got old defaults, and a mixture of engines.
You can see what are the defaults for your databases with
mysql -e 'SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;'
DB engine Conversion[edit | edit source]
The table type has nothing to do with Collation, but while updating my database, I wanted to switch the engine from MyISAM to InnoDB which is now the default.
This SQL worked flawlessly to switch any MyISAM tables to INNODB; without touching the *searchindex table(s) because that uses MySQL fulltext search which doesn't come into INNODB until MySQL v. 5.6
/* Find and convert your MediaWiki tables to use INNODB */
SET @DATABASE_NAME = 'mediawiki';
SELECT CONCAT('ALTER TABLE `', table_name, '` ENGINE=InnoDB;') AS sql_statements
FROM information_schema.tables AS tb
WHERE table_schema = @DATABASE_NAME
AND `ENGINE` = 'MyISAM'
AND `TABLE_TYPE` = 'BASE TABLE'
AND `tb`.table_name NOT REGEXP '.*searchindex'
ORDER BY table_name DESC;
/* Then copy the output and execute that. E.g. : */
ALTER TABLE `watchlist` ENGINE=InnoDB;
ALTER TABLE `user_newtalk` ENGINE=InnoDB;
ALTER TABLE `user` ENGINE=InnoDB;
ALTER TABLE `transcache` ENGINE=InnoDB;
ALTER TABLE `trackbacks` ENGINE=InnoDB;
ALTER TABLE `text` ENGINE=InnoDB;
ALTER TABLE `site_stats` ENGINE=InnoDB;
ALTER TABLE `recentchanges` ENGINE=InnoDB;
ALTER TABLE `querycache` ENGINE=InnoDB;
ALTER TABLE `oldimage` ENGINE=InnoDB;
ALTER TABLE `objectcache` ENGINE=InnoDB;
ALTER TABLE `math` ENGINE=InnoDB;
ALTER TABLE `linkscc` ENGINE=InnoDB;
ALTER TABLE `links` ENGINE=InnoDB;
ALTER TABLE `ipblocks_old` ENGINE=InnoDB;
ALTER TABLE `interwiki` ENGINE=InnoDB;
ALTER TABLE `imagelinks` ENGINE=InnoDB;
ALTER TABLE `image` ENGINE=InnoDB;
ALTER TABLE `cur` ENGINE=InnoDB;
ALTER TABLE `categorylinks` ENGINE=InnoDB;
ALTER TABLE `brokenlinks` ENGINE=InnoDB;
ALTER TABLE `blobs` ENGINE=InnoDB;
ALTER TABLE `archive` ENGINE=InnoDB;
The fact that I use Elasticsearch makes native full-text search irrelevant, but still I want my MediaWiki to be setup correctly.
So then to change the charset/collation:
- UTF8 is a very bad choice for that.
- If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety.
The recommended charset/collation is 'binary BINARY'.
ALTER table <tablename> CHARACTER SET = 'binary' COLLATE = 'BINARY';
-- You can generate the above SQL commands with the following:
SET @DATABASE_NAME = 'wiki_freephile';
SELECT
CONCAT("ALTER TABLE `",
table_name,
"` CHARACTER SET = 'binary' COLLATE = 'BINARY';") AS sql_statements
FROM
information_schema.tables AS tb
WHERE
table_schema = @DATABASE_NAME
AND `TABLE_COLLATION` = 'latin1_swedish_ci'
AND `TABLE_TYPE` = 'BASE TABLE'
AND `tb`.table_name NOT REGEXP '.*searchindex'
ORDER BY table_name DESC;
Table or column collation[edit | edit source]
You can go further than setting the database, or table defaults; you can look at collation and character sets for individual columns. Individual columns probably don't need to be (and shouldn't be) converted to a new character set or collation unless you are certain you know what you're doing. MySQL has this reference on charset conversion
You can look at the distribution of column collations with
SET @DATABASE_NAME = 'mediawiki';
SELECT TABLE_SCHEMA, count(COLLATION_NAME) AS 'count', COLLATION_NAME
FROM information_schema.columns AS cols
WHERE table_schema = @DATABASE_NAME
GROUP BY COLLATION_NAME;
For a default demo database, you might see something like this
TABLE_SCHEMA | count | COLLATION_NAME |
wiki_demo | 0 | NULL |
wiki_demo | 36 | latin1_bin |
wiki_demo | 37 | latin1_swedish_ci |
wiki_demo | 3 | utf8_general_ci |
or the full detail at
SELECT
TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLLATION_NAME
FROM information_schema.columns AS cols
WHERE table_schema = "wiki_demo";
- ↑ https://mariadb.com/kb/en/library/differences-in-mariadb-in-debian-and-ubuntu/
- ↑ https://mariadb.com/kb/en/setting-character-sets-and-collations/#table-level
- ↑ https://mariadb.com/kb/en/library/setting-character-sets-and-collations/ https://scottlinux.com/2017/03/04/mysql-mariadb-set-character-set-and-collation-to-utf8/
- ↑ https://dba.stackexchange.com/questions/239975/change-default-collation-for-character-set-utf8mb4-to-utf8mb4-unicode-ci
- ↑ https://mariadb.com/kb/en/server-system-variables/#character_set_server
- ↑ https://www.mediawiki.org/wiki/Manual:$wgDBmysql5