Collation

From Freephile Wiki
Jump to navigation Jump to search

Note: When dealing with MariaDB, they have this KB Overview. For MySQL, there is a dedicated forum for Character Sets, Collation, Unicode

Collation is a big word that we don't use everyday. It helps to remember that it basically means 'sorting'. Throwback: when printing documents in the office was a big deal (before the Internet), you'd have to pay extra to get huge sorting trays attached to the printer so that you could print '20 copies of this 5-page document' and each set of 5 pages would be printed into it's own tray. Now collating copiers are the norm and they do it without the massive tray finishers.

OS-dependent defaults[edit | edit source]

MariaDB is different in Ubuntu vs. Debian[1] (and CentOS / Rocky Linux seem more like Ubuntu than Debian). Debian does the 'right thing' and sets defaults to be UTF-8 compatible, whereas others are still lagging with defaults of latin1 for character set (and latin1_swedish_ci collation)!!

Debian also enables the unix_socket authentication plugin by default for MariaDB <= 10.4.This allows passwordless login. See the reference for more details.

Best practice[edit | edit source]

Changing Character Sets and Collations[edit | edit source]

Character sets and collations can be set from the server level right down to the column level, as well as for client-server communication. For example, the collation_connection server system variable. There is another KB page for Setting Character Sets and Collations. Keep in mind that there a many possible collations per character set, but only one default per character set. So, if you specify a character set, but are silent on the collation, you will receive the default. Character sets and collations always cascade down, so a column without a specified collation will look for the table default, the table for the database, and the database for the server. It's therefore possible to have extremely fine-grained control over all the character sets and collations used in your data. Collation names always start with the character set they correspond to, e.g. latin1 charset and latin1_bin collation (the bin suffix stands for 'binary').

show GLOBAL VARIABLES LIKE 'character_set_%';
show VARIABLES LIKE 'collation%';

This is the mess that I wound up with somehow:

SET @DATABASE_NAME = 'wiki_freephile';
SELECT 
    TABLE_SCHEMA,
    COUNT(COLLATION_NAME) AS 'count',
    COLLATION_NAME
FROM
    information_schema.columns AS cols
WHERE
    table_schema = @DATABASE_NAME
GROUP BY COLLATION_NAME;
wiki_freephile, 0, 
wiki_freephile, 11, binary
wiki_freephile, 73, latin1_bin
wiki_freephile, 69, latin1_swedish_ci
wiki_freephile, 1, utf8_general_ci

Note that this query is counting the columns which have an undesirable collation. (Recall that encoding/collation can be set at the server, database, table, and column levels.)


So, start off by setting the defaults for each database:

ALTER DATABASE wiki_freephile CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER DATABASE wiki_meta CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER DATABASE wiki_wiki CHARACTER SET = 'binary' COLLATE = 'BINARY';

You can check the status of each table too:

SET @DATABASE_NAME = 'wiki_freephile';
SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_COLLATION
FROM    information_schema.tables
WHERE   TABLE_SCHEMA = @DATABASE_NAME
AND TABLE_COLLATION not in ('binary');

Then, down in the section below on changing engines, there's an example SQL query for generating ALTER TABLE statements to fix the table defaults[2]. These statements would be in the form:

ALTER TABLE table_name
 CONVERT TO CHARACTER SET charset_name [COLLATE collation_name];

In my case, there are 90-100 columns that have old latin1 character sets.

SET @DATABASE_NAME = 'wiki_greg';
SELECT
	TABLE_SCHEMA,
	TABLE_NAME,
	COLUMN_NAME,
	CHARACTER_SET_NAME,
	COLLATION_NAME
FROM
	information_schema.columns AS cols
WHERE
	table_schema = @DATABASE_NAME
	AND COLLATION_NAME NOT IN ('binary');

We can use the information schema to list them all out. And you can also inspect each TABLE, with e.g. SHOW CREATE TABLE brokenlinks \G


To set the mysql server default character set to binary [3], add the following to /etc/mysql/my.cnf (Debian/Ubuntu) or /etc/my.cnf (Fedora/Centos/RHEL) under the [mysqld] section:

# character set and collation
character-set-server = binary
collation-server = BINARY
#  ignore client side character set value sent during handshake
skip-character-set-client-handshake = 1

Then restart mysql/mariadb:

  • service mariadb restart for Debian/Ubuntu;
  • /bin/systemctl restart mariadb.service for Fedora/Centos/RHEL


Changing defaults[edit | edit source]

Depending on what version of MySQL or MariaDB you are using, it can be difficult at best to change the default collation for a given character set. [4] The default relationship between character set and the collation for it will affect you when you use 'CREATE' or 'ALTER' with a 'CHARACTER SET' clause and the collation clause is left out. The admonition is to always specify both the character set and the collation in Connections as well as CREATE statements.

Not to confuse things, but the default server-wide character set changed in MariaDB >= 10.6. The character_set_server changed from utf8 to utf8mb3 [5]


Show the default collation for a given character set:

SELECT * FROM information_schema.COLLATIONS WHERE CHARACTER_SET_NAME = 'utf8mb4' ;

On my CentOS server, the value is utf8mb4_general_ci and the same for Rocky Linux 8.9

For CHARACTER_SET_NAME = 'binary', the collation is binary as no other collation is available, nor would make sense.


Understand Encoding and Collation[edit | edit source]

http://kunststube.net/encoding/


2016 problems with MediaWiki[edit | edit source]

Trying to Special:Import an XML dump from Special:Export of Commons, I get an "illegal mix of collations" error.

Import failed: A database error has occurred. Did you forget to run maintenance/update.php after upgrading? See: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script Query: SELECT page_id,page_len,page_is_redirect,page_latest,page_content_model FROM `page` WHERE page_namespace = '6' AND page_title = 'Fantomo_ru?a.svg' LIMIT 1 Function: LinkCache::addLinkObj Error: 1267 Illegal mix of collations (latin1_bin,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '=' (localhost)


This is a mediawiki that is "current" but was first established a long time ago (2004).

MediaWiki	1.27.1 (64f7660) 2016-11-25T23:52:22
PHP	5.5.9-1ubuntu4.22 (apache2handler)
MySQL	5.5.57-0ubuntu0.14.04.1
ICU	52.1
Elasticsearch	2.4.4
Lua	5.1.5
Schema: mediawiki
Default collation: latin1_swedish_ci
Default characterset: latin1

I also have $wgDBmysql5 = true; in LocalSettings.php [6]

Default MediaWiki SQL schema setup[edit | edit source]

When MediaWiki is first installed, this is the SQL that is run.

The import failed, and reveals that my database has a problem handling UTF8 characters.

MediaWiki (the application) uses UTF8 throughout. What your database and data use is another story. Since my database was setup a long time ago, I've got old defaults, and a mixture of engines.

You can see what are the defaults for your databases with

mysql -e 'SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;'

DB engine Conversion[edit | edit source]

The table type has nothing to do with Collation, but while updating my database, I wanted to switch the engine from MyISAM to InnoDB which is now the default.

This SQL worked flawlessly to switch any MyISAM tables to INNODB; without touching the *searchindex table(s) because that uses MySQL fulltext search which doesn't come into INNODB until MySQL v. 5.6

/* Find and convert your MediaWiki tables to use INNODB */

SET @DATABASE_NAME = 'mediawiki';

SELECT  CONCAT('ALTER TABLE `', table_name, '` ENGINE=InnoDB;') AS sql_statements
FROM    information_schema.tables AS tb
WHERE   table_schema = @DATABASE_NAME
AND     `ENGINE` = 'MyISAM'
AND     `TABLE_TYPE` = 'BASE TABLE'
AND     `tb`.table_name NOT REGEXP '.*searchindex'
ORDER BY table_name DESC;

/* Then copy the output and execute that. E.g. : */
ALTER TABLE `watchlist` ENGINE=InnoDB;
ALTER TABLE `user_newtalk` ENGINE=InnoDB;
ALTER TABLE `user` ENGINE=InnoDB;
ALTER TABLE `transcache` ENGINE=InnoDB;
ALTER TABLE `trackbacks` ENGINE=InnoDB;
ALTER TABLE `text` ENGINE=InnoDB;
ALTER TABLE `site_stats` ENGINE=InnoDB;
ALTER TABLE `recentchanges` ENGINE=InnoDB;
ALTER TABLE `querycache` ENGINE=InnoDB;
ALTER TABLE `oldimage` ENGINE=InnoDB;
ALTER TABLE `objectcache` ENGINE=InnoDB;
ALTER TABLE `math` ENGINE=InnoDB;
ALTER TABLE `linkscc` ENGINE=InnoDB;
ALTER TABLE `links` ENGINE=InnoDB;
ALTER TABLE `ipblocks_old` ENGINE=InnoDB;
ALTER TABLE `interwiki` ENGINE=InnoDB;
ALTER TABLE `imagelinks` ENGINE=InnoDB;
ALTER TABLE `image` ENGINE=InnoDB;
ALTER TABLE `cur` ENGINE=InnoDB;
ALTER TABLE `categorylinks` ENGINE=InnoDB;
ALTER TABLE `brokenlinks` ENGINE=InnoDB;
ALTER TABLE `blobs` ENGINE=InnoDB;
ALTER TABLE `archive` ENGINE=InnoDB;


The fact that I use Elasticsearch makes native full-text search irrelevant, but still I want my MediaWiki to be setup correctly.

So then to change the charset/collation:

  • UTF8 is a very bad choice for that.
  • If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety.

The recommended charset/collation is 'binary BINARY'.

ALTER table <tablename> CHARACTER SET = 'binary' COLLATE = 'BINARY';


-- You can generate the above SQL commands with the following:

SET @DATABASE_NAME = 'wiki_freephile';
SELECT 
    CONCAT("ALTER TABLE `",
            table_name,
            "` CHARACTER SET = 'binary' COLLATE = 'BINARY';") AS sql_statements
FROM
    information_schema.tables AS tb
WHERE
    table_schema = @DATABASE_NAME
        AND `TABLE_COLLATION` = 'latin1_swedish_ci'
        AND `TABLE_TYPE` = 'BASE TABLE'
        AND `tb`.table_name NOT REGEXP '.*searchindex'
ORDER BY table_name DESC;

Table or column collation[edit | edit source]

You can go further than setting the database, or table defaults; you can look at collation and character sets for individual columns. Individual columns probably don't need to be (and shouldn't be) converted to a new character set or collation unless you are certain you know what you're doing. MySQL has this reference on charset conversion

You can look at the distribution of column collations with

SET @DATABASE_NAME = 'mediawiki';
SELECT TABLE_SCHEMA, count(COLLATION_NAME) AS 'count', COLLATION_NAME
FROM    information_schema.columns AS cols
WHERE   table_schema = @DATABASE_NAME
GROUP BY COLLATION_NAME;

For a default demo database, you might see something like this

TABLE_SCHEMA count COLLATION_NAME
wiki_demo 0 NULL
wiki_demo 36 latin1_bin
wiki_demo 37 latin1_swedish_ci
wiki_demo 3 utf8_general_ci

or the full detail at

SELECT 
  TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLLATION_NAME 
FROM information_schema.columns AS cols 
WHERE table_schema = "wiki_demo";