Collation: Difference between revisions

No edit summary
m Text replacement - "<(\/?)source" to "<$1syntaxhighlight"
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
Note: When dealing with MariaDB, they have this [https://mariadb.com/kb/en/character-set-and-collation-overview/ KB Overview]. For MySQL, there is a dedicated [https://forums.mysql.com/list.php?103 forum for Character Sets, Collation, Unicode]
Collation is a big word that we don't use everyday. It helps to remember that it basically means 'sorting'. Throwback: when printing documents in the office was a big deal (before the Internet), you'd have to pay extra to get huge sorting trays attached to the printer so that you could print '20 copies of this 5-page document' and each set of 5 pages would be printed into it's own tray. Now collating copiers are the norm and they do it without the massive tray finishers.
===OS-dependent defaults===
MariaDB is different in Ubuntu vs. Debian<ref>https://mariadb.com/kb/en/library/differences-in-mariadb-in-debian-and-ubuntu/</ref> (and CentOS / Rocky Linux seem more like Ubuntu than Debian). Debian does the 'right thing' and sets defaults to be UTF-8 compatible, whereas others are still lagging with defaults of <code>latin1</code> for character set (and <code>latin1_swedish_ci</code> collation)!!
Debian also enables the <code>unix_socket</code> authentication plugin by default for MariaDB <= 10.4.This allows passwordless login.  See the reference for more details.
===Best practice===
{{Ambox|text=Get into the habit of specifying CHARACTER SET and COLLATION on all connections and CREATE TABLEs. MySQL and MariaDB are gradually changing from latin1_swedish_ci to utf8mb4_0900_ai_ci.
"900" is probably not the last Unicode standard.
By explicitly specifying the charset and collation, you maintain control and consistency, even if it is an out-dated pair.}}
===Changing Character Sets and Collations===
Character sets and collations can be set from the server level right down to the column level, as well as for client-server communication. For example, the <code>[https://mariadb.com/kb/en/server-system-variables/#collation_connection collation_connection]</code> '''server system variable'''. There is another KB page for [https://mariadb.com/kb/en/setting-character-sets-and-collations/ Setting Character Sets and Collations]. Keep in mind that there a many possible collations per character set, but only one default per character set. So, if you specify a character set, but are silent on the collation, you will receive the default. Character sets and collations always cascade down, so a column without a specified collation will look for the table default, the table for the database, and the database for the server. It's therefore possible to have extremely fine-grained control over all the character sets and collations used in your data. Collation names always start with the character set they correspond to, e.g. '''latin1''' charset and '''latin1'''_bin collation (the bin suffix stands for 'binary').
<syntaxhighlight lang="mysql">
show GLOBAL VARIABLES LIKE 'character_set_%';
show VARIABLES LIKE 'collation%';
</syntaxhighlight>This is the mess that I wound up with somehow:
<syntaxhighlight lang="mysql">
SET @DATABASE_NAME = 'wiki_freephile';
SELECT
    TABLE_SCHEMA,
    COUNT(COLLATION_NAME) AS 'count',
    COLLATION_NAME
FROM
    information_schema.columns AS cols
WHERE
    table_schema = @DATABASE_NAME
GROUP BY COLLATION_NAME;
</syntaxhighlight>
<pre>
wiki_freephile, 0,
wiki_freephile, 11, binary
wiki_freephile, 73, latin1_bin
wiki_freephile, 69, latin1_swedish_ci
wiki_freephile, 1, utf8_general_ci
</pre>
Note that this query is counting the '''columns''' which have an undesirable collation. (Recall that encoding/collation can be set at the server, database, table, and column levels.)
So, start off by setting the defaults for each database:
<syntaxhighlight lang="mysql">
ALTER DATABASE wiki_freephile CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER DATABASE wiki_meta CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER DATABASE wiki_wiki CHARACTER SET = 'binary' COLLATE = 'BINARY';
</syntaxhighlight>
You can check the status of each '''table''' too:
<syntaxhighlight lang="mysql">
SET @DATABASE_NAME = 'wiki_freephile';
SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_COLLATION
FROM    information_schema.tables
WHERE  TABLE_SCHEMA = @DATABASE_NAME
AND TABLE_COLLATION not in ('binary');
</syntaxhighlight>
Then, down in the section below on changing engines, there's an example SQL query for generating '''<code>ALTER TABLE</code>''' statements to fix the table defaults<ref>https://mariadb.com/kb/en/setting-character-sets-and-collations/#table-level</ref>. These statements would be in the form:
<syntaxhighlight lang="mysql">
ALTER TABLE table_name
CONVERT TO CHARACTER SET charset_name [COLLATE collation_name];
</syntaxhighlight>
In my case, there are 90-100 columns that have old latin1 character sets.
<syntaxhighlight lang="mysql">
SET @DATABASE_NAME = 'wiki_greg';
SELECT
TABLE_SCHEMA,
TABLE_NAME,
COLUMN_NAME,
CHARACTER_SET_NAME,
COLLATION_NAME
FROM
information_schema.columns AS cols
WHERE
table_schema = @DATABASE_NAME
AND COLLATION_NAME NOT IN ('binary');
</syntaxhighlight>
We can use the information schema to list them all out. And you can also inspect each TABLE, with e.g. <code>SHOW CREATE TABLE brokenlinks \G</code>
To set the mysql server default character set to binary <ref>https://mariadb.com/kb/en/library/setting-character-sets-and-collations/ https://scottlinux.com/2017/03/04/mysql-mariadb-set-character-set-and-collation-to-utf8/</ref>, add the following to <code>/etc/mysql/my.cnf</code> (Debian/Ubuntu) or <code>/etc/my.cnf</code> (Fedora/Centos/RHEL) under the '''<nowiki>[mysqld]</nowiki>''' section:
<pre>
# character set and collation
character-set-server = binary
collation-server = BINARY
#  ignore client side character set value sent during handshake
skip-character-set-client-handshake = 1
</pre>
Then restart mysql/mariadb:
*<code>service mariadb restart</code> for Debian/Ubuntu;
*<code>/bin/systemctl restart mariadb.service</code> for Fedora/Centos/RHEL
<br />
===Changing defaults===
Depending on what version of MySQL or MariaDB you are using, it can be difficult at best to change the default collation '''for a given character set'''. <ref>https://dba.stackexchange.com/questions/239975/change-default-collation-for-character-set-utf8mb4-to-utf8mb4-unicode-ci</ref> The default relationship between character set and the collation for it will affect you when you use 'CREATE' or 'ALTER' with a 'CHARACTER SET' clause and the collation clause is left out. The admonition is to always specify both the character set and the collation in Connections as well as CREATE statements.
Not to confuse things, but the default server-wide character set changed in MariaDB >= 10.6. The <code>character_set_server</code> changed from <code>utf8</code> to <code>utf8mb3</code> <ref>https://mariadb.com/kb/en/server-system-variables/#character_set_server</ref>
Show the default collation '''for a given character set''':
<code>SELECT * FROM information_schema.COLLATIONS WHERE CHARACTER_SET_NAME = 'utf8mb4' ;</code>
On my CentOS server, the value is <code>utf8mb4_general_ci</code> and the same for Rocky Linux 8.9
For <code>CHARACTER_SET_NAME = 'binary'</code>, the collation is <code>binary</code> as no other collation is available, nor would make sense.
==Understand Encoding and Collation==
http://kunststube.net/encoding/
==2016 problems with MediaWiki==
Trying to Special:Import an XML dump from Special:Export of Commons, I get an "illegal mix of collations" error.
Trying to Special:Import an XML dump from Special:Export of Commons, I get an "illegal mix of collations" error.


Line 17: Line 140:


{| class="wikitable"  
{| class="wikitable"  
| Schema: || mediawiki
|Schema:||mediawiki
|-
|-
| Default collation: || latin1_swedish_ci
|Default collation:||latin1_swedish_ci
|-
|-
| Default characterset: || latin1
|Default characterset:||latin1
|}
|}


I also have <code>$wgDBmysql5 = true;</code> in LocalSettings.php <ref>https://www.mediawiki.org/wiki/Manual:$wgDBmysql5</ref>
I also have <code>$wgDBmysql5 = true;</code> in LocalSettings.php <ref>https://www.mediawiki.org/wiki/Manual:$wgDBmysql5</ref>


== Default MediaWiki SQL schema setup ==
==Default MediaWiki SQL schema setup==
When MediaWiki is first installed, [https://phabricator.wikimedia.org/source/mediawiki/browse/master/maintenance/tables.sql this is the SQL that is run].
When MediaWiki is first installed, [https://phabricator.wikimedia.org/source/mediawiki/browse/master/maintenance/tables.sql this is the SQL that is run].


Line 35: Line 158:
You can see what are the defaults for your databases with
You can see what are the defaults for your databases with


<source lang="bash">
<syntaxhighlight lang="bash">
mysql -e 'SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;'
mysql -e 'SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;'
</source>
</syntaxhighlight>


== DB engine Conversion ==
==DB engine Conversion==
The table type has nothing to do with Collation, but while updating my database, I wanted to switch the engine from MyISAM to InnoDB which is now the default.
The table type has nothing to do with Collation, but while updating my database, I wanted to switch the engine from MyISAM to InnoDB which is now the default.


This SQL worked flawlessly to switch any MyISAM tables to INNODB; without touching the *searchindex table(s) because that uses MySQL fulltext search which doesn't come into INNODB until MySQL v. 5.6   
This SQL worked flawlessly to switch any MyISAM tables to INNODB; without touching the *searchindex table(s) because that uses MySQL fulltext search which doesn't come into INNODB until MySQL v. 5.6   
<source lang="mysql">
<syntaxhighlight lang="mysql">
/* Find and convert your MediaWiki tables to use INNODB */
/* Find and convert your MediaWiki tables to use INNODB */


Line 80: Line 203:
ALTER TABLE `blobs` ENGINE=InnoDB;
ALTER TABLE `blobs` ENGINE=InnoDB;
ALTER TABLE `archive` ENGINE=InnoDB;
ALTER TABLE `archive` ENGINE=InnoDB;
</source>
</syntaxhighlight>




Line 86: Line 209:


So then to change the charset/collation:
So then to change the charset/collation:
* UTF8 is a very bad choice for that. 
* If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety.


The *recommended* charset/collation is 'binary BINARY'.
*UTF8 is a very bad choice for that.
<source lang="mysql">
*If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety.
 
The '''recommended''' charset/collation is 'binary BINARY'.
<syntaxhighlight lang="mysql">
ALTER table <tablename> CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER table <tablename> CHARACTER SET = 'binary' COLLATE = 'BINARY';


Line 109: Line 233:
         AND `tb`.table_name NOT REGEXP '.*searchindex'
         AND `tb`.table_name NOT REGEXP '.*searchindex'
ORDER BY table_name DESC;
ORDER BY table_name DESC;
</source>
</syntaxhighlight>


== Table or column collation ==
==Table or column collation==
You can go further than setting the database, or table defaults; you can look at collation and character sets for individual columns.  Individual columns probably don't need to be (and shouldn't be) converted to a new character set or collation unless you are certain you know what you're doing.  MySQL has this [https://dev.mysql.com/doc/refman/5.5/en/charset-conversion.html reference on charset conversion]  
You can go further than setting the database, or table defaults; you can look at collation and character sets for individual columns.  Individual columns probably don't need to be (and shouldn't be) converted to a new character set or collation unless you are certain you know what you're doing.  MySQL has this [https://dev.mysql.com/doc/refman/5.5/en/charset-conversion.html reference on charset conversion]  


You can look at the distribution of column collations with  
You can look at the distribution of column collations with  
<source lang="mysql">
<syntaxhighlight lang="mysql">
SET @DATABASE_NAME = 'mediawiki';
SET @DATABASE_NAME = 'mediawiki';
SELECT TABLE_SCHEMA, count(COLLATION_NAME) AS 'count', COLLATION_NAME
SELECT TABLE_SCHEMA, count(COLLATION_NAME) AS 'count', COLLATION_NAME
Line 121: Line 245:
WHERE  table_schema = @DATABASE_NAME
WHERE  table_schema = @DATABASE_NAME
GROUP BY COLLATION_NAME;
GROUP BY COLLATION_NAME;
</source>
</syntaxhighlight>


For a default demo database, you might see something like this
For a default demo database, you might see something like this
{| class="wikitable"
{| class="wikitable"
|-
|-
| TABLE_SCHEMA || count || COLLATION_NAME  
|TABLE_SCHEMA||count||COLLATION_NAME
|-
|-
| wiki_demo   ||     0 || NULL            
|wiki_demo||0||NULL
|-
|-
| wiki_demo   ||   36 || latin1_bin        
|wiki_demo||36||latin1_bin
|-
|-
| wiki_demo   ||   37 || latin1_swedish_ci  
|wiki_demo||37||latin1_swedish_ci
|-
|-
| wiki_demo   ||     3 || utf8_general_ci  
|wiki_demo||3||utf8_general_ci
|}
|}


or the full detail at  
or the full detail at  
<source lang="mysql">
<syntaxhighlight lang="mysql">
SELECT  
SELECT  
   TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLLATION_NAME  
   TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLLATION_NAME  
FROM information_schema.columns AS cols  
FROM information_schema.columns AS cols  
WHERE table_schema = "wiki_demo";
WHERE table_schema = "wiki_demo";
</source>
</syntaxhighlight>
 
 
 
 
 
 
 
<references />


[[Category:MediaWiki]]
[[Category:MediaWiki]]