Collation: Difference between revisions

Changing Character Sets and Collations: update config example; remove duplicate
m Text replacement - "<(\/?)source" to "<$1syntaxhighlight"
 
(One intermediate revision by the same user not shown)
Line 18: Line 18:
Character sets and collations can be set from the server level right down to the column level, as well as for client-server communication. For example, the <code>[https://mariadb.com/kb/en/server-system-variables/#collation_connection collation_connection]</code> '''server system variable'''. There is another KB page for [https://mariadb.com/kb/en/setting-character-sets-and-collations/ Setting Character Sets and Collations]. Keep in mind that there a many possible collations per character set, but only one default per character set. So, if you specify a character set, but are silent on the collation, you will receive the default. Character sets and collations always cascade down, so a column without a specified collation will look for the table default, the table for the database, and the database for the server. It's therefore possible to have extremely fine-grained control over all the character sets and collations used in your data. Collation names always start with the character set they correspond to, e.g. '''latin1''' charset and '''latin1'''_bin collation (the bin suffix stands for 'binary').
Character sets and collations can be set from the server level right down to the column level, as well as for client-server communication. For example, the <code>[https://mariadb.com/kb/en/server-system-variables/#collation_connection collation_connection]</code> '''server system variable'''. There is another KB page for [https://mariadb.com/kb/en/setting-character-sets-and-collations/ Setting Character Sets and Collations]. Keep in mind that there a many possible collations per character set, but only one default per character set. So, if you specify a character set, but are silent on the collation, you will receive the default. Character sets and collations always cascade down, so a column without a specified collation will look for the table default, the table for the database, and the database for the server. It's therefore possible to have extremely fine-grained control over all the character sets and collations used in your data. Collation names always start with the character set they correspond to, e.g. '''latin1''' charset and '''latin1'''_bin collation (the bin suffix stands for 'binary').


<source lang="mysql">
<syntaxhighlight lang="mysql">
show GLOBAL VARIABLES LIKE 'character_set_%';
show GLOBAL VARIABLES LIKE 'character_set_%';
show VARIABLES LIKE 'collation%';
show VARIABLES LIKE 'collation%';
</source>This is the mess that I wound up with somehow:
</syntaxhighlight>This is the mess that I wound up with somehow:
<source lang="mysql">
<syntaxhighlight lang="mysql">
SET @DATABASE_NAME = 'wiki_freephile';
SET @DATABASE_NAME = 'wiki_freephile';
SELECT  
SELECT  
Line 33: Line 33:
     table_schema = @DATABASE_NAME
     table_schema = @DATABASE_NAME
GROUP BY COLLATION_NAME;
GROUP BY COLLATION_NAME;
</source>
</syntaxhighlight>
<pre>
<pre>
wiki_freephile, 0,  
wiki_freephile, 0,  
Line 41: Line 41:
wiki_freephile, 1, utf8_general_ci
wiki_freephile, 1, utf8_general_ci
</pre>
</pre>
So, I fixed the defaults for each database:
 
<source lang="mysql">
Note that this query is counting the '''columns''' which have an undesirable collation. (Recall that encoding/collation can be set at the server, database, table, and column levels.)
 
 
So, start off by setting the defaults for each database:
<syntaxhighlight lang="mysql">
ALTER DATABASE wiki_freephile CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER DATABASE wiki_freephile CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER DATABASE wiki_meta CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER DATABASE wiki_meta CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER DATABASE wiki_wiki CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER DATABASE wiki_wiki CHARACTER SET = 'binary' COLLATE = 'BINARY';
</source>
</syntaxhighlight>


You can check the status of each table too:
You can check the status of each '''table''' too:
<source lang="mysql">
<syntaxhighlight lang="mysql">
SET @DATABASE_NAME = 'wiki_freephile';
SET @DATABASE_NAME = 'wiki_freephile';
SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_COLLATION
SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_COLLATION
Line 55: Line 59:
WHERE  TABLE_SCHEMA = @DATABASE_NAME
WHERE  TABLE_SCHEMA = @DATABASE_NAME
AND TABLE_COLLATION not in ('binary');
AND TABLE_COLLATION not in ('binary');
</source>
</syntaxhighlight>
 
Then, down in the section below on changing engines, there's an example SQL query for generating '''<code>ALTER TABLE</code>''' statements to fix the table defaults<ref>https://mariadb.com/kb/en/setting-character-sets-and-collations/#table-level</ref>. These statements would be in the form:
<syntaxhighlight lang="mysql">
ALTER TABLE table_name
CONVERT TO CHARACTER SET charset_name [COLLATE collation_name];
</syntaxhighlight>
 
In my case, there are 90-100 columns that have old latin1 character sets.
<syntaxhighlight lang="mysql">
SET @DATABASE_NAME = 'wiki_greg';
SELECT
TABLE_SCHEMA,
TABLE_NAME,
COLUMN_NAME,
CHARACTER_SET_NAME,
COLLATION_NAME
FROM
information_schema.columns AS cols
WHERE
table_schema = @DATABASE_NAME
AND COLLATION_NAME NOT IN ('binary');
</syntaxhighlight>
We can use the information schema to list them all out. And you can also inspect each TABLE, with e.g. <code>SHOW CREATE TABLE brokenlinks \G</code>
 
 


To set the mysql server default character set to binary <ref>https://mariadb.com/kb/en/library/setting-character-sets-and-collations/ https://scottlinux.com/2017/03/04/mysql-mariadb-set-character-set-and-collation-to-utf8/</ref>, add the following to <code>/etc/mysql/my.cnf</code> (Debian/Ubuntu) or <code>/etc/my.cnf</code> (Fedora/Centos/RHEL) under the '''<nowiki>[mysqld]</nowiki>''' section:
To set the mysql server default character set to binary <ref>https://mariadb.com/kb/en/library/setting-character-sets-and-collations/ https://scottlinux.com/2017/03/04/mysql-mariadb-set-character-set-and-collation-to-utf8/</ref>, add the following to <code>/etc/mysql/my.cnf</code> (Debian/Ubuntu) or <code>/etc/my.cnf</code> (Fedora/Centos/RHEL) under the '''<nowiki>[mysqld]</nowiki>''' section:
Line 129: Line 158:
You can see what are the defaults for your databases with
You can see what are the defaults for your databases with


<source lang="bash">
<syntaxhighlight lang="bash">
mysql -e 'SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;'
mysql -e 'SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;'
</source>
</syntaxhighlight>


==DB engine Conversion==
==DB engine Conversion==
Line 137: Line 166:


This SQL worked flawlessly to switch any MyISAM tables to INNODB; without touching the *searchindex table(s) because that uses MySQL fulltext search which doesn't come into INNODB until MySQL v. 5.6   
This SQL worked flawlessly to switch any MyISAM tables to INNODB; without touching the *searchindex table(s) because that uses MySQL fulltext search which doesn't come into INNODB until MySQL v. 5.6   
<source lang="mysql">
<syntaxhighlight lang="mysql">
/* Find and convert your MediaWiki tables to use INNODB */
/* Find and convert your MediaWiki tables to use INNODB */


Line 174: Line 203:
ALTER TABLE `blobs` ENGINE=InnoDB;
ALTER TABLE `blobs` ENGINE=InnoDB;
ALTER TABLE `archive` ENGINE=InnoDB;
ALTER TABLE `archive` ENGINE=InnoDB;
</source>
</syntaxhighlight>




Line 184: Line 213:
*If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety.
*If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety.


The *recommended* charset/collation is 'binary BINARY'.
The '''recommended''' charset/collation is 'binary BINARY'.
<source lang="mysql">
<syntaxhighlight lang="mysql">
ALTER table <tablename> CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER table <tablename> CHARACTER SET = 'binary' COLLATE = 'BINARY';


Line 204: Line 233:
         AND `tb`.table_name NOT REGEXP '.*searchindex'
         AND `tb`.table_name NOT REGEXP '.*searchindex'
ORDER BY table_name DESC;
ORDER BY table_name DESC;
</source>
</syntaxhighlight>


==Table or column collation==
==Table or column collation==
Line 210: Line 239:


You can look at the distribution of column collations with  
You can look at the distribution of column collations with  
<source lang="mysql">
<syntaxhighlight lang="mysql">
SET @DATABASE_NAME = 'mediawiki';
SET @DATABASE_NAME = 'mediawiki';
SELECT TABLE_SCHEMA, count(COLLATION_NAME) AS 'count', COLLATION_NAME
SELECT TABLE_SCHEMA, count(COLLATION_NAME) AS 'count', COLLATION_NAME
Line 216: Line 245:
WHERE  table_schema = @DATABASE_NAME
WHERE  table_schema = @DATABASE_NAME
GROUP BY COLLATION_NAME;
GROUP BY COLLATION_NAME;
</source>
</syntaxhighlight>


For a default demo database, you might see something like this
For a default demo database, you might see something like this
Line 233: Line 262:


or the full detail at  
or the full detail at  
<source lang="mysql">
<syntaxhighlight lang="mysql">
SELECT  
SELECT  
   TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLLATION_NAME  
   TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLLATION_NAME  
FROM information_schema.columns AS cols  
FROM information_schema.columns AS cols  
WHERE table_schema = "wiki_demo";
WHERE table_schema = "wiki_demo";
</source>
</syntaxhighlight>
 
 
 
 
 
 
 
<references />


[[Category:MediaWiki]]
[[Category:MediaWiki]]
Line 246: Line 283:
[[Category:QualityBox]]
[[Category:QualityBox]]
[[Category:Wiki]]
[[Category:Wiki]]
<references />