Collation: Difference between revisions
→Changing Character Sets and Collations: update config example; remove duplicate |
m Text replacement - "<(\/?)source" to "<$1syntaxhighlight" |
||
| (One intermediate revision by the same user not shown) | |||
| Line 18: | Line 18: | ||
Character sets and collations can be set from the server level right down to the column level, as well as for client-server communication. For example, the <code>[https://mariadb.com/kb/en/server-system-variables/#collation_connection collation_connection]</code> '''server system variable'''. There is another KB page for [https://mariadb.com/kb/en/setting-character-sets-and-collations/ Setting Character Sets and Collations]. Keep in mind that there a many possible collations per character set, but only one default per character set. So, if you specify a character set, but are silent on the collation, you will receive the default. Character sets and collations always cascade down, so a column without a specified collation will look for the table default, the table for the database, and the database for the server. It's therefore possible to have extremely fine-grained control over all the character sets and collations used in your data. Collation names always start with the character set they correspond to, e.g. '''latin1''' charset and '''latin1'''_bin collation (the bin suffix stands for 'binary'). | Character sets and collations can be set from the server level right down to the column level, as well as for client-server communication. For example, the <code>[https://mariadb.com/kb/en/server-system-variables/#collation_connection collation_connection]</code> '''server system variable'''. There is another KB page for [https://mariadb.com/kb/en/setting-character-sets-and-collations/ Setting Character Sets and Collations]. Keep in mind that there a many possible collations per character set, but only one default per character set. So, if you specify a character set, but are silent on the collation, you will receive the default. Character sets and collations always cascade down, so a column without a specified collation will look for the table default, the table for the database, and the database for the server. It's therefore possible to have extremely fine-grained control over all the character sets and collations used in your data. Collation names always start with the character set they correspond to, e.g. '''latin1''' charset and '''latin1'''_bin collation (the bin suffix stands for 'binary'). | ||
< | <syntaxhighlight lang="mysql"> | ||
show GLOBAL VARIABLES LIKE 'character_set_%'; | show GLOBAL VARIABLES LIKE 'character_set_%'; | ||
show VARIABLES LIKE 'collation%'; | show VARIABLES LIKE 'collation%'; | ||
</ | </syntaxhighlight>This is the mess that I wound up with somehow: | ||
< | <syntaxhighlight lang="mysql"> | ||
SET @DATABASE_NAME = 'wiki_freephile'; | SET @DATABASE_NAME = 'wiki_freephile'; | ||
SELECT | SELECT | ||
| Line 33: | Line 33: | ||
table_schema = @DATABASE_NAME | table_schema = @DATABASE_NAME | ||
GROUP BY COLLATION_NAME; | GROUP BY COLLATION_NAME; | ||
</ | </syntaxhighlight> | ||
<pre> | <pre> | ||
wiki_freephile, 0, | wiki_freephile, 0, | ||
| Line 41: | Line 41: | ||
wiki_freephile, 1, utf8_general_ci | wiki_freephile, 1, utf8_general_ci | ||
</pre> | </pre> | ||
So, | |||
< | Note that this query is counting the '''columns''' which have an undesirable collation. (Recall that encoding/collation can be set at the server, database, table, and column levels.) | ||
So, start off by setting the defaults for each database: | |||
<syntaxhighlight lang="mysql"> | |||
ALTER DATABASE wiki_freephile CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ALTER DATABASE wiki_freephile CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ||
ALTER DATABASE wiki_meta CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ALTER DATABASE wiki_meta CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ||
ALTER DATABASE wiki_wiki CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ALTER DATABASE wiki_wiki CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ||
</ | </syntaxhighlight> | ||
You can check the status of each table too: | You can check the status of each '''table''' too: | ||
< | <syntaxhighlight lang="mysql"> | ||
SET @DATABASE_NAME = 'wiki_freephile'; | SET @DATABASE_NAME = 'wiki_freephile'; | ||
SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_COLLATION | SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_COLLATION | ||
| Line 55: | Line 59: | ||
WHERE TABLE_SCHEMA = @DATABASE_NAME | WHERE TABLE_SCHEMA = @DATABASE_NAME | ||
AND TABLE_COLLATION not in ('binary'); | AND TABLE_COLLATION not in ('binary'); | ||
</ | </syntaxhighlight> | ||
Then, down in the section below on changing engines, there's an example SQL query for generating '''<code>ALTER TABLE</code>''' statements to fix the table defaults<ref>https://mariadb.com/kb/en/setting-character-sets-and-collations/#table-level</ref>. These statements would be in the form: | |||
<syntaxhighlight lang="mysql"> | |||
ALTER TABLE table_name | |||
CONVERT TO CHARACTER SET charset_name [COLLATE collation_name]; | |||
</syntaxhighlight> | |||
In my case, there are 90-100 columns that have old latin1 character sets. | |||
<syntaxhighlight lang="mysql"> | |||
SET @DATABASE_NAME = 'wiki_greg'; | |||
SELECT | |||
TABLE_SCHEMA, | |||
TABLE_NAME, | |||
COLUMN_NAME, | |||
CHARACTER_SET_NAME, | |||
COLLATION_NAME | |||
FROM | |||
information_schema.columns AS cols | |||
WHERE | |||
table_schema = @DATABASE_NAME | |||
AND COLLATION_NAME NOT IN ('binary'); | |||
</syntaxhighlight> | |||
We can use the information schema to list them all out. And you can also inspect each TABLE, with e.g. <code>SHOW CREATE TABLE brokenlinks \G</code> | |||
To set the mysql server default character set to binary <ref>https://mariadb.com/kb/en/library/setting-character-sets-and-collations/ https://scottlinux.com/2017/03/04/mysql-mariadb-set-character-set-and-collation-to-utf8/</ref>, add the following to <code>/etc/mysql/my.cnf</code> (Debian/Ubuntu) or <code>/etc/my.cnf</code> (Fedora/Centos/RHEL) under the '''<nowiki>[mysqld]</nowiki>''' section: | To set the mysql server default character set to binary <ref>https://mariadb.com/kb/en/library/setting-character-sets-and-collations/ https://scottlinux.com/2017/03/04/mysql-mariadb-set-character-set-and-collation-to-utf8/</ref>, add the following to <code>/etc/mysql/my.cnf</code> (Debian/Ubuntu) or <code>/etc/my.cnf</code> (Fedora/Centos/RHEL) under the '''<nowiki>[mysqld]</nowiki>''' section: | ||
| Line 129: | Line 158: | ||
You can see what are the defaults for your databases with | You can see what are the defaults for your databases with | ||
< | <syntaxhighlight lang="bash"> | ||
mysql -e 'SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;' | mysql -e 'SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;' | ||
</ | </syntaxhighlight> | ||
==DB engine Conversion== | ==DB engine Conversion== | ||
| Line 137: | Line 166: | ||
This SQL worked flawlessly to switch any MyISAM tables to INNODB; without touching the *searchindex table(s) because that uses MySQL fulltext search which doesn't come into INNODB until MySQL v. 5.6 | This SQL worked flawlessly to switch any MyISAM tables to INNODB; without touching the *searchindex table(s) because that uses MySQL fulltext search which doesn't come into INNODB until MySQL v. 5.6 | ||
< | <syntaxhighlight lang="mysql"> | ||
/* Find and convert your MediaWiki tables to use INNODB */ | /* Find and convert your MediaWiki tables to use INNODB */ | ||
| Line 174: | Line 203: | ||
ALTER TABLE `blobs` ENGINE=InnoDB; | ALTER TABLE `blobs` ENGINE=InnoDB; | ||
ALTER TABLE `archive` ENGINE=InnoDB; | ALTER TABLE `archive` ENGINE=InnoDB; | ||
</ | </syntaxhighlight> | ||
| Line 184: | Line 213: | ||
*If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety. | *If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety. | ||
The | The '''recommended''' charset/collation is 'binary BINARY'. | ||
< | <syntaxhighlight lang="mysql"> | ||
ALTER table <tablename> CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ALTER table <tablename> CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ||
| Line 204: | Line 233: | ||
AND `tb`.table_name NOT REGEXP '.*searchindex' | AND `tb`.table_name NOT REGEXP '.*searchindex' | ||
ORDER BY table_name DESC; | ORDER BY table_name DESC; | ||
</ | </syntaxhighlight> | ||
==Table or column collation== | ==Table or column collation== | ||
| Line 210: | Line 239: | ||
You can look at the distribution of column collations with | You can look at the distribution of column collations with | ||
< | <syntaxhighlight lang="mysql"> | ||
SET @DATABASE_NAME = 'mediawiki'; | SET @DATABASE_NAME = 'mediawiki'; | ||
SELECT TABLE_SCHEMA, count(COLLATION_NAME) AS 'count', COLLATION_NAME | SELECT TABLE_SCHEMA, count(COLLATION_NAME) AS 'count', COLLATION_NAME | ||
| Line 216: | Line 245: | ||
WHERE table_schema = @DATABASE_NAME | WHERE table_schema = @DATABASE_NAME | ||
GROUP BY COLLATION_NAME; | GROUP BY COLLATION_NAME; | ||
</ | </syntaxhighlight> | ||
For a default demo database, you might see something like this | For a default demo database, you might see something like this | ||
| Line 233: | Line 262: | ||
or the full detail at | or the full detail at | ||
< | <syntaxhighlight lang="mysql"> | ||
SELECT | SELECT | ||
TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLLATION_NAME | TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLLATION_NAME | ||
FROM information_schema.columns AS cols | FROM information_schema.columns AS cols | ||
WHERE table_schema = "wiki_demo"; | WHERE table_schema = "wiki_demo"; | ||
</ | </syntaxhighlight> | ||
<references /> | |||
[[Category:MediaWiki]] | [[Category:MediaWiki]] | ||
| Line 246: | Line 283: | ||
[[Category:QualityBox]] | [[Category:QualityBox]] | ||
[[Category:Wiki]] | [[Category:Wiki]] | ||