Collation: Difference between revisions

(One intermediate revision by the same user not shown)

Line 18:

Character sets and collations can be set from the server level right down to the column level, as well as for client-server communication. For example, the <code>[https://mariadb.com/kb/en/server-system-variables/#collation_connection collation_connection]</code> '''server system variable'''. There is another KB page for [https://mariadb.com/kb/en/setting-character-sets-and-collations/ Setting Character Sets and Collations]. Keep in mind that there a many possible collations per character set, but only one default per character set. So, if you specify a character set, but are silent on the collation, you will receive the default. Character sets and collations always cascade down, so a column without a specified collation will look for the table default, the table for the database, and the database for the server. It's therefore possible to have extremely fine-grained control over all the character sets and collations used in your data. Collation names always start with the character set they correspond to, e.g. '''latin1''' charset and '''latin1'''_bin collation (the bin suffix stands for 'binary').

<~~source~~ lang="mysql">

show GLOBAL VARIABLES LIKE 'character_set_%';

show VARIABLES LIKE 'collation%';

</~~source~~>This is the mess that I wound up with somehow:

</syntaxhighlight>This is the mess that I wound up with somehow:

<~~source~~ lang="mysql">

SET @DATABASE_NAME = 'wiki_freephile';

SELECT

Line 33:

table_schema = @DATABASE_NAME

GROUP BY COLLATION_NAME;

</~~source~~>

</syntaxhighlight>

<pre>

wiki_freephile, 0,

Line 41:

wiki_freephile, 1, utf8_general_ci

</pre>

So, ~~I fixed~~ the defaults for each database:

<~~source~~ lang="mysql">

Note that this query is counting the '''columns''' which have an undesirable collation. (Recall that encoding/collation can be set at the server, database, table, and column levels.)

So, start off by setting the defaults for each database:

ALTER DATABASE wiki_freephile CHARACTER SET = 'binary' COLLATE = 'BINARY';

ALTER DATABASE wiki_meta CHARACTER SET = 'binary' COLLATE = 'BINARY';

ALTER DATABASE wiki_wiki CHARACTER SET = 'binary' COLLATE = 'BINARY';

</~~source~~>

</syntaxhighlight>

You can check the status of each table too:

You can check the status of each '''table''' too:

<~~source~~ lang="mysql">

SET @DATABASE_NAME = 'wiki_freephile';

SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_COLLATION

Line 55:

Line 59:

WHERE TABLE_SCHEMA = @DATABASE_NAME

AND TABLE_COLLATION not in ('binary');

</~~source~~>

</syntaxhighlight>

Then, down in the section below on changing engines, there's an example SQL query for generating '''<code>ALTER TABLE</code>''' statements to fix the table defaults<ref>https://mariadb.com/kb/en/setting-character-sets-and-collations/#table-level</ref>. These statements would be in the form:

ALTER TABLE table_name

CONVERT TO CHARACTER SET charset_name [COLLATE collation_name];

</syntaxhighlight>

In my case, there are 90-100 columns that have old latin1 character sets.

SET @DATABASE_NAME = 'wiki_greg';

SELECT

TABLE_SCHEMA,

TABLE_NAME,

COLUMN_NAME,

CHARACTER_SET_NAME,

COLLATION_NAME

FROM

information_schema.columns AS cols

WHERE

table_schema = @DATABASE_NAME

AND COLLATION_NAME NOT IN ('binary');

</syntaxhighlight>

We can use the information schema to list them all out. And you can also inspect each TABLE, with e.g. <code>SHOW CREATE TABLE brokenlinks \G</code>

To set the mysql server default character set to binary <ref>https://mariadb.com/kb/en/library/setting-character-sets-and-collations/ https://scottlinux.com/2017/03/04/mysql-mariadb-set-character-set-and-collation-to-utf8/</ref>, add the following to <code>/etc/mysql/my.cnf</code> (Debian/Ubuntu) or <code>/etc/my.cnf</code> (Fedora/Centos/RHEL) under the '''<nowiki>[mysqld]</nowiki>''' section:

Line 129:

Line 158:

You can see what are the defaults for your databases with

<~~source~~ lang="bash">

mysql -e 'SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;'

</~~source~~>

</syntaxhighlight>

==DB engine Conversion==

Line 137:

Line 166:

This SQL worked flawlessly to switch any MyISAM tables to INNODB; without touching the *searchindex table(s) because that uses MySQL fulltext search which doesn't come into INNODB until MySQL v. 5.6

<~~source~~ lang="mysql">

/* Find and convert your MediaWiki tables to use INNODB */

Line 174:

Line 203:

ALTER TABLE `blobs` ENGINE=InnoDB;

ALTER TABLE `archive` ENGINE=InnoDB;

</~~source~~>

</syntaxhighlight>

Line 184:

Line 213:

*If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety.

The *recommended* charset/collation is 'binary BINARY'.

The '''recommended''' charset/collation is 'binary BINARY'.

<~~source~~ lang="mysql">

ALTER table <tablename> CHARACTER SET = 'binary' COLLATE = 'BINARY';

Line 204:

Line 233:

AND `tb`.table_name NOT REGEXP '.*searchindex'

ORDER BY table_name DESC;

</~~source~~>

</syntaxhighlight>

==Table or column collation==

Line 210:

Line 239:

You can look at the distribution of column collations with

<~~source~~ lang="mysql">

SET @DATABASE_NAME = 'mediawiki';

SELECT TABLE_SCHEMA, count(COLLATION_NAME) AS 'count', COLLATION_NAME

Line 216:

Line 245:

WHERE table_schema = @DATABASE_NAME

GROUP BY COLLATION_NAME;

</~~source~~>

</syntaxhighlight>

For a default demo database, you might see something like this

Line 233:

Line 262:

or the full detail at

<~~source~~ lang="mysql">

SELECT

TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLLATION_NAME

FROM information_schema.columns AS cols

WHERE table_schema = "wiki_demo";

</~~source~~>

</syntaxhighlight>

[[Category:MediaWiki]]

Line 246:

Line 283:

[[Category:QualityBox]]

[[Category:Wiki]]

~~<references />~~

@@ Line 18: / Line 18: @@
 Character sets and collations can be set from the server level right down to the column level, as well as for client-server communication. For example, the <code>[https://mariadb.com/kb/en/server-system-variables/#collation_connection collation_connection]</code> '''server system variable'''. There is another KB page for [https://mariadb.com/kb/en/setting-character-sets-and-collations/ Setting Character Sets and Collations]. Keep in mind that there a many possible collations per character set, but only one default per character set. So, if you specify a character set, but are silent on the collation, you will receive the default. Character sets and collations always cascade down, so a column without a specified collation will look for the table default, the table for the database, and the database for the server. It's therefore possible to have extremely fine-grained control over all the character sets and collations used in your data. Collation names always start with the character set they correspond to, e.g. '''latin1''' charset and '''latin1'''_bin collation (the bin suffix stands for 'binary').
-<source lang="mysql">
+<syntaxhighlight lang="mysql">
 show GLOBAL VARIABLES LIKE 'character_set_%';
 show VARIABLES LIKE 'collation%';
-</source>This is the mess that I wound up with somehow:
+</syntaxhighlight>This is the mess that I wound up with somehow:
-<source lang="mysql">
+<syntaxhighlight lang="mysql">
 SET @DATABASE_NAME = 'wiki_freephile';
 SELECT
@@ Line 33: / Line 33: @@
      table_schema = @DATABASE_NAME
 GROUP BY COLLATION_NAME;
-</source>
+</syntaxhighlight>
 <pre>
 wiki_freephile, 0,
@@ Line 41: / Line 41: @@
 wiki_freephile, 1, utf8_general_ci
 </pre>
-So, I fixed the defaults for each database:
-<source lang="mysql">
+Note that this query is counting the '''columns''' which have an undesirable collation. (Recall that encoding/collation can be set at the server, database, table, and column levels.)
+So, start off by setting the defaults for each database:
+<syntaxhighlight lang="mysql">
 ALTER DATABASE wiki_freephile CHARACTER SET = 'binary' COLLATE = 'BINARY';
 ALTER DATABASE wiki_meta CHARACTER SET = 'binary' COLLATE = 'BINARY';
 ALTER DATABASE wiki_wiki CHARACTER SET = 'binary' COLLATE = 'BINARY';
-</source>
+</syntaxhighlight>
-You can check the status of each table too:
+You can check the status of each '''table''' too:
-<source lang="mysql">
+<syntaxhighlight lang="mysql">
 SET @DATABASE_NAME = 'wiki_freephile';
 SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_COLLATION
@@ Line 55: / Line 59: @@
 WHERE   TABLE_SCHEMA = @DATABASE_NAME
 AND TABLE_COLLATION not in ('binary');
-</source>
+</syntaxhighlight>
+Then, down in the section below on changing engines, there's an example SQL query for generating '''<code>ALTER TABLE</code>''' statements to fix the table defaults<ref>https://mariadb.com/kb/en/setting-character-sets-and-collations/#table-level</ref>. These statements would be in the form:
+<syntaxhighlight lang="mysql">
+ALTER TABLE table_name
+ CONVERT TO CHARACTER SET charset_name [COLLATE collation_name];
+</syntaxhighlight>
+In my case, there are 90-100 columns that have old latin1 character sets.
+<syntaxhighlight lang="mysql">
+SET @DATABASE_NAME = 'wiki_greg';
+SELECT
+	TABLE_SCHEMA,
+	TABLE_NAME,
+	COLUMN_NAME,
+	CHARACTER_SET_NAME,
+	COLLATION_NAME
+FROM
+	information_schema.columns AS cols
+WHERE
+	table_schema = @DATABASE_NAME
+	AND COLLATION_NAME NOT IN ('binary');
+</syntaxhighlight>
+We can use the information schema to list them all out. And you can also inspect each TABLE, with e.g. <code>SHOW CREATE TABLE brokenlinks \G</code>
 To set the mysql server default character set to binary <ref>https://mariadb.com/kb/en/library/setting-character-sets-and-collations/ https://scottlinux.com/2017/03/04/mysql-mariadb-set-character-set-and-collation-to-utf8/</ref>, add the following to <code>/etc/mysql/my.cnf</code> (Debian/Ubuntu) or <code>/etc/my.cnf</code> (Fedora/Centos/RHEL) under the '''<nowiki>[mysqld]</nowiki>''' section:
@@ Line 129: / Line 158: @@
 You can see what are the defaults for your databases with
-<source lang="bash">
+<syntaxhighlight lang="bash">
 mysql -e 'SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;'
-</source>
+</syntaxhighlight>
 ==DB engine Conversion==
@@ Line 137: / Line 166: @@
 This SQL worked flawlessly to switch any MyISAM tables to INNODB; without touching the *searchindex table(s) because that uses MySQL fulltext search which doesn't come into INNODB until MySQL v. 5.6
-<source lang="mysql">
+<syntaxhighlight lang="mysql">
 /* Find and convert your MediaWiki tables to use INNODB */
@@ Line 174: / Line 203: @@
 ALTER TABLE `blobs` ENGINE=InnoDB;
 ALTER TABLE `archive` ENGINE=InnoDB;
-</source>
+</syntaxhighlight>
@@ Line 184: / Line 213: @@
 *If you ever need to put the DB data into something Unicode, then be sure to use the utf8mb4 (Multi-byte 4) variety.
-The *recommended* charset/collation is 'binary BINARY'.
+The '''recommended''' charset/collation is 'binary BINARY'.
-<source lang="mysql">
+<syntaxhighlight lang="mysql">
 ALTER table <tablename> CHARACTER SET = 'binary' COLLATE = 'BINARY';
@@ Line 204: / Line 233: @@
          AND `tb`.table_name NOT REGEXP '.*searchindex'
 ORDER BY table_name DESC;
-</source>
+</syntaxhighlight>
 ==Table or column collation==
@@ Line 210: / Line 239: @@
 You can look at the distribution of column collations with
-<source lang="mysql">
+<syntaxhighlight lang="mysql">
 SET @DATABASE_NAME = 'mediawiki';
 SELECT TABLE_SCHEMA, count(COLLATION_NAME) AS 'count', COLLATION_NAME
@@ Line 216: / Line 245: @@
 WHERE   table_schema = @DATABASE_NAME
 GROUP BY COLLATION_NAME;
-</source>
+</syntaxhighlight>
 For a default demo database, you might see something like this
@@ Line 233: / Line 262: @@
 or the full detail at
-<source lang="mysql">
+<syntaxhighlight lang="mysql">
 SELECT
    TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLLATION_NAME
 FROM information_schema.columns AS cols
 WHERE table_schema = "wiki_demo";
-</source>
+</syntaxhighlight>
+<references />
 [[Category:MediaWiki]]
@@ Line 246: / Line 283: @@
 [[Category:QualityBox]]
 [[Category:Wiki]]
-<references />