Difference between revisions of "Talk:Collation"
(4 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | This page is all about Collation in MediaWiki - and should probably become a new article to supplement the general topic page. | ||
+ | |||
+ | == See Also== | ||
+ | https://www.mediawiki.org/wiki/Manual:$wgCategoryCollation and https://github.com/wikimedia/mediawiki/blob/master/maintenance/updateCollation.php | ||
+ | Other maintenance scripts such as https://github.com/wikimedia/mediawiki/blob/master/maintenance/uppercaseTitlesForUnicodeTransition.php or other extensions etc. may be helpful if you need to "fix" things. | ||
+ | |||
+ | |||
+ | |||
In MWStake General, I [https://matrix.to/#/!NGZmJSwAAwbGRxhWwH:matrix.org/$HObrGFUPO_XBbhS3STTjQ5R-i-uV3cRjVIpPuAhP0i4?via=matrix.org&via=marijn.it&via=converser.eu said] | In MWStake General, I [https://matrix.to/#/!NGZmJSwAAwbGRxhWwH:matrix.org/$HObrGFUPO_XBbhS3STTjQ5R-i-uV3cRjVIpPuAhP0i4?via=matrix.org&via=marijn.it&via=converser.eu said] | ||
Short answer: | Short answer: | ||
Line 4: | Line 12: | ||
1. yes, you should change the <code>$wgDBTableOptions</code> (see below) to the new default. | 1. yes, you should change the <code>$wgDBTableOptions</code> (see below) to the new default. | ||
However, this setting is only used during new table creation, so it doesn't really affect your existing database(s) and the tables in them. | However, this setting is only used during new table creation, so it doesn't really affect your existing database(s) and the tables in them. | ||
− | 2. Do NOT <code>CONVERT TO</code> all your existing tables. You have to be more precise with what columns you change - if you change anything at all. You really only convert if you have some encoding problem (aka Mojibake). | + | 2. Do NOT <code>CONVERT TO</code> all your existing tables. You have to be more precise with what columns you change - if you change anything at all. You really only convert if you have some encoding problem (aka [[wp:Mojibake|Mojibake]]). |
Using | Using | ||
Line 58: | Line 66: | ||
* https://www.mediawiki.org/wiki/Manual:Installation_requirements#Database_server '''Seems like a 'bug' to have a minimum DB requirement for a DB that is out of support''' | * https://www.mediawiki.org/wiki/Manual:Installation_requirements#Database_server '''Seems like a 'bug' to have a minimum DB requirement for a DB that is out of support''' | ||
* https://www.mediawiki.org/wiki/Manual:MysqlUpdater.php - a class called by the Installer that performs database updates. | * https://www.mediawiki.org/wiki/Manual:MysqlUpdater.php - a class called by the Installer that performs database updates. | ||
+ | |||
+ | == Conversion == | ||
+ | If you need to do conversion, check out the [https://dev.mysql.com/doc/refman/8.3/en/charset-conversion.html MySQL docs v8.3] and [https://dev.mysql.com/doc/refman/5.7/en/charset-conversion.html MySQL docs v5.7] | ||
== checkStorage.php == | == checkStorage.php == | ||
− | What does the [https://doc.wikimedia.org/mediawiki-core/master/php/classCheckStorage.html CheckStorage] class ([https://www.mediawiki.org/wiki/Manual:CheckStorage.php Manual] [https://doc.wikimedia.org/mediawiki-core/master/php/checkStorage_8php_source.html source]) actually do? It is used in conjunction with the [[mw:Manual:External_storage|Manual:External_storage]] capability of MediaWiki and therefore is irrelevant here because we're not using External Storage. But, here's the output of running the script for information's sake: | + | What does the [https://doc.wikimedia.org/mediawiki-core/master/php/classCheckStorage.html CheckStorage] class ([https://www.mediawiki.org/wiki/Manual:CheckStorage.php Manual] [https://doc.wikimedia.org/mediawiki-core/master/php/checkStorage_8php_source.html source]) actually do? It is used in conjunction with the [[mw:Manual:External_storage|Manual:External_storage]] ([https://wikitech.wikimedia.org/wiki/External_storage Wikitech]) capability of MediaWiki and therefore is irrelevant here because we're not using External Storage. But, here's the output of running the script for information's sake: |
<pre> | <pre> |
Latest revision as of 09:36, 26 April 2024
This page is all about Collation in MediaWiki - and should probably become a new article to supplement the general topic page.
See Also[edit source]
https://www.mediawiki.org/wiki/Manual:$wgCategoryCollation and https://github.com/wikimedia/mediawiki/blob/master/maintenance/updateCollation.php Other maintenance scripts such as https://github.com/wikimedia/mediawiki/blob/master/maintenance/uppercaseTitlesForUnicodeTransition.php or other extensions etc. may be helpful if you need to "fix" things.
In MWStake General, I said Short answer:
1. yes, you should change the $wgDBTableOptions
(see below) to the new default.
However, this setting is only used during new table creation, so it doesn't really affect your existing database(s) and the tables in them.
2. Do NOT CONVERT TO
all your existing tables. You have to be more precise with what columns you change - if you change anything at all. You really only convert if you have some encoding problem (aka Mojibake).
Using
SHOW VARIABLES WHERE Variable_name LIKE 'character_set_%' OR Variable_name LIKE 'collation%';
will show you some info about how your database server, client, filesystem etc. is currently setup.
Here's a Stack Overflow question + answer that explains the different encodings (byte mappings) and collations (sorting rules) and why it _might_ matter to you. The newest collations only really matter if you're trying to chase the long tail of performance optimizations.
BTW, the answer is by Rick James (not the singer - the database expert)
From WMF, I see this puppet template for the galera configuration, updated on 2020-06-11 specifying
# character sets and collations character-set-server = utf8mb4 collation-server = utf8mb4_general_ci
As you point out, the setting for $wgDBTableOptions changed back in REL1_32 to use a "binary" character set as the default
$wgDBTableOptions = 'ENGINE=InnoDB, DEFAULT CHARSET=binary';
Apparently the prior "option" to choose the charset (between utf8 and binary) was hidden from the UI of the installer in MediaWiki 1.31.0-rc.2, so the change in REL1_32 was to make the default config setting match what the installer was already doing (using the binary charset for NEW installations).
Keep in mind that MediaWiki outputs UTF8 to the browser:
$response->header( 'Content-Type: text/html; charset=UTF-8' );
But that is a different aspect to the whole UTF-8 question.
In the database backend, MediaWiki switched over some years ago (I think it was REL1_35) to use the Doctrine Database Abstraction Layer (Doctrine DBAL); using JSON to represent the schemas for each table as well as to define "abstract" schema changes that work across multiple DB engines. There is a PHP script that turns that into generated SQL
You can see how the default Character Set for MySQL is 'binary' in the installer, but this does not mean that MediaWiki changes the character set of an existing installation.
My recent installation of MediaWiki 1.39 is actually latin1 -- I'm off to figure out why that is the case... possibly due to the default of the MariaDB instance I'm using.
Although it is old information (2012), User Daniel Renfro put together some nice notes on Character Encoding and how it plays out (or plays havoc) for you.
Tests[edit source]
select CONVERT(CONVERT(CONVERT('Mike’s' USING latin1) USING binary) USING utf8);
select CONVERT(CONVERT('Mike’s' USING latin1) USING binary);
produces
Mike’s Mikeâ s
MediaWiki[edit source]
- Is this a bug? https://github.com/wikimedia/mediawiki/blame/master/includes/installer/MysqlInstaller.php#L329 the installer specifies utf8 when it supposedly is 'binary'.
This is one of the few places in the code where 'create database' is ever called
- https://www.mediawiki.org/wiki/Manual:Installation_requirements#Database_server Seems like a 'bug' to have a minimum DB requirement for a DB that is out of support
- https://www.mediawiki.org/wiki/Manual:MysqlUpdater.php - a class called by the Installer that performs database updates.
Conversion[edit source]
If you need to do conversion, check out the MySQL docs v8.3 and MySQL docs v5.7
checkStorage.php[edit source]
What does the CheckStorage class (Manual source) actually do? It is used in conjunction with the Manual:External_storage (Wikitech) capability of MediaWiki and therefore is irrelevant here because we're not using External Storage. But, here's the output of running the script for information's sake:
Errors: Flag statistics: 1575 15.53% utf-8 8564 84.47% Local object statistics:
Tables[edit source]
SET @DATABASE_NAME = 'wiki_greg';
SELECT
CONCAT("ALTER TABLE `",
table_name,
"` CHARACTER SET = 'binary' COLLATE = 'BINARY';") AS sql_statements
FROM
information_schema.tables AS tb
WHERE
table_schema = @DATABASE_NAME
AND `TABLE_COLLATION` = 'latin1_swedish_ci'
AND `TABLE_TYPE` = 'BASE TABLE'
AND `tb`.table_name NOT REGEXP '.*searchindex'
ORDER BY table_name DESC;
First thing you need to do is use wiki_greg;
Then you can alter all the tables by copy/pasting all the generated lines. Altering table defaults will NOT change definitions on columns which specify CHARACTER SET or collation.
ALTER TABLE `archive` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `bad_behavior` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `blobs` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `brokenlinks` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `category` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `categorylinks` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `change_tag` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `cur` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `dune_hitcounter` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `external_user` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `externallinks` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `filearchive` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `image` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `imagelinks` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `interwiki` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `ipblocks` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `ipblocks_old` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `job` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `l10n_cache` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `langlinks` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `links` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `linkscc` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `linktarget` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `log_search` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `logging` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `math` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `objectcache` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `oldimage` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `online` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `page` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `page_props` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `page_restrictions` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `pagelinks` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `protected_titles` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `querycache` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `querycache_info` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `querycachetwo` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `recentchanges` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `redirect` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `revision` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `site_identifiers` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `site_stats` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `sites` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `templatelinks` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `text` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `trackbacks` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `updatelog` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `user` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `user_autocreate_serial` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `user_groups` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `user_newtalk` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `user_openid` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `user_properties` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `user_rights` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `validate` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `watchlist` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `watchlist_expiry` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `wiretap` CHARACTER SET = 'binary' COLLATE = 'BINARY'; ALTER TABLE `wiretap_counter_alltime` CHARACTER SET = 'binary' COLLATE = 'BINARY';
Columns[edit source]
Show the columns that are not 'binary' and have a specific character set
SET @DATABASE_NAME = 'wiki_greg';
SELECT
TABLE_SCHEMA,
TABLE_NAME,
COLUMN_NAME,
CHARACTER_SET_NAME,
COLLATION_NAME
FROM
information_schema.columns AS cols
WHERE
table_schema = @DATABASE_NAME
AND COLLATION_NAME NOT IN ('binary');
TABLE_SCHEMA|TABLE_NAME |COLUMN_NAME |CHARACTER_SET_NAME|COLLATION_NAME | ------------+-----------------+------------------------+------------------+-----------------+ wiki_greg |archive |ar_timestamp |latin1 |latin1_bin | wiki_greg |bad_behavior |ip |latin1 |latin1_swedish_ci| wiki_greg |bad_behavior |request_method |latin1 |latin1_swedish_ci| wiki_greg |bad_behavior |request_uri |latin1 |latin1_swedish_ci| wiki_greg |bad_behavior |server_protocol |latin1 |latin1_swedish_ci| wiki_greg |bad_behavior |http_headers |latin1 |latin1_swedish_ci| wiki_greg |bad_behavior |user_agent |latin1 |latin1_swedish_ci| wiki_greg |bad_behavior |request_entity |latin1 |latin1_swedish_ci| wiki_greg |bad_behavior |key |latin1 |latin1_swedish_ci| wiki_greg |blobs |blob_index |latin1 |latin1_bin | wiki_greg |brokenlinks |bl_to |latin1 |latin1_bin | wiki_greg |categorylinks |cl_type |latin1 |latin1_swedish_ci| wiki_greg |cur |cur_title |latin1 |latin1_bin | wiki_greg |cur |cur_text |latin1 |latin1_swedish_ci| wiki_greg |cur |cur_user_text |latin1 |latin1_bin | wiki_greg |cur |cur_timestamp |latin1 |latin1_bin | wiki_greg |cur |cur_touched |latin1 |latin1_bin | wiki_greg |cur |inverse_timestamp |latin1 |latin1_bin | wiki_greg |dune_searchindex |si_title |latin1 |latin1_swedish_ci| wiki_greg |dune_searchindex |si_text |latin1 |latin1_swedish_ci| wiki_greg |external_user |eu_external_id |latin1 |latin1_bin | wiki_greg |filearchive |fa_storage_group |latin1 |latin1_swedish_ci| wiki_greg |filearchive |fa_storage_key |latin1 |latin1_bin | wiki_greg |filearchive |fa_deleted_timestamp |latin1 |latin1_bin | wiki_greg |filearchive |fa_media_type |latin1 |latin1_swedish_ci| wiki_greg |filearchive |fa_major_mime |latin1 |latin1_swedish_ci| wiki_greg |filearchive |fa_timestamp |latin1 |latin1_bin | wiki_greg |image |img_media_type |latin1 |latin1_swedish_ci| wiki_greg |image |img_major_mime |latin1 |latin1_swedish_ci| wiki_greg |interwiki |iw_prefix |latin1 |latin1_swedish_ci| wiki_greg |interwiki |iw_url |latin1 |latin1_swedish_ci| wiki_greg |interwiki |iw_wikiid |latin1 |latin1_swedish_ci| wiki_greg |ipblocks |ipb_timestamp |latin1 |latin1_bin | wiki_greg |ipblocks |ipb_expiry |latin1 |latin1_bin | wiki_greg |ipblocks_old |ipb_address |latin1 |latin1_bin | wiki_greg |ipblocks_old |ipb_timestamp |latin1 |latin1_bin | wiki_greg |ipblocks_old |ipb_expiry |latin1 |latin1_bin | wiki_greg |ipblocks_old |ipb_range_start |latin1 |latin1_swedish_ci| wiki_greg |ipblocks_old |ipb_range_end |latin1 |latin1_swedish_ci| wiki_greg |job |job_cmd |latin1 |latin1_swedish_ci| wiki_greg |l10n_cache |lc_key |latin1 |latin1_swedish_ci| wiki_greg |log_search |ls_value |latin1 |latin1_swedish_ci| wiki_greg |logging |log_timestamp |latin1 |latin1_swedish_ci| wiki_greg |math |math_inputhash |latin1 |latin1_swedish_ci| wiki_greg |math |math_outputhash |latin1 |latin1_swedish_ci| wiki_greg |math |math_html |latin1 |latin1_swedish_ci| wiki_greg |math |math_mathml |latin1 |latin1_swedish_ci| wiki_greg |objectcache |keyname |latin1 |latin1_bin | wiki_greg |objectcache |modtoken |latin1 |latin1_swedish_ci| wiki_greg |oldimage |oi_timestamp |latin1 |latin1_bin | wiki_greg |oldimage |oi_media_type |latin1 |latin1_swedish_ci| wiki_greg |oldimage |oi_major_mime |latin1 |latin1_swedish_ci| wiki_greg |online |username |latin1 |latin1_swedish_ci| wiki_greg |online |timestamp |latin1 |latin1_swedish_ci| wiki_greg |online |wikiid |latin1 |latin1_swedish_ci| wiki_greg |page |page_touched |latin1 |latin1_bin | wiki_greg |page_restrictions|pr_type |latin1 |latin1_swedish_ci| wiki_greg |page_restrictions|pr_level |latin1 |latin1_swedish_ci| wiki_greg |page_restrictions|pr_expiry |latin1 |latin1_bin | wiki_greg |querycache |qc_type |latin1 |latin1_swedish_ci| wiki_greg |querycache_info |qci_type |latin1 |latin1_swedish_ci| wiki_greg |querycache_info |qci_timestamp |latin1 |latin1_swedish_ci| wiki_greg |querycachetwo |qcc_type |latin1 |latin1_swedish_ci| wiki_greg |recentchanges |rc_log_type |latin1 |latin1_bin | wiki_greg |recentchanges |rc_log_action |latin1 |latin1_bin | wiki_greg |redirect |rd_interwiki |latin1 |latin1_swedish_ci| wiki_greg |revision |rev_timestamp |latin1 |latin1_bin | wiki_greg |searchindex |si_title |latin1 |latin1_swedish_ci| wiki_greg |searchindex |si_text |latin1 |latin1_swedish_ci| wiki_greg |sites |site_domain |latin1 |latin1_swedish_ci| wiki_greg |smw_ft_search |o_text |utf8 |utf8_general_ci | wiki_greg |text |old_title |latin1 |latin1_bin | wiki_greg |text |old_text |latin1 |latin1_swedish_ci| wiki_greg |text |old_user_text |latin1 |latin1_bin | wiki_greg |text |old_timestamp |latin1 |latin1_bin | wiki_greg |text |inverse_timestamp |latin1 |latin1_bin | wiki_greg |trackbacks |tb_title |latin1 |latin1_swedish_ci| wiki_greg |trackbacks |tb_url |latin1 |latin1_swedish_ci| wiki_greg |trackbacks |tb_ex |latin1 |latin1_swedish_ci| wiki_greg |trackbacks |tb_name |latin1 |latin1_swedish_ci| wiki_greg |updatelog |ul_key |latin1 |latin1_swedish_ci| wiki_greg |user |user_email |latin1 |latin1_swedish_ci| wiki_greg |user |user_token |latin1 |latin1_bin | wiki_greg |user |user_email_authenticated|latin1 |latin1_bin | wiki_greg |user |user_email_token |latin1 |latin1_bin | wiki_greg |user |user_email_token_expires|latin1 |latin1_bin | wiki_greg |user |user_registration |latin1 |latin1_bin | wiki_greg |user |user_newpass_time |latin1 |latin1_bin | wiki_greg |user_newtalk |user_ip |latin1 |latin1_swedish_ci| wiki_greg |user_openid |uoi_openid |latin1 |latin1_swedish_ci| wiki_greg |validate |val_comment |latin1 |latin1_swedish_ci| wiki_greg |validate |val_ip |latin1 |latin1_swedish_ci| wiki_greg |wiretap |page_name |latin1 |latin1_swedish_ci| wiki_greg |wiretap |user_name |latin1 |latin1_swedish_ci| wiki_greg |wiretap |hit_timestamp |latin1 |latin1_swedish_ci| wiki_greg |wiretap |hit_year |latin1 |latin1_swedish_ci| wiki_greg |wiretap |hit_month |latin1 |latin1_swedish_ci| wiki_greg |wiretap |hit_day |latin1 |latin1_swedish_ci| wiki_greg |wiretap |hit_hour |latin1 |latin1_swedish_ci| wiki_greg |wiretap |page_action |latin1 |latin1_swedish_ci| wiki_greg |wiretap |diff |latin1 |latin1_swedish_ci| wiki_greg |wiretap |referer_url |latin1 |latin1_swedish_ci| wiki_greg |wiretap |referer_title |latin1 |latin1_swedish_ci|