Open main menu

Changes

no edit summary
In MWStake General, I [https://matrix.to/#/!NGZmJSwAAwbGRxhWwH:matrix.org/$HObrGFUPO_XBbhS3STTjQ5R-i-uV3cRjVIpPuAhP0i4?via=matrix.org&via=marijn.it&via=converser.eu said]
Short answer:
 
1. yes, you should change the <code>$wgDBTableOptions</code> (see below) to the new default.
However, this setting is only used during new table creation, so it doesn't really affect your existing database(s) and the tables in them.
2. Do NOT <code>CONVERT TO</code> all your existing tables. You have to be more precise with what columns you change - if you change anything at all. You really only convert if you have some encoding problem (aka Mojibake).
 
Using
 
<source lang="mysql">
SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
</source>
 
will show you some info about how your database server, client, filesystem etc. is currently setup.
 
Here's a [https://stackoverflow.com/questions/54885178/whats-the-difference-between-utf8-unicode-ci-and-utf8mb4-0900-ai-ci Stack Overflow question + answer] that explains the different encodings (byte mappings) and collations (sorting rules) and why it _might_ matter to you. The newest collations only really matter if you're trying to chase the long tail of performance optimizations.
 
BTW, the answer is by Rick James (not the singer - the [[MySQL|database expert]])
 
From WMF, I see [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+blame/0adab4be323405345c3c63428747c12e3ad4aea2/modules/galera/templates/server.cnf.erb this puppet template for the galera configuration], updated on 2020-06-11 specifying
<pre>
# character sets and collations
character-set-server = utf8mb4
collation-server = utf8mb4_general_ci
</pre>
 
As you point out, the setting for [https://www.mediawiki.org/wiki/Manual:$wgDBTableOptions $wgDBTableOptions] changed back in REL1_32 to use a "binary" character set as the default
<code>$wgDBTableOptions = 'ENGINE=InnoDB, DEFAULT CHARSET=binary';</code>
 
Apparently the prior "option" to choose the charset (between utf8 and binary) was hidden from the UI of the installer in MediaWiki 1.31.0-rc.2, so the change in REL1_32 was to make the default config setting match what the installer was already doing (using the binary charset for NEW installations).
 
Keep in mind that MediaWiki outputs UTF8 to the browser:
<code>$response->header( 'Content-Type: text/html; charset=UTF-8' );</code>
But that is a different aspect to the whole UTF-8 question.
 
In the database backend, MediaWiki switched over some years ago (I think it was REL1_35) to use the Doctrine Database Abstraction Layer (Doctrine DBAL); using [https://github.com/wikimedia/mediawiki/blob/master/maintenance/tables.json JSON to represent the schemas for each table] as well as to define "abstract" schema changes that work across multiple DB engines. There is a PHP script that turns that into [https://github.com/wikimedia/mediawiki/blob/master/maintenance/tables-generated.sql generated SQL]
 
You can see how the [https://gerrit.wikimedia.org/g/mediawiki/core/+/ce9d259317b876fdd2334d4065230b2f9e1f1510/includes/installer/MysqlInstaller.php#54 default Character Set for MySQL is 'binary'] in the installer, but this does not mean that MediaWiki changes the character set of an existing installation.
 
My recent installation of MediaWiki 1.39 is actually latin1 -- I'm off to figure out why that is the case... possibly due to the default of the MariaDB instance I'm using.
 
Although it is old information (2012), User Daniel Renfro put together some nice notes on [https://www.mediawiki.org/wiki/User:DanielRenfro/Character_Encoding Character Encoding] and how it plays out (or plays havoc) for you.
 
== Tests ==
<source lang="sql">
select CONVERT(CONVERT(CONVERT('Mike’s' USING latin1) USING binary) USING utf8);
select CONVERT(CONVERT('Mike’s' USING latin1) USING binary);
</source>
produces
<pre>
Mike’s
Mikeâ s
</pre>
 
== MediaWiki ==
* Is this a bug? https://github.com/wikimedia/mediawiki/blame/master/includes/installer/MysqlInstaller.php#L329 the installer specifies utf8 when it supposedly is 'binary'.
This is one of the few places in the code where [https://codesearch.wmcloud.org/core/?q=create+database&files=&excludeFiles=i18n&repos=&i=fosho 'create database' is ever called]
* https://www.mediawiki.org/wiki/Manual:Installation_requirements#Database_server '''Seems like a 'bug' to have a minimum DB requirement for a DB that is out of support'''
* https://www.mediawiki.org/wiki/Manual:MysqlUpdater.php - a class called by the Installer that performs database updates.
 
== checkStorage.php ==
 
<pre>
Errors:
 
Flag statistics:
1575 15.53%
utf-8 8564 84.47%
 
Local object statistics:
</pre>
 
 
== Tables ==
<source lang="sql">
SET @DATABASE_NAME = 'wiki_greg';
ALTER TABLE `wiretap` CHARACTER SET = 'binary' COLLATE = 'BINARY';
ALTER TABLE `wiretap_counter_alltime` CHARACTER SET = 'binary' COLLATE = 'BINARY';
</pre>
 
== Columns ==
Show the columns that are not 'binary' and have a specific character set
<source lang="sql">
SET @DATABASE_NAME = 'wiki_greg';
SELECT
TABLE_SCHEMA,
TABLE_NAME,
COLUMN_NAME,
CHARACTER_SET_NAME,
COLLATION_NAME
FROM
information_schema.columns AS cols
WHERE
table_schema = @DATABASE_NAME
AND COLLATION_NAME NOT IN ('binary');
</source>
 
<pre>
TABLE_SCHEMA|TABLE_NAME |COLUMN_NAME |CHARACTER_SET_NAME|COLLATION_NAME |
------------+-----------------+------------------------+------------------+-----------------+
wiki_greg |archive |ar_timestamp |latin1 |latin1_bin |
wiki_greg |bad_behavior |ip |latin1 |latin1_swedish_ci|
wiki_greg |bad_behavior |request_method |latin1 |latin1_swedish_ci|
wiki_greg |bad_behavior |request_uri |latin1 |latin1_swedish_ci|
wiki_greg |bad_behavior |server_protocol |latin1 |latin1_swedish_ci|
wiki_greg |bad_behavior |http_headers |latin1 |latin1_swedish_ci|
wiki_greg |bad_behavior |user_agent |latin1 |latin1_swedish_ci|
wiki_greg |bad_behavior |request_entity |latin1 |latin1_swedish_ci|
wiki_greg |bad_behavior |key |latin1 |latin1_swedish_ci|
wiki_greg |blobs |blob_index |latin1 |latin1_bin |
wiki_greg |brokenlinks |bl_to |latin1 |latin1_bin |
wiki_greg |categorylinks |cl_type |latin1 |latin1_swedish_ci|
wiki_greg |cur |cur_title |latin1 |latin1_bin |
wiki_greg |cur |cur_text |latin1 |latin1_swedish_ci|
wiki_greg |cur |cur_user_text |latin1 |latin1_bin |
wiki_greg |cur |cur_timestamp |latin1 |latin1_bin |
wiki_greg |cur |cur_touched |latin1 |latin1_bin |
wiki_greg |cur |inverse_timestamp |latin1 |latin1_bin |
wiki_greg |dune_searchindex |si_title |latin1 |latin1_swedish_ci|
wiki_greg |dune_searchindex |si_text |latin1 |latin1_swedish_ci|
wiki_greg |external_user |eu_external_id |latin1 |latin1_bin |
wiki_greg |filearchive |fa_storage_group |latin1 |latin1_swedish_ci|
wiki_greg |filearchive |fa_storage_key |latin1 |latin1_bin |
wiki_greg |filearchive |fa_deleted_timestamp |latin1 |latin1_bin |
wiki_greg |filearchive |fa_media_type |latin1 |latin1_swedish_ci|
wiki_greg |filearchive |fa_major_mime |latin1 |latin1_swedish_ci|
wiki_greg |filearchive |fa_timestamp |latin1 |latin1_bin |
wiki_greg |image |img_media_type |latin1 |latin1_swedish_ci|
wiki_greg |image |img_major_mime |latin1 |latin1_swedish_ci|
wiki_greg |interwiki |iw_prefix |latin1 |latin1_swedish_ci|
wiki_greg |interwiki |iw_url |latin1 |latin1_swedish_ci|
wiki_greg |interwiki |iw_wikiid |latin1 |latin1_swedish_ci|
wiki_greg |ipblocks |ipb_timestamp |latin1 |latin1_bin |
wiki_greg |ipblocks |ipb_expiry |latin1 |latin1_bin |
wiki_greg |ipblocks_old |ipb_address |latin1 |latin1_bin |
wiki_greg |ipblocks_old |ipb_timestamp |latin1 |latin1_bin |
wiki_greg |ipblocks_old |ipb_expiry |latin1 |latin1_bin |
wiki_greg |ipblocks_old |ipb_range_start |latin1 |latin1_swedish_ci|
wiki_greg |ipblocks_old |ipb_range_end |latin1 |latin1_swedish_ci|
wiki_greg |job |job_cmd |latin1 |latin1_swedish_ci|
wiki_greg |l10n_cache |lc_key |latin1 |latin1_swedish_ci|
wiki_greg |log_search |ls_value |latin1 |latin1_swedish_ci|
wiki_greg |logging |log_timestamp |latin1 |latin1_swedish_ci|
wiki_greg |math |math_inputhash |latin1 |latin1_swedish_ci|
wiki_greg |math |math_outputhash |latin1 |latin1_swedish_ci|
wiki_greg |math |math_html |latin1 |latin1_swedish_ci|
wiki_greg |math |math_mathml |latin1 |latin1_swedish_ci|
wiki_greg |objectcache |keyname |latin1 |latin1_bin |
wiki_greg |objectcache |modtoken |latin1 |latin1_swedish_ci|
wiki_greg |oldimage |oi_timestamp |latin1 |latin1_bin |
wiki_greg |oldimage |oi_media_type |latin1 |latin1_swedish_ci|
wiki_greg |oldimage |oi_major_mime |latin1 |latin1_swedish_ci|
wiki_greg |online |username |latin1 |latin1_swedish_ci|
wiki_greg |online |timestamp |latin1 |latin1_swedish_ci|
wiki_greg |online |wikiid |latin1 |latin1_swedish_ci|
wiki_greg |page |page_touched |latin1 |latin1_bin |
wiki_greg |page_restrictions|pr_type |latin1 |latin1_swedish_ci|
wiki_greg |page_restrictions|pr_level |latin1 |latin1_swedish_ci|
wiki_greg |page_restrictions|pr_expiry |latin1 |latin1_bin |
wiki_greg |querycache |qc_type |latin1 |latin1_swedish_ci|
wiki_greg |querycache_info |qci_type |latin1 |latin1_swedish_ci|
wiki_greg |querycache_info |qci_timestamp |latin1 |latin1_swedish_ci|
wiki_greg |querycachetwo |qcc_type |latin1 |latin1_swedish_ci|
wiki_greg |recentchanges |rc_log_type |latin1 |latin1_bin |
wiki_greg |recentchanges |rc_log_action |latin1 |latin1_bin |
wiki_greg |redirect |rd_interwiki |latin1 |latin1_swedish_ci|
wiki_greg |revision |rev_timestamp |latin1 |latin1_bin |
wiki_greg |searchindex |si_title |latin1 |latin1_swedish_ci|
wiki_greg |searchindex |si_text |latin1 |latin1_swedish_ci|
wiki_greg |sites |site_domain |latin1 |latin1_swedish_ci|
wiki_greg |smw_ft_search |o_text |utf8 |utf8_general_ci |
wiki_greg |text |old_title |latin1 |latin1_bin |
wiki_greg |text |old_text |latin1 |latin1_swedish_ci|
wiki_greg |text |old_user_text |latin1 |latin1_bin |
wiki_greg |text |old_timestamp |latin1 |latin1_bin |
wiki_greg |text |inverse_timestamp |latin1 |latin1_bin |
wiki_greg |trackbacks |tb_title |latin1 |latin1_swedish_ci|
wiki_greg |trackbacks |tb_url |latin1 |latin1_swedish_ci|
wiki_greg |trackbacks |tb_ex |latin1 |latin1_swedish_ci|
wiki_greg |trackbacks |tb_name |latin1 |latin1_swedish_ci|
wiki_greg |updatelog |ul_key |latin1 |latin1_swedish_ci|
wiki_greg |user |user_email |latin1 |latin1_swedish_ci|
wiki_greg |user |user_token |latin1 |latin1_bin |
wiki_greg |user |user_email_authenticated|latin1 |latin1_bin |
wiki_greg |user |user_email_token |latin1 |latin1_bin |
wiki_greg |user |user_email_token_expires|latin1 |latin1_bin |
wiki_greg |user |user_registration |latin1 |latin1_bin |
wiki_greg |user |user_newpass_time |latin1 |latin1_bin |
wiki_greg |user_newtalk |user_ip |latin1 |latin1_swedish_ci|
wiki_greg |user_openid |uoi_openid |latin1 |latin1_swedish_ci|
wiki_greg |validate |val_comment |latin1 |latin1_swedish_ci|
wiki_greg |validate |val_ip |latin1 |latin1_swedish_ci|
wiki_greg |wiretap |page_name |latin1 |latin1_swedish_ci|
wiki_greg |wiretap |user_name |latin1 |latin1_swedish_ci|
wiki_greg |wiretap |hit_timestamp |latin1 |latin1_swedish_ci|
wiki_greg |wiretap |hit_year |latin1 |latin1_swedish_ci|
wiki_greg |wiretap |hit_month |latin1 |latin1_swedish_ci|
wiki_greg |wiretap |hit_day |latin1 |latin1_swedish_ci|
wiki_greg |wiretap |hit_hour |latin1 |latin1_swedish_ci|
wiki_greg |wiretap |page_action |latin1 |latin1_swedish_ci|
wiki_greg |wiretap |diff |latin1 |latin1_swedish_ci|
wiki_greg |wiretap |referer_url |latin1 |latin1_swedish_ci|
wiki_greg |wiretap |referer_title |latin1 |latin1_swedish_ci|
</pre>