Talk:Collation: Difference between revisions
document what I did on my test host |
No edit summary |
||
| Line 1: | Line 1: | ||
In MWStake General, I [https://matrix.to/#/!NGZmJSwAAwbGRxhWwH:matrix.org/$HObrGFUPO_XBbhS3STTjQ5R-i-uV3cRjVIpPuAhP0i4?via=matrix.org&via=marijn.it&via=converser.eu said] | |||
Short answer: | |||
1. yes, you should change the <code>$wgDBTableOptions</code> (see below) to the new default. | |||
However, this setting is only used during new table creation, so it doesn't really affect your existing database(s) and the tables in them. | |||
2. Do NOT <code>CONVERT TO</code> all your existing tables. You have to be more precise with what columns you change - if you change anything at all. You really only convert if you have some encoding problem (aka Mojibake). | |||
Using | |||
<source lang="mysql"> | |||
SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%'; | |||
</source> | |||
will show you some info about how your database server, client, filesystem etc. is currently setup. | |||
Here's a [https://stackoverflow.com/questions/54885178/whats-the-difference-between-utf8-unicode-ci-and-utf8mb4-0900-ai-ci Stack Overflow question + answer] that explains the different encodings (byte mappings) and collations (sorting rules) and why it _might_ matter to you. The newest collations only really matter if you're trying to chase the long tail of performance optimizations. | |||
BTW, the answer is by Rick James (not the singer - the [[MySQL|database expert]]) | |||
From WMF, I see [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+blame/0adab4be323405345c3c63428747c12e3ad4aea2/modules/galera/templates/server.cnf.erb this puppet template for the galera configuration], updated on 2020-06-11 specifying | |||
<pre> | |||
# character sets and collations | |||
character-set-server = utf8mb4 | |||
collation-server = utf8mb4_general_ci | |||
</pre> | |||
As you point out, the setting for [https://www.mediawiki.org/wiki/Manual:$wgDBTableOptions $wgDBTableOptions] changed back in REL1_32 to use a "binary" character set as the default | |||
<code>$wgDBTableOptions = 'ENGINE=InnoDB, DEFAULT CHARSET=binary';</code> | |||
Apparently the prior "option" to choose the charset (between utf8 and binary) was hidden from the UI of the installer in MediaWiki 1.31.0-rc.2, so the change in REL1_32 was to make the default config setting match what the installer was already doing (using the binary charset for NEW installations). | |||
Keep in mind that MediaWiki outputs UTF8 to the browser: | |||
<code>$response->header( 'Content-Type: text/html; charset=UTF-8' );</code> | |||
But that is a different aspect to the whole UTF-8 question. | |||
In the database backend, MediaWiki switched over some years ago (I think it was REL1_35) to use the Doctrine Database Abstraction Layer (Doctrine DBAL); using [https://github.com/wikimedia/mediawiki/blob/master/maintenance/tables.json JSON to represent the schemas for each table] as well as to define "abstract" schema changes that work across multiple DB engines. There is a PHP script that turns that into [https://github.com/wikimedia/mediawiki/blob/master/maintenance/tables-generated.sql generated SQL] | |||
You can see how the [https://gerrit.wikimedia.org/g/mediawiki/core/+/ce9d259317b876fdd2334d4065230b2f9e1f1510/includes/installer/MysqlInstaller.php#54 default Character Set for MySQL is 'binary'] in the installer, but this does not mean that MediaWiki changes the character set of an existing installation. | |||
My recent installation of MediaWiki 1.39 is actually latin1 -- I'm off to figure out why that is the case... possibly due to the default of the MariaDB instance I'm using. | |||
Although it is old information (2012), User Daniel Renfro put together some nice notes on [https://www.mediawiki.org/wiki/User:DanielRenfro/Character_Encoding Character Encoding] and how it plays out (or plays havoc) for you. | |||
== Tests == | |||
<source lang="sql"> | |||
select CONVERT(CONVERT(CONVERT('Mike’s' USING latin1) USING binary) USING utf8); | |||
select CONVERT(CONVERT('Mike’s' USING latin1) USING binary); | |||
</source> | |||
produces | |||
<pre> | |||
Mike’s | |||
Mikeâ s | |||
</pre> | |||
== MediaWiki == | |||
* Is this a bug? https://github.com/wikimedia/mediawiki/blame/master/includes/installer/MysqlInstaller.php#L329 the installer specifies utf8 when it supposedly is 'binary'. | |||
This is one of the few places in the code where [https://codesearch.wmcloud.org/core/?q=create+database&files=&excludeFiles=i18n&repos=&i=fosho 'create database' is ever called] | |||
* https://www.mediawiki.org/wiki/Manual:Installation_requirements#Database_server '''Seems like a 'bug' to have a minimum DB requirement for a DB that is out of support''' | |||
* https://www.mediawiki.org/wiki/Manual:MysqlUpdater.php - a class called by the Installer that performs database updates. | |||
== checkStorage.php == | |||
<pre> | |||
Errors: | |||
Flag statistics: | |||
1575 15.53% | |||
utf-8 8564 84.47% | |||
Local object statistics: | |||
</pre> | |||
== Tables == | |||
<source lang="sql"> | <source lang="sql"> | ||
SET @DATABASE_NAME = 'wiki_greg'; | SET @DATABASE_NAME = 'wiki_greg'; | ||
| Line 77: | Line 151: | ||
ALTER TABLE `wiretap` CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ALTER TABLE `wiretap` CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ||
ALTER TABLE `wiretap_counter_alltime` CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ALTER TABLE `wiretap_counter_alltime` CHARACTER SET = 'binary' COLLATE = 'BINARY'; | ||
</pre> | |||
== Columns == | |||
Show the columns that are not 'binary' and have a specific character set | |||
<source lang="sql"> | |||
SET @DATABASE_NAME = 'wiki_greg'; | |||
SELECT | |||
TABLE_SCHEMA, | |||
TABLE_NAME, | |||
COLUMN_NAME, | |||
CHARACTER_SET_NAME, | |||
COLLATION_NAME | |||
FROM | |||
information_schema.columns AS cols | |||
WHERE | |||
table_schema = @DATABASE_NAME | |||
AND COLLATION_NAME NOT IN ('binary'); | |||
</source> | |||
<pre> | |||
TABLE_SCHEMA|TABLE_NAME |COLUMN_NAME |CHARACTER_SET_NAME|COLLATION_NAME | | |||
------------+-----------------+------------------------+------------------+-----------------+ | |||
wiki_greg |archive |ar_timestamp |latin1 |latin1_bin | | |||
wiki_greg |bad_behavior |ip |latin1 |latin1_swedish_ci| | |||
wiki_greg |bad_behavior |request_method |latin1 |latin1_swedish_ci| | |||
wiki_greg |bad_behavior |request_uri |latin1 |latin1_swedish_ci| | |||
wiki_greg |bad_behavior |server_protocol |latin1 |latin1_swedish_ci| | |||
wiki_greg |bad_behavior |http_headers |latin1 |latin1_swedish_ci| | |||
wiki_greg |bad_behavior |user_agent |latin1 |latin1_swedish_ci| | |||
wiki_greg |bad_behavior |request_entity |latin1 |latin1_swedish_ci| | |||
wiki_greg |bad_behavior |key |latin1 |latin1_swedish_ci| | |||
wiki_greg |blobs |blob_index |latin1 |latin1_bin | | |||
wiki_greg |brokenlinks |bl_to |latin1 |latin1_bin | | |||
wiki_greg |categorylinks |cl_type |latin1 |latin1_swedish_ci| | |||
wiki_greg |cur |cur_title |latin1 |latin1_bin | | |||
wiki_greg |cur |cur_text |latin1 |latin1_swedish_ci| | |||
wiki_greg |cur |cur_user_text |latin1 |latin1_bin | | |||
wiki_greg |cur |cur_timestamp |latin1 |latin1_bin | | |||
wiki_greg |cur |cur_touched |latin1 |latin1_bin | | |||
wiki_greg |cur |inverse_timestamp |latin1 |latin1_bin | | |||
wiki_greg |dune_searchindex |si_title |latin1 |latin1_swedish_ci| | |||
wiki_greg |dune_searchindex |si_text |latin1 |latin1_swedish_ci| | |||
wiki_greg |external_user |eu_external_id |latin1 |latin1_bin | | |||
wiki_greg |filearchive |fa_storage_group |latin1 |latin1_swedish_ci| | |||
wiki_greg |filearchive |fa_storage_key |latin1 |latin1_bin | | |||
wiki_greg |filearchive |fa_deleted_timestamp |latin1 |latin1_bin | | |||
wiki_greg |filearchive |fa_media_type |latin1 |latin1_swedish_ci| | |||
wiki_greg |filearchive |fa_major_mime |latin1 |latin1_swedish_ci| | |||
wiki_greg |filearchive |fa_timestamp |latin1 |latin1_bin | | |||
wiki_greg |image |img_media_type |latin1 |latin1_swedish_ci| | |||
wiki_greg |image |img_major_mime |latin1 |latin1_swedish_ci| | |||
wiki_greg |interwiki |iw_prefix |latin1 |latin1_swedish_ci| | |||
wiki_greg |interwiki |iw_url |latin1 |latin1_swedish_ci| | |||
wiki_greg |interwiki |iw_wikiid |latin1 |latin1_swedish_ci| | |||
wiki_greg |ipblocks |ipb_timestamp |latin1 |latin1_bin | | |||
wiki_greg |ipblocks |ipb_expiry |latin1 |latin1_bin | | |||
wiki_greg |ipblocks_old |ipb_address |latin1 |latin1_bin | | |||
wiki_greg |ipblocks_old |ipb_timestamp |latin1 |latin1_bin | | |||
wiki_greg |ipblocks_old |ipb_expiry |latin1 |latin1_bin | | |||
wiki_greg |ipblocks_old |ipb_range_start |latin1 |latin1_swedish_ci| | |||
wiki_greg |ipblocks_old |ipb_range_end |latin1 |latin1_swedish_ci| | |||
wiki_greg |job |job_cmd |latin1 |latin1_swedish_ci| | |||
wiki_greg |l10n_cache |lc_key |latin1 |latin1_swedish_ci| | |||
wiki_greg |log_search |ls_value |latin1 |latin1_swedish_ci| | |||
wiki_greg |logging |log_timestamp |latin1 |latin1_swedish_ci| | |||
wiki_greg |math |math_inputhash |latin1 |latin1_swedish_ci| | |||
wiki_greg |math |math_outputhash |latin1 |latin1_swedish_ci| | |||
wiki_greg |math |math_html |latin1 |latin1_swedish_ci| | |||
wiki_greg |math |math_mathml |latin1 |latin1_swedish_ci| | |||
wiki_greg |objectcache |keyname |latin1 |latin1_bin | | |||
wiki_greg |objectcache |modtoken |latin1 |latin1_swedish_ci| | |||
wiki_greg |oldimage |oi_timestamp |latin1 |latin1_bin | | |||
wiki_greg |oldimage |oi_media_type |latin1 |latin1_swedish_ci| | |||
wiki_greg |oldimage |oi_major_mime |latin1 |latin1_swedish_ci| | |||
wiki_greg |online |username |latin1 |latin1_swedish_ci| | |||
wiki_greg |online |timestamp |latin1 |latin1_swedish_ci| | |||
wiki_greg |online |wikiid |latin1 |latin1_swedish_ci| | |||
wiki_greg |page |page_touched |latin1 |latin1_bin | | |||
wiki_greg |page_restrictions|pr_type |latin1 |latin1_swedish_ci| | |||
wiki_greg |page_restrictions|pr_level |latin1 |latin1_swedish_ci| | |||
wiki_greg |page_restrictions|pr_expiry |latin1 |latin1_bin | | |||
wiki_greg |querycache |qc_type |latin1 |latin1_swedish_ci| | |||
wiki_greg |querycache_info |qci_type |latin1 |latin1_swedish_ci| | |||
wiki_greg |querycache_info |qci_timestamp |latin1 |latin1_swedish_ci| | |||
wiki_greg |querycachetwo |qcc_type |latin1 |latin1_swedish_ci| | |||
wiki_greg |recentchanges |rc_log_type |latin1 |latin1_bin | | |||
wiki_greg |recentchanges |rc_log_action |latin1 |latin1_bin | | |||
wiki_greg |redirect |rd_interwiki |latin1 |latin1_swedish_ci| | |||
wiki_greg |revision |rev_timestamp |latin1 |latin1_bin | | |||
wiki_greg |searchindex |si_title |latin1 |latin1_swedish_ci| | |||
wiki_greg |searchindex |si_text |latin1 |latin1_swedish_ci| | |||
wiki_greg |sites |site_domain |latin1 |latin1_swedish_ci| | |||
wiki_greg |smw_ft_search |o_text |utf8 |utf8_general_ci | | |||
wiki_greg |text |old_title |latin1 |latin1_bin | | |||
wiki_greg |text |old_text |latin1 |latin1_swedish_ci| | |||
wiki_greg |text |old_user_text |latin1 |latin1_bin | | |||
wiki_greg |text |old_timestamp |latin1 |latin1_bin | | |||
wiki_greg |text |inverse_timestamp |latin1 |latin1_bin | | |||
wiki_greg |trackbacks |tb_title |latin1 |latin1_swedish_ci| | |||
wiki_greg |trackbacks |tb_url |latin1 |latin1_swedish_ci| | |||
wiki_greg |trackbacks |tb_ex |latin1 |latin1_swedish_ci| | |||
wiki_greg |trackbacks |tb_name |latin1 |latin1_swedish_ci| | |||
wiki_greg |updatelog |ul_key |latin1 |latin1_swedish_ci| | |||
wiki_greg |user |user_email |latin1 |latin1_swedish_ci| | |||
wiki_greg |user |user_token |latin1 |latin1_bin | | |||
wiki_greg |user |user_email_authenticated|latin1 |latin1_bin | | |||
wiki_greg |user |user_email_token |latin1 |latin1_bin | | |||
wiki_greg |user |user_email_token_expires|latin1 |latin1_bin | | |||
wiki_greg |user |user_registration |latin1 |latin1_bin | | |||
wiki_greg |user |user_newpass_time |latin1 |latin1_bin | | |||
wiki_greg |user_newtalk |user_ip |latin1 |latin1_swedish_ci| | |||
wiki_greg |user_openid |uoi_openid |latin1 |latin1_swedish_ci| | |||
wiki_greg |validate |val_comment |latin1 |latin1_swedish_ci| | |||
wiki_greg |validate |val_ip |latin1 |latin1_swedish_ci| | |||
wiki_greg |wiretap |page_name |latin1 |latin1_swedish_ci| | |||
wiki_greg |wiretap |user_name |latin1 |latin1_swedish_ci| | |||
wiki_greg |wiretap |hit_timestamp |latin1 |latin1_swedish_ci| | |||
wiki_greg |wiretap |hit_year |latin1 |latin1_swedish_ci| | |||
wiki_greg |wiretap |hit_month |latin1 |latin1_swedish_ci| | |||
wiki_greg |wiretap |hit_day |latin1 |latin1_swedish_ci| | |||
wiki_greg |wiretap |hit_hour |latin1 |latin1_swedish_ci| | |||
wiki_greg |wiretap |page_action |latin1 |latin1_swedish_ci| | |||
wiki_greg |wiretap |diff |latin1 |latin1_swedish_ci| | |||
wiki_greg |wiretap |referer_url |latin1 |latin1_swedish_ci| | |||
wiki_greg |wiretap |referer_title |latin1 |latin1_swedish_ci| | |||
</pre> | </pre> | ||