Recently I have been involved in an effort to convert MySQL databases from a utf8 character set to utf8mb4. As a part of this effort, my team evaluated which collations would be best for facilitating a broad range of multi-lingual support.
There have been many recent posts in the MySQL community about better unicode collation support in MySQL 8 such as from the MySQL Server Team’s blog at Oracle, who have also done a good job of showing us how newer collations based on UTF8 9.0.0 will properly group and sort characters according to their case and inheritance. As the title of the latter post suggests, the “devil is” indeed “in the details”.
There is also the matter of the “sushi-beer” problem, which shows that utf8mb4_unicode_520_ci will treat beer and sushi emoji as equal.
Rather than focusing on the particular deficiencies and enhancements in collations, this post focuses on practical solutions to converting existing data sets. Every DBA, at some point, faces the daunting task of character set conversion or differing collation usage. However, before we go on, a note.
Green field projects
When you are considering a new MySQL database to support a new project, common wisdom derived from articles linked above and from the many articles one may find on Stack Overflow and other resources suggests that if you have multi-lingual and emoji storage requirements: Just use utf8mb4 character set and utf8mb4_unicode_520_ci collation in MySQL. That is, until MySQL 8.0 goes GA. If you really know what you are doing, or if you already know a lot about this subject, your choices may vary. The salient point here is that using an appropriate unicode collation (rather than defaults) will save the DBA from several future headaches, particularly regarding unique keys.
Existing data sets
Most DBAs work in shops with existing data sets with growing requirements. Many DBAs will have already worked on converting latin1 databases to utf8. Most MySQL installations that use utf8 will have utf8_general_ci as the collation. For utf8mb4, the current default is utf8mb4_general_ci.
As illustrated in many documents and talks you’ll find that the general_ci collations in MySQL are sub-optimal when it comes to sorting and enforcing uniqueness. They are not so bad with latin characters. “APPLE = ApPlE = apple”, but when you’re talking about multi-lingual support, the general collations are generally bad at case insensitivity and allow variations such as the “apple” example in other languages to be distinguished as different.
To help illustrate a practical approach, I will provide an illustration of potential issues that one may encounter, and an example of how to potentially find and fix all unique key issues in a data set.
Creating an Example Data Set
To facilitate this example, I created a sample table in a MySQL 5.7 database:
CREATE TABLE `utf8_test` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`singlechar` varchar(4) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unk1` (`singlechar`)
) ENGINE=InnoDB AUTO_INCREMENT=189331 DEFAULT CHARSET=utf8mb4
I chose varchar(4) because a lot of people use varchars, and I chose length 4, because some “emoji” and other 4 byte strings generated on the console give a MySQL error 1300 “Invalid utf8 character string: ‘\the\Code'”. If you have a column with 4 bytes, then MySQL lets you insert it anyway, with a warning, rather than an error.
Then, I populated this with fake data, spanning all sorts of characters using a fairly “ugly” but very “quick” solution.
jscott@jscott-57-1:~$ cat /dev/urandom > myfile
jscott@jscott-57-1:~$ for x in `cat myfile | tr -d "[:cntrl:]" | grep -a -o .`; do mysql -v js -e "set collation_connection=utf8mb4_general_ci; insert into utf8_test (singlechar) values ('$x');"; done
Concatenating /dev/urandom to a file for a few seconds yielded about 35MB of “garbage”, containing many conceivable utf8 characters. We loop through the file output with bash. On Linux, tr does a bad job of actually breaking up the characters because it does not yet support multi-byte characters, but it does an excellent job of removing “control characters“, hence “-d [:cntrl:]”. grep is doing our heavy lifting for breaking up the strings (-a option to treat binary as text and the -o with a period as the argument to break up the string character by character). Finally, we feed those characters into single insert statements to our table. As you might imagine, this is quite nasty in terms of error messages, but it gets the job done. In a few minutes time, we have a table with an integer id, and a couple thousand “supposedly unique” rows.
Unique key violations
For most DBAs, finding problems with collations starts with your first unique key error. Let’s try converting our newly created table to use the utf8mb4_unicode_520_ci collation (default is utf8mb4_general_ci):
mysql> alter table utf8_test convert to character set utf8mb4 collate utf8mb4_unicode_520_ci;
ERROR 1062 (23000): Duplicate entry '9' for key 'unk1'
This tells us we have a violation. But what then? Inevitably, you solve one, then you find another, then another, then another.
Using concat() to find problems
You may knock MySQL for having inadequate collation support, but it has outstanding support for querying data in various ways. For one thing, “collate” can be supplied to various parts of a query. I like to use concat for these cases for reasons which will become clear later, but here’s a query that works well to find all the collisions in our sample table:
select count(1) as cnt, group_concat(id), concat(singlechar collate utf8mb4_unicode_ci) as unk from utf8_test group by unk having cnt > 1;
Notice that inside the concat() we are adding collate utf8mb4_unicode_520_ci. This seeks to emulate what MySQL is doing when trying to alter table (supplying a new table collation), but giving us all the information ahead of time.
| cnt | group_concat(id) | unk |
| 2 | 642,8804 | ΄ |
| 2 | 1242,20448 | ΅ |
| 2 | 194,11764 | ; |
| 2 | 16145,29152 | · |
| 2 | 114105,33 | ︵ |
| 2 | 63,186608 | } |
| 2 | 4963,44554 | ʹ |
| 2 | 84,87616 | > |
| 4 | 120845,292,2759,38412 | ୦ |
| 4 | 2,21162,25295,47504 | 1 |
| 5 | 46179,81143,231,7766,36158 | ² |
| 4 | 66,2339,19777,26796 | 3 |
| 5 | 102802,158554,150,16224,21282 | ፬ |
| 3 | 35,14643,19433 | 5 |
| 3 | 107,377,9234 | 6 |
| 4 | 12,585,12643,28853 | ٧ |
| 3 | 60,12070,25619 | 8 |
| 3 | 17,70,27677 | ٩ |
| 3 | 32,4370,12498 | A |
Looking at one of these by itself:
mysql> select * from utf8_test where id in (102802,158554,150,16224,21282);
| id | singlechar |
| 150 | 4 |
| 16224 | ٤ |
| 21282 | ۴ |
| 102802 | ፬ |
| 158554 | ៤ |
5 rows in set (0.00 sec)
All of the above characters are resolving to the one with ID 102802.
In my current working environment, we decided to continue to use utf8mb4_general_ci collation, because we were unable to determine (in all cases) whether the duplicates in our data set were simply “character-case” related or whether we would actually experience data loss.
In the future, we expect to be able to “trust” the new collations in mysql 8 to be correct.
We went the extra mile to create a utility based on the following query. The utility finds possible unique key violations using queries like the one I used above.
information_schema is your friend:
case when group_concat(c.DATA_TYPE) like '%char%' then 1 else 0 end as contains_char
on tc.CONSTRAINT_SCHEMA = kc.CONSTRAINT_SCHEMA
and tc.CONSTRAINT_NAME = kc.CONSTRAINT_NAME
and tc.TABLE_NAME = kc.TABLE_NAME
on kc.CONSTRAINT_SCHEMA = c.TABLE_SCHEMA
and kc.TABLE_NAME = c.TABLE_NAME
and kc.COLUMN_NAME = c.COLUMN_NAME
tc.CONSTRAINT_SCHEMA = 'your_schema'
and tc.CONSTRAINT_TYPE = 'UNIQUE'
Using the query above and iterating through its results you can build similar SQL statements to the one we used in our test scenario above to discover duplicates, supplying the “collate” in “concat” functions to the char columns. concat() is a great fit, because it allows you to do the same query (
select count(1) as cnt, group_concat(id), concat(col1,'-',col2 collate <test collation>,'-'......) as unk from utf8_test group by unk having cnt > 1;
) for a unique constraint having varying numbers of columns. An additional query to information schema in the loop is required to find which column has the char type.
You can then use the group concatenated IDs in the results to choose “winners and losers”, find dependent rows and update them, delete the losing rows, etc.