A practical explanation: problems during unicode collation conversion

Introduction

Recently I have been involved in an effort to convert MySQL databases from a utf8 character set to utf8mb4. As a part of this effort, my team evaluated which collations would be best for facilitating a broad range of multi-lingual support.

There have been many recent posts in the MySQL community about better unicode collation support in MySQL 8 such as from the MySQL Server Team’s blog at Oracle, who have also done a good job of showing us how newer collations based on UTF8 9.0.0 will properly group and sort characters according to their case and inheritance. As the title of the latter post suggests, the “devil is” indeed “in the details”.

There is also the matter of the “sushi-beer” problem, which shows that utf8mb4_unicode_520_ci will treat beer and sushi emoji as equal.

Rather than focusing on the particular deficiencies and enhancements in collations, this post focuses on practical solutions to converting existing data sets.  Every DBA, at some point, faces the daunting task of character set conversion or differing collation usage.  However, before we go on, a note.

Green field projects

When you are considering a new MySQL database to support a new project, common wisdom derived from articles linked above and from the many articles one may find on Stack Overflow and other resources suggests that if you have multi-lingual and emoji storage requirements: Just use utf8mb4 character set and utf8mb4_unicode_520_ci collation in MySQL.  That is, until MySQL 8.0 goes GA.  If you really know what you are doing, or if you already know a lot about this subject, your choices may vary.  The salient point here is that using an appropriate unicode collation (rather than defaults) will save the DBA from several future headaches, particularly regarding unique keys.

Existing data sets

Most DBAs work in shops with existing data sets with growing requirements.  Many DBAs will have already worked on converting latin1 databases to utf8.  Most MySQL installations that use utf8 will have utf8_general_ci as the collation.  For utf8mb4, the current default is utf8mb4_general_ci.

As illustrated in many documents and talks you’ll find that the general_ci collations in MySQL are sub-optimal when it comes to sorting and enforcing uniqueness.  They are not so bad with latin characters.  “APPLE = ApPlE = apple”, but when you’re talking about multi-lingual support, the general collations are generally bad at case insensitivity and allow variations such as the “apple” example in other languages to be distinguished as different.

To help illustrate a practical approach, I will provide an illustration of potential issues that one may encounter, and an example of how to potentially find and fix all unique key issues in a data set.

Creating an Example Data Set

To facilitate this example, I created a sample table in a MySQL 5.7 database:

CREATE TABLE `utf8_test` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `singlechar` varchar(4) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `unk1` (`singlechar`)
) ENGINE=InnoDB AUTO_INCREMENT=189331 DEFAULT CHARSET=utf8mb4

I chose varchar(4) because a lot of people use varchars, and I chose length 4, because some “emoji” and other 4 byte strings generated on the console give a MySQL error 1300 “Invalid utf8 character string: ‘\the\Code'”.  If you have a column with 4 bytes, then MySQL lets you insert it anyway, with a warning, rather than an error.

Then, I populated this with fake data, spanning all sorts of characters using a fairly “ugly” but very “quick” solution.

jscott@jscott-57-1:~$ cat /dev/urandom > myfile
jscott@jscott-57-1:~$ for x in `cat myfile | tr -d "[:cntrl:]" | grep -a -o .`; do mysql -v js -e "set collation_connection=utf8mb4_general_ci; insert into utf8_test (singlechar) values ('$x');"; done

Concatenating /dev/urandom to a file for a few seconds yielded about 35MB of “garbage”, containing many conceivable utf8 characters.  We loop through the file output with bash.  On Linux, tr does a bad job of actually breaking up the characters because it does not yet support multi-byte characters, but it does an excellent job of removing “control characters“, hence “-d [:cntrl:]”. grep is doing our heavy lifting for breaking up the strings (-a option to treat binary as text and the -o with a period as the argument to break up the string character by character).  Finally, we feed those characters into single insert statements to our table.  As you might imagine, this is quite nasty in terms of error messages, but it gets the job done.  In a few minutes time, we have a table with an integer id, and a couple thousand “supposedly unique” rows.

Unique key violations

For most DBAs, finding problems with collations starts with your first unique key error.  Let’s try converting our newly created table to use the utf8mb4_unicode_520_ci collation (default is utf8mb4_general_ci):

mysql> alter table utf8_test convert to character set utf8mb4 collate utf8mb4_unicode_520_ci;
ERROR 1062 (23000): Duplicate entry '9' for key 'unk1'

This tells us we have a violation.  But what then?  Inevitably, you solve one, then you find another, then another, then another.

Using concat() to find problems

You may knock MySQL for having inadequate collation support, but it has outstanding support for querying data in various ways.  For one thing, “collate” can be supplied to various parts of a query.  I like to use concat for these cases for reasons which will become clear later, but here’s a query that works well to find all the collisions in our sample table:

select count(1) as cnt, group_concat(id), concat(singlechar collate utf8mb4_unicode_ci) as unk from utf8_test group by unk having cnt > 1;

Notice that inside the concat() we are adding collate utf8mb4_unicode_520_ci. This seeks to emulate what MySQL is doing when trying to alter table (supplying a new table collation), but giving us all the information ahead of time.

Sample output:

+-----+-------------------------------+------+
| cnt | group_concat(id)              | unk  |
+-----+-------------------------------+------+
|   2 | 642,8804                      | ΄    |
|   2 | 1242,20448                    | ΅    |
|   2 | 194,11764                     | ;    |
|   2 | 16145,29152                   | ·    |
|   2 | 114105,33                     | ︵   |
|   2 | 63,186608                     | }    |
|   2 | 4963,44554                    | ʹ    |
|   2 | 84,87616                      | >    |
|   4 | 120845,292,2759,38412         | ୦    |
|   4 | 2,21162,25295,47504           | 1    |
|   5 | 46179,81143,231,7766,36158    | ²    |
|   4 | 66,2339,19777,26796           | 3    |
|   5 | 102802,158554,150,16224,21282 | ፬    |
|   3 | 35,14643,19433                | 5    |
|   3 | 107,377,9234                  | 6    |
|   4 | 12,585,12643,28853            | ٧    |
|   3 | 60,12070,25619                | 8    |
|   3 | 17,70,27677                   | ٩    |
|   3 | 32,4370,12498                 | A    |

Looking at one of these by itself:

mysql> select * from utf8_test where id in (102802,158554,150,16224,21282);
+--------+------------+
| id     | singlechar |
+--------+------------+
|    150 | 4          |
|  16224 | ٤          |
|  21282 | ۴          |
| 102802 | ፬          |
| 158554 | ៤          |
+--------+------------+
5 rows in set (0.00 sec)

All of the above characters are resolving to the one with ID 102802.

Possible actions:

In my current working environment, we decided to continue to use utf8mb4_general_ci collation, because we were unable to determine (in all cases) whether the duplicates in our data set were simply “character-case” related or whether we would actually experience data loss.

In the future, we expect to be able to “trust” the new collations in mysql 8 to be correct.

We went the extra mile to create a utility based on the following query.  The utility finds possible unique key violations using queries like the one I used above.

information_schema is your friend:
select
	tc.TABLE_NAME,
	tc.CONSTRAINT_NAME,
	group_concat(kc.COLUMN_NAME),
	case when group_concat(c.DATA_TYPE) like '%char%' then 1 else 0 end as contains_char
FROM
	TABLE_CONSTRAINTS tc
INNER JOIN
	KEY_COLUMN_USAGE kc
		on tc.CONSTRAINT_SCHEMA = kc.CONSTRAINT_SCHEMA
		and tc.CONSTRAINT_NAME = kc.CONSTRAINT_NAME
		and tc.TABLE_NAME = kc.TABLE_NAME
INNER JOIN
	information_schema.COLUMNS c
		on kc.CONSTRAINT_SCHEMA = c.TABLE_SCHEMA
		and kc.TABLE_NAME = c.TABLE_NAME
		and kc.COLUMN_NAME = c.COLUMN_NAME
WHERE
	tc.CONSTRAINT_SCHEMA = 'your_schema'
	and tc.CONSTRAINT_TYPE = 'UNIQUE'
GROUP BY
	tc.TABLE_NAME,tc.CONSTRAINT_NAME
HAVING
	contains_char=1;

Using the query above and iterating through its results  you can build similar SQL statements to the one we used in our test scenario above to discover duplicates, supplying the “collate” in “concat” functions to the char columns.  concat() is a great fit, because it allows you to do the same query (

select count(1) as cnt, group_concat(id), concat(col1,'-',col2 collate <test collation>,'-'......) as unk from utf8_test group by unk having cnt > 1;

) for a unique constraint having varying numbers of columns.  An additional query to information schema in the loop is required to find which column has the char type.

You can then use the group concatenated IDs in the results to choose “winners and losers”, find dependent rows and update them, delete the losing rows, etc.

5 thoughts on “A practical explanation: problems during unicode collation conversion

  1. Hi John, thanks for mentioning MySQL 8.0 🙂

    A couple of things to mention about 5.7 though:

    – You could also consider using a binary collation like utf8mb4_bin. This is not quite the same as a case sensitive collation, but may work for the simple case.

    – Here is a similar test case to yours – but showing utf8mb4_bin in use. As you note, the COLLATE syntax really can be used anywhere:

    CREATE TABLE t1 (id INT NOT NULL PRIMARY KEY auto_increment, b char(255));
    INSERT INTO t1 values (NULL, ‘e’), (NULL, ‘é’);

    # Using the default ci collation there are duplicate values
    mysql> SELECT b, count(*) as c FROM t1 GROUP BY b HAVING c > 1;
    +——+—+
    | b | c |
    +——+—+
    | e | 2 |
    +——+—+
    1 row in set (0.00 sec)

    # collating as bin there are no duplicates, but control characters and order will differ from a true “case sensitive collation”
    mysql> SELECT ANY_VALUE(b), count(*) as c FROM t1 GROUP BY b COLLATE utf8mb4_bin HAVING c > 1;
    Empty set (0.00 sec)

    – In VARCHAR(4) the 4 refers to characters. So with a 4 byte character set the storage size could be up to 16 bytes. Things would be much simpler if this were bytes, but I believe this just comes from the SQL standard.

    – Morgan

    Like

    1. Morgan, we actually do use utf8_bin in a couple of areas. It is much more permissive. Unicode does what international customers want most, true case insensitivity and good sorting.

      The blog post is primarily interested in changing to unicode for the benefits that asian clients may want most out of a UI deeply rooted in database sorting.

      On your note about varchar(4) being four characters, not four bytes, I completely understand that. When creating this example, I queried my table and noticed that when I had char(1) or varchar(1), I was not getting any values with length(singlechar) > 3. I wanted to be sure and get some 4 byte characters.

      My primary use case for that was trying to reproduce the sushi-beer issue. Copy and paste the sushi emoji (from say, a web page) into a MySQL insert statement with a varchar(1) or char(1) and you get:

      mysql> insert into utf8_test1 (singlechar) values ('🍣');
      ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x8D\xA3' for column 'singlechar' at row 1;

      increase the size to varchar(2) and same error.
      increase the size to varchar(3) and same error.

      Once you have varchar(4) and/or presumbably char(4) and you are allowed to insert the value with a warning with same text as the error above.

      I glossed over this in the post and didn’t provide sufficient detail to support my use of the 4 character length.

      I’m sure there’s a way to provide a proper string value for sushi that eliminates this issue, but I was providing a “quick example”.

      I appreciate the feedback, sir, and take care.

      Like

  2. Hi, I think you have to have a look at your client side. The 5.7 client will never set utf8mb4 as character set, but utf8 or latin1 depending on the environment (locale). To get Unicode values outside BMP (>0xFFFF) over to the server, you will need to use utf8mb4:

    mysql> create table t1 (s varchar(1)) default charset utf8mb4;
    Query OK, 0 rows affected (0.01 sec)

    mysql> insert into t1 values (‘🍣’);
    ERROR 1366 (HY000): Incorrect string value: ‘\xF0\x9F\x8D\xA3’ for column ‘s’ at row 1
    mysql> show variables like ‘character_set_client’;
    +———————-+——-+
    | Variable_name | Value |
    +———————-+——-+
    | character_set_client | utf8 |
    +———————-+——-+
    1 row in set (0.01 sec)

    mysql> set names utf8mb4;
    Query OK, 0 rows affected (0.00 sec)

    mysql> insert into t1 values (‘🍣’);
    Query OK, 1 row affected (0.00 sec)

    mysql> show variables like ‘character_set_client’;
    +———————-+———+
    | Variable_name | Value |
    +———————-+———+
    | character_set_client | utf8mb4 |
    +———————-+———+
    1 row in set (0.00 sec)

    mysql> select * from t1;
    +——+
    | s |
    +——+
    | 🍣 |
    +——+
    1 row in set (0.00 sec)

    Like

    1. Thank you! (and thank you for taking the time). Yes, that was my issue. I used collation_connection, but did not set character_set_client. Doing so would have provided a little better example with a char(1) and plenty of 4-byte characters in my set.

      Liked by 1 person

  3. As you note, the COLLATE syntax really can be used anywhere:
    CREATE TABLE t1 (id INT NOT NULL PRIMARY KEY auto_increment, b char(255));
    INSERT INTO t1 values (NULL, ‘e’), (NULL, ‘é’);
    # Using the default ci collation there are duplicate values
    mysql> SELECT b, count(*) as c FROM t1 GROUP BY b HAVING c > 1;
    +——+—+
    | b | c |
    +——+—+
    | e | 2 |
    +——+—+
    1 row in set (0.00 sec)
    # collating as bin there are no duplicates, but control characters and order will differ from a true “case sensitive collation”
    mysql> SELECT ANY_VALUE(b), count(*) as c FROM t1 GROUP BY b COLLATE utf8mb4_bin HAVING c > 1;
    Empty set (0.00 sec)
    – In VARCHAR(4) the 4 refers to characters. To get Unicode values outside BMP (>0xFFFF) over to the server, you will need to use utf8mb4:
    mysql> create table t1 (s varchar(1)) default charset utf8mb4;
    Query OK, 0 rows affected (0.01 sec)
    mysql> insert into t1 values (‘🍣’);
    ERROR 1366 (HY000): Incorrect string value: ‘\xF0\x9F\x8D\xA3’ for column ‘s’ at row 1
    mysql> show variables like ‘character_set_client’;
    +———————-+——-+
    | Variable_name | Value |
    +———————-+——-+
    | character_set_client | utf8 |
    +———————-+——-+
    1 row in set (0.01 sec)
    mysql> set names utf8mb4;
    Query OK, 0 rows affected (0.00 sec)
    mysql> insert into t1 values (‘🍣’);
    Query OK, 1 row affected (0.00 sec)
    mysql> show variables like ‘character_set_client’;
    +———————-+———+
    | Variable_name | Value |
    +———————-+———+
    | character_set_client | utf8mb4 |
    +———————-+———+
    1 row in set (0.00 sec)
    mysql> select * from t1;
    +——+
    | s |
    +——+
    | 🍣 |
    +——+
    1 row in set (0.00 sec)

    Like

Comments are closed.