A practical explanation: problems during unicode collation conversion

Introduction

Recently I have been involved in an effort to convert MySQL databases from a utf8 character set to utf8mb4. As a part of this effort, my team evaluated which collations would be best for facilitating a broad range of multi-lingual support.

There have been many recent posts in the MySQL community about better unicode collation support in MySQL 8 such as from the MySQL Server Team’s blog at Oracle, who have also done a good job of showing us how newer collations based on UTF8 9.0.0 will properly group and sort characters according to their case and inheritance. As the title of the latter post suggests, the “devil is” indeed “in the details”.

There is also the matter of the “sushi-beer” problem, which shows that utf8mb4_unicode_520_ci will treat beer and sushi emoji as equal.

Rather than focusing on the particular deficiencies and enhancements in collations, this post focuses on practical solutions to converting existing data sets.  Every DBA, at some point, faces the daunting task of character set conversion or differing collation usage.  However, before we go on, a note.

Green field projects

When you are considering a new MySQL database to support a new project, common wisdom derived from articles linked above and from the many articles one may find on Stack Overflow and other resources suggests that if you have multi-lingual and emoji storage requirements: Just use utf8mb4 character set and utf8mb4_unicode_520_ci collation in MySQL.  That is, until MySQL 8.0 goes GA.  If you really know what you are doing, or if you already know a lot about this subject, your choices may vary.  The salient point here is that using an appropriate unicode collation (rather than defaults) will save the DBA from several future headaches, particularly regarding unique keys.

Existing data sets

Most DBAs work in shops with existing data sets with growing requirements.  Many DBAs will have already worked on converting latin1 databases to utf8.  Most MySQL installations that use utf8 will have utf8_general_ci as the collation.  For utf8mb4, the current default is utf8mb4_general_ci.

As illustrated in many documents and talks you’ll find that the general_ci collations in MySQL are sub-optimal when it comes to sorting and enforcing uniqueness.  They are not so bad with latin characters.  “APPLE = ApPlE = apple”, but when you’re talking about multi-lingual support, the general collations are generally bad at case insensitivity and allow variations such as the “apple” example in other languages to be distinguished as different.

To help illustrate a practical approach, I will provide an illustration of potential issues that one may encounter, and an example of how to potentially find and fix all unique key issues in a data set.

Creating an Example Data Set

To facilitate this example, I created a sample table in a MySQL 5.7 database:

CREATE TABLE `utf8_test` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `singlechar` varchar(4) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `unk1` (`singlechar`)
) ENGINE=InnoDB AUTO_INCREMENT=189331 DEFAULT CHARSET=utf8mb4

I chose varchar(4) because a lot of people use varchars, and I chose length 4, because some “emoji” and other 4 byte strings generated on the console give a MySQL error 1300 “Invalid utf8 character string: ‘\the\Code'”.  If you have a column with 4 bytes, then MySQL lets you insert it anyway, with a warning, rather than an error.

Then, I populated this with fake data, spanning all sorts of characters using a fairly “ugly” but very “quick” solution.

jscott@jscott-57-1:~$ cat /dev/urandom > myfile
jscott@jscott-57-1:~$ for x in `cat myfile | tr -d "[:cntrl:]" | grep -a -o .`; do mysql -v js -e "set collation_connection=utf8mb4_general_ci; insert into utf8_test (singlechar) values ('$x');"; done

Concatenating /dev/urandom to a file for a few seconds yielded about 35MB of “garbage”, containing many conceivable utf8 characters.  We loop through the file output with bash.  On Linux, tr does a bad job of actually breaking up the characters because it does not yet support multi-byte characters, but it does an excellent job of removing “control characters“, hence “-d [:cntrl:]”. grep is doing our heavy lifting for breaking up the strings (-a option to treat binary as text and the -o with a period as the argument to break up the string character by character).  Finally, we feed those characters into single insert statements to our table.  As you might imagine, this is quite nasty in terms of error messages, but it gets the job done.  In a few minutes time, we have a table with an integer id, and a couple thousand “supposedly unique” rows.

Unique key violations

For most DBAs, finding problems with collations starts with your first unique key error.  Let’s try converting our newly created table to use the utf8mb4_unicode_520_ci collation (default is utf8mb4_general_ci):

mysql> alter table utf8_test convert to character set utf8mb4 collate utf8mb4_unicode_520_ci;
ERROR 1062 (23000): Duplicate entry '9' for key 'unk1'

This tells us we have a violation.  But what then?  Inevitably, you solve one, then you find another, then another, then another.

Using concat() to find problems

You may knock MySQL for having inadequate collation support, but it has outstanding support for querying data in various ways.  For one thing, “collate” can be supplied to various parts of a query.  I like to use concat for these cases for reasons which will become clear later, but here’s a query that works well to find all the collisions in our sample table:

select count(1) as cnt, group_concat(id), concat(singlechar collate utf8mb4_unicode_ci) as unk from utf8_test group by unk having cnt > 1;

Notice that inside the concat() we are adding collate utf8mb4_unicode_520_ci. This seeks to emulate what MySQL is doing when trying to alter table (supplying a new table collation), but giving us all the information ahead of time.

Sample output:

+-----+-------------------------------+------+
| cnt | group_concat(id)              | unk  |
+-----+-------------------------------+------+
|   2 | 642,8804                      | ΄    |
|   2 | 1242,20448                    | ΅    |
|   2 | 194,11764                     | ;    |
|   2 | 16145,29152                   | ·    |
|   2 | 114105,33                     | ︵   |
|   2 | 63,186608                     | }    |
|   2 | 4963,44554                    | ʹ    |
|   2 | 84,87616                      | >    |
|   4 | 120845,292,2759,38412         | ୦    |
|   4 | 2,21162,25295,47504           | 1    |
|   5 | 46179,81143,231,7766,36158    | ²    |
|   4 | 66,2339,19777,26796           | 3    |
|   5 | 102802,158554,150,16224,21282 | ፬    |
|   3 | 35,14643,19433                | 5    |
|   3 | 107,377,9234                  | 6    |
|   4 | 12,585,12643,28853            | ٧    |
|   3 | 60,12070,25619                | 8    |
|   3 | 17,70,27677                   | ٩    |
|   3 | 32,4370,12498                 | A    |

Looking at one of these by itself:

mysql> select * from utf8_test where id in (102802,158554,150,16224,21282);
+--------+------------+
| id     | singlechar |
+--------+------------+
|    150 | 4          |
|  16224 | ٤          |
|  21282 | ۴          |
| 102802 | ፬          |
| 158554 | ៤          |
+--------+------------+
5 rows in set (0.00 sec)

All of the above characters are resolving to the one with ID 102802.

Possible actions:

In my current working environment, we decided to continue to use utf8mb4_general_ci collation, because we were unable to determine (in all cases) whether the duplicates in our data set were simply “character-case” related or whether we would actually experience data loss.

In the future, we expect to be able to “trust” the new collations in mysql 8 to be correct.

We went the extra mile to create a utility based on the following query.  The utility finds possible unique key violations using queries like the one I used above.

information_schema is your friend:
select
	tc.TABLE_NAME,
	tc.CONSTRAINT_NAME,
	group_concat(kc.COLUMN_NAME),
	case when group_concat(c.DATA_TYPE) like '%char%' then 1 else 0 end as contains_char
FROM
	TABLE_CONSTRAINTS tc
INNER JOIN
	KEY_COLUMN_USAGE kc
		on tc.CONSTRAINT_SCHEMA = kc.CONSTRAINT_SCHEMA
		and tc.CONSTRAINT_NAME = kc.CONSTRAINT_NAME
		and tc.TABLE_NAME = kc.TABLE_NAME
INNER JOIN
	information_schema.COLUMNS c
		on kc.CONSTRAINT_SCHEMA = c.TABLE_SCHEMA
		and kc.TABLE_NAME = c.TABLE_NAME
		and kc.COLUMN_NAME = c.COLUMN_NAME
WHERE
	tc.CONSTRAINT_SCHEMA = 'your_schema'
	and tc.CONSTRAINT_TYPE = 'UNIQUE'
GROUP BY
	tc.TABLE_NAME,tc.CONSTRAINT_NAME
HAVING
	contains_char=1;

Using the query above and iterating through its results  you can build similar SQL statements to the one we used in our test scenario above to discover duplicates, supplying the “collate” in “concat” functions to the char columns.  concat() is a great fit, because it allows you to do the same query (

select count(1) as cnt, group_concat(id), concat(col1,'-',col2 collate <test collation>,'-'......) as unk from utf8_test group by unk having cnt > 1;

) for a unique constraint having varying numbers of columns.  An additional query to information schema in the loop is required to find which column has the char type.

You can then use the group concatenated IDs in the results to choose “winners and losers”, find dependent rows and update them, delete the losing rows, etc.

Command line QPS (Queries Per Second) Quick and Dirty

Ever wanted to just look at QPS in real time while logged into your server?

Well here’s a little command line hackery to do it quick and dirty.

[user@yourserver ~] $ LASTVAL=0; while true; do CURVAL=`mysql --batch -N -e "show status like 'Quer%';" | awk '{print $2}'`; QPS=`expr $CURVAL - $LASTVAL`; if [ $LASTVAL -ne 0 ]; then echo "$CURVAL $QPS"; fi; LASTVAL=$CURVAL; sleep 1; done

The output looks like this:

65549603430 2439
65549605421 1991
65549606912 1491
65549611219 4307
65549614186 2967
65549618048 3862
65549620853 2805

The first column is just the Query counter value. The second column is the QPS.

This script requires a .my.cnf to exist in your home directory (or that you do something nastily insecure and supply -u user -ppassword to the mysql command in the example above).

Fetch CSV of MySQL table size vs .ibd container size

This only works if you’re using innodb_file_per_table.

Purpose: import this csv quickly into google sheets (or other spreadsheet) and compare MySQL’s internal data size to the container size on disk to determine tables needing to be optimized (or “null altered”) to reclaim disk space and maybe increase performance due to defragmentation. Rule of thumb is probably something like >=10% difference may warrant action.

I wrote this loop as a one-liner dynamically / ad-hoc on the command line a couple weeks ago but made it into a configurable, yet quick-and-dirty shell script, below.

Add -u and -p arguments to MySQL CLI command if you need to, or just place a .my.cnf in your home directory and use the script as-is.

#!/bin/bash

DATADIR="/path/to/datadir" # ex: /var/lib/mysql
SCHEMANAME="yourschema"

for x in `mysql --batch -n -e "select concat(concat(table_name,'.ibd'),',',(data_length + index_length)) as total_length from information_schema.tables where table_schema = '${SCHEMANAME}';"`
do 
	FILE=`echo $x | cut -d , -f1`;
	FSIZE=`ls -la ${DATADIR}/${SCHEMANAME}/${FILE} | awk '{print $5}'`; 
	echo ${x},${FSIZE};
done

New site

Johnscott.net is reborn as a technology blog!

  • MySQL tips and tricks
  • MySQL deep dives
  • Linux/Bash command line hackery
  • Other items that come to mind
  • Cookin’ (maybe with tech, maybe with food)

Please come back soon!