Two version upgrade fun with MySQL 5.5 to 5.7

In a perfect world, one would upgrade databases one version at a time and not let them get too old.  But our databases are where the “crown jewels” are.  They must stay up 24×7.  When performance is acceptable, it’s acceptable, and sometimes old versions stay around too long.  We don’t live in a perfect world.  This idea applies to so many things.  There’s almost never a perfect data model.  There is always some type of resource constraint be it storage, memory, CPU, IOPS, or just plain dollars.

I will bring this concept of not living in a perfect world into a discussion about upgrades.

Ideally there would be…

  • …time to do two upgrades.  One upgrade to 5.6, the other to 5.7.   This is the way sane, normal people upgrade.
  • …a lot of extra hardware.  It sure would be nice to maybe combine a maintenance like this with a hardware refresh so that we could just have an entirely new MySQL 5.7 database shard, slap a MySQL 5.6 box in the middle as a relay, and once things are all caught up just do a VIP cutover to new machines.

The extra hardware idea I have actually worked with in the past.  A client of a company I worked for just had a standalone 5.5 machine with no replicants.  They were able to allocate two more servers, one to be a 5.6 relay and one new 5.7 machine that would serve as a final destination.  Added bonus… At the end the 5.6 server would be upgraded and left in place for a redundant machine.

In the absence of extra time and extra money / hardware resources, sometimes a little creativity is required, which brings me to an example of the most interesting 5.5 to 5.7 upgrade process I have been a part of.  However, it’s not “that creative”, but it does involve being willing to accept single points of failure for short periods of time.

The beginning setup is a MySQL 5.5 master-master pair in the primary site (machines 1&2 in 1st diagram below).  However when I say “master-master” pair, machine 1 is getting all the traffic.  Machine 3 is a secondary site master and Machine 4 is the secondary site slave.  And a side note: I am intentionally keeping this high level, so there are more steps involved than described here.

We upgrade machine 4 to be a 5.6 server and allow its slave to catch up.  We use a method of binary backup and restore to backup and restore machine 4 to machines 2 and 3.  Machines 2 and 3 become MySQL 5.7 servers.  After this is done the replication chain looks like this.

Then, we cutover production traffic to machine two, then use a binary backup/restore method to restore a copy of machine 3 to machines 1 and 4.  In the end we look just like we started, except now we’re all upgraded to MySQL 5.7.

What are some other interesting upgrade paths?  This is a fun topic that I enjoy hearing about in the community.  Cheers.

southern fried: quick script to fix table auto increments

OK, Y’all.  If you have been a DBA long enough you have run into situations where you’ve had to import DDL from another database instance and forgotten to remove AUTO_INCREMENT on a table definition.  Sometimes this means your auto_increment is way up in the millions and you have 10 rows in the table.  People worried about ID depletion on an integer column will do a “facepalm”.

Making matters worse, having perhaps done this on more than one table, sometimes it’s a real pain to detect and fix the issue of auto_increments being much higher than you want them to be across a big number of tables.  Sometimes the pain is so great that you might want to simply repeat your data importation.

This quick and dirty “southern fried” script will help you report and rectify that.  It detects your auto_increment columns and checks them by querying “ORDER BY <auto_increment_column> desc limit 1”  I coded it in bash to be the “least objectionable”, so maybe some readers will actually use it.

So far the script is “lightly tested” so use at your own risk.

Enjoy!

 

 

Open Source DBA Ethos

I’m in Santa Clara at the Percona Live Conference and Expo.  I have been here less than 24 hours, but I have already had conversations with many fellow DBAs, some that I have met and worked with before, and others whom I am meeting for the first time.  It has been three years since my last visit, but being here shows you that as an Open Source DBA, you’re in sort of a “big family” of like minded, top notch, professionals.

I am proud to be an open source DBA because of our apparent values, values you see in almost everyone when you come to an event like this.

Who are we as Open Source DBAs ?

  • We are system experts.  Open source DBAs know their operating system.  They are not only focused on the database.
  • We are developers.  We contribute to source code within our enterprises, and in many cases, in the community.  We create meaningful tools for our companies, and help solve problems inherent in “bleeding edge” technologies, making such technologies approachable in the enterprise.  We strive to bring continuous integration and seamless migration into places that are very difficult.
  • We move fast.  Open source DBAs, even in large enterprises, are always “moving the needle” towards higher performance, making changes as fast as our tools, our applications, and our organizations allow.
  • We strive for answers.  Open source DBAs are never satisfied by answers to (even old) questions in our technologies.  “One source of truth” is almost never the case.
  • We are collaborators.  We work together across team lines within our companies, and across companies in the community, to broaden our understanding.  We work across technologies to find the “best solution for the task”.
  • We are facilitators of change, not barriers to change.

The point of this article is that there is no such thing as “walls” around an open source DBA. I’m proud to be counted among the growing army of creative talent in the field, and proud to be counted among those who will always enjoy becoming a better, more educated, and more helpful resource.

 

A practical explanation: problems during unicode collation conversion

Introduction

Recently I have been involved in an effort to convert MySQL databases from a utf8 character set to utf8mb4. As a part of this effort, my team evaluated which collations would be best for facilitating a broad range of multi-lingual support.

There have been many recent posts in the MySQL community about better unicode collation support in MySQL 8 such as from the MySQL Server Team’s blog at Oracle, who have also done a good job of showing us how newer collations based on UTF8 9.0.0 will properly group and sort characters according to their case and inheritance. As the title of the latter post suggests, the “devil is” indeed “in the details”.

There is also the matter of the “sushi-beer” problem, which shows that utf8mb4_unicode_520_ci will treat beer and sushi emoji as equal.

Rather than focusing on the particular deficiencies and enhancements in collations, this post focuses on practical solutions to converting existing data sets.  Every DBA, at some point, faces the daunting task of character set conversion or differing collation usage.  However, before we go on, a note.

Green field projects

When you are considering a new MySQL database to support a new project, common wisdom derived from articles linked above and from the many articles one may find on Stack Overflow and other resources suggests that if you have multi-lingual and emoji storage requirements: Just use utf8mb4 character set and utf8mb4_unicode_520_ci collation in MySQL.  That is, until MySQL 8.0 goes GA.  If you really know what you are doing, or if you already know a lot about this subject, your choices may vary.  The salient point here is that using an appropriate unicode collation (rather than defaults) will save the DBA from several future headaches, particularly regarding unique keys.

Existing data sets

Most DBAs work in shops with existing data sets with growing requirements.  Many DBAs will have already worked on converting latin1 databases to utf8.  Most MySQL installations that use utf8 will have utf8_general_ci as the collation.  For utf8mb4, the current default is utf8mb4_general_ci.

As illustrated in many documents and talks you’ll find that the general_ci collations in MySQL are sub-optimal when it comes to sorting and enforcing uniqueness.  They are not so bad with latin characters.  “APPLE = ApPlE = apple”, but when you’re talking about multi-lingual support, the general collations are generally bad at case insensitivity and allow variations such as the “apple” example in other languages to be distinguished as different.

To help illustrate a practical approach, I will provide an illustration of potential issues that one may encounter, and an example of how to potentially find and fix all unique key issues in a data set.

Creating an Example Data Set

To facilitate this example, I created a sample table in a MySQL 5.7 database:

CREATE TABLE `utf8_test` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `singlechar` varchar(4) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `unk1` (`singlechar`)
) ENGINE=InnoDB AUTO_INCREMENT=189331 DEFAULT CHARSET=utf8mb4

I chose varchar(4) because a lot of people use varchars, and I chose length 4, because some “emoji” and other 4 byte strings generated on the console give a MySQL error 1300 “Invalid utf8 character string: ‘\the\Code'”.  If you have a column with 4 bytes, then MySQL lets you insert it anyway, with a warning, rather than an error.

Then, I populated this with fake data, spanning all sorts of characters using a fairly “ugly” but very “quick” solution.

jscott@jscott-57-1:~$ cat /dev/urandom > myfile
jscott@jscott-57-1:~$ for x in `cat myfile | tr -d "[:cntrl:]" | grep -a -o .`; do mysql -v js -e "set collation_connection=utf8mb4_general_ci; insert into utf8_test (singlechar) values ('$x');"; done

Concatenating /dev/urandom to a file for a few seconds yielded about 35MB of “garbage”, containing many conceivable utf8 characters.  We loop through the file output with bash.  On Linux, tr does a bad job of actually breaking up the characters because it does not yet support multi-byte characters, but it does an excellent job of removing “control characters“, hence “-d [:cntrl:]”. grep is doing our heavy lifting for breaking up the strings (-a option to treat binary as text and the -o with a period as the argument to break up the string character by character).  Finally, we feed those characters into single insert statements to our table.  As you might imagine, this is quite nasty in terms of error messages, but it gets the job done.  In a few minutes time, we have a table with an integer id, and a couple thousand “supposedly unique” rows.

Unique key violations

For most DBAs, finding problems with collations starts with your first unique key error.  Let’s try converting our newly created table to use the utf8mb4_unicode_520_ci collation (default is utf8mb4_general_ci):

mysql> alter table utf8_test convert to character set utf8mb4 collate utf8mb4_unicode_520_ci;
ERROR 1062 (23000): Duplicate entry '9' for key 'unk1'

This tells us we have a violation.  But what then?  Inevitably, you solve one, then you find another, then another, then another.

Using concat() to find problems

You may knock MySQL for having inadequate collation support, but it has outstanding support for querying data in various ways.  For one thing, “collate” can be supplied to various parts of a query.  I like to use concat for these cases for reasons which will become clear later, but here’s a query that works well to find all the collisions in our sample table:

select count(1) as cnt, group_concat(id), concat(singlechar collate utf8mb4_unicode_ci) as unk from utf8_test group by unk having cnt > 1;

Notice that inside the concat() we are adding collate utf8mb4_unicode_520_ci. This seeks to emulate what MySQL is doing when trying to alter table (supplying a new table collation), but giving us all the information ahead of time.

Sample output:

+-----+-------------------------------+------+
| cnt | group_concat(id)              | unk  |
+-----+-------------------------------+------+
|   2 | 642,8804                      | ΄    |
|   2 | 1242,20448                    | ΅    |
|   2 | 194,11764                     | ;    |
|   2 | 16145,29152                   | ·    |
|   2 | 114105,33                     | ︵   |
|   2 | 63,186608                     | }    |
|   2 | 4963,44554                    | ʹ    |
|   2 | 84,87616                      | >    |
|   4 | 120845,292,2759,38412         | ୦    |
|   4 | 2,21162,25295,47504           | 1    |
|   5 | 46179,81143,231,7766,36158    | ²    |
|   4 | 66,2339,19777,26796           | 3    |
|   5 | 102802,158554,150,16224,21282 | ፬    |
|   3 | 35,14643,19433                | 5    |
|   3 | 107,377,9234                  | 6    |
|   4 | 12,585,12643,28853            | ٧    |
|   3 | 60,12070,25619                | 8    |
|   3 | 17,70,27677                   | ٩    |
|   3 | 32,4370,12498                 | A    |

Looking at one of these by itself:

mysql> select * from utf8_test where id in (102802,158554,150,16224,21282);
+--------+------------+
| id     | singlechar |
+--------+------------+
|    150 | 4          |
|  16224 | ٤          |
|  21282 | ۴          |
| 102802 | ፬          |
| 158554 | ៤          |
+--------+------------+
5 rows in set (0.00 sec)

All of the above characters are resolving to the one with ID 102802.

Possible actions:

In my current working environment, we decided to continue to use utf8mb4_general_ci collation, because we were unable to determine (in all cases) whether the duplicates in our data set were simply “character-case” related or whether we would actually experience data loss.

In the future, we expect to be able to “trust” the new collations in mysql 8 to be correct.

We went the extra mile to create a utility based on the following query.  The utility finds possible unique key violations using queries like the one I used above.

information_schema is your friend:
select
	tc.TABLE_NAME,
	tc.CONSTRAINT_NAME,
	group_concat(kc.COLUMN_NAME),
	case when group_concat(c.DATA_TYPE) like '%char%' then 1 else 0 end as contains_char
FROM
	TABLE_CONSTRAINTS tc
INNER JOIN
	KEY_COLUMN_USAGE kc
		on tc.CONSTRAINT_SCHEMA = kc.CONSTRAINT_SCHEMA
		and tc.CONSTRAINT_NAME = kc.CONSTRAINT_NAME
		and tc.TABLE_NAME = kc.TABLE_NAME
INNER JOIN
	information_schema.COLUMNS c
		on kc.CONSTRAINT_SCHEMA = c.TABLE_SCHEMA
		and kc.TABLE_NAME = c.TABLE_NAME
		and kc.COLUMN_NAME = c.COLUMN_NAME
WHERE
	tc.CONSTRAINT_SCHEMA = 'your_schema'
	and tc.CONSTRAINT_TYPE = 'UNIQUE'
GROUP BY
	tc.TABLE_NAME,tc.CONSTRAINT_NAME
HAVING
	contains_char=1;

Using the query above and iterating through its results  you can build similar SQL statements to the one we used in our test scenario above to discover duplicates, supplying the “collate” in “concat” functions to the char columns.  concat() is a great fit, because it allows you to do the same query (

select count(1) as cnt, group_concat(id), concat(col1,'-',col2 collate <test collation>,'-'......) as unk from utf8_test group by unk having cnt > 1;

) for a unique constraint having varying numbers of columns.  An additional query to information schema in the loop is required to find which column has the char type.

You can then use the group concatenated IDs in the results to choose “winners and losers”, find dependent rows and update them, delete the losing rows, etc.