When “log sequence number X is in the future” is a non-issue

While seeding a bunch of new slaves for an infrastructure (and after I had successfully seeded and started replication on several, and I can verify that the database is consistent), one of my fellow DBAs checks the mysql error log and says “INNODB CORRUPTION!!”.

All of my servers were spewing these errors.

2018-05-09T10:29:10.420943-05:00 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html for information about forcing recovery.

2018-05-09T10:29:10.421717-05:00 0 [ERROR] InnoDB: Page [page id: space=XXXX, page number=XXXXXXXX] log sequence number XXXXXXXXXXX is in the future! Current system log sequence number XXXXXXXXX.

But — No.  I do not have InnoDB corruption in this case.  I discovered a brand new gripe with MySQL & InnoDB.  This is not a ‘bug’ but more of an ‘unfortunate feature’ and I invite others to tell me that I am wrong about this.  How to reproduce:

  • Configure MySQL so that redo logs are in a separate partition from the datadir
  • Stop mysql, normally, cleanly
  • Back up and transfer and restore the datadir onto a new slave database system by whatever binary method you prefer. (in my case it was to create a zfs snapshot on the datadir partition and start MySQL back up and send that snapshot to a new slave).
  • The new slave is configured identically.  It:
    • has a fresh MySQL install.
    • has been started up, previously.
    • has been shut down, cleanly.
    • has redo logs in a separate partition (corresponding to the fresh installation).

If you are a MySQL DBA you now know what I did.  I have a clean/consistent MySQL data directory from one server paired with redo logs from a brand new server.  The brand new server’s MySQL was shut down cleanly, too, though, so nothing was replayed.  Because the innodb_log_file_size is the same on both servers, MySQL started up.  But, it also decided to adopt the system LSN from the innodb log file.  Thus we have a nasty issue of finding LSNs in the future in the tablespace for every transaction.  We’re now spewing fountains of MySQL error logs.

In this case – this is not a “real issue”.  There is no corruption.  MySQL/ InnoDB is just spewing errors.

To keep this from occurring, all I have to do (before starting the seeded slave the first time) is delete those ib_logfile*s on the destination redo log partition and let MySQL create them on start.  You only get one chance to do it right.  When MySQL creates the redo logs fresh, they contain the proper system LSN, obviously copied back out from ibdata1.  If you try nuking the log files and letting them recreate again AFTER the first start, it’s too late.  The system LSN is now in ibdata1 and a “too low” value will still be used.

There is a viable fix without doing “crazy stuff”.  I can shut down the seed source machine and grab the ib_logfiles, copy them to the new slave (after clean shutdown) and then start the new slave back up again.  Voila.  LSNs not in the future anymore.

My opinion is:

  • MySQL should not start up if the log LSN differs more than <threshold> from the one in ibdata1 – so that you can get an opportunity to delete redo logs and let them be recreated.  Why not? If you know your MySQL snapshot was created after a clean shutdown, it’s fine.
  • OR if the redo logs don’t contain any checkpoints that need replaying against the server, the system LSN from ibdata1 should be used (like it does when it starts without any redo logs).

Anyway I saw many posts on the web about this.  Your database is NOT always corrupt when you see this error message in your logs.

You don’t need to use some GDB hack with a perfectly quiet database.  You can use some of the hacks out there for writing enough data to advance the LSN if you want though.

BUT one solution that I couldn’t find anywhere is this:

If you are seeding machines and your seed source is still viable – just shut down MySQL on that seed source and grab its redo logs.  Shut down the destination cleanly as well and put the ib_logfiles from the source into place on the destination and restart.  Voila, the errors disappear.

You do have the choice of just “living with” the errors… and knowing that they are complete BS.  But that leaves the unfortunate discovery by someone who doesn’t know what was done and the panic starts all over again.

End of gripe.

Three client error messages that MySQL DBAs should know by heart to save time

Working in an operations environment means that you get a lot of questions.  There’s the inevitable troubleshooting tasks that go along with being a DBA.  This is designed to be a quick reference post, much like my more in depth post in 2016 about timeouts. These typical error messages can create confusion and unneeded activities to diagnose.  To aid my own process of elimination and those of others, here are three error messages every DBA should know by heart.

SQLSTATE[HY000] [2002] Connection timed out

Plain and simple: this error means the client cannot connect to the server.

  • The calling program is trying to connect to the wrong database server (one that it cannot reach).
  • The database server is completely down (you’d get another error if the server and client actually made a connection).
  • Firewall rules on either local servers or on a network resource is preventing the connection.
  • Routes are misconfigured.
  • The calling server has so many outbound connections that the MySQL client cannot create an outbound socket (think LSOF / netstat (lots of TIME_WAITed connections) ulimits / tcp_tw_reuse & recycle, etc).
  • Anything else that might be preventing an actual TCP connection outbound to the server port (typically 3306).

SQLSTATE[HY000] [2006] MySQL server has gone away

This error is typically related to server timeouts.

  • an idle connection is hitting wait_timeout.  wait_timeout concerns only idle sessions.  sessions with running queries are not affected by this timeout.
  • a running SELECT query with streaming cursor has not moved the cursor in [net_write_timeout] seconds
  • a running bulk INSERT or UPDATE transaction has not written anything to the server within [net_read_timeout] seconds
  • in rare cases interactive_timeout is being hit by an ad-hoc CLI client or other interactive program that respects the CLIENT_INTERACTIVE option to mysql_real_connect().  This is rare.  Most of the time interactive_timeout = wait_timeout.

Keep in mind that if a DBA has mindfully set timeouts on the server that adjustments should be made to SESSION timeout variables within the calling program.  In other words, don’t change server timeout settings just for one outlier or for some exceptional situation.

Also: This error can be “normal” in situations where the application opens a connection pool to the database and many connections in the pool are idle, hitting the wait timeout.  In other words, in environments like this, SQLSTATE[HY000] [2006] MySQL server has gone away can be a “non issue”.  Work with developers to understand where in the code the error appears and collaborate on the correct solution, which may be to ensure the application’s MySQL connector or library is configured to reconnect.

SQLSTATE[HY000] [2013] Lost connection to MySQL server during query

This error is exactly what it says.

  • A mysql query was interrupted in the middle of the query, meaning it was
    • killed by a backup utility on the server
    • killed by pt-kill on the server,
    • killed manually by a DBA on the server
    • or some other situation by which the client thread lost the connection to the server (network issue between client and server).
  • This error normally does not concern timeouts or settings on the server side.


Two version upgrade fun with MySQL 5.5 to 5.7

In a perfect world, one would upgrade databases one version at a time and not let them get too old.  But our databases are where the “crown jewels” are.  They must stay up 24×7.  When performance is acceptable, it’s acceptable, and sometimes old versions stay around too long.  We don’t live in a perfect world.  This idea applies to so many things.  There’s almost never a perfect data model.  There is always some type of resource constraint be it storage, memory, CPU, IOPS, or just plain dollars.

I will bring this concept of not living in a perfect world into a discussion about upgrades.

Ideally there would be…

  • …time to do two upgrades.  One upgrade to 5.6, the other to 5.7.   This is the way sane, normal people upgrade.
  • …a lot of extra hardware.  It sure would be nice to maybe combine a maintenance like this with a hardware refresh so that we could just have an entirely new MySQL 5.7 database shard, slap a MySQL 5.6 box in the middle as a relay, and once things are all caught up just do a VIP cutover to new machines.

The extra hardware idea I have actually worked with in the past.  A client of a company I worked for just had a standalone 5.5 machine with no replicants.  They were able to allocate two more servers, one to be a 5.6 relay and one new 5.7 machine that would serve as a final destination.  Added bonus… At the end the 5.6 server would be upgraded and left in place for a redundant machine.

In the absence of extra time and extra money / hardware resources, sometimes a little creativity is required, which brings me to an example of the most interesting 5.5 to 5.7 upgrade process I have been a part of.  However, it’s not “that creative”, but it does involve being willing to accept single points of failure for short periods of time.

The beginning setup is a MySQL 5.5 master-master pair in the primary site (machines 1&2 in 1st diagram below).  However when I say “master-master” pair, machine 1 is getting all the traffic.  Machine 3 is a secondary site master and Machine 4 is the secondary site slave.  And a side note: I am intentionally keeping this high level, so there are more steps involved than described here.

We upgrade machine 4 to be a 5.6 server and allow its slave to catch up.  We use a method of binary backup and restore to backup and restore machine 4 to machines 2 and 3.  Machines 2 and 3 become MySQL 5.7 servers.  After this is done the replication chain looks like this.

Then, we cutover production traffic to machine two, then use a binary backup/restore method to restore a copy of machine 3 to machines 1 and 4.  In the end we look just like we started, except now we’re all upgraded to MySQL 5.7.

What are some other interesting upgrade paths?  This is a fun topic that I enjoy hearing about in the community.  Cheers.

southern fried: quick script to fix table auto increments

OK, Y’all.  If you have been a DBA long enough you have run into situations where you’ve had to import DDL from another database instance and forgotten to remove AUTO_INCREMENT on a table definition.  Sometimes this means your auto_increment is way up in the millions and you have 10 rows in the table.  People worried about ID depletion on an integer column will do a “facepalm”.

Making matters worse, having perhaps done this on more than one table, sometimes it’s a real pain to detect and fix the issue of auto_increments being much higher than you want them to be across a big number of tables.  Sometimes the pain is so great that you might want to simply repeat your data importation.

This quick and dirty “southern fried” script will help you report and rectify that.  It detects your auto_increment columns and checks them by querying “ORDER BY <auto_increment_column> desc limit 1”  I coded it in bash to be the “least objectionable”, so maybe some readers will actually use it.

So far the script is “lightly tested” so use at your own risk.