Author: jscott

Advice on pt-stalk for connection saturated servers

Given an environment where a high volume web application is prone to opening many connections to backend resources, using a utility like pt-stalk is important.  When performance or availability affecting events like innodb lock waits or connection saturation occur, pt-stalk helps give you information you may need in troubleshooting what was happening.

The tendency may be to create multiple pt-stalks for various conditions.  This can be a poor decision when your server is dealing with both lock contention and high connections. When pt-stalk triggers, it triggers multiple simultaneous connections to MySQL to get the full processlist, lock waits, innodb transactions, slave status and other attributes.  Pt-stalk has the concept of sleeping a number of seconds after triggering, but once that time expires, the trigger may fire again, compounding the issue.  Put simply, pt-stalk can absorb the last few remaining connections on your database, particularly if you use extra_port (and run pt-stalk on the extra_port) and have a relatively low number of extra_max_connections.

Advice:  Stick to using only one of the built-in functions (like “processlist”) if it triggering when your processlist is large is enough for you.  Alternatively, write your own trg_plugin() function encompassing multiple tests that are relevant to your environment, if you need more than one check.

Unfortunately I cannot share the one I just wrote at this time (will need to write a more generic one to share later).  It checks processlist length, replication lag, innodb_trx wait time, and innodb_lock_waits, so that I could fold four of our more relevant checks into 1 pt-stalk and avoid the “connection stack_up” when MySQL was having an issue and mutiple stalks were firing.

Easy fix: Error 1048 (23000) and explicit_defaults_for_timestamp

This may be a “duh” post for some, but I had to post this because I didn’t find the answer in typical places like stackoverflow when I had the issue.  I recently worked on a project to expand database capacity by deploying new MySQL installations with memory, config, and disk space tweaks by backup/restore, replication topology change and, and failover.  I did not notice that the old servers had “explicit_defaults_for_timestamp=OFF”.  After restoring a binary backup and starting the replication thread on the new systems I got this error in the replication thread (column name in error corresponds to examples further down).

ERROR 1048 (23000): Column 'ts' cannot be null

Below, I will provide a synopsis to show statements that caused the error and why a simple global variable change fixed the issue.  First, a sample table definition.

CREATE TABLE `time_test` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `comment` varchar(32) NOT NULL,
  `ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=latin1

Notice, the timestamp has a default of current_timestamp.  Next, a statement that works fine.

mysql> insert into time_test (comment) values ('this is statement 1');

“id” and “ts” columns are omitted to allow MySQL to generate both the ID and the timestamp value.

Next – a statement that does not work, similar to the one I encountered in my replication thread.

mysql> insert into time_test (comment,ts) values ('this is statement 2',NULL);
ERROR 1048 (23000): Column 'ts' cannot be null

With MySQL and many other databases, even though we have the “DEFAULT CURRENT_TIMESTAMP” on our timestamp column, explicitly supplying NULL in the INSERT statement is “non-standard” syntax.  Standard syntax would be to omit the column in the column spec and VALUES (e.g. query 1).  We get the error because of the default value of explicit_defaults_for_timestamp.  From the MySQL manual, “This system variable determines whether the server enables certain nonstandard behaviors for default values and NULL-value handling in TIMESTAMP columns. By default, explicit_defaults_for_timestamp is enabled, which disables the nonstandard behaviors.”

I was working with an application with deprecated SQL syntax.  In fact, this syntax may be legal on some modern databases; just not by default on MySQL.  If a reader knows of databases where this is legal, please feel free to comment.

The fix for this is as simple as it gets, which is probably why there are not more posts about this.

mysql> set global explicit_defaults_for_timestamp='OFF';
Query OK, 0 rows affected (0.00 sec)

Because this is a global variable, one must log out and log back into the CLI.  After reconnecting:

mysql> insert into time_test (comment,ts) values ('this is statement 2',NULL);
Query OK, 1 row affected (0.01 sec)

mysql> select * from time_test;
+----+---------------------+---------------------+
| id | comment             | ts                  |
+----+---------------------+---------------------+
|  1 | this is statement 1 | 2018-08-14 02:51:21 |
|  2 | this is statement 2 | 2018-08-14 02:52:20 |
+----+---------------------+---------------------+
2 rows in set (0.00 sec)

In my case, as soon as the global variable was set, I was able to start my replication thread back up.  Thanks for reading my reference post.

Query Macroeconomics

I studied some macroeconomics in school.  I’m still interested in it 20 years hence.  I was recently in a discussion about query optimization and how to prioritize what to fix first.  My pen and paper started graphing things, and here we are with an abstract thought.  Bear with me.  This is for entertainment purposes, mostly, but may actually have a small amount of value in your thought processes around optimizing queries.  This is a riff on various supply, demand graphs from macroeconomics.

In the graph below:

  • Axes:
    • Vertical: number of distinct problem queries
    • Horizontal: Database “query load capacity” gains (from optimization)
  • Lines:
    • LIRQ (long and/or infrequently run queries)
    • SFRQ (short, frequently run queries)
    • AC: Absolute capacity (the point at which you’re going as fast as I/O platform you run on will let you and your query capacity bottlenecks have less to do with queries and more to do with not enough IOPS).
  • Point:
    • E (subscript) O: Equilibrium of optimization

On LIRQ: Simply put, on a typical OLTP workload, you may have several long and infrequently running queries on the database that are “problems” for overall system performance.  If you optimize those queries, your performance gain in load capacity is sometimes fairly small.

ON SFRQ: Conversely, optimizing short running but very very frequently run “problem queries” can sometimes create very large gains in query load capacity.  Example: a covering index that takes a query that’s run many thousands of times a minute from 10 milliseconds down to < 1 millisecond by ensuring the data is in the bufferpool can give you some serious horsepower back.

On AC: Working on optimizing often run queries that are not creating an I/O logjam do not return any benefits.  You can only go as fast as your platform will let you, so if you are getting close to the point where your database is so well optimized that you really can’t read or write to disk any faster, then you have hit the wall and you will produce little ROI with your optimization efforts unless you make the platform faster (which moves the red line to the right).

On EO:  Often run long (or somewhat long) queries are the low hanging fruit.  They should stand out naturally and be a real “apparent pain” in the processlist or in application response times without even bothering to pt-query-digest.

Speaking of pt-query-digest: digests of the slow query log (when log_query_time is set to 0) are a good way to figure out what types of queries are taking up the lion’s share of your database load. You will be able to tell via the ranking and the total time and percentiles shown in the digest what queries are taking up your database’s valuable time.  I wish for you that you have SFRQ, so that your optimization effort may produce high rewards in capacity gain.

Thanks for bearing with me on my database capacity economics.

 

The witching hour: How illegal datetimes get stored into MySQL in US timezones

Most companies make a mistake at the very beginning.  They set their system time zone to Eastern, Central, Mountain, or Pacific time.  Then, they install their database on it.  They use all manner of timestamp and date types and insert terabytes of data.  After that, it becomes almost impossible (in terms of being worth the effort) to change it some months or years later when the company has a popular product or service.  One of the unfortunate issues that may arise is illegal datetimes.

On Sunday, March 11, 2018, when US time zones switched To daylight savings time, there was no 2AM hour.  When we switch back to normal time on Sunday November 4,  there will be two 2AM hours.

Inevitably, out of many database servers and many different types of application servers, some data manages to get inserted into MySQL with an illegal datetime during the 2AM hour on the “spring forward” date.  Each year, developers ask the question about application errors arising from it.

Here is an illustration of how it happens.

Given the following table definition:

CREATE TABLE `ts_test` (
  `js_ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `js_dt` datetime NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Let’s now try to insert some data.

mysql> insert into ts_test (js_ts,js_dt) values ('2018-03-11 02:12:00','2018-03-11 02:12:00');
ERROR 1292 (22007): Incorrect datetime value: '2018-03-11 02:12:00' for column 'js_ts' at row 1

We got an error for the datetime column.  This is good right?  MySQL is protecting us from inserting a bad value.  So – if I’m running… let’s see here, what am I running.  MySQL 5.7.21 — there should be no issue, Right?  Maybe this was yesterday’s issue on a different version of MySQL?….

Not so fast.

What about your ORM?  Is your ORM giving any specific instructions to the database for inserts and updates to datetimes?  If so, maybe… something like THIS is happening.

mysql> set session time_zone='-5:00';
Query OK, 0 rows affected (0.00 sec)
mysql> insert into ts_test (js_ts,js_dt) values ('2018-03-11 02:12:00','2018-03-11 02:12:00');
Query OK, 1 row affected (0.00 sec)

mysql> select * from ts_test;
+---------------------+---------------------+
| js_ts               | js_dt               |
+---------------------+---------------------+
| 2018-03-11 02:12:00 | 2018-03-11 02:12:00 |
+---------------------+---------------------+
1 row in set (0.00 sec)

Hm.  So, if we set the session time zone to UTC -5, the insert succeeds.  That must mean that MySQL has made this “right”.  If I query again after disconnecting from this session and starting a fresh one in daylight savings time – what will happen?  Both timestamps will advance and show 3AM…. right?

mysql> select * from ts_test;
+---------------------+---------------------+
| js_ts               | js_dt               |
+---------------------+---------------------+
| 2018-03-11 03:12:00 | 2018-03-11 02:12:00 |
+---------------------+---------------------+
1 row in set (0.00 sec)

Wrong.

The timestamp column adjusted.  The datetime one did not.  I now have a situation that my application is not going to like.  My applications above this database will error on the js_dt column value because it’s an illegal datetime.

What if I open up a virtual machine, create the same table, set the system time back to just before the time change, establish a MySQL session to the database server, leave that connection open, and continue to insert rows as the time changes.

I ran:

~# date -s "11 Mar 2018 01:59:30";
> insert into ts_test values (now(),now()); # repeatedly

… and as we went over the time change, MySQL did the right thing with the now() function and the session as far as I could tell.

| 2018-03-11 01:59:55 | 2018-03-11 01:59:55 |
| 2018-03-11 01:59:57 | 2018-03-11 01:59:57 |
| 2018-03-11 01:59:58 | 2018-03-11 01:59:58 |
| 2018-03-11 01:59:59 | 2018-03-11 01:59:59 |
| 2018-03-11 03:00:00 | 2018-03-11 03:00:00 |
| 2018-03-11 03:00:01 | 2018-03-11 03:00:01 |
| 2018-03-11 03:00:02 | 2018-03-11 03:00:02 |
| 2018-03-11 03:00:03 | 2018-03-11 03:00:03 |
| 2018-03-11 03:00:04 | 2018-03-11 03:00:04 |

I did not have applications at hand to try a PHP session or a java session inserting data second by second as we passed over the time change.

Most of the time we see this in the real world, the datetime datapoints are only few seconds past 2AM.  It is my belief that ORMs and software in general can set a specific offset from UTC, set it in the database session, and fail to change it because of an operation taking a few seconds and completing after the time change.

Let’s recap.

Q: How do bad datetime values get into the database?

  • Because application ORMs and other session originators can explicitly set the session offset from UTC.
  • Because a session connected prior to the time change may still have the previous time zone setting.

Q: Why is this a problem for datetimes and *not* timestamps?

  • The underlying storage for timestamps is UTC, and times must be greater than the start of the epoch (1970-01-01 00:00:00) or 0 unixtime.  The DB converts timestamps to UTC, and that conversion works relative to the connected session time zone.
  • Datetimes are not converted back and forth.  What you put in is what you get back out.  MySQL does try to validate datetimes based on the current timezone (see previous error), but an override of session.time_zone to an explicit UTC offset will negate the comparison to the system time zone; thus allowing “illegal” values to be inserted.

Q: What do I do if I have illegal datetimes in my database?

  • Illegal datetimes should only occur during 1 hour per year, therefore it should be easy to query to look for and report on them.  They should be updated + 1 hour.