February 2014 Archives

PostgreSQL + ActiveMQ = (auto)Vacuum Hell

This is one of those "I had this weird problem that others are likely to encounter as well, so here's a detailed explanation that will hopefully come up in a search engine for those people when they are trying to figure it out" posts.

As most PostgreSQL administrators know, vacuuming is a necessity for the long-term survival of your database.  This is because the PostgreSQL Multi-Version Concurrency Control strategy replaces changed/deleted rows with their new versions, but keeps the old tuples around for as long as some other existing transaction might need to see them.  Eventually these old rows are no longer visible to any transaction, and they can be cleaned up for re-use or discarded completely once an entire page is empty.

This works out really well, provided you have fairly straightforward database access going on: transactions start, stuff happens, transactions complete; repeat.  Things can go radically wrong if the "transactions complete" part doesn't happen, however.

Because PostgreSQL has no idea about what any given transaction might do next, it is not able to vacuum out any dead rows newer than the oldest transaction still in progress.  If you think about it, that makes sense.  Any given transaction could look at an older version of any row that was around at the time that transaction began, in any table in the database.

The symptom of this is that vacuum tells you about a large number of dead tuples, but says that it removed 0 / none.  Doing a "vacuum verbose" tells you there was a large number of "dead row versions" which "cannot be removed yet".  If you turn on autovacuum logging, you'll see messages like this:

Feb 27 02:57:41 slurm postgres[12345]: [3-1] LOG:  automatic vacuum of table "activemq.public.activemq_msgs": index scans: 0
Feb 27 02:57:41 slurm postgres[12345]: [3-2] #011pages: 0 removed, 14002 remain
Feb 27 02:57:41 slurm postgres[12345]: [3-3] #011tuples: 0 removed, 58897 remain

The number that remain will continue increasing forever, and the number removed is always 0.  Like me, you might naively check for transactions running against activemq_msgs, and find none, or find only ones which are short-lived.  And that would be you mistake.  While autovacuum runs per-table, running transactions are per-database.  You may well have a long-running transaction running statements against some other table preventing rows from being removed from the table you're watching.  Again, this is because PostgreSQL cannot predict the future; that long-running transaction might run a query against the table you're watching two seconds from now, and as long as that could happen, those old tuples cannot be removed.

How does this relate to ActiveMQ, you ask?  If you're running ActiveMQ and using PostgreSQL as your backing/persistent store (and you may well have reasons to do this), and you don't do anything to change it, the default failover locking strategy is for the master to acquire a JDBC lock at startup, and hold onto it forever.  This translates into a transaction that starts when ActiveMQ starts, and never completes until ActiveMQ exits.  You can see this in progress from the command line:

activemq=# select xact_start, query from pg_stat_activity where xact_start is not null and datname='activemq';
          xact_start           |                                                query                                                
-------------------------------+-----------------------------------------------------------------------------------------------------
 2014-02-27 01:24:13.677693+00 | UPDATE ACTIVEMQ_LOCK SET TIME = $1 WHERE ID = 1

If you look at the xact_start timestamp, you'll see that this query has been running since ActiveMQ started.  You can also see the locks it creates:


activemq=# select n_live_tup, n_dead_tup, relname, relid from pg_stat_user_tables order by n_dead_tup desc;
 n_live_tup | n_dead_tup |    relname    | relid 
------------+------------+---------------+-------
        628 |      58903 | activemq_msgs | 16387
          4 |          0 | activemq_acks | 16398
          1 |          0 | activemq_lock | 16419
activemq=# select locktype, mode from pg_locks where relation = 16419;
 locktype |       mode       
----------+------------------
 relation | RowShareLock
 relation | RowExclusiveLock


Again, as long as this transaction is running holding the ActiveMQ lock, (auto)vacuum cannot reclaim any dead tuples for this entire database.

Fortunately, ActiveMQ has a workable solution for this problem in version 5.7 and later in the form of the Lease Database Locker.  Instead of starting a transaction and blocking forever, instead the master will create a short-lived transaction and lock long enough to try to get a leased lock, which it will periodically renew (with timing that you specify in the configuration; see the ActiveMQ documentation for an example).  So long as the lock keeps renewing, the slave won't try to take over.  Your failover time, then, depends on the duration of the lease; it won't be nearly-instantaneous as it would in the case of a lock held when ActiveMQ exits (though it could be faster than a transaction ending after a socket times out due to an unclean exit).

Because the locking transactions come and go, rather than persisting forever, the autovacuum process is able to reap your dead tuples.


So the moral of the story is this: if you're using PostgreSQL as the persistent store for ActiveMQ, make sure you configure the Lease Database Locker in your persistenceAdapter configuration.  Otherwise, PostgreSQL will never be able to vacuum out old tuples and you may suffer performance degradation and a database that bloats in size forever (or until you stop ActiveMQ, run a vacuum, and restart it).

Five More Bloody Signs You Aren't Bloody Agile

This is getting old, isn't it?  I still think I might try to turn this into a book of some kind, though, so we press on.

5. You can't touch some code because someone "owns" it.  Or there's some piece of code that only one person can work on.  Sometimes this can happen because what a unit of code does is so complicated and so specialized that there's only one person on your team capable of understanding it.  But most of the time what is happening is some combination of a person's ego becoming involved, a lack of sufficient unit testing, and insufficient cross-pollination of code modules among the members of the team.  Consider this situation very carefully, because it means your code base now has a single point of potential failure.  If only one person understands it, what happens if that person gets sick or wants to take a vacation?  If someone's ego gets involved, it's almost always to the detriment of the rest of the team. Or are people just afraid to touch it because something might break?  That's the easiest case to deal with: keep adding unit tests until people feel comfortable with that safety net.  Pair Programming can help with the cross-pollination, as can inflating an estimate, biting the bullet, and giving the work to anyone BUT the person who "owns" that code.  Collective code ownership may be a little painful and a little slow in the short term, but in the long run, you'll be glad you did it.

4. Your team lives from one crisis to the next crisis.  Management is freaking out regularly.  Flailing, weeping, wailing, and gnashing of teeth.  What could this possibly have to do with Agile, you ask? Quite a lot: it's an indication that your Agile process has completely broken down or your stakeholders are not playing the game.  If Agile were working properly, why would Management need to freak out?  They'd have a predictable set of deliverables for each iteration and each release.  They'd have a good idea of the team's velocity.  They'd have a decent idea of when things would be "done".  They'd be seeing progress in the form of demos regularly.  If they're completely disengaged and not participating, you're going to need a heart-to-heart.  If they are flailing because your team has given them timelines, and they just don't like what you had to say, then you have a trust issue between your stakeholders and your team, and again, you need to have a heart-to-heart.  If Agile is new to your organization, you need to set expectations accordingly, and make sure everyone understands how the game is played.  Remember, though, out of crisis can arise opportunity: if you can hold your team together and execute cleanly, you can demonstrate how Agile can provide the predictability and response to change that has your reporting chain in an uproar.

3. Your team has more than 6-8 developers.  Agile performs best with small teams who can meet regularly and communicate easily.  If your product is really a product of products, consider breaking teams apart along product boundaries.  You may be tempted to break teams apart across some other boundaries, but try not to do that.  Remember, you're trying to satisfy a product owner; the key word is "product."  How can you do end-to-end iterative development if you've aligned "vertically" instead of "horizontally."  Give it some thought, because there's a decent amount of overhead in bootstrapping a new team.  Reflect on the values of Agile, and form your teams in such a way that you can satisfy the various roles most effectively and play the game by the rules.  

2. Chickens are being pigs.  Pigs are being chickens.  Or to put it another way, someone is regularly stepping outside the boundaries of their role in your meetings.  This could be a product owner trying to change the rules of your Scrum, a Scrum Master trying to influence the implementation of your code, a developer trying to inject a feature into the product that wasn't requested, etc.  It can also take the form of "Some Guy" trying to do anything at all in your meetings.  When this happens it is the responsibility of the Scrum Master to call it out, explain why it isn't allowed, and remind everyone what their roles are. If you're playing soccer out on a pitch, you can't have a handful of players decide they want to play rugby, nor can you have your goalie decide he wants to play forward.  For the game to work, the players must first agree to the rules, and then play by those rules.

1. The rules of your game keep changing, and nobody asked your team.  Someone from outside the team is issuing edicts to the team and expecting them to be obeyed.  Agile teams should be self-organizing, with the aid of the Scrum Master as a servant-leader.  It's a negotiation within the team to set things like meeting times, iteration length, and consulting with the Product Owner, to set things like the number of iterations to a release, release dates, demo agendas, etc.  If someone from outside the team is issuing directives (like "Some Guy"), they are cheating.  This, sadly, is an organizational problem and often an indicator of a lack of trust of or lack of respect for the team.  It's a tough nut to crack to figure out why this lack of trust or lack of respect persists, and even more difficult to remedy the problem.

I seriously hope I'm done this time, and that anything else I think of will be a straightforward repetition of one of my previous articles:



About this Archive

This page is an archive of entries from February 2014 listed from newest to oldest.

January 2014 is the previous archive.

January 2015 is the next archive.

Find recent content on the main index or look in the archives to find all content.

Categories

Pages

Powered by Movable Type 5.02