Rapid table archive?

This is one of those things that I always wind up writing a script to do; but I always wonder how would a DBA do it.

So you have a table with 3 fields: event_id, recorded_stamp, occurrances.

These entries are recorded at 5s intervals but after 30 days you don’t need this level of granularity. So, you want to aggregate older rows so that you have a one hour granularity. An example – albeit one I know is terrible – might explain:

LOCK TABLE event_log ; 

CREATE TEMPORARY TABLE t_ec_archive 
SELECT event_id 
     , DATE_FORMAT(recorded, '%Y-%m-%d %H:00') AS recorded 
     , SUM(occurances) 
 FROM event_log 
WHERE recorded < DATE_SUB(now(), INTERVAL 30 DAY)GROUP BY 1 
; 

DELETE FROM event_log 
 WHERE recorded <= (SELECT MAX(recorded) FROM t_ec_archive) ; 

INSERT INTO event_log 
       SELECT * FROM t_ec_archive ; 

UNLOCK TABLE event_log ; 

DROP TABLE t_ec_archive ;

There are several obvious things wrong with this approach, although it does the job.

I usually write a script rather than doing it all in SQL, with the intent of burning CPU somewhere to avoid long table locks.

7 Comments

Heck, I wouldn’t even do it in SQL. RRDtool for the win!

However, when trying to keep it all in SQL, you’ve got some potential transactional issues doing it that way. If you’ve got good transactional support (IE: not MySQL) you’ve probably got the ability to create an aggregate view table as well, so I’d just have multiple tables with different granularities, and move items from one to the next via some sort of overflow processor.

IE: a view of “event_log” is actually a composite of “event_log_gran0005”, “event_log_gran0060”, “event_log_gran0600”, and “event_log_gran3600” to give you 5 second, 1 minute, 10 minute, and 1 hour granularities. The process that adds a record to 0005 checks the table size when it’s done, and if needed collates the oldest 12 into a 1 minute record and moves that data into 0060, etc.

Oh, I should also add that the mass select/delete/insert will absolutely kill you if you’re using transaction logging. That’s the kind of thing that causes some of our midsized (6-12 GB) databases to spew out 60 GB of archivelogs per day.

Thanks God I have no clue. That’s not for me :)

*Sigh*

Not having views, I’d almost forgotten about creating a union-derived view to create apparently seamless query tables. That’d definitely be a much healthier way of doing it. How would you deal with making sure the last record you copy is also the last record you then delete from the parent table?

Imagine you did INSERT INTO … FROM event_log WHERE recorded and it takes 25 seconds doing the inserts. If you now do just a plain-old DELETE FROM event_log WHERE recorded you’re going to have 25 seconds of non-copied records…

Or would you use some other mechanism for controlling it (I’d use a script to specify an absolute cut-off date in both cases)

Hmm. Under Oracle, I’d use ROWID… under MySQL, I’ll have to think about it. Perhaps make the assumption that the timestamps can be a safe boundary condition, and delete where the timestamp is between them. I’d be surprised to see that lengthy of an insert, though, if the data reduction is performed with every 12th insert in the 5 second table.

with pgsql you could define a procedure,

you define a select for your cursor , you lock the selected records, once you’ve completed your aggregate and insert into the other table, you delete the selected records and commit.

no worries about losing records being recorded into the table while this is being run since you are locking only the records being worked on.

Transactional processing, it is a postgreSQL strength over MySQL. The same process would work with Oracle as well. Same with the views stated above for the seamlessness.

you would want to perform regular maintenance on these tables since they do go through lots of inserts and deletes.

You know, I think I actually *did* something like that for the DAoC Player Wishlist (which used a pgsql database).

I’m unclear what the resource cost is of a P/SQL operation that might potentially take a long time – I’m thinking it ought to be fairly light because, as you point out, you’re using the cursor so that you’re only operating on a few rows at a time.

Leave a Reply

Name and email address are required. Your email address will not be published.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

You may use these HTML tags and attributes:

<a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <pre> <q cite=""> <s> <strike> <strong> 

%d bloggers like this: