SQL Challenge

Given a medium sized table (2-3 million rows), each containing (amongst other things) a grouping relationship (or “membership” relationship, numerous rows will have same membership): extract (efficiently) the 9 most common groupings and “other”

Think pie chart. Feel free to use this to plug your own database, but include samples with timings and explains :)

Edit: This is a coffee-table discussion/challenge – not a task-at-hand, I’m not trying to solve a problem, I’m just interested in seeing some of you discuss how you might solve a problem like this.

23 Comments

I suspect this isn’t the answer you’re looking for :)

Assuming a table structure that maps as follows:

int id, int groupA, int groupB;

in perl/C/C++/whatever:

select groupA, groupB from table;

then just iterate once over the select data accumulating into a 2 dimensional array of counters. Make sure to use as index0 the min of (groupA, groupB) to make sure A:B and B:A end up accumulated together.

pick top 9 counters (easily done as you accumulate)

done!

order (n) operation, using a few MB of memory, no problem.

Well here is mine. Keep it in the DB and break it down.

The assumption is that your database can keep track of the connected session, and what you are using to program with can keep a persistent session open.

Create a simple query that gathers your aggregate data.

select month
, day
, hour
, count(value) into temp tmp_tab1
from [some_tables]
group by blah, blah blah
order by blah, blah blah desc

Then parse through the temp table taking the first 8 values and then adding the rest to a 9th.

You can do this in postgreSQL use your flavor of PL/SQL. Or register it as a function or procedure, then make a call to the function/procedure and collect the returned data.

I don’t have explain plans, but this method beats any file writes to the system, can be accessed multiple concurrent times. It also beats a monster Query that basically has a Select with 9 other selects built into it.

Also it is more secure with the stored procedure since there are no file writes.

I have a hybrid perl script that takes some information and groups it out by month,day,hour. It took a monster query that took almost an hour to run (1 query running 24 sub queries) to less than a minute. This is based on a table of nearly 4 million records.

I’m not looking for an answer, I’m interested in peoples’ approaches as much as their solutions :)

Hmm.

I wrote up a bunch of stuff, then I realized that I don’t believe I quite get what your data is, because my answer was really simple.

After looking at it again, you’re basically looking at a data set where each table row contains an instance of one entity meeting another entity, right? A given entity may be either in column A or column B, so we can’t assume that

A=19, B=17
would be grouped the same as
B=17, A=19

even though that’s the “same” encounter – a type 19 and a type 17.

In essence, it seems like what we have here is a grid of encounters – you need the tallies in each square, and then “fold” the chart across the diagonal and add the overlaps together. I don’t know if that makes any sense… I could draw it easier than explaining it.

If that’s the case, I’d need to ponder for a bit. Oracle has some built-in statistical functions that would help, but if it’s a more bare-bones database like MySQL, you’re starting to look at things a little brute-force. It may be possible to efficiently gather the data for the initial “A meets B” table with group by, then use a nested select to pull out the top added combos.

Er, that should have been:

A given entity may be either in column A or column B, so we can’t assume that

A=19, B=17
would be grouped the same as
A=17, B=19

Heh, I’m not looking for a solution – I’m just interested in solutios and approaches generally :)

Perhaps “membership” would clarify better than “relationship”. Inspired by my consciously dealing with a problem differently with MySQL than I would have in Oracle producing data for a pie-chart.

I’m not really following the data layout of what you are asking for, but going by Thunder’s assumption of the data, I’d create a collection indexed by a composite of the two groupings (to give a unique key) and simply tally as I roll through the records. Sort descending on the tally column, and for x=0 to 8 to show the results.

Hire a *competent* programmer.

…@/

Given a medium sized relationship table (2-3 million rows), each containing a grouping relationship (or “membership” relationship, numerous rows will have same membership); extract (efficiently) the 9 most common groupings and “other”? (Think pie chart) Feel free to use this to plug your own database, but include samples with timings and explains :)

Just to make sure. I think I came at this post late, because I dont see a reference to this A:B, B:A stuff. I quoted the question that I’m answering.

Table Structure:
Data, Group, ID

SELECT count(group) total
FROM table
GROUP BY group
ORDER BY count(group)

for (i = 0 to 8)
pie_array[i] = result(i);

for (i = 9 to num_rows(result))
pie_array[9] += result(i)

SELECT count(group) total
FROM table
GROUP BY group
ORDER BY count(group) DESC

— Changed my SQL to add DESC

Snail wrote:
Hire a *competent* programmer.

…@/

Hehe, yeah, that sounds like the kind of mistake Playnet would have made back in 2000: Hire a programmer to deal with database issues :)

gnpatton wrote:
Table Structure:
Data, Group, ID

Yep, that’s the kind of thing. I had a particular use and a short amount of time, and the tool I was using the data in didn’t let me write a script, so I pulled the data with the same query (SELECT DISTINCT(group), COUNT(*)) into a temporary table and gave the tool the following MySQL query:

SELECT IF(temp.row_id < 10,group.name,"Others") as group_name
, temp.num_members
FROM temp_member_counts as temp
JOIN groups as group USING (group_id)
ORDER BY 2 DESC

But I’m curious how other people would do it in other situations/with other tools :)

My approach would be from one direction and depending on the DB used, the code would be different.

1. Create a select w/ join into a temporary table. At the same time make a extra entry into the table as an auto incrementing value (basically a column in the table starting at 1 (top, and going down));

2. Select from the temp table where the custom column is less than 11;

I have not had to do any SQL coding in a bit but I know Terradata had a unique SQL command that would do both of those steps in one, transparent to the user when he/she asked for the top 10 responses.

I would use the DB as temp tables and searching would be seriously faster.

clarification on item 1:

What I am talking about is a select/join/sort into a temp table with an additional column that auto-increments values (top down) based on the sort. Make it easy to select the top 10.

I’d get the DB engine to do this as opposed to for loops in other code as the DB engine is designed to to this on data faster than some series of for loops on what is described as a couple millions lines. Transfering that onto the network when all you are interested in is a very small fraction of the data is a waste.

PS: the coders here will no doubt cringe at my mis-paired brackets. Sorry – it makes my eyes hurt when I see it to :-)

I believe Oracle has a “summary” feature too which will do this. Fixed your brackets ;)

Ok , Real Example: check it out at http://www.4rca.com/cgi-bin/top_ten_views.cgi

I’ve also posted this to my blog not sure if all of the HTML reference stuff will go through.

First Create a function on your postgreSQL database. Note I could pass a parameter to define a top n if I wanted to. This is using the Topics of our PHPBB for my squad. Record count is low but this does scale. You could dump it into a temp table to be used by the function if memory usage becomes an issue. This could be an aggregate function select query you get the same results. Basically the time it takes to run your query is the time it takes to get the results. The added bonus of storing it as a procedure on the database is that the query is already compiled. Also notice how nice the SQL is. Easy to read and maintain.

CREATE OR REPLACE FUNCTION get_top_posts()
RETURNS SETOF record AS
$$
DECLARE
results record;
record_counter int;
all_user varchar;
all_topic varchar;
all_views int;

BEGIN

record_counter := 0;
all_user := ‘Other Users’;
all_topic := ‘Other Topics’;
all_views := 0;

for results in EXECUTE ‘select usr.username,top.topic_title, top.topic_views as views
from
phpbb_topics top
, phpbb_users usr
where
top.topic_poster = usr.user_id
order by top.topic_views desc;’ LOOP

if record_counter connect(“$dsn”,”$user”,”$password”) or
die print “can’t connect $! \n”;

$sql = “select * from get_top_posts() AS (users varchar , topic varchar, views int )”;

$sth=$dbh->prepare($sql) or
die “can’t prepare: $sql \n”;

$sth->execute or
die “can’t execute: $sql \n”;

print “Content-type: text/html\n\n”;
print “”;
print “”;
print “Users Topics Views “;

while (@row=$sth->fetchrow_array)
{
$users = $row[0];
$topic = $row[1];
$views = $row[2];

print “$users $topic $views “;
}
print “”;
print “”;

It’s a rough Hack, but it works. It’s not supposed to be pretty. You can click on the image and go to the main site and navigate to the forums. Make the numbers change if you like.

Nice discussion Kfsone – more of the same please :)

A different approach (with Oracle) – not sure if it would be any more efficient, but it’s slightly different to the above.

How about putting the table into memory to start with to avoid reads. Create a procedure. Run the first grouping select (which finds the IDs of the top N groupings) as a cursor. Bind the value of the ID in a second select into a (temporary) table (or whatever you want to do with it). Pretty straightforward and all kept in the DB with nothing fancy in terms of SQL.

Deft

I guess I’m still not quite getting the data structure, then, because if each simply describes the membership of a single entity, this is just a group by with a limit. There’s nothing really fancy about it.

Thunder’s response makes me think it’s similar to the problem I described, though.

The key seems to be how much data people want to suck across the line. Programmers tend to want to pull all the data over to the application and munge it there, while SQL-heads will try to keep all the data in the database and have the RDBMS do the data reduction for them.

My solution would be similar to Thunder’s, except that I wouldn’t pull the rows across. I’d have the database perform a group-by for me, and then feed the group-by pairings into the table and total that up.

I just dislike the overhead of copying all that data over and over and over again, especially when the query will be hit a lot. If you’re tallying up totals in the application, then there’s no chance for the RDBMS to cache it for you. Also, with a good index (probably a bitmap index, if the number of distinct groups is small enough) the group by would be one heck of a lot faster for the database to calculate compared to sending all that data over the wire.

I think you’re trying to figure out what the “problem” is, Krenn :)

Breed gave an excellent practical example, a top 5 with an “other” category. That’s what makes it anything more than just “SELECT group_id, count(*) GROUP BY 1 ORDER BY 2 DESC LIMIT 10” – because you want a row which includes everything that the “limit” would truncate.

I also gave the example I actually used in my use-case here.

Nah, just trying to properly understand the data set, because so much of what you can do in SQL is made possible (or prevented by) the data structure itself.

If it’s just top ten where #10 is other, I’d use a nested subselect (assuming Oracle). If not Oracle, a mass group-by of everything and add together #10-end inside the scripting language.

Sub selects add a lot of cost to a query, it can easily double the execution time or worse. It’s one of the reasons why PL/SQL and the other procedural languages for databases have been created. Too bad the current version of MySQL that they are running on is very limited in what it can do.

It would be a subquery of the external query, though, so it’s impact should be minimal. The result should already be cached in the SGA and it would just be tallied up.

Breed: Subqueries can be an optimization in several SQL environments, MySQL is just not one of them. The MySQL community is founded on a web-scripting base.

I don’t think I’ll be hazzarding anything on MySQL’s stored procedures etc or even looking at them seriously until version 7.1 :)

If we were starting another project with a larger target audience, I’d want to look at Postgres or Oracle for a significantly larger customer base. I’d still want an abstraction layer so that we could easily switch.

The funny part of that, of course, is that SQL itself is supposed to be the abstraction layer. :)

Leave a Reply

Name and email address are required. Your email address will not be published.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

You may use these HTML tags and attributes:

<a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <pre> <q cite=""> <s> <strike> <strong> 

%d bloggers like this: