Sorting and parallelization… Yeuch.
You can get significant performance gains by parallelizing various sort algorithms cleverly, but ultimately there’s some data on the left that needs to be on the right, and vice versa.
Consider: [ 42, 15, 11, 17]
You can divide this into two workloads for parallel processing: [42, 15] and [11, 17]. But they are going to take different amounts of time.
And you can’t sort the whole group until both sub-groups are ordered, which means synchronization.
So I wound up having fun experimenting with various sorting algorithms in Python using matplotlib. Until Python started annoying me. Not that Python is bad, I can see why people like it. But from a C/C++ background there are just lots of things that make me balk.
E.g. The weird Python 2.6/3.0 string formatting:
print("Some {0} over {1} is a {2}".format("where", "moon", "cow"))
This gets annoying if you remove an argument… Since you then have to renumber all the remaining arguments.
And the old format,
print("Some %s over %s is a %s." % ... how do I do this? ...)
I couldn’t figure out.
Some of it was self-induced; sorting involves a lot of element swapping, and Python doesn’t have a native “swap”, but is rather proud of the fact you can do:
list[l], list[r] = list[r], list[l]
While that syntax offers some neato possibilities, when it comes to doing the above it’s just bloody irritating and it doesn’t clearly state what you are doing.
(solution: make your own swap function, which I did, but still)
Lastly (well, last I can be bothered to ramble on about), the whole whitespace thing.
When working with short snippets of code, I like the Python style where indentation controls the extent of a sub-section of code:
if theSkyIsBlue : # ':' in this case means 'then' print("The sky is blue") theSkyIsGreen = false # same indent level, still dependent on the if print("Not indented, we are back to normal")
But I find it kinda hideous if you get many levels of indentation.
def foo(work, depth): width = len(work) if width > 1: if depth > 1: if width < 8: return work.sort() else: if depth > 2: r = letsDoSomethingElse(work) if r < 10: r = somethingElse(work) if r > 3: foo(work[:width/2], depth+1) foo(work[width/2:], depth+1) def somethingElse(work): # ...
Noooooooo. MUST PUT FI or } or SOMETHING to indicate I hadn’t just not finished this code!
That cliff-like fall off from the final level of compound statements to the “def” on the next line … URGLE.
I find that what I tend to wind up doing is using blank lines to denote “end of compound statement”, and as a result not using it elsewhere in code (did I mean for the “if r > 3:” to be indented? I dunno)
6 Comments
print(“Some %s over %s is a %s.” % (“where”, “moon”, “cow”)). The % formatting operator takes a tuple on the RHS, and as a special case you can omit the tuple if it would have only a single element. I.e. full form is “%s” % (“x”,) and the special-case form “%s” % “x”. And if you want to format a tuple with a single-slot format string, you have to stick it inside a tuple “%s” %((“x”, “y”),) :)
Perhaps you would benefit from an even broader approach to the sorting problem. I’m assuming you are using it to calculate vis-lists and the like?
Then perhaps a map/reduce approach will suit you better.
I.e. consider a list of 10.000 entries, and you want to find the 5 best wrt your current player:
– Send each of 8 servers a list of their eights, and the current player position, and tell them to sort it. Each server does its own calculation of distance, then returns the top 5.
These 40 entries can then be whiddled further down by simple logic.
I guess the reason for your frustration didn’t come through to me.
If the number of cores N < factor N speedup for a normal nlogn sorting algorithm. Then you merge the subsequences which is just like the end of a merge sort, with decreasing parallelism as the sequences are merged. Won’t this give you enough speedup? Or are you sorting so many things you need a distributed sort?
Soraz: Actually, I don’t have any incredibly large lists to sort :) The host’s vis vehicle lists are already spatially organized.
Lut: It’s the synchronization part that sucks, I guess. At least with current hardware. It all comes back to my frustration that on something like a Core2 Duo or an i7, that parallelization is still done in software. Like I said, [ 42, 15, 11, 17]. After you have sorted the two sub-units (if you were using so small a grain size), you have to synchronize on their completion before you can perform the merge.
Trivial if you are just dealing with my sample list.
But a list of 4096 entries. That’s going to be 4094 synchronization points.
Because they may take arbitrary time (e.g sorting [ 42, 15 ] takes a lot less time than sorting [ 11, 17 ] since no write is required) and these will vary wildly depending on subunit size, many of the cleaner/simpler solutions go out the window because of the cost of in-software parallelization (that is, things like thread management, mutexes, etc).
If you know you’re going to have a reasonable number of cores, you can afford to be a little less aggressive about your actual sorting: for instance, affording yourself the luxury of determining that subunits below a certain size are already sorted, and that two given subunits are already in order.
In my test code, I found it also became best to set upper and lower grain sizes for sub units. That is: once I got down to a subunit of 16 or less, I just ran a standard sort algorithm on that subunit. After that, it becomes possibly to trivially skip a merge:
But this proved not to be optimal in all cases.
I also spent a chunk of time trying to understand the bitonic merge sort. It seemed to me that it should be capable of dealing with workloads of arbitrary sizes by for some reason my experimental versions would only work with ^2 sized lists. Grr.
I suspect this was probably down to me doing a direct access on data rather than taking a value before an operation.
Obviously the divide-and-conquer aspect makes parallelization a great boon to sort performance; but conversely the parallelization only provides solutions for subsets. You’re [almost] always going to wind up with a serial merge of the largest possible dataset, hehe :) (Since my knowledge of sorting algorithms is extremely limited, there may be algorithms that actually don’t)
I really enjoyed my experiment with Python, particularly the ease with which I got matplotlib up and running given that it was only the second or third time I’d ever used Python!
In particular, I have a wee bit of a phobia when it comes to visualization tools. It comes from the fact that almost every visualization tool I’ve ever tried documents itself thusly:
Step 1. How to include the library.
{easy stuff}
Step 2. How to create an image.
{easy still!}
Step 3. How to do kick-ass awesome graphical stuff
{really fricken complex stuff full of graphics terminology like rasterization, compression styles, etc etc}
Step 4. How to draw a complex pie chart with blah blah blah
…
Step 99. How to draw a couple of lines, why are you wasting our time pissant?
[description in terms of steps 4 thru 98]
But matplotlib cut straight to the point. (Well, ok, I then had to figure out how to make an interactive chart so that I could update the same figure each frame of iteration, but that reasonably well covered in the FAQ; plt.ion()).
So I’m still confused…
On a normal processor with N cores, I seriously doubt the mutex overhead would be prohibitive for the *one* lock you have to take to wait on the first sort of the subsequences.
I mean, do you *really* need to worry about load balancing for the sorting of the different chunks? That seems about as likely as you are to worry about the worst-case performance of quicksort… Just do them serially. If you know you don’t have more workers, there’s just additional overhead in further subdivision. And because you’ve now cut the length of the sequence that needs to be sorted serially by N, you’ve cut the worst-case scenario by a factor N^2 so it’s much “less worst” now compared to the serial case before. And even if you do the subsequent merging *totally serially*, that’s only a linear time operation.
Worrying about the edge cases seems a bit like premature optimization to me. Just do the straightforward parallelization and see if it’s good enough… ;-)
I meant to add: If you were doing this for a GPU-like massively data-parallel machine with very lightweight synchronization and switching, then you probably want another strategy.