Thanks to Mr. Heidenreich I lost most of my free time this afternoon.
I had posted about the programming problems on Project Euler and Mr. Gerry Heidenreich got addicted right away and we had been talking back and forth about some of our solutions. He pings me to tell me that he is having dreams (nightmares?) about Problem 10 in the set. It was taking too long to run, way too long since the problems should complete in less than one minute. We both have almost the same computer oddly enough, a 2.13ghz Core 2 Duo. If you’re too lazy to follow the link, the problem is to add the prime numbers that are less than 1 million.
I started with his algorithm and got the run time down to about 80 seconds with some plain code optimizations and expanding upon the Sieve he was using for eliminating factors of primes. To be fair this took a while. To be even more fair, there are probably Math-only optimizations that would eliminate the need for what you are about to see, but there is a larger lesson here.
I’ve been saying to anyone who’ll listen that chip-makers have been letting us down for about 3 years. We can no longer expect significantly faster and faster sequential processors to speed up the more complex apps we are using and writing. The Consolation Prize is ubiquitous multi-core chips in laptops, desktops, and servers. I have engaged in various debates about the magnitude of the benefit of this approach, especially for Game Programming or other tasks where the solutions (at least the currently practiced and taught ones) are fundamentally sequential in nature. I had been meaning to look into the cost of context switching, memory copying and synchronization primitives for problems that don’t immediately appear to lend themselves to SMP. My mediocre math skills gave me a test problem.
My first attempt was lightning fast but gave the wrong results due to an obvious synchronization issue. I decided to give this problem at least a fraction of the thought it deserves and came up with some initial goals. I wanted to make a class that can encapsulate a chunk of the work. Some sort of controller class should organize the chunks of work, and be able to handle work broken up into arbitrary #s of slices. The controller should be able to make sure all work is done before calculating the final solution. To solve my obviously sloppy threading issues, tasks running concurrently should be able to safely share some sort of read/write context that may be necessary to manage dependencies between chunks of work or at least the final results.
To skip to the punchline, the multithreaded method using the pattern I came up with is almost twice as fast as the single threaded method, gets the correct result, and has given me food for thought on future problems requiring significant computing. Almost twice as fast on 2 procs as on 1 is about as good as you can get so at least I got that part right.
Now, to rewind to the solution. The first thing that jogged my memory was watching the single-threaded method run.
CPU usage would not peg the CPU even when setting the process to Highest priority and turning off everything else that was running. I know very little about the details of the Core 2 Duo but unlike hyperthreading its two real processors, so somehow either Windows, the BIOS, or Intel are spreading the load across two processors while half of my overall horsepower goes wasted.
Here is a model of the first iteration of my work-splitting solution:
WorkUnit made the most sense for the Command class name, not to be confused with other Unit-Of-Work patterns.
Obviously one goal is to make something I can keep refining as I come across new problems and have a .NET pattern/framework for speeding up execution of large tasks in the future. The problem I’m solving here is probably one of the simplest examples of concurrent tasks with data interdependencies both constantly during execution was well as result assembly when all WorkUnits are done. In this case the shared context is that all threads of execution will be reading and writing to an array of flags. These flags mark numbers later on in processing as “Not possibly a prime candidate” when they are factors of already processed numbers. The process in action looks something like this:
Now I ran with two WorkUnits, each sharing part of the range of possible prime #s and calculating their own sub-sums to be contributed to the final tally when all analysis was done. I was happy to see a different CPU load…
… as well as obtaining the correct answer at the end and in far less time than single threaded solution. I think there is still room for improvement, quite a bit in fact. As soon as the one work unit is done the CPU load goes back down to 50% while the WorkUnit with the larger #s completes.
I have some other initial thoughts on this framework that I’ll get to later tonight, time permitting.
Remember Me
a@href@title, strike
Powered by: newtelligence dasBlog 2.0.7226.0
Disclaimer The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.
© Copyright 2008, Damon Payne
E-mail