[lug] File modification

Bear Giles bgiles at coyotesong.com
Wed Mar 13 17:49:41 MST 2002


> Unfortunately, no. I ctrl-C'd it after about 15 hours. I used a
> particularly bad choice of algorithms because I was in a hurry, and I
> I ended up making some stupidly, dramatically, *painfully* inefficient
> algorithms to reduce the complexity of the data structures.

I may be in the minority here, but here goes....

Did you know the size of the problem going in?  Were you misled 
about the size of the problem?

This is one of my sore spots - the "savings" in using an inefficient
algorithm are rarely worth it unless you are absolutely certain that
the problem size is limited.  The more efficient algorithms are not
that much more difficult to learn or use, and learning to use them
 now will serve you well in the future.

Some horrid examples to prove that efficient algorithms matter:

1) at one site, a long-departed developer had stored a list of up to
~4,000 items in a linked list.  THEN SORTED IT WITH A BUBBLE SORT.

I knew that no element was more than 10 bytes or so (it was a
#define constant), so I instrumented the code to determine the
maximum number of records ever seen today.  I doubled that count,
#define'd it, dynamically allocated a buffer, and instrumented the 
ingest routine to immediately warn any future user to update that
variable and recompile in the unlikely event that we ever saw more 
records than this in the future.  I then called qsort() on the
buffer.  The improvement in runtime was measured in HOURS since
this routine was called tens of thousands of times during a run.

(Different data though, so there was no degenerate case for the
bubble sort.)

As a side benefit, I eliminated a huge memory leak since the
existing code never bothered to release the linked list elements.
I had a single malloc(), and it was easy to ensure that it was
always free()'d.

P.S., anyone else interested in a citizen initiative to make
teaching bubble sorts a capital offense?  Quick sorts are easier
to teach, easier to code, and much more efficient.  ("Find a pivot
value.  Walk through the array with two pointers (one from each
end) and swapping the first value above the pivot value with the
last point below the pivot value.  Stop when the pointers meet,
repeat on left and right subarray.)

2) another site did a lot of geographic projections.  The existing
code scanned the entire GIS database, so generating a small set of
maps was usually an overnight affair because the code scanned the
high resolution GIS data across the entire globe!  This was despite
the data being pre-sorted into tidy little boxes of 1-degree square.
I modified the code to use a rough bounding box - it took a day or
two to code and test, but it reduced the run times from overnight
to seconds.

3) another site read a configuration file in code in a way I haven't
seen in 20 years.  It actually opened and closed the file for each
line in the configuration file... possibly even each field in each
line in the configuration file.  Changing this to read the file in
a single pass knocked 20 seconds off the initialization time.

Of course this is always a balancing act.  But when you're doing
initial development the full cost of a decent algorithm is usually
no more than the cost of an inefficient one, and it's often much
lower because the efficient algorithm is a more natural fit to the
problem.

> It *is* done, though. 

No it's not.  Can you use this program to solve problems?

> And I know it works on smaller data sets.

How large were these smaller data sets?  What test data did they 
contain?

If we were in industry and you reported to me, my first question
would be how you know that the program is making any progress?
Do you have *ANY* demonstrable reason to believe that your program
isn't cycling between two states?  (e.g., perhaps two values in this
problem set are equal but your code assumes that a strict ordering
is defined.)

The literature is littered with algorithms that worked with one or
two data points, but always broke on three data points.  Or four.
Or five.  Or data containing duplicate keys.  There's countless 
production code with pointer errors or bad calls to free() that 
still manage to work with small data sets, but break when you have 
more data.

If your program is still running after 15 hours, my gut reaction is
that it's probably due to a coding error.  It *might* be due to
an inefficient algorithm, but that's not the first place I look unless
the code has been rigorously tested.  (Think white-box coding that
you can prove exercises every code branch, not a couple small data
sets.)

> (No,
> you can't look at the code. It's too embarrassing.)

How will you learn?  Your professor or TA can make a few comments
on your code, but how much industry experience do they have?  Post
your code here and you could learn what we care about in the real
world.

> Eh, better that I learn that lesson *now* than in the industry...

And what lesson(s) did you learn?  Seriously.  The lesson you learned
may not be the lesson we think you should have learned! >:-)




More information about the LUG mailing list