[lug] File modification

Thu Mar 14 15:34:58 MST 2002

> The course work assigned at universities *is* a throw-away task; and
> rightfully so.  

To be honest, I never really took any ugrad programming classes. At
the graduate level, about half of the classes that involved programming
had the code build on prior work and it could be done in the fashion
you'ld see it in industry.

> Typically those assignments do not deal with a 'real-world'
> task, except in the most abstract sense.  The goal at the university is to
> teach budding programmers a new approach to a particular task, where the new
> approach is usally only *part* of the task, not necessarily an entire design
> process (how many courses actually teach a debugging/testing methodology
> course at the undergraduate level?).

Sure, but it should also be reinforcing good habits.

Map the problem into other fields.  Would an English department tolerate
papers with poor grammar and 'creative' spelling because the emphasis was
on Shakespeare's sonnets?  Or a history department tolerate a poorly
referenced document because the emphasis was on WW-I, not writing papers?

The bigger point is that you can see college as a series of disconnected
classes, or you can see it as a way to practice your new skills.  This
is especially pronounced in the usual math sequences - you feel like you
never really understand any material until after you've completed the next
class in the sequence, but that's because it took a couple extra months of 
exposure to really "get" it.

You can see algorithms and data structures as a toss-away class, or as
a quick introduction that you think about, however briefly, when doing
each of your subsequent assignments.  Ditto software engineering, and
databases, and whatever ugrad class you have.  It only takes a few minutes,
but the payoff is that you really begin to understand the material.

And as an aside, remember that it could be much worse.  I had very few
ugrad classes, and I taught myself the equivalence of a BSCS by reading
magazine articles (Dr. Dobbs and the like) and constantly asking myself
how I could apply what I learned at work.  It took a lot longer, but all
it took to be convinced that this was the right path was overhearing 
far more senior developers mention the problems they were having with a
run taking all night to finish and suggesting that they try an AVL tree
(an early balanced binary tree - now you would use a red-black tree).
This dropped the runtime from overnight to about 20 minutes (since it
turned out that they were initializing a search tree from a sorted list!)
and I saw that this theoretical stuff had utility in the real world.

> The amount of time it takes to develop  serious, real-world solutions is
> not appropriate to coursework as a rule, because you are being handed
> brand-new material, and given two weeks to get it done, for a grade.  This
> means 1) learning how the algorithm ought to work, 2) discovering that how
> you have been thinking about it is wrong, once you start writing code 3)
> fixing all of the code you have already written, once you discover the
> difference.

I agree completely, and would have written #2 entirely in capital letters.
But this requires that you actually see the program work, and my experience
screams that these sample runs were just too small to have any confidence
that the program worked.

> Multiply that by 3 courses, all on
> *separate* topics, and I can understand why someone might take the brute
> force approach if it looks like it will take less time to code it; typically
> *professors* don't care about efficiency, so long as the assignment appears
> to work correctly.

I think this is the crux of the matter.  I don't mind brute force 
approaches when it's the only thing that's available, but the key thing 
is to think about your options before you start coding blindly.  Even a
few minutes can make a huge difference.

Again, map this elsewhere.  If you were told you needed to be in LA next
week, would you immediately jump in your car and start driving?  Of course
not - you would at least take a moment or to decide what your options are -
do you go I-70 to I-15 and then into LA?  I-80 to Sacramento and down I-5?
I-25 to I-10?  (Actually, I think it's I-40 but I digress?)  Or would you
just grab the first interstate you reached and followed it to the end -
ending up in Montana or Kansas City?

Or would you take a step back and ask if it even made sense to drive.
Maybe there's an airline promotion and you could fly for far less money
and hassle.

> Industry certainly cares about efficiency, at least to a certain extent,
> but many of Bear's examples seem to indicate that even industry is less
> worried about the trade-off of efficiency versus getting it done (at least
> for initial solutions) than the posts here have indicated (why else would
> you put up with processes which ought to be quick and are instead slow, for
> any length of time?).

You misread those examples.  I can assure you that every one of those
companies knew that these were critical problems, but they were the 
proverbial frog in the pot and felt they were powerless to change the
situation.  It took someone come in and kicking butt (but politely) to 
shake them out of their complacency.

In most of these cases the ultimate problem was that they started out
with a small problem, and expected faster hardware to save them as the
problems got larger.  But with the inefficient algorithms the run times
got worse, and everyone was focusing on the symptoms instead of asking
the core questions.  All it took was someone turning on the profiler
and asking why strcmp() was called over 10 million times to process a
1000-line document to find one of these problems, but nobody had ever
taken the time.

The only way out of this trap is to do it half-way right the first time.
Don't plan for failure.

> And how often have we heard on this list about
> supervisor X who isn't interested in investing even the small amount of time
> necessary to make things efficient:  the product release deadline is coming
> up quickly, and the code needs to work; you can always patch it later to
> make it fast, once customers have paid for it (whether that's true or not is
> another question :) ).

In practice, code is never revisited.  NEVER.  I got away with it only
because I've learned the best way for me to learn new C code is turn on
the compiler warnings and fix the problems.  Really - that's a braindead
exercise that exposes me to most of the code, and it inevitably makes the
code a lot less fragile.  I wrap it up by turning on profiling and looking
at the dump - by then I've seen the procedure names enough to understand
what I'm seeing once I get past the braindead comparison of malloc/free,
open/close, etc.

But if I see three separate (and inconsistent) implementations of linked
lists in about 10 kloc, I'll investigate why this was considered necessary.
That lead to the discovery of the LL bubble sort.

Back to your point, this is why you need to understand the tools you
have available.  If you use qsort(), you don't have to worry about anything
other than setting up the initial conditions.  If you use the Berkeley DB,
you don't have to worry about anything other than mapping between objects
and persistent objects.  Knowing lex/yacc means you can focus on what to do
with the pieces, not how to write the parser.  This frees up a *lot* of
time.

> For real-world work, it's usually the case that you understand something
> about the domain you are working in; typically in class you don't have a
> clue as to what the dataset is really about, or what it looks like, or why
> alogrithm X is really a better approach than Y to the dataset, because you
> only have foggy notion of what Y does, much less X. 

In these cases, the professor should make a library available so the
students can focus on the context, not the implementation.

If the implementation *is* the point of the exercise, then proper choice
of algorithms is back on the table.

> Perhaps it would be better to skip optimizing for speed, and just
> 'get it done'.

I have never argued that you should optimize before you have a working
solution, just that you can make intelligent choices on the components
of that initial solution.

Remember, Chris hasn't said what the assignment involved.  It's possible
that it's an NP-complete problem and the professor is a sick bastard. :-)
But we don't know what this class was either - maybe the whole point of
the exercise was to give an "impossible" assignment (e.g., travelling
salesmen) and see what they do with it.

> Again, I think this is an experience issue.  More efficient algorithms are
> easier to learn *if* you have experience with the efficient algorithms.

Then learn them.  I'm not suggesting students need to learn the algorithms
we covered in my 6000-level algorithm class, just to learn to recognize
the common themes and how to handle them.  If it's a problem that involves
divide-and-conquer, you know you'll probably need to use recursion and/or
linked structures.  If it's a problem that involves a greedy algorithm,
you'll want a heap or a sorted structure.

> This is easier than:
> 	1.  Look at an item in a list.
> 	2.  If it's bigger/less than the item above it, move it up.
> 	3.  Keep doing that to the end of the list
> 
> ?!?!  It's only easier if recursion makes sense to you, and you are used to
> dealing with pointers; quicksort gets taught as the second programming
> course, and I am not convinced that someone relatively new to programming is
> clear on either of those two issues.

I think I skipped a step or three.  My beef isn't that bubble sort
is hard to implement, but that it's FALSELY stated to be faster
than a simple insertion sort.  It's not, both are O(N^2).  I even had
some java animations on my home page for many years that allowed you
to launch a half dozen sorts at once, and long after quick, heap and
shell sorts finished the bubble and selection sorts finished together.

Yet bubble sort is taught as a "more efficient" sorting algorithm than
selection sort.  I've seen countless comments to this effect in the wild,
and actually got some mail from people convinced I deliberately did 
something to slow down the bubble sort in my java applets.

If they want to teach a "more efficient" sorting algorith, teach
quick sort.  Or shell sort, which is a bubble short on steroids.
But never bubble sort - it offers no improvement over selection
sorts, but it's harder to implement correctly.

> Certainly quicksort is more efficient, but as I recall, qsort gets terrible
> actual performance when the data set is small (and when the data is near
> sorted already, but I am am not sure about that).  

So what?  Once you get past the "intro to algorithms" course you should
just call the system qsort() function.  It has already been optimized
to handle the small cases (where you use selection sort, not bubble
sort), median-value pivoting, etc.

> Still, bubblesort on 4000 items seems a bit, well, silly.

Not just bubble sort on 4000 items, bubble sort on a linked list.
A linked list containing small items.  (IIRC, it was the section ID
for components in a document, so we're talking about things like
 "17.C.3.b".)

> > Of course this is always a balancing act.  But when you're doing
> > initial development the full cost of a decent algorithm is usually
> 
> Hmmm, only if the code is being run a lot, right?  You examples above are
> good because they have to be run frequently.  Course assignments are run
> once, maybe twice.  

That's one way of looking at it.

I prefer looking at it the other way - how much can I do in a given
amount of time?  More efficient algorithms allow me to deal with larger
problems.  To give some hard numbers, let's say one 'tick' is one second,
how much can I do in two hours?:

  O(N lgN)  ~  a bit under 800
  O(N^2)    ~  60
  O(N^3)    ~  15

There's some slop here, of course, but the point is clear.  Even if you
assume the constants with O(N lgN) is 10x worse than the O(N^2) algorithm,
you're still better off with it if you have more than 60 points or so.

This is also why I am so down on the 1-2% test case.  Going the other
way, if a 2% solution takes 1 minute then a 100% solution takes:

  O(N lgN)  ~   282 m (<5 hours)
  O(N^2)    ~  2500 m (over 40 hours)
  O(N^3)    ~ 12500 m (8.7 days)

> > > It *is* done, though.
> > No it's not.  Can you use this program to solve problems?
> Sure, he can.  He just needs 24 hours or so

Do we know this?

> But, efficiency is secondary for course work, usually.

This reminds me of rhetorical question in one of my SE class...

You always care about efficiency, at least to the extend of
ensuring that it will finish by the deadline.  If you write
a program that won't terminate until after your deadline, why
bother?

And I'm still not convinced that this program would terminate.
I've seen too many program that "haven't finished yet" or crashed
after running for hours (or days) when in fact they were deadmeat
from the first few seconds - but nobody realized they had blown
through array bounds, corrupted the stack, etc.

> > If we were in industry and you reported to me, my first question
> > would be how you know that the program is making any progress?
> > Do you have *ANY* demonstrable reason to believe that your program
> > isn't cycling between two states?  (e.g., perhaps two values in this
> > problem set are equal but your code assumes that a strict ordering
> > is defined.)
> 
> And his response would be, well, it works on small sets, so it does
> something correctly.  

And I would reply "so what?"  We're talking about a whole new 
situation when we increase the size of the problem by 5000%.

It's not hard to put in debugging statements that periodically
show the length of a pending queue, or number of unprocessed nodes,
or whatever makes sense in the context of the problem.  I've lost
count of the number of times I started something that I knew would
take hours to run normally, but only caught the fact that it would
never terminate (due to a coding error) because these progress
indicators never moved.

> Remember:  these *are* one off tasks, and if finding the more elegant
> solution means stepping into a library and doing a whole bunch of research,
> well, I have suspicion that maybe it isn't usually worth it.

Unless that was the point of the exercise. :-)

Seriously, I'm not suggesting that anyone learn esoteric stuff,
just master the basics.

> > If your program is still running after 15 hours, my gut reaction is
> > that it's probably due to a coding error.
> 
> Hmmm... Who has time in a course to test *every* possible code branch?

You mean you don't write the test cases as you write your code?

I'm not expecting full coverage, just pointing out that you use the
small test cases to check for obvious errors.  To test the algorithm
itself you need to use larger test cases - if I knew that the final
problem was 500 points I would try to check the algorithm itself with
problems of 5, 10, 20, 50 and maybe 100 points.

> Obscure bugs are probably not going to be tested by the professor, either.

Unless they keep the program from terminating.  Or they have a
nasty sense of humor.  (500 data points is large enough to hide
a lot of nasty surprises.  I wouldn't expect to see this in a
lower-division ugrad course, but I wouldn't be surprised by it 
in a graduate or senior-level course.)

> Coursework is about getting the general idea, not necessarily getting the
> *best* solution.  You tend to error check what you *need* to, not what you
> *ought* to, and how many times does actual performance nearly require that
> very little actual testing goes on?

But when run times take hours (or days), you can't afford to test the
software by seeing what it finally produces.  Take the time to check
what you can before making this major investment.

This reminds me of one of my few ugrad classes.  We did our punch cards,
gave them to a clerk, and came back a few hours later to get our printouts.
My university (with only 10,000+ students at the time!) actually sent
these jobs to another university for processing!

Today I let my compiler check for syntax errors and misspellings.  Even
a few years later, when I had a second class that used line printers, I
let the compiler check for these errors.  But with an hours-long turnaround
I can assure you that I carefully checked these punch cards before I handed
them off.

I see this situation as similiar.  If a 1-2% test case takes minutes to
run, it will take hours to run on the full data set.  The time spent testing
for obvious errors would be time well spent.

> > How will you learn?  Your professor or TA can make a few comments
> > on your code, but how much industry experience do they have?  Post
> > your code here and you could learn what we care about in the real
> > world.
> 
> Chris, play fair here.  You can't call for help, even on only a related
> question, and not expect a bunch of programmers to want to see.  Besides, if
> you ever *do* need to use the algorithm again, you will have learned how to
> fix it.  And I wouldn't mind seeing it because frankly, I suck, you are a
> generally better programmer than I am, and I would like to see other
> mistakes, so that I don't make them.

That, and the general fact that everyone needs to learn to distance
themselves from their code.  Criticisms of your code are not criticism
of you.

> My answer to this last is obvious; I think that
> Chris, like me, is pursuing dual degrees in two separate, unrelated fields,
> and struggling to integrate them in a meaningful way, more or less without
> help.

For about five years I was working full time and taking graduate classes
and trying to remain sane.  Nobody is denying that he (or you) are under
pressure, just that it goes away when you graduate.  It doesn't, it gets
worse, and the consequences can be a lot worse than possibly getting no
credit for an assignment.

> I guess my point here is that some of the responses seem to berate Chris for
> poor design decisions, without paying attention to the fact that he is not
> in an industry job,

I think many of us are expressing concern that he has an idealized, and
highly unrealistic, view of what it's like in industry.  A few years ago
I was deliberately taking months off between contracts, not because "I was 
rich and could afford it" but because I had been working so hard for so
long that I was skirting burnout.  This intensity level is very rare in
other disciplines... and at least there it's understood and management
makes allowances.  But not in programming.  (I'm suspect that's also why
it sometimes seem rare to find an experience programmer who doesn't also
have a therapist or two.)

> the academic world really does pull you in
> multiple directions, in a way that an industry job generally doesn't (except
> for some of the consultants, I would imagine).

It depends on what you're doing.  You, at least, have someone giving you
a clear indication of where you're being pulled.  Imagine the same pressure,
the same need to pick up new material quickly, but with no idea if it's
another world wide web (I remember reading the CU symposium flyer in '92 or
so... and deciding it would be another ultimately unused academic exercises)
or if it's just more vaporware that will go nowhere.

At least courses are stuff that's (usually) widely considered important.

> BTW, does anyone want to look at my code, and tell my
> where it sucks? :)

I am always happy to look at code.  

Seriously.

It helps me keep my own skills current, but I tend to speak my mind.