[lug] software engineering

Nate Duehr nate at natetech.com
Mon Nov 13 13:21:55 MST 2006


Zan Lynx wrote:
> On Sun, 2006-11-12 at 23:44 -0700, Nate Duehr wrote:
>> I was typing up a long reply to all the points, because I find many of 
>> these things a lot of fun to talk about -- not being much of a software 
>> developer but having worked on the receiving end (technical support) of 
>> both good and bad software for most of my adult life, I (think) I have a 
>> unique perspective, as do you.
>>
>> I have contended for years that so-called Software Engineers don't play 
>> by the same rules that Civil, Chemical, Structural, Electrical, and 
>> other Engineers live by -- the industry just barely makes a half-hearted 
>> effort at it.  What I mean is, the creativity and drive are there of 
>> other Engineers, but the discipline isn't.
>>
>> It shows in the fact that open-source software blows away the 
>> functionality and features of most "Engineered" code from most businesses.
> 
> Civil engineers design and add a *hefty* safety margin.  And they miss
> things, like resonance frequencies on bridges, and some of the mistakes
> made in "quake-proof" buildings in California.  And the New Orleans
> flood control systems.

Nit-pick: New Orleans wasn't a mistake, the Army Corp of Engineers built 
  the levy system for a smaller storm size and surge, and clearly 
indicated that to those paying for the job, the U.S. Congress.  Congress 
had an opportunity to pay for upgrades two years or so before the city 
was destroyed and chose not to.

> Electrical engineers make *plenty* of mistakes.  Please read the errata
> sheets for various components.  Especially interesting are the computer
> related ones like SATA and Ethernet controllers, memory controllers, PCI
> and PCI Express, etc.  CPUs have *many* bugs.  There was the old Pentium
> divide bug, there was an Athlon 64 prefetch into protected memory
> segfault bug, plus hundreds of other little things I can't remember.

They just don't seem as critical as software's problems.

Every Joe on the street knows what a "computer software crash" is, and 
how millions of dollars is lost a year to them.  (Billions?)

But you rarely hear about "building crash"... ya know?  They're not 
falling down so often that people notice it, anyway.

There appear to be completely different level of mistakes going on here, 
by many orders of magnitude.

> It's complexity.  Engineers cannot hold everything related to their
> project in their heads, and they cannot predict all possible
> interactions between components plus the surrounding environment.

Building a building isn't complex?

The thing is... re-use of concepts is heavily used in building 
buildings, and formalized job roles are in place -- Architects, 
Engineers, Construction Supervisors, Workers of all sorts, they all know 
their piece of the job very well.

Software companies, even big ones, typically still don't quite have that 
level of consciousness about code.

The knowledge re-use in building structures is formalized to the point 
where its put into Building Codes.

Software has almost no such correlation, and is far more immature (both 
from a number of years standpoint, and also from an attitude standpoint 
of many developers) in this regard.

> Software gets a bad rep because it isn't as critical to get right the
> first time as building a bridge, and it is easy to update and fix later,
> so perhaps less effort is put into verification.  But the customer does
> get faster and cheaper in exchange for a few bugs.

I would contend that companies are starting to not agree anymore, but 
software companies are stuck in the rut of quick releases, etc.

When 40% of CxO's say their biggest problem is "software that didn't 
deliver what was promised", that's enlightening, isn't it?

> I would say really critical software, like F-22 flight control software,
> *is* heavily analyzed and tested, and is probably just as reliable as
> the mechanical engineering that goes into the wings and engines.

Probably true.  But even more-so, it's designed with a failsafe 
mentality and design philosophy.

It doesn't fail and crash, it fails, and switches to another, perhaps 
"dumber" mode where the human has to do a little or a lot more work to 
fly the aircraft.  And in critical systems that the human can't do, 
redundancy is multiple layers deep.

Again back to testing too -- the F-22 (using your example) was also 
heavily TESTED for YEARS by pilots who weren't the operational 
day-to-day folks who fly it.

(That makes F-22 Raptor pilots sound like they're not 
steely-eyed-missile-men, but they are... the test pilots are just that 
razor's edge better even.)

Software simply doesn't go through anything close to this level of 
testing or anything many orders of magnitude close to it.

Even at our system's most core level, the kernel -- depending on the 
timeline of a particular distro's release, they could pull in a section 
of code that was on the kernel-traffic mailing list being discussed only 
a few weeks ago, if they hit the release dates just right.

> The discipline is there, when it is needed and cost effective.

It's ALWAYS needed, so cost effectiveness is really the driver.  :-) 
How many software shops do you know that ask the engineers to look at 
ways to create code more inexpensively?  I've never heard that question 
asked of a developer ever in the business world.

What you usually hear is that the underlying system changes so 
dramatically that even asking the question at the application 
developer/engineer level is silly.  The OS will change so much 
underneath you that you can't and won't ever attempt to stay on a stable 
code-base for your application software.

OS level upgrades or changes are a good opportunity for most software 
houses to package up a new product, discontinue/end-of-life the old, and 
blame it on the OS upgrade in an awful lot of cases, hiding the reality 
that their old code was starting to get really crufty and needed some 
heavy design-level thinking to re-build major components of it.

And patches in the software world don't bring new dollars and cash with 
them, only the next new version does.  The whole sales cycle is 
economically built to force instability.  Some would say to force 
"innovation" but I'd disagree.  Most coders out there are probably 
spending 50% or more of their time re-writing the same string 
manipulation code they wrote the day they joined the company, on the 
shiny new platform, be that a new OS, or a new OS and new hardware, etc.

> Really, consider how your boss and users would react if you claimed that
> you needed a week to design, analyze and test the Perl script that
> reports their maximum disk usage, to make sure it was 100% reliable.

Actually we do this, but hey -- I work in a weird environment.  :-)

Scripts are never uploaded to production systems without peer and 
Engineering review, and are never added during the Production day.  Root 
access is never allowed unless something is seriously broken during the 
production day, and every single command that's going to be typed into 
the machine at the shell prompt is written down in a formal 
Method-Of-Procedure (MOP) document and reviewed and tested on a lab 
system before done in production.  Even a simple, "We'd like to remove 
the excess junk in /tmp" requires a written document of exactly what 
you'd type to do it.

One customer, for example, has done their own analysis and only allows 
root and maintenance on Friday nights.  That's their "safest" time that 
matches their internal schedules.  (Oh lucky us!  GRIN...)

Others have procedures to allow vendors (us) to formally request root 
access any time it's needed, which includes submitting those MOP's above 
for their review prior to us logging in.

Once in a while after a great deal of trust is built, you can get a 
verbal approval (and change of the root password to something you know 
so you can get in) to log into the system to LOOK at a log file with 
permissions you can't otherwise see.  But you still don't change or 
modify ANYTHING without permission.

Down-time is NOT in my customer's vocabulary.  :-)  Down-time caused by 
a vendor typing the wrong command is very close to being an offense that 
can lead to termination of the person that allowed us to do it.

> Right, its ridiculous.  You just fix it whenever you notice it doesn't
> work for filesystems with really long device names.

This would be listed in the caveats and/or design doc for the script if 
it comes from our Engineering group -- that document is required and the 
script can't be released past the Change Control Board without that 
document.

Nate



More information about the LUG mailing list