To add logic to the software to handle things that are not possible just adds extra complexity and therefore extra opportunities for even more bugs.
Not HANDLING things that are not possible makes sense.
But DETECTING things that are "impossible" is very common. Much, if not most, important code is filled with "assertions", which verify the assumptions the programmers made are in fact true. In non-critical code, it an assertion fails then the program halts. In critical code it at least gives and error message, and maybe then forces a switch back to a simpler "safe mode".
For example (purely hypothetical, I know nothing of their code), say they have two ways to determine mission elapsed time - one from the Atlas, and one from the break wire on the umbilical. Then the assertion would read something like
IF abs(met_from_breakwire - met_from_atlas) > 10 seconds THEN print("Clocks do not agree!")
perhaps followed by setting both clocks to the value deemed more reliable.
I disagree with your characterization of what an assertion is. I think you're conflating assertions with error handling.
In my experience, assertions are used for things that are not meant to be checked in production code. Assertions are generally checked only in testing and then in production builds there is no assertion checking.
What you describe is indeed one use of assertions, but it's not the only one. Often they are left in production - at least then if the system fails you know *why*, instead of having a sometimes difficult debugging problem.
For example, see Assertion Checkers in Verification, Silicon Debug and In-Field Diagnosis. In this case, the assertions are built into the hardware, and left in for production versions.
They specifically state this is helpful in high reliability situations. If an assertion fails, you shut down that unit and switch to a redundant processor. (This is for hardware errors, not software bugs.) More obviously, it finds error situations that occur in the field that are not uncovered by your tests - exactly what happened here.
Here's an example of leaving software asserts in production code:
What they found was that the number of crashes did not change much, but the cardinality went down, significantly. The learning was code executing past a disabled assertion may be in one of n different bad states, each of which might lead to a different type of crash. They now had better high-level information about what was causing crashes (knowing which asserts were wrong) and it helped them reduce their crash rate much more than raw crashes without asserts (including cases where the crash was in iOS code, not app code).
So leaving assertions in production code, and even silicon, is a known practice. I suspect you've personally seen evidence of this, when some program you are using emits an error message "This can't happen". I've certainly seen enough of these.
Unfortunately, some developers are not very good at writing code with proper error checking. Some don't understand the intended difference between assertions and run-time error checking. Such developers often want to keep assertion checking turned on in production code.
I've found that usually this is just a lack of training. Nobody ever taught them to think in terms of proper error checking. When I've explained the proper use of assertions and run-time error checking and handling (which is more complex because you need to choose how to handle it and recover gracefully), I've found that developers are usually able to master this.
If having assertion checking turned on in production code improves the code, it wasn't properly written to begin with.
Your first link is about assertions in hardware description languages. These days tools from the big EDA vendors are used as standard practice to prove all those assertions can't happen, or to find waveforms that cause the assertions to fail, before a chip is ever fabricated. It is near-universal practice that these assertions are not checked in production. The paper seems to be about an experimental idea for using some of this assertion information to help find hardware problems in individual chips. That's not a standard practice.
In any event, all of this is very, very far from what you were labeling as an example of "assertion checking", which was comparing two clock signals, detecting that something was wrong and taking some action to try to improve the behavior based on that. Even when some people stretch the term "assertion" to apply to keeping checking enabled in production, I've never before heard anyone stretch the definition of "assertion" as far as you have to not just checking for an unexpected condition but trying to keep the system running and doing some error handling based on it beyond just sending an error message and stopping execution.
What you described is run-time error handling. It's not assertion checking by any stretch.
In my experience, assertions are used for things that are not meant to be checked in production code. Assertions are generally checked only in testing and then in production builds there is no assertion checking. When an assertion fails, the program just entirely bails out. It does not try to handle the error, it just ends the test. Hence the C/C++ assert() macro, which only has any effect when -DDEBUG is defined and in that case causes an error message and an abort() call, while in production code the condition isn't even checked and the assertion has no effect whatsoever.
This has another problem - it violates "test as you fly", in ways that can introduce fatal problems. Suppose, for example, you say, in C++
assert(n=m);
where you meant:
assert(n==m);
Now when you test, you will *set* n equal to m, and essentially assert that m = 0.
So n is guaranteed to be 0 after the assertion (not what you intended).
Now you compile for production, and the assertion is not tested. Now n has whatever value it had before the statement. Of course you are not *supposed* to write assertions with side effects, but it's easy to do by mistake. And now you can have code that works perfectly in DEBUG and not in production. See
Effective use of assertions in C++ There are only two ways in which assertions can be misused, by introducing side effects within an assertion's expression, or by leaving the assertions in the production code. The introduction of side effects into the assert expression is a common mistake …
Bold mine.
In my experience, assertions are used for things that are not meant to be checked in production code. Assertions are generally checked only in testing and then in production builds there is no assertion checking. When an assertion fails, the program just entirely bails out. It does not try to handle the error, it just ends the test. Hence the C/C++ assert() macro, which only has any effect when -DDEBUG is defined and in that case causes an error message and an abort() call, while in production code the condition isn't even checked and the assertion has no effect whatsoever.
This has another problem - it violates "test as you fly", in ways that can introduce fatal problems. Suppose, for example, you say, in C++
assert(n=m);
where you meant:
assert(n==m);
Now when you test, you will *set* n equal to m, and essentially assert that m = 0.
So n is guaranteed to be 0 after the assertion (not what you intended).
Now you compile for production, and the assertion is not tested. Now n has whatever value it had before the statement. Of course you are not *supposed* to write assertions with side effects, but it's easy to do by mistake. And now you can have code that works perfectly in DEBUG and not in production. See Effective use of assertions in C++ There are only two ways in which assertions can be misused, by introducing side effects within an assertion's expression, or by leaving the assertions in the production code. The introduction of side effects into the assert expression is a common mistake …
Bold mine.
Yes, quite true. That's why release builds must be fully tested.
... If having assertion checking turned on in production code improves the code, it wasn't properly written to begin with. ...
The difference in terminology is nuanced, but the end result is very much not nuanced. Assertions are a form of error checking, and dispensing with them at run time is very much at the developer's and the solution's peril. Is bounds checking at e.g., the jvm layer an assertion or question-mark (edit: darn those emoticons) Answer: Yes to all. As was once put many years ago, turning off those check--whatever you want to call them--at run time is (to paraphrase) like practicing with life boats, and then going to sea without them.
While "...If having assertion checking turned on in production code improves the code, it wasn't properly written to begin with..." might be true, but the critical questions are: Does it make the system safer. No. Does it make the system easier to make safer? No. Does including those checks make the system safer and easier to make safer? Demonstrably yes.
In short, no credible developer responsible for life- or safety-critical systems should take your suggestion seriously. (That from someone who has had to take the witness stand in defense of their code-system-practices where lives were lost.)
edit: p.s. think we are a bit off topic here, but felt compelled to interject as this is not a topic for amateur advice; back to your normally scheduled program.
... If having assertion checking turned on in production code improves the code, it wasn't properly written to begin with. ...
The difference in terminology is nuanced, but the end result is very much not nuanced. Assertions are a form of error checking, and dispensing with them at run time is very much at the developer's and the solution's peril. Is bounds checking at e.g., the jvm layer an assertion or question-mark (edit: darn those emoticons) Answer: Yes to all. As was once put many years ago, turning off those check--whatever you want to call them--at run time is (to paraphrase) like practicing with life boats, and then going to sea without them.
I think you may have misunderstood Chris's point. Assertions are a form of "error checking" in the general sense, but they are not meant to catch errors that can happen at runtime. They certainly aren't meant to handle those errors. The intent of assertions is to safeguard against programmers misusing each other's code and help them debug the code once they have steps for reproducing a bug from QA. If an error check should be in the shipping build,
then it should not be in an assertion. Doing otherwise is a misuse and a misunderstanding of what an assertion is for. I would hope that such misuses are caught in peer reviews (which even video game developers are doing nowadays).
If you are sufficiently worried about programmers misusing each other's code that you have assertions enabled in a shipping build,
then I suggest that you do not have sufficient confidence in your codebase to ship it. The nuance does matter.
I'm not sure why we're talking about assertion syntax in C++, but I am pretty sure it's time to end this digression.
My problem is not that they picked the wrong time - S### happens.
My problem (and for a lot of others) is just how the vehicle responded to that error.
Or more glaringly, was incapable of recognizing that an error even existed.
Exactly. What's the point of having 3 redundant flight computers cross-checking each other if you're going to feed all 3 the same garbage value from an external source without bothering to confirm that the value is correct? That's just like using data from a single sensor - if the wrong values get passed or the sensor spits out garbage data, the entire supposedly "redundant" system is worthless.
It seems that the design is too fragile. A simple error like this should not cascade in multiple propulsion failures and LOM.
Just a very simplistic view of things.
Imagine something designed to fall and land in a net, if it misses the net it breaks into little pieces.
Now suppose it missed the net because of a simple mistake that can be easily fixed, and also imagine that once this is fixed you can be confident >99.9999% that you will hit the net every time, do you still think the design is too fragile?
Edit: I'm not saying their software is rubust, great, excellent, perfect, or any of that sort, I obviously have no idea.
I'm just saying that the failure due to the clock being off is NOT evidence of a bad design.
It's a single point of failure in a mission critical function. How is that not a bad design? If you want a robust design then the flight computers should not blindly trust a mission-critical value provided by a single outside source. Not if there is any way to validate that value or run a sanity check.
You can't just look at 'single point of failure', you also have to look at odds of failure and the consequences of failure. (and also cost protecting against it).
I 'assert' that the odds of the clock being wrong are next to 0 (once programmed correctly). Adding cross checks adds unnecessary complexity to a simple task.
It seems that the design is too fragile. A simple error like this should not cascade in multiple propulsion failures and LOM.
Just a very simplistic view of things.
Imagine something designed to fall and land in a net, if it misses the net it breaks into little pieces.
Now suppose it missed the net because of a simple mistake that can be easily fixed, and also imagine that once this is fixed you can be confident >99.9999% that you will hit the net every time, do you still think the design is too fragile?
Edit: I'm not saying their software is rubust, great, excellent, perfect, or any of that sort, I obviously have no idea.
I'm just saying that the failure due to the clock being off is NOT evidence of a bad design.
It's a single point of failure in a mission critical function. How is that not a bad design? If you want a robust design then the flight computers should not blindly trust a mission-critical value provided by a single outside source. Not if there is any way to validate that value or run a sanity check.
You can't just look at 'single point of failure', you also have to look at odds of failure and the consequences of failure. (and also cost protecting against it).
I 'assert' that the odds of the clock being wrong are next to 0 (once programmed correctly). Adding cross checks adds unnecessary complexity to a simple task.
You can add complexity sure by checking a million things and then cross checking another million things...but something as simple as:
"Time From Atlas" = 11Hr
"Time From Break Wire" = 0hr
if ((TFA - TFBW) > 15min ) {
SUB ("NOT POSSIBLE, SOMETHING WRONG, DON'T GO CRAZY, ASK FOR HELP")}
else {
SUB ("CONTINUE MISSION")}
And apparently, the wrong time kills the mission, makes other systems go crazy, and damages hardware. So how is that not a single point of failure? Apparently, MET is the "jesus nut" for Starliner.
You can't just look at 'single point of failure', you also have to look at odds of failure and the consequences of failure. (and also cost protecting against it).
I 'assert' that the odds of the clock being wrong are next to 0 (once programmed correctly). Adding cross checks adds unnecessary complexity to a simple task.
If the system clock is so critical, then its management should have been ironclad (by design and tests) and it wasn't.
Also, the propulsion system should have been protected from overuse and, again, it wasn't.
How can NASA, the astronauts and their families trust the current state of the software of Starliner? Or the simulation of the in-flight abort? IMO the problem is systemic, not a "simple bug".
Many of the issues being discussed here are inter-related:
Should the code check for inconsistencies even if can't determine which value is correct?
Would the astronauts, had they been onboard, have noticed in time?
Would they know what to do in such an event, in a timely manner?
Is it safer to go sub-orbital, and enter in the Indian ocean, or proceed to orbit?
Un-crewed software has, over many years, converged on a solution that adddresses all these concerns:
1) The code is full of self-checks. But if one fails, it does not try to correct the problem, it drops into "safe mode". Coders can add as many assertions as they wish without having to figure out what the spacecraft should do in each case.
2) Safe mode is as simple as possible and is super-intensively tested. Its only goals are stable attitude, power positive, minimal use of consumables, and enable communication.
3) It's the job of humans to figure out how to get the spacecraft out of safe mode.
Such an architecture would have solved all the problems on the mission, plus many more conjectured here.
There were lots of ways safe mode could have been triggered. Timer's disagree, excessive fuel consumption, radio link not as expected, etc.
Two of the main features of safe mode would have helped save the mission - don't use excessive fuel, and get in a good attitude for comms.
If astronauts had been on board, there is no question they would notice. The screen would say "Safe mode entered, manual input required".
There is also no question that the astronauts would know what to do. Since safe mode is the result of almost all problems, it's one they would surely train for - safe mode during orbital insertion, safe mode during ISS approach, etc.
This also helps with the question of whether it's better to re-enter in the middle of nowhere, or press to orbit. If the craft goes into safe mode while still sub-orbital, the astronauts could assess the situation and decide what is safer. If all thrusters are working fine under manual control, press on. But if they are not certain they can re-enter later, maybe better to passively re-enter in the Indian ocean.
Robotic missions have spent decades making all these tradeoffs work, and as far as I know almost every mission has gone into safe mode at least once (often soon after launch, when tasks are being performed for real for the first time). But I have no idea how similar the software is for crewed and un-crewed crafts. Maybe it should be more similar than it appears to be.
I 'assert' that the odds of the clock being wrong are next to 0 (once programmed correctly). Adding cross checks adds unnecessary complexity to a simple task.
I'd agree for values that are internal to Starliner systems, but the MET is coming from Centaur. Starliner has no control over whether Centaur knows the right MET. If you're going to trust that as a design-controlled internally generated value, then you need some way to validate the entire system design
including Centaur, which Boeing clearly didn't and arguably couldn't.
Un-crewed software has, over many years, converged on a solution that adddresses all these concerns:
1) The code is full of self-checks. But if one fails, it does not try to correct the problem, it drops into "safe mode". Coders can add as many assertions as they wish without having to figure out what the spacecraft should do in each case.
2) Safe mode is as simple as possible and is super-intensively tested. Its only goals are stable attitude, power positive, minimal use of consumables, and enable communication.
3) It's the job of humans to figure out how to get the spacecraft out of safe mode.
I doubt this is true for this situation. As Ed Kyle pointed out Starliner is not a typical spacecraft at this stage. It is the third stage of a launch system. There is plenty of design experience of launch systems and Boeing has a lot of it. Programming your launch system to go into safe mode mid-staging is unlikely to prove a path to success. Although this suborbital tour de force of machine intelligence would no doubt be a technological marvel it would not make the system more robust.
Starliner has no control over whether Centaur knows the right MET. If you're going to trust that as a design-controlled internally generated value, then you need some way to validate the entire system design including Centaur, which Boeing clearly didn't and arguably couldn't.
The launch system can trust the master clock, because if that clock is wrong the launch will fail. Starliner getting to the point in launch sequence when it polls the clock proves that the clock is right.
Why not use break wires at S/C sep to start the clock on Starliner as a back-up signal for MET...
Starliner has no control over whether Centaur knows the right MET. If you're going to trust that as a design-controlled internally generated value, then you need some way to validate the entire system design including Centaur, which Boeing clearly didn't and arguably couldn't.
The launch system can trust the master clock, because if that clock is wrong the launch will fail. Starliner getting to the point in launch sequence when it polls the clock proves that the clock is right.
Not necessarily. ULA could make a change to Centaur that breaks passing the MET. That
probably won't happen, but the only ways to ensure that is to either not trust it, or to validate the systematic control of the entire design including Centaur.
Gemini was actually a somewhat similar design, with two thruster systems - the malfunctioning on-orbit primary RCS, and the entry attitude control system. When the primary RCS malfunctioned the astronauts inhibited the primary system and activated the entry system. This unfortunately is what forced them to end the mission early, as the entry system could not be deactivated due to the use of pyro valves.
So I would hesitate to say that Gemini's thruster system was in any way safer/better than CST-100's.
Is that "damning by faint praise"?
The contrapositive of the bolded statement is that you "would hesitate to say" that CST-100's thruster system is not safer than Gemini's.
I.e. your opinion is that CST-100 may be better than America's second capsule design done almost six decades ago, but your are not confident?
Faint praise indeed.
And this is really "apples to oranges", comparing CST-100 to the Gemini.
Gemini was actually a somewhat similar design, with two thruster systems - the malfunctioning on-orbit primary RCS, and the entry attitude control system. When the primary RCS malfunctioned the astronauts inhibited the primary system and activated the entry system. This unfortunately is what forced them to end the mission early, as the entry system could not be deactivated due to the use of pyro valves.
So I would hesitate to say that Gemini's thruster system was in any way safer/better than CST-100's.
Is that "damning by faint praise"?
The contrapositive of the bolded statement is that you "would hesitate to say" that CST-100's thruster system is not safer than Gemini's.
I.e. your opinion is that CST-100 may be better than America's second capsule design done almost six decades ago, but your are not confident?
Faint praise indeed.
The user to whom I was responding seemed to be suggesting that somehow the Gemini system was better than CST-100 in a "stuck thruster" scenario. I was disputing this claim.
And this is really "apples to oranges", comparing CST-100 to the Gemini.
Agreed.
Many of the issues being discussed here are inter-related:
Should the code check for inconsistencies even if can't determine which value is correct?
Would the astronauts, had they been onboard, have noticed in time?
Would they know what to do in such an event, in a timely manner?
Is it safer to go sub-orbital, and enter in the Indian ocean, or proceed to orbit?
Un-crewed software has, over many years, converged on a solution that adddresses all these concerns:
1) The code is full of self-checks. But if one fails, it does not try to correct the problem, it drops into "safe mode". Coders can add as many assertions as they wish without having to figure out what the spacecraft should do in each case.
2) Safe mode is as simple as possible and is super-intensively tested. Its only goals are stable attitude, power positive, minimal use of consumables, and enable communication.
3) It's the job of humans to figure out how to get the spacecraft out of safe mode.
Such an architecture would have solved all the problems on the mission, plus many more conjectured here.
There were lots of ways safe mode could have been triggered. Timer's disagree, excessive fuel consumption, radio link not as expected, etc.
Two of the main features of safe mode would have helped save the mission - don't use excessive fuel, and get in a good attitude for comms.
If astronauts had been on board, there is no question they would notice. The screen would say "Safe mode entered, manual input required".
There is also no question that the astronauts would know what to do. Since safe mode is the result of almost all problems, it's one they would surely train for - safe mode during orbital insertion, safe mode during ISS approach, etc.
This also helps with the question of whether it's better to re-enter in the middle of nowhere, or press to orbit. If the craft goes into safe mode while still sub-orbital, the astronauts could assess the situation and decide what is safer. If all thrusters are working fine under manual control, press on. But if they are not certain they can re-enter later, maybe better to passively re-enter in the Indian ocean.
Robotic missions have spent decades making all these tradeoffs work, and as far as I know almost every mission has gone into safe mode at least once (often soon after launch, when tasks are being performed for real for the first time). But I have no idea how similar the software is for crewed and un-crewed crafts. Maybe it should be more similar than it appears to be.
Would Safe Mode cause Starliner to burn up in the atmosphere because the insertion orbit brings it back if it cannot be reset without it happening?
It seems that the design is too fragile. A simple error like this should not cascade in multiple propulsion failures and LOM.
You are correct that would be a bad design. In this case multiple (3) things happened.
They didn't get the correct delta V for the "back away maneuver" because the thrusters were over stressed.
That is an assumption on your part. All that was reported is that the delta-v was subnominal. That could have been caused by any one of dozens of potential causes.
From Eric Berger’s article
"In testing the system the spacecraft executed all the commands, but we did observe a lower than expected delta V during the backing away phase," Boeing said in a statement. "Current evidence indicates the lower delta V was due to the earlier cautionary thruster measures, but we are carefully reviewing data to determine whether this demonstration should be repeated in the subsequent mission."
That sure reads to me that they believe the delta V issue was related to the thruster issues.
I read this as an admission that no set of thrusters was available to expend more propellant to get to the "back away" delta-v. That implies that virtually all of them had reached the point where they couldn't provide any more impulse. With 60 of them on the vehicle, that's pretty sobering.
What is not mentioned in the article is the abort demo was, by design, not performed at full redundancy. The demo could have been repeated and probably would have been successful.