*snip*
Look, I'm just going to say it. The only reason Boeing didn't do an in-flight abort was to save money by not having to purchase another Atlas V and all associated costs with an IFA mission.
*snip*
You do not have to use the launch vehicle to do an in-flight abort test. All other in-flight abort tests prior to SpaceX's IFA test have used a much cheaper stand-in rocket booster for the in-flight abort test.
SpaceX initially also planned to use a modified first stage with only 3 engines and no second stage to perform the test, so Boeing could have used anything offered commercially, like a minotaur first stage was used for Orion.
...
We've already established that this software was poorly written.
...
In defense of the programmers (since I write bugs (aka software) for a living): (small time software, nothing mission critical or anywhere near that).
I'm not convinced that the software was poorly written.
You trimmed Lar's comment one phrase too soon: What he said (using words I wouldn't have) was:
We've already established that this software was poorly written. We just don't know how poorly just yet. (and probably never will)
As one who spent a career doing safety-critical software, I can say that darn near every software "bug" that made it into flight testing was, at its root, a system requirement that several software requirements were derived from. Was the software modified as a result of the system requirement error? You bet; it's a lot harder to change hardware than it is software. Is that a software error? Not in my book; the software faithfully implemented the requirements given to it.
What is the root cause in Starliner? We don't know, and as Lars points out, we (the public) may not ever learn the details. Will the change be made in software? I sure hope so, because if there are hardware problems that have to be corrected, that's gonna be a hurt to Boeing, NASA, and space fans everywhere (but a WHOLE LOT LESS OF A HURT than losing a crew).
Of course the software will be changed, it will take the correct time value from the hardware. What other changes are needed I have no idea.
p.s. I'm not sure what the additional phrase from Lar changes.
p.p.s. on the subject of software being written to spec even if wrong, I can't help but recall the bug in the windows forfiles command
https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/forfiles.
In the command you can specify a date filter by adding /d and a number preceded by a + or -, now read this paragraph carefully:
Selects files with a last modified date later than or equal to (+) the current date plus the number of days specified, or earlier than or equal to (-) the current date minus the number of days specified.
so if you specify '/d +10' it will search for files that were modified 10 days in the
future or later. (when what we really wanted was files modified within the past 10 days). And of course the software was written exactly to spec and was therefore useless, (I think this bug has been fixed by now, but it was definitely a bug at one time)
I must admit to not having followed Starliner development as closely as Dragon's but why exactly do they have so many thrusters? Why does Starliner need (60?) thrusters?
There are 28 RCS thrusters on the CM. There are also 20 orbital maneuvering thrusters that are more powerful (they are also used to control pitch and yaw in the event of an abort). Starliner would be using those 20 to change orbits. Plus the 12 RCS thrusters on the capsule which provide orientation during EDL.
I am still failing to understand why a capsule and service module as light as Starliner is, needs 30,000 pounds of thrust via twenty 1,500 pound thrusters when the Shuttle was able to get to orbit with only 12,000 pounds of thrust from the two 6,000 pounds thrust OMS pods? Is it simply just to control the abort phase, which seems to imply the abort motors can't use differential thrust to to guide the trajectory, or is it to impart enough delta v to get to orbit, even though that seems unlikely given the small amount of delta v needed to make it to orbit. It just seems wildly excessive. Counting every single thruster on Starliner including abort motors you get 64 of them. Dragon has 28 including abort motors, the Shuttle and 44 RCS thrusters, 38 with 807 pounds thrust each and 6 with 24 pounds thrust each and the OMS with 6,000 pounds each for a total of 46 for a total output of about 43,000 pounds from all of them while its max weight was 10 times the Starliner stack. Not including the abort motors, Starliner has almost 34,000 pounds of thrust across all its thrusters. Dragon total is 1800 pounds of thrust not including the Super Dracos.
That just seems crazy in the difference between the two. What am I missing?
Emphasis mine.
Only 12 of those 1,500 pound thrusters are aimed in the same (downward) direction. They provide additional thrust during low-altitude aborts and are the main abort thrusters for high-altitude aborts. Those 12 (18,000 pounds of thrust combined) are also the primary thrusters for getting Starliner into orbit after Atlas V has released Starliner into its suborbital trajectory.
Another four 1,500 pound thrusters are aimed forward in two pairs and are - if I recall correctly - the primary thrusters for emergency abort (rapid retreat) during ISS approach phase.
The remaining four are paired in roll-control configuration (firing sideways).
All 20 of the 1,500 pound thrusters provide attitude control during low-altitude aborts because the normal RCS thrusters are just not powerful enough for that purpose.
Bold emphasis in you quote mine.
So would it be correct to state this as, the 12 (18,000 lb thrust combined) thrusters are required to make up for a lack of delta v in the booster?
Twelve are there, but that is more than is needed (redundancy). Just two sets (six thrusters) is probably enough for the final nudge to put Starliner in orbit.
To add logic to the software to handle things that are not possible just adds extra complexity and therefore extra opportunities for even more bugs.
Not HANDLING things that are not possible makes sense.
But DETECTING things that are "impossible" is very common. Much, if not most, important code is filled with "assertions", which verify the assumptions the programmers made are in fact true. In non-critical code, it an assertion fails then the program halts. In critical code it at least gives and error message, and maybe then forces a switch back to a simpler "safe mode".
For example (purely hypothetical, I know nothing of their code), say they have two ways to determine mission elapsed time - one from the Atlas, and one from the break wire on the umbilical. Then the assertion would read something like
IF abs(met_from_breakwire - met_from_atlas) > 10 seconds THEN print("Clocks do not agree!")
perhaps followed by setting both clocks to the value deemed more reliable.
Another four 1,500 pound thrusters are aimed forward in two pairs and are - if I recall correctly - the primary thrusters for emergency abort (rapid retreat) during ISS approach phase.
if that is true, then was one of those thrusters the one that never worked? it sounds like they have a specific job in life and would not have been firing during the timer screwup.
It seems that the design is too fragile. A simple error like this should not cascade in multiple propulsion failures and LOM.
It seems that the design is too fragile. A simple error like this should not cascade in multiple propulsion failures and LOM.
Just a very simplistic view of things.
Imagine something designed to fall and land in a net, if it misses the net it breaks into little pieces.
Now suppose it missed the net because of a simple mistake that can be easily fixed, and also imagine that once this is fixed you can be confident >99.9999% that you will hit the net every time, do you still think the design is too fragile?
Edit: I'm not saying their software is rubust, great, excellent, perfect, or any of that sort, I obviously have no idea.
I'm just saying that the failure due to the clock being off is NOT evidence of a bad design.
It seems that the design is too fragile. A simple error like this should not cascade in multiple propulsion failures and LOM.
Just a very simplistic view of things.
Imagine something designed to fall and land in a net, if it misses the net it breaks into little pieces.
Now suppose it missed the net because of a simple mistake that can be easily fixed, and also imagine that once this is fixed you can be confident >99.9999% that you will hit the net every time, do you still think the design is too fragile?
Edit: I'm not saying their software is rubust, great, excellent, perfect, or any of that sort, I obviously have no idea.
I'm just saying that the failure due to the clock being off is NOT evidence of a bad design.
My problem is not that they picked the wrong time - S### happens.
My problem (and for a lot of others) is just how the vehicle responded to that error.
So would it be correct to state this as, the 12 (18,000 lb thrust combined) thrusters are required to make up for a lack of delta v in the booster?
Boing get the Starliner dropped on a suborbital trajectory by choice, this is very much like what NASA used to do with the shuttle, the initial orbit having a free return trajectory then the spacecraft making the final insertion burn. The Atlas V could put the capsule into orbit on its own no problem, it's not making up for a lack of delta V on the booster side, Just a slightly different approach.
The Shuttle trajectory is wholly irrelevant
Shuttle launched to a sub-orbital trajectory to control the reentry of its passive 78,000+ lb (35.4+ ton) External Tank.
Atlas Centaur can control its own reentry, and has done so whenever it was planned.
Were there any Centaur failures to deorbit?
Weren't several left in GTO, which decayed / will decay in almost random locations, but which were not considered a significant hazard?
Shuttle assumed that at least one if its two OMS engines would function to get it to orbit. And they always did.
How many rearwards facing 1500 lbf engines does Starliner have?
A "free return trajectory" to an area of the world without much land and with no prepared infrastructure (like a two mile landing strip) was not much of a safety feature for Shuttle.
It's not much for Starliner, either.
It seems that the design is too fragile. A simple error like this should not cascade in multiple propulsion failures and LOM.
Just a very simplistic view of things.
Imagine something designed to fall and land in a net, if it misses the net it breaks into little pieces.
Now suppose it missed the net because of a simple mistake that can be easily fixed, and also imagine that once this is fixed you can be confident >99.9999% that you will hit the net every time, do you still think the design is too fragile?
I think the difference here is that no one knows if fixing the simple mistake makes us confident. Boeing and NASA cannot afford to try iterative tries at catching in the net or trying to land a rocket repeatedly until they get it right because Boeing and NASA have a different business model than Blue Origin and SpaceX on their own.
SpaceX works with NASA on crewed missions, and here we see the same problem, though SpaceX mistakes are not nearly as expensive as Boeing's because of SpaceX's vertical business integration.
SpaceX and Blue Origin treat hardware development much like commercial software development, where iterative development allows evolution that the Boeing/NASA business model cannot match.
Therefore, Blue Origin and SpaceX have an enormous advantage over companies like Boeing.
Now suppose it missed the net because of a simple mistake that can be easily fixed, and also imagine that once this is fixed you can be confident >99.9999% that you will hit the net every time, do you still think the design is too fragile?
I'm sorry, I know this isn't what you were meaning at all, but this popped into my brain and I can't get rid of it:
https://dilbert.com/strip/2004-09-05
To add logic to the software to handle things that are not possible just adds extra complexity and therefore extra opportunities for even more bugs.
Not HANDLING things that are not possible makes sense.
But DETECTING things that are "impossible" is very common. Much, if not most, important code is filled with "assertions", which verify the assumptions the programmers made are in fact true. In non-critical code, it an assertion fails then the program halts. In critical code it at least gives and error message, and maybe then forces a switch back to a simpler "safe mode".
For example (purely hypothetical, I know nothing of their code), say they have two ways to determine mission elapsed time - one from the Atlas, and one from the break wire on the umbilical. Then the assertion would read something like
IF abs(met_from_breakwire - met_from_atlas) > 10 seconds THEN print("Clocks do not agree!")
perhaps followed by setting both clocks to the value deemed more reliable.
I disagree with your characterization of what an assertion is. I think you're conflating assertions with error handling.
In my experience, assertions are used for things that are not meant to be checked in production code. Assertions are generally checked only in testing and then in production builds there is no assertion checking. When an assertion fails, the program just entirely bails out. It does not try to handle the error, it just ends the test. Hence the C/C++ assert() macro, which only has any effect when -DDEBUG is defined and in that case causes an error message and an abort() call, while in production code the condition isn't even checked and the assertion has no effect whatsoever.
That's in contrast to error handling code which is meant to be included in production code and which is meant to detect the error and take some sort of reasonable action, which is what you're describing in your clock comparison check.
The problem with adding error handling code is that you can make things worse. That's part of the reason that assertions only have an effect when doing testing. It makes it easy to throw in lots of assertions without having to worry about any negative effects they might have on production builds, because they are guaranteed to have zero effects in production builds. When you put in error handling code, you can make things worse. The loss of the first Ariane 5 is an example of this -- an overflow error was detected but the way it was handled crashed both the computer and the rocket. If you detect that the clocks don't agree and do something based on that, then you're risking making things worse in the case where the other clock you're comparing with is wrong, or where you just made a mistake in the error condition.
Okay - let me get this straight.
They didn't get the correct delta V for the "back away maneuver" because the thrusters were over stressed.
Over-stressing the thrusters was due to the timing anomaly and over-stressing would not have occurred during a nominal flight.
Just how many uses can the thrusters handle on the "reusable" craft?
ISTM that Starliner has some significant issues that need to be addressed.
-or- do we just tell the astronauts to go easy on the thrusters so we have enough life for future flights [/snark off]
The thrusters in question are located on the service module, which burns up in the atmosphere after every mission - I think a lot of people are forgetting this.
Starliner also has thrusters built into the capsule itself which are only used for entry attitude control after separation from the service module, IIRC.
Okay - let me get this straight.
They didn't get the correct delta V for the "back away maneuver" because the thrusters were over stressed.
Over-stressing the thrusters was due to the timing anomaly and over-stressing would not have occurred during a nominal flight.
Just how many uses can the thrusters handle on the "reusable" craft?
ISTM that Starliner has some significant issues that need to be addressed.
-or- do we just tell the astronauts to go easy on the thrusters so we have enough life for future flights [/snark off]
The thrusters in question are located on the service module, which burns up in the atmosphere after every mission - I think a lot of people are forgetting this.
Starliner also has thrusters built into the capsule itself which are only used for entry attitude control after separation from the service module, IIRC.
I was unaware that only the SM thrusters were affected. I stand corrected.
But - I'm sure Armstrong and Scott were glad they didn't have to fear "over using" the thrusters on Gemini 8
Okay - let me get this straight.
They didn't get the correct delta V for the "back away maneuver" because the thrusters were over stressed.
Over-stressing the thrusters was due to the timing anomaly and over-stressing would not have occurred during a nominal flight.
Just how many uses can the thrusters handle on the "reusable" craft?
ISTM that Starliner has some significant issues that need to be addressed.
-or- do we just tell the astronauts to go easy on the thrusters so we have enough life for future flights [/snark off]
The thrusters in question are located on the service module, which burns up in the atmosphere after every mission - I think a lot of people are forgetting this.
Starliner also has thrusters built into the capsule itself which are only used for entry attitude control after separation from the service module, IIRC.
I was unaware that only the SM thrusters were affected. I stand corrected.
But - I'm sure Armstrong and Scott were glad they didn't have to fear "over using" the thrusters on Gemini 8
Gemini was actually a somewhat similar design, with two thruster systems - the malfunctioning on-orbit primary RCS, and the entry attitude control system. When the primary RCS malfunctioned the astronauts inhibited the primary system and activated the entry system. This unfortunately is what forced them to end the mission early, as the entry system could not be deactivated due to the use of pyro valves.
So I would hesitate to say that Gemini's thruster system was in any way safer/better than CST-100's.
It seems that the design is too fragile. A simple error like this should not cascade in multiple propulsion failures and LOM.
Just a very simplistic view of things.
Imagine something designed to fall and land in a net, if it misses the net it breaks into little pieces.
Now suppose it missed the net because of a simple mistake that can be easily fixed, and also imagine that once this is fixed you can be confident >99.9999% that you will hit the net every time, do you still think the design is too fragile?
Edit: I'm not saying their software is rubust, great, excellent, perfect, or any of that sort, I obviously have no idea.
I'm just saying that the failure due to the clock being off is NOT evidence of a bad design.
It's a single point of failure in a mission critical function. How is that not a bad design? If you want a robust design then the flight computers should not blindly trust a mission-critical value provided by a single outside source. Not if there is any way to validate that value or run a sanity check.
As for a thruster not operating after the anomaly:
The report explicitly stated that the one thruster
didn't operate at all, as in never.
To add logic to the software to handle things that are not possible just adds extra complexity and therefore extra opportunities for even more bugs.
Not HANDLING things that are not possible makes sense.
But DETECTING things that are "impossible" is very common. Much, if not most, important code is filled with "assertions", which verify the assumptions the programmers made are in fact true. In non-critical code, it an assertion fails then the program halts. In critical code it at least gives and error message, and maybe then forces a switch back to a simpler "safe mode".
For example (purely hypothetical, I know nothing of their code), say they have two ways to determine mission elapsed time - one from the Atlas, and one from the break wire on the umbilical. Then the assertion would read something like
IF abs(met_from_breakwire - met_from_atlas) > 10 seconds THEN print("Clocks do not agree!")
perhaps followed by setting both clocks to the value deemed more reliable.
I disagree with your characterization of what an assertion is. I think you're conflating assertions with error handling.
In my experience, assertions are used for things that are not meant to be checked in production code. Assertions are generally checked only in testing and then in production builds there is no assertion checking.
What you describe is indeed one use of assertions, but it's not the only one. Often they are left in production - at least then if the system fails you know *why*, instead of having a sometimes difficult debugging problem.
For example, see
Assertion Checkers in Verification, Silicon Debug and In-Field Diagnosis. In this case, the assertions are built into the hardware, and left in for production versions.
They specifically state this is helpful in high reliability situations. If an assertion fails, you shut down that unit and switch to a redundant processor. (This is for hardware errors, not software bugs.) More obviously, it finds error situations that occur in the field that are not uncovered by your tests - exactly what happened here.
Here's an example of
leaving software asserts in production code:
What they found was that the number of crashes did not change much, but the cardinality went down, significantly. The learning was code executing past a disabled assertion may be in one of n different bad states, each of which might lead to a different type of crash. They now had better high-level information about what was causing crashes (knowing which asserts were wrong) and it helped them reduce their crash rate much more than raw crashes without asserts (including cases where the crash was in iOS code, not app code).
So leaving assertions in production code, and even silicon, is a known practice. I suspect you've personally seen evidence of this, when some program you are using emits an error message "This can't happen". I've certainly seen enough of these.
*snip*
Look, I'm just going to say it. The only reason Boeing didn't do an in-flight abort was to save money by not having to purchase another Atlas V and all associated costs with an IFA mission.
*snip*
You do not have to use the launch vehicle to do an in-flight abort test. All other in-flight abort tests prior to SpaceX's IFA test have used a much cheaper stand-in rocket booster for the in-flight abort test.
Correct. Apollo and, more recently, Orion (2019-07-02), used a stand-in LV for their test. Dragon is the only spacecraft to ever use it's designated standard LV to perform an IFA test. Boeing could easily have done the same as Apollo and Orion, but chose not to.
My problem is not that they picked the wrong time - S### happens.
My problem (and for a lot of others) is just how the vehicle responded to that error.
Or more glaringly, was incapable of recognizing that an error even existed.