-
#580
by
SoftwareDude
on 10 Jan, 2020 23:23
-
A normal hiccup in software? I've never heard of such a thing.
"Normal" hiccup in the sense that there are almost always some pieces of code that misbehave in its FIRST real-world testing. Testing in the lab will only test those scenarios that the software engineers can think of. It does not test those scenarios that they didn't think of. Those only get tested when it actually flies for the very 1st time. And there is almost always some test scenario that the engineers didn't think of. At which time the optimum phrase becomes "oh crap, we didn't think of that".
I guess the way I think about this point is the answer to the question, "Was the software appropriately tested before someone paid $100 million for, what many would call, a dress rehearsal, or in software parlance "staging" test. There is here, as is common with software, the stress in the schedule vs the software being ready and appropriately tested.
Vice President Pence said that any further delays in the schedule would not be tolerated. Was Boeing ready and whose fault was that? I don't know the answer but two tests in a row point to "not ready".
-
#581
by
Vettedrmr
on 11 Jan, 2020 00:16
-
"Normal" hiccup in the sense that there are almost always some pieces of code that misbehave in it's FIRST real-world testing. Testing in the lab will only test those scenarios that the software engineers can think of. It does not test those scenarios that they didn't think of. Those only get tested when it actually flies for the very 1st time. And there is almost always some test scenario that the engineers didn't think of. At which time the optimum phrase becomes "oh crap, we didn't think of that".
Our safety-critical software/system testing was always focused on "make sure we test every function that can cause a Class A mishap (>$1M damage or fatality), and 100% of all interfaces that provide data to those functions. Things might cause a loss of mission, or force a re-fly, but we didn't lose a bird. So, in Starliner's case, it might be determined that safety of flight requirements were validated and CFT can proceed with additional requirements that didn't get validated on OFT.
BUT, the interface error that caused Starliner to get the wrong data from Atlas is a really serious one, IMO. It obviously is used to make major changes to the ship's configuration, and this kind of error is *exactly* what system integration testing is supposed to ferret out.
Here's hoping Starliner has a successful 2nd flight.
Have a good one,
Mike
-
#582
by
SoftwareDude
on 11 Jan, 2020 00:26
-
I worked on software that caused $5000 per minute loss during an outage 24x7, and in other cases $2 million dollars plus 1000 minutes. No lives were ever at risk at risk from our software. We spent millions of dollars building a complete automated test stand that included simulations of all mechanical devices and deployment. The complete end-to-end tests ran every night with a build. It involved tens of millions of lines of code.
It worked the first time every time. Was it more complicated than a rocket? Yes. Were their timing issues? Yes. Time-critical meant 100 ms. Did our control room look a lot like NASA's or SpaceX? Yes.
-
#583
by
clongton
on 11 Jan, 2020 00:31
-
Here's hoping Starliner has a successful 2nd flight.
Have a good one,
Mike
The one thought in the back of my head about the OFT mission fail was that the software itself had no idea that it was not functioning correctly. It doesn't appear that there was any checksum functioning. It took some pretty smart people on the ground to figure out a way to override the avionics and save what they could of the mission. Obviously, no one ever thought about "what if the MET is wrong? It reminds me of the the brilliant way that people on the ground saved the Apollo 13 crew when they had to figure out a way to use the LM CO2 canisters by creating adapters out of the mission planning pages. That was also a case of "oh crap, we never thought of that". Nobody ever thought about the LM being used as a lifeboat and what would need to be common to make that work. And that's the unavoidable danger in "safety-critical software/system testing". It requires that the engineers think of everything, and of course they rarely do, no matter how many years they've been doing this. This same danger lurks in the avionics of Dragon as well. In both Starliner and Dragon we will not become comfortable with the reliability of the software until after many actual flights that prove out in the real world that it's functioning correctly. And there's no way to avoid that. You do the best you can, mitigate everything you can think of and then roll the dice and hope for the best. OFT was lucky that the people on the ground were smart and quick enough to be able to salvage the mission. And that's what bothers me; The software didn't even know it was out of sync. IMO there should have been some algorithms running that would have indicated that, and immediately called home for help. Something similar to our Mars Landers switching into Safe mode when they detect that something is wrong. OFT seemed incapable of detecting that something was wrong and that's a worry. One can only hope that the same lack doesn't also infect Dragon. Only time, and flights, will tell.
-
#584
by
LouScheffer
on 11 Jan, 2020 00:58
-
The one thought in the back of my head about the OFT mission fail was that the software itself had no idea that it was not functioning correctly.
This is an interesting point. On an un-crewed mission, a thruster overheating would probably trigger "safe mode", where the spacecraft stops whatever it is doing, tries to stabilize into a known attitude that is power positive and thermally OK, then establishes communications and waits for commands. On the other hand, safe mode is often disabled during critical maneuvers, where the risk of not acting seems higher than the risk of malfunction.
So does CST-100 have "safe mode" software? If so, could it have detected the malfunction but not not enabled safe mode, since the spacecraft believed it was in the middle of a critical event? Lots of possibilities here which are hard to sort out without knowing how the software was designed....
-
#585
by
Rocket Science
on 11 Jan, 2020 01:09
-
-
#586
by
laszlo
on 11 Jan, 2020 01:53
-
... Nobody ever thought about the LM being used as a lifeboat and what would need to be common to make that work...
Actually, according to Michael Collins, that scenario was considered and planned for. NASA had contingency binders with procedures and checklists for that and many other scenarios. You're right, there were some points that were specific to the way 13 failed that required further work, like the CO2 problem, the exact consumables budget, and the CM power up checklist, but in general much of the planning was done beforehand.
-
#587
by
clongton
on 11 Jan, 2020 03:14
-
Don't know if this was already posted but our friend Wayne Hale had an interesting anecdote from his experience on STS-1
Reminded of April 10, 1981 when a software timing error delayed the first space shuttle launch - an error that went undetected in ten thousand hours of software testing.
There's another example of the kind of thing I labeled upthread as a "normal hiccup". As good and experienced as they are, the software engineers rarely catch everything. It takes executing in the real world to ferret the hidden bugs out.
-
#588
by
Comga
on 11 Jan, 2020 03:57
-
(snip)
Vice President Pence said that any further delays in the schedule would not be tolerated.
Was Boeing ready and whose fault was that? I don't know the answer but two tests in a row point to "not ready".
Pence said that?
When?
I doubt that is very helpful.
People work best under some pressure, but no one does careful, intellectual, technical work well under a political whip.
-
#589
by
woods170
on 11 Jan, 2020 14:01
-
(snip)
Vice President Pence said that any further delays in the schedule would not be tolerated.
Was Boeing ready and whose fault was that? I don't know the answer but two tests in a row point to "not ready".
Pence said that?
When?
I doubt that is very helpful.
People work best under some pressure, but no one does careful, intellectual, technical work well under a political whip.
Pence stated something like that along general lines. He was NOT specifically referring to CCP.
-
#590
by
Chasm
on 11 Jan, 2020 20:06
-
I've been thinking more a bit the mission elapsed timer.
With the 11 hour offset mentioned my going theory is that Starliner polled the uptime value (counted from when the computer booted) instead of the MET.
Somewhere upthread it was said that this would be roughly 30 minutes off but the press conference value might be rounded down or perhaps a reset command gets send to the computers early in the launch sequence.
I think the problem goes beyond software.
We saw lot of people sitting at consoles looking at Starliner data. They are not launch commentators, so presumably it's live telemetry data.
Why did nobody notice that Starliner as sitting on the pad was already several hours into the mission?
Human error/oversight?
Or is the MET is not displayed on any of the consoles?
Or do different Starliner systems pull different data?
-
#591
by
ChrisWilson68
on 11 Jan, 2020 20:10
-
I've been thinking more a bit the mission elapsed timer.
With the 11 hour offset mentioned my going theory is that Starliner polled the uptime value (counted from when the computer booted) instead of the MET.
Somewhere upthread it was said that this would be roughly 30 minutes off but the press conference value might be rounded down or perhaps a reset command gets send to the computers early in the launch sequence.
I think the problem goes beyond software.
We saw lot of people sitting at consoles looking at Starliner data. They are not launch commentators, so presumably it's live telemetry data.
Why did nobody notice that Starliner as sitting on the pad was already several hours into the mission?
Human error/oversight?
Or is the MET is not displayed on any of the consoles?
Or do different Starliner systems pull different data?
I would bet that Starliner doesn't set its MET until it is about to separate from Atlas, and only at that point does it read the value from Atlas. So, there's no bad data to see on any console until separation time.
-
#592
by
ChrisWilson68
on 11 Jan, 2020 20:19
-
For example, all that up-thread ripping of "the state machine" and we don't even know if there actually is one. They may have implemented a stateful system, but not as a state machine. Or not. The point is, we don't know.
Pardon my design-pattern prejudice, but implementing a stateful software system as anything other than a formal FSM is a recipe for an untestable wad of spaghetti code that's eventually going to bite you.
That's wrong. So very, very wrong.
When the number of states is small, implementing it as an explicit single finite state machine can be the best way to avoid bugs. When the number of states is large enough, it becomes impractical for humans to go through all the states and transitions explicitly, and even if it is of a size where it's possible, it can be more likely to lead to errors than structuring the code in another way, such as with multiple state machines and/or other data storage and algorithm paradigms.
Where exactly the boundary is depends on the person, because it's inherently about what makes a particular person more or less likely to let bugs get through.
As a trivial example of a stateful system that can't practically be implemented as a single explicit finite state machine, take a modern cell phone. It is a finite state machine. But it has something on the order of 2^(2^35) states, and nobody would try to program it as an explicit finite state machine.
I have seen the results of misguided people who took explicit finite state machines too far and made huge state tables by hand that were harder to read and more bug-prone than other abstractions would have been.
-
#593
by
TheRadicalModerate
on 12 Jan, 2020 18:25
-
For example, all that up-thread ripping of "the state machine" and we don't even know if there actually is one. They may have implemented a stateful system, but not as a state machine. Or not. The point is, we don't know.
Pardon my design-pattern prejudice, but implementing a stateful software system as anything other than a formal FSM is a recipe for an untestable wad of spaghetti code that's eventually going to bite you.
That's wrong. So very, very wrong.
When the number of states is small, implementing it as an explicit single finite state machine can be the best way to avoid bugs. When the number of states is large enough, it becomes impractical for humans to go through all the states and transitions explicitly, and even if it is of a size where it's possible, it can be more likely to lead to errors than structuring the code in another way, such as with multiple state machines and/or other data storage and algorithm paradigms.
Where exactly the boundary is depends on the person, because it's inherently about what makes a particular person more or less likely to let bugs get through.
As a trivial example of a stateful system that can't practically be implemented as a single explicit finite state machine, take a modern cell phone. It is a finite state machine. But it has something on the order of 2^(2^35) states, and nobody would try to program it as an explicit finite state machine.
I have seen the results of misguided people who took explicit finite state machines too far and made huge state tables by hand that were harder to read and more bug-prone than other abstractions would have been.
Sounds like you've seen a lot of
bad FSM implementations. As with all software engineering, if you choose the wrong level of abstraction, you get unmanageable code.
BTW, if a cell phone has an FSM with more than about five states and five input events for its call management behavior, it's been implemented by an idiot.
-
#594
by
SoftwareDude
on 13 Jan, 2020 02:10
-
I've been thinking more a bit the mission elapsed timer.
With the 11 hour offset mentioned my going theory is that Starliner polled the uptime value (counted from when the computer booted) instead of the MET.
Somewhere upthread it was said that this would be roughly 30 minutes off but the press conference value might be rounded down or perhaps a reset command gets send to the computers early in the launch sequence.
I think the problem goes beyond software.
We saw lot of people sitting at consoles looking at Starliner data. They are not launch commentators, so presumably it's live telemetry data.
Why did nobody notice that Starliner as sitting on the pad was already several hours into the mission?
Human error/oversight?
Or is the MET is not displayed on any of the consoles?
Or do different Starliner systems pull different data?
I would bet that Starliner doesn't set its MET until it is about to separate from Atlas, and only at that point does it read the value from Atlas. So, there's no bad data to see on any console until separation time.
I personally don't do it this way, but my understanding of implementing most real-time clocks is that the first call initializes it to a certain time. The first time call gets the system clock ticks and stores that time as 0 on the first call and returns some time structure as time == 0. It makes no sense to take the ticks from a different system clock and start with that because that requires two hardware clocks to be exactly in synch. The accuracy of the clock depends on the amount of real-time available to the clock. I don't know what Boeing did but they are saying doesn't make a whole lot of sense.
-
#595
by
AJW
on 13 Jan, 2020 04:37
-
I've been thinking more a bit the mission elapsed timer.
With the 11 hour offset mentioned my going theory is that Starliner polled the uptime value (counted from when the computer booted) instead of the MET.
...
If this theory is correct, it raises the question why this or any payload would require access to the Atlas startup time since if it was indeed in the interface, and then mistakenly applied, its very presence led to this flight missing key objectives.
I always find that this quote by an early aviator about aircraft design also applies to good software and UI design.
'In anything at all, perfection is finally attained not when there is no longer anything to add, but when there is no longer anything to take away...' Antoine de Saint Exupéry - Wind, Sand and Stars
-
#596
by
SWGlassPit
on 13 Jan, 2020 14:04
-
I have seen the results of misguided people who took explicit finite state machines too far and made huge state tables by hand that were harder to read and more bug-prone than other abstractions would have been.
I did that once, implementing an SLR(1) parser by hand from the grammar before I understood how to write a parser generator.
Turns out writing a parser generator is a *hell* of a lot easier.
-
#597
by
thirtyone
on 13 Jan, 2020 20:45
-
Don't know if this was already posted but our friend Wayne Hale had an interesting anecdote from his experience on STS-1
https://twitter.com/waynehale/status/1208040666460241920
So I should note that given all the information, both public and semi-public, that has come out, I find it unlikely that this is one of those difficult "timing" bugs. Such "timing" bugs are usually those where multiple systems need to basically be operating at exactly the same phase as each other (we're talking sub-second timings) to work correctly.
1) This was hours off.
2) Genuine timing bugs are difficult to catch and difficult to fix because they tend to be very complicated. You don't find them within hours with a cursory look at the code.
3) Within a day a Boeing SVP said:
This looks like we reached in there and grabbed the wrong coefficient. More to learn there, but it’s not more complicated than that. We started the clock at the wrong time
So I'd really not categorize this in the same category as a difficult timing bug that got missed with even tens of thousands of hours of testing. Perhaps someone else in software could tell me if there's any other possible guess as to what actually happened? To me the speed with which they made a conclusion about where the problem was and its simplicity removes a lot of possibilities. It looks like they pulled the wrong parameter from Atlas, and all of the system tests they developed simulated the parameter incorrectly as well. IMO the closest analogous space failure is probably the Mars Climate Orbiter, where one team misinterpreted the specifications at an interface between systems, and continued to run all of their tests with those invalid assumptions (wrong units) without any higher system level test to verify the results.
-
#598
by
Vettedrmr
on 13 Jan, 2020 21:32
-
Perhaps someone else in software could tell me if there's any other possible guess as to what actually happened? To me the speed with which they made a conclusion about where the problem was and its simplicity removes a lot of possibilities. It looks like they pulled the wrong parameter from Atlas, and all of the system tests they developed simulated the parameter incorrectly as well. IMO the closest analogous space failure is probably the Mars Climate Orbiter, where one team misinterpreted the specifications at an interface between systems, and continued to run all of their tests with those invalid assumptions (wrong units) without any higher system level test to verify the results.
In my old line of work an Interface Control Document is used to define all of the interfaces, both analog and digital, between systems.
ASSUMING that ULA and Starliner's team have something similar, one possible explanation could be that there were different versions of the document that changed the interface addresses (a big no-no unless VERY WELL coordinated). Or, perhaps the scaling of the parameter was changed and it wasn't communicated to the software engineers.
In either case, system integration testing between Atlas and Starliner should have caught this. It's not feasible to test all paths through a piece of software, BUT we always tested ALL of the interfaces between safety-critical systems.
Not saying this is what happened, but it does fit with what has been stated, to my knowledge.
HTH, and have a good one,
Mike
-
#599
by
TheRadicalModerate
on 14 Jan, 2020 06:52
-
So I should note that given all the information, both public and semi-public, that has come out, I find it unlikely that this is one of those difficult "timing" bugs. Such "timing" bugs are usually those where multiple systems need to basically be operating at exactly the same phase as each other (we're talking sub-second timings) to work correctly.
1) This was hours off.
2) Genuine timing bugs are difficult to catch and difficult to fix because they tend to be very complicated. You don't find them within hours with a cursory look at the code.
3) Within a day a Boeing SVP said:
This looks like we reached in there and grabbed the wrong coefficient. More to learn there, but it’s not more complicated than that. We started the clock at the wrong time
So I'd really not categorize this in the same category as a difficult timing bug that got missed with even tens of thousands of hours of testing. Perhaps someone else in software could tell me if there's any other possible guess as to what actually happened? To me the speed with which they made a conclusion about where the problem was and its simplicity removes a lot of possibilities. It looks like they pulled the wrong parameter from Atlas, and all of the system tests they developed simulated the parameter incorrectly as well. IMO the closest analogous space failure is probably the Mars Climate Orbiter, where one team misinterpreted the specifications at an interface between systems, and continued to run all of their tests with those invalid assumptions (wrong units) without any higher system level test to verify the results.
Mostly agree with this, but in fairness the STS-1 bug (which I saw from a fairly lowly position on the Rockwell side, as IBM tried to blame us) was ultimately a case of real vehicle wires not being quite the same as testbed or simulated wires. It stemmed from a bit more interrupt skew on the Orbiter than IBM saw in the lab, which caused the primary avionics software system to put an entire minor cycle's worth of I/O from one of the flight strings into the wrong minor cycle during startup. That's why all 4 primary GPCs picked it up, and why the backup flight system refused to sync. (Why you were interrupt-driving an I/O system that was so rigidly scheduled that it was essentially polled is a different rant...)
I have no clue whether the MET bug was a bad address, a bad wire, or simply bad noise. But it is true that it's really hard to simulate an entire Atlas V in your lab, to say nothing of all the weird EM events it's subject to on the pad. Real hardware is a lot noisier than your sims.