How is there a test regime where you didn't put "a lot of duty cycles" on the RCS system?
You don't test too far beyond your design criteria.
Let's say, for sake of argument, that you design the thrusters for a duty cycle of 25% for no longer than 2 minutes. Maybe you test them at 40% for 4 minutes and, if they're fine, you say they meet the design criteria. Then you operate them at 80% for 10 minutes and they fail. Well, that was outside the design - and testing - criteria.
In case it's not clear, I made up all those numbers for illustration purposes.
I suspect that "duty cycle" is high-level engineering managerspeak for "it got used a lot".
No, duty cycle is very low level engineering speak for ratio between time on and total time. 25% would mean it's on for 25% of the time.
I'm aware. But I'd be surprised if that's what Chilton meant: he talked about putting a lot of "duty cycles" (plural) on the thrusters. Pretty sure that what he meant is that they'd fired a lot of times, which isn't the level of semantic rigor that we'd expect from our lofty perch as nerd kibbitzers.
I've had this exact thing happen to me. I design an actuator system to handle the expected loads with substantial margin, then an unstable controller decides to push the actuators to their absolute limits - continuously - and after a brief time, they trip. Then I fix the unstable controller and they never come anywhere close to tripping ever again.
Ah, the joys of engineering a robust system.
I'm just a software geek, but that basic bug pattern happens all the time. You discover some stupid, really unlikely bug has revealed a major problem that you hadn't thought about at all. Then you fix 'em both, and the system is a lot more bulletproof at the end of the process.
The problem is when the testing doesn't turn up either the stupid bug or the serious one. That's a test design problem.
After a (likely) simple fix I’d expect this particular issue to never arise again. But what does Starliner’s response imply about how it may react to a whole load of other conceivable anomalies? Is this failure evidence of a wider design problem? Even answering that question may take some time, let alone addressing any further issues that are identified.
FWIW my professional background is software development in safety-related domains (such as aerospace, rail, air traffic management).
I have no trouble believing that the fix to the specific issue on OFT (reading the wrong parameter) is very simple.
My concern is root cause and fault tolerance. From what we know right now it sounds as if the wrong parameter was just accepted and acted on. Does Starliner do any checks to see if the value is credible / makes sense? Why didn’t the software identify that the value implied post-insertion burn and yet no insertion burn had taken place?
After a (likely) simple fix I’d expect this particular issue to never arise again. But what does Starliner’s response imply about how it may react to a whole load of other conceivable anomalies? Is this failure evidence of a wider design problem? Even answering that question may take some time, let alone addressing any further issues that are identified.
I wouldn’t fly Starliner again (crewed or uncrewed) without being confident that this isn’t evidence of a much broader failing. It’ll be interesting to see what ASAP etc have to say about this.
To me the strangest thing about this incident is the 8-min communication gap with the TDRSS network. There are so many TDRSS satellites that there should be complete coverage. I suspect this was because the relevant satellite was occupied with NRO spy satellite traffic. In joint civil/Military systems, military demands always have priority.
No. They were clear in the presser today that a significant portion of the blackout was because the spacecraft's antennae were not pointed at the satellites because the spacecraft was out of position. Comms were not possible even when the spacecraft came out of the dead spot until they got the antennae pointed in the right direction.
We expect this system to have comms at all times with TDRSS. Contributing factors [to the problem] are: we got off the Atlas V not where we expected to be, and the spacecraft... this is a point in the mission where we tell the spacecraft where it is, not where it opens its eyes and looks, which is most of the rest of the mission. It was not where it expected to be, it was further from TDRSS than it thought. It was also starting to move between satellites, which in its own it can take a bit more time to get a link, but it doesn't mean you shouldn't have a link. From an attitude perspective, there's a series of antennas around the vehicle, and because the vehicle was not were we expected it to be, and it was not where it thought it was, it wasn't pointing the antennas quite right at TDRSS. So you add those factors together and it took a little more time to connect [the 8 minutes] than we expected.
After a (likely) simple fix I’d expect this particular issue to never arise again. But what does Starliner’s response imply about how it may react to a whole load of other conceivable anomalies? Is this failure evidence of a wider design problem? Even answering that question may take some time, let alone addressing any further issues that are identified.
I guess what you and other software engineers in is thread are concerned about is that this bug is like a cockroach - that when you see one, there could be another 50 lurking unseen.
Just a wild guess, but they may have run full simulations where inputs were faked to make the system think it was actually in flight. That's how they test new airliners. That could mean they screwed up in returning everything to flight configuration.
Just a wild guess, but they may have run full simulations where inputs were faked to make the system think it was actually in flight. That's how they test new airliners. That could mean they screwed up in returning everything to flight configuration.That is the most straightforward and most likely explanation I've heard so far. I guess we will see how deep the rabbit hole goes. (Matrix reference).
Spot on. As an ex-software head who used to do control systems I am stunned at how dumb this software system is. As someone else said, it's basically an egg timer and as dumb as one. That has me astonished. Maybe this is how it's always been done on spacecraft but it means that one hiccup along the way can cascade into much bigger problems. Not good enough for a manned spacecraft.
Spot on. As an ex-software head who used to do control systems I am stunned at how dumb this software system is. As someone else said, it's basically an egg timer and as dumb as one. That has me astonished. Maybe this is how it's always been done on spacecraft but it means that one hiccup along the way can cascade into much bigger problems. Not good enough for a manned spacecraft.
Sometimes dumb/simple is good. See KISS principle. At least one Shuttle payload with a giant solid rocket motor (Transfer Orbit Stage on ACTS mission) used a simple MET clock triggered by separation from Shuttle for the entire mission sequencing. However, there were actually three MET clocks for dual fault tolerance, with a majority-voting scheme for critical event like rocket motor ignition.
That was a man-rated software/avionics design, because the system had to be two-fault-tolerant against the MET clock sequencing from igniting the rocket motor too close to Shuttle. So it was good enough to be man-rated after extreme scrutiny by the NASA/JSC safety panel.
Again, that was just three MET clocks, initiated by the separation event, with majority voting to null the vote of one errant clock.
It was good enough for Shuttle man-rated payloads, so it wouldn't surprise me if CST-100 used a similar MET scheme.
The problem here apparently was the initialization of the MET clock(s). Which apparently wasn't done by a separation event, but actively pulling a wrong parameter from Atlas.
So the problem isn't necessarily the simplicity of the software system, but the bug that allowed it to initialize on an incorrect parameter.
The fact that the error apparently happened at an interface (between two different pieces or hardware, and between two different contractors) suggests just as much a systems engineering problem, where one guy/gal on one side of the wall misunderstands how the other guy/gal's avionics/software works, or someone misses an interface control document change, and when the two pieces of hardware/software come together, interface errors are found, but not always on the ground.
Sometimes dumb/simple is good. See KISS principle. At least one Shuttle payload with a giant solid rocket motor (Transfer Orbit Stage on ACTS mission) used a simple MET clock triggered by separation from Shuttle for the entire mission sequencing. However, there were actually three MET clocks for dual fault tolerance, with a majority-voting scheme for critical event like rocket motor ignition.
That was a man-rated software/avionics design, because the system had to be two-fault-tolerant against the MET clock sequencing from igniting the rocket motor too close to Shuttle. So it was good enough to be man-rated after extreme scrutiny by the NASA/JSC safety panel.
Again, that was just three MET clocks, initiated by the separation event, with majority voting to null the vote of one errant clock.
It was good enough for Shuttle man-rated payloads, so it wouldn't surprise me if CST-100 used a similar MET scheme.
Sometimes dumb/simple is good. See KISS principle. At least one Shuttle payload with a giant solid rocket motor (Transfer Orbit Stage on ACTS mission) used a simple MET clock triggered by separation from Shuttle for the entire mission sequencing. However, there were actually three MET clocks for dual fault tolerance, with a majority-voting scheme for critical event like rocket motor ignition.
That was a man-rated software/avionics design, because the system had to be two-fault-tolerant against the MET clock sequencing from igniting the rocket motor too close to Shuttle. So it was good enough to be man-rated after extreme scrutiny by the NASA/JSC safety panel.
Again, that was just three MET clocks, initiated by the separation event, with majority voting to null the vote of one errant clock.
It was good enough for Shuttle man-rated payloads, so it wouldn't surprise me if CST-100 used a similar MET scheme.Is that really a good comparison, though? You're talking about a MET for a single-event system - ignition of a solid rocket motor, as well as presumably guidance during firing, all during a very defined/rigid trajectory. A manned spacecraft that has to have complex sequences of motor/rcs firings to rendezvous with the space station, dock, has abort modes, etc is far more complicated, no?
Sometimes dumb/simple is good. See KISS principle. At least one Shuttle payload with a giant solid rocket motor (Transfer Orbit Stage on ACTS mission) used a simple MET clock triggered by separation from Shuttle for the entire mission sequencing. However, there were actually three MET clocks for dual fault tolerance, with a majority-voting scheme for critical event like rocket motor ignition.
That was a man-rated software/avionics design, because the system had to be two-fault-tolerant against the MET clock sequencing from igniting the rocket motor too close to Shuttle. So it was good enough to be man-rated after extreme scrutiny by the NASA/JSC safety panel.
Again, that was just three MET clocks, initiated by the separation event, with majority voting to null the vote of one errant clock.
It was good enough for Shuttle man-rated payloads, so it wouldn't surprise me if CST-100 used a similar MET scheme.Is that really a good comparison, though? You're talking about a MET for a single-event system - ignition of a solid rocket motor, as well as presumably guidance during firing, all during a very defined/rigid trajectory. A manned spacecraft that has to have complex sequences of motor/rcs firings to rendezvous with the space station, dock, has abort modes, etc is far more complicated, no?
Sometimes dumb/simple is good. See KISS principle. At least one Shuttle payload with a giant solid rocket motor (Transfer Orbit Stage on ACTS mission) used a simple MET clock triggered by separation from Shuttle for the entire mission sequencing. However, there were actually three MET clocks for dual fault tolerance, with a majority-voting scheme for critical event like rocket motor ignition.
That was a man-rated software/avionics design, because the system had to be two-fault-tolerant against the MET clock sequencing from igniting the rocket motor too close to Shuttle. So it was good enough to be man-rated after extreme scrutiny by the NASA/JSC safety panel.
Again, that was just three MET clocks, initiated by the separation event, with majority voting to null the vote of one errant clock.
It was good enough for Shuttle man-rated payloads, so it wouldn't surprise me if CST-100 used a similar MET scheme.Is that really a good comparison, though? You're talking about a MET for a single-event system - ignition of a solid rocket motor, as well as presumably guidance during firing, all during a very defined/rigid trajectory. A manned spacecraft that has to have complex sequences of motor/rcs firings to rendezvous with the space station, dock, has abort modes, etc is far more complicated, no?
Yes, MET is MET. It just needs a good reliable starting point. What the spacecraft does with it is totally different.
Sometimes dumb/simple is good. See KISS principle. At least one Shuttle payload with a giant solid rocket motor (Transfer Orbit Stage on ACTS mission) used a simple MET clock triggered by separation from Shuttle for the entire mission sequencing. However, there were actually three MET clocks for dual fault tolerance, with a majority-voting scheme for critical event like rocket motor ignition.
That was a man-rated software/avionics design, because the system had to be two-fault-tolerant against the MET clock sequencing from igniting the rocket motor too close to Shuttle. So it was good enough to be man-rated after extreme scrutiny by the NASA/JSC safety panel.
Again, that was just three MET clocks, initiated by the separation event, with majority voting to null the vote of one errant clock.
It was good enough for Shuttle man-rated payloads, so it wouldn't surprise me if CST-100 used a similar MET scheme.Is that really a good comparison, though? You're talking about a MET for a single-event system - ignition of a solid rocket motor, as well as presumably guidance during firing, all during a very defined/rigid trajectory. A manned spacecraft that has to have complex sequences of motor/rcs firings to rendezvous with the space station, dock, has abort modes, etc is far more complicated, no?