-
#200
by
Lee Jay
on 22 Dec, 2019 01:40
-
Another issue is that apparently the data the Starliner was looking for was just an arbitrary location within a larger set of data. The Starliner chose location 15 while the number it should have had was in location 17, or something of that sort.
Do we actually know that?
Timing can be quite tricky. What calendar are you using? If using GPS time, are you in the right epoch? Are you using the version that accounts for leap seconds, or not? Is the time measured in seconds, milliseconds, jiffies, cycles of some clock chip, or what?
-
#201
by
butters
on 22 Dec, 2019 02:03
-
Another issue is that apparently the data the Starliner was looking for was just an arbitrary location within a larger set of data. The Starliner chose location 15 while the number it should have had was in location 17, or something of that sort.
A more modern design would send all data in JSON format, where each data field is explicitly labeled in the data stream. Sure, it takes more bandwidth to send the same data, and the sender and receiver need to be a bit more complicated. But modern computers have no problem with this. Modern software has no problem with it. The whole web is based on this concept. It's well worth it to avoid accidentally misinterpreting data.
As a developer that builds software on and off the web, I see where you're coming from, but string-based encodings like JSON only make sense sufficiently high up in the application stack where the serialization overhead is tolerable. Particularly with a data structure that stores a timer, you can probably see the problem with deserializing the JSON string, incrementing the value of timer field, and then re-serializing the data on each clock tick inside an interrupt handler that needs to be blazing fast because it's blocking other tasks in the real-time system from running.
It wouldn't be performant or reasonable for the Linux kernel to store process control blocks in JSON, for example. But given the choice, in C, between an array of pointers or a struct with name-addressable fields, Linux wisely chose the latter. If Starliner used a numerically-indexed array to represent the state vector coming from Atlas V when it could have used a name-addressable data structure instead with a clearly-defined field for the MET time, then that would be a fair criticism.
-
#202
by
Jim
on 22 Dec, 2019 02:09
-
I'm confused by this whole situation.
Why would Starliner need to grab MET from Atlas? It has an IMU on board (probably several - right?) so it should know when it launched, where it is and its orientation. It should have a way (break wire?) to detect spacecraft separation. So why does it need MET from Atlas and why would getting a wrong one affect, well, anything other than a possible error code generated to say "hey - this doesn't match our launch and separation detections"?
Any satellite experts want to educate me?
A break wire can tell when it launched
-
#203
by
groundbound
on 22 Dec, 2019 02:20
-
I have a question, asked out of complete ignorance. I'm looking to put some of the discussion in this thread into context.
How much visibility does ASAP have into the designs and processes being discussed in this thread?
-
#204
by
ChrisWilson68
on 22 Dec, 2019 02:33
-
Another issue is that apparently the data the Starliner was looking for was just an arbitrary location within a larger set of data. The Starliner chose location 15 while the number it should have had was in location 17, or something of that sort.
Do we actually know that?
From the mission discussion thread:
During the NASA teleconference today, someone said (approximately) "the spacecraft reached into the Atlas and grabbed the wrong parameter".
-
#205
by
ChrisWilson68
on 22 Dec, 2019 02:36
-
Another issue is that apparently the data the Starliner was looking for was just an arbitrary location within a larger set of data. The Starliner chose location 15 while the number it should have had was in location 17, or something of that sort.
A more modern design would send all data in JSON format, where each data field is explicitly labeled in the data stream. Sure, it takes more bandwidth to send the same data, and the sender and receiver need to be a bit more complicated. But modern computers have no problem with this. Modern software has no problem with it. The whole web is based on this concept. It's well worth it to avoid accidentally misinterpreting data.
As a developer that builds software on and off the web, I see where you're coming from, but string-based encodings like JSON only make sense sufficiently high up in the application stack where the serialization overhead is tolerable. Particularly with a data structure that stores a timer, you can probably see the problem with deserializing the JSON string, incrementing the value of timer field, and then re-serializing the data on each clock tick inside an interrupt handler that needs to be blazing fast because it's blocking other tasks in the real-time system from running.
It wouldn't be performant or reasonable for the Linux kernel to store process control blocks in JSON, for example. But given the choice, in C, between an array of pointers or a struct with name-addressable fields, Linux wisely chose the latter. If Starliner used a numerically-indexed array to represent the state vector coming from Atlas V when it could have used a name-addressable data structure instead with a clearly-defined field for the MET time, then that would be a fair criticism.
Sure, you don't want to use JSON internally. But we're talking about an interface between the Atlas V launch vehicle and the Starliner vehicle. To me, that definitely qualifies as high-enough level an interface to be worth using JSON rather then a raw vector of bytes.
-
#206
by
RobW
on 22 Dec, 2019 03:08
-
Sounds like a bug that may have not cropped up until the vehicle was integrated with the LV as it cropped up when it grabbed the time reference from the Atlas flight computers.
One advantage Spacex has is they manufacture both the capsule and the LV.
Though it should be an easy fix.
The bug is only the proximate cause. The root cause needs to explain how Starliner ended up with an architecture that (from what we know, which admittedly isn't much) was sufficiently fragile to allow a single, very simple error to cascade into problems with guidance, navigation and communications. How was the decision to take such an apparently fragile approach made? What culture allowed the vehicle to fly without first performing the kind of hardware-in-the-loop, end-to-end test that would have picked the very simple bug up? Or to write a hardware-in-the-loop, end-to-end test that *didn't* pick up the very simple bug?
That's where the fix is (may be) needed - and fixing that is generally not trivial.
Most likely (almost assuredly), the bug isn't nearly as simple as it seems to us at this early stage.
I agree. Bugs found this late in the process are far more likely to be subtle than simple.
It is very easy to take some oversimplified statement from a press conference (where the intended ultimate audience is a non-technical public) and project your own worst fears onto it. I've done that above. When I started out saying 'very simple bug' I was thinking more about the 'simple to fix' end than the 'simple to find' end. Then I went all 'what's wrong with the testing culture' on it, and that's certainly a leap beyond what's currently known.
But I'm still uncomfortable with the information we have about how the system responded to the timing error. Bugs are slippery and it pays to assume some will get through. I only have the press conference oversimplifications to go on, but those give me pause. They make me wonder whether the software for this part of the flight has been designed and coded with the necessary resilience.
I understand that it most likely is fine, and that any worry stems from my limited understanding gleaned from oversimplifications in a press conference. I'm not in aerospace and my own software engineering background has never involved coding life-critical systems.
But I like to learn. To me there's value in asking 'could this be a problem?' and (hopefully) learning why it's not.
-
#207
by
Lee Jay
on 22 Dec, 2019 03:22
-
Another issue is that apparently the data the Starliner was looking for was just an arbitrary location within a larger set of data. The Starliner chose location 15 while the number it should have had was in location 17, or something of that sort.
Do we actually know that?
From the mission discussion thread:
During the NASA teleconference today, someone said (approximately) "the spacecraft reached into the Atlas and grabbed the wrong parameter".
But we don't know if it's the wrong address or the wrong value.
-
#208
by
Lee Jay
on 22 Dec, 2019 03:23
-
I'm confused by this whole situation.
Why would Starliner need to grab MET from Atlas? It has an IMU on board (probably several - right?) so it should know when it launched, where it is and its orientation. It should have a way (break wire?) to detect spacecraft separation. So why does it need MET from Atlas and why would getting a wrong one affect, well, anything other than a possible error code generated to say "hey - this doesn't match our launch and separation detections"?
Any satellite experts want to educate me?
A break wire can tell when it launched
Okay, even worse. Then why would it need MET from Atlas?
-
#209
by
Arch Admiral
on 22 Dec, 2019 03:33
-
To me the strangest thing about this incident is the 8-min communication gap with the TDRSS network. There are so many TDRSS satellites that there should be complete coverage. I suspect this was because the relevant satellite was occupied with NRO spy satellite traffic. In joint civil/Military systems, military demands always have priority.
-
#210
by
clongton
on 22 Dec, 2019 03:36
-
I'm confused by this whole situation.
Why would Starliner need to grab MET from Atlas? It has an IMU on board (probably several - right?) so it should know when it launched, where it is and its orientation. It should have a way (break wire?) to detect spacecraft separation. So why does it need MET from Atlas and why would getting a wrong one affect, well, anything other than a possible error code generated to say "hey - this doesn't match our launch and separation detections"?
Any satellite experts want to educate me?
A break wire can tell when it launched
Okay, even worse. Then why would it need MET from Atlas?
Yea, especially because Starliner has its own break wire that goes out thru the umbilicals down to the spacecraft EGSE at the base of the pad. This wire breaks when the umbilicals are retracted, informing the spacecraft avionics that liftoff has occurred, which starts the MET clock.
-
#211
by
clongton
on 22 Dec, 2019 03:41
-
To me the strangest thing about this incident is the 8-min communication gap with the TDRSS network. There are so many TDRSS satellites that there should be complete coverage. I suspect this was because the relevant satellite was occupied with NRO spy satellite traffic. In joint civil/Military systems, military demands always have priority.
No. They were clear in the presser today that a significant portion of the blackout was because the spacecraft's antennae were not pointed at the satellites because the spacecraft was out of position. Comms were not possible even when the spacecraft came out of the dead spot until they got the antennae pointed in the right direction.
-
#212
by
wolfpack
on 22 Dec, 2019 03:45
-
Okay, even worse. Then why would it need MET from Atlas?
Wasn’t it supposed to be launch vehicle agnostic? Hardly seems that way if it needs Atlas’ data.
Too much software. It falls into the same trap time and time again. The assumption that I/O is BOTH omnipresent AND correct. If programmers would just think a little outside the box. Wires can break. Transponders go down. Bit errors happen.
Shuttle really got all of this right. That code was CLEAN.
-
#213
by
jcopella
on 22 Dec, 2019 04:49
-
Seems odd to me that there'd be a requirement for the spacecraft to interrogate the LV for MET. Like, it's odd enough I'm not sure I completely believe it.
But assuming it's a reasonably accurate characterization of the problem, and while we're just opining and idly speculating, it has the "smell" of a configuration management snafu, as if the spacecraft and LV software loads were built against different versions of a telemetry table or something.
-
#214
by
TheRadicalModerate
on 22 Dec, 2019 05:58
-
How is there a test regime where you didn't put "a lot of duty cycles" on the RCS system?
You don't test too far beyond your design criteria.
Let's say, for sake of argument, that you design the thrusters for a duty cycle of 25% for no longer than 2 minutes. Maybe you test them at 40% for 4 minutes and, if they're fine, you say they meet the design criteria. Then you operate them at 80% for 10 minutes and they fail. Well, that was outside the design - and testing - criteria.
In case it's not clear, I made up all those numbers for illustration purposes.
I suspect that "duty cycle" is high-level engineering managerspeak for "it got used a lot".
Given that a situation where there's an attitude control problem, real or instrumentation-generated, is always a possibility, I'd think that the thrusters having a heavy station-keeping load would be something pretty deeply rooted in a lot of failure trees.
This may explain something, though: It's pretty clear that there isn't a contingency to use RCS (or at least whichever of the two RCS systems, CM or SM, that threw the instrumentation failures) as a backup for deorbit delta-v, which is consistent with wanting to be able to do a no-thruster reentry in case of a failure of the OMACs.
-
#215
by
TheRadicalModerate
on 22 Dec, 2019 06:04
-
I really don't see how either of your hypothetical situations affects whether to put crew on board for the first automatic docking. The second one doesn't even have anything to do with whether it was automatic or manual docking.
My thinking wasn't whether the crew might add some value to the situation. Rather, it was whether it might be nice to test the system prior to exposing the crew to it for real. You obviously have to expose the ISS to a docking cycle to prove that Starliner works properly, but the ISS crew is safer in the event of something bad happening than a crew on the Starliner would be. And there's no danger at all to the ISS crew in a hung docking, while a Starliner crew would be in a bad situation.
-
#216
by
TheRadicalModerate
on 22 Dec, 2019 06:06
-
Wouldn't it be much safer doing the first docking with someone on board, able to monitor and possibly take control if there is a problem? You would use the automatic system, but now you have a backup.
John
That's great for intervening in a test that was going wrong--right up to the point where the intervention doesn't work and the crew is endangered.
-
#217
by
TheRadicalModerate
on 22 Dec, 2019 06:15
-
I've had this exact thing happen to me. I design an actuator system to handle the expected loads with substantial margin, then an unstable controller decides to push the actuators to their absolute limits - continuously - and after a brief time, they trip. Then I fix the unstable controller and they never come anywhere close to tripping ever again.
Ah, the joys of engineering a robust system.
I'm just a software geek, but that basic bug pattern happens all the time. You discover some stupid, really unlikely bug has revealed a major problem that you hadn't thought about at all. Then you fix 'em both, and the system is a lot more bulletproof at the end of the process.
The problem is when the testing doesn't turn up either the stupid bug or the serious one. That's a test design problem.
-
#218
by
Lee Jay
on 22 Dec, 2019 06:41
-
How is there a test regime where you didn't put "a lot of duty cycles" on the RCS system?
You don't test too far beyond your design criteria.
Let's say, for sake of argument, that you design the thrusters for a duty cycle of 25% for no longer than 2 minutes. Maybe you test them at 40% for 4 minutes and, if they're fine, you say they meet the design criteria. Then you operate them at 80% for 10 minutes and they fail. Well, that was outside the design - and testing - criteria.
In case it's not clear, I made up all those numbers for illustration purposes.
I suspect that "duty cycle" is high-level engineering managerspeak for "it got used a lot".
No, duty cycle is very low level engineering speak for ratio between time on and total time. 25% would mean it's on for 25% of the time.
-
#219
by
TheRadicalModerate
on 22 Dec, 2019 06:58
-
How is there a test regime where you didn't put "a lot of duty cycles" on the RCS system?
You don't test too far beyond your design criteria.
Let's say, for sake of argument, that you design the thrusters for a duty cycle of 25% for no longer than 2 minutes. Maybe you test them at 40% for 4 minutes and, if they're fine, you say they meet the design criteria. Then you operate them at 80% for 10 minutes and they fail. Well, that was outside the design - and testing - criteria.
In case it's not clear, I made up all those numbers for illustration purposes.
I suspect that "duty cycle" is high-level engineering managerspeak for "it got used a lot".
No, duty cycle is very low level engineering speak for ratio between time on and total time. 25% would mean it's on for 25% of the time.
I'm aware. But I'd be surprised if that's what Chilton meant: he talked about putting a lot of "duty cycles" (plural) on the thrusters. Pretty sure that what he meant is that they'd fired a lot of times, which isn't the level of semantic rigor that we'd expect from our lofty perch as nerd kibbitzers.