To solve this problem, the engineers followed a strategy that had been employed in previous missions such as Mars Global Surveyor: they would use a form of ‘dead reckoning’. This depended on knowing exactly how much thrust would be exerted on the spacecraft – and in which direction – when each thruster was fired for a known period of time. With this information, it would be possible to calculate how much the speed and direction of the spacecraft would change during each AMD event, even without measuring those changes directly.
To make this possible, the subcontractor who manufactured the thrusters sent paperwork to Lockheed Martin that documented how much thrust was generated when each thruster was fired. This manufacturer was accustomed to working in British units (pounds, feet and so on). NASA generally requires the use of metric units throughout its operations and those of its contractors, but it makes exceptions in cases where ordering a change of units may be unduly burdensome, or where it increases the risk that the contractor will make some kind of mistake. NASA made an exception of this kind for this subcontractor, so the paperwork received by Lockheed Martin listed the thrusters’ performance in units of pounds of force.
The root cause of the Mars Climate Orbiter mishap was the failure to convert these English units to metric units of force – newtons – in the preparation of a navigational software file called ‘Small Forces’. The purpose of this file was to determine how strongly each AMD event would push the spacecraft out of its intended path. Because the remaining navigational software assumed that the output of the Small Forces file was in newtons, it underestimated the deflection of the spacecraft’s trajectory caused by each AMD event by a factor equal to the ratio between pounds and newtons – which is to say, by a factor of 4.45.
To understand how this error occurred, I spoke with John Casani, the onetime chief engineer at JPL who led the lab’s internal investigation into the mishap. I also spoke with Steve Jolly of Lockheed Martin, who was the lead systems engineer for the Mars Climate Orbiter. Casani’s and Jolly’s accounts agreed in one respect: they both told me that the failure to convert the units was primarily the fault of a young engineer who had only completed college a couple of months earlier and who was a new hire at Lockheed Martin. (This person has never been identified by name.)
Casani and Jolly gave me somewhat differing accounts of how the young engineer actually came to make the mistake. According to Casani, the engineer was given the documentation from the thruster manufacturer that contained performance data in pounds, as well as a set of instructions from JPL that specified that the output of the Small Forces file should be in newtons. The engineer simply failed to read the JPL document with sufficient care and thus overlooked the requirement for the conversion. According to this account, then, the engineer thought he was doing the right thing by providing the output in English units.
According to Jolly (who presumably would have been more knowledgeable about the matter), the engineer did know that pounds needed to be converted to newtons. The reason he failed to make the conversion, Jolly told me, had to do with the ‘heritage’ issue. The Mars Global Surveyor had used similar navigational software, and the plan was to save money by having the Climate Orbiter ‘inherit’ or reuse it. The new spacecraft used different thrusters, however, so the portion of the software relating to thruster performance was excised, and the engineer’s task was to replace that portion with code incorporating performance data for the new thrusters. Unfortunately, he assumed that the code that made the conversion of units was left in the unexcised Global Surveyor software, whereas in fact it was in the excised portion. The conversion was represented simply by the number 4.45 in an equation, without any comment as to its purpose, so it was easy to miss. Thus, in writing the new code, the engineer left the units in pounds, thinking that the required conversion would be made by the pre-existing software.
Although the engineer’s mistake was the root cause of the mishap, such mistakes are inevitable as long as science is done by humans. The more serious error was the failure of anyone to spot the mistake. Part of the problem was that a factor of 4.45 is not a terribly large error in engineering terms: the faulty code produced output that looked quite reasonable. In fact, if the Mars Global Surveyor (with its symmetrical solar panels) had incorporated the same error, that mission would probably not have been affected. It was only the asymmetrical design of the Climate Orbiter, with the resulting need for numerous AMD events, that allowed the small individual errors to accumulate to a mission-endangering level.
Following standard procedures, the faulty software was reviewed, but the error wasn’t spotted. Then it went through formal testing: using fictional AMD events, the output of the software was compared with the output of manual calculations. Unfortunately, the manual calculations somehow incorporated the same error as was present in the faulty software, so the two outputs were in agreement and the software was judged to be good.
The Small Forces software was not actually loaded into the spacecraft’s computer; rather, it was placed in computers that remained on the ground. The idea was that every time an AMD event occurred, the spacecraft would radio back data about the length of firing of the thrusters, the attitude of the spacecraft and so on, and the navigational team would then feed the data into the ground computer to extract measures of the magnitude, duration, and direction of thrust. These measures would then be used to adjust the model of the spacecraft’s trajectory.
By a bitter irony, the spacecraft’s own computer did in fact possess software to make this calculation independently, and these files correctly specified the resulting thrust in newtons. ‘You can imagine how many times I wake up at night thinking about that,’ said Jolly. In fact, the spacecraft was even programmed to radio the output of these calculations to the ground, but the navigators did not know this so no one looked at the incoming data packets or compared them to the output of the erroneous calculations being performed on the ground. If they had done so, the error would have been quickly detected. Even when I spoke with him in 2006, after NASA’s official inquiry had established and published the fact, Jolly said that he didn’t know that the spacecraft had been transmitting the correct data to Earth.
On December 11, 1998, the Mars Climate Orbiter was launched from Cape Canaveral Air Station in Florida, atop a Delta II rocket. The Delta II is a relatively inexpensive, medium-powered launch vehicle. So as not to exceed the Delta II’s lifting capacity, the mission planners had to economise on the weight of fuel carried by the spacecraft – fuel which was required for slowing the spacecraft when it reached Mars. The planners took two steps to save on fuel. First, they sent the spacecraft by a long route that took it more than halfway around the sun: this ensured that it was travelling relatively slowly as it approached Mars, but it lengthened the trip to nine months rather than the six months needed for a more direct route. Second, they planned to accomplish some of the slowing by aerobraking – repeatedly dipping the spacecraft into Mars’s outer atmosphere on successive orbits after the first encounter – rather than relying entirely on the spacecraft’s engines. Even with these measures, the orbital insertion burn would have to slow the spacecraft by nearly 5,000 kilometres per hour, a task that would consume nearly 300 kilograms of fuel – almost half the total weight of the spacecraft.
The launch went flawlessly. The Delta II’s first two stages lifted the spacecraft into low Earth orbit, then the third stage booster rocket fired for 88 seconds, kicking the spacecraft out of Earth’s gravitational clutches. After the booster separated, the spacecraft deployed its solar panels and began its long, unpowered cruise toward Mars.
Teams at JPL and Lockheed Martin, led by JPL flight operations manager Sam Thurman, monitored and controlled the spacecraft during its journey. Part of the team was a group of four JPL navigators, led by Pat Esposito, whose task was to determine the spacecraft’s trajectory and calculate the required corrections during the flight. The team was also responsible for two other spacecraft, however – Mars Global Surveyor (which was orbiting Mars) and Mars Polar Lander (which was launched on January 3). Only one team member, Eric Graat, could give his undivided attention to the Mars Climate Orbiter. This was a low level of staffing compared with previous and subsequent missions: the successful Mars Odyssey mission of 2001, for example, boasted a 15-member navigation team. Although neither Esposito nor Graat agreed to speak with me, Sam Thurman told me that they were very much overworked.
The first trajectory correction manoeuvre (TCM-1) took place ten days after launch. It corrected a deliberate mis-aim in the launch trajectory – a mis-aim whose purpose was to ensure that the third-stage booster did not strike Mars and contaminate the planet with terrestrial germs. The manoeuvre involved an elaborate sequence of operations. First, the solar array was folded and locked against the spacecraft body to protect it from damage, then the entire spacecraft was rotated so that the firing of its aft-pointing thrusters would deflect the craft’s trajectory in the right direction, and then the thrusters were fired for a few minutes to achieve the correct trajectory. Finally, the spacecraft was rotated back into its flight orientation and the solar panels were deployed once more. A second, much smaller trajectory manoeuvre (TCM-2) was performed on January 26,1999, and it, too, went according to plan.
About every 17 hours during the flight, the spacecraft automatically performed angular momentum desaturation (AMD) procedures, firing its thrusters for a few seconds to allow the reaction wheels to be decelerated. The navigators had not been expecting the AMD events to occur so frequently, because when they came onto the job they were not familiar with the Orbiter and they did not realise that its asymmetrical design would cause an increased tendency to spin under the influence of solar radiation.
During the first four months of the flight, the navigators did not use the Small Forces software to calculate the effects of the AMDs on the spacecraft’s trajectory. This was because the software not only contained the units error (which no one was aware of), but also some other bugs that
had
come to light. Because the tiny effects of the AMD events would only really be important for the final approach to Mars and orbital insertion, the navigators simply did without the output of the Small Forces software, planning to incorporate the data at a later time.
Finally, in mid-April, the ground software was delivered and put into operation. Now the effects of the AMD events (including those that had already taken place) were incorporated into the navigational calculations. But for each AMD event the software told the navigators that the spacecraft had been deflected by an amount that was nearly five times larger than what had actually occurred, thanks to the poison pill that was the units error. Still, each individual navigational solution looked good, because the error was in an unobservable dimension perpendicular to the line of sight.
Only over time, as more and more solutions were calculated along the spacecraft’s curving path, did Graat become aware that the individual solutions didn’t quite mesh together to form a coherent trajectory. And calculations of the spacecraft’s current position that were derived from different data sets (for example, those based on range or Doppler measurements, or those that were based on different parts of the spacecraft’s trajectory) gave a fuzzy cluster of solutions instead of a single, unanimous answer.
Graat discussed this navigational problem with the leader of the navigational team, Pat Esposito. According to John Casani, the problem should have been entered as a formal written record known as an Incident, Surprise, Anomaly form – or ISA – which guarantees that a problem is followed up to a satisfactory resolution, but that’s not what happened. ‘The navigator here at JPL sent an email message to someone at Lockheed Martin, saying, “Take a look at this, there’s something funny going on that we don’t understand.” That never got entered into the formal record, which is our normal practice. So someone received this at the Lockheed end and said, “I’m going to work on this,” and then he got some other task that came along that either he thought was a higher priority, or his boss thought was a higher priority, and he got deflected, and this problem that was communicated to him by email just fell off the table, so to speak. If the form had been filled out, that could not have happened.’
Sam Thurman clued me in to the ‘other task’ that got higher priority. A serious incident occurred during the third trajectory correction manoeuvre, which was performed on July 23. Although the procedure for TCM-3 was the same as for TCM-1 and TCM-2, the process of retracting and locking the solar panels in preparation for the burn went awry. This procedure involved rotating the panels around a ball joint, using a gimbal drive. ‘There are devices on this gimbal that read out its angular position,’ he said, ‘and there were some calibration errors on those things, so the solar array scraped up against the side of the spacecraft and nearly got jammed in the stowed position. That put the spacecraft into “safe mode”: when it tried to un-stow after the manoeuvre, the array wouldn’t move, so the software stopped and called up the ground and said, “Hey, I’m trying to move this and it’s not moving; there’s something wrong.” So we spent most of the month of the approach phase scrambling to try to resolve this problem with the gimbal drive – because we knew that when we got to orbit insertion we had to have the solar array in the stowed position when we fired the main engine. The support that held the array wasn’t strong enough to take that force without the array being stowed. So we knew following TCM-3 that we had a problem we must fix or orbit insertion would fail. And that was very scary – that took a hell of a lot of effort from the team, the spacecraft team [at Lockheed Martin] in particular. So I think the navigation team’s problem was they were calling up the spacecraft team, saying, “Gee, we’re seeing this funny stuff, can you help us work the Small Forces modelling and try to understand it?” And they said, “Oh my God, we’ve got this huge other problem that could end the mission if we don’t fix it in the next two weeks.”’