The title isn’t a euphemism. This is about when product management decisions can do damage.
In Part 1 of this writeup, I’m going to define some of the challenges for Mission and Safety Critical applications as compared to typical development. If you want to skip right to more practical considerations for what to do about such things, go right to Part 2, with a checklist followed by explanations of the line items.
Over the years I’ve moved from consumer oriented products to B2B and B2B2C, most recently several years working on healthcare solutions. On this path, I’ve learned a lot about Safety Critical application issues. It’s stuck me that people who ‘grew up’ in such environments may have a lot of this knowledge through intrinsic experience. But for those transitioning to such areas or in startup mode who haven’t been there, (as was my path), it’s possible there’s knowledge gaps. What I often do when learning new things is take note in my own Wiki, and over time develop some degree of body of knowledge in a subject area. So my goal now is to share back what I’ve learned for those who may find it useful.
What’s the difference between “Mission Critical” or more extremely, “Safety Critical” vs. “typical” product management? Rather than try to formerly define it, let’s keep it simple. Safety Critical products are those that – for certain failure modes – can hurt or kill people. (Or cause significant property or environmental damage.) Often, they’re things that move in our environment; that is, actually do things. A basic compare and contrast might be an advertising delivery system for content vs. an insulin pump; failure of the first maybe means some lost money, whereas failure of the second… clear enough, right? Mission Critical – as distinct from Safety Critical – might mean less severe human consequences, but a ‘hard’ failure or challenge nonetheless. E.g., a failure in a basic regulatory compliance issue could result in anything from minor work stoppage or fines though potential criminal negligence. Or a failure of the software on a Mars Rover results in loss of the unit, associated waste and data loss. No one gets hurt or dies, but the mission is basically lost.
In short, these are not really applications where we want to “fail fast and learn.” These are apps that have somewhat more crisp “Minimum Viable Product“ standards when it comes to the “Viable” part.
The Short Version of Issues for Mission / Safety Critical Products
- Mission and Safety Critical systems demand something beyond basic Agile Product / Project management approaches. (And this may include products in highly regulated areas as well.)
- There is often a moral component that exists beyond other types of products / services.
- More planning is required than for non-critical systems.
- Agility may be more often impacted by external dependencies.
- So called “Non-functional” requirements may become critical.
- Risk assessments must be included as a core concern.
- There’s a suggested checklist for Mission / Safety Critical Products in Part 2 of this writeup that you may find useful if you work on these types of products.
That’s it. You can stop reading now. But for fuller discussion of these points, please continue. Or skip to Part 2 for the just mentioned checklist items.
Project Methodologies for Mission and Safety Critical
Here’s the bottom line assertion… the whole idea of Agile Development – or at least as often practiced – is in many ways antithetical to Safety Critical products and services. Yes, there may be hybridized ways to achieve the best of both worlds. But consider this as you think about what product and project methodologies you’re going to use. One idea that needs to be dispensed with is the “fail fast” attitude. While this is not part of the Agile Manifesto, it’s become all but de rigueur in Agile environments. Now, before any purists start itching to point out this can mean fail fast during development and Sprints, but value increments don’t necessarily need to be released until a full package is ready… well… let’s be honest about what most often ends up in releases when there’s business pressure. Things that are ‘done’ tend to have momentum. Among the premises of Agile is that we fix and re-factor things later. But as a practical matter, this doesn’t always happen. And let’s also note there’s nothing about the Agile Manifesto that demands the specific methodologies of the ever popular Scrum method.
As software continues to manage more of our world, the concepts of Lean product development, Minimum Viable Product (MVP), plus Agile have become ascendent, and there may be a disconnect between teams new to developing what can turn out to be safety critical products vs. those who have worked on such things for years. As we approach the mid 2020s, there has been some pushback on the idea of MVP and even Agile, for a variety of reasons. Some of those reasons might be about the best ways to efficiently get to value or how to adjust these methods given our learnings from their use. Others may be about using the methods vs. the more abstract underpinnings of the methods. It’s not that this or that model is right or wrong. The issue is more – my opinion – about how or when to properly select a method for the task at hand. Or, if necessary, modify them to right fit the goal. I’ve had plenty of success using Agile methods at multiple companies on multiple projects; usually manifesting in the form of Scrum or Kanban. And I’m a huge believer. But not necessarily strictly so for all project types. And Safety Critical applications do demand a higher standard of care than others. If Agile methods can satisfy this for a given application, great. But if not, it’s time to work out methods that will.
Engineers at car companies, aircraft / aerospace manufacturers, medical instruments and more have been trying to follow, (or create), best industry practices for decades. They have experimented with various methodologies from Six Sigma through Lean and so on. As to the rest of us, over time, it seems more and more development is floating into more critical spaces. (Consumer IoT, wearables, and so on.) For example, a long established seemingly old-fashioned brand of medical device manufacturer may have been following stringent processes for years. Whereas the hot new medical device startup barely knows what these processes might be. The startup may be faster… maybe more ‘tragically hip’ in terms of its whole vibe; but is it really producing safe products? Maybe. Maybe not. What about your shop?
And of course, even the best of the big organizations fail sometimes. This next example isn’t a digital product, but consider for a moment NASAs space shuttles. Some of the world’s best project managers were involved. Still, 2 out of 5 operational shuttles were destroyed. This is an unsettling 40% failure rate on per unit basis. And of course, besides the major investments in the program, 14 highly educated, stunningly talented brave adventurers died.
OK, maybe that’s a serious outlier example as it’s one of the more complicated things ever built on which very few well trained crew would travel. How about the Boeing 737 Max? OK, that’s complicated too. And yet, we’ve lost a couple of those. What about Toyota’s circa 2015 unintended acceleration issues? There’s plenty more examples, (too many), but I think you get the point.
Let’s consider something we maybe don’t know for sure. We know driving while distracted can be a killer. So is something as simple as a hard to use UI / UX on a fancy auto entertainment system possibly responsible for some accidents? I don’t know. I can’t find data on this. Intuitively, it seems possible enough. Regardless, it seems clear enough that safety is a consideration here beyond just the features.
Mission & Safety Criticality
Let’s start with Mission Critical. Just what is “Mission Critical” versus not?
In one high level sense, Mission Critical is pretty much anything that’s so essential to core operations – of anything – that should it fail, the whole system fails. Taking things to the next level, we have Safety Critical.
With Safety Critical applications we get into an area where a failure can cause harm; from minor injury through death. Or possibly significant damage to property or the environment. Failures may range from design of interface or experience though failure of code to perform an operation and so on. And the harm can be anything from discomfort though injury or death; either immediate or over time due to failure to perform an action or avoid an action.
Ideally we can all agree there’s a difference in the nature of codebases for… let’s say a community forum and news site for an interest category vs. something like flight management software, or remote patient monitoring software. Or similar. Both categories of software may be important. After all, hobbies are important. News is important. Advertisement supported content websites serve their customers and provide livelihoods for their owners, who are passionate about the topic. But do the Minimum Viable Product specs have the same kind of requirements as Safety Critical? Of course not. Is the Definition of Done in the User Stories at the same level of criticality versus a safety critical system? Of course not. There seems to be a difference between an ad slot being defective for a few hours, (losing revenue), vs. failure of a patient alert monitoring system. The challenge is when product, design or other personnel begin moving from innocuous products to those with more critical concerns.
I’ve worked on both types of products. And with all the back and forth lately surrounding debates about MVPs and Agile or Scaled Agile, or whatever… I wanted to take a few moments to get back to basics and write about some fundamental priorities. And perhaps make what might be an unpopular argument – among some who are used to running more fast and loose – for more planning when it comes to mission critical applications.
Quick Story: One day I was in Fire / Rescue training at the volunteer firehouse where I work part time; primarily as Emergency Medical Technician. We were doing some search and rescue training in a maze. This is where we put on all our gear, including breathing packs / masks, etc. Our masks get obscured with wax paper inside them so we can’t see outside at all, and then we have to run a victim search through the maze. (Well, not really run… it’s more of a crawl.) The maze gets changed around over time, but typically has various obstacles, holes in the floors, walls where we have to ‘swim’ through the studs, hanging wires and so on. I’m half way through this evolution. Even though I can’t see outside my mask, I can see from the display in my mask – which is still somewhat visible – that my air supply is getting a bit low. Around the next corner, I get hung up in some wires. As I try to clear them one of my bunker pants’ suspenders falls down around my left upper arm inside my heavy jacket. I’d been meaning to replace them because one of the rear straps had broken. Now I’m having trouble moving. I’m fighting a bit, but I’m somewhat trapped and if this had been just a little worse, I’m not sure I could have even reached my radio to call mayday. Of course, we’re never alone in these things and my partner was able to extricate me. As I exited the obstacle the “vibe alert” in my mask starting going off. Low on air. I removed my gear and enjoyed some cool air on my sweating face. And here’s the thought that really hit me… if this had been real; it’s conceivable that the cause of my death and leaving my kid without a dad might have been my failure to replace a $40 piece of nylon. And it’s possible no one would ever know the reason why. Pilot error is a catchall cause for various aircraft accidents when another clear cause can’t be found. But I wonder how many of these things were maybe caused by some seemingly inconsequential issue at just the wrong time.
Now, my personal story happened to be about a physical product, not software. But the point remains… it can be the smallest of poorly considered things that under just the wrong circumstances can offer up a critical failure mode. One of my fire instructors once told me, “You can never train too hard for a job that can kill you.” But this minor never-gave-it-a-second thought piece of equipment? The lesson is clear enough; the littlest things can matter in the largest of ways under just the right… or that is… wrong… conditions.
Do we treat our mission critical software jobs thinking about such things? Perhaps we should. Maybe my instructor’s thoughts can be re-cast as “You can never think or plan too hard about products that can damage, hurt or kill.”
Moral Components of Safety Critical Software Development
I swear by Apollo Healer, by Asclepius, by Hygieia, by Panacea, and by all the gods and goddesses, making them my witnesses, that I will carry out, according to my ability and judgment, this oath and this indenture. To hold my teacher in this art equal to my own parents, etc. etc.
That’s the beginning of something called the Hippocratic Oath. This was once upon a time typically taken by physicians. While today, graduating medical students may not swear such an oath, (they may use others), the general idea is to make a commitment of sorts to various types of behavior. And while the specific words Primum non nocere, might not be in the oath itself, the general idea certainly is. That is, “First, do no harm.” (I was somewhat pleased with myself for coming up with the idea to apply this old oath to design / development. Then of course, as with so many things… I find I was hardly original at all. While researching for this article, I found this book: Tragic Design: The Impact of Bad Product Design and How to Fix It. The authors had long ago expressed the same idea. It’s a great read for a lot more examples than I’m offering here.)
Software product managers, UI/UX/IA folks, developers, etc. swear no such oaths. But perhaps we should. As a practical matter, would it make a difference in anything at all? Perhaps not. But maybe it would help inculcate an initial idea state of standard of care somewhat beyond the popular idea of “fail fast and iterate.” (At least for those of us that began our careers in fast cadence development, perhaps in startups, vs. more traditional shops that had been doing more safety critical work all along.)
The key takeaway here? Those of us that work in Safety Critical areas need a much higher standard of care than some of our buddies optimizing AdTech for TikTok or a content management system for a news site… or… etc. etc. These products may be useful and important in the world. But even a hard failure isn’t likely to hurt anyone. Maybe a short term revenue hit, but then you move on. No one gets physically hurt.
But for other roles, Bad design can kill. Bad code can kill. An API failure can kill.
While this is being written mostly for product folks, development is certainly a potential reader as well. So here’s an example from that perspective with a somewhat insidious error chain, with of course involvement from Product because together we build the Backlog. Let’s say you happen to be a software developer. There’s some tech debt in your startup’s product. You decide it’s time to re-factor some code to an MVC type architecture for future extensibility. Great. Product agrees and you all make some Stories for the next couple of Sprints to get this done. One of the design patterns you might use is publisher / subscriber to properly separate some functions from Controller vs. View. Wonderful. But what happens when you were catching an API push of info and nothing’s come in for awhile? Did anyone account for that? Are you polling for data as well? Or just taking it from an expected occasional push from an API provided by an external provider? A simple try / catch statement isn’t likely going to be enough here. How much time passes with no data before… before what? Before a clinician or someone gets involved? Is this patient no longer in the program? Were they cured? Did they die? What happened? I’m your product person. Actually, I’m somewhat senior so maybe I’m your product manager’s manager, or their manager even. Did any of us product folks write that user story for you? Was this check done as its own story? Or part of the acceptance criteria for a story on the data check itself? Did we in product just miss this because we don’t fully understand code? Did you think it through or just pound out the code? Was any such thing discussed during the Sprint planning meeting? Maybe we should have drawn some kind of logical application architecture or just messaging diagram before just adding stories to backlogs? Either way, if these backlog items make it to a release, maybe everything is fine. Until it’s not. Until some device in the field fails and we don’t notice. Who’s mom is that? Or is this patient mostly on their own? Who’s fault is this? (The answer, by the way, is Product. It’s the product people’s fault. Yes, the whole team missed it, but if it’s product that leads the way here, then they / we own this.)
A few years ago I had occasion to work on some flight planning software. It was somewhat specialized in that it was purpose built for special types of scanning missions. The scanning gear had special requirements regarding Field of View, altitude, etc. The flight profiles were along points from which we would calculate routes based on various input criteria. We needed not only code, but some additional help with mathematical algorithms, etc. The end result was to be an efficient flight pattern that also takes into account fuel burn, fuel stops, etc. Great. There was just one other issue… and that’s to ask, “What are some first principles here?” To me, it was, “Let’s not kill the pilots.” Some of the patterns may potentially have suggested turns that could be risky in terms of bank angle to hold the patterns. Our earliest outputs showed some clear concerns here and we made sure to take that into account. It’s likely had we not realized this ourselves, our client would have quickly enough. It really turned out to be a non-issue in this case. But it easily could have been a bit edgy.
So another item has gone on my checklist for product features and that’s simply asking the question, “Are there risk / safety concerns for this particular feature / function?” Yes, sometimes that’s blazingly obvious when clearly working on a safety critical project. But what I’m suggesting is that should just be a standard checklist question when working on anything that’s anywhere near a Safety Critical type application. Even if you don’t personally have the subject matter expertise to assess the issue, at least you’ll be prompted to ask people who do. This is what Discovery is for and it’s critical. Oftentimes, for all the lip service paid to Customer Journey and User Personas, product teams just take their best shot at ideating features. That’s potentially dangerous in some of these realms.
So again… Let’s Think… Primum non nocere, “First, do no harm.”
Planning for Mission & Safety Critical Applications
As usual, the degree of planning and the steps involved will depend on your application. While everything needs its level of care and due diligence, here are some questions to consider…
- Is there an already existing safety standard, (or set of standards), with which you must comply?
- Hardware involved? If software is to be embedded, is there a means to update it? How? Or do you need to decide that?
- Where will the product be deployed? If distributed, might there be special Neds for updating
- What, if anything, will the software be controlling?
- What are the potential failure modes?
- Consider doing a fault tree analysis.
Fault tree analysis
Safety-Critical Operating Systems
A Methodology for Safety Critical Software Systems Planning
Making AI ready for safety-critical applications
Planning for safety (from Project Management Institute (PMI), Bonyuet, D. (2001). Planning for safety. PM Network, 15(10), 49–51)
Fault Tree Analysis FTA Explained With Example Calculation (4:08 Video)
Fault Tree Analysis Explained with Examples – Simplest Explanation Ever (12:34 Video)
Non-Functional Requirements for Safety Critical Products
Non-functional requirements (NFRs; just so we have another acronym) are those that supposedly are more for judging how something is working as opposed to a specific behavior. Display of something like a pulse rate, for example, is a functional requirement. Speed of a system though, or refresh rate, may often be thought of as a non-functional requirement. And yet, it’s not hard to see how response time could be life critical in a variety of applications. The challenge in thinking here may be simply in how we think of definitions. It’s perhaps the case that the label “non-functional” as a concept was an unfortunate historical choice. It simply feels like less of an imperative than something deemed to be functional. Assuming we’re stuck with established nomenclature, it then becomes important as to how we think about that which is non-functional and how we define it as a requirement. When we write a story that says, “As a clinician, I need the patient’s pulse displayed clearly on both the bedside and remote monitor,” our definition of done had best also require details on things like refresh rate of the display.
The idea of Security is generally classified as non-functional as well. And yet, this is just as clearly a requirement for a great many products as are the actual functions. There’s lists of possibly dozens of non-functional requirements that might end up being critical to the success of a User Story.
There are at least two challenges with the ideas of NFRs when it comes to Minimum Viable Products and how this idea, (and Agile in general), may contend with safety issues.
- Ignoring NFRs to get to MVP. You can – to some degree and perhaps even completely – ignore NFRs to produce an early concept product. After all, you could certainly produce patient monitoring software that works just fine and have zero security applied, no data logging or audit capability, and so on. Of course, any one reading this could – and should – quickly point out that completion of Sprints doesn’t necessitate a release. And we can always go back to these other pesky issues later. After all, that’s part of the point of Agile… drive to value and test. And yet, we all know full well that some functionality that doesn’t drive towards revenue can be challenging to prioritize. Or that it’s maybe given scant resources. We know this from simple history. How many products suffer from security failings mostly because security is a negative externality? It’s both known and accepted that Agile is going to require some degree of re-factoring along the way. Still, any product is going to feel some pressure when burning cycles to add, (or go back and fix), security, durability, recovery functionality and more when the entire rest of the org just wants the product sold.
- NFRs Might be Inconsistently Represented in Workflow. In some cases, a non-functional requirement may be embedded in a User Story, but in others, may be its own User Story. As a result, verifying performance of all NFRs may be challenging. For example, if we take the pulse monitor example, the performance of the display in terms of refresh rate may be embedded as part of the display story. However, would the same be true of a requirement / Story for saving time series data across an entire monitoring product for multiple devices? Perhaps. But if such data needs to be saved for all data points, is that better handled by its own Story? After all, a system likely needs an entire Big Data strategy for dealing with the 4Vs of such things, (Volume, Variety, Velocity, Variability, and as a bonus, some add Validation). Would it make sense to have historical data as it’s own Epic? Or does each functional piece need it’s own reference here? Agile – as typically practiced – is often challenged with dependency issues.
Risk Assessment Needs for Safety Critical Products
Your first stop here is likely regulatory. In some ways, this is the easy part. Assuming your products live in a regulatory space there is probably at least one, (and perhaps several), checklists that are pre-built milestones. Chances are you already know about these. But, possibly not. Startups that possibly come from a crossover industry may lack familiarity with a new domain.
- Please see Part 2 for a checklist with additional explanations of checklist line items.
Some Additional readings…