Forecasting in the Real World: What Monte Carlo Is (and Isn’t)

Forecasting in the Real World: What Monte Carlo Is (and Isn’t)

 m

Guest Author, Nicolas Brown

Recently, there has been more active discussion on Monte Carlo Simulation (MCS) in software development, especially with the recent panel on probabilistic forecasting at CraftCon25. Following this, one of the panellists, Nigel Thurlow shared his thoughts about the panel and some further elaboration on what was discussed, which led to lots of discussion on LinkedIn. In the thread, Nigel (and others) offered some valid counter views and general questions about the validity of MCS which many folks out there will also be asking.

As a practitioner of many years and someone who builds free tools that leverage MCS, I thought it would be helpful for the community to summarise/try to answer the main questions/challenges posed in the thread. 

Isn’t this just a fancy form of averaging?

Averages give you one number. MCS helps you see the whole picture and how delivery might play out across a range of possibilities, so you can plan for risk rather than hope the average works out. I’ve been that person who committed the sin of the latter for years, almost always getting found out (through the date/capacity being wrong).

Actual delivery is noisy, with variation in our Throughput, therefore we need to model this in what we forecast. Now, in defence of this statement, there are plenty of times I’ve seen MCS poorly communicated with a single date/no of items to complete in a timebox, and no associated probability. In this instance I can absolutely see why it might be perceived as such.

MCS models variation and risk accumulation over time, something averaging simply cannot do, even in a stable system.

Doesn’t this only apply when the system is stable? And how do we know it’s stable?

“Stability” is one of the more pertinent questions in the discussion and one which is probably the most divisive. The key question is what do we mean by stable?

Let’s consider just a few factors that we may deem as contributing towards stability of our system:

  • Team members/size is changing
  • Major issues in production
  • Tech debt increasing
  • Leadership frequently “pushing” work

These undoubtedly are things that MAY be impacting the stability of our system. As MCS relies on Throughput as the input, we have to look at our Throughput data and determine if it is unstable (i.e. there is too much variation in our samples).

Let’s take this team’s Throughput data:

The argument here is that, as input data goes, it is unstable, there is too much variation here with 4-13 items completed a week. If it were 4-8 items (i.e. less variation) then this would be stable.

This is a subjective opinion on stability, not an objective one. If we are to be objective, this is where usage of a Process Behaviour Chart (PBC) is imperative for our input data. 

A PBC is a type of graph that visualises the variation in a process over time. It consists of a running record of data points, a central line that represents the average value, and upper and lower limits (referred to as Upper Natural Process Limit - UNPL and Lower Natural Process Limit - LNPL) that define the boundaries of routine variation. A PBC can help to distinguish between common causes and exceptional causes of variation, and to assess the predictability and stability of a process. Basically, if there are values outside of our LNPL or UNPL lines, the system is objectively unstable therefore it shouldn’t (although there is nothing stopping you!) be used for forecasting.

If we take our Throughput data and put it into a PBC, we can now get a sense (objectively) for if this team is predictable or not:

It is worth noting that Throughput is the type of data that is zero bound as it is impossible for us to have a negative Throughput. So, by default, our LNPL is considered to be 0. We can therefore say that objectively this team is predictable.

Now, if the Throughput data was producing data points outside our UNPL, then it would be very risky in using this to forecast (as the inputs are unstable). Of course, nothing can “stop” you still using this to forecast but proceed with extreme caution!

This is where I believe there is likely to never be an agreement amongst practitioners/critics. MCS operates purely on objective inputs, it can’t account for subjective or hidden factors unless the data (Throughput) reflects this. If your position is that the system is unstable, but this is using subjective information, then you’re never going to agree with another perspective. This is fine, however it’s probably not an effective use of time debating this amongst ourselves!

When the system is stable, why can’t we just use averages?

This is absolutely a fair challenge to make, and one which really is best answered by data, rather than conference panels or LinkedIn debates. 

The hypothesis here is that, using a PBC, our system is stable, therefore averaging shouldn’t produce any different results than using MCS. 

A few years ago I wrote a blog looking at the accuracy of probabilistic forecasts, thankfully I still have the Throughput data from 11 of those 25 teams. I can take the average weekly Throughput from 6, 8, 10 and 12 weeks’ worth of historical data and use this to forecast what will get “done” (in terms of number of items) in the next 2, 4, 8 and 12 week time periods.
It’s worth noting that for all these teams, the input data was ‘stable’ (as defined above). Similarly, teams were not set any targets as part of this, as the data was reviewed long after the actual “dates” had passed.

Any time a cell is marked green - this means the forecast was accurate, the team completed this many items OR MORE. 

Where a cell is marked red - this means the forecast was inaccurate, the team completed less than this number of items. 

This is what the data looks like using averages to forecast:

Here we can compare the same results of MCS using our 50th, 70th and 85th percentiles:

A summary for each being:

Average - correct 59% (104/176) times

50th percentile MCS - correct 62% (109/176) times

70th percentile MCS - correct 73% (128/176) times

85th percentile MCS - correct 81% (143/176) times

Looking at this data, we can clearly see that MCS at higher confidence levels significantly outperforms averaging in terms of forecast reliability, even when using stable input data. In other words: the stability of the system doesn’t negate the value MCS adds.

You can’t use MCS for a single project

As a practitioner, I can absolutely say that you can as I have many times, however it’s important to be clear in how to think about this. MCS will give you a general expectancy for a single project based on past ones. 

Think of MCS like a weather forecast.

If you’re planning your day and the forecast says there’s a 70% chance of rain, then that doesn’t mean it will rain, it means that based on similar weather patterns in the past, rain occurred 70% of the time under those conditions.

You don’t need 100% certainty to make a decision, you use the forecast to improve the odds of making a good one. You might take an umbrella, adjust your route, or change your plans, even though it might stay dry all day.

The same applies to a software project. When we use MCS to simulate delivery outcomes based on past performance, we’re not saying this project will definitely finish in X weeks.

We’re saying given what we've seen before, here’s how likely different outcomes are.

And just like a weather forecast updates hourly with new data, so should our delivery forecast as the project progresses.

It’s not about certainty, it’s about confidence over time.

What is imperative is that as that project progresses, and we gain more information, then we can continually forecast to give us the expected outcomes (in terms of completion date and associated likelihood) for that project.

Yeah but we need lots of data for MCS don’t we?

No, this is one of the biggest misunderstandings with MCS in the context of software development. 

With 5 samples we are confident that the median will fall inside the range of those 5 samples, so that already gives us an idea about our timing and we can make some simple projections (Source: Actionable Agile Metrics For Predictability).

With 11 samples we are confident that we know the whole range, as there is a 90% probability that every other sample will fall in that range (see the German Tank Problem). Knowing the range of possible values drastically reduces uncertainty.

Similarly, when validating my own model, I found that when choosing between 6-12 weeks historical data, the amount of historical data does not play a significant factor in the outcomes of forecast accuracy.

But senior leaders don’t know the difference between 85% or 95% likelihood…

This is a bit of an unfair generalisation to leaders, as certainly in industries where risk management is paramount (insurance, finance to name a few), they absolutely do understand and communicate in risk. 

I believe that part of the criticism here stems from when the forecast is communicated with the probability and that’s the only information presented. As delivery professionals we need to present those probabilities in terms that drive meaningful decisions: what the risk means for scope, timelines, cost, and confidence. Probabilities only become powerful when they’re tied to options, trade-offs, and consequences. For example, communicating it as “At 85% confidence, we’re likely to deliver X by Y date; at 95%, we should plan for Z” brings it to life far easier for those consuming the forecast. 

So what happens when the system becomes unstable?

This was a great question posed by Nigel in the panel, and sadly one that we didn’t really get a chance to hear a fully fleshed answer. In a follow on blog post, Colleen will tell us more about what to do when this happens. The TLDR version being, this is why you should always be practicing continuous forecasting.

With cycle time, usually aren’t 80-90% within a tight band and the rest are the ones with special causes?

Any time someone is talking about cycle time in the context of a discussion about MCS, they are throwing more terminology in the mix when it’s simply not necessary. Cycle Time is not an input into MCS and anyone who thinks it is needs to re-read about MCS in the context of software development.

Coming back to the point raised, this is again a statement that we should be qualifying with real data. Take this teams cycle time data:

And put it into a PBC:

Here only 1 of the 30 (3%) data points is above our UNPL - nowhere near 10-20%. This is just one team, but we need to be extremely mindful about making generalisations without this being backed by data. 

Is MCS adding anything to software predictability?

This again is one where almost certainly there will never be agreement. There will never be consensus on what we mean by ‘predictability’, so to make the black and white statement of MCS not adding anything to software predictability feels unfair. 

There are still plenty of organisations who view predictability (and measure it as such) as say-do ratio, which we know is an incredibly naive take on software development. If you wanted something more objective, you could view predictability through the lens of a PBC, and argue that if we are within those UNPL/LNPL lines then we are in fact ‘predictable’ - but we know PBCs are quite daunting to get your head around and, as useful as they are, are still not used by so many organisations/practitioner in spite of what they show.

When it comes to MCS, I believe it’s better to validate its impact/effectiveness through a series of questions:

Is it making us aware of variation and its impact in our process? Yes.

Is it helping everyone understand that there are a range of outcomes that can occur and the levers we can pull to change those? Yes.

Is it helping us better understand and communicate risk? Yes. 

Those might not fit your definition of ‘predictability’, but they certainly make the world of delivery of software easier. 

Ultimately, Monte Carlo Simulation is not about adding complexity for its own sake. It’s about helping us see uncertainty, model risk, and make better delivery commitments. If we reduce it to averaging, we miss its true value, and lose a powerful tool for managing the realities of software delivery.