Posts tagged: Analysis

Why most sales forecasts suck…and how Monte Carlo simulations can make them better

Sales forecasts don’t suck because they’re wrong.  They suck because they try to be too right. They create an impossible illusion of precision that ultimately does a disservice to managers who need accurate forecasts to assist with our planning. Even meteorologists — who are scientists with tons of historical data, incredibly high powered computers and highly sophisticated statistical models — can’t forecast with the precision we retailers attempt to forecast. And we don’t have nearly the data, the tools or the models meteorologists have.

Luckily, there’s a better way. Monte Carlo simulations run in Excel can transform our limited data sets into statistically valid probability models that give us a much more accurate view into the future. And I’ve created a model you can download and use for yourself.

There are literally millions of variables involved in our weekly sales, and we clearly can’t manage them all. We focus on the few significant variables we can affect as if they are 100% responsible for sales, but they’re not and they are also not 100% reliable.

Monte Carlo simulations can help us emulate real world combinations of variables, and they can give us reliable probabilities of the results of combinations.

But first, I think it’s helpful to provide some background on our current processes…

We love our numbers, but we often forget some of the intricacies about numbers and statistics that we learned along the way. Most of us grew up not believing a poll of 3,000 people could predict a presidential election. After all, the pollsters didn’t call us. How could the opinions of 3,000 people predict the opinions of 300 million people?

But then we took our first statistics classes. We learned all the intricacies of statistics. We learned about the importance of properly generated and significantly sized random samples. We learned about standard deviations and margins of errors and confidence intervals. And we believed.

As time passed, we moved on from our statistics classes and got into business. Eventually, we started to forget a lot about properly selected samples, standard deviations and such and we just remembered that you can believe the numbers.

But we can’t just believe any old number.

All those intricacies matter. Sample size matters a lot, for example. Basing forecasts, as we often do, on limited sets of data can lead to inaccurate forecasts.

Here’s a simplified explanation of how most retailers that I know develop sales forecasts:

  1. Start with base sales from last year for the the same time period you’re forecasting (separating out promotion driven sales)
  2. Apply the current sales trend (which is maybe determined by an average of the previous 10 week comps). This method may vary from retailer to retailer, but this is the general principle.
  3. Look at previous iterations of the promotions being planned for this time period. Determine the incremental revenue produced by those promotions (potentially through comparisons to control groups). Average of the incremental results of previous iterations of the promotion, and add that average to the amount determined in steps 1 and 2.
  4. Voilà! This is the sales forecast.

Of course, this number is impossibly precise and the analysts who generate it usually know that. However, those on the receiving end tend to assume it is absolutely accurate and the probability of hitting the forecast is close to 100% — a phenomenon I discussed previously when comparing sales forecasts to baby due dates.

As most of us know from experience, actually hitting the specific forecast almost never happens.

We need accuracy in our forecasts so that we can make good decisions, but unjustified precision is not accuracy. It would be far more accurate to forecast a range of sales with accompanying probabilities. And that’s where the Monte Carlo simulation comes in.

Monte Carlo simulations

Several excellent books I read in the past year (The Drunkard’s Walk, Fooled by Randomness, Flaw of Averages, and Why Can’t You Just Give Me a Number?) all promoted the wonders of Monte Carlo simulations (and Sam Savage of Flaw of Averages even has a cool Excel add-in). As I read about them, I couldn’t help but think they could solve some of the problems we retailers face with sales forecasts (and ROI calculations, too, but that’s a future post). So I finally decided to try to build one myself. I found an excellent free tutorial online and got started. The results are a file you can download and try for yourself.

A Monte Carlo simulation might be most easily explained as a “what if” model and sensitivity analysis on steroids. Basically, the model allows us to feed in a limited set of variables about which we have some general probability estimates and then, based on those inputs, generate a statistically valid set of data we can use to run probability calculations for a variety of possible scenarios.

It turns out to be a lot easier than it sounds, and this is all illustrated in the example file.

The results are really what matters. Rather than producing a single number, we get probabilities for different potential sales that we can use to more accurately plan our promotions and our operations. For example, we might see that our base business has about a 75% chance of being negative, so we might want to amp up our promotions for the week in order have a better chance of meeting our growth targets.  Similarly, rather than reflexively “anniversaring” promotions, we can easily model the incremental probabilities of different promotions to maximize both sales and profits over time.

The model allows for easily comparing and contrasting the probabilities of multiple possible options. We can use what are called probability weighted “expected values” to find our best options. Basically, rather than straight averages that can be misleading, expected values are averages that are weighted based on the probability of each potential result.

Of course, probabilities and ranges aren’t as comfortable to us as specific numbers, and using them really requires a shift in mindset. But accepting that the future is uncertain and planning based on the probabilities of potential results puts us in the best possible position to maximize those results. Understanding the range of possible results allows for better and smarter planning. Sometimes, the results will go against the probabilities, but consistently making decisions based on probabilities will ultimately earn the best results over time.

One of management’s biggest roles is to guide our businesses through uncertain futures. As managers and executives, we make the decisions that determine the directions of our companies. Let’s ensure we’re making our decisions based on the best and most accurate information — even if it’s not the simplest information.

What do you think? What issues have you seen with sales forecasts? Have you tried my example? How did it work for you?

Wanna be better with metrics? Watch more poker and less baseball.

Both baseball and poker have been televising their World Series championships, and announcers for both frequently describe strategies and tactics based on the statistics of the games. Poker announcers base their commentary and discussion on the probabilities associated with a small number of key metrics, while baseball announcers barrage us with numbers that sound meaningful but that are often pure nonsense.

Similarly, today’s web analytics give us the capability to track and report data on just about anything, but just because we can generate a number doesn’t mean that number is meaningful to our business. In fact, reading meaning into meaningless numbers can cause us to make very bad decisions.

Don’t get me wrong, I am a huge believer in making data-based decisions, in baseball, poker, and on our websites. But making good decisions is heavily dependent on using the right data and seeing the data in the right light. I sometimes worry that constant exposure to sports announcers’ misreading and misappropriation of numbers is actually contributing to a misreading and misunderstanding of numbers in our business settings.

Let’s consider a couple of examples of misreading and misappropriating numbers that have occurred in baseball over the last couple of weeks:

  1. Selection bias
    This one is incredibly common in the world of sports and nearly as common in business. Recently, headlines here in Detroit focused on the Tigers “choking” and blowing a seven-game lead with only 16 games to go. In a recent email exchange on this topic, my friend Chris Eagle pointed out the problems with the sports announcers’ hyperbole:

    “They’re picking the high-water mark for the Tigers in order to make their statement look good.  If you pick any other random time frame (say end-of-August, which I selected simply because it’s a logical break point), the Tigers were up 3.5 games.  But it doesn’t look like much of a choke if you say the Tigers lost a 3.5 game lead with a month and change to go.”

    Unfortunately, this type of analysis error occurs far too often in business. We might find that our weekend promotions are driving huge sales over the last six months, which sounds really impressive until we notice that non-sale days have dropped significantly as we’ve just shifted our business to days when we are running promotions (which may ultimately mean we’ve reduced our margins overall by selling more discounted product and less full-price merchandise).

    In a different way, Dennis Mortensen addressed the topic in his excellent blog post “The Recency Bias in Web Analytics,” where he points out the tendency to give undue weight to more recent numbers. He included a strong example about the problems of dashboards that lack context. Dashboards with gauges look really cool but are potentially dangerous as they are only showing metrics from a very short period of time. Which leads me to…

  2. Inconsistency of averages over short terms
    Baseball announcers and reporters can’t get enough of this one. Consider this article on the Phillies’ Ryan Howard after Game 3 of the World Series that includes, “Ryan Howard’s home run trot has been replaced by a trudge back to the dugout.The Phillies’ big bopper has gone down swinging more than he’s gone deep…He’s still 13 for 44 overall in the postseason (.295) but only 2 for 13 (.154) in the World Series.” Actually, during the length of the season, he had three times as many strike outs as home runs, so his trudges back to the dugout seem pretty normal. And the problem with the World Series batting average stat is the low sample size. A sample of thirteen at bats is simply too small to match against his season long average of .279. Do different pitchers or the pressures of the situation have an effect? Maybe, but there’s nothing in the data to support such a conclusion. Segmenting by pitcher or “postseason” suffers from the same small sample size problems, where the margin of error expands significantly. Furthermore, and this is really key, knowing an average without knowing the variability of the original data set is incomplete and often misleading.

    This problems with variability and sample sizes arise frequently in retail analysis when we either run a test with too small a sample size and assume we can project it to the rest of the business, or we run a properly sized test but assume we’ll automatically see those same results in the first day of a full application of the promotion. Essentially, the latter point is what is happening with Ryan Howard in the postseason. We often hear the former as well when a player is all of the sudden crowned a star when he outperforms his season averages over a few games in the postseason.

    In retail, we frequently see this type of issue when we’re comparing something like average order value of two different promotions or two variations in an A/B test. Say we’ve run an A/B test of two promotions. Over 3,100 iterations of test A, we have an average order size of $31.68. And over 3,000 iterations of Test B, we have an average order size of $32.15. So, test B is the clear winner, right? Wrong. It turns our there is a lot more variability in test B, which has a standard deviation of 11.37 compared with test A’s standard deviation of 7.29. As a result the margin of error on the comparison expands to +/- 48 cents, which means both averages are within the margin of error and we can say with 95% confidence that there really is no difference between the tests. Therefore, it would be a mistake to project an increase in transaction size if we went with test B.

    Check out that example using this simple calculator created by my fine colleagues at ForeSee Results and play around with your own scenarios.  Download Test difference between two averages.

Poker announcers don’t seem to fall into all these statistical traps. Instead, they focus on a few key metrics like the number of outs and the size of the pot to discuss strategies for each player based largely on the probability of success in light of the risks and rewards of a particular tactic. Sure, there are intangibles like “poker tells” that occur, but even those are considered in light of the statistical probabilities of a particular situation.

Retail is certainly more complicated than poker, and the number of potential variables to deal with is immense. However, we can be much more prepared to deal with the complexities of our situations if we take a little more time to view our metrics in the right light. Our data-driven decisions can be far more accurate if we ensure we’re looking at the full data set, not a carefully selected subset, and we take the extra few minutes to understand the effects of variability on averages we report. A little extra critical thinking can go a long way.

What do you think? Are there better ways to analyze key metrics at your company? Do you consider variability in your analyses? Do you find the file to test two averages useful?



Related posts:

How retail sales forecasts are like baby due dates

Are web analytics like 24-hour news networks

True conversion – the on-base percentage of web analytics

How the US Open was like a retail promotion analysis

The Right Metrics: Why keeping it simple may not work for measuring e-retail performance (Internet Retailer article)

Are web analytics like 24-hour news networks?

We have immediate access to loads of data with our web sites, but just because we can access lots of data in real time doesn’t mean we should access our data in real time. In fact, accessing and reporting on the numbers too quickly can often lead to distractions, false conclusions, premature reactions and bad decisions.

I was attending the web-analytics-focused Semphonic X Change conference last week in San Francisco (which, by the way, was fantastic) where lots of discussion centered around both the glories and the issues associated with the mass amount of data we have available to us in the world of the web.

Before heading down for the conference breakfast Friday morning (September 11), I switched on CNN and saw — played out in all their glory on national TV — the types of issues that can occur with reporting too early on available data.

It seems CNN reporters “monitoring video” from a local TV station saw Coast Guard vessels in the Potomac River apparently trying to keep another vessel from passing. They then monitored the Coast Guard radio and heard someone say, “You’re approaching a Coast Guard security zone. … If you don’t stop your vessel, you will be fired upon. Stop your vessel immediately.” And, for my favorite part of the story, they made the decision to go on air when they heard someone say “bang, bang, bang, bang” and “we have expended 10 rounds.” They didn’t hear actual gun shots, mind you, they heard someone say “bang.” Could this be a case of someone wanting the data to say something it isn’t really saying?

In the end, it turned out the Coast Guard was simply executing a training exercise it runs four times a week! Yet, the results of CNN’s premature, erroneous and nationally broadcast report caused distractions to the Coast Guard leadership and White House leadership, caused the misappropriation of FBI agents who were sent to the waterfront unnecessarily, led to the grounding of planes at Washington National airport for 22 minutes, and resulted in reactionary demands from law enforcement agencies that they be alerted of such exercises in the future, even though the exercises run four times per week and those alerts will likely be quickly ignored because they will become so routine.

In the days when we only got news nightly, reporters would have chased down the information, discovered it was a non-issue and the report would have never aired. The 24-hour networks have such a need for speed of reporting that they’ve sacrificed accuracy and credibility.

Let’s not let such a rush negatively affect our businesses.

Later on that same day, I was attending a conference discussion on the role of web analytics in site redesigns. Several analysts in the room mentioned their frustrations when they were asked by executives for a report on how the new design was doing only a couple of hours after the launch of new site design. They wanted to be able to provide solid insight, but they knew they couldn’t provide anything reliable so soon.

Even though a lot of data is already available a couple of hours in, that data lacks the context necessary to start drawing conclusions.

For one, most site redesigns experience a dip in key metrics initially as regular customers adjust to a new look and feel. In the physical retail world, we used to call this the “Where’s my stuff?” phenomenon. But even if we set the initial dip aside, there are way too many variables involved in the short term of web activity to make any reliable assessments of the new design’s effectiveness. As with any short term measurement, the possibilities for random outliers to unnaturally sway the measurement to one direction or another is high. It takes some time and an accumulation of data to be sure we have a reliable story to tell.

And even with time, web data collection is not perfect. Deleted cookies, missed connections, etc. can all cause some problems in the overall completeness of the data. For that matter, I’ve rarely seen the perfect set of data in any retail environment. Given the imperfect nature of the data we’re using to make key strategic decisions, we need to give our analysts time to review it, debate it and come to reasoned conclusions before we react.

I realize the temptation is strong to get an “early read” on the progress of a new site design (or any strategic issue, really). I’ve certainly felt it myself on many occasions. However, since just about every manager and executive I know (including myself) has a strong bias for action, we have to be aware of the risks associated with these “early reads” and our own abilities or inabilities to make conclusions and immediately react. Early reads can lead to the bad decisions associated with the full accelerator/full brake syndrome I’ve referenced previously.

We can spend months or even years preparing for a massive new strategic effort and strangle it within days by overreacting to early data. Instead, I wonder if it’s a better to determine well in advance of the launch — when we’re thinking more rationally and the temptation to know something is low — when we’ll first analyze the success of our new venture. Why not make such reporting part of the project plan and publicly set expectations about when we’ll review the data and what type of adjustments we should plan to make based on what we learn?

In the end, let’s let our analysts strive for the credibility of the old nightly news rather than emulate the speed and rush to judgment that too often occurs in this era of 24-hours news. Our businesses and our strategies are too important and have taken too long to build to sacrifice them to a short-term need for speed.

What do you think? Have you seen this issue in action? How do you need with the balance between quick information and thoughtful analysis?

Photo credit: Wikimedia Commons




How the US Open was like a retail promotion analysis

Last week’s US Open golf tournament had a surprise leader going into the final round in Ricky Barnes, who came out of relative obscurity to record the best 36-hold score in US Open history, beating out golf greats like Tiger Woods and Phil Mickelson.

The media played it up, talking about Barnes as finally coming into his own and really blossoming. Was he the next big golf star? From CBS Sports:

“Until this week at the 109th U.S. Open, keeping up with the big boys has always been difficult prospect (for Barnes), full of disappointment, figurative bloody noses and scabby knees.

‘I know he hates losing,’ brother Andy said. ‘Maybe because he did a lot of it when he was younger.’

And more than a bit as a young adult, too, which is what made his record-setting start at Bethpage Black all the more surprising. In a field full of the household names with whom Barnes has been so desperately trying to compete, he’s finally atop the leaderboard.”

As Barnes faltered during the final rounds and Woods and Mickelson improved, it was all about who was handling the pressure well and who wasn’t. Barnes’ score dropped off significantly over the final two rounds while the bigger names improved.

Was it the pressure? I would argue that what we really saw was what statisticians call a regression toward the mean (or average). Basically, Woods and Mickelson began the tournament with rounds that were well below their averages, but with each round they began to score closer to what they would normally be expected to score. Barnes basically did the opposite. When continuously measured over time, Tiger Woods is still clearly the world’s top golfer. This can be clearly seen in the world golf rankings where Woods is #1 and Barnes is #153.

So, why is this like a retail promotion analysis?
Because we retailers have a tendency to look at each short term promotion result in isolation and then make concrete conclusions and kick off immediate modifications. Come Monday morning, we’re looking to see how the weekend sale did, and we’re ready to change next weekend’s sale if this past one didn’t perform to expectations. We don’t take into account the possibility that we might have witnessed an outlier result that is not really indicative of the actual effectiveness of the promotion but is actually just the result of random luck — good or bad. After a single test, we could be ready to declare the promotion equivalent of Ricky Barnes the world’s greatest and the promo Tiger Woods an also ran. And the next time we run the Barnes promotion and it’s a dog, we’ll revert back.

An old colleague of mine used to call this the “full accelerator, full brake” syndrome. The net effect of all of this short term measurement and immediate reaction is a steady reduction in the average effectiveness of our promotions.

Instead, we should measure the effectiveness of promotions over a much longer period of time and over many instances. Because of the massive amount of variables that can affect a promotion (including the obvious and more visible variables like weather and road construction and the less obvious and invisible variables like an unusual number of people happened to plan family picnics at the same time and therefore didn’t shop like they normally would have) we simply cannot count on a short term measurement to provide the accuracy we need to make a wise decision. Short term, sometimes the promotions will show improvements and sometimes they won’t, just as Tiger Woods does not win every golf tournament he enters. Over time, though, we will come closer and closer to determining their true value.

This requires patience and courage that will be difficult in the fast paced retail environment, especially for public companies. However, it will produce a lot less churn and increase efficiency and effectiveness overall. And in an economic time when we’re trying to maximize the effectiveness of the staff we have left, less churn can go a long way.

What do you think? How are promotion analyses handled in your company? Do you measure over the long haul?



Retail: Shaken Not Stirred


Home | About