Bridging the gap between lean startup in theory and in practice

While the Lean Startup methodology lays out a great foundation to guide startup product development, applying it in practice remains challenging. When a startup first adopts Lean Startup principles, there are many initial questions. What is our MVP? What hypotheses should we test? How do we test those hypotheses? How do we convince the team to embrace this method? In this post, I would like to share my experience answering those questions.

Here’s a quick refresher on lean startup (courtesy of Wikipedia):

"Lean startup" is a method for developing businesses and products first proposed in 2011 by Eric Ries. Based on his previous experience working in several U.S. startups, Ries claims that startups can shorten their product development cycles by adopting a combination of business-hypothesis-driven experimentation, iterative product releases, and what he calls "validated learning". Ries' overall claim is that if startups invest their time into iteratively building products or services to meet the needs of early customers, they can reduce the market risks and sidestep the need for large amounts of initial project funding and expensive product launches and failures

At the core of “lean startup” is the idea that startups should release their “Minimum Viable Product” as quickly as possible instead of building a full featured product before launching to the market. Startups should identify and validate their key hypotheses by running cheap experiments instead of iteratively improving the product based on gut feeling. One of the key benefits to this approach is that startup founders are forced to think about key hypotheses upfront, which enables startups to have a cohesive strategy and validate hypotheses in cheap, quick and creative ways.

As an example, let’s compare Google Drive and Dropbox. Back in 2008, Google Drive was already a working Dropbox-like product that was used internally. After years of development, it became tightly integrated with Google Docs and the development was primarily driven by internal feedback. The product was finally launched in early 2012, and only receiving market feedback then. For most startups, this type of product strategy would bear overwhelming market risk, building a product that has no target audience. In contrast, instead of building the full featured product, Drew Houston, CEO of Dropbox, came up with a clever way to validate their key market hypotheses. He made a 3-minute video where he walked through how the product works, posted it on Hacker News, and watched the waiting list explode from 5,000 to 75,000 people overnight.

With the aforementioned context, let’s now dive into answering the initial lean startup questions we surfaced earlier.

1. what hypothesis to test first

Given that every startup’s runway is limited, what to validate first is a do-or-die question. Hypotheses vary in importance and ease of testing. Without holistic product thinking, testing low impact hypotheses will quickly lead to local optimization. For instance, testing email subject lines to optimize email open rate is probably going to be low impact, and shouldn’t be a focus before product/market fit. Also, hypotheses can be interdependent, and premature optimization could lead to suboptimal product impact. For example, optimizing growth before having reasonable retention eventually leads to increased unretained users; optimizing paid customer acquisition before having reasonable customer lifetime value might jeopardize cash flow.

Return on investment is probably the most adopted principle for prioritization. With the spirit of being lean, startups should always try to validate the highest impact hypothesis with the cheapest experiment. In some cases, estimating potential return before the fact could be another challenging problem. Two simple alternatives are either designing a separate experiment to estimate the potential impact or focusing energy on the most risky hypotheses. From the hindsight, focusing on the most risky hypothesis is the exact reason why most consumer facing companies want to prove sustainable growth before any other thing.

2. how to test big hypotheses

Testing big hypotheses is crucial for early stage startups, but they are also the most difficult to validate. Breaking a big hypothesis down into testable pieces typically results in cheaper individual experiments. For instance, say Codecademy wants to test a hypothesis of “there are X number of mobile users who want to learn coding with their mobile devices” and it would be difficult to test in a single experiment. Breaking that into two smaller hypotheses (a) “there are Y number of users who intend to learn coding” and (b) “ Z% of them would learn with their mobile devices” makes testing much easier. Especially since the company may already be familiar with Y, the size of the addressable market, and could approximate Z separately by the percentage of Codecademy users using their mobile devices.

It's not difficult to see that how to break down a big hypothesis is more art than science. In my opinion, it's a tradeoff among how accurate it is, how easy to test, and how fast to get feedback. 

3. how to come up with a cheap experiment

The most expensive type of experiment to validate a hypothesis is to fully build the feature and then measure the performance. It is often worth exploring cheaper alternatives. A well-known example is what ex-CEO of Zynga Mark Pincus described as ghetto testing, which tests the size of a potential market by (a) making a banner ad for a new game idea before the game is implemented, (b) putting the ad on Zynga’s site for a while, (c) measuring the number of clicks, and (d) prioritizing game ideas based on the number of clicks. Though not perfect, it is a reasonable approximation. In practice, not all hypotheses are as easy to test as ghetto testing. In some cases, hypotheses could be validated quickly through market research or ghetto testing. In other cases, another popular approach is to build a so-called concierge MVP, which take inefficient approaches to solving a problem for the sake of getting rapid feedback. For instance, a concierge MVP for a daily coupon site may not crawl the coupons from the web automatically. As long as the website looks nice, coupons can be manually curated at the beginning so that the level of demand is validated. The idea is to design cheap experiments, and only build the feature if none of the alternatives work.

4. what metric should be used

In order to validate or falsify a hypothesis at the end of the experiment, we need to come up with a metric upfront that reflects the impact of the hypothesis. At first glance, it seems reasonable to use Key Performance Indicators (KPIs) such as retention rate or subscription rate. At the end of the day, these KPIs make or break a startup. But it doesn’t take long to notice that most experiments simply won’t move the needle when measured by the KPIs. The problem comes from not understanding the fine-grained results that the experiment yields.

Suppose Codecademy is testing whether offering real-time help on exercise pages would help users to complete exercises and thus increase user retention. One experiment is to (a) place test learners into either a control or a treatment group when they first hit an exercise page, (b) only show the real-time help button to users in the treatment group, and (c) test the difference in retention rates between the two groups. While this may seem fair, there are two major problems with the setup. First, there are many factors which could affect the retention rate simultaneously. Thus, it is possible that even if learners love realtime help, their retention rate might not reflect that. Second, not all users in treatment group have the intention to engage with the feature. If only 1% of users get stuck and need real-time help, even if those users found realtime help to be extremely useful, the effect to the retention rate of the entire group could be low. Alternatively, a hypothetical experiment setting would be, on each exercise page, offer a "get help" button, and for users who click the button, slot them into either the control group which shows "Sorry, but the help information for the current exercise is not available", or the treatment group which shows the real-time help feature. Then measure three metrics:

  1. The number of users who clicked the button, which reflects the size of the target audience.
  2. The difference in exercise completion rates between the two groups, which reflects the effect of the feature.
  3. The difference in retention rates between the two groups, which reflects the causal relationship between exercise completion and retention.

Of the three metrics, only the second assesses the effectiveness of real-time help would affect learners or not. But it’s important to note that the experiment only makes sense if the first and third metrics are high. If there is no strong evidence that many people need help or that exercise completion affects retention, it might not make sense to even run the experiment. Therefore, a good metric for measuring effectiveness should be the conversion rate between users who "intend" to engage with the feature and users who actually "engage and benefit" from it. To make a compelling hypothesis, the target audience size for the feature should be large and the desired effect should be a leading indicator to a KPI.

5. how to interpret the result properly

When it comes to hypothesis validation, most people jump straight to thinking about A/B testing and whether the metric difference is statistically significant. In practice, it's much harder than that. I plan to cover more details in a future post, but let me share some high level ideas here:

  1. As the number of samples increase, any small difference becomes statistically significant. So the question shouldn’t be whether the difference is significant but rather whether the difference is meaningful.
  2. Type I error, false positive, matters less than people might expect. In traditional drug testing, Type I error matters because it means claiming some drug to be effective while it's not. However, when testing a signup page, if we observed and falsely claimed the red signup button outperforms the green signup button, it is unlikely the the green button outperforms the red one with a meaningful margin and chances are the color of the button doesn't make meaningful difference.
  3. Type II error, false negative, matters more than people might expect. In traditional drug testing, type II error matters less than type I error because it means claiming some drug to be ineffective while it's effective. However, if a website falsely dismissed features that help their KPI, those are missed opportunities. Type II error needs to be bounded, which can be done through a power analysis.
  4. Traditional statistical testing is not adaptive as the experiment collects more and more information. Comparing to more adaptive approaches like multi-armed bandit, traditional statistical testing might not converge as efficiently and might require pre-determined base rate and expected effect size before the experiment begins.

To sum up, the lean startup philosophy lays out a great foundation for product development as iterative hypothesis validation. As the product collects more data from the customers, startups can learn more about what customers really want. This post is meant to share some common challenges I have seen when startups start to apply this philosophy, and to propose some ideas to deal with those challenges accordingly. As always, there are definitely blind spots and clever solutions that I'm not aware of. Feedback is highly welcome.

If you like the post, you can follow me (@chengtao_chu) on Twitter or subscribe to my blog "ML in the Valley" Also, special thanks Ian Wong (@ihat), Leo Polovets, and Bob Ren (@bobrenjc93) for reading a draft of this.