How (not) to overengineer the 3 Episode Rule

Posted on Oct 12, 2025

As you get older, the number of shows you watch slowly decreases over time. Then, before you realize it, you’ve lost the ability to binge-watch multiple seasons in a single day.

This often happens because, even if a show is good, the same tropes are overused. You end up being able to predict what will happen next with a high degree of confidence, which takes away a lot of the fun.

For example, most shounen tend to follow familiar tropes. To be fair, that’s part of what defines the genre, but on average, the execution can become so cliché that it diminishes the enjoyment. These shows often appeal more to a less-experienced audience.

The 3 Episode Rule

When you start a new anime, you want to give the first 3 episodes a shot. The idea is that the first episode might just set the stage or introduce the characters, and it often takes a bit longer for the story to really kick in. If you’re still engaged after three episodes, it’s a good sign that the show is worth your time!

This seems fair for an unbiased mind right?

But here’s the thing: humans are social creatures, and most people are influenced by those around them. When a show becomes really popular, you might find yourself trying to enjoy it, even if it doesn’t necessarily match your taste.

Any show has potential right? You might never have watched a romcom for example, but as push through a good one, by the fourth episode, you might find yourself slowly getting hooked.

So, here’s the dilemma we’ll try to solve:

Given a show and the general public’s opinion on it, what should its rating be? Is it worth investing your time in?

In a sense, we are biased by our own tastes, which might cause us to overlook hidden gems. Knowing the opinions of others can help, though we should always prioritize our own enjoyment, of course.

The math

Let’s model a simple rating $R \in [0, 1]$

$$ R = f(x_1, x_2, .. ,x_N), x_k \in [0, 1] $$

That should account for the following points:

  1. Raw enjoyment (E)

    The willingness to continue is very important!

    It is independent of how good the show is, a show can be dumb but fun.

  2. Story/Plot (S):

    Important, but not as much as the raw enjoyment. (If you are into the slice-of-life genre for example)

  3. Public opinion (P):

    Depends on how pertinent it is

  4. Fairness

I would define fairness with the following scheme:

For any uniformly distributed $X = ( x_1 x_2 .. x_N)$ where $x_k$ can take any value from $0$ to $1$ then $E[R=f(X)] = 1/2$.

Which is just a fancy way of saying that if you don’t have specific information about how each factor will behave, and you assume they are all randomly distributed between 0 and 1, then on average, the rating would be 50%. This score represents a neutral or balanced result, which might imply it is neither exceptionally good nor bad.

With all of that in mind, here is a proposition:

$ R = f(E, S, P, \rho) = 0.5 \times (E + S) \times (1 - \rho) + P \times \rho $

  • $ 0.5 \times (E + S) \times (1 - \rho)$ describes your personal view of the show.
  • $P \times \rho$ describes how you personaly perceive others’ view of the show.

Where $\rho \in [0, 1]$ is representing the public pertinence factor or how much you care about the public opinion. For example if $\rho = 1$, this could mean that you care 100% about what others think, then what you think about the show does not matter anymore.

Now let’s verify the 4th point:

$ E[R] = E[0.5 \times (E + W) \times (1 -\rho)] + E[P \times \rho] \newline E[R] = 0.5 \times (E[E] + E[W]) \times (1 -E[\rho]) + E[P] \times E[\rho] \newline E[R] = 0.5 \times (0.5 + 0.5) \times (1 - 0.5) + 0.5 \times 0.5 \newline E[R] = 0.5 \times 1 \times 0.5 + 0.25 = 0.25 + 0.25 = 1/2 $

Good, that means that if you were to randomly assign values to willingness, story quality, pertinence, and public opinion many times, the average of all the ratings you compute would converge around 50%.

This doesn’t mean every rating will be 50%, some ratings will be much higher or lower depending on how those factors vary. However, across all possible scenarios, 50% is the central tendency.

Equipped with this, you can rate any show you watch. If your rating is higher than 0.5, it means you’ve enjoyed it overall and should continue.

Personally, I set $P = 1$ and instead use the pertinence $\rho$ as a gradual switch, which can be interpreted as how favorable people are toward the show.

You can try checking someone’s rating by asking the following questions:

  1. How much are you enjoying it now? ($E$)
  2. Is the plot good? How predictable do you feel it is? ($S$)
  3. What do you think others would rate this show? ($P$)
  4. Do you care/or are you aware of what people think? (e.g. through Social Media, memes, ..) ($\rho$)

Cumulative willingness

Basically, the formula above can be used to rate all 3 episodes at once.
But then you may wonder if it is even possible to minimize the time you wasted on a show…

Can I watch fewer than 3 episodes without feeling bad about dropping the show?

Or similarly..

This show seems to have potential, but 3 episodes is too small a sample.

We can address this dilemma by rating each episode $R_e$, and using a weighted average.

  • $e$ is the current episode we are on.
  • $C_e$ is the cumulative rating.
  • $w(e)$ is the weight for the e-th episode
    • $w(1)$ must be $1$, it’s always about first impressions.
    • $\forall e \in \N*: 0 \lt w_{min} \le w(e) \leq 1 \land w(e+\epsilon) \gt w(e) (\epsilon \gt 0) $.
    • $\lim_{e \to \infty} w(e) = w_{min}$, the weight must decay to the lower bound as we watch more episodes, i.e. early episodes are the most important.

Properties and constraints:

  1. $C_1 = R_1 $.
  2. $\forall k: 0 \leq C_k \leq 1$ .
  3. A not so required requirement.. but is actually nice to have

How would our rating behave if each rating up to a certain $k$ is the same? What if $ R_1 = R_2 = … = R_k = V$?

Then it would be nice to have $C_k = V$ where $V$ is an arbitrary constant. The main motivation is that if all ratings are the same, then the cumulative rating must hold the same value.

Picking the right weighted average

  • The very first episodes are important, so we can rule out the Harmonic mean because it tends to give more weight to lower ratings, making it overly sensitive to early episodes with lower weights.
  • Our ratings are bounded between 0 and 1, the Geometric mean is overkill. It is better suited for datasets where values can explode or grow exponentially, which doesn’t apply here.
  • The Arithmetic mean on the other hand provides a balanced average and is well-suited for our bounded rating model, it is also the simplest. So, let’s go with that…

$$ \mu(R) = {{\sum_{i=1}^{e} R_i} \over {e}} = {{\sum_{i=1}^{e} R_i \times 1} \over {\sum_{i=1}^{e} 1}} \xrightarrow{\text{In weighted form..}} C_e := {{\sum_{i=1}^{e} R_i w(i)} \over {\sum_{i=1}^{e} w(i)}} $$

Checking the properties

  1. $C_1 = R_1 $. $\checkmark$
  2. $\forall k \in \N^*: 0 \leq C_k \leq 1$ . $\checkmark$ (The proof is left as an exercice for the reader)
  3. $\forall k \in \N^*: R_1 = R_2 = .. = R_k = V \rightarrow C_e = V$ $\checkmark$

$$ C_e = {{\sum_{i=1}^{e} V w(i)} \over {\sum_{i=1}^{e} w(i)}} = V \rightarrow {{\sum_{i=1}^{e} w(i)} \over {\sum_{i=1}^{e} w(i)}} = 1 $$

Picking a decent weight function

  • Episode: $1 \leq e \leq e_{max} $
  • Weight: $0 < w_{min} \lt w(e) \leq 1$ and $w_{max} = w(1) = 1$
  • $w(e+\epsilon) \lt w(e) (\epsilon \gt 0) $
  • $w(e_{max}) = w_{min}$

Narrowing down function candidates

A good candidate would be a function that can decrease in a controllable way relative to the total number of episodes.

For example, on the latest episode of a given show, we might want the weight to be $0.5$ if we want to ensure that it will contribute.

Another nice thing to have is also the ability to measure how important the first episodes are. We can express that by defining concretely how much of the total available weight is allocated to the first K episodes, either by a sum or an integral.

Let the total allocated weight defined as $$ A_{Total} = \int_{e_{min}}^{e_{max}} {w(t)} dt $$

Which can be split into $$ A_{Total} = A_{\leq K} + A_{\geq K} = \int_{e_{min}}^{K} {w(t)} dt + \int_K^{e_{max}} {w(t)} dt $$

Let $ \mathcal {M_K \{w(t)\}} $ be the measure of how much weight is allocated to the first K episodes,

$$ \mathcal {M_K \{w(t)\}} = {A_{\leq K} \over A_{total}} \times 100 \% $$

Candidate 1: linear weight function

$$ w(e) := 1 - e{{w_{max} - w_{min}} \over {e_{max}}} $$

This is a good start, easy, very intuitive, and is mostly fine if we do not want to control how much we favor the first episodes.

Now let’s measure the allocated weight

$$ \int_{a}^{b} w(t) dt = [ t - {{t^2} \over {2}} {{w_{max} - w_{min}} \over {e_{max}}}]_{a}^{b} $$

$$ A_{\leq K} = (K-1) - {(K^2+1) \over 2} {(w_{max} - w_{min}) \over {e_{max}}} $$

$$ A_{Total} = (e_{max}-1) - {(e_{max}^2+1) \over 2} {(w_{max} - w_{min}) \over {e_{max}}}$$

Then, for example with

$$ K=3, e_{max} = 12, w_{min} = 0.5, w_{max} = 1 $$

We have

$$ \mathcal {M_3 \{w(t)\}} = {{2 - (10/2) \times (0.5 / 12)} \over {11 - (145/2) \times (0.5/12)}} \times 100 \% = 22 \% $$

Which translates to 22% of the weight all goes to the first 3 episodes of a 12 episode show given that the 12th episode must be weighted to $0.5$.

Candidate 2: exponential weight decay

This is slightly tricky as the challenge is to avoid decaying too fast by delaying the decrease of the weight as we reach the end episodes.

$$w(e) := \exp f(e) $$

After many trials, I have settled with:

$$w(e) := \exp (-\alpha \sqrt {e - 1}), (\alpha > 0) $$

Which may seem completely arbitrary but the rational is that it is only defined for $e \ge 0$ and $w(1) = w_{max} = 1$, plus we can fine tune the factor $\alpha$ to delay the decay rate by solving the $w_{min}$ constraint while still being simple.

$$w(e_{max}) = w_{min} = \exp (-\alpha \sqrt {e_{max} - 1}) = w_{min} $$ $$ \alpha = -{{\ln w_{min}} \over {\sqrt {e_{max} - 1}}} $$

Let’s fixate $\alpha$, just like before we can enforce the latest episode weight $w(e_{max})$ to $0.5$.

$$ \alpha = {{\ln 2} \over {\sqrt {e_{max} - 1}}} $$

With the same conditions as above, we can try $e_{max} = 12$, giving $\alpha \approx 0.21$

Now, let’s measure the allocated weight

$$ \int_{a}^{b} w(t) dt = -{2 \over \alpha^2} [\exp (-\alpha \sqrt {t-1}) (1 + \alpha \sqrt {t-1}) ]_{a}^{b} $$

$$ A_{\leq K} = -{2 \over \alpha^2} [\exp (-\alpha \sqrt {K-1}) (1 + \alpha \sqrt {K-1}) - 1] $$ $$ A_{Total} = -{2 \over \alpha^2} [\exp (-\alpha \sqrt {e_{max}-1}) (1 + \alpha \sqrt {e_{max}-1}) - 1] $$

$$ \mathcal {M_3 \{ w(t) \} } = {{\exp (-0.21 \sqrt {2}) (1 + 0.21 \sqrt {2}) - 1} \over {\exp (-0.21 \sqrt {11}) (1 + 0.21 \sqrt {11}) - 1}} \times 100 \% \approx 23 \%$$

Which is very close to the previous 22% given the same parameters.

Other candidates

There are infinitely many possible candidates but let’s end that rabbit hole here, we could have used something along the lines of $w(e) = 1 / log_2(e + 1)$ too, the only issue is that it decays much slower at infinity.

You are free to explore but most functions are not as “nice” as exponents or polynomials.

Conclusion

$$ R(e) = {(E + S) \over 2} (1 - \rho) + P \rho $$ $$ w(e) := 1 - e{{w_{max} - w_{min}} \over {e_{max}}} \text{ or } w(e) := \exp {{\ln w_{min} \sqrt {e - 1}} \over {\sqrt {e_{max} - 1}}} $$ $$ C_e := {{\sum_{i=1}^{e} R_i w(i)} \over {\sum_{i=1}^{e} w(i)}} $$

If you’ve gone through 100 episodes and have a cumulative rating greater than or equal to $0.5$, even if the later episodes were bad, then maybe, overall, you actually liked the show?

For practical purposes, the recursive form for $C_e$ might be easier to calculate if you do this by hand. As you already know the cumulative rating and weight of latest week’s episode.

$$ C_e := {{C_{e-1} {W_{e-1}} + R_e w(e)} \over {W_{e-1} + w(e)}} $$

Where $W_{e-1}$ represents the cumulative sum from previous week’s episode, i.e. $W_{e-1} = {\sum_{i=1}^{e-1} w(i)}$

Additional notes

If you are registered to MAL or similar websites, chances are that you have lots of rating data. To which you can callibrate the above model.

The simplest approach but less accurate is just to assume $R_{true} = k R$ then fit on your data by finding the appropriate factor $k$.

The suggested approach is to add extra parameters that will indicate how much you deviate from the model.

$$ R_{true} = R + R_{dev} $$

$$ R_{true} = {(E + k_1) + (S + k_2) \over 2} (1 - (\rho + k_3)) + (P + k_4) (\rho + k_3)$$

You can determine $W$, $S$ and $P$ by rating a few shows, just try to be yourself (can be from the same dataset, just remove them afterwards).

Then you can find $k_1$, $k_2$, $k_3$, and $k_4$ by fitting over your known ratings and minimize the difference between the observed ${\widehat {R}} _ {true}$ and your personal model $R_{true}$. It is a simple linear regression problem after all.

Just don’t forget the constraint $0 \le R \lt 1$.

Once that’s done, you can recompute all of your ratings to be on the same scale as our framework.