Imagine you wish to know how many fish are in a rather large pond… The water is murky so forget about counting them easily… In ecology this is a common problem! The population size of a species often needs to be known, but it is either impractical or simply impossible to count all individual members of that population. Fortunately, there is a relatively easy solution to get an estimation!

To estimate the size of the population you would have to visit the site twice. On the first visit a number of animals are caught, marked and released. It is important to mark them in a way it doesn’t do them any disadvantage or harm, equally important is to use marks that don’t easily rub off. You keep count of how many members of the population were caught and marked. Next, a day or so later, the site is visited again and a number of animals is caught. This time the number of marked animals that were recaptured is important. As the fraction of the marked individuals captured the second day, is essentially the same as the fraction of the population that was marked the first day. This method goes by a few names like Capture-Mark-Recapture, Mark and Recapture or other slight variations.

As an example imagine you catch and tag 50 individuals on day one. The next day, 100 animals are caught and 10 of those are marked. So we know that roughly 10% of all individuals were tagged, we also know that there are in total 50 tagged individuals, so the total population size should be about 500.

\[\begin{aligned} \frac{n\_marked}{population\_size} = \frac{n\_recaptured}{captured\_round\_2} \end{aligned}\]

This formula can be re-written as:

\[\begin{aligned} population\_size = \frac{n\_marked * captured\_round\_2}{n\_recaptured} \end{aligned}\]

While the latter is really simple and similar exercises are already seen in primary school (known as the rule of three), calculating the confidence interval (the range in which the actual population size is with a certain amount, usually 95%, of confidence) on this estimate is far from trivial (check the link, it is a very complex formula). However, when using a Bayesian approach, we get the uncertainty on the population size automatically, without any extra effort. So let’s implement this model in PyMC3 and see what we get.

Creating a PyMC3 model

The data from the observation is stored in a few variables: n_marked (the number of individuals marked on the first visit), captured_round_2 (the total number of individuals caught the second visit) and n_recaptured (the number of marked individuals caught the second visit). The big unknown that we need to infer is the population_size, we don’t really have any clue about this except that it is equal or higher than the total number of individual that were caught during both visits. So we can set the number of unique animals seen as the lower limit.

The probability to capture an already marked animal the second visit p_marked is the number of marked animal within the full population. So this is a pm.Deterministic variable as it is defined by the number of marked animals (which is known) and the population size (which we try to infer). Finally, a likelihood recapture_obs is required, this will be a pm.Binomial with the total number of animals caught the second day as the number of draws, the number of marked individual recaptured as the observation and a probability of p_marked.

%load_ext nb_black
import pymc3 as pm

n_marked = 50
captured_round_2 = 100
n_recaptured = 10

with pm.Model() as model:
    population_size = pm.Bound(
        pm.Flat, lower=n_marked + captured_round_2 - n_recaptured
    )("population_size")
    p_marked = pm.Deterministic("p_marked", n_marked / population_size)

    recapture_obs = pm.Binomial(
        "recapture_obs", captured_round_2, p_marked, observed=n_recaptured
    )

    trace = pm.sample(4000, tune=1000, return_inferencedata=False)

After sampling the model, we get our estimate for the population size. In this case the population size is estimated to be between 294 and 1034 individuals (hdi_3% and hdi_97%, PyMC3 by default gives you the smallest range where 94% of the inferred values were found) with a mean value of 620. While the mean isn’t crazy far off from the population size in our example, the uncertainty is rather large. It does show that not nearly enough animals were marked or re-captured in our thought experiment.

  mean sd hdi_3% hdi_97%
population_size 620.067 220.926 293.750 1034.423
p_marked 0.090 0.028 0.039 0.142

Update 30/08/2022 - HyperGeometric likelihood

As in this case draws from the population aren’t independent, a HyperGeometric distribution is better for this model. The last two parts can be switched for the lines below to implement this. While this wouldn’t be a big difference in case the number of marked individuals is large enough, here it does affect the mean estimate and 94% HDI.

    recapture_obs = pm.HyperGeometric(
        "recapture_obs", N=population_size, k=n_marked, n=captured_round_2, observed=n_recaptured
    )

    trace = pm.sample(4000, tune=1000, return_inferencedata=False, target_accept=0.9)
  mean sd hdi_3% hdi_97%
population_size 748.737 372.363 273.785 1396.472
p_marked 0.040 0.015 0.012 0.067

Can we improve without catching more animals ?

So without catching more animals overall, is it better to mark more animals, and catch fewer the next visit? Or capture more animals on the second visit, while marking fewer the first time around? With this model we can easily test this out!

Let’s mark 100 animals on day one and catch 50 on day two. Given a population size of 500 we’d expect about 10 to be re-capture. If we feed those number in the model we get the results below.

  mean sd hdi_3% hdi_97%
population_size 610.585 212.195 310.344 1008.055
p_marked 0.180 0.054 0.082 0.279

While there is still a lot of uncertainty, it does drop when catching more animals the first day and fewer the second day. (I’ll leave confirming it gets worse if you do things the other way around to the reader) However, once you drop below a point where you are barely recapturing any animals on the second day, this is also negatively going to affect the estimate.

Catching twice as many animals

This model really works well if enough individuals can be tagged and recaptured. So if we increase the number of marks to 200 and capture 100 the second day, we would expect about 40 to be tagged. Punching in those numbers in the model, we see that now the mean is very close to what it should be in our imaginary population and the uncertainty is far smaller.

  mean sd hdi_3% hdi_97%
population_size 519.579 66.165 402.473 643.741
p_marked 0.391 0.048 0.302 0.483

Conclusion

While the formula for this model is simple, calculating the confidence interval isn’t. Though here it could be instrumental to know how much uncertainty you have. In our first example, depending on how many fish there are in the pond you might allow fishers to capture a certain number. So if there are 293, 500, 610 or, 1034 this could influence how many people are allowed to catch fish in that pond.

By using a Bayesian approach, the problem is solved in a few simple lines of code, and we get the confidence interval in one go as well, easy! This is exactly the strength of Bayesian statistics. The advantages here are further highlighted as it is very easy to play with the number to see how the uncertainty can be decreased, so the efforts of catching and marking individuals can be done as efficient as possible.

To see what I did with this formula, check out the next post where this is used to estimate how many KeyForge decks were printed.

Acknowledgements

Header image by Sebastian Pena Lambarri on Unsplash