By Dominik Peters (with Tom Sittler)
We're centralising all discussion on the Effective Altruism forum. To discuss this post, please comment there.
Summary. We describe a simple simulation model for the recommendations of a charity evaluator like Animal Charity Evaluators (ACE). In this model, the charity evaluator is unsure about the true impacts of the charities in a fixed pool, and can reduce its uncertainty by performing costly research, thereby improving the quality of its recommendation (in expectation). Better recommendations lead to better utilisation of the money moved by ACE. We also describe how we converted the model’s output, which is measured in chicken years averted / $ into “Human-equivalent well-being-adjusted life-years” (HEWALYs) / $.
(This post is an updated version of our previous post on this model. We would like to thank the commenters on this post for their helpful suggestions.)
Charity evaluators, and in particular GiveWell, have been enormously influential and impactful for effective altruists: they seeded the idea of aiming for effectiveness in one’s giving, they incentivised charities to be more transparent and impact-focussed, and (most directly) they have moved dollars donated by effective altruists to higher-impact organisations (e.g., Peter Singer seems to have reallocated some of his donations from Oxfam towards AMF).
While GiveWell’s recommendations in the field of global health seem to be relatively robust (not having changed substantially over several years), charity evaluators in fields with more uncertainty about the best giving opportunities could have substantial impact through arriving at better recommendations. Among existing such organisations, Animal Charity Evaluators (ACE) appears to be a natural candidate: evidence for the effectiveness of interventions (such as leafleting) in the animal sector is rather weak, and some of ACE’s standout charities (such as GFI) engage in complicated mixes of activities that are difficult to model and evaluate rigorously.
To see whether ACE may be a good target for our grant, we will set up a simple quantitative model of the activities of charity evaluators. We model ACE, but the model is cause-neutral and can be applied to any. Of course, the model is very much simpler than what the real ACE does. In the model, ACE gathers more evidence about the impact of various charities over time, and based on the available evidence recommends a single charity to its followers. In each time period, these followers give a fixed amount to the top recommendation. Thus, the amount donated does not change with changing recommendations, and donors do not have an “outside option”.
Definition of “evidence”. The evidence gathered by ACE could come in various guises: they could be survey results or RCTs, they could be intuitive impressions after conversations with charity staff, they could be new arguments heard in favour or against a certain intervention, or even just intuitive hunches. The model is agnostic about what type of evidence is used; we only require that it comes in form of a point estimate of the impact (per dollar) of a given charity. For now, we do not model the strength of this evidence, and there is no Bayesian updating. Rather, if ACE has gathered several pieces of evidence in forms of point estimates, ACE will take the average and take this to be an overall estimate.
The model. Here, then, is our proposed model in pseudocode form. All model parameters are themselves chosen at random from a lognormal distribution specified by a [5%,95%] confidence interval, like for our other models using guesstimate.
There is a fixed pool of charities that ACE will evaluate, which consists of 10–15 charities. This is approximately the number of top and standout charities that ACE currently recommends.
Ground truth. Each charity in the pool has a true impact (per dollar). For each charity, we decide this true impact by randomly sampling from a lognormal distribution. The parameters are chosen to go through a [5%,95%] confidence interval, where the lower bound is given by ACE’s quantitative estimates of its lower-performing standout charities (such as VEBU and Vegan Outreach, for about 0.5 years of suffering averted / $) and the upper bound is given by ACE’s estimate for its top charities (Mercy for Animals and The Humane League, at about 10 years of suffering averted / $). The true impact of the charity will stay fixed over time.
For each time period t = 1, …, T:
Evidence gathering strategy. For each charity in the pool, ACE collects a single item of evidence in the form of a point estimate of its impact. We arrive at this piece of evidence by randomly sampling a number from a normal distribution centred at the true impact of the charity. Based on comments on the previous version of this piece, a distribution with wider tails than those of a normal distribution might be preferable, but I did not identify a good choice of another symmetric distribution. The standard deviation of the normal distribution we used was at 10–20 years of suffering averted / $, which is the approximate standard deviation which one can see in ACE’s guesstimate models of total impact.
Recommendation. For each charity in the pool, we calculate the average of the point estimates that we have collected in this and previous time periods. We select and recommend the charity for which this average is highest.
Payoff. ACE’s followers donate approximately $3.5m to the recommended charity. ACE’s impact in this time period is the difference in the true impact of the charity recommended at time t versus the charity recommended at time t-1, multiplied by the money moved, divided by their operating costs of approximately $0.3m. The numbers for money moved and research costs approximately follow ACE’s annual report for 2016.
The model is then run for 4–6 rounds, and the impact is calculated for the last round. Most of the time, the top charity does not change in the very last round, so that 0 impact / $ is achieved. Occasionally, the quality of the recommendation decreases, because ACE has sampled wrong data, in which case value is destroyed. More often, the recommendation improves, creating value amplified by ACE’s money-moved factor of approximately 10x operating costs.
To smooth out our impact estimates, we actually take the average impact over the last 3 rounds in the model, so that the fraction of times in which 0 impact is achieved is smaller. This is to aid in the model aggregation process, where the final impact distribution will be fitted to a (double) lognormal distribution.
The model is then simulated 50,000 times, and the average impact over all model runs is calculated, which comes out at about ~6 years of suffering averted / $. The list of 50,000 impact estimates is then passed to the central aggregation process.
We have implemented this model using a simple python script that simulates the process; the code is available and can be run on repl.it.
It can be better to just donate to the top charity. For our model parameters, in expectation it is better value to donate to the charity that is currently recommended, rather than help ACE run its next evaluation round. This result is relatively robust to changes in the underlying distribution. The “problem” is that ACE will likely identify pretty-good charities very early on, and additional rounds do not lead to much change. With our parameters, donating directly to the recommended organisation is ~30% more cost-effective. One can reverse this conclusion by assuming a money-moved factor (money moved divided by operating costs) that is higher than 10x. This suggests that charity evaluators should focus on increasing their money moved. Of course, one way this can be done is by producing higher-quality research that will then attract more donors. Nevertheless, this conclusion surprised us a lot.
Initial overconfidence. In most simulation runs, the impact estimate of the recommended charity at time t = 1 is much higher than the true impact of the charity. This is not surprising: because ACE recommends whichever charity got the highest point estimate, it will almost certainly recommend an organisation for which it has seen badly inflated evidence. Arguably, the behaviour of this model mirrors certain real-life phenomena. For example, GiveWell’s estimate for the cost of saving a life has been revised up several times over the years; and in the animal space a few years ago, both online ads and leafleting seemed like they would have much more impact than what is commonly estimated today.
Decreasing returns over time. The change in true impact of the recommended charity is higher in earlier time rounds than in later ones. For example, with our current parameter choices, funding ACE at time t = 1 has approximately 10 times as much impact as at time t = 10, averaging over 10,000 simulation runs. This is because the estimates get closer to the truth over time, and so the top recommendations will change less frequently, and when they do, the difference in impact will often be small. This observation matches our thought above that funding GiveWell (being relatively mature) seems less impactful than funding the relatively young ACE.
A final step for this model is a unit conversion. Our models for our three other shortlisted organisations estimate human-equivalent well-being-adjusted life-years created per $, whereas this model estimates chicken years on a factory farm averted per $. How to convert these units is not obvious and controversial. We decided to obtain an “exchange rate” by querying team members’ intuitions, and taking medians.
To aid in obtaining this exchange ratio, we proceed in two steps. First, we asked ourselves how bad a year of life on a factory farm is compared to how good an average healthy year of life is, keeping species fixed. That is, we considered a thought experiment where there are “human factory farms”, and asked how many extra years of healthy life we would demand in order to accept being kept on a factory farm as a human. Reports ranged from 10 to 100 years, with a median of 50 years. Next, we asked how bad a year of life for a chicken on a factory farm is compared to the same situation for a human. That is, we asked for what value of n we would be indifferent between saving n chickens from a factory farm versus saving 1 human from a factory farm. Reports ranged from 10 to 1,000, with a median of 400. These values together can be used to obtain the required exchange rate.
This post was submitted for comment to Animal Charity Evaluators before publication.