Charity evaluators: a first model and open questions

2017-04-25

By Dominik Peters (with Tom Sittler)

We're centralising all discussion on the Effective Altruism forum. To discuss this post, please comment there.

Abstract. We describe a simple simulation model for the recommendations of a charity evaluator like GiveWell or ACE. The model captures some real-world phenomena, such as initial overconfidence in impact estimates. We are unsure how to choose the parameters of the underlying distributions, and are happy to receive feedback on this.

Charity evaluators, and in particular GiveWell, have been enormously influential and impactful for effective altruists: they seeded the idea of aiming for effectiveness in one’s giving, they incentivised charities to be more transparent and impact-focussed, and (most directly) they have moved dollars donated by effective altruistsa to higher-impact organisations (e.g., Peter Singer seems to have reallocated some of his donations from Oxfam towards AMF).

While GiveWell’s recommendations in the field of global health seem to be relatively robust (not having changed substantially over several years), charity evaluators in fields with more uncertainty about the best giving opportunities could have substantial impact through arriving at better recommendations. Among existing such organisations, Animal Charity Evaluators (ACE) appears to be a natural candidate: evidence for the effectiveness of interventions (such as leafleting) in the animal sector is rather weak, and some of ACE’s standout charities (such as GFI) engage in complicated mixes of activities that are difficult to model and evaluate rigorously.

To see whether ACE may be a good target for our OxPrio donation, we will set up a simple quantitative model of the activities of charity evaluators. The model is cause-neutral (so far), but we will call the generic charity evaluator “ACE” for short. Of course, the model is very much simpler than what the real ACE does in the real world. In the model, ACE gathers more evidence about the impact of various charities over time, and based on the available evidence recommends a single charity to its followers. In each time period, these followers give a fixed amount (say normalised to $1) to the top recommendation. Thus, the amount donated does not change with changing recommendations, and donors do not have an “outside option”.

Definition of “evidence”. The evidence gathered by ACE could come in various guises: they could be survey results or RCTs, they could be intuitive impressions after conversations with charity staff, they could be new arguments heard in favour or against a certain intervention, or even just intuitive hunches. The model is agnostic about what type of evidence is used; we only require that it comes in form of a point estimate of the impact (per dollar) of a given charity. For now, we do not model the strength of this evidence, and ACE does not use anything like Bayesian updating. Rather, if ACE has gathered several pieces of evidence in forms of point estimates, ACE will take the average and take this to be an overall estimate.

The model. Here, then, is our proposed model in pseudocode form:

●     There is a fixed pool of charities that ACE will evaluate.

●     Ground truth. Each charity in the pool has a true impact (per dollar). For each charity, we decide this true impact by randomly sampling from a lognormal distribution. The true impact of the charity will stay fixed over time.

●     For each time period t = 1, …, T:

○     Evidence gathering strategy. For each charity in the pool, ACE collects a single item of evidence in the form of a point estimate of its impact. We arrive at this piece of evidence by randomly sampling a number from a normal distribution centred at the true impact of the charity.

○     Recommendation. For each charity in the pool, we calculate the average of the point estimates that we have collected in this and previous time periods. We select and recommend the charity for which this impact is highest.

○     Payoff. ACE’s followers donate $1 to the recommended charity. ACE’s impact in this time period is the difference in the true impact of the charity recommended now versus the charity recommended in the previous round.

We have implemented this model using a simple python script that simulates the process:

# CHARITY EVALUATION MODEL
# Peters & Sittler, April 2017

import numpy

# which organisations are there to evaluate in the area?
orgs = ["AMF", "SCI", "GFI", "MIRI", "80k", "FHI", "MFA", "ACE", "FLI"]

#### distribution of true impact ####
# we take lognormal with these parameters:
impact_distribution_mu = 2.3
impact_distribution_sigma = 1

#### distribution of evidence around true impact ####
# let's take a normal distribution
evidence_distribution_sigma = 50

def sample_impact():
    "sample a random true mu for an organisation"
    return numpy.random.lognormal(impact_distribution_mu, impact_distribution_sigma)

def sample_evidence(mu):
    "for an org with true impact mu, sample a piece of evidence"
    return numpy.random.normal(mu, evidence_distribution_sigma)

def impact_estimate(evidence_list):
    "given a list of sampled evidence impacts, what is our overall impact estimate? we use average"
    return sum(evidence_list) / len(evidence_list)


def simulate():
  # decide true impacts
  true_mu = {}
  for org in orgs:
      true_mu[org] = sample_impact()
  
  max_impact = max([true_mu[org] for org in orgs])
  arg_max_impact = [org for org in orgs if true_mu[org] == max_impact][0]
  print("Best achievable impact:", round(max_impact), "("+str(arg_max_impact)+")")
  
  # store all evidence that we have sampled so far
  gathered_evidence = {}
  for org in orgs:
      # we don't have any evidence yet
      gathered_evidence[org] = []
  
  rounds = 10
  
  round_str      = "Round:      "
  org_str        = "Recom. Org: "
  estimate_str   = "Estimate:   "
  true_str       = "True Impact:"
  difference_str = "Difference: "
  change_str     = "Change:     "
  
  last_round_true_impact = 0.0
  width = 5 # table column width
  for i in range(1,rounds+1):
      round_str += str(i).rjust(width)
      ### do research ###
      # for each org, obtain a sample
      for org in orgs:
        gathered_evidence[org].append(sample_evidence(true_mu[org]))
      # which org do we recommend after this round?
      max_impact_estimate = max([impact_estimate(gathered_evidence[org]) for org in orgs])
      ### pretty-print results ###
      for org in orgs:
          if impact_estimate(gathered_evidence[org]) == max_impact_estimate:
              org_str += str(org).rjust(width)
              estimate_str += str(round(max_impact_estimate)).rjust(width)
              impact_change = round(true_mu[org] - last_round_true_impact)
              true_str += str(round(true_mu[org])).rjust(width)
              difference_str += '{0:+d}'.format(round(true_mu[org] - max_impact_estimate)).rjust(width)
              if impact_change != 0:
                change_str += '{0:+d}'.format(impact_change).rjust(width)
              else:
                change_str += "".rjust(width)
              last_round_true_impact = true_mu[org]
              break
  
  print(round_str)
  print(org_str)
  print(estimate_str)
  print(true_str)
  print(difference_str)
  print(change_str)
  print("")
  
simulate()

Two observations:

●     Initial overconfidence. In most simulation runs, the impact estimate of the recommended charity at time t = 1 is much higher than the true impact of the charity. This is not surprising: because ACE recommends whichever charity got the highest point estimate, it will almost certainly recommend an organisation for which it has seen badly inflated evidence. Arguably, the behaviour of this model mirrors certain real-life phenomena. For example, GiveWell’s estimate for the cost of saving a life has been revised up several times over the years; and in the animal space a few years ago, both online ads and leafleting seemed like they would have much more impact than what is commonly estimated today.

●     Decreasing returns over time. The change in true impact of the recommended charity is higher in earlier time rounds than in later ones. For example, with our current parameter choices, funding ACE at time t = 1 has approximately 10 times as much impact as at time t = 10, averaging over 10,000 simulation runs. This is because the estimates get closer to the truth over time, and so the top recommendations will change less frequently, and when they do, the difference in impact will often be small. This observation matches our thought above that funding GiveWell (being relatively mature) seems less impactful than funding the relatively young ACE.

Currently, our model is somewhat underspecified, and some of the modelling choice lack good justification. Some open questions related to this, on which we would appreciate any ideas and feedback:

●     What should be the distribution of true impact?

○     A lognormal distribution (according to which the order of magnitude of impact is normally distributed) seems like a sensible choice for the reasons Michael Dickens has outlined.

○     On the other hand, Brian Tomasik has argued that on an all-things-considered view (taking flow-through and other effects into account), overall true impact may be more normally distributed.

○     Location (mu):

■     For global health, this role could be filled by an estimate of the impact of GiveDirectly.

■     Analogously, maybe an unbiased estimate of online ads may be a reasonable choice. But are current estimates biased?

○     Scale (sigma):

■     Is it sensible to use variance information from other contexts? For example, there are various distributions of the impact of global health interventions that are sometimes circulated.

●     Evidence distribution

○     Which type of distribution captures this best? Normal or lognormal?

○     How to get the right sigma parameter for this? Maybe look at distributions found in large meta-analyses in other contexts (such as health)?

●     Is averaging the right way to aggregate different evidence samples?

○     Probably yes if they are normally distributed. What about lognormal?

○     Is there a simple way to make this process more Bayesian?

●     For which charities are samples drawn?

○     Diminishing returns to sampling give a good reason for sampling the charity that has the fewest samples (which is roughly equivalent to sampling all charities in each round)

○     It may be a better use of resources to focus on borderline orgs, since for these, new evidence is most likely to change the recommendation. On the other hand, this strategy may never promote the best charity to the top, if the first few evidence samples were very bad.