Challenges of Evaluating the Causal Effects of Early Child Development Programs
[Thesis]
Weber, Ann
Tager, Ira B; Fernald, Lia H
UC Berkeley
2012
UC Berkeley
2012
Although the lack of consistent results is generally attributed to possible problems of implementation and governance of the program, the failure to find a statistically significant effect (or alternatively, the success of finding one) may, in fact, be due to the types of problems described in my dissertation. There may be bias in the outcome (or other) measure, failure of the causal assumptions to hold, or bias from the method of effect estimation. Misleading estimates of a program's benefit (in either direction) have significant policy and funding implications for the program. More importantly, the decisions made based on an evaluation have consequences for the children the programs are trying to help. I present tactics for addressing several methodological challenges to evaluation and urge investigators to update and/or reconsider their analytic approaches to evaluations. Since ECD intervention research is often inter-disciplinary, I recommend learning new methods from other disciplines and to use the best methods that are at our disposal.
Over 200 million children under five years old in low- and middle-income countries (39% of preschool children in developing countries) are estimated as not achieving their potential across multiple domains of development (including sensori-motor, cognitive, language, and social-emotional development). Although there is evidence of benefit to child development for a wide range of interventions, results from assessments of scaled-up programs are less conclusive. Therefore, the assessment of large-scale early child development (ECD) programs in developing countries is a priority. This dissertation focuses on several methodological issues in evaluating large-scale ECD interventions that threaten the validity of finding a program benefit. Specifically, I address two important areas of evaluation: 1) the challenge of obtaining an unbiased measure of language development in a setting for which the test was not developed; and 2) the analytic process of determining whether the ECD intervention had a benefit that actually is the result of the intervention, given an unbiased developmental outcome. To demonstrate these challenges, I make use of data collected over a 14 year period for a national nutrition program in Madagascar. First implemented in 1999 by the National Office of Nutrition (ONN) in Madagascar, the program has expanded to include 5550 sites with coverage of approximately 1.1 million children. The program takes a comprehensive approach to improving early child nutritional status, targeting children less than 5 years of age and including multiple activities that have been found to be associated with better child outcomes. A wide spectrum of developmental outcomes was assessed in four national surveys in Madagascar, including physical growth (height and weight), and motor, cognitive, language, and behavioral skills. In my dissertation, I focus on only two of these measures: weight-for-age (a measure of short-term nutritional status) and receptive language (understanding of words, gestures or phrases), as assessed by an adaptation of the U.S. version of the Peabody Picture Vocabulary Test, 3rd edition (PPVT-III).Tests of early child cognition and language that were developed and carefully validated in one country are not guaranteed to maintain their properties when adapted and translated for use in another. The risk of censoring is high, and bias from differential item functioning (DIF) can be introduced when administering the test to different subgroups (e.g., ethnicities) within the same country. Using longitudinal data from two rounds of testing (when children were 3-6 years and 7-10 years of age) I apply item response theory (IRT) models to assess the performance of the PPVT in Madagascar. My analysis uncovers problem items (e.g., bias from DIF by dialect spoken in 55% of items), censoring in a large proportion of the children, and patterns of responses related to test fatigue. This information can be used to identify items that need to be dropped before estimating the program effect (e.g., items with strong, significant DIF); and items that should be replaced, modified, or re-ordered in future work (to avoid censoring and test fatigue). Although my analysis focuses on a test of vocabulary, many of the issues apply to any multi-item instrument intended to capture a latent construct. Such multi-item measures are commonly used in ECD intervention research and include other tests of language and memory, as well as non-verbal tests of cognition and socio-emotional behavior scales. I present lessons learned from working with the PPVT in Madagascar and make recommendations for how these lessons can be applied in other developing country settings. Presuming that the developmental outcome is assessed without bias, there remains the analytic challenge of determining whether an ECD intervention has a benefit that actually is the result of the intervention. I make use of a detailed, step-by-step roadmap for estimating the average treatment effect (ATE) of Madagascar's program on children's mean weight-for-age in a community between 1997 and 2004. The evaluation of the Madagascar program is complicated by the fact that the selection into the program was non-random and strongly associated with the pre-treatment (lagged) outcome. The availability of pre-program data allows me to define the outcome as either the post treatment value or the change from pre-treatment to post-treatment. Using these two outcome definitions, I contrast identification results for three common statistical parameters that under different assumptions are equivalent to my target parameter, the ATE. These statistical parameters are a post-treatment estimand commonly used in epidemiology that adjusts for measured confounders, and two difference-in-differences estimands (one of which is popular in econometrics) that can address certain types of unmeasured confounders. For identification, I make the assumptions underlying each of these estimands explicit and demonstrate the consequences of alternate choices using directed acyclic graphs and data simulations. Finally, I describe and compare three methods of estimation for each of the three estimands: traditional parametric regression, inverse probability of treatment weights (IPTW or propensity score weighting), and targeted maximum likelihood estimation (TMLE). Throughout, I avoid imposing parametric model assumptions unless they are firmly supported by knowledge, and deliberately keep the process of identification separate from the process of estimation in order to avoid the common confusion of the two.My findings show that I am faced with a serious bias trade-off when choosing an estimand for the ATE of the Madagascar nutrition program. A post treatment estimand controls for confounding due to the lagged outcome but not from possible unmeasured confounders. The difference-in-differences estimands do not control for confounding by lagged outcome, but have the potential to adjust for a certain type of unmeasured confounding. However, the difference-in-differences estimands have the potential for introducing bias if the additional assumptions they require (beyond those needed for the post-treatment estimand) are not met. The three estimands result in very different estimates of effect in the Madagascar study, regardless of method of estimation. The estimates for the ATE from the post-treatment estimand are less than one tenth of a standard deviation (SD) improvement in community mean weight-for-age z-score (as compared to the WHO reference population). The two difference-in-differences estimands are comparable to each other, with estimates of the ATE ranging from 0.24 to 0.28 SD increase in mean weight-for-age z-score (statistically significant with all estimation methods). However, since I am unable to estimate either the magnitude or direction of possible confounding from unmeasured factors, or the magnitude or direction of bias from the failure of the assumptions to hold, I conclude that my best choice is the post treatment estimand. This simple estimand adjusts optimally for known measured confounders and is equal to the ATE under the fewest assumptions.Given this choice of estimand, the choice of estimator can still make a difference. In fact, the only significant effect for the post-treatment estimand was obtained with TMLE (estimate of the ATE = 0.066 SD, CI: 0.001, 0.146 SD). TMLE has specific advantages over either parametric regression or IPTW, and improves on both by implementing a bias reduction step to estimate the target parameter of interest. In addition, TMLE is considered doubly robust to model misspecifications. Therefore, I conclude that TMLE is a better choice for estimation over the other two methods, and that my best estimate of the ATE is small, but statistically significant. Alternate target parameters and alternate estimation approaches are unlikely to resolve the uncertainty of the choice of estimand for the Madagascar evaluation. However, future analytic work on other nutritional outcomes (e.g., height-for-age) and longer term effects of the program (e.g. from a third wave of data in 2011) may provide an accumulation of evidence of a causal benefit of Madagascar's nutrition program.There is mixed evidence of the effectiveness of large-scale nutrition programs on early child development outcomes.