The methods of artificial intelligence and statistical machine learning are finding tremendous success in various fields, with applications ranging from cancer screening to machine translation. However, continued improvement in predictive accuracy is not sufficient to guarantee that these systems can be used reliably across a variety of societal contexts. There are three main challenges in the development of robust AI systems -- (a) {\em Elicitation} to obtain high-quality feedback on training instances, (b) {\em Aggregation} to understand and summarize the trade-offs arising from decisions across society, and (c) {\em Causal Inference} to estimate the impact of any system before deploying. My thesis makes progress on specific problems within each of these three challenges. A unifying theme is that of the need to be careful in handling the heterogeneity of agents, which is prevalent in many domains, be it the reviewers on a platform like Yelp or the consumers of a recommendation service like Amazon. First, I propose new peer prediction mechanisms to elicit feedback on instances where the responses cannot be verified. I consider both the heterogeneous agents and tasks setting and show that the proposed mechanisms have better incentive guarantees both in theory and through empirical evaluation. In regard to aggregation, I consider theoretical aspects of voting rules, motivated by anticipated use of AI systems in the context of societal decision making. I provide a unified view of voting by considering elicitation and aggregation together, and provide a sharp characterization of the performance of such rules. Finally, I develop a tensor decomposition approach to estimating the impact of a policy that applies treatments over a sequence of rounds. I show that our estimator is consistent, propose an algorithm to efficiently solve the estimation problem, and through simulation show that it has better performance than existing methods for causal inference under time-varying treatments.