ACM FAccT '24 Paper #1406 Reviews and Comments =========================================================================== Paper #1406 From the Fair Distribution of Predictions to the Fair Distribution of Social Goods: Evaluating the Impact of Fair Machine Learning on Long-Term Unemployment Review #1406A =========================================================================== * Updated: Mar 24, 2024 Post-Rebuttal Comments ---------------------- Thanks for the helpful and detailed responses. The proposed changes address my concerns, and points of confusion. Summary of Contribution ----------------------- The paper argues that imposing fairness constraints on risk prediction algorithms may not translate to fairness at the outcome level. The paper focuses on a setting in which the goal is to reduce the rate of long-term employment. Resources (such as educational services) are allocated according risk predictions (what is the likelihood an individual will face long-term employment). The paper notes that it is not clear that fairness in risk prediction will lead to fairness in outcomes, such as the difference in the rate of long term employment between men and women. The paper provides conditions under which it is possible to estimate the distribution of outcomes given different risk prediction algorithms and policies for allocating resources. The basic idea is to estimate the causal effect of different interventions by group and risk score (CATE) and the allocation of resources induced by a given risk algorithm. The paper then applies this method using real data from an employment setting, and shows how fairness interventions on the risk algorithm do not improve fairness in outcomes (and in fact can increase gaps in employment rates). Quality ------- Strengths: - The paper makes an interesting argument---that fairness interventions at the level of the risk algorithm may not translate at the level of outcomes---and conceptualizes this claim formally. - The paper backs this claim with an empirical study that leverages a new empirical framework backed by theoretical results. A strength of the paper is its three levels of contribution---conceptual, theoretical, and empirical. A thorough analysis of the conceptual point (for example, with case studies) could by itself make a strong paper. - The empirical study yields several useful insights. For example, it considers the claim that risk scores can be used to estimate treatment effects: that treatment effects are highest for individuals with moderate risk, since low risk individuals will do well regardless, and high risk individuals are a "lost cause." The empirical results challenge this claim, by noting that a policy of allocating resources to the most risky performs similarly to allocating resources to those at moderate risk. Weaknesses: - The paper does a good job grounding the central claim in a concrete context (employment programs), but would benefit by giving a variety of examples in which the argument would and would not apply. - The primary empirical results (gender gap in long-term unemployment rate) lack uncertainty quantification. It appears that reported results are based on individual models, making it harder to evaluate the robustness of these results. Clarity ------- The formalization and theoretical results were easy to follow, and the paper as a whole was well organized. The experimental methodology was clearly explained, though some details appear missing (such as size of train-test split) and code does not appear available, hurting reproducibility. A couple points/suggestions: - In Section 2, it would be helpful to include a couple additional examples of relevant settings. - Line 485-487: I would elaborate on the issue described here. Originality ----------- The paper builds on existing literature on performative prediction (that use of a prediction algorithm can influence the distribution the algorithm is trying to predict). It offers a new perspective by focusing explicitly on outcomes, whereas performative prediction is primarily focused on the validity of the prediction algorithm over time. Notably, the paper introduces methodology to analyze how the adoption of a prediction algorithm in policy contexts will shape outcomes. Scholarship ----------- The work appears to be appropriately situated in context of related work, and cites results and definitions borrowed from other work. Significance and Impact ----------------------- The conceptual points made by the paper may have significant impacts. Risk prediction is increasingly used to dictate policy decisions (such as resource allocation), and the paper clearly illustrates potential shortcomings of such an approach. In particular, while great attention has been placed on the fairness of risk prediction, the paper shows that fairness at this level may not transfer at all to the ultimate outcome. The empirical framework appears to broadly useful in assessing the efficacy of potential risk-based policies pre-deployment. Relevance --------- 5. This submission has relevance to FAccT: Strongly Agree Overall merit ------------- 5. Would be an accepted paper at a highly respected venue in my discipline. A good submission; an accept. I vote for accepting this submission, but would not be upset if it were rejected. Review #1406B =========================================================================== * Updated: Mar 23, 2024 Post-Rebuttal Comments ---------------------- I appreciate the authors for their clear and concise rebuttal. Their proposed edits would address my major concerns, and I am in favor of accepting this paper. Summary of Contribution ----------------------- This paper discusses how certain fairness constraints on predictions impact distributions of social goods after deployment. In particular, they present a strong empirical case study using Swiss unemployment data to show how predictions of unemployment might be used to allocate interventions, and what the ultimate effect on real-world inequalities would be. Quality ------- The work is fairly comprehensive. The theoretical section, while mostly reconceptualizing existing works, is relevant, thorough, and helps motivate the stronger empirical contributions. The empirical section adequately presents a novel case study. However, the work could expand on unlikely, but potential negative effects in the discussion. An unstated implication of the work is that certain fairness constraints should not be used to modify predictions, at least in this setting. The paper should at least acknowledge that there may be settings where these fairness constrains are beneficial. Clarity ------- The submission is very clearly written and well-organized. The main arguments are straightforward and easy to follow throughout. One critique of the experimental methodology: the paper should consider how calibration, and in-particular, multi-calibration of risk scores would affect the gender gap. Such a "fairness constraint" at prediction time is more plausible to be implemented, unlike one of the chosen metrics of "statistical parity", which does not seem as relevant to this setting (equalizing the "acceptance rate" in this case does not make sense because long-term unemployment is inherently imbalanced). Originality ----------- The empirical section is entirely original, and presents a novel case study on real-world data and policies. The theoretical section mostly reconceptualizing existing works, but is a relevant reframing and helps motivate the empirical section. Scholarship ----------- The work adequately references many related works, however some of the claims about the field algorithmic fairness are too strong. Many recent works have been analyzing long-term and policy fairness, and this work adds to that growing literature through a novel framework and case study. In addition, the literature on calibration and multi-calibration is ignored, yet should be considered given this work's use of risk scores. Significance and Impact ----------------------- The work makes strong empirical contributions, with a new case study and experimental approach. It also adds to the many recent works around long-term and policy fairness, making it highly-relevant for the FACCT community. Relevance --------- 5. This submission has relevance to FAccT: Strongly Agree Overall merit ------------- 5. Would be an accepted paper at a highly respected venue in my discipline. A good submission; an accept. I vote for accepting this submission, but would not be upset if it were rejected. Review #1406C =========================================================================== * Updated: Mar 25, 2024 Post-Rebuttal Comments ---------------------- Thanks to the authors for their clarifications AC Metareview ------------- Reviewers agreed that this paper provides important theoretical and empirical contributions. I encourage the authors to carefully read the reviews for suggestions on improvements and extensions of this work. Overall merit ------------- 5. Would be an accepted paper at a highly respected venue in my discipline. A good submission; an accept. I vote for accepting this submission, but would not be upset if it were rejected. Rebuttal Response by Author [Sebastian Zezulka ] (665 words) --------------------------------------------------------------------------- We are grateful to the reviewers for their thoughtful comments. We respond inline. **Reviewer A** “The paper [...] would benefit by giving a variety of examples in which the argument would and would not apply.” This is a good point. Our analysis applies whenever predictions of an outcome are used to inform decisions that themselves affect the outcome. In order to straightforwardly replicate our analysis, it is necessary that outcomes are observed for every treatment. For example, medical triage (Tal, 2023; Caruana et al., 2015) fits closely with our analysis. Here, predictions of mortality affect treatment decisions, which themselves affect the probability of mortality. Educational tracking can also be analyzed in this way: predictions about educational performance informs tracking decisions which, in turn, affect educational outcomes. The analysis becomes more difficult when there are outcomes are never observed under some treatments. For example, in the context of pre-trial detention, defendants who are detained cannot be rearrested. In the context of child protection, placement-out-of-home cannot be observed for cases that are not investigated (Coston et al., 2020). We will add this discussion into the revised draft. “The primary empirical results (gender gap in long-term unemployment rate) lack uncertainty quantification.” True. Uncertainty quantification in the estimation of individual potential outcomes (and treatment effects) is an open problem in the econometric literature (Curth et al., 2024). Conformal prediction methods may apply (Alaa et al., 2023; Lei and Candès, 2021). We will mention this as an avenue for future work. “... some details appear missing (such as size of train-test split) and code does not appear available, hurting reproducibility.” We discuss the train-test split on line 487, but we will make this clearer in the updated version. We will also publish the code on Github. “Line 485-487: I would elaborate on the issue described here.” Yes, this should be made clearer in the revision. The problem is that some people are assigned to “no program” and others exit unemployment before they can receive an assignment. In the data, these are coded the same way. To avoid overestimating the effectiveness of “no program”, we drop those who are reemployed before they have a chance to be assigned. (Compare: if someone spontaneously recovers before being assigned to an arm of a drug trial, this shouldn’t count in favor of the placebo.) Our strategy for estimating when you would have been assigned (if you hadn’t found a job) is adopted from Knaus (2022) and Lechner (1999). Out of 69,371 observations, we consequently drop 5,076 observations. **Reviewer B** “The paper should at least acknowledge that there may be settings where these fairness constraints are beneficial.” True. It was not obvious to us what effect fairness constraints would have on employment outcomes until we ran our analysis. Our point is not that they would always make things worse, but rather that part of due diligence is forecasting their effects on outcomes. Our methodology is meant to support such efforts. We will make this clearer in the revision. “One critique of the experimental methodology: the paper should consider how calibration, and in-particular, multi-calibration of risk scores would affect the gender gap.” This is an interesting point that we will mention in the revision. The fairness unconstrained risk scores are approximately calibrated for men and women (see Table 3 in the Appendix). This is not surprising given the results by Liu et al. (2019). Our results show that this is good for the resulting gender LTU gap. We can check whether we still have approximate calibration when considering more intersectional identities (gender, citizenship, marital status, etc.) but we may have to save a true test of multicalibration (in all computable subsets) for future work. “some of the claims about the field algorithmic fairness are too strong.” We acknowledge that considerable parts of the literature investigate long-term and outcome oriented fairness. We explicitly rely on this literature and cite it extensively in Section 2.1. In an updated version, we will make sure not to overstate our critique.