by Michael Biggs, Department of Sociology, University of Oxford.
The Tavistock’s Gender Identity Development Service (GIDS) and University College London have finally released the results of their experiment on puberty blockers, albeit not in a scientific journal. The timing is curious. The paper’s first author, Dr Polly Carmichael (Director of GIDS) refused to provide it to the judicial review brought by Keira Bell and Mrs A, on a flimsy pretext. On the day after the judgement was handed down, the paper appeared on a preprint server, medRxiv (Carmichael et al., 2020). It was not discovered for some days because the authors were too modest to seek publicity. The event has not been mentioned on the website of the Tavistock and Portman NHS Foundation Trust, which had originally announced the experiment in 2011 with some fanfare: ‘It is hoped that the results of this study will contribute to improving the standards of care offered to this group of young people and their families.’
The fact that the Carmichael et al. have only now published results that were available in 2016—for outcomes after one year—and in 2017—after two years—shows their lack of concern for the standards of care offered to this group of young people. Indeed, it is almost certain that the experiment would have been conveniently forgotten without Transgender Trend’s sustained scrutiny. This website first called on the Tavistock to publish the results of its ‘Early Intervention Study’ in March 2019. I made a formal complaint to the Health Research Authority, which oversees the Research Ethics Committee that had approved this experiment. The report of its investigation was sent to me (embargoed before publication) on 11 October 2019. Carmichael et al.’s statistical analysis plan was ‘lodged with the Research Ethics Committee of the Health Research Authority on 9 October 2019’ (Appendix S2, p. 1).
The long-delayed paper provides results for 44 subjects—aged 12 to 15—who were prescribed Gonadotropin-releasing Hormone agonist (GnRHa). They were followed up at three time points: after one year, two years, and three years. Because the subjects could progress to cross-sex hormones soon after their sixteenth birthday, only 24 remained on GnRHa after two years, and only 14 at three years.
The authors’ statistical analysis plan, written in 2019 after they had come under scrutiny from Transgender Trend, is remarkable for its low expectations. It is far more pessimistic than the original research protocol from 2010.
- 2010: ‘Going through puberty in what is perceived to be the wrong body can be very distressing and in some cases contribute to self-harm and suicide attempts …. It is important to evaluate whether intervention early in puberty reduces self harm and suicide attempts’ (Viner et al., 2010, p. 15).
- 2019: ‘We hypothesise no change in self-harm across the study’ (Carmichael et al., 2020, S2, p. 9).
- 2010: ‘Early intervention is also associated with a reduction in the gender dysphoria experienced by these adolescents …’ (Viner et al., 2010, p. 15).
- 2019: ‘It is therefore unlikely that GnRHa treatment will result in significant reduction in body dissatisfaction’ (Carmichael et al., 2020, S2, pp. 12-13).
The authors have provided a perfect illustration of what psychologists call ‘HARKing’: hypothesizing after the results are known (Kerr, 1998). Aside from this being a questionable research practice, one wonders how it could be ethical to give an experimental treatment to children if the experimenters themselves expect the treatment not to lead to any improvement.
Psychological functioning does not change
The paper’s headline finding is that ‘GnRHa treatment brought no measurable benefit nor harm to psychological function in these young people with GD [gender dysphoria]’ (p. 45). This seems reassuring given that the first 30 subjects enrolled in the GIDS experiment reported more negative than positive effects after one year (GIDS, 2015; Biggs, 2020).
The paper’s findings might partly reflect the authors’ choice to present results only for girls and boys combined, and to test sex differences (Table 6) for only 2 measures out of 26. ‘Our statistical analysis plan restricted testing all outcomes for differences by sex due to the type 1 error risk’, they explain (p. 46). This risk is a legitimate concern, which will be discussed below. There is no justification, however, for not tabulating the results disaggregated by sex, as done by the landmark Dutch study on which the Tavistock’s experiment was modelled (de Vries et al., 2011), and by Carmichael’s presentation of the preliminary results (GIDS, 2015). My article (Biggs, 2020) shows that the measures for boys and for girls are uncorrelated, in the preliminary GIDS results and likewise in the Dutch study. In both data sets, to take the clearest example, girls’ body image worsened following GnRHa, while boys’ body image improved. By combining both sexes, the authors make it impossible to discern such patterns.
The authors also provide frustratingly little information on self-harm. There are two indexes, one created from the child’s answers and one from the parent’s. Each index sums two questions, each scored as 0, 1, or 2. The authors report only the median and the interquartile range (Table 4). The median is always 0 because most children do not harm themselves. The lower quartile is 0, of course; the upper quartile is 1 in every measure except the index for Youth Self Report after twelve months, when it is 2. (The difference between this measure at baseline and at one year is apparently not statistically significant; p = .4.) Why not report the mean, as they had previously (GIDS, 2015)? Or tabulate the frequency? Disaggregating by sex would also be informative, because their own preliminary results for the first 30 subjects showed that the increase in self-harm—on the question ‘I deliberately try to hurt or kill myself’—was greater for girls than for boys (the sex difference was statistically significant, p = .014).
The lack of discernible improvement is quite surprising because children and their parents must have been enthusiastic about puberty blockers and would have considered themselves fortunate to be in the first group of British adolescents to receive them. After all, this treatment had been demanded for years by Mermaids and GIRES, as a lifesaving elixir for children who identify as transgender (Biggs, 2019). This context should have created a powerful placebo response, even if the specific physical effects of GnRHa were minimal. We know that almost all or all the benefit of anti-depressants comes from placebo response (Kirsch, 2019).
The sample is too small
The authors are right to be wary of conducting too many statistical tests on a small sample, comprising only 44 individuals. I will try to explain this simply. Let us say we find a sample statistic—like the average change in one measure after these particular patients have been treated for a year—to be statistically significant at the .05 level. This means that if the population parameter were truly zero—if there were really no effect—we would then have only a 5% probability of getting a statistic of that magnitude in a sample of that size, simply due to random variability. In other words, the probability of a ‘Type I error’ is 5%. The more measures we test, however, the greater the probability of finding one to be statistically significant. If we were to carry out 20 statistical tests on completely random variables, on average 1 in 20 would be statistically significant at the .05 level.
The authors point out that my analysis (Biggs, 2020) of their preliminary results does not adjust for the number of statistical tests. I replicated the procedure used in the Dutch article (de Vries et al., 2011) which provides the only significant evidence for the benefits of puberty blockers. The authors’ critique therefore applies equally to that article, whose sample was almost as small, ranging from 41 to 57 (depending on the measure). Statistical tests were conducted on 14 measures. Applying the Bonferroni correction, as the authors advocate (Carmichael et al., 2020, p. 18), would also eliminate 3 out 8 of the positive Dutch findings. Most importantly, the improvement in overall psychological functioning (captured the Children’s Global Adjustment Scale) and the reduction in depression would no longer be statistically significant (p > .05 / 14).
The authors make a convincing argument that their sample was too small to really detect changes in so many measures. Why did they not realize this earlier? When the experiment was designed, the GIDS had a caseload of only 29 teenagers aged between 12 and 15 (Viner et al., 2010, pp. 8–9), and so they planned to enrol 30–45 patients over three years. Referrals subsequently grew exponentially, perhaps helped by Dr Carmichael’s promotion of puberty blockers in newspaper interviews and on BBC Children’s Television. In 2014/15, the final year of enrolment on to the experiment, the GIDS received referrals for 282 teenagers in the 12-15 age bracket. In other words, the annual increase was by then ten times greater than the total number of patients just four years earlier. After enrolment in the experiment finished, the GIDS recruited over 50 children aged 10-14 each year to its GnRHa programme. The GIDS therefore should now possess data on the effect of puberty suppression—after one year—on at least 250 more children (counting those referred to the endocrine clinic from January 2015 to December 2018). A sample size of around 300 would provide sufficient statistical power to really test whether adolescents undergoing puberty suppression improve or deteriorate. Unfortunately, the GIDS chose either not to collect or not to report these data, despite winning £1.3 million in research funding. Why?
No information on autism
In the case brought by Keira Bell and Mrs A, the judges asked for the number of children on the autism spectrum who were administered puberty blockers. They were told that these data could not be obtained. The judgment ‘found this lack of data analysis—and the apparent lack of investigation of this issue—surprising’ (para 35). The authors mention that they used the Social Responsiveness Scale to assess autism but simply promise that ‘these data will be analysed in the future’ (Carmichael et al., 2020, p. 17). We know only that out of the first 30 experimental subjects, 16 were in the normal range, 10 had ‘mid to moderate’ Autism Spectrum Disorder traits, and 5 had ‘severe’ traits as measured by SRS-2 (GIDS, 2015, p. 50).
An American endocrinologist, Dr Michael Laidlaw, raised the alarm about the effects of GnRHa on bone density, which must accrue rapidly during puberty to avoid osteoporosis later in life. This paper confirms his fears. At baseline the subjects were already half a standard deviation below the norm for their age and sex (Table 3). After one year, they were one standard deviation below the norm; at two years, more than one standard deviation below. (The authors chose not to statistically test these changes in Z-scores, for reasons which are unclear.) The paper omits the range of bone density, which is crucial: given that after one year the average was a standard deviation below the norm, many of the subjects would fall more than two standard deviations below the norm—which is a warning sign ‘that your bone density is lower than it should be for someone of your age’ (NHS, 2020). In the overall population, only 2% of individuals will experience such low bone density to meet this warning threshold (Z-score < -2). After two years on GnRHa, perhaps 30% of those with puberty suppression could meet this threshold for spine bone density, even adjusting for height. (My calculation assumes the Normal distribution and necessarily estimates the standard deviation of the Z-score from the authors’ confidence intervals.)
Whether the failure to accrue bone density increases the risk of fractures is unclear. The authors collected data on various ‘adverse events’, but these did not include broken bones.
Puberty blockers lead inexorably to cross-sex hormones
The most important outcome—but the least surprising—is that 43 out of 44 subjects continued to cross-sex hormones. Although puberty blockers are promoted as a diagnostic aid, since 2006 (if not before) we have known that in almost every case they lead to cross-sex hormones and eventually surgery. It is therefore astonishing that the authors continue to claim that ‘pubertal suppression may be both a treatment in its own right and also an intermediate step’ (p. 48).
Considered as a treatment in its own right, the suppression of puberty with GnRHa might be the only treatment provided by the NHS for which the costs clearly exceed the benefits. The sole justification for GnRHa is to prepare a child for lifelong medicalization with cross-sex hormones and surgeries, with irreversible consequences for sexuality and fertility. After all, the paper that introduced puberty suppression was entitled ‘The Feasibility of Endocrine Interventions in Juvenile Transsexuals’ (Gooren & Delemarre-van de Waal, 1996). The question is whether the GIDS has the moral authority and scientific expertise to designate children as young as 10 as juvenile transsexuals. As the judges ruled in the case of Keira Bell and Mrs A, ‘Apart perhaps from life-saving treatment, there will be no more profound medical decisions for children than whether to start on this treatment pathway’ (para 149).
Biggs, M. (2019). The Tavistock’s experiment with puberty blockers.
Biggs, M. (2020). Gender dysphoria and psychological functioning in adolescents treated with gnrha: comparing Dutch and English prospective studies. Archives of Sexual Behavior, 49, 2231–2236. https://doi.org/10.1007/s10508-020-01764-1
Carmichael, P., Butler, G., Masic, U., Cole, T. J., De Stavola, B. L., Davidson, S., Skageberg, E. M., Khadr, S., & Viner, R. (2020). Short-term outcomes of pubertal suppression in a selected cohort of 12 to 15 year old young people with persistent gender dysphoria in the UK. https://doi.org/10.1101/2020.12.01.20241653
de Vries, A. L. C., Steensma, T. D., Doreleijers, T. A. H., & Cohen- Kettenis, P. T. (2011). Puberty suppression in adolescents with gender identity disorder: A prospective follow-up study. Journal of Sexual Medicine, 8, 2276–2283. https://doi.org/10.1111/j.1743-6109.2010.01943.x
Gender Identity Development Service. (2015). Preliminary results from the early intervention research. Tavistock and Portman NHS Foundation Trust, Board of Directors, Part One: Agenda and Papers … 23rd June 2015, 50–55. https://tavistockandportman.nhs.uk/documents/142/board-papers-2015-06.pdf
Gooren, L., & Delemarre-van de Waal, H. (1996). The feasibility of endocrine interventions in juvenile transsexuals. Journal of Psychology & Human Sexuality, 8, 69–74. https://doi.org/10.1300/J056v08n04_05
Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2, 196–217. https://doi.org/10.1207/s15327957pspr0203_4
Kirsch, I. (2019). Placebo effect in the treatment of depression and anxiety. Frontiers in Psychiatry, 10. https://doi.org/10.3389/fpsyt .2019.00407
NHS (2019). Bone density scan (DEXA scan): How it is performed. https://www.nhs.uk/conditions/dexa-scan/what-happens/
Viner, R., Carmichael, P., Di Ceglie, D., Butler, G., Brain, C., Holt, V., Khadr, S., & Skagerberg, E. (2010). An evaluation of early pubertal suppression in a carefully selected group of adolescents with gender identity disorder (v1.0, 6 October 2010). Obtained from the Health Research Authority under Freedom of Information.