|Year : 2022 | Volume
| Issue : 2 | Page : 33-38
Statistical inference through estimation: Recommendations from the International Society of Physiotherapy Journal Editors
Mark R Elkins1, Rafael Zambelli Pinto2, Arianne Verhagen1, Monika Grygorowicz3, Anne Söderlund4, Matthieu Guemann5, Antonia Gómez-Conesa6, Sarah Blanton7, Jean-Michel Brismée8, Shabnam Agrawal9, Alan Jette10, Sven Karstens11, Michele Harms12, Geert Verheyden13, Umer Sheikh14
1 International Society of Physiotherapy Journal Editors; Journal of Physiotherapy
2 International Society of Physiotherapy Journal Editors; Brazilian Journal of Physical Therapy/Revista Brasileira de Fisioterapia
3 BMC Sports Science, Medicine and Rehabilitation
4 European Journal of Physiotherapy
5 European Rehabilitation Journal
7 Journal of Humanities in Rehabilitation
8 Journal of Manual & Manipulative Therapy
9 Journal of Society of Indian Physiotherapists
10 Physical Therapy
13 Physiotherapy Research International
14 The Journal of Physiotherapy & Sports Medicine
|Date of Submission||21-Dec-2021|
|Date of Acceptance||11-Jan-2022|
|Date of Web Publication||02-Jan-2023|
Mark R Elkins
Centre for Education and Workforce Development, Sydney Local Health District, Sydney, Australia
Source of Support: None, Conflict of Interest: None
|How to cite this article:|
Elkins MR, Pinto RZ, Verhagen A, Grygorowicz M, Söderlund A, Guemann M, Gómez-Conesa A, Blanton S, Brismée JM, Agrawal S, Jette A, Karstens S, Harms M, Verheyden G, Sheikh U. Statistical inference through estimation: Recommendations from the International Society of Physiotherapy Journal Editors. J Soc Indian Physiother 2022;6:33-8
|How to cite this URL:|
Elkins MR, Pinto RZ, Verhagen A, Grygorowicz M, Söderlund A, Guemann M, Gómez-Conesa A, Blanton S, Brismée JM, Agrawal S, Jette A, Karstens S, Harms M, Verheyden G, Sheikh U. Statistical inference through estimation: Recommendations from the International Society of Physiotherapy Journal Editors. J Soc Indian Physiother [serial online] 2022 [cited 2023 Jun 10];6:33-8. Available from: jsip-physio.org/text.asp?2022/6/2/33/366644
Null hypothesis statistical tests are often conducted in health care research, including in the physiotherapy field. Despite their widespread use, null hypothesis statistical tests have important limitations. This co-published editorial explains statistical inference using null hypothesis statistical tests and the problems inherent to this approach; examines an alternative approach for statistical inference (known as estimation); and encourages readers of physiotherapy research to become familiar with estimation methods and how the results are interpreted. It also advises researchers that some physiotherapy journals that are members of the International Society of Physiotherapy Journal Editors (ISPJE) will be expecting manuscripts to use estimation methods instead of null hypothesis statistical tests.
| What Is Statistical Inference?|| |
Statistical inference is the process of making inferences about populations using data from samples. Imagine, for example, that some researchers want to investigate something (perhaps the effect of an intervention, the prevalence of a comorbidity, or the usefulness of a prognostic model) in people after stroke. It is unfeasible for the researchers to test all stroke survivors in the world; instead, the researchers can only recruit a sample of stroke survivors and conduct their study with that sample. Typically, such a sample makes up a miniscule fraction of the population, so the result from the sample is likely to differ from the result in the population. Researchers must therefore use their statistical analysis of the data from the sample to infer what the result is likely to be in the population.
| What Are Null Hypothesis Statistical Tests?|| |
Traditionally, statistical inference has relied on null hypothesis statistical tests. Such tests involve positing a null hypothesis (e.g., that there is no effect of an intervention on an outcome, that there is no effect of exposure on risk, or that there is no relationship between two variables). Such tests also involve calculating a p-value, which quantifies the probability (if the study were to be repeated many times) of observing an effect or relationship at least as large as the one that was observed in the study sample, if the null hypothesis is true. Note that the null hypothesis refers to the population, not the study sample.
Because the reasoning behind these tests is linked to imagined repetition of the study, they are said to be conducted within a “frequentist” framework. In this framework, the focus is on how much a statistical result (e.g., a mean difference, a proportion, or a correlation) would vary among the repeats of the study. If the data obtained from the study sample indicate that the result is likely to be similar among the imagined repeats of the study, this is interpreted as an indication that the result is in some way more credible.
One type of null hypothesis statistical test is significance testing, developed by Fisher.,, In significance testing, if a result at least as large as the result observed in the study would be unlikely to occur in the imagined repeats of the study if the null hypothesis is true (as reflected by p < 0.05), then this is interpreted as evidence that the null hypothesis is false. Another type of null hypothesis statistical test is hypothesis testing, developed by Neyman and Pearson.,, Here, two hypotheses are posited: the null hypothesis (i.e., that there is no difference in the population) and the alternative hypothesis (i.e., that there is a difference in the population). The p-value tells the researchers which hypothesis to accept: if p ≥ 0.05, retain the null hypothesis; if p < 0.05, reject the null hypothesis and accept the alternative. Although these two approaches are mathematically similar, they differ substantially in how they should be interpreted and reported. Despite this, many researchers do not recognize the distinction and analyze their data using an unreasoned hybrid of the two methods.
| Problems with Null Hypothesis Statistical Tests|| |
Regardless of whether significance testing or hypothesis testing (or a hybrid) is considered, null hypothesis statistical tests have numerous problems.,, Five crucial problems are explained in [Table 1]. Each of these problems is fundamental enough to make null hypothesis statistical tests unfit for use in research. This may surprise many readers, given how widely such tests are used in published research.,
|Table 1: Problems with null hypothesis statistical tests: Modified from Herbert|
Click here to view
It is also surprising that the widespread use of null hypothesis statistical tests has persisted for so long, given that the problems in [Table 1] have been repeatedly raised in health care journals for decades,, including physiotherapy journals., There has been some movement away from null hypothesis statistical tests, but the use of alternative methods of statistical inference has increased slowly over decades, as seen in analyses of health care research, including physiotherapy trials., This is despite the availability of alternative methods of statistical inference and promotion of those methods in statistical, medical, and physiotherapy journals.,,,,
| Estimation as an Alternative Approach for Statistical Inference|| |
Although there are multiple alternative approaches to statistical inference, the simplest is estimation. Estimation is based on a frequentist framework but, unlike null hypothesis statistical tests, its aim is to estimate parameters of populations using data collected from the study sample. The uncertainty or imprecision of those estimates is communicated with confidence intervals.,
A confidence interval can be calculated from the observed study data, the size of the sample, the variability in the sample, and the confidence level. The confidence level is chosen by the researcher, conventionally at 95%. This means that if hypothetically the study were to be repeated many times, 95% of the confidence intervals would contain the true population parameter. Roughly speaking, a 95% confidence interval is the range of values within which we can be 95% certain that the true parameter in the population actually lies.
Confidence intervals are often discussed in relation to treatment effects in clinical trials,, but it is possible to put a confidence interval around any statistic, regardless of its use, including mean difference, risk, odds, relative risk, odds ratio, hazard ratio, correlation, proportion, absolute risk reduction, relative risk reduction, number needed to treat, sensitivity, specificity, likelihood ratios, diagnostic odds ratios, and difference in medians.
| Interpretation of the Results of the Estimation Approach|| |
To use the estimation approach well, it is not sufficient simply to report confidence intervals. Researchers must also interpret the relevance of the information portrayed by the confidence intervals and consider the implications arising from that information. The path of migration of researchers from statistical significance and p-values to estimation methods is littered with examples of researchers calculating confidence intervals at the behest of editors, but then ignoring the confidence intervals and instead interpreting their study’s result dichotomously as statistically significant or nonsignificant depending on the p-value. Interpretation is crucial.
Some authors have proposed a ban on terms related to interpretation of null hypothesis statistical testing. One prominent example is an editorial published in The American Statistician, which introduced a special issue on statistical inference. It states:
The American Statistical Association Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of “statistical significance” be abandoned. We take that step here. We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way.
This may seem radical and unworkable to researchers with a long history of null hypothesis statistical testing, but many concerns can be allayed. First, such a ban would not discard decades of existing research reported with null hypothesis statistical tests; the data generated in such studies maintain their validity and will often be reported in sufficient detail for confidence intervals to be calculated. Second, reframing the study’s aim involves a simple shift in focus from whether the result is statistically significant to gauging how large and how precise the study’s estimate of the population parameter is. (For example, instead of aiming to determine whether a treatment has an effect in stroke survivors, the aim is to estimate the size of the average effect. Instead of aiming to determine whether a prognostic model is predictive, the aim is to estimate how well the model predicts.) Third, the statistical imprecision of those estimates can be calculated readily. Existing statistical software packages already calculate confidence intervals, including free software such as R., Lastly, learning to interpret confidence intervals is relatively straightforward.
Many researchers and readers initially come to understand how to interpret confidence intervals around estimates of the effect of a treatment. In a study comparing a treatment versus control with a continuous outcome measure, the study’s best estimate of the effect of the treatment is usually the average between-group difference in outcome. To account for the fact that estimates based on a sample may differ by chance from the true value in the population, the confidence interval provides an indication of the range of values above and below that estimate where the true average effect in the relevant clinical population may lie. The estimate and its confidence interval should be compared against the “smallest worthwhile effect” of the intervention on that outcome in that population. The smallest worthwhile effect is the smallest benefit from an intervention that patients feel outweighs its costs, risk, and other inconveniences. If the estimate and the ends of its confidence interval are all more favorable than the smallest worthwhile effect, then the treatment effect can be interpreted as typically considered worthwhile by patients in that clinical population. If the effect and its confidence interval are less favorable than the smallest worthwhile effect, then the treatment effect can be interpreted as typically considered trivial by patients in that clinical population. Results with confidence intervals that span the smallest worthwhile effect indicate a benefit with uncertainty about whether it is worthwhile. Results with a narrow confidence interval that spans no effect indicate that the treatment’s effects are negligible, whereas results with a wide confidence interval that spans no effect indicate that the treatment’s effects are uncertain. For readers unfamiliar with this sort of interpretation, some clear and nontechnical papers with clinical physiotherapy examples are available.,,,
Interpretation of estimates of treatment effects and their confidence intervals relies on knowing the smallest worthwhile effect (sometimes called the minimum clinically important difference). For some research questions, such a threshold has not been established or has been established with inadequate methods. In such cases, researchers should consider conducting a study to establish the threshold or at least to nominate the threshold prospectively.
Readers who understand the interpretation of confidence intervals around treatment effect estimates will find interpretation of confidence intervals around many other types of estimates quite familiar. Roughly speaking, the confidence interval indicates the range of values around the study’s main estimate where the true population result probably lies. To interpret a confidence interval, we simply describe the practical implications of all values inside the confidence interval. For example, in a diagnostic test accuracy study, the positive likelihood ratio tells us how much more likely a positive test finding is in people who have the condition than it is in people who do not have the condition. A diagnostic test with a positive likelihood ratio greater than about 3 is typically useful and greater than about 10 is very useful. Therefore, if a diagnostic test had a positive likelihood ratio of 4.8 with a 95% confidence interval of 4.1–5.6, we could anticipate that the true positive likelihood ratio in the population is both useful and similar to the study’s main estimate. Conversely, if a study estimated the prevalence of depression in people after anterior cruciate ligament rupture at 40% with a confidence interval from 5% to 75%, we may conclude that the main estimate is suggestive of a high prevalence but too imprecise to conclude that confidently.
| ISPJE Member Journals’ Policy Regarding the Estimation Approach|| |
The executive of the ISPJE strongly recommends that member journals seek to foster use of the estimation approach in the papers they publish. In line with that recommendation, the editors who have coauthored this editorial advise researchers that their journals will expect manuscripts to use estimation methods instead of null hypothesis statistical tests. We acknowledge that it will take time to make this transition, so editors will give authors the opportunity to revise manuscripts to incorporate estimation methods if the manuscript seems otherwise potentially viable for publication. Editors may assist authors with those revisions where required.
Readers who require more detailed information to address questions about the topics raised in this editorial are referred to the resources in [Table 2], such as the Research Note on the problems of significance and hypothesis testing and an excellent textbook that addresses confidence intervals and the application of estimation methods in various research study designs with clinical physiotherapy examples. Both are readily accessible to researchers and clinicians without any prior understanding of the issues.
|Table 2: Resources that provide additional information to respond to questions about the transition from null hypothesis statistical tests to estimation methods|
Click here to view
Quantitative research studies in physiotherapy that are analyzed and interpreted using confidence intervals will provide more valid and relevant information than those analyzed and interpreted using null hypothesis statistical tests. The estimation approach is therefore of great potential value to the researchers, clinicians, and consumers who rely upon physiotherapy research, and that is why ISPJE is recommending that member journals foster the use of estimation in the articles they publish.
We thank Professor Rob Herbert from Neuroscience Research Australia (NeuRA) for his presentation to the ISPJE on this topic and for comments on a draft of this editorial.
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
| References|| |
Nickerson RS. Null hypothesis significance testing: A review of an old and continuing controversy. Psychol Methods 2000;5:241-301.
Freire APCF, Elkins MR, Ramos EMC, Moseley AM. Use of 95% confidence intervals in the reporting of between-group differences in randomized controlled trials: Analysis of a representative sample of 200 physical therapy trials. Braz J Phys Ther 2019;23:302-10.
Altman DG, Bland JM. Uncertainty and sampling error. Bmj 2014;349:g7064.
Barnett V. Comparative Statistical Inference. London, New York: Wiley; 1973.
Royall RM. Statistical Evidence: A Likelihood Paradigm. 1st ed. London, New York: Chapman & Hall; 1997.
Gigerenzer G. The Empire of Chance: How Probability Changed Science and Everyday Life. Cambridge, England: Cambridge University Press; 1989.
Goodman SN, Royall R. Evidence and scientific research. Am J Public Health 1988;78:1568-74.
Ziliak S, McCloskey D. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor, USA: University of Michigan Press; 2008.
Hubbard R. Corrupt Research: The Case for Reconceptualizing Empirical Management and Social Science. Thousand Oaks, USA: Sage; 2016.
Herbert RD. How to estimate treatment effects from reports of clinical trials. I: Dichotomous outcomes. Aust J Physiother 2000;46:229-35.
Maher CG, Sherrington C, Elkins M, Herbert RD, Moseley AM. Challenges for evidence-based physical therapy: Accessing and interpreting high-quality evidence on therapy. Phys Ther 2004;84:644-54.
Yi D, Ma D, Li G, Zhou L, Xiao Q, Zhang Y, et al
. Statistical use in clinical studies: Is there evidence of a methodological shift? Plos One 2015;10:e0140159.
Wasserstein RL, Schirm AL, Lazar NA. Moving to a world beyond “p< 0.05”. Am Stat 2019;73(Suppl. 1):1-19. https://doi.org/10.1080/00031305.2019.1583913
Herbert RD. How to estimate treatment effects from reports of clinical trials. II: Dichotomous outcomes. Aust j Physiother 2000;46:309-13.
Sim J, Reid N. Statistical inference by confidence intervals: Issues of interpretation and utilization. Phys Ther 1999;79:186-95.
Rothman KJ. Disengaging from statistical significance. Eur J Epidemiol 2016;31:443-4.
Cumming G. Multivariate Applications Series. New York: Routledge; 2012.
Kamper SJ. Showing confidence (intervals). Braz J Phys Ther 2019;23:277-8.
Kamper SJ. Confidence intervals: Linking evidence to practice. J Orthop Sports Phys Ther 2019;49:763-4.
Fidler F, Thomason N, Cumming G, Finch S, Leeman J. Editors can lead researchers to confidence intervals, but can’t make them think: Statistical reform lessons from medicine. Psychol Sci 2004;15:119-26.
R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
RStudio Team (2019). RStudio: Integrated Development for R. RStudio, Inc., Boston, USA. http://www.rstudio.com/
Ferreira M. Research note: The smallest worthwhile effect of a health intervention. J Physiother 2018;64:272-4.
Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature 2019;567:305-7.
Herbert R. Research Note: Significance testing and hypothesis testing: Meaningless, misleading and mostly unnecessary. J Physiother 2019;65:178-81. https://doi.org/10.1016/j.jphys.2019.05.001
Herbert RD, Jamtvedt G, Mead J, Hagen KB. Practical Evidence-Based Physiotherapy. 2nd ed. Oxford: Elsevier; 2011.
Boos DD, Stefanski LA. P-value precision and reproducibility. Am Stat 2011;65:213-21.
Wasserstein R, Lazar N. The ASA’s statement on p-values: Context, process, and purpose. Am Stat 2016;70:129-33. https://doi.org/10.1080/00031305.2016.1154108
International Committee of Medical Journal Editors. ICMJE recommendations for the conduct, reporting, editing and publication of scholarly work in medical journals. 2013. http://www.icmje.org/icmje-recommendations.pdf
McGough JJ, Faraone SV. Estimating the size of treatment effects: Moving beyond p values. Psychiatry (Edgmont) 2009;6:21-9.
Hayat MJ, Chandrasekhar R, Dietrich MS, Gifford RH, Golub JS, Holder JT, et al
. Moving otology beyond p < 0.05. Otol Neurotol 2020;41:578-9.
Hayat MJ, Staggs VS, Schwartz TA, Higgins M, Azuero A, Budhathoki C, et al
. Moving nursing beyond p < 0.05. Res Nurs Health 2019;42:244-5.
Cumming G, Fidler F, Kalinowski P, Lai J. The statistical recommendations of the American Psychological Association Publication Manual: Effect sizes, confidence intervals, and meta-analysis. Aust J Psychol 2012;64:138-46.
Calin-Jageman RJ, Cumming G. Estimation for better inference in neuroscience. eNeuro 2019;6:eNeuro.0205-19.2019.
Schreiber JB. New paradigms for considering statistical significance: A way forward for health services research journals, their authors, and their readership. Res Social Adm Pharm 2020;16:591-4.
Erickson RA, Rattner BA. Moving beyond p < 0.05 in ecotoxicology: A guide for practitioners. Environ Toxicol Chem 2020;39:1657-69.
Smith RJ. P > .05: The incorrect interpretation of “not significant” results is a significant problem. Am J Phys Anthropol 2020;172:521-7.
Percie du Sert N, Ahluwalia A, Alam S, Avey MT, Baker M, Browne WJ, et al
. Reporting animal research: Explanation and elaboration for the Arrive Guidelines 2.0. Plos Biol 2020;18:e3000411.
[Table 1], [Table 2]