

SPECIAL ARTICLE 

Year : 2020  Volume
: 3
 Issue : 2  Page : 7684 

Study design, errors and sample size calculation in medical research
Sabyasachi Das^{1}, Pradeep A Dongare^{2}, Umesh Goneppanavar^{3}, Rakesh Garg^{4}, S Bala Bhaskar^{5}
^{1} Department of Anaesthesiology, Medical College, Kolkata, West Bengal, India ^{2} Department of Anaesthesiology, ESI PGIMSR, Bengaluru, Karnataka, India ^{3} Department of Anaesthesia, Dharwad Institute of Mental Health and Neurosciences, Dharwad, Karnataka, India ^{4} Department of OncoAnaesthesia and Palliative Medicine, Dr BRAIRCH, AIIMS, New Delhi, India ^{5} Department of Anaesthesia, Vijayanagara Institute of Medical Sciences, Bellary, Karnataka, India
Date of Submission  27Jul2020 
Date of Acceptance  04Aug2020 
Date of Web Publication  30Aug2020 
Correspondence Address: Dr. Umesh Goneppanavar Dharwad Institute of Mental Health and Neurosciences, Dharwad, Karnataka India
Source of Support: None, Conflict of Interest: None  1 
DOI: 10.4103/ARWY.ARWY_29_20
The choice of an appropriate study design is one of the crucial steps in the research process after framing a research question. A single research question may fit into different study designs. Each design has its own merits and drawbacks; diligence in implementing the methodology and data collection reflects good study design. Sample size justification and power analysis are foundations of a study design. They should ideally be settled when framing a research question and creating the study design. An adequate sample size minimises random error or chance occurrence. 'A just large enough' sample supports the researcher to estimate expected cost, time and feasibility. The sample 'size' is a tugofwar between reality and scientific effectiveness and is highly influenced by study designs. Null hypothesis (H_{0}) is the assumption that there is no difference in the treatment groups, whereas an assumption that there is a difference is called alternate hypothesis (H_{a}). Type I error (α) finds difference in the absence of one (falsepositive conclusion), whereas Type II error (β) indicates probability of falsenegative results. If the calculated P value is smaller than α, the researcher rejects the null hypothesis (H_{0}) and welcomes the alternative hypothesis (Ha). There are several validated software available for sample size calculation. Sample size tends to be smaller for means than percentages. As the sample size increases, the P value tends to become small. Finally, a statistically significant result might not always be clinically relevant.
Keywords: Medical research, sample size, study design
How to cite this article: Das S, Dongare PA, Goneppanavar U, Garg R, Bhaskar S B. Study design, errors and sample size calculation in medical research. Airway 2020;3:7684 
How to cite this URL: Das S, Dongare PA, Goneppanavar U, Garg R, Bhaskar S B. Study design, errors and sample size calculation in medical research. Airway [serial online] 2020 [cited 2022 Nov 29];3:7684. Available from: https://www.arwy.org/text.asp?2020/3/2/76/293966 
Introduction   
The selection of appropriate study design plays a pivotal role in validating relationship between the exposure and the disease. It improves validity of the study and helps understand the results. Research methods are primarily classified as qualitative, quantitative and mixed. Using both qualitative and quantitative approaches makes the results trustworthy. They rely on factors such as research question and primary outcome, experience of the researcher, contribution from the funding agency, time frame, current level of understanding of the topic, rare or common disease and many other similar factors.^{[1]} Efficacy, 'can it work'; effectiveness, 'does it work' and efficiency, 'is it worth' are the three perceptions put forward by Archie Cochrane when framing a research question. These help us choose the right study design for a research question. No one study design is universally applicable or superior, but its choice in the right context is essential.^{[2]}
Study Designs and Their Classification   
The selection of study design primarily depends on the research question or hypothesis. Information regarding the study setting, objectives, resources, time frame and quality of the data also helps in selecting the appropriate study design. An exhaustive list of study designs exist in the literature and many of them are not simple to comprehend [Figure 1]. All study designs are broadly grouped on the basis of allocating an intervention or no intervention into observational and experimental studies.^{[3],[4],[5]}
Research Methods   
The methods employed to obtain data for any particular study design may be qualitative, quantitative or mixed.
For example, if 300 students at a university are given a survey questionnaire and asked questions such as 'On a scale from 1 to 5, how satisfied are you with your seminarbased teaching module?', one can perform statistical analysis on the data and draw conclusions such as 'On an average, students rated their seminarbased teaching module at 4.4'. This is an example of a quantitative approach. On the other hand, one can conduct indepth interviews with 15 students and ask them openended questions such as 'How satisfied are you with your studies?', 'What is the most positive aspect of your study programme?' and 'What can be done to improve the study programme? Based on the answers, followup questions can be asked to clarify things. All interviews can be transcribed using transcription software to find commonalities and patterns. This is how a qualitative study works.^{[3]} In mixed method, both the quantitative and the qualitative approaches are employed to get wider perspective and more reliable evidence to support the inference.^{[4],[5],[6],[7]}
Sample Size   
The outcome of any study becomes reliable if it is conducted in an appropriate number of subjects. These subjects also need to be chosen from the population scientifically to represent the population as such. The important concepts related to sample size are elucidated in the subsequent paragraphs. Important items relevant for sample size calculation are given in [Table 1].
Statistical significance
Most researchers concentrate on the P value which denotes statistical significance. However, the P value does not yield any estimate of the actual treatment effect. Clinically useless small difference may result in statistical significance and clinically meaningful significant effects may end up as statistically not significant. Isolated P value does not provide information about the “magnitude” of the effect of interest and the 'precision' of the effect.^{[8]}
Effect size
Effect size is a standardised estimate of the observed difference or association and guides us to interpret 'how big is big?' A confidence interval (CI) for the observed difference is essential to be reported with the effect size. In other words, the effect size expresses the quantitative liaison between variables and is commonly computed by evaluating the differences in proportions or means. It determines the strength of the association between variables; the second objective is to determine the differences or ratios between groups. One of the widely accepted estimates of effect size is Cohen's d for a ttest. In large effect size, the value of Cohen's d is >0.8.^{[9]}
For example, Yoon et al. compared the McGrath videolaryngoscope with the Optiscope video stylet for endotracheal intubation in patients with cervical spine immobilisation. The results of their study revealed higher success rate of endotracheal intubation in McGrath group (92.3% vs. 81.0%, P = 0.002) with risk difference of 0.11 (95% CI: 0.05–0.18). Similarly, intubation time was also significantly shorter with the McGrath group than the Optiscope group with a mean difference value of 13.5 s (35.7 ± 27.8 s vs. 49.2 ± 43.8 s) along with 95% CI as 5.9–21.1.
The study reports mean difference and risk difference as two expressions of the effect size.^{[10]} When we compare 'intubation time' between groups on a normally distributed and continuous outcome, the difference in mean can be relied upon as a measure of the effect size. In case of a binary outcome like successful intubation, the effect size can be computed by measuring the absolute difference in proportion and usually reported as risk difference.^{[8]}
Errors
Errors can be broadly divided into type I error (α) and type II error (β) based on the hypothesis. In statistical hypothesis testing, a type I error is the rejection of a true null hypothesis (also known as a 'falsepositive' finding or conclusion), while a type II error is the nonrejection of a false null hypothesis (also known as a 'falsenegative' finding or conclusion).
Type I error (α) and level of significance
This error occurs if the null hypothesis is rejected when in reality it is true.^{[11]} It produces a risk wherein the researcher observes a difference between two or more groups where no real difference prevails. For twotailed studies, the significance level is classically fixed at 5%, which permits 5% chance of generating falsepositive inferences (type I error). It will be pertinent to recall that the P value is not the same as α error or the level of significance. The α value is predefined specified value of committing type I error in a study, whereas the P value is achieved by employing statistical test on the study data. If the observed or calculated P value is smaller than α, the researcher rejects the null hypothesis (H_{0}) and welcomes the alternative hypothesis (H_{a}).^{[9]}
Type II error (β)
This essentially occurs when the study detects no inequality despite having real differences in an association or correlation among groups in a particular population. Simply stated, the type 2 error represents falsenegative results. Conventionally, the β level is set at 0.10 or 0.20 (false negatives of 10% or 20%).
Power of the Study and the Need for Sample Size Estimation   
Power and sample size help researchers to determine the number of subjects needed to answer the research question (or null hypothesis). The statistical power of a study is the power or ability of a study to detect a difference, if a difference really exists. It depends on two things: the sample size (number of subjects) and the effect size (e.g., the difference in outcomes between two groups). The converse of type II error is correctly rejecting the null hypothesis (H_{0}) when H_{a} is true and to reveal a true difference or effect or association. With 80% of power, the probability of missing a true difference or association is 20%. As summation of possibilities of occurrence of these two events is 1, the power of the study is conventionally expressed as 1−β. Classically, the investigators wish to have power for their study not <80% with a β of 0.2 (20%) or 90% power with β of 0.1 (10%). Obviously, 90% of power is much more welcome as there is 50% rise in improvement. Smaller α and narrow β make the study results more valid but at the cost of greatly increased sample size.^{[11]}
Factors affecting a power calculation
 Precision and variance of measurements within any sample
 Magnitude of a clinically significant difference
 How certain we want to be to avoid type 1 error
 The type of statistical test we are performing.
The primary elements needed for sample size calculation include treatment effects, variability of the primary objectives (e.g., standard deviation [SD]), level of significance (α) or type I error, multiple comparisons, testing and power of the study. Treatmentrelated events could be difference in means or proportions, relative risks, odds ratio or correlation that an investigator wishes to detect. A study is less meaningful if it just claims H_{0} is rejected and H_{a} is accepted. An ideal study must articulate about the quantification of difference between the groups and expected effect size (minimal effective difference) used when calculating sample size. The researcher is scheduled to mention the predicted variability of the primary objective, variability (SD). Previous relevant literature can be relied upon to estimate the value of SD in case of continuous primary variable. It is not essential to find out a similar study which administered the same intervention with the same primary outcome measure. Studies having comparable patient enrolment, with a control group and similar outcome, are good enough to guess a rational sample size. For binary/dichotomous outcomes, researchers need to specify the expected proportion of participants with the outcome in each of the 2 or more groups analysed. First, one should fix the proportion of the control group and then the researcher can hypothesise that the given amount of relative reduction is expected in the intervention group. In general, it is accepted to keep the exposure of type I error for single primary outcome at 5% for sample size calculation. The situation would be different if number of primary outcomes are more than one.^{[11]}
The calculation of sample size depends on the effect size or minimal clinically relevant effect of any treatment or an exposure. Expected variation of this estimate can influence sample size significantly. Pilot studies (when feasible) and extensive review of literature may lead us to reach an 'educated guess'^{[11]} Treatment effects are difference (in means or proportions, relative risk, odds ratio or correlation) that one wishes to detect in the population of interest. An investigator's job is not over just proving H_{0} is false and H_{a} is true. The meaningful performance of a study is reflected as quantification of the difference between the groups.
Superiority, Equivalence and NonInferiority Trials   
The calculation of sample size also varies depending on whether we are trying to estimate superiority of an intervention, noninferiority or equivalence in comparison with the control or placebo or established standard. There is another terminology related to a superiority trial. This is called the margin of clinical significance (δ). This is the threshold for which the superiority of the intervention is asserted. If the δ is set higher, it is troublesome to refute the null hypothesis and to accept an alternative hypothesis. In the case of noninferiority trials, efforts are made to demonstrate that the new treatment is not inadmissibly inferior to the standard treatment. This is accomplished by stating a margin of noninferiority for the effects of the treatment. The noninferiority margin explains the amount of reduction in efficacy that may be accepted. As long as the new treatment is not performing poorly as compared to the standard treatment by this specified limit, the new treatment would be treated as noninferior [Figure 2].^{[12],[13]}  Figure 2: Differences between superiority, noninferiority and equivalence trials
Click here to view 
How to Calculate the Sample Size?   
All studies need calculation of sample size except descriptive studies. Sample size calculation is a deal between failures to find the actual difference and identifying a difference which may have statistical significance, but lacks clinical implication. For simple studies, we can rely upon the application of standard formulae. Specialised complex statistical software programmes are useful for complex studies. All validated software show the formula when calculating [Table 2]. There are two common ways to calculate the sample size. One is computing a parameter based on an assumed CI. Another approach is to accept or refute the null hypothesis in comparative studies.^{[14],[15]} It is necessary to check whether it is total sample size or sample size in each group. Another good practice is to try software or formula against an answer already known to us. Inclusion of 10% extra to accommodate loss to followup or nonresponders is commonly done. A sample size matrix can be made by playing around with our calculations.  Table 2: Examples of some commonly used and validated software for sample size calculation
Click here to view 
Examples of Sample Size Calculation (Three Clinical Scenarios)   
Example 1 Sample size estimation for a crosssectional study
A landmark study of this decade, the Fourth National Audit Project, projected significant evidence. The study revealed that more than 65% attempts at cricothyroidotomy by anaesthesiologists were unsuccessful to secure the airway of the patients in an emergency.^{[16]} As there are limited data on this issue in our country, one may wish to conduct a crosssectional study on this issue (questionnaire based) among senior residents working in postgraduate departments of anaesthesiology in different medical colleges of this country over a period of 3 months. Review of literature and pilot study reveal that the present information and awareness range varies from somewhere between 60% and 70% among the participants.
The formula for calculating sample size for crosssectional study is:
where N is the required sample size, 'p' is the prevalence, 'd' is the precision of the estimate and
'q' is 100 − p
The value of Zα is a statistical constant corresponding to level of significance. The value is 1.96 for 95% CI (normally distributed data).
Here, we need to have two unknown values which we need to compute, one is for prevalence 'p' and the other is the value of 'd' (precision). Relative precisions are used if the value of an absolute precision is not available with us. The relative precision has got a ceiling effect and the value should not exceed 20% of the prevalence ('p'). In the example mentioned above, if 'p' is supposed to be 60% (please remember, existing literature search indicated that 'p' ranges between 60% to 70%) then expected 'd' the relative precision will be (60/100) × 20 = 0.6 × 20 = 12. The sample size will be (1.96 × 2 × 60 × 40)/12 × 12 = 65.3. This implies that the indicated 'p' will be in between the range of 58%–82%. If by chance, the actual prevalence is below 58%, the estimated awareness and information level may be less authentic. A larger value of 'p' is associated with larger value of 'd' and the reverse is also true. A smaller 'd' always demands larger sample size and vice versa. Now, if we consider 'd' as 10% of the prevalence (60%), the sample size is (1.96 × 2 × 60 × 40)/6 × 6 = 261.3. Hence, a total of 261 samples will empower to reveal the truth (what we intend to measure), provided the prevalence is 60%. On the contrary, if 'p' is considered as 65%, expected sample will be (1.96 × 2 × 65 × 35)/13 × 13 as (d = 13) = 52.7. If 'p' is taken as 70% and d is 10% of 70 (d = 0.7 × 10 = 7), the value of N equals to 168.
Now, the question is, which one to take, 'P' 60% or 70%?, precision 10% or 20% for sample size?
It is reliant on so many factors such as time, workforce, location, resources and many such factors that a researcher is supposed to consider well ahead. Based on these, the researcher should decide on feasibility of including a sample size of 261 or 65. There are occasions where we do not get authentic data on the prevalence from extensive review of literature. Doing a pilot study in such situations could be a wise and appropriate decision to get a convincing value of 'p' and the value obtained from the pilot study would be more trustworthy than the external values.
Example 2 Sample size estimation for clinical trials by comparing two means
The formulae for the same are shown in [Table 3].^{[17]} One recently published randomised controlled trial compared the time to intubate in seconds between channelled blade and nonchannelled blade of King Vision^{™} videolaryngoscope for orotracheal intubation. The authors conducted a pilot study for sample size calculation. They found that the time taken for orotracheal intubation was 24 ± 7.5 s with nonchanneled blade and 13.8 ± 8.0 s with the channelled blade with mean difference of 10.2 s. From the pilot study, we got the value of SD (S), which is the mean of 7.5 s and 8.0 s. For ease of calculation, let us consider as 8.0 s as population variance or SD. Similarly, we can consider minimal difference of 5 s between the groups as clinically relevant. Although the authors anticipated 20% dropouts in their study, addition of 10% extra participants is commonly seen. The researchers have considered the power of the study as 95, but for the ease of calculation, we will consider the power as 80, the value familiar to most of the researchers.
Intubation time is a continuous variable and the data are expressed as mean ± SD. The formula for calculation of sample size in this case is based on the sample size formula shown above in [Table 3] if superiority is assumed.
N = Sample size in each arm, a = conventional multiplier, Z_{1−α}; value for α 0.05 = 1.645 (corresponds to 97.5 percentile), b = statistical constant for power of the study 0.80 = 0.842. The difference between both the groups to be clinically relevant (dδ_{0}) is 5 s and S is the population variance or SD is 8 s. N = 2 ([1.645 + 0.842]^{2} × [8.0]^{2}/[5 × 5]). The total sample size for one arm is 16.
Hence, a sample size of 40 patients, 20 in each arm, is adequate to detect a minimal clinically relevant difference of 5 s between groups in time to achieve successful intubation, considering a SD of 8.0 s using twotailed ttest of difference between means with 80% power and 5% level of significance. Considering inclusion of 10% possible dropouts, the total sample size is 44.
Let us now assume that we are trying to prove a noninferiority theory for the same example. Hence, we assume that the noninferiority margin is 3 s.
N = 2[(1.645 + 0.842)^{ 2} × (8.0)^{ 2}/3 × 3]. The total sample size for one arm is 88.
Let us now assume that we are trying to prove a theory of equivalence for the same example. So, we assume that the equivalence margin is 2 s.
N = 2 ([1.96 + 0.842]^{2} × [8.0]^{2}/2 × 2]. The total sample size for one arm is 252.
Example 3 Sample size estimation for clinical trials by comparing two proportions
Tracheal intubation when continuing cardiopulmonary resuscitation (CPR) is a challenging and crucial issue influencing outcome of the patient. It is a common practice to stop chest compression for few seconds to facilitate fast tracheal intubation. Skilled airway managers can perform the procedure without any pause of chest compression. Prolonged cessation of chest compression may have serious concerns in prolonged CPR. Many new devices are in use for airway management. In this article, the authors compared conventional Macintosh laryngoscope with King Vision videolaryngoscope during uninterrupted chest compression in terms of achieving successful tracheal intubation in percentage in two groups in manikin.^{[18]} If we wish to conduct a similar study in intensive care unit, the sample size calculation will be as follows
The formula for calculating sample size in superiority design is shown in [Table 3].
where N = estimated sample size in each arm, Z_{1−α}= depends on level of significance, for 5% the value is 1.96, Z_{1−β}= depends on power and for 80%, the value is 0.842, Suppose, P = 73% or 0.73 of cases of successful intubation with Macintosh direct laryngoscopy, 1 − p = 0.27. Let us assume that success rate with King Vision is 93%, δ = 0.2, let us assume the clinically acceptable difference to assume null hypothesis is δ_{0}= 0.1.
N = 2× (1.645 + 0.842/0.1)^{2} × 0.73 × 0.27) = 245.
If noninferiority is assumed
N = 2× (1.645 + 0.842/0.1)^{2} × 0.73 × 0.27) = 245
If equivalence is assumed
N = 2 × (1.96 + 0.842/0.1)^{2} × 0.73 × 0.27) = 310.
Nuances of sample size calculation
Who? What? Why? When? Where? How? How much? and the last query 'So what?' – these are the elemental questions a child prepares at primary school. Similarly, if any researcher introspects, there are lots of resemblances with these questions to identify the milestones of any clinical research.^{[19]} These questions are particularly helpful to recognise the direction of research and also the context. The choice of an appropriate study design for the purpose of research is crucial as this cannot be changed or modified at any stage of the research.^{[20]} Adopting the correct design can only ensure that the author achieves objectives and is able to interpret study results, ultimately justifying the research questions. There is some sort of overlap in the understanding of quantitative studies. If we repeat a crosssectional study and the same sample is calculated for a second time, the original crosssectional study is transformed to a cohort study. Similarly, a case series in a predefined population may prompt the conduct of a case–control study or an experimental study. Although we tried to divide study designs into various compartments, these are not compact and some sort of flexibility prevails when we work in real life. One important consideration is that the sample size needed does not depend on the population size. If we reduce the effect size by 50% (half), the sample size gets augmented by four times. Sample size tends to be smaller for means than percentages. Sample size is the number of participants and NOT the number of replicates. Our aim should be to increase the number of patients and to minimise the number of replicates. Measurement of mean arterial pressure (MAP) 10 times in 100 patients is much superior (for the sake of study results) than measurement of MAP 100 times in 10 patients. The practice pearls related to the topic are given in [Table 4].
Conclusion   
Sample size estimation is the key to performing effective comparative studies. An understanding of the concepts of power, sample size and type I and II errors will help the researcher and the critical reader of medical literature. Ken Rothman stated in 1986 that 'the question of the most appropriate study size is not a technical decision, but a judgement to be determined by experience, intuition and insight'.
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
References   
1.  Saxena P, Prakash A, Acharya AS, Nigam A. Selecting a study design for research. Indian J Med Specialities 2013;4:3349. 
2.  Vetter TR. Magic mirror, on the wallwhich is the right study design of them all?Part I. Anesth Analg 2017;124:206873. 
3.  Stewart J, Flice de Barrows N. Qualitative research methods. FAIMERKeele Master's in Health Professions Education: Accreditation and Assessment. Module 6, Unit 6, 4 ^{th} ed. London: FAIMER Center for Distance Learning, CenMEDIC; 2018. 
4.  Kapoor MC. Types of studies and research design. Indian J Anaesth 2016;60:62630. [ PUBMED] [Full text] 
5.  Garg AX, Hackam D, Tonelli M. Systematic review and metaanalysis: When one study is just not enough. Clin J Am Soc Nephrol 2008;3:25360. 
6.  Rezigalla AA. Observational Study Designs: Synopsis for Selecting an Appropriate Study Design. Cureus 2020;12:e6692. 
7.  Vetter TR. Magic mirror, on the wallwhich is the right study design of them all?Part II. Anesth Analg 2017;125:32832. 
8.  Schober P, Vetter TR. Effect size measures in clinical research. Anesth Analg 2020;130:869. 
9.  Mascha EJ, Vetter TR. Significance, errors, power, and sample size: The blocking and tackling of statistics. Anesth Analg 2018;126:6918. 
10.  Yoon HK, Lee HC, Park JB, Oh H, Park HP. McGrath MAC videolaryngoscope versus optiscope video stylet for tracheal intubation in patients with manual inline cervical stabilization: A randomized trial. Anesth Analg 2020;130:8708. 
11.  Das S, Mitra K, Mandal M. Sample size calculation: Basic principles. Indian J Anaesth 2016;60:6526. [ PUBMED] [Full text] 
12.  Wang B, Wang H, Tu XM, Feng C. Comparisons of superiority, noninferiority, and equivalence trials. Shanghai Arch Psychiatry 2017;29:3858. 
13.  Brasher PM, Dobson G. Understanding noninferiority trials: An introduction. Can J Anaesth 2014;61:38992. 
14.  Darling HS. Basics of statistics3: Sample size calculation(i). Cancer Res Stat Treat 2020;3:31722. [Full text] 
15.  Dattalo P. A review of software for sample size determination. Eval Health Prof 2009;32:22948. 
16.  Cook TM, Woodall N, Harper J, Benger J; Fourth National Audit Project. Major complications of airway management in the UK: Results of the Fourth National Audit Project of the Royal College of Anaesthetists and the Difficult Airway Society. Part 2: Intensive care and emergency departments. Br J Anaesth 2011;106:63242. 
17.  Zhong B. How to calculate sample size in randomized controlled trial? J Thorac Dis 2009;1:514. 
18.  Gaszynska E, Gaszynski T. Endotracheal intubation using the Macintosh laryngoscope or King Vision video laryngoscope during uninterrupted chest compression. BioMed Res Int 2014;2014:250820. 
19.  Vetter TR. Descriptive statistics: Reporting the answers to the 5 basic questions of who, what, why, when, where, and a sixth, so what? Anesth Analg 2017;125:1797802. 
20.  Omair A. Selecting the appropriate study design for your research: Descriptive study design. J Health Specialities 2015;3:1536. 
[Figure 1], [Figure 2]
[Table 1], [Table 2], [Table 3], [Table 4]
