SLIDE 1
1
A Unified Statistical Framework for Demographic Rates Using Demographic and Health Survey Data
Thomas W. Pullum tom.pullum@icf.com The Demographic and Health Surveys Program Draft of October 25, 2017 Prepared for the IUSSP International Population Conference, Cape Town, South Africa, October 29 to November 4, 2017. DHS is a project of the United States Agency for International Development.
SLIDE 2 2 Abstract The Demographic and Health Surveys Program is a major source of estimates of fertility, under‐five mortality, and adult and maternal mortality in developing countries. The textbook definitions of these rates are geared to settings with good vital statistics and census data. This paper is intended to clarify how DHS rates are calculated from survey data, particularly from retrospective birth histories and sibling
- histories. The rates in DHS reports and on STATcompiler are calculated with CSPro, a non‐statistical data
processing package that essentially cumulates numerators and denominators and divides. Confidence intervals in the reports are calculated with a repetitive jackknife procedure. This paper presents a statistical modeling approach, in which generalized linear models are applied to the individual‐level data files, and the coefficients from those models are converted to rates. Confidence intervals for single rates and compound rates can also be estimated analytically. The methods are described in terms of Stata, within which it is also easy to incorporate the sampling weights and sample design effects. The estimated rates from the models agree exactly with the DHS estimates and the confidence intervals are very similar. The framework is intended to help bridge the gap between demographic and statistical methods, as well as to bring out the conceptual similarities among the different kinds of rates.
The Demographic and Health Surveys Program has been a major source of estimates of fertility and child mortality in developing countries for more than 30 years. For approximately the last half of that interval, it has also been a major source of estimates of adult and maternal mortality. The procedures for estimating these rates from survey data evolved from methods that were originally developed for vital statistics and census data. The procedures are implemented by DHS with the Census and Survey Processing System (CSPro), a package developed largely by the U.S. Census Bureau and DHS (with USAID support) for the entry, editing, and tabulation of census and survey data. CSPro is a Windows‐based package that has been in widespread use since about 2000 but evolved from previous DOS‐based packages that did much the same thing. CSPro is able to produce publication‐ready tables of very complex indicators, but it is not a statistical
- package. It does not include estimation commands and it does not analytically produce standard errors.
The DHS estimates of standard errors and confidence intervals that appear in the main survey reports are calculated with a jackknife procedure that is computationally intensive but essentially re‐calculates the indicator repeatedly with the omission of one cluster, or PSU, at a time. The DHS procedure for calculating demographic rates is to accumulate a numerator, accumulate a denominator, and divide, for each category of a covariate or each cell of a cross‐classification. Programs written in CSPro to accomplish this are computationally efficient, and are based on sound demographic procedures, but their logic is quite different from how a demographer or statistician might approach the task with a package such as SPSS, SAS, Stata, or R. It is very difficult for someone working outside of DHS to match the DHS rates. The main exception to this statement is the well‐known Stata program “TFR2” that has been prepared and made widely available by Bruno Schoumaker (Schoumaker 2013). Even within DHS, some analysts find TFR2 to be the easiest and fastest way to produce fertility rates that are consistent with those produced by CSPro. So far as I know, there is no generally available alternative to TFR2, and there are no generally available programs at all that are consistent with DHS estimates of under‐five mortality or adult and maternal
SLIDE 3 3
- mortality. Such programs surely exist, but presumably they are restricted to personal use or to internal
use within various agencies. Prior to joining DHS in 2011 I had developed a Stata program to calculate the fertility rates, using similar logic to TFR2. This program was consistent with a description of the procedures for calculating fertility rates from survey data (Pullum 2004). Since joining DHS I have developed Stata programs that calculate all the rates, with emerging features and with the inclusion of confidence intervals. The goal has been to reproduce the DHS rates exactly, including very subtle and somewhat arbitrary features such as the treatment of age and time boundaries, the exclusion of any events that occur in the month of the survey, and so on, but within a statistical framework that would yield confidence intervals, adjusted for weights and other aspects of the survey design. A secondary goal has been to develop the flexibility to have different time intervals, specified either as a range of calendar time or time before the survey, to include categorical covariates, to include multiple surveys from the same country or from different countries in a single run, to save the results in files that can be re‐analyzed, and so on. It is expected that these programs will be distributed through the DHS website in 2018, in two formats: as Stata ado files, with limited options, and with complete Stata code that users can freely adapt for their own use. This paper is in part an introduction to those programs, in advance of their release. DHS produces the following three sets of demographic rates that will be discussed in this paper: Fertility rates General Fertility Rate (GFR) Age‐specific Fertility Rates for ages 15‐19 through 45‐49 (ASFRs) Total Fertility Rate (TFR) Under‐five Mortality Rates Neonatal Mortality Rate (NNMR) Post‐Neonatal Mortality Rate (PNNMR) Infant Mortality Rate (IMR. 1q0)) Child Mortality Rate (CMR, 4q1)) Under‐Five Mortality Rate (U5MR, 5q0)) Adult Mortality Rates Adult Male Mortality Rate (AMMR) Adult Female Mortality Rate (AFMR) Maternal Mortality Rate (MMRate) Maternal Mortality Ratio (MMRatio)
SLIDE 4 4 We will make minimal use of formal notation. It will be assumed that readers have a good familiarity with the standard definitions of demographic rates in terms of numbers that could be obtained from, say, a complete civil registration and vital statistics (CRVS) system; with statistical concepts such as confidence intervals and generalized linear models, or at least with poisson and logit regression; and with basic computational procedures that can be implemented with a package such as Stata. It will not be assumed that readers are familiar with how demographic rates can be constructed from DHS data. The report will describe the calculation of fertility rates, under‐five mortality rates, and adult mortality rates, in succession. Within each topic, there will be two sections. The first section describes how DHS data, particularly birth histories and sibling histories, can be converted to estimates of rates. Many demographers who can ably interpret these rates are probably not clear on the details of how they are
- calculated. This discussion uses the kind of terminology that is compatible with either CSPro calculations
- r a statistical approach. Section 4.1 is extracted from the current version of the Guide to DHS Statistics
(Rutstein and Rojas 2006). Sections 2.1 and 3.1 are drafts (by Pullum) of portions of the next version of this core document, which will appear in 2018. The second section for each set of rates shifts to an emphasis on the statistical approach. The final section of the paper will attempt to bring out the commonalities among the three sets of rates and procedures.
We will discuss the following fertility rates: General Fertility Rate (GFR) Age‐specific Fertility Rates for ages 15‐19 through 45‐49 (ASFRs) Total Fertility Rate (TFR) DHS reports include estimates of the Crude Birth Rate (CBR) but it will not be included here. The most recent surveys obtain the day of children’s births, for all children in the birth histories, in addition to calendar month and year. Given the day, month, and year of the interview, which have always been obtained, it will be possible to estimate children’s ages more precisely (although it is expected that day of birth will often have to be imputed). Some DHS programs will be modified to take this information into account. The description given here is consistent with current procedures. 2.1. How the fertility rates are calculated with DHS data The main reports on DHS surveys include several measures of fertility for recent reference periods of time, usually the three years before the survey. We will provide a brief description of the calculation
- f the following rates: the General Fertility Rate (GFR), Age‐Specific Fertility Rates (ASFRs), and the
Total Fertility Rate (TFR). These are all calculated as exposure‐occurrence rates, in which the numerators consist of numbers of births and the denominators are woman‐years of exposure to the risk of childbearing.
SLIDE 5 5 In the original demographic definition of an ASFR, say for women age 20‐24 in calendar year 2000, the numerator is the number of births in 2000 to women who were age 20‐24 at the time of birth, and the denominator is the number of women age 20‐24 at the midpoint of the year, that is, on July 1, 2000. The numerator comes from a vital statistics system and the denominator is estimated from a completely different source, such as censuses or a civil registration system. In developed countries, official statistics
- n fertility are calculated in this way.
By contrast, when age‐specific fertility rates are calculated from retrospective birth histories collected in a survey, the denominator is not the number of women in the age group, but rather the number of woman‐years of exposure to the specified age interval (and within a specified time interval). All women in the survey whose lives included any time (“exposure”) in the combination of age and time, not just those who had a birth, contribute to the denominators. Each woman’s month and year of birth are used to calculate her age in the time interval. The standard five‐year age intervals are 15‐19 through 45‐49. The time interval for DHS fertility rates are usually expressed as an interval before the survey, rather than as a calendar year or group of calendar years. Time is counted with month 0 for the month of interview, which is specific for each woman in the survey, with month 1 for the preceding month,
- etc. Events and exposure in the month of interview are not included. The three years before the
survey consists of months 1‐36, a sliding interval that is the same for all women interviewed in the same month but is shifted slightly for women interviewed in different months. The variables used for the calculation of the rates are the century month code (cmc) for the date of interview (v008), the cmc for the woman’s birthdate (v011), and the cmc for the children’s birthdates (b3, for each live birth that the woman has had). The weight variable (wt=v005/1000000) determines the magnitude of the contributions to the numerators and denominators of the rates. A woman’s birth is assumed to be on the first day of a calendar month, in order to avoid ambiguity about her age when a child is born in the same calendar month as the woman. For each woman, each age interval within the window or reference period of time is expressed as a range of calendar time, with a starting month and ending month. If a birth occurred within that range of time, then wt is added to the numerator of the appropriate age‐specific rate. The woman’s contribution to the denominator of the rate is the total number of months in that range (the ending month minus the starting month, plus one), multiplied by wt. Subsequently the number of months is divided by 12 to convert to years of exposure. The age‐specific rates are calculated by dividing the numerator by the denominator. The age‐specific rates in the reports are multiplied by 1000 and can be interpreted as the number of births per 1000 woman‐years of exposure or simply as the number
The General Fertility Rate is analogous to a broadened age‐specific rate. The denominator consists
- f all exposure to ages 15‐44 within the window of time. Exposure to age interval 45‐49 is omitted.
The numerator consists of all births at ages 15‐44, plus births at ages 45‐49 and births at ages under 15 that occur within the window of time. This is a somewhat non‐standard definition of the GFR but the difference from alternative is very small. The GFR is affected by the age distribution of women within the reproductive ages, but the TFR is not. The Total Fertility Rate is the sum of the age‐specific rates (before they are multiplied by 1000), multiplied by five. The TFR is an example of a synthetic rate, because it pieces together recent information to produce a rate with a cohort interpretation: the TFR is the average number of births that a woman would have during ages 15‐49 if she survived the full age interval and had children
SLIDE 6 6 with the age‐specific rates observed in the reference period. The factor of five is required because the hypothetical woman would experience each of the five‐year age‐specific rates for five years of time. Because the age interval for eligibility in DHS surveys is 15‐49, there is a progressive omission of the upper ages as estimates go back in time. For example, it is impossible to calculate an ASFR for ages 45‐49 for a period more than five years before the survey, because women at that age, at that time, would have been too old at the time of the survey to be included in the survey. When DHS reports include ASFR or TFR estimates for an earlier time period, they generally will not include any correction for this progressive omission. In some DHS surveys, eligibility is further restricted to women who are ever‐married. These are described as EMW (for ever‐married women) surveys. These surveys include woman‐specific “all‐ women factors” that must be used in the calculation of the ASFRs and GFR if the rates are to be interpreted as describing all women, not just ever‐married women. [The calculation of all‐women factors is described elsewhere in this Guide]. They include a factor of 100 that must be removed during the adjustment procedure. The numerators of the rates are not affected, because it is assumed that never‐married women have not had any births. Only the denominators must be changed, to adjust for the exposure to risk that is missing when the survey omits never‐married
- women. Each woman’s contribution to the denominators of the rates, described above, is inflated
by multiplying by awfactt/100. For rates that are specific to place of residence, region, education, or wealth quintile, other pre‐calculated versions of the factors are used. Age‐specific rates for sub‐populations are not given in the reports because they would tend to be based on too few cases and to be statistically unstable. For subgroups, only the TFR is provided in the reports and on STAT Compiler. Age‐specific rates are calculated in order to produce a TFR, but they are not published. 95% cconfidence intervals for the TFR and GFR are provided in Appendix B on sampling errors in the main reports. The confidence intervals are not generally included in the body of the report and are not included on STATcompiler. The confidence intervals are calculated as follows. First, the standard error of the rate is estimated by a jackknife procedure that is adapted to the survey design. The rate is re‐calculated, repeatedly, with the omission of sample clusters, and the variation of these estimates is converted to an estimate of the standard error of the original estimate. In a second step, a confidence interval is calculated as plus/minus two standard errors. “2” is used in place of the usual multiplier, 1.96. There are three weaknesses in this approach to the calculation of confidence intervals. First, the jackknife takes account of sample weights and clusters but not stratification. Second, the interval is symmetric around the TFR or GFR. It would be preferable to have an interval that was symmetric around ln(TFR) or ln(GFR) and then exponentiated, because the fertility rates have a natural zero. Under the DHS strategy, it would be possible for the lower end of the confidence interval to be less than 0. Third, using a multiplier of 2 rather than 1.96 for a 95% confidence interval, although the difference is trivial, is not justifiable.
SLIDE 7 7 2.2. The statistical model for fertility rates The preceding section described how the fertility rates are calculated with DHS data using CSPro. We now describe the procedure using an individual‐level statistical model, which produces parameter estimates that are then manipulated to produce rates. An occurrence/exposure rate has the form r=b/e, where b is the collective number of occurrences, e.g. the number of births in a population, and e is the collective exposure, e.g. the number of woman‐years
- f exposure associated with the births. In a statistical framework, b is a random variable, a count, and e
is a nuisance parameter analogous to population size and is treated as a constant. Taking the natural logarithm of both sides, ln(r)=ln(b)‐ln(e); rearranging so that the random variable is on the left hand side, ln(b)=ln(r)+ln(e). With individual‐level data, we fit a generalized linear model in which b is the
- utcome, the link function is the natural logarithm, there are no covariates, and ln(e) is an offset. Cases
with e=0 are omitted. The model will produce a constant term b0 (say), which can be identified with ln(r), and therefore the point estimate of r will be exp(b0). This model is described, for example, in a familiar text by Alan Agresti (Agresti 1996, pp. 86‐87). A generalized linear model must also specify the distribution of the random variable b around its expected value. For this purpose we assume a poisson distribution. It is well known that the distribution
- f the number of births in any interval of time is less dispersed than a poisson distribution would be,
largely because of the minimal length of time (e.g. 10.5 months of pregnancy and post partum amenorrhea) associated with each live birth. Alternative distributions such as the negative binomial can be used in place of the poisson. In Stata, the glm with log link and poisson error would be specified with “glm b, family(poisson) link(log) offset(lne)”, where lne is ln(e). Equivalently, the model could be “poisson b, offset(lne)”. In order to adjust for the survey design, including sampling weights, it is also necessary to use svyset and svy. The units of analysis are the cases in the file of individual women respondents (the IR file). The reference interval of time is specified in terms of a start month and an end month, expressed in cmc’s. The interval can be fixed for all women, e.g. as the three calendar years 2000‐2002 (the cmc of the start month 1201, and the cmc of the end month is 1236), or it can be an interval of months before the interview, e.g. the 36 months before the interview (the cmc of the end month is the month of interview minus 1, and the cmc of the start month is the month of interview minus 36). The simplest fertility rate is the GFR, which is analogous to a single ASFR but for a full range of ages 15‐
- 49. To calculate the GFR1, for each woman we construct e, which is her months of exposure to ages 15‐
49 within the interval of time, and b, which is the number of births she had while ages 15‐49 within the interval of time. The statistical model to calculate the GFR consists basically of two steps. The first step is to calculate the natural logarithm of e; this step is combined with the conversion from months to years with “gen lne=ln(e/12)”. The second step is to apply “svy: glm b, family(poisson) link(log) offset(lne)”. This command requires a previous specification of the survey design effects, including weights, within the “svyset” command. The estimation command will produce a scalar b0, from which the GFR can be calculated with “scalar GFR=exp(b0)”. The command will also produce a scalar s0, which is the estimated standard error of b0. This can be used to construct a confidence interval for the population value of b0,
1 The DHS version of the GFR includes births before age 15 (for women 15+ at the time of the survey). It includes
births while age 45‐49 but omits exposure to age 45‐49. A modification of the model incorporates these nuances, but those modifications are not in the model as described.
SLIDE 8 8 the ends of which, after exponentiation, provide a confidence interval for the GFR.2 This confidence interval will be symmetric on the scale of ln(GFR) but asymmetric on the scale of the GFR itself, such that the midpoint of the interval will be slightly greater than the point estimate. Alternatively, a user can access a saved matrix r(table), which contains the lower and upper bounds of a 95% confidence interval for b0, and exponentiate those numbers to obtain the confidence interval for the GFR. This approach will be called the Single Rate Model. Within the estimation command, a negative binomial error distribution would not change the point estimates but could change the standard errors. The ASFRs could be calculated one at a time using the Single Rate Model, but it is more efficient to calculate all of them in a single model. To do this we calculate, for each woman, the months of exposure to each five‐year interval of age, within the interval of time. The typical woman is exposed to one age interval, or at most two consecutive age intervals, within the time interval. The months of exposure to each age interval are added to the record as variables e_1 through e_7. The calculation of exposure does not involve the birth history. The dates of births in the birth history are then reviewed and the births are classified into the correct interval of the woman’s age at the time of birth, within the time interval. The numbers of births in the respective age intervals are b_1 through b_7. As stated above, for the standard setup (five‐year age intervals and a three‐year time interval) a woman will have non‐zero exposure to at most two (consecutive) age intervals. The number of births can only be non‐zero if the exposure is non‐
- zero. The ASFRs and the TFR are calculated by manipulating e_1 through e_7 and b_1 through b_7.
We modify the Single Rate Model with an approach that facilitates the calculation of the TFR, and in particular the calculation of the confidence interval for the TFR. This involves a step in which the e_* and b_* values are stacked. That is, we shift to a data structure in which there is one record for each combination of woman and age interval, although including only the age intervals for which e_* is greater than 0. In Stata this is accomplished with the reshape command, “reshape long b_ lne_, i(womanid) j(age)”. Here “womanid” is a unique identifier for each woman and “lne” is ln(e/12). Here “age” is an index constructed within the reshape command, to match the observed age intervals. For example, if a woman had non‐zero exposure to age intervals 2 and 3, then in the reshaped or stacked file her first record will include b=b_2 and e=e_2 and age=2, and the second record will include b=b_3 and e=e_3 and age=3. (This description skips steps to eliminate the unnecessary underscores and intervals with no exposure.) Thus the stacked file only has about twice as many records or lines as the
- riginal file, even though there are 7 age groups. It is important to stack the age intervals because it
cannot be assumed that fertility in an age interval is statistically independent of fertility in a preceding or subsequent age interval. We next develop the appropriate statistical model to estimate the rates. The constructed variable “age” takes the values 1 through 7. If it is included on the right hand side of a model, it will be assumed by Stata to be an interval‐level variable. Clearly, in the calculation of age‐specific fertility rates, age is a categorical variable. In order for age to be interpreted as a categorical variable, the typical specification would be to include it on the right hand side as “i.age”. This specification would treat the lowest value, age=1, as a reference category. The coefficient for that category would be a “constant” and the coefficients for other values of age would be interpreted as deviations from the reference category.
2 The scalars referred to here are included in matrices in the “saved results” produced by the estimation command.
It is necessary to extract them from the matrices in steps not spelled out here.
SLIDE 9 9 It is preferable to have coefficients for each age group that can be exponentiated directly to obtain the age‐specific rates. To achieve this, we construct dummy variables age_1 through age_7 for the seven age groups, include all seven dummies on the right hand side, and then use the “nocons” option, which suppresses the “constant” term and therefore gives a model that is fully identified. This strategy provides a parameterization that is friendlier than “i.age” although, in terms of overall fit, it is identical. (The most efficient statements to construct the seven dummies are “xi, noomit i.age” followed by “rename _I* *”. The same end could be accomplished by specifying an appropriate design matrix.) Using this file and an appropriate svyset specification, the Stata command to produce the age‐specific rates is “svy: glm b age_*, family(poisson) link(log) offset(lne) nocons”. This simple command will produce a saved vector of 7 coefficients, called e(b), which we rename B, which can be confirmed to be the logs of the ASFRs. We will refer to these coefficients with the symbols b1 through b7: bi=ln(ASFRi). Stata also saves a 7x7 matrix e(V), which we rename S, that is the variance‐covariance matrix of b1 through b7. The ASFRs can be obtained by exponentiating the coefficients: ri=exp(bi). Lower and upper bounds of 95% confidence intervals for the ASFRs can be obtained by expoentiating the respective bounds of the confidence intervals for the coefficients. The lower and upper bounds are given on a log scale in the output and also as “saved results” in a matrix r(table). The TFR is a function of the coefficients calculated above—namely, 5 times the sum of the exponentials
- f the coefficients. Because the TFR is a rate (a compound rate) with a natural zero, we prefer to
calculate a confidence interval for F=ln(TFR) and then exponentiate the ends of that interval. This can be done analytically using the delta method to estimate the standard error of F, called sF. A 95% confidence interval for the TFR will have lower bound exp(F‐1.96sF) and upper bound exp(F+1.96sF). Here the standard error sF is a scalar that can be calculated as the square root of DSD’, where S is the 7x7 variance‐covariance matrix of the coefficients and D is a 1x7 row vector of the partial derivatives of F with respect to the seven coefficients. The partial derivative for cell i is simply 5*ri/TFR. The same steps—constructing individual‐level contributions to the numerators and denominators of age‐specific rates, stacking those contributions, applying a poisson (or similar) version of a generalized linear model, and using the “saved results” to calculate point estimates and confidence intervals for the age‐specific rates and compound rates—can be applied to occurrence/exposure rates other than the fertility rates. This procedure will be given the generic label of the Stacked Rate Model. The GFR can be described as an age‐specific fertility rate in which the age interval extends from 15 through 49, as above. It can also be described as a linear combination of the 7 ASFRs. The TFR is a linear combination in which the ASFRs are multiplied by 5 and then added. The GFR is a linear combination in which the ASFRs are multiplied by the proportion of the total exposure that is within the age interval, and then added. The second perspective can be applied to the vector e(b) and the matrix e(V) that were produced to obtain the point estimate and confidence interval for the TFR. The point estimate of the GFR will be a weighted sum of the rates, with weight wi for rate ri (and the sum of the w’s is 1). That is, the GFR is the sum of wi*ri. To calculate a confidence interval for the GFR we define F=ln(GFR). The only
- ther modification to the procedure described as the Stacked Rate Model is that entry i in the vector D
is wi*ri/GFR. With this modification, the point estimate and confidence interval for the GFR produced by the Stacked Rate Model will be essentially identical with that produced by the Single Rate Model. This modification of the Stacked Rate Model will be called a Compound Rate Model. This label is used for an application of the Stacked Rate Model to produce age‐specific rates, followed by weighting with weights wi to produce a summary rate. (Note that the weights wi are not the sampling weights, which
SLIDE 10 10 are always included within the svyset command.) The weights wi are obtained separately and are only employed for the calculation of the summary rate and the confidence interval for the summary rate. In practice, the GFR would be calculated with the Single Rate Model. The description of the Compound Rate Model is included here in anticipation of its relevance for the calculation of adult and maternal mortality rates in section 4.2.
- 3. Under‐Five Mortality Rates
We will discuss the following under‐five mortality rates: Neonatal Mortality Rate (NNMR) Post‐Neonatal Mortality Rate (PNNMR) Infant Mortality Rate (IMR. 1q0)) Child Mortality Rate (CMR, 4q1)) Under‐Five Mortality Rate (U5MR, 5q0)) 3.1. Calculating the under‐five mortality rates with DHS data The main reports on DHS surveys include five rates to describe mortality under age five during reference periods of time, mainly the five years before the survey. These are the Neonatal Mortality Rate (NNMR), Post‐Neonatal Mortality Rate (PNMR), Infant Mortality Rate (IMR), Child Mortality Rate (CMR) and the Under‐Five Mortality Rate (U5MR). Although described as rates, they are all probabilities multiplied by 1000. No matter what data source is used, these standard rates are algebraically related to one another. The probability that a child will die within the first month after birth can be called q1. The NNMR is 1000*q1. The probability that a child will die within the first year after birth, in standard demographic notation, is 1q0. The IMR is 1000*1q0. The probability that a child will die within five years after birth is 5q0. The U5MR is 1000*5q0. The other two standard rates are related to these three as follows. The PNMR is defined as PNMR=IMR‐NNMR; it is 1000 times the probability that a child will die in the first year but not in the first month. The CMR is 1000*4q1, where 4q1 is the probability of dying between the first and fifth birthdays, given that the child survived to the first birthday. The algebraic relationship among 1q0, 4q1, and 5q0 is (1‐5q0) = (1‐1q0)*(1‐4q1); that is, the probability of surviving to the fifth birthday is the probability of surviving to the 1st birthday times the probability of surviving from the first birthday to the fifth birthday (where the latter probability is conditional on having survived to the first birthday). This algebraic relationship among the q’s can be re‐written as an algebraic relationship among the IMR, CMR, and U5MR. If a real cohort of 1000 births were followed for a full five years after birth, these rates could be calculated easily. The number of deaths to this cohort in the first month would be the NNMR; the number later in the first year would be the PNMR; the sum of those two numbers would be the
- IMR. The number who died in the first five years would be the U5MR. The only conditional rate
would be the CMR, calculated as the number of deaths between ages 1 and 4, divided by the
SLIDE 11 11 number of children who survived to the first birthday, and then multiplied by 1000. The birth histories in DHS surveys include the information that is needed to calculate synthetic estimates of the under‐five mortality rates. Synthetic estimates are not based on a single cohort of children who had full exposure to the risk of dying at any age under five in the past five years, which would be limited to children born more than five years ago. Rather, synthetic procedures artificially combine information for short intervals of age observed for different birth cohorts partially or entirely within the past five years, combined as if the information referred to a true birth cohort. The synthetic estimates use all information about deaths and risk observed in the birth histories for the desired combinations of age and time. The full five‐year age interval of 0‐59 months is divided into eight segments, for ages 0; 1‐2; 3‐5; 6‐ 11; 12‐23; 24‐35; 36‐47; and 48‐59 months. The first interval is identified with the neonatal period; the first four intervals comprise infancy; the last four intervals describe ages 1‐4 years. The earlier intervals are shorter because under‐five mortality tends to be disproportionately concentrated in the first year and the earliest months. The symbols q1 through q8 refer to the probabilities of dying in intervals 1 through 8, conditional on having reached the beginning of the interval of age. Here, q1 is the same as the q1 defined earlier. We can calculate 1q0 with 1q0=1‐(1‐q1)* (1‐q2)* (1‐q3)* (1‐ q4). We can calculate 4q1 with 4q1=1‐(1‐q5)* (1‐q6)* (1‐q7)* (1‐q8). Thus the procedure is reduced to how to calculate q1 through q8 within an interval of time. For each age interval, we require two numbers: the numerator, which is the number of deaths in that age interval; and the denominator, which is not person‐years of exposure, as with an age‐specific death rate or fertility rate, but is the aggregated risk of dying in the age interval. The relevant information for a specific child in the birth histories is the century month code (cmc) of birth, b3, and the age at death, b6, if the child died. Note that although the timing of the birth is expressed in calendar year and month, converted to a cmc, the timing of the death is expressed
- nly as age at death, not a year and month of death. The question about age at death is coded as b6
with three digits. The first digit is 1 if age is given in days; 2 if given in months; 3 if given in years. The second and third digits of b6 give the corresponding number of days, or months, or years. These are interpreted as completed days, months, or years. The number of days should not exceed 30—for a larger number, the response should be in months; and the number of months should not exceed 23—for a large number, the response should be in years. The constructed variable b7 gives age in completed months, and it is b7, rather than b6, that is then used for subsequent calculations. In this construction, b7 is 0 if b6 is coded as days (<31); b7 is the same as b6 if b6 is coded in months (<24); and b7 is 12*b6 if b6 is coded in years. Typically a few values of b6 are out of range, but when they
- ccur, their conversion to b7 is similar.
Deaths coded with b7 must be mapped into the numerators of q1‐q8. A child who dies at age 0 months will go into the numerator of q1. Similarly, b7 values of 1‐2, 3‐5, and 6‐11 will go into the numerators of q2, q3, and q4, respectively. All children who die in the range 12‐23 months will go into the numerator of q5, with no distinctions about where they died in the range 12‐23 months. Subsequent values of b7 are typically almost entirely located at the values 24, 36, and 48 months (values of b7 greater than 59 months are not relevant for the under‐five rates). These cases will go into the numerators of q6, q7, and q8, respectively, even though they would appear to be concentrated implausibly in just the first month of the intervals 24‐35, 36‐47, and 48‐59 months,
- respectively. This apparent heaping is just an artefact of how b7 is constructed from b6. That is, for
SLIDE 12 12 example, “died at age 2” should be interpreted as “died at age 2 completed years”, and the coding b7=24 should be understood to describe a range of 24‐35 months, and not as a single month, 24. A complication arises when the age interval for a child straddles, or is split between, two successive time intervals. This is best illustrated with an example. Suppose that the time interval is months 1‐ 60 before the month of interview (to repeat, the fertility and mortality rates never include children born in the month of interview, or month 0). This interval can be described as the window of
- bservation. Suppose that a specific child had its second birthday 62 months before the interview.
That is, it was exposed to age interval 6 (24‐35 months of age) partly before the time interval of interest, and partly after. If this child died within age interval 6 (age 24‐35 months), we do not know whether the death occurred in the window of observation or earlier. Because of this ambiguity about the time interval in which the death occurred, it is allocated equally to the two parts. Half a death will be added to the numerator of q6 in the window of observation. The other half will go to the numerator of q6 for the earlier 60‐month time period. Turning to the denominators of the q’s, each child will contribute one unit of risk to an age interval if they died in the age interval or if they survived to the end of the age interval, within the window
- f time. It is typical for the child to contribute to the denominator of several successive q’s, but note
that the contribution is not months of exposure, but rather units of risk, usually either 0 or 1. The exception is for an age interval that straddles two time intervals, as described above for the child whose second birthday was 62 months before the interview. This child will contribute half a unit of risk to age interval 6 before the beginning of the window and half a unit to age interval 6 within the window, regardless of whether or not the child died in the age interval. That is, the child had a total
- f one unit of risk in the age interval, so to speak, and the rule followed by DHS is to allocate this
unit of risk equally to the part before the beginning of the window and the part within the window. The half‐and‐half allocation of risk and deaths when there is this kind of ambiguity—that is, when an age interval straddles a time interval—could be improved upon for individual cases, but in the aggregate it should even out. There are many other equivalent situations in demography, in which cases are assigned to the midpoint of an interval, for example. The cumulation of the numerators and denominators of the q’s is most easily done with the weight
- variable. For each case, the normal weight is wt=v005/10000000. The numerator of a q will be the
sum of contributions wt or wt/2 and the denominator of a q will also be the sum of contributions wt
In analyses of mortality, there may be two time intervals, for example, 1‐60 months before the interview and 61‐120 months before the interview, so trends can be studied. The allocation of age intervals that cross the two windows will be as described above. However, the treatment of censoring by the interview is slightly different. If the child had its second birthday two months before the interview, for example, then the observation of age interval 6 is incomplete and only half a unit of risk will be assigned, as if it had crossed a boundary. However, if the child died within this age interval, then the death will be counted in full and the risk will be increased to one unit, under the reasoning that observation for the remainder of the age interval would have been superfluous, because the child had already died.
SLIDE 13 13 Following these rules, it is possible to calculate the total number of deaths and the total risk for each age interval, 1 through 8, then calculate each of the 8 q’s as the ratio of deaths to risk, and then to calculate the five standard rates. 3.2. The statistical model for under‐five mortality rates We now re‐frame the calculation of the under‐five rates as a statistical model, within Stata. We will emphasize similarities to, and differences from, the calculation of the fertility rates. Perhaps the most important difference is that the under‐five mortality rates are not rates, but are probabilities (more precisely, estimated probabilities). We will perpetuate this unfortunate misnomer by sometimes using the “rate” terminology. For fertility, the individual‐level outcome or numerator is a count of births. For mortality, the outcome is a binary variable, usually 0. If the child dies within an age interval, the outcome is 1 (with an exception to be noted below). For fertility, the individual‐level denominator is exposure, the months or years lived within the combination of age and time. For mortality, the denominator is 0 or 1 (with an exception to be noted below). As described above, we use the term “risk” to refer to the individual‐level denominator. If the numerator is 1, then the denominator must be 1. If the numerator is 0, then the denominator may be 0
The simplest of the under‐five mortality rates is the NNMR, which is the probability of dying within a month after birth (ignoring the factor of 1000). In the data files, these are the births with b7=0 (b7 is the completed months of age at death), for children who were born within the specified time interval and
- died. The standard way to estimate the probability for a subpopulation would be with a logit regression
with no covariates, in which a binary variable d is 1 if the birth in the interval of time resulted in a neonatal death, and 0 if it did not. Following specification of the weights and design effects with svyset, the Stata glm command to calculate q_nnmr is svy: glm d, family(binomial) link(logit). This model will produce a single coefficient, which we may call b0, that is the logit of the estimated probability of a neonatal death. The probability will be the antilogit of b0, i.e. exp(b0)/[1+exp(b0)]. We can obtain a 95% confidence interval for q_nnmr by calculating the antilogits of the lower and upper bounds of the confidence interval for b0, which are given in the output and are also in a saved matrix r(table). This procedure will be described generically as the Single Probability Model. In the description of DHS calculations given above, it was noted that an age interval may straddle two time periods. This is not an issue for neonatal mortality, because it only involves a single month of age, but it is a potential issue for all of the other under‐five age intervals. For example, the time interval when a child was age 12‐23 months may be partly within the five years before the survey and may partly be earlier. DHS has traditionally allocated half of the numerator and half of the denominator to each of the two time intervals. That is, for the child just described, the risk for age 12‐23 months would be 0.5 in each time interval. If the child died, the outcome would be 0.5 in each interval. This allocation poses a problem for the statistical model, however, because a binary outcome can only take two values, which we identify with 0 (survived) and 1 (died). In generalized linear models, particularly as implemented in Stata, the outcome and the risk are not restricted to the values 0 and 1. The only relevant restrictions are that the risk must be positive and the
SLIDE 14 14
- utcome must not be greater than the risk. In Stata, the specification for a binomial model is generally
“family(binomial)” but this option can be expanded to be “family(binomial risk)”; the default value of risk is risk=1. We therefore drop any age intervals with risk=0 and allow risk to take the value 0.5 as well as 1, and allow the outcome to take the values 0, 0.5, and 1. For the fertility rates, the age intervals within which births and exposure are calculated for the individual woman are ages 15‐19 through 45‐49, a total of 7 intervals spanning 35 years of age. For the mortality rates, the age intervals within which death and risk are calculated for the individual child are in months, rather than years, specifically months 0, 1‐2, 3‐5, 6‐11, 12‐23, 24‐35, 36‐47, and 48‐59, a total of 8 intervals spanning 5 years of age. We estimate the 8 underlying age‐specific probabilities as follows. Using the file of children from the birth histories (the “BR” file of all children, not the reduced “KR” file of children born in the past five years, because children born earlier may have exposure and deaths in the past five years), we construct 8 numerators d_1 through d_8 and eight denominators r_1 through r_8 that are consistent with the time interval, usually the past five years but potentially another interval of years before the survey or calendar years. The d’s and r’s may take the values 0, 0.5, or 1, subject to r>0 and d < r in each interval. The age intervals are then stacked, using the Stata command “reshape long d_ r_, i(childid) j(age)”. “age” is a variable that is constructed within the command and that identifies the 8 age intervals, and “childid” is a child‐specific id code, constructed before executing this command. Lines with r=0 are
- dropped. There may be up to eight lines for a single child in the stacked file, in contrast with a maximum
- f two lines for a woman in the stacked file of births, because the age intervals for children are so
narrow. The age variable is then converted to eight indicator variables, one for each age interval, called age_1 through age_8, with exactly the same lines as for fertility rates (“xi, noomit i.age” and “rename _I* *”). An appropriate version of “svyset” is specified to account for the weights, clustering, and stratification in the survey design. We replace d_ with d and r_ with r. The specification of the model is ” svy: glm d age_*, nocons family(binomial r) link(logit)”. This is completely analogous to the fertility model. Instead of poisson error we have binomial error, with a risk that is allowed to be 0.5, not just 1. The link function is the logit rather than the log. (The canonical link function for poisson error is the log, and for binomial error it is the logit. Those link functions are the defaults for glm models in Stata and can be omitted from the respective commands.) With this specification we obtain 8 coefficients that are the logits of the 8 underlying probabilities or q’s. The point estimate of a q is obtained as the antilogit of the coefficient. That is, for specific coefficient b, q=exp(b)/[1+exp(b)]. The r(table) matrix saved by Stata will include the lower and upper bounds of a 95% confidence for each parameter. By applying the antilogit transformation to those numbers, we
- btain the lower and upper bounds of a 95% confidence interval for each q. This model can be
generically described as the Stacked Probability Model. Stata also saves a vector e(b), renamed B, that contains the eight coefficients, and a matrix e(V), renamed S, that is the variance‐covariance matrix for these coefficients. With the help of the delta method, this matrix can be used to obtain confidence intervals of various functions of the q’s. However, the 5 q’s that appear in DHS reports and on STAT Compiler have a more complex relationship with the 8 underlying q’s than, say the TFR has to the 7 ASFRs. The 5 rates that appear in the DHS reports are linked to the 8 underlying rates as follows:
SLIDE 15 15 q_nnmr=q1 q_pnnmr=q_imr‐qnnmr q_imr=1‐(1‐q1)*(1‐q2)*(1‐q3)*(1‐q4) q_cmr=1‐(1‐q5)*(1‐q6)*(1‐q7)*(1‐q8) q_u5mr=1‐(1‐q1)*(1‐q2)*(1‐q3)*(1‐q4)*(1‐q5)*(1‐q6)*(1‐q7)*(1‐q8) We can illustrate the steps for calculating a confidence interval for the Infant Mortality Rate. Ignoring the factor of 1000, the probability of dying before the first birthday can be called q_imr. The function for which we would construct a symmetric interval is F=logit(q_imr). In terms of the underlying q’s and b’s,
- nly the first four age intervals are relevant, because q_imr=1‐(1‐q1)*(1‐q2)*(1‐q3)*(1‐q4). As with the
fertility rates, the standard error of F is calculated as the square root of DSD’, where S is the 8x8 variance‐covariance matrix of the coefficients and D is a 1x8 vector of partial derivatives of F with respect to the coefficients. In the case of the IMR, elements 5, 6, 7, and 8 of D will be 0. The first four elements of D can be shown to be simply qi/q_imr, for i=1, 2, 3, and 4. The antilogits of the ends of a confidence interval for F will be the ends of the confidence interval for q_imr. The same steps are applied to the other 4 rates that appear in the DHS reports. Note that the NNMR (ignoring the factor of 1000, labelled q_nnmr) is just the first of the 8 underlying rates, q1, rather than a compound rate. The general approach for the confidence interval described in the previous paragraph (with the delta method) applies to that probability but is equivalent to simply constructing the confidence interval as the antilogit of the lower and upper bounds of the confidence interval for the first coefficient that is provided by T=r(table). The confidence interval for q_nnmr, which refers to months 1‐11, is handled slightly differently. The definition of this probability is q_pnnmr=q_imr‐qnnmr, the difference between the probability for the first year and the first month. Rather than develop a different procedure for calculating this confidence interval, we note that q_nnmr is approximately the same as q_approx=1‐ (1‐q2)*(1‐q3)*(1‐q4). We apply the above approach to get the lower and upper bounds of a confidence interval for q_approx and then displace it by an amount q_pnnmr‐q_approx to approximate the confidence interval for q_pnnmr. The model described for the under‐five probabilities will be referred to as the Compound Probability Model.
- 4. Adult and Maternal Mortality Rates
We will discuss the following adult and maternal mortality rates: Adult Male Mortality Rate (AMMR) Adult Female Mortality Rate (AFMR) Maternal Mortality Rate (MMRate) Maternal Mortality Ratio (MMRatio)
SLIDE 16 16 DHS reports also include estimates of the probability of dying between ages 15 and 50, 35q15, and the lifetime risk of a maternal death, LTR, which can be calculated alongside these rates. The age‐specific probabilities of dying are estimated from the age‐specific death rates, for five‐year intervals of age, with q=5*m/(1+2.5*m) and 35q15 = 1 ‐ (1‐5q15)*…*(1‐5q45). The LTR is calculated as LTR = 1 ‐ (1‐MMRatio) ^ TFR. (Other formulas exist for the LTR but this is the one DHS currently uses.) In the most recent surveys, information is obtained about whether a sibling death was due to violence
- r accident. Deaths during or soon after pregnancy, including deaths due to violence or accident, are
described as “pregnancy‐related deaths” in the sense of timing; the term “maternal deaths” refers to deaths during pregnancy that are not due to violence or accident. The description of maternal deaths in this section can easily be modified to take this refinement into account. 4.1. Calculating adult and maternal mortality rates with DHS data Section 4.1 will be extracted almost verbatim from the current version of the Guide to DHS Statistics (Rutstein and Rojas 2006). A revision of this document will appear in 2018. The DHS maternal mortality module questionnaire collects information from respondents … about all of their siblings born to the same mother, starting with the oldest. For living siblings, the date of birth in the recode data file is calculated by subtracting the age from the date of interview. For dead siblings, the date of birth is calculated by subtracting the sum of the responses on age at death and the number of years ago the death occurred from the date of interview. To calculate the date of death, the number of years ago the death occurred is subtracted from the date of interview. After these calculations, month
- f date of birth and date of death are assigned by subtracting a random number between 0 and 11 and
making sure that the birth order and minimum birth intervals are maintained. However, the age distribution of siblings is very much different than the age distribution of the population since only eligible women and men can report on their siblings. For example if a girl or boy is sixteen years old and has only siblings younger than 15 years, she or he will not be reported. The same holds true at the upper end of the eligible age range. Thus the age distribution of siblings is a curve with minimums at the ends of the eligible age range of respondents and a maximum at about the midpoint of the eligible age range (30 to 35 years). In order to properly calculate general or total rates, age‐specific rates must be adjusted to a more representative age distribution. The distribution of respondents is used for this adjustment. One might think that the calculation of mortality rates is biased because the (living) respondent is not
- included. Similarly, people with no siblings are not included since there is no one to report on them.
However, it has been shown by German Rodriguez that these two potential biases cancel each other
- ut, under the assumption that mortality rates are unrelated to the size of the sibship.
Another important issue is location. The DHS does not collect information on the residence of neither siblings who died nor of the residence during the exposure period of both living and dead siblings. The residence at the time of interview of respondents is not necessarily the same as that of their siblings. Therefore, DHS usually does not publish adult mortality rates by area. For each dead sibling, age at death is determined directly from the response by classification into 5 year age groups. Period of death is determined by using the sibling’s date of death. Deaths age ages less than 15 years or more than 49 years are not tabulated as are deaths occurring earlier or later than the period.
SLIDE 17 17 Once the numerators and denominators are properly established, age‐specific mortality rates are
- btained by the division of the numerators by the corresponding denominators and multiplying by 1000.
The general mortality rate for age 15‐49 is obtained by multiplying the age‐specific mortality rates by the proportion of respondents in the five‐year age group and then summing the age‐distribution‐ adjusted mortality rates. Once the numerators and denominators are properly established, age‐specific maternal mortality rates are obtained by the division of the numerators by the corresponding denominators and multiplying by 100,000. The total maternal mortality rate (for age 15‐49) is obtained by multiplying the age‐specific mortality rates by the proportion of respondents in the five‐year age group and then summing the age distribution‐adjusted maternal mortality rates. The total proportion of deaths maternal is calculated by dividing the total maternal mortality rate by the general 15‐49 adult mortality rate of women. The maternal mortality ratio is calculated by dividing the total maternal mortality rate by the general fertility rate … for the period and is expressed per 100,000 births by multiplying the product by 100,000. 4.2. The statistical model for adult and maternal mortality rates We now describe the statistical models that are used to calculate the adult death rates for women age 15‐49 and men age 15‐49, both of which will be referred to a 35m15; the maternal death rate for women age 15‐49, referred to as MMRate; and the maternal mortality ratio, MMRatio. The standard reference interval is the 7 years, or 84 months, prior to the woman’s month of interview. The main reasons for selecting a seven‐year interval are (a) to increase the statistical stability of the estimate by accumulating a relatively large number of deaths, particularly maternal deaths, and (b) to minimize the impact of the heaping of deaths, in most surveys, at “5 years ago” or “10 years ago”. Each woman in the survey is asked about her sisters and brothers, and for each of these siblings the data file includes a cmc for the sibling’s birth and, if the sibling died, the cmc of death. In many cases, the months have been imputed but the responses include combinations of year of birth, current age if living, age at death (in years), or years since death, and the month is imputed at random within a range that is consistent with the information that has been provided. The inclusion of a cmc of death for siblings removes the potential ambiguity about time that is inherent in the information available for children. DHS procedures for adult mortality are based on the estimation of occurrence exposure rates, rather than probabilities of dying. The adult mortality rates are actually entirely consistent with the calculation
- f fertility rates for women, rather than the calculation of mortality rates (actually probabilities) for
children. For each respondent we construct a set of records, one for each sibling, whether alive or dead, who had any exposure to ages 15‐49. For each of these siblings, we then construct sets of 7 numerators and 7 denominators, corresponding with the seven five‐year age intervals 15‐19, 20‐24, …, 45‐49. The numerators may be referred to as d_1 through d_7. The vast majority of these d’s have the value 0, but if a sibling died in the age interval, the value is 1. The denominators are the number of months of exposure to the combination of age and time. If a sibling died within a specific age interval, the exposure is 0 after the cmc of death. The sibling has have a months of exposure for the month of death. The
SLIDE 18 18 denominators may be referred to as e_1 through e_7. For women siblings, i.e. sisters, we construct an additional set of numerators md_1 through md_7. If the sister died within age interval i and that death is classified as maternal, then md_i=1. Otherwise, all of the md’s are 0. In the file just described, with one record per sibling, each record includes an id code for the respondent, called womanid, an id code for the sibling nested within the id code for the woman, and the sex of the sibling (with the conventional coding of 1 for male, 2 for female). It would be possible to construct the rates from this file. However, because we have no covariates other than sex that are specific to the siblings, very little is lost by moving to a file in which the respondents are the units of
- analysis. This is done with a collapse command: “collapse (sum) d_* md_* e_*, by(womanid sex)”. This
command will produce (after eliminating respondents with no sisters and/or no brothers) a file in which all of the occurrence/exposure information associated with a woman’s sisters will appear in one record and all such information about a woman’s brothers will appear in one record. The sisters’ information is thus summarized on one record per respondent in terms of three sets of seven numbers. We retain the symbols d, md, and e that were used for an individual sibling, but they now refer to subtotals for sets of siblings. The deaths to sisters in the seven age groups are given with d_1 through d_7. The maternal deaths are given with md_1 through md_7. The collective months of exposure are given with e_1 through e_7. The age intervals are then stacked with a “reshape long” command and indicator variables age_1 through age_7 are constructed. The brothers’ information is summarized in a similar way but without md_1 through md_7. Each record includes d, md, e, lne=ln(e/12) and the age indicator variables. The DHS rates are then obtained with four applications of the Compound Rate Model, as follows. The Adult Male Mortality Rate: Using the brothers’ file, the age‐specific adult male mortality rates for ages 15‐19 through 45‐49 are obtained with svy: glm d age_*, family(poisson) link(log) offset(lne) nocons The age‐specific rates are given in DHS reports but primary interest is in the overall rate 30m15. 30m15 is obtained by weighting the age groups according to the age distribution of men age 15‐49 in the file of men if there is a survey of men; if not, by the age distribution of men age 15‐49 in the household survey. (Usually the two age distributions are very close, but it is generally expected that the age reports in the MR file are slightly better than in the PR file. The distributions include weights mv005 in the MR file or hv005 in the PR file.) The Adult Female Mortality Rate: Using the sisters’ file, the age‐specific adult female mortality rates for ages 15‐19 through 45‐49 are obtained with svy: glm d age_*, family(poisson) link(log) offset(lne) nocons Again, the age‐specific rates are given in DHS reports but primary interest is in the overall rate 30m15. 30m15 is obtained by weighting the age groups according to the age distribution of women age 15‐49 in the file of women (the IR file). That age distribution is weighted by v005 in the IR file. The Maternal Mortality Rate: Using the sisters’ file, the age‐specific maternal mortality rates for ages 15‐19 through 45‐49 are obtained with svy: glm md age_*, family(poisson) link(log) offset(lne) nocons
SLIDE 19 19 Again, the age‐specific rates are given in DHS reports but primary interest is in the overall rate MMRate. MMRate is obtained by weighting the age groups according to the age distribution of women age 15‐49 in the file of women, exactly the same as the weights for the adult female mortality rate. The General Fertility Rate for the same reference interval of time: A fourth application of the Compound Rate Model is required to produce a GFR for the same time interval as the MMRate. The age‐ specific rates are calculated from the respondents’ birth histories, as was the GFR discussed in the last part of section 3.2, where the Compound Rate Model was introduced. There, weight wi was the proportion of the total woman‐years of exposure to ages 15‐49 that was in age interval i, where i refers to the five‐year sub‐intervals of age 15‐19 through age 45‐49. Here, the weight wi is exactly the same as for the adult female mortality rate and the maternal mortality rate, i.e. is the proportion of women age 15‐49 in the IR file who are in age group i, weighted by v005. Confidence intervals for the four compound rates are obtained in exactly the same way that was described for the discussion of the TFR and GFR as compound rates in section 3.2. These four rates, as usually defined, and the lower and upper bounds of their respective confidence intervals, require that the numbers calculated above be multiplied by 1000. The Maternal Mortality Ratio, or MMRatio, is calculated as the ratio of the MMRate to the GFR for the same reference interval of time. That is, MMRatio=MMRate/GFR. Taking the log of both sides, ln(MMRatio) = ln(MMRate) – ln(GFR). MMRate and GFR are calculated separately from the siblilng histories and the birth histories, respectively. It is possible that an individual respondent’s contributions to these rates are not statistically independent, but if we make the simplifying assumption that they are independent, then the standard error of ln(MMRatio) will be the square root of the sum of the squares
- f the standard errors of ln(MMRate) and ln(GFR), which have already been calculated as part of the
Compound Rate Model for MMRate and GFR. Thus a 95% confidence interval for ln(MMRatio) (that is, for the underlying parameter) can be calculated by adding and subtracting 1.96 standard errors. Exponentiating the lower and upper ends of this interval will produce a 95% confidence interval for MMRatio (that is, for the underlying parameter). MMRatio is generally very small and has a multiplier of 100,000.
- 5. Summary and conclusions
This paper has shown that the DHS fertility rates, under‐five mortality rates, and adult and maternal mortality rates can all be subsumed within a framework of closely related statistical models that fall under the rubric of generalized linear models. The coefficients and variance‐covariance matrices produced by these models can be manipulated to estimate confidence intervals that are adjusted for the survey design. Six generalized linear models were described—three “rate” models and three “probability” models, as follows: Single Rate Model: glm b, family(poisson) offset(lne) The rate and c.i. are obtained by exponentiating the coefficient and its c.i. Stacked Rate Model: glm b age_*, family(poisson) offset(lne) nocons
SLIDE 20 20 The age‐specific rates and c.i.’s are obtained by exponentiating the coefficients and their c.i.’s Compound Rate Model: The saved results from the stacked rates are combined as a compound rate that is a linear combination of the age‐specific rates; a confidence interval is estimated analytically. Single Probability Model: glm d, family(binomial) The probability and c.i. are obtained as the antilogits of the coefficient and its c.i. Stacked Probability Model: glm d age_*, family(binomial r) nocons The probabilities and c.i.’s are obtained as the antilogits of the coefficients and their c.i.’s Compound Probability Model: The saved results from the stacked probabilities are combined as a compound probability; a confidence interval is estimated analytically. In the rate models, we generically use “b” to represent an outcome that is a count and “lne” to represent the natural logarithm of years of exposure; “age_*” is a set of binary variables, one for each interval of age. In the probability models, we generically use “d” to represent a binary outcome, “r” to represent units of risk, in our case limited to the values 1 and 0.5. As described here, the glm model does not specify the link function because we use the defaults, i.e. the log for poisson error and the logit for binomial error. Unexpectedly, perhaps, the methods for the adult and maternal mortality rates are completely consistent with the methods for the fertility rates, rather than the methods for child mortality rates. Indeed, the only “rates” for which the probability models are used are the under‐five mortality rates. Historically, DHS could have approached adult and maternal mortality in terms of estimating probabilities (q’s), as was done for under‐five mortality, rather than rates (m’s). As it is, an important summary probability of adult mortality, specifically 35q15 for women and men, requires a step to estimate five‐year q’s from five‐year m’s. One can imagine an alternative in which the five‐year q’s would have been estimated first, and then the m’s would have been estimated from the q’s. For children, we note that one website that presents under‐five mortality using DHS data (in part) does so in terms of 5m0 rather than 5q0 (http://www.fsedata.stanford.edu/paper‐1) . The “Single” models are simply special cases of the respective “Stacked” models, with a single age
- interval. The two “Compound” models are simply extensions of the respective “Stacked” models. A
compound rate or probability is constructed from the specific rates or probabilities that are generated from the Stacked model, and the standard error of the log of a compound rate R is estimated as the square root of DSD’, as described, where S is the variance‐covariance matrix of coefficients produced by the Stacked model. All of the DHS rates (other than the under‐five mortality rates) are linear combinations of the specific rates, and the vector D of partial derivatives is very easy to specify; the
SLIDE 21
21 element for age i is wi*ri/R, where ri is the specific rate for age i and wi is the multiplier of ri such that R is the sum of wi*ri. The under‐five compound rates, which are actually probabilities, are not linear combinations of the age‐specific rates, but have the form Q=1‐(1‐q1)*(1‐q2)…. The standard error of the logit of Q is estimated as the square root of DSD’. Here the element of D for age i is simply qi/Q. The core of what is proposed here is the stacking of age intervals, followed by application of poisson and binomial versions of the generalized linear model. The estimates produced by these models agree exactly with the DHS estimates. For the purpose of estimating the rates, the statistical models provide a more systematic and unified methodology. The standard errors produced by the models appear to be almost exactly the same as the DHS estimates using a jackknife, except that our confidence intervals are on a log or logit scale and are slightly displaced, with the midpoint of our interval being slightly above the point estimate. All of the models can be applied to subgroups in the sample. An easy way to achieve this is to define a numeric variable, for example, “dummy”, that is 1 for all cases in that subgroup and otherwise is 0. Then, within the estimation command, “svy:” would be replaced by “svy, subpop(dummy):”. If using the entire sample, the subpop option can be omitted or can be nullified by defining dummy to be 1 for all cases in the sample. A limitation of Stata’s svyset command when using DHS data is that the DHS data files only include a single weight for each individual respondent. These rates are the same for all cases in a cluster. In the typical DHS sample design, PSUs (clusters or census enumeration areas) are sampled with probability proportional to size (the number of households in the most recent census). Then approximately a fixed number of households, such as 30, are randomly drawn from each cluster. This is essentially a self‐ weighting design, but it is generally found during the listing of households in the cluster that the number in the sampling frame is out of date or inaccurate. Nonresponse must also be taken into account, and the sampling fractions are different for the different strata (usually the combinations of region and urban‐rural residence). For these reasons, weights are calculated and must be used in order to obtain unbiased estimates of population parameters. Most DHS outcomes could be viewed in a multi‐level framework. Ideally we would be able to separate the sampling fractions—and their inverses, the weights—according to the selection of clusters and the selection of households within clusters. This is not possible. The information required to distinguish the two sampling fractions is destroyed after the combined weight has been calculated. The reason for this practice is that if the cluster weight were provided, along with other information about the clusters, it would theoretically be possible for a user to identify at least some of the sample clusters, leading to a violation of the guarantee of confidentiality given to DHS respondents. The exact specification of the svyset command could plausibly vary. For example, the specification of the PSU could be the cluster or it could be a lower level unit such as the household. When information about multiple age intervals is obtained for the same woman or sibling or child, then perhaps the PSU could be the woman or sibling or child. Fortunately, comparisons of alternatives indicate negligible differences across these options. (There is also negligible variation across the “singleunit” options.) It is hoped that this paper can contribute to a general process by which demographic methods are increasingly being approached from a statistical perspective, a perspective that was completely absent when the rates were originally developed. This framework may also help to bring out the underlying similarities in different sets of rates.
SLIDE 22 22 References Agresti, Alan. 1996. An Introduction to Categorical Data Analysis. John Wiley and Sons, Inc. Ahmed, Saifuddin, Qinfeng Li, Carolyn Scrafford, and Thomas W. Pullum. 2014. An Assessment of DHS Maternal Mortality Data and Estimates. DHS Methodological Reports No. 13. Rockville, Maryland, USA: ICF International. Available at http://dhsprogram.com/pubs/pdf/MR13/MR13.pdf. Alkema, Leontine, and Danzhen You. 2012. “Child Mortality Estimation: A Comparison of UN IGME and IHME Estimates of Levels and Trends in Under‐Five Mortality Rates and Deaths.” PLoS Medicine 9(8): e1001288. Masquelier, B., G. Reniers, and G. Pison. 2014. “Divergences in Trends in Child and Adult Mortality in Sub‐Saharan Africa: Survey Evidence on the Survival of Children and Siblings.” Population Studies 68(2): 161‐177. McCullagh, P. and J.A. Nelder. 1992. Generalized Linear Models. Chapman and Hall. Preston, Samuel H., Patrick Heuveline, and Michel Guillot. 2001. Demography: Measuring and Modeling Population Processes. Blackwell Publishers Ltd. Pullum, Thomas W. 2004. “Natality: Measures based on censuses and surveys.” Chapter 16 of The Methods and Materials of Demography, D. Swanson and J. Siegel (eds.), Academic Press. Pullum, Thomas W., Shireen Assaf, and Sarah Staveteig. 2017. Comparisons of DHS Estimates of Fertility and Mortality with Other Estimates. DHS Methodological Reports No. 21. Rockville, Maryland, USA: ICF. Available at http://dhsprogram.com/pubs/pdf/MR21/MR21.pdf. Pullum, Thomas W., and Stan Becker. 2014. Evidence of Omission and Displacement in DHS Birth
- Histories. DHS Methodological Reports No. 11. Rockville, Maryland, USA: ICF International.
Available at http://dhsprogram.com/pubs/pdf/MR11/MR11.pdf. Pullum, Thomas W., and Sarah Staveteig. 2017. An Assessment of the Quality and Consistency of Age and Date Reporting in DHS Surveys, 2000‐2015. DHS Methodological Reports No. 19. Rockville, Maryland, USA: ICF. Available at http://dhsprogram.com/pubs/pdf/MR19/MR19.pdf. Rutstein, Shea O., and Guillermo Rojas. Guide to DHS Statistics. 2006. Available at https://www.dhsprogram.com/pubs/pdf/DHSG1/Guide_to_DHS_Statistics_29Oct2012_DHSG1.pdf Schoumaker, Bruno. 2013. “A Stata Module for Computing Fertility Rates and TFRs from Birth Histories: TFR2.” Demographic Research 28:1093‐1144. Schoumaker, Bruno. 2014. Quality and Consistency of DHS Fertility Estimates, 1990 to 2012. DHS Methodological Reports No. 12. Rockville, Maryland, USA: ICF International. Available at http://dhsprogram.com/pubs/pdf/MR12/MR12.pdf.