Constructing epidemic curves is a common and very important practice in epidemiology. Epidemic curves are used to monitor disease occurrence, to detect outbreaks, to generate hypotheses about the cause of an outbreak, to monitor the impact of intervention efforts, and to predict the course of an epidemic. Simple epidemic curves can be constructed using a spreadsheet. For spreadsheet examples, see the excellent tutorial on constructing epidemic curves by the North Carolina Center for Public Health Preparedness.
However, if you need to construct numerous epidemic curves because you are monitoring several geographic sites; to update frequently rapidly evolving data; to manipulate efficiently large data sets; to plot high-quality, publication-ready graphics; or to automate the production of plots, then R is available to do the job. In the example that follows we take you through a step-by-step, iterative process of constructing epidemic curves for West Nile Virus infections in California as of December 14, 2004. Once the R code is created it becomes trivial to re-run the code with updated data.
The R code that follows appears as it would in your text editor. If you have 'epitools' installed and loaded in R on your computer, you can paste this code into your R program.
Step 1. Read in data set and try
epicurve.datesfunction.library(epitools) wdat <- read.table("http://www.medepi.net/data/wnv/wnv2004-12-14.txt", sep = ",", header = TRUE, as.is = TRUE) ##fig1 epicurve.dates(wdat$date.onset)![]()
Step 2. Change default axis labeling.
##suppress default axis labels and then re-label ##fig2 xv <- epicurve.dates(wdat$date.onset, axisnames = FALSE) axis(1, at = xv$xvals, labels = xv$cmday, tick = FALSE, line = 0) axis(1, at = xv$xvals, labels = xv$cmonth, tick = FALSE, line = 1)![]()
Step 3. Using
colors.plotfunction, add some color to this plot. (See next example to see howcolors.plotfunction works.) Also, suppress default x-axis labels and then specify the x-axis labels precisely.##fig3 xv <- epicurve.dates(wdat$date.onset, axisnames = FALSE, col = "royalblue1") dd <- xv$cmday==1 | xv$cmday==15 dd2 <- xv$cmday==15 axis(1, at = xv$xvals[dd], labels = xv$cmday[dd], tick = FALSE, line = 0) axis(1, at = xv$xvals[dd], labels = xv$cmonth[dd], tick = FALSE, line = 1)![]()
Step 4. Try
epicurve.weeksfunction, stratify by clinical syndrome, add default legend, and add titles, etc.##fig4 xv <- epicurve.weeks(wdat$date.onset, strata = wdat$syndrome, min.date = "2004-01-01", max.date = "2004-12-31", axisnames = FALSE, legend.text = TRUE) axis(1, at = xv$xvals, labels = xv$cweek, tick = FALSE, line = -0.5, cex.axis = 0.8) axis(1, at = xv$xvals, labels = xv$cmonth, tick = FALSE, line = 0.5, cex.axis = 0.8) axis(1, at = median(xv$xvals), labels = "Disease Week & Calendar Month, 2004", tick = FALSE, line = 1.5) title(main = "West Nile Virus Human Cases Reported in California by Disease Week as of December 14, 2004", ylab = "Cases")![]()
Step 5. Use
colorbrewer.displayfunction to select sequential colors for 3 categories (similar to the sequential colors above).colorbrewer.display(3, "seq")![]()
Step 6. Use
colorbrewer.palettefunction to create a palette and plot epidemic curve. Try red-like colors first.seq3.r <- colorbrewer.palette(3, "seq", "r") ##red-like seq3.a <- colorbrewer.palette(3, "seq", "a") ##blue-like ##Re-level syndrome categories using 'factor' function ##fig5 syndrome <- factor(wdat$syndrome, levels = c("WNND", "WNF", "Unknown")) xv <- epicurve.weeks(wdat$date.onset, strata = syndrome, min.date = "2004-01-01", max.date = "2004-12-31", axisnames = FALSE, col = seq3.r[3:1], legend.text = TRUE) title(main = "West Nile Virus Human Cases Reported in California by Disease Week as of December 14, 2004", ylab = "Cases") axis(1, at = xv$xvals, labels = xv$cweek, tick = FALSE, line = -0.5, cex.axis = 0.8) axis(1, at = xv$xvals, labels = xv$cmonth, tick = FALSE, line = 0.5, cex.axis = 0.8) axis(1, at = median(xv$xvals), labels = "Disease Week & Calendar Month, 2004", tick = FALSE, line = 1.5)![]()
Step 7. Try blue-like colors and add annotations to the plot.
##fig6 syndrome <- factor(wdat$syndrome, levels = c("WNND", "WNF", "Unknown")) xv <- epicurve.weeks(wdat$date.onset, strata = syndrome, min.date = "2004-01-01", max.date = "2004-12-31", axisnames = FALSE, col = seq3.a[3:1], legend.text = TRUE) title(main = "West Nile Virus Human Cases Reported in California by Disease Week as of December 14, 2004", ylab = "Cases") axis(1, at = xv$xvals, labels = xv$cweek, tick = FALSE, line = -0.5, cex.axis = 0.8) axis(1, at = xv$xvals, labels = xv$cmonth, tick = FALSE, line = 0.5, cex.axis = 0.8) axis(1, at = median(xv$xvals), labels = "Disease Week & Calendar Month, 2004", tick = FALSE, line = 1.5) ##add arrows and annotations zdates <- as.week(c("2004-02-24", "2004-04-14", "2004-05-17", "2004-06-20")) znames <- c("+ Bird\n2/24","+ Mosquito\n4/14", "+ Chicken\n5/17","+ Horse\n6/20") x0 <- c(xv$xvals[xv$cstratum==zdates$stratum[1]], xv$xvals[xv$cstratum==zdates$stratum[2]], xv$xvals[xv$cstratum==zdates$stratum[3]], xv$xvals[xv$cstratum==zdates$stratum[4]]) y0 <- c(10,20,30,40) x1 <- c(xv$xvals[xv$cstratum==zdates$stratum[1]], xv$xvals[xv$cstratum==zdates$stratum[2]], xv$xvals[xv$cstratum==zdates$stratum[3]], xv$xvals[xv$cstratum==zdates$stratum[4]]) y1 <- c(0, 0, 6, 18) arrows(x0, y0, x1, y1, length = 0.15) text(x1, y0 + 5, znames, cex=.8)![]()
Step 8.Try
epicurve.monthsfunction. Which epidemic curve do you like best?##fig7 syndrome <- factor(wdat$syndrome, levels = c("WNND", "WNF", "Unknown")) xv <- epicurve.months(wdat$date.onset, strata = syndrome, min.date = "2004-01-01", max.date = "2004-12-31", axisnames = FALSE, col = seq3.a[3:1], legend.text = TRUE) title(main = "West Nile Virus Human Cases Reported in California by Month as of December 14, 2004", ylab = "Cases") axis(1, at = xv$xvals, labels = xv$cmonth, tick = FALSE, line = -0.5, cex.axis = 0.8) axis(1, at = median(xv$xvals), labels = "Calendar 2004", tick = FALSE, line = 0.5)![]()
Constructing epidemic curves can be challenging when the illness onset times are measured in 1-hour or 1/2-hour units and the events span several days. The Center for Disease Control and Prevention (CDC) Oswego outbreak module has been used for many years to train epidemiologists in conducting an outbreak investigation of a foodborne illness. Although this is considered an introductory-level training module, constructing this epidemic curve using a computer requires some work. To convince yourself, load this data into your favorite data management software/statistical package and generate the plots belows. This is the raw data as it appears in CDC publications. You'll have to do some data cleaning.
To make the creation of these types of plots easier, we wrote the 'epicurve.hours' function.
The R code that follows appears as it would in your text editor. If you have 'epitools' installed and loaded in R on your computer, you can paste this code into your R program.
Step 1. Read in and prepare data.
library(epitools) odat <- read.table("http://www.medepi.net/data/oswego/oswego.txt", sep = "", header = TRUE, as.is = TRUE) odat[1:10,] ## create vector with meal date and time mdt <- paste("4/18/1940", odat$meal.time) ## convert into standard date and time meal.dt <- strptime(mdt, "%m/%d/%Y %I:%M %p") ## create vector with onset date and time odt <- paste(paste(odat$onset.date,"/1940",sep=""), odat$onset.time) ## convert into standard date and time onset.dt <- strptime(odt, "%m/%d/%Y %I:%M %p")Step 2. Plot epidemic curve using 1-hour categories.
##using epicurve.hours function col3seq.d <- rev(colorbrewer.palette(3,"seq","d")) xv <- epicurve.hours(onset.dt, "1940-04-18 12:00:00", "1940-04-19 12:00:00", axisnames = FALSE, axes = FALSE, ylim = c(0,11), col = col3seq.d[2], segments = TRUE, strata = odat$sex) hh <- xv$chour12==3 | xv$chour12== 6 | xv$chour12== 9 hh2 <- xv$chour12==12 hh3 <- xv$chour12==1 hlab <- paste(xv$chour12,xv$campm2,sep="") hlab2 <- paste(xv$cmonth,xv$cmday) axis(1, at = xv$xval[hh], labels = xv$chour12[hh], tick = FALSE, line = -.2) axis(1, at = xv$xval[hh2], labels = hlab[hh2], tick = FALSE, line = -.2) axis(1, at = xv$xval[hh3], labels = hlab2[hh3], tick = FALSE, line = 1.0) axis(2, las = 1) title(main = "Figure 1. Cases of Gastrointestinal Illness by Symptom Onset Times (Hour Category) Oswego County, New York, April 18-19, 2004", xlab = "Time of Onset", ylab = "Cases")![]()
Step 2. Plot epidemic curve using 1/2-hour categories.
##half-hour xv <- epicurve.hours(onset.dt, "1940-04-18 12:00:00", "1940-04-19 12:00:00", axisnames = FALSE, axes = FALSE, ylim = c(0,11), col = col3seq.d[2], segments = TRUE, half.hour = TRUE, strata = odat$sex) hh <- xv$chour12==3 | xv$chour12== 6 | xv$chour12== 9 hh2 <- xv$chour12==12 hh3 <- xv$chour12==1 hlab <- paste(xv$chour12,xv$campm2,sep="") hlab2 <- paste(xv$cmonth,xv$cmday) axis(1, at = xv$xval[hh], labels = xv$chour12[hh], tick = FALSE, line = -.2) axis(1, at = xv$xval[hh2], labels = hlab[hh2], tick = FALSE, line = -.2) axis(1, at = xv$xval[hh3], labels = hlab2[hh3], tick = FALSE, line = 1.0) axis(2, las = 1) title(main = "Figure 2. Cases of Gastrointestinal Illness by Symptom Onset Times (1/2 Hour Category) Oswego County, New York, April 18-19, 2004", xlab = "Time of Onset", ylab = "Cases")![]()
Selecting good colors for your graphics is crucial to optimizing the display of epidemiologic data. R has tremendous capabilities for selecting colors (try '
help.search("colors")'). However, most of us just need a broad selection from which to choose without having to specify RGB numbers. R has 657 named colors ready for action. The newcolors.plotfunction plots the 657 colors and allows you to interactively point to the colors that interest you, and then it reports the corresponding color names.To really learn how to appropriately select colors for any type of display visit Cindy Brewer's ColorBrewer site.
library(epitools) > ##plot R's 657 named colors > colors.plot()
> ##activate 'locator' to identify color names > colors.plot(locator = TRUE) ##select 5 colors from plot x y color.names 1 3 2 brown1 2 12 21 springgreen2 3 22 14 lightgoldenrod2 4 16 2 cadetblue4 5 30 1 blue4
To make valid comparisons between rates from different groups (e.g., geographic area, ethnicity), epidemiologists often adjust for differences in age distribution to remove the confounding affect of age. When the number of events or rates are very small (as is often the case for local area studies), the normal approximation method of calculating confidence intervals may give a negative number for the lower confidence limit. To avoid this common pitfall, one can approximate exact confidence intervals. The
ageadjust.directand theageadjust.indirectfunctions implement these methods (Anderson 1998).Below is a comparison of United States white male cancer mortality data for the years 1940 and 1960. We replicate the analysis provided in Selvin (2004)
References
- Anderson RN, Rosenberg HM. Age Standardization of Death Rates: Implementation of the Year 200 Standard. National Vital Statistics Reports; Vol 47 No. 3. Hyattsville, Maryland: National Center for Health Statistics. 1998, pp. 13-19. Available at http://www.cdc.gov/nchs/data/nvsr/nvsr49/nvsr49_09.pdf.
- Steve Selvin (2004). Statistical Analysis of Epidemiologic Data (Monographs in Epidemiology and Biostatistics, V. 35), Oxford University Press; 3rd edition
library(epitools)
> ##From Selvin (2004)
> ##enter data
> dth60 <- scan()
1: 141 926 1253 1080 1869 4891 14956 30888 41725 26501 5928
12:
Read 11 items
> pop60 <- scan()
1: 1784033 7065148 15658730 10482916 9939972 10563872 9114202
8: 6850263 4702482 1874619 330915
12:
Read 11 items
> dth40 <- scan()
1: 45 201 320 670 1126 3160 9723 17935 22179 13461 2238
12:
Read 11 items
> pop40 <- scan()
1: 906897 3794573 10003544 10629526 9465330 8249558 7294330
8: 5022499 2920220 1019504 142532
12:
Read 11 items
> ##calculate age-specific rates
> rate60 <- dth60/pop60
> rate40 <- dth40/pop40
>
> #create array for display
> tab <- array(c(dth60, pop60, round(rate60*100000,1), dth40, pop40,
+ round(rate40*100000,1)),c(11,3,2))
> agelabs <- c("<1", "1-4", "5-14", "15-24", "25-34", "35-44", "45-54",
+ "55-64", "65-74", "75-84", "85+")
> dimnames(tab) <- list(agelabs,c("Deaths", "Population", "Rate"),
+ c("1960", "1940"))
> tab
, , 1960
Deaths Population Rate
<1 141 1784033 7.9
1-4 926 7065148 13.1
5-14 1253 15658730 8.0
15-24 1080 10482916 10.3
25-34 1869 9939972 18.8
35-44 4891 10563872 46.3
45-54 14956 9114202 164.1
55-64 30888 6850263 450.9
65-74 41725 4702482 887.3
75-84 26501 1874619 1413.7
85+ 5928 330915 1791.4
, , 1940
Deaths Population Rate
<1 45 906897 5.0
1-4 201 3794573 5.3
5-14 320 10003544 3.2
15-24 670 10629526 6.3
25-34 1126 9465330 11.9
35-44 3160 8249558 38.3
45-54 9723 7294330 133.3
55-64 17935 5022499 357.1
65-74 22179 2920220 759.5
75-84 13461 1019504 1320.3
85+ 2238 142532 1570.2
>
> ##implement direct age standardization using 'ageadjust.direct'
> dsr <- ageadjust.direct(count = dth40, pop = pop40, stdpop = pop60)
>
> round(100000*dsr, 2) ##rate per 100,000 per year
crude.rate adj.rate lci uci
119.53 139.25 138.21 140.29
>
> ##implement indirect age standardization using 'ageadjust.indirect'
> isr <- ageadjust.indirect(count = dth40, pop = pop40,
+ stdcount = dth60, stdpop = pop60)
>
> round(isr$sir, 2) ##standarized incidence ratio
observed exp sir lci uci
71058.00 85556.89 0.83 0.82 0.84
>
> round(100000*isr$rate, 1) ##rate per 100,000 per year
crude.rate adj.rate lci uci
166.1 137.9 136.9 139.0
>