| Title: | Generalized Linear Model Data Sets |
|---|---|
| Description: | Data sets from the book Generalized Linear Models with Examples in R by Dunn and Smyth. |
| Authors: | Peter K. Dunn [cre,aut], Gordon K. Smyth [aut] |
| Maintainer: | Peter K. Dunn <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 1.4 |
| Built: | 2026-05-29 11:41:27 UTC |
| Source: | https://github.com/cran/GLMsData |
Physical measurements and blood measurements from high performance athletes at the AIS
data(AIS)data(AIS)
A data frame containing 202 observations with the following 13 variables.
Sexthe sex of the athlete: F means female, and M means male
Sportthe sport of the athlete;
one of
BBall (basketball),
Field,
Gym (gymnastics),
Netball,
Rowing,
Swim (swimming),
T400m, (track, further than 400m),
Tennis,
TPSprnt (track sprint events),
WPolo (waterpolo)
LBMlean body mass, in kg
Htheight, in cm
Wtweight, in kg
BMIbody mass index, in kg per metre-squared
SSFsum of skin folds
PBFpercentage body fat
RBCred blood cell count, in per litre
WBCwhite blood cell count, in per litre
HCThematocrit, in percent
HGBhemoglobin concentration, in grams per decilitre
Ferrplasma ferritins, in ng per decilitre
The data give measurements from high-performance athletes from the Australian Institute of Sport (ais), for 202 athletes (102 males; 100 females) on 13 variables. Telford and Cunningham (1991) provide more information on how the data were collected.
From the paper: “The main aim of the statistical analysis was to determine whether there were any hematological differences, on average, between athletes from the various sports, between the sexes, and whether there was an effect of mass or height” (p. 789).
OzDASL, available on-line at http://www.statsci.org/data/.
Telford, R. D. and Cunningham, R. B. (1991) Sex, sport, and body-size dependency of hematology in highly trained athletes. Medicine and Science in Sports and Exercise, 23(7):788–794.
data(AIS) summary(AIS)data(AIS) summary(AIS)
The number of ant species in New England (usa)
data(ants)data(ants)
A data frame containing 44 observations with the following 5 variables.
Sitean abbreviation for the site name
Srichspecies richness (number of ant species); a numeric vector
Habitatthe habitat type:
a factor with levels Bog and Forest
Latitudethe latitude (in decimal degrees) for the site; a numeric vector
Elevationthe elevation, in metres above sea level; a numeric vector
The data give the ant species richness (number of ant species) found in 64 square metre sampling grids, in 22 bogs and 22 forests surrounding the bogs, in Connecticut, Massachusetts and Vermont (usa). The sites span a 3-degrees of latitude in New England.
N. J. Gotelli and A. M. Ellison (2002). Biogeography at a regional scale: determinants of ant species density in bogs and forests of New England. Ecology, 83, 1604–1609.
Aaron M. Ellison (2004) Bayesian inference in ecology. Ecology Letters, 7, 509–520.
data(ants) summary(ants)data(ants) summary(ants)
The number of apprentices migrating to Edinburgh
data(apprentice)data(apprentice)
A data frame with 33 observations on the following 5 variables.
Distthe distance from Edinburgh (unit unknown, presumably miles); a numeric vector
Appsthe number of apprentices moving to Edinburgh from the given county (given in row labels); a numeric vector
Popthe population (in thousands) of the given county; a numeric vector
Urbanthe degree of urbanization as measured by the percentage of the population living in urban settlements; a numeric vector
Locnthe location of the county relative to Edinburgh;
a factor with levels North,
South and West
The data record the number of apprentices moving to Edinburgh between 1775 and 1799 from other Scottish counties.
Andrew Lovett and Robin Flowerdew (1989) Analysis of count data using Poisson regression. Professional Geographer, 41(2), 190–198.
data(apprentice) summary(apprentice)data(apprentice) summary(apprentice)
The daily individual feeding rates of chestnut-crowned babblers
data(babblers)data(babblers)
A data frame containing 97 observations with the following 8 variables.
ObsTimethe length of observation (in decimal hours); a numeric vector
Sexthe sex of the bird; one of f (female) or m (male)
Agethe age of non-breeding group members; one of adult or yearling
Relatednessthe pedigree-based relatedness to the brood;
one of 0.5 (first-order relatives); 0.25 (second-order relatives) or
0 (more distant relatives)
ChickAgethe age of the brood, in days; a numeric vector
BroodSizethe size of the brood: a numeric vector
UnitSizethe number of individuals in the unit; a numeric vector
FeedingRatethe daily individual feeding rates, in feeds per hour; a numeric vector
The data relate to a population of colour-ringed population of chestnut-crowned babblers in an area of the University of New South Wales Arid Zone Research Station, (Fowlers Gap, western New South Wales, Australia). The study determined whether, where and how often non-breeding group members contributed to providing for nestlings by monitoring the visit rate of tagged birds during 2007 and 2008. These data are extracted from a larger data set, extracted so that there is one (randomly chosen) observation for each individual bird.
L. E. Browning, S. C. Patrick, L. A. Rollins, S. C. Griffith, and A. F. Russell (2012) Kin selection, not group augmentation, predicts helping in an obligate cooperatively breeding bird. Proceedings of the Royal Society B, 279(1743): 3861–3869. doi:10.1098/rspb.2012.1080
L. E. Browning, S. C. Patrick, L. A. Rollins, S. C. Griffith, and A. F. Russell (2012) Data from: Kin selection, not group augmentation, predicts helping in an obligate cooperatively breeding bird. Dryad Digital Repository. doi:10.5061/dryad.ff868
L. E. Browning, S. C. Patrick, L. A. Rollins, S. C. Griffith, and A. F. Russell (2012) Kin selection, not group augmentation, predicts helping in an obligate cooperatively breeding bird. Proceedings of the Royal Society B, 279(1743): 3861–3869. doi:10.1098/rspb.2012.1080
data(babblers) summary(babblers)data(babblers) summary(babblers)
The number of candidates in the British general election in 1992
data(belection)data(belection)
A data frame with 55 observations on the following 4 variables.
Regionthe region;
a factor with levels EastAnglia,
EastMidlands, GreaterLondon,
NorthWest, Scotland, SouthEast,
SouthWest, Wales, WestMidlands
and YorksHumbers
Partythe political party;
a factor with levels Cons, Green,
Labour, LibDem and Other
Femalesthe number of female candidates; a numeric vector
Malesthe number of male candidates; a numeric vector
The data give the number of male and females candidates in the British general election held April 9, 1992.
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994) A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 374.
The data originally came from: The Independent, Friday 27th March, 1992.
data(belection) plot(Females/(Females+Males) ~ Party, data=belection)data(belection) plot(Females/(Females+Males) ~ Party, data=belection)
The number of blocks stacked by children, and the time taken
data(blocks)data(blocks)
A data frame with 100 observations on the following 6 variables.
Childthe child;
an identifier from A to Y
Numberthe number of blocks the child could successfully stack; a numeric vector
Timethe time in seconds taken for the children to make their stack of blocks; a numeric vector
Trialthe trial number on which the data were gathered (see Details);
a factor with levels 1 and 2
Shapethe shape of the blocks being stacked;
a factor with levels Cube and Cylinder
Agethe age of the child in completed years; a numeric vector
Children were seated a a small table, and “told” to build a tower from the blocks as high as they could. This was demonstrated for the child. The time taken and the number of blocks used were recorded. The cubes were always presented first, then cylinders. The second trial was conducted one month later.
The blocks were “half inch cubes and cylinders included in Mrs. Hailmann's Beads No. 470 of Bradley's Kindergarten Material”. Throughout the article, the children are referred to using male pronouns, but (in keeping with the custom at the time) it is unclear whether all children were males or not. However, since gender is not recorded the children may all have been boys.
The source (Johnson and Courtney 1931) gives the age in years and months. Here they have been converted to decimal years.
The means given in Table 1 in Johnson and Courtney (1931) do not agree in every case with the data given in that same table.
Buford Johnson and Dorothy Moore Courtney (1931) Tower building, Child Development, 2(2), 161–162
Judith D. Singer and John B. Willett (1990) Improving the teaching of applied statistics: Putting the data back into data analysis. The American Statistician, 44(3), 223–230.
data(blocks) plot( Time ~ Age, data=blocks)data(blocks) plot( Time ~ Age, data=blocks)
The number of mice embryos dead after exposure to four different doses of boric acid
data(boric)data(boric)
A data frame with 107 observations on the following 3 variables.
Dosethe dose of boric acid (in percent of boric acid in feed); a numeric vector
Deadthe number of embryos dead in utero; a numeric vector
Implantsthe total number of embryos; a numeric vector
Mice were fed doses of boric acid in their feed during the first 17 days of gestation; the mice were then sacrificed and the embryos examined. Boric acid is widely used in pesticides and household products.
Terra L. Slaton, Walter W. Piegorsch and Stephen D. Durham (2000) Estimation and testing with overdispersed proportions using the beta-logistic regression model of Heckman and Willis. Biometrics, 56(1), 125–133, Table 4.
J. H. Hiendel, C. J. Price, E. A. Field, M. C. Marr, C. B. Myers, R. E. Morrissey, and B. A. Schwetz (1992) Developmental toxocity of boric acid in mice and rats. Fundamental and Applied Toxicology, 18, 266–277.
data(boric) plot( Dead/Implants ~ Dose, data=boric)data(boric) plot( Dead/Implants ~ Dose, data=boric)
The dialetric breakdown strength of electrical insulation
data(breakdown)data(breakdown)
A data frame containing 128 observations with the following 3 variables.
Strengththe dialetric breakdown strength, in kiloVolts
Timethe time exposure in weeks;
one of 1, 2, 4, 8, 16, 32, 48, or 64
Temperaturethe temperature, in degrees Celsius;
one of 180, 225, 250 or 275
The data come from a study of performance degradation of electrical insulation from accelerated tests. The study can be considered as a 8-by-4 factorial experiment, with four measurements for each time–temperature combination.
OzDASL, available on-line at http://www.statsci.org/data/general/dialectr.html.
Nelson, W. (1981) Analysis of performance-degradation data. IEEE Transactions on Reliability, 2, R-30, 149–155.
The Statistical Reference Datasets page: http://www.itl.nist.gov/div898/strd/nls/data/nelson.shtml.
data(breakdown) summary(breakdown)data(breakdown) summary(breakdown)
The data record details about the Birth to Ten study (btt) in South Africa during 1990
data(bttstudy)data(bttstudy)
A data frame with 8 observations on the following 4 variables.
Countsthe number of subjects in the given classification; a numeric vector
Groupthe group the mother belongs to;
a numeric vector with levels 1 (mothers not followed up),
2 (mothers followed up five years later)
MedicalAidwhether or not the mother had medical aid;
a factor with levels No and Yes
Racethe mother's race;
a factor with levels Black and White
The data record details about the Birth to Ten study (btt), performed in the greater Johannesburg/Soweto metropolitan area of South Africa during 1990. In the study, all mothers of singleton births were interviewed during a seven-week period between April and June to women with permanent addresses in a defined area (a total of 4019 births). Five years later, 964 of these mothers were re-interviewed. If the mothers interviewed later and representative of the original populations, the two groups should show similar characteristics. One of those characteristics is documented here: the proportion with and without medical aid.
Christopher H. Morrell (1999) Simpson's Paradox: An example from a longitudinal study in South Africa. Journal of Statistics Education, 7(3).
data(bttstudy) summary(bttstudy)data(bttstudy) summary(bttstudy)
The number of tobacco budworms dying at various doses of pyrethroid
data(budworm)data(budworm)
A data frame with 12 observations on the following 4 variables.
Killedthe number of budworms killed at each dose; a numeric vector
Numberthe number of budworms exposed at each dose; a numeric vector
Dosethe dose of pyrethroid trans-cypermethrin in micrograms; a numeric vector
Genderthe gender of the budworms;
a factor with levels F (female) and M (male)
The data concern the tobacco budworm Heliothis virescens and the doses of pyrethroid trans-cypermethrin (to which the moths were beginning to show resistance). Twenty male and twenty female moths were exposed at each of six doses of the pyrethroid, and the number that were killed recorded.
W. N. Venables and B. D. Ripley (1997). Modern Applied Statistics with S-Plus, second edition. Springer-Verlag: New York (p 230)
D. Collett (1991). Modelling Binary Data. Chapman and Hall: London.
data(budworm) summary(budworm)data(budworm) summary(budworm)
The average butterfat content for dairy cattle
data(butterfat)data(butterfat)
A data frame with 100 observations on the following 3 variables.
Butterfatthe average butterfat percentage; a numeric vector
Breedthe cattle breed;
a factor with levels Ayrshire, Canadian,
Guernsey, Holstein-Fresian and Jersey
Agethe age of the cow;
a factor with levels 2year and Mature
The data give the average butterfat content (percentages) for random samples of twenty cows (ten two-year old and ten mature (greater than four years old)) from each of five breeds. The data are from Canadian records of pure-bred dairy cattle.
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994) A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 23.
R. R. Sokal and F. J. Rohlf (1981) Biometry, 2nd edition, San Fransisco: WH Freeman.
data(butterfat) summary(butterfat)data(butterfat) summary(butterfat)
The estimated number of deaths from cancer in three regions of Canada by cancer site and gender
data(ccancer)data(ccancer)
A data frame with 30 observations on the following 5 variables.
Countthe estimated number of deaths by the given cancer; a numeric vector
Gendergender; a factor with levels either \codeF (female) or codeM (male)
Regionthe region;
a factor with levels
Ontario, Newfoundland or Quebec
Sitethe cancer site;
a factor with levels Lung, Colorectal,
Breast, Prostate or Pancreas
Populationthe estimated population of the region in 2000/20001; a numeric vector
The cancer data are estimated number of deaths in 2000 from the five leading cancer sites
Cancer estimates from:
Canadian Cancer Society.
Canadian cancer statistics 2000.
Published on the internet:
http://www.cancer.ca/stats2000/tables/tab5e.htm.
Accessed 19 September 2001.
Population estimates from:
The Daily, Tuesday September 25, 2001.
(Accessed on the internet:
http://www.statcan.gc.ca/daily-quotidien/010925/dq010925a-eng.htm
now https://www150.statcan.gc.ca/n1/daily-quotidien/010925/dq010925a-eng.htm)
data(ccancer) summary(ccancer)data(ccancer) summary(ccancer)
The age and salary of ceos of small companies
data(ceo)data(ceo)
A data frame with 60 observations on the following 2 variables.
Agethe age of the ceo in completed years; a numeric vector
Salarythe salary of the ceo (including bonuses) in thousands of dollars; a numeric vector
The age and salary of ceos of small companies (annual sales greater than 5 and less than 350 million dollars); companies were ranked according to 5-year average return on investment. The first 60 firms are listed.
The Data and Story Library (dasl)
(formerly http://lib.stat.cmu.edu/DASL/
now https://dasl.datadescription.com)
Originally from Forbes, November 8, 1993 “America's Best Small Companies”.
data(ceo) plot(ceo)data(ceo) plot(ceo)
The number of deaths from cervical cancer in four countries
data(cervical)data(cervical)
A data frame with 16 observations on the following 4 variables.
Countrythe country;
a factor with levels EngWales (England and Wales),
Belgium, France and Italy
Agethe age group;
a factor with levels 25to34, 35to44,
45to54, 55to64
Deathsthe number of deaths; a numeric vector
Wyearsthe woman-years of risk; a numeric vector
The data give the number of deaths from cervical cancer, and the woman-years of risk, for various age groups and four countries.
A. S. Whittermore and G. Gong (1991) Poisson regression with misclassified counts: Applications to cervical cancer mortality rates. Applied Statistics, 40(1), 81–93.
data(cervical) with( cervical, plot(Deaths/Wyears ~ Age) )data(cervical) with( cervical, plot(Deaths/Wyears ~ Age) )
The taste of cheddar cheese
data(cheese)data(cheese)
A data frame with 30 observations on the following 4 variables.
Tastethe combined taste scores from several judges (presumably higher scores correspond to better taste); a numeric vector
Aceticthe concentration of acetic acid in the cheese (units unknown); a numeric vector
H2Sthe concentration of hydrogen sulphide (units unknown); a numeric vector
Lacticthe concentration of lactic acid (units unknown): a numeric vector
The data give information on taste and concentration of various chemical components of matured 30 cheddar cheeses from the LaTrobe Valley in Victoria, Australia.
The final Taste score is a combination of the taste scores
from several tasters.
David S. Moore and George P. McCabe (1993) Introduction to the Practice of Statistics, W. H. Freeman and company, second edition.
The Statlib data base:
formerly http://lib.stat.cmu.edu/DASL/Datafiles/Cheese.html
now https://dasl.datadescription.com.
G. P. McCabe, L. McCabe, A. Miller. Analysis of taste and chemical composition of cheddar cheese 1982–83 experiments, CSIRO Division of Mathematics and Statistics Consulting Report VT85/6.
I. Barlow, et al. (1989) Correlations and changes in flavour and chemical parameters of cheddar cheeses during maturation. Australian Journal of Dairy Technology, 44, 7–18.
According to Moore and McCabe (1993), the data are based on the experiments of G. T. Lloyd and E. H. Ramshaw.
data(cheese) plot(cheese)data(cheese) plot(cheese)
Details of the Canadian car insurance industry
data(cins)data(cins)
A data frame with 20 observations on the following 6 variables.
Meritthe merit rating;
a factor with levels
Merit3 (licensed and accident free 3 or more years),
Merit2 (licensed and accident free 2 or more years),
Merit1 (licensed and accident free 1 or more years),
Merit0 (all others)
Classthe vehicle class;
a factor with levels
Class1 (pleasure, no male operator under 25),
Class2 (pleasure, non-principal male operator under 25),
Class3 (business use),
Class4 (unmarried owner or principal operator under 25),
Class5 (married owner or principal operator under 25)
Insuredthe earned car-years; a numeric vector
Premiumearned premiums in 1000s of dollars (adjusted to equivalent 2001 rates); a numeric vector
Claimsthe number of claims; a numeric vector
Costtotal cost of the claim in 1000s of dollars; a numeric vector
The data are for all of Canada except Saskatchewan, and refer to private passenger automobile liability for non-farmers. The data are for policy years 1956 and 1957, as of 30 June 1959.
The data was downloaded from OzDASL http://www.statsci.org/data/general/carinsca.html where it was prepared by Gordon Smyth from Bailey and Simon (1960).
Robert A. Bailey and LeRoy J. Simon (1960) Two studies in automobile insurance ratemaking. ASTIN Bulletin, I(IV):192-217.
data(cins) summary(cins)data(cins) summary(cins)
The age at which babies start to crawl, the birth month and average monthly temperature six months after the birth month
data(crawl)data(crawl)
A data frame with 12 observations on the following 5 variables.
BirthMonththe baby's birth month; levels such as January and July
Agethe mean age (in completed weeks) at which the babies born this month started to crawl; a numeric vector
SDthe standard deviation (in completed weeks) of the crawling ages for babies born this month; a numeric vector
SampleSizethe number of babies in the study born in the given month; a numeric vector
Tempthe monthly average temperature (in degrees F) six months after the birth month; a numeric vector
The data come from a study which hypothesized that babies would take longer to learn to crawl in colder months because the extra clothing restricts their movement. From 1988–1991, recorded were the babies' first crawling age and the average monthly temperature 6 months after birth (when “infants presumably enter the window of locomotor readiness”). The parents reported the birth month, and age when their baby first crept or crawled a distance of four feet in one minute. Data were collected at the University of Denver Infant Study Center on 208 boys and 206 girls, and summarized by the birth month.
Janette Benson (1993) Season of birth and onset of locomotion: Theoretical and methodological implications. Infant Behavior and Development, 16(1), 69–81.
Thanks to Janette Benson for granting permission to use this data set.
data(crawl) plot(Age ~ Temp, data=crawl, cex=0.05*SampleSize, pch=19)data(crawl) plot(Age ~ Temp, data=crawl, cex=0.05*SampleSize, pch=19)
The data give the number of severe and non-severe tropical cyclones from 1969 to 2005 in the Australian region
data(cyclones)data(cyclones)
A data frame with 37 observations on the following 8 variables.
Yearthe year
Severethe number of severe cyclones recorded; a numeric vector
NonSeverethe number of non-severe cyclones; a numeric vector
Totalthe total number of cyclones (the sum of Severe and NonSevere);
a numeric vector
JFMthe Ocean Nino Index, or oni, averaged over the months January to March; a numeric vector
AMJthe Ocean Nino Index, or oni, averaged over the months April to June; a numeric vector
JASthe Ocean Nino Index, or oni, averaged over the months July to September; a numeric vector
ONDthe Ocean Nino Index, or oni, averaged over the months October to December; a numeric vector
The data give the number of severe and non-severe cyclones tropical cyclones from 1970 to 2005 in the Australian region (south of equator; 105 to 160 degrees E). Severe cyclones are defined as those with a minimum central pressure less than 970 hPa.
The oni is based on a three-month running mean of ERSST.v3b Sea Surface Temperature (sst) anomalies in the Nino 3.4 region (5 degrees N to 5 degrees S, 120 degrees to 170 degrees W), based on the 1971 to 2000 base period.
Cyclone information: http://www.bom.gov.au/cyclone/climatology/trends.shtml (accessed 04 April 2011).
Ocean Nino Index: http://www.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/ensoyears.shtml (accessed 04 April 2011).
data(cyclones) plot(Severe~JFM, data=cyclones )data(cyclones) plot(Severe~JFM, data=cyclones )
The number of cases of lung cancer in four Danish cities
data(danishlc)data(danishlc)
A data frame with 24 observations on the following 4 variables.
Casesthe number of lung cancer cases; a numeric vector
Popthe population of each age group in each city; a numeric vector
Agethe age group;
a factor with levels 40-54, 55-59,
60-64, 65-69, 70-74 and >74
Citythe city;
a factor with levels Fredericia, Horsens,
Kolding and Vejle
The data gives the number of cases of lung cancer in four Danish cities between 1968 and 1971 inclusive.
James K. Lindsey (1995) Modelling frequency and count data. Clarendon Press, page 157.
The original source is: E. B. Andersen (1977) Multiplicative Poisson models with unequal cell rates. Scandinavian Journal of Statistics, 4, 153–158.
data(danishlc) plot(Cases/Pop ~ City, data=danishlc)data(danishlc) plot(Cases/Pop ~ City, data=danishlc)
The data give the estimates of the mean number of decayed, missing and filled teeth (DMFT) at age 12 years, and the mean annual sugar consumption in the previous five years for 90 countries
data(dental)data(dental)
A data frame with 90 observations on the following 4 variables.
Countrythe country; a factor
Induswhether the country is considered an industrialized country;
a factor with levels Ind (industrialized)
or NonInd (not industrialized)
Sugarthe mean annual sugar consumption in kilograms per person per year, computed over the five years (or as much as available) prior to the survey; a numeric vector
DMFTestimates of the mean number of decayed, missing and filled teeth at age 12; a numeric vector
The data give the estimates of the mean number of decayed, missing and filled teeth (DMFT) at age 12 years, and the mean annual sugar consumption in the previous five years for 90 countries. For some countries, data on sugar consumption was unavailable for the previous five years, and the average was computed for the available data; see Woodward and Walker (1994) for details.
M. Woodward and A. R. P. Walker (1994) Sugar consumption and dental caries: evidence from 90 countries. British Dental Journal, 176, 297–302.
M. Woodward (2004) Epidemiology: Study Design and Data Analysis, second edition. Chapman and Hall.
data(dental) plot(DMFT ~ Sugar, data=dental )data(dental) plot(DMFT ~ Sugar, data=dental )
The number of insects killed at various doses of insecticide
data(deposit)data(deposit)
A data frame with 18 observations on the following 4 variables.
Killedthe number of insects killed at each poison level; a numeric vector
Numberthe number of insects exposed at each poison level; a numeric vector
Insecticidethe insecticide used;
a factor with levels A,
B and C
Depositthe amount of deposit (insecticide) used in milligrams; a numeric vector
Fifty insects were exposed to various deposits of insecticides. The proportions of the insects killed after six days exposure were recorded.
P. S. Hewlett and T. J. Plackett (1950) Statistical aspects of the independent joint action of poisons, particularly insecticides. II. Examination of data for agreement with hypothesis. Annals of Applied Biology, 37, 527–552.
Wotjek J. Krzanowski (1998) An Introduction to Statistical Modelling, Arnold: London.
data(deposit) summary(deposit)data(deposit) summary(deposit)
The number of Downs Syndrome cases in British Columbia, Canada
data(downs)data(downs)
A data frame with 30 observations on the following 3 variables.
Agethe average age of the mother in each group, in completed years; a numeric vector
Birthsthe number of live births; a numeric vector
DSthe number of Downs Syndrome births; a numeric vector
The data give the number of Downs Syndrome cases from 1961–1970 in British Columbia, Canada, in 30 age categories for the mother.
The ages are the means of the ages in defined groups, rounded to one decimal place.
Charles J. Geyer (1991) Constrained maximum likelihood exemplified by isotonic convex logistic regression. Journal of the American Statistical Association, 86(415), 717–724.
The data are originally from the British Columbia Health Surveillance Registry.
The data also appear in A. C. Davison and D. V. Hinkley (1997) Bootstrap Methods and their Applications, Cambridge University Press, Table 7.12, though there are very slight differences in their data to ours, in the decimal places for age. (The differences are very minor, and will not affect conclusions.)
data(downs) plot( DS/Births ~ Age, data=downs)data(downs) plot( DS/Births ~ Age, data=downs)
The data give the number of women from a sample in Camberwell, South London, who developed depression in a one-year period
data(dwomen)data(dwomen)
A data frame with 8 observations on the following 4 variables.
Countsthe counts in each category; a numeric vector
Depressionwhether depression was observed;
a factor with levels Yes and No
SLEwhether a Severe Life Event was observed;
a factor with levels Yes and No
Childrenwhether the woman had three children under 14;
a factor with levels Yes and No
The data give the number of women from a sample in Camberwell, South London, who developed depression in a one-year period.
B. S. Everitt and A. M. R. Smith (1979) Interactions in a contingency tables: a brief discussion of alternative definitions. Psychological Medicine, 9, 581–583.
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994) A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 391 (second table).
data(dwomen) summary(dwomen)data(dwomen) summary(dwomen)
The number of seriously emotionally disturbed and learning disabled adolescents and their reported depression levels.
data(dyouth)data(dyouth)
A data frame with 24 observations on the following 5 variables.
Obsthe number of observed adolescents in the given category; a numeric vector
Agethe age group;
a factor with levels 12-14,
15-16 and 17-18
Groupthe group;
a factor with levels LD (learning disabled) and
SED (serious emotionally disturbed)
Genderthe gender;
a factor with levels F (female) and M (male)
Depressionthe depression level;
a factor with levels H (high) and L (low)
The data come from a study of seriously emotionally disturbed and learning disabled adolescents and their reported depression levels. The adolescents were classified by age and gender and their depression levels.
J. W. Maag and J. T. Behrens (1989) Epidemiologic data on seriously emotionally disturbed and learning disabled adolescents: reporting extreme depressive symptomatology. Behavioral Disorders, 15(1).
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994) A Handbook of Small Data Sets, London: Chapman and Hall.
data(dyouth) summary(dyouth)data(dyouth) summary(dyouth)
The number of ear infections in swimmers
data(earinf)data(earinf)
A data frame with 287 observations on the following 5 variables.
Swimhow often the swimmer swims in the ocean;
a factor with levels Freq (frequently) and Occas (occasionally)
Locthe reported usual swimming location;
a factor with levels Beach and NonBeach
Agethe age group;
a factor with levels 15-19, 20-24 and 25-29
Sexthe sex;
a factor with levels Female and Male
NumInfecthe number of self-diagnosed ear infections; a numeric vector
Infecwhether there are self-diagnosed ear infections;
a numeric vector where 0 means no self-reported infection,
and 1 means at least one self-reported ear infection
The data give the number of self-reported ear infections in the 1990 Pilot Surf/Health Study of nsw Water Board.
This data file was downloaded from OzDASL (http://www.statsci.org/data/oz/earinf.html) where it was prepared by Dr Gordon Smyth from Hand et al (1994) Dataset 328.
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994) A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 328.
data(earinf) summary(earinf)data(earinf) summary(earinf)
The total monthly rainfall in Emerald, Australia, and the average monthly soi
data(emeraldaug)data(emeraldaug)
A data frame with 114 observations on the following 3 variables.
Yearthe year; a numeric vector
Rainthe total monthly rainfall in August of the given year; a numeric vector
SOIthe monthly average southern oscillation index (soi); a numeric vector
Phasethe soi phase (see Stone and Auliciems, 1992);
a factor with these values:
1 (consistently negative),
2 (consistently positive),
3 (rapidly falling),
4 (rapidly rising), or
5 (consistently near zero)
The data give the total monthly rainfall and monthly in Emerald, Queensland, Australia, from 1889 to 2002, and the average soi for the corresponding month.
Data obtained from the Australian Bureau of Meteorology (http://www.bom.gov.au) and iri/ldeo Climate Data Library (http://www.longpaddock.qld.gov.au/seasonalclimateoutlook/southernoscillationindex/soidatafiles/index.php) on 21 December 2010, then compiled.
R. C. Stone and A. Auliciems (1992) soi phase relationships with rainfall in eastern Australia, International Journal of Climatology, 12, 625–636.
data(emeraldaug) plot(emeraldaug)data(emeraldaug) plot(emeraldaug)
The energy expenditure for 104 females at rest for a 24 hour period
data(energy)data(energy)
A data frame with 104 observations on the following 3 variables.
Energythe energy expenditure (units not given); a numeric vector
Fatthe mass of fat tissue (units not given); a numeric vector
NonFatthe mass of fat-free tissue (units not given); a numeric vector
The data give the energy expenditure for 104 females at rest over a 24 hour period; the mass of fat and fat-free tissue was also recorded.
Note that the total mass of each subject is the sum of the fat and fat-free tissue masses.
B. Joergensen (1992) Exponential dispersion models and extensions: A review. International Statistical Review, 60(1), 5–20.
L. Garby, J. S. Garrow, B. Joergensen, O. Lammert, K. Madsen, P. Soerensen and J. Webster (1988) Relation between energy expenditure and body composition in man: Specific energy expenditure in vivo of fat and fat-free tissue. European Journal of Clinical Nutrition, 42, 301–305.
data(energy) summary(energy)data(energy) summary(energy)
The number of failures of electronic equipment operating in two modes
data(failures)data(failures)
A data frame with 18 observations on the following 4 variables.
Periodthe time period; a numeric vector
Time1the time spent in Mode 1 in the given period (units not given); a numeric vector
Time2the time spent in Mode 2 in the given period (units not given); a numeric vector
Failuresthe number of failures in the given period; a numeric vector
The data give the number of failures of a piece of electronic equipment after operating in two modes.
Dale W. Jorgensen (1961) Multiple regression analysis of a Poisson process. Journal of the American Statistical Association, 56(294), 235–245.
data(failures) summary(failures)data(failures) summary(failures)
The daily individual feeding rates of chestnut-crowned babblers
data(feedrates)data(feedrates)
A data frame containing 1293 observations with the following 11 variables.
SocGroupthe social group for the bird; 27 levels
NestIDthe nest identifier; 61 levels
ObsTimethe length of observation (in decimal hours); a numeric vector
Ringan identifier for individual birds; 97 levels
Sexthe sex of the bird; one of f (female) or m (male)
Agethe age of non-breeding group members; one of adult or yearling
Relatednessthe pedigree-based relatedness to the brood;
one of 0.5 (first-order relatives); 0.25 (second-order relatives) or
0 (more distant relatives)
ChickAgethe age of the brood, in days; a numeric vector
BroodSizethe size of the brood: a numeric vector
UnitSizethe number of individuals in the unit; a numeric vector
FeedingRatethe daily individual feeding rates, in feeds per hour; a numeric vector
The data relate to a population of colour-ringed population of chestnut-crowned babblers in an area of the University of New South Wales Arid Zone Research Station, (Fowlers Gap, western New South Wales, Australia). The study determined whether, where and how often non-breeding group members contributed to providing for nestlings by monitoring the visit rate of tagged birds during 2007 and 2008.
L. E. Browning, S. C. Patrick, L. A. Rollins, S. C. Griffith, and A. F. Russell (2012) Kin selection, not group augmentation, predicts helping in an obligate cooperatively breeding bird. Proceedings of the Royal Society B, 279(1743): 3861–3869. doi:10.1098/rspb.2012.1080
L. E. Browning, S. C. Patrick, L. A. Rollins, S. C. Griffith, and A. F. Russell (2012) Data from: Kin selection, not group augmentation, predicts helping in an obligate cooperatively breeding bird. Dryad Digital Repository. doi:10.5061/dryad.ff868
L. E. Browning, S. C. Patrick, L. A. Rollins, S. C. Griffith, and A. F. Russell (2012) Kin selection, not group augmentation, predicts helping in an obligate cooperatively breeding bird. Proceedings of the Royal Society B, 279(1743): 3861–3869. doi:10.1098/rspb.2012.1080
data(feedrates) summary(feedrates)data(feedrates) summary(feedrates)
The root length density of apple trees
data(fineroot)data(fineroot)
A data frame with 511 observations on the following 5 variables.
Plantthe plant number; a numeric vector
Rstockthe root stock;
a factor with levels Mark,
MM106 or M26
Spacingthe plant spacing;
a factor with levels 5x3 or
4x2 (measured in metres)
Zonethe zone relative to the plant
from which the soil core is taken;
a factor with levels Inner or Outer
RLDthe root length density in centimetres per cubic centimetre; a numeric vector
The data concern the underground root system of eight apple trees. Three different root stocks and two plant spacings are used; the root length density (the density of the fine roots) is measured in one of the two zones.
The design is not full factorial:
plants 1 and 2 are for Mark rootstock at 5x3 spacing;
plants 3 and 4 are for Mark rootstock at 4x2 spacing;
plants 5 and 6 are for MM106 rootstock at 5x3 spacing;
plants 7 and 8 are for M26 rootstock at 4x2 spacing.
Personal communication from Nihal de Silva.
H. N. de Silva, A. J. Hall, D. S. Tustin and P. W. Gandar (1999) Analysis of distribution of root length density of apple trees on different dwarfing rootstocks. Annals of Botany, 83, 335–345.
P. K. Dunn and G. K. Smyth (2005) Series evaluation of Tweedie exponential dispersion model densities. Statistics and Computing, 15(4), 267–280.
data(fineroot) summary(fineroot)data(fineroot) summary(fineroot)
The food consumption for various fish species
data(fishfood)data(fishfood)
A data frame with 33 observations on the following 6 variables.
Speciesthe fish species; an identifier
MaxWtthe mean asymptotic (or maximum) weight of the fish in grams; a numeric vector
Tempthe mean habitat temperature in degrees Celsius; a numeric vector
ARthe aspect ratio of the fish; a numeric vector
Foodthe food type for the fish;
a factor with levels C for carnivores,
and H for herbivores
FoodConthe daily food consumption of a fish population as a percentage of its biomass; a numeric vector
The computation of the aspect ratio is detailed in the source.
M. L. Palomares and D. Pauly (1989) A multiple regression model for predicting the food consumption of marine fish population. Australian Journal of Marine and Freshwater Research, 40(3), 259–284.
data(fishfood) summary(fishfood)data(fishfood) summary(fishfood)
Information about tiger flathead trawls
data(flathead)data(flathead)
A data frame containing 169 observations with the following 7 variables.
Lonthe longitude of the trawl
Latthe latitude of the traw
Depththe depth (bathymetry) of the trawl, in metres
Distancethe distance along a 100 metre depth contour for the trawl (northwards of all trawls from an arbitrary origin), in metres
Areathe area swept, in hectares
Numberthe number of tiger flathead caught
Biomassthe total biomass of tiger flathead caught, in kg
The data give details of trawls in the South East Fisheries ecosystem off Australia. The data were originally collected by Bax and Williams (2000).
R package fishMod: Foster (2016).
Nicholas~J. Bax and Alan Williams (2000) Habitat and fisheries production in the South East fishery ecosystem. Final Report 1994/040, Fisheries Research and Development Corporation.
Scott~D. Foster and Mark~V. Bravington (2013) A Poisson–Gamma model for analysis of ecological data. Environmental and Ecological Statistics, 20(4):533–552.
Scott D. Foster (2016) fishMod: Fits Poisson-Sum-of-Gammas GLMs, Tweedie GLMs, and Delta Log-Normal Models. R package version 0.29. https://CRAN.R-project.org/package=fishMod
data(flathead) summary(flathead)data(flathead) summary(flathead)
The average number of meadowfoam flowers in given light conditions
data(flowers)data(flowers)
A data frame with 24 observations on the following 3 variables.
Flowersthe mean number of flowers per meadowfoam plant, averaged over ten seedlings; a numeric vector
Lightthe light intensity in mol per square metre per second;
a numeric vector
Timingwhen the light treatment was applied;
a factor with levels PFI
(photoperiodic floral induction) or
Before (24 days before PFI)
The data are collected from an experiment to study how to maximize Mermaid meadowfoam production. (Meadowfoam is a small plant from which a vegetable oil can be extracted.)
These data are consistent with those in Seddigh and Joliff (1994). The data were estimated from their Figure 3, and then adjusted to produce, as closely as possible, the statistics given on those graphs.
M. Seddigh and G.D. Joliff (1994) Light intensity effects on meadowfoam growth and flowering. Crop Science, 34: 497–503.
data(flowers) summary(flowers)data(flowers) summary(flowers)
The data give the total procedure time during ct fluoroscopic scanning, and the radiation dose received.
data(fluoro)data(fluoro)
A data frame with 19 observations on the following 2 variables.
Timethe total procedure time (in minutes); a numeric vector
Dosethe total radiation dose received (in rads); a numeric vector
The data are given in the Table as the
natural log of Time and the natural log of Dose.
Here the data have been transformed back to the original scale.
The source claims the purpose of the data collection was
“to assess whether radiation dose could be estimated
by simply measuring the total ct fluoroscopic procedure time”.
The procedure was performed in the abdomen.
Kelly H. Zou, Kemal Tuncali, and Stuart G. Silverman (2003) Correlation and simple linear regression. Radiology, 227, 617–628.
The data were originally used, but not given, in S. G. Silverman, K. Tuncali, D. F. Adams, R. D. Nawfel, K. H. Zou, and P. F. Judy (1999) ct fluoroscopy-guided abdominal interventions: techniques, results, and radiation exposure. Radiology, 212, 673–681.
data(fluoro) plot(fluoro)data(fluoro) plot(fluoro)
The number of species on the Gal\'apagos Islands
data(galapagos)data(galapagos)
A data frame containing 29 observation with the following 11 variables.
Islandthe name of the island
Plantsthe number of plant species; a numeric vector
PlantEndthe number of endemic plant species; a numeric vector
Finchesthe number of finch species; a numeric vector
FinchEndthe number of endemic finch species; a numeric vector
FinchGenerathe number of finch genera; a numeric vector
Areathe area of each island in square kilometres; a numeric vector
Elevationthe maximum elevation of each island in metres; a numeric vector
Nearestthe distance to the nearest island; a numeric vector
StCruzthe distance to Santa Cruz Island in kilometres; a numeric vector
Adjacentthe area of adjacent island in square kilometres; a numeric vector
The data give the number of plant species and related variables for 29 different islands. Counts are given for both the total number of species and the number of species that occur only in the Gal\'apagos (the endemics).
Elevations for Baltra and Seymour obtained from web searches. Elevations for four other small islands obtained from large-scale maps.
Michael P. Johnson and Peter H. Raven (1973) Species number and endemism: The Gal\'apagos Archipelago revisited. Science, 179(4076), 893–895.
data(galapagos) summary(galapagos)data(galapagos) summary(galapagos)
In an experiment, the number of seeds germination was recorded for two types of seeds and two types of root extracts
data(germ)data(germ)
A data frame with 21 observations on the following 4 variables.
Germthe number seeds germinating; a numeric vector
Totalthe number of seeds planted; a numeric vector
Extractthe extract type;
a factor with levels Bean and Cucumber
Seedsthe type of seed;
a factor with levels OA75
(O. aegyptiaca 75) and
OA73 (O. aegyptiaca 73)
The data gives the total number of seeds and the number germinating, for two types of seeds and two types of root stocks; the dilution is 1 in 25 in all cases.
An alternative representation of these data are given in germBin.
Martin J. Crowder (1978) Beta-binomial anova for proportions. Applied Statistics, 27(1), 34–37.
The following sources also quote the data, but have reversed the two seed types from the original source:
P. J. Smith and D. F. Heitjan (1993). Testing and adjusting for departures from nominal dispersion in generalized linear models. Applied Statistics, 42, 31–41 (Table 1).
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994). A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 420.
data(germ) summary(germ)data(germ) summary(germ)
In an experiment, the number of seeds germination was recorded for two types of seeds and two types of root extracts
data(germ)data(germ)
A data frame with 831 observations on the following 3 variables.
Extractthe extract type;
a factor with levels Bean and Cucumber
Seedsthe type of seed;
a factor with levels OA75
(O. aegyptiaca 75) and
OA73 (O. aegyptiaca 73)
Resultthe result of the experiment: either Germ (the seed germinated)
or NotGerm (the seed did not germinate)
The data gives the total number of seeds and the number germinating, for two types of seeds and two types of root stocks; the dilution is 1 in 25 in all cases.
These data are the same as germ but with one row for each seed.
Martin J. Crowder (1978) Beta-binomial anova for proportions. Applied Statistics, 27(1), 34–37.
The following sources also quote the data, but have reversed the two seed types from the original source:
P. J. Smith and D. F. Heitjan (1993). Testing and adjusting for departures from nominal dispersion in generalized linear models. Applied Statistics, 42, 31–41 (Table 1).
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994). A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 420.
data(germBin) summary(germBin)data(germBin) summary(germBin)
The gestation time for 1513 infants
data(gestation)data(gestation)
A data frame with 21 observations on the following 4 variables.
Agethe gestational age in weeks; a numeric vector
Birthsthe number of births; a numeric vector
Weightthe mean birthweight in kilograms; a numeric vector
SDthe standard deviation of the birthweight in each group in kilograms; a numeric vector
The gestation time for 1513 infants born in St George's Hospital, London, to Caucasian mothers willing to participate between August 1982 and March 1984.
J. M. Bland, J. L. Peacock, H. R. Anderson, and O. G. Brooke (1990) The adjustment of birthweight for very early gestational ages: two related problems in statistical analysis. Applied Statistics, 39(2), 229–239.
data(gestation) summary(gestation)data(gestation) summary(gestation)
Loss of consciousness induced by G-forces)
data(gforces)data(gforces)
A data frame containing 8 observations with the following 3 variables.
Subjectthe initials of the subject; a text identifier
Agethe age of the subject, in years; a numeric vector
Signswhether the subject showed syncopal blackout-related signs:
a factor with levels 0 (No) and 1 (Yes)
Military pilots sometimes black out when their brains are deprived of oxygen due to G-forces during violent manoeuvres. Glaister and Miller (1990) produced similar symptoms by exposing volunteers' lower bodies to negative air pressure, likewise decreasing oxygen to the brain. The data lists the subjects' ages and whether they showed syncopal blackout related signs (pallor, sweating, slow heartbeat, unconsciousness) during an 18 minute period.
The data were obtained electronically from OzDASL (http://www.statsci.org/data/). The Details above were obtained from this webpage.
D. H. Glaister and N. L. Miller (1990) Cerebral tissue oxygen status and psychomotor performance during lower body negative pressure (LBNP). Aviation, Space and Environmental Medicine. 61(2), 99–105.
L. C. Hamilton (1992) Regression with Graphics: a second course in applied statistics. Duxbury, page 243.
data(gforces) summary(gforces)data(gforces) summary(gforces)
The clutch sizes from various studies of Gopher tortoises
data(gopher)data(gopher)
A data frame with 19 observations on the following 6 variables.
Sitethe site number (an identifier); a numeric vector
Latitudethe latitude at which the study was conducted; a numeric vector
Evapthe mean total annual actual evapotranspiration (in mm); a numeric vector
Tempthe mean annual temperature in degrees Celsius; a numeric vector
ClutchSizethe mean clutch size; a numeric vector
SampleSizethe size of the sample upon which the
ClutchSize was computed;
a numeric vector
Nineteen populations of Gopher tortoises were examined across 17 different studies; from each study, the mean clutch size and various other variables were compiled.
K. G. Ashton, R. L. Burke, and J. N. Layne (2007) Geographic variation in body and clutch size of Gopher tortoises. Copeia, May 16, Number 2, 355–363.
data(gopher) summary(gopher)data(gopher) summary(gopher)
Amount of sleep in guinea pigs after receiving ketamine
data(gpsleep)data(gpsleep)
A data frame with 30 observations on the following 2 variables.
Sleepthe minutes of sleep (zero means the guinea pig did not sleep); a numeric vector
Dosethe dose of ketamine in mg/kg body weight; a numeric vector
R. C. Bailey, J. P. Summe, L. D. Homer, and L. E. McCraken (1978) A model for analysis of the anesthetic response, Biometrics. 34(2), 223–232.
The original source is: L. E. McCracken, R. E. Toby, and R. Bailey (1977) Ketamine and thiopental sleep responses in hyperbaric helium oxygen in guinea pigs. Undersea Biomedical Research, 6(4), 329–338.
data(gpsleep) plot(Sleep~Dose, data=gpsleep)data(gpsleep) plot(Sleep~Dose, data=gpsleep)
The density of understorey birds at a series of sites in two areas either side of a stockproof fence
data(grazing)data(grazing)
A data frame with 62 observations on the following 3 variables.
Birdsthe number of understorey birds; a numeric vector
Whenwhen the bird count was conducted;
a factor with levels Before
(before herbivores were removed)
and After (after herbivores were removed)
Grazedwhich side of the stockproof fence;
a factor with levels Reference
(grazed by native herbivores)
and Feral (grazed by feral herbivores,
mainly horses)
In this experiment,
the density of understorey birds at a series of sites in two areas
either side of a stockproof fence were compared.
Once side had limited grazing (mainly from native herbivores),
and the other was heavily grazed by feral herbivores, mostly horses.
Bird counts were done at the sites either side of the fence
(the Before measurements).
Then the herbivores were removed,
and bird counts done again
(the After measurements).
The measurements are the
total number of understorey-foraging birds
observed in three 20-minute surveys of two hectare quadrats.
Personal communication from Martine Maron.
Alison L. Howes, Martine Maron and Clive A. McAlpine (2010) Bayesian networks and adaptive management of wildlife habitat. Conservation Biology. 24(4), 974–983.
data(grazing) plot( Birds ~ When, data=grazing)data(grazing) plot( Birds ~ When, data=grazing)
The number of male crabs attached to female horseshoe crabs
data(hcrabs)data(hcrabs)
A data frame with 173 observations on the following 5 variables.
Colthe color of the female;
a factor with levels LM (light medium),
M (medium), DM (dark medium) or D (dark)
Spinethe spine condition;
a factor with levels BothOK,
OneOK or NoneOK
Widththe carapace width of the female crab in cm; a numeric vector
Wtthe weight of the female crab in grams; a numeric vector
Satthe number of male crabs attached to the female (‘satellites’); a numeric vector
The data come from an observational study of nesting horseshoe crabs: “The study was conducted at two beaches on the Delaware shore, Breakwater Harbor at Cape Henlopen Park in Lewes and Fowler's Beach, 32 km north on the same shoreline (Sussex County, Delaware, usa). In 1991 observations were made from 7 to 17 June, in1992 from 28 May to 3 June and from 11 to 14 June, and in 1993 from 18 May to 11 June. At these sites the crabs were most active on the higher of the two daily high tides (which at this time of year are at night between 1700 and 0200 h est)” (Brockmann, 1996; p. 4).
H. J. Brockmann (1996) Satellite male groups in horseshoe crabs, Limulus polyphemus. Ethology, 102(1), 1–21.
data(hcrabs) plot(Sat ~ Wt, data=hcrabs)data(hcrabs) plot(Sat ~ Wt, data=hcrabs)
The heat capacity of hydrobromic acid measured at various temperatures
data(heatcap)data(heatcap)
A data frame with 18 observations on the following 2 variables.
Cpthe heat capacity (in calories per mole per degree Kelvin); a numeric vector
Tempthe temperature (in Kelvin); a numeric vector
The data give the heat capacity for hydrobromic acid at various temperatures.
M. Shacham and N. Brauner (1997) Minimizing the effects of collinearity in polynomial regression. Industrial and Engineering Chemical Research, 36, 4405–4412.
The original source is:
W. F. Giauque and R. Wiebe (1929)
The heat capacity of hydrogen bromide from
K to its boiling point
and its heat of vaporization.
The entropy from spectroscopic data.
Journal of the American Chemical Society,
51(5),
1441–1449.
data(heatcap) plot(heatcap)data(heatcap) plot(heatcap)
The age and percent body fat for 18 adults
data(humanfat)data(humanfat)
A data frame with 18 observations on the following 4 variables.
Agethe age of the subject in completed years; a numeric vector
Percent.Fatthe body fat percentage; a numeric vector
Genderthe gender;
a factor with levels F (females) or
M (males)
BMIthe body mass index in metres per kilogram-squared; a numeric vector
The data come from a study investigating a new method of measuring body composition. The body fat percentage, age and gender is given for 18 adults aged between 23 and 61. “Eighteen normal adult subjects were measured including four young males and 14 females (age 25 to 60 years). None of these subjects had chronic diseases, were taking medications, or had skeletal fractures indicative of osteoporosis” (Mazess et al. (1984), p. 835). The bmi is computed from the weights and heights given in the original source.
R. B. Mazess, W. W. Peppler, and M. Gibbons (1984)
Total body composition by dualphoton (Gd) absorptiometry.
American Journal of Clinical Nutrition,
40,
834–839.
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994) A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 17.
data(humanfat) summary(humanfat)data(humanfat) summary(humanfat)
The Janka hardness of Australian hardwoods
data(janka)data(janka)
A data frame containing 36 observations with the following 2 variables.
Densitythe hardwood density (units unknown); a numeric vector
Hardnessthe Janka hardness (units unknown); a numeric vector
The data give the Janka hardness (which is hard to measure) and the density of Australian hardwoods (which is easier to measure).
W. N. Venables (1998) Exegeses on linear models. In S-Plus User's Conference, Washington DC.
Williams, E. J. (1959) Regression Analysis, Wiley, New York.
data(janka) plot(janka)data(janka) plot(janka)
Treatment of kidney stones
data(kstones)data(kstones)
A data frame with 8 observations on the following 4 variables.
Countsthe number of subjects in the given classification; a numeric vector
Sizewhether the subject has kidney stones
with mean diameter less than 2cm (coded as Small)
or greater than or equal to 2cm (coded as Large);
a factor with levels Large and Small
Methodthe treatment method;
a factor with levels A (open surgery) or
B (percutaneous nephrolithotomy)
Outcomethe outcome of the stated treatment;
a factor with levels Failure and Success
The data give the success rates of two methods of treating kidney stones: open surgery methods, and percutaneous nephrolithotomy.
The given data are a subset of that reported by Charig et al. (1986), who also include two other methods of treatment, and also break up the open surgery methods into three sub-groups. The two methods here were chosen because they demonstrate Simpson's paradox.
C. R. Charig, D. R. Webb, S. R. Payne, and J. A. E. Wickham (1986) Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorpeal shockwave lithotripsy. British Medical Journal, 292, 29 March, 879–882.
Steven A. Julious and Mark A. Mullee (1994) Confounding and Simpson's paradox. British Medical Journal, 309(1480):1480–1481.
data(kstones) summary(kstones)data(kstones) summary(kstones)
Lactation of dairy cows over time
data(lactation)data(lactation)
A data frame containing 35 observations on the following 2 variables.
Yieldthe average daily far yield from a dairy cow, in kg/day
Weekthe week in which the data were collected
The data give data from a lactating dairy cow, recording the average daily fat yield over 35 weeks.
Harold V. Henderson and Charles E. McCulloch (1990) Transform or link? Technical Report BU-049-MA, Cornell University.
data(lactation) plot(lactation$Yield ~ lactation$Week)data(lactation) plot(lactation$Yield ~ lactation$Week)
The percentage leaf area of barley infected with leafblotch
data(leafblotch)data(leafblotch)
A data frame with 90 observations on the following 3 variables.
Areathe percentage area infected with leaf blotch; a numeric vector
Sitethe site;
a factor with levels A, B up to I
Varietythe variety of barley;
a factor with levels 1, 2, up to 9
The data give the percentage leaf area of barley infected with Rhynchosporium secalis, or leaf blotch, for ten different barley varieties grown at nine different sites.
R. W. M. Wedderburn (1974) Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika, 61(3), 439–447.
The data also appear in McCullagh and Nelder, p 329, and in Faraway (2006), Exercise 7.5.
data(leafblotch) plot( Area ~ Site, data=leafblotch)data(leafblotch) plot( Area ~ Site, data=leafblotch)
The times to death and white blood cell counts for two groups of leukaemia patients
data(leukwbc)data(leukwbc)
A data frame with 33 observations on the following 3 variables.
WBCthe white blood cell count; a numeric vector
Timethe time to death in weeks; a numeric vector
AGthe morphological variable, the ag factor;
a numeric vector where 1 means ag-positive and
2 means ag-negative
The data gives the times to death (in weeks) and white blood cell counts for two groups of leukaemia patients, ag-positive and ag-negative. The two groups have not been created by random allocation.
P. Feigl and M. Zelen (1965) Estimation of exponential survival probabilities with concomitant information. Biometrics, 21, 826–838.
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994) A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 424.
data(leukwbc) summary(leukwbc)data(leukwbc) summary(leukwbc)
Data from small-leaved lime trees grown in Russia
data(lime)data(lime)
A data frame containing 385 observations with the following 4 variables.
Foliagethe foliage biomass, in kg (oven dried matter)
DBHthe tree diameter, at breast height, in cm
Agethe age of the tree, in years
Originthe origin of the tree;
one of
Coppice,
Natural,
Planted
The data give measurements from small-leaved lime trees (Tilia cordata) growing in Russia.
Schepaschenko, Dmitry; Shvidenko, Anatoly; Usoltsev, Vladimir A; Lakyda, Petro; Luo, Yunjian; Vasylyshyn, Roman; Lakyda, Ivan; Myklush, Yuriy; See, Linda; McCallum, Ian; Fritz, Steffen; Kraxner, Florian; Obersteiner, Michael (2017): Biomass tree data base. doi:10.1594/PANGAEA.871491, In supplement to: Schepaschenko, D et al. (2017): A dataset of forest biomass structure for Eurasia. Scientific Data, 4, 170070, doi:10.1038/sdata.2017.70. Extracted from https://doi.pangaea.de/10.1594/PANGAEA.871491
The source (Schepaschenko et al.) obtains the data from various sources:
Dylis N.V., Nosova L.M. (1977) Biomass of forest biogeocenoses under Moscow region. Moscow: Nauka Publishing.
Gabdelkhakov A.K. (2015) Tilia cordata Mill. tree biomass in plantations and coppice forests. Eco-potential. No. 3 (11). p. 7–16.
Gabdelkhakov A.K. (2005) Tilia cordata Mill. tree biomass in plantations. Ural forests and their management. Issue 26. Yekaterinburg: USFEU. p. 43–51.
Polikarpov N.P. (1962) Scots pine young forest dynamics on clear cut. Moscow: Academy of Sci. USSR.
Prokopovich E.V. (1995) Ecological conditions of soil forming and biological cycle of matters in spruce forests of the Middle Ural. Ph.D. Thesis. Ekaterinburg: Plant and Animals Ecology Institute.
Remezov N.P., Bykova L.N., Smirnova K.M. (1959) Uptake and cycling of nitrogen and ash elements in forests of European part of USSR. Moscow: State University.
Smirnov V.V. (1971) Organic mass of certain forest phytocoenoses at European part of USSR. Moscow: Nauka.
Uvarova S.S. (2005) Biomass dynamics of Tilia cordata trees on the example of Achit forest enterprise of Sverdlovsk region. Ural forests and their management. Issue 26. Ekaterinburg: State Forest Engineering University, p. 38–40.
Uvarova S.S. (2006) Growth and biomass of Tilia cordata forests of Sverdlovsk region Dissertation. Ekaterinburg: State Forest Engineering University. (USFEU library)
data(lime) summary(lime)data(lime) summary(lime)
The health and smoking habits of 654 youth
data(lungcap) data(lungcapsub)data(lungcap) data(lungcapsub)
A data frame with 654 observations on the following 5 variables.
(The data frame lungcapsub contains the data only for smokers,
and hence does not contain the variable Smoke.)
Agethe age of the subject in completed years; a numeric vector
FEVthe forced expiratory volume in litres, a measure of lung capacity; a numeric vector
Htthe height in inches; a numeric vector
Genderthe gender of the subjects: a numeric vector with females coded as 0 and males as 1
Smokethe smoking status of the subject: a numeric vector with non-smokers coded as 0 and smokers as 1
The data give information on the health and smoking habits of a sample of 654 youths, aged 3 to 19, in the area of East Boston during middle to late 1970s.
Kahn, Michael (2005) An exhalent problem for teaching statistics. The Journal of Statistical Education, 13(2). Available on-line.
Kahn, M. (2003) Data Sleuth, STATS, 37, 24.
Ira B. Tager, Scott T. Weiss, Alvaro Munoz, Bernard Rosner, and Frank E. Speizer (1983) Longitudinal study of the effects of maternal smoking on pulmonary function in children. New England Journal of Medicine, 309(12):699–703.
data(lungcap) summary(lungcap)data(lungcap) summary(lungcap)
Assay results from a study of adult mammary stem cells
data(mammary)data(mammary)
A data frame containing results from 81 assays, compiled into five rows of data, with the following 3 variables.
N.Cellsthe average number of calls in each assay
N.Assaysthe number of assays at that cell number
N.Outgrowthsthe number of assays giving a positive outcome (i.e. seeing a milk gland outgrowth)
The data give measurements from an assay analysis of adult mammary stem cells.
Mark Shackleton, Francois Valliant, Kaylene J. Simpson, John Sting, Gordon K. Smyth, Marie-Liesse Asselin-Labat, Li Wu, Geoffrey J. Lindeman, and Jane E. Visvader (2006). Generation of a functional mammary gland from a single stem cell. Nature, 439:84–88.
Mark Shackleton, Francois Vaillant, Kaylene J. Simpson, John Sting, Gordon K. Smyth, Marie-Liesse Asselin-Labat, Li Wu, Geoffrey J. Lindeman, and Jane E. Visvader (2006) Generation of a functional mammary gland from a single stem cell. Nature, 439:84–88.
data(mammary) summary(mammary)data(mammary) summary(mammary)
The data give the mandible length and gestational age for 167 foetuses from the 12th week of gestation onwards
data(mandible)data(mandible)
A data frame with 167 observations on the following 2 variables.
Agethe gestational age (in weeks); a numeric vector
Lengththe mandible length (in mm); a numeric vector
The data give the mandible length and gestational age for 167 foetuses from the 12th week of gestation onwards, measured using ultrasound.
Patrick Royston and Douglas G. Altman (1994) Regression using fractional polynomials of continuous covariates: Parsimonious parametric modelling. Applied Statistics, 43(3), 429–467.
data(mandible) plot(mandible)data(mandible) plot(mandible)
The pH and wound size of wounds before and after treatment with Manuka honey
data(manuka)data(manuka)
A data frame containing 20 observations (from 17 patients) with the following 6 variables.
Aetiologythe aetiology of the wound; one of V (venous), A (arterial), M (mixed) or P (pressure ulcer)
Durationthe duration of the wound; units not given
Size0the initial wound size, in square centimetres
pH0the initial wound pH
Size2the wound size after 2 weeks, in square centimetres
pH2the wound pH after 2 weeks
The data give the pH and wound size for 20 lower-leg wounds on 17 patients,
giving 20 observations on 6 variables.
The Duration is never explained or used.
The article Gethin et al. (2008) is subject to a retraction notice.
Gethin, Cowman and Conroy (2008), Table 1.
Gethin, G. T., Cowman, S., and Conroy, R. M. (2008) The impact of Manuka honey dressings on the surface pH of chronic wounds. International Wound Journal, 5(2):185–194.
International Wound Journal (2014), Retraction. 11: 342. doi:10.1111/iwj.12275
data(manuka) summary(manuka)data(manuka) summary(manuka)
The data give details of third-party motor insurance claims in Sweden for the year 1977.
data(motorins) data(motorins1)data(motorins) data(motorins1)
A data frame with 2182 observations on the following 7 variables
Kilometresthe number of kilometres travelled per year;
a numeric vector with levels 1 (less than 10000),
2 (from 10000 to 15000),
3 (15000 to 20000),
4 (20000 to 25000) or
5 (more than 25000)
Zonegeographical zone (only in motorins);
a numeric vector with levels 1 to 7
()see Details below)
Bonusno claim bonus; a numeric vector equal to the number of years plus one since the last claim
Makethe make of vehicle;
a numeric vector with levels from 1 to 8
representing eight common cart models,
and 9 representing all other models
Insuredthe number of insured in policy-years; a numeric vector
Claimsthe number of claims; a numeric vector
Paymentthe total value of payments in Skoner; a numeric vector
For variable Zone, the geographical zones are:
| 1 | Stockholm, Goteborg, Malmo with surroundings |
| 2 | Other large cities and surroundings |
| 3 | Small cities in northern Sweden |
| 4 | Small cities in southern Sweden |
| 5 | Rural areas in northern Sweden |
| 6 | Rural areas in southern Sweden |
| 7 | Gotland |
The file motorins1 only contains the data from Zone 1
(and hence Zone is not one of the variables in that data set).
“In Sweden all motor insurance companies apply identical risk arguments to classify customers, and thus their portfolios and their claims statistics can be combined. The data were compiled by a Swedish Committee on the Analysis of Risk Premium in Motor Insurance. The Committee was asked to look into the problem of analyzing the real influence on claims of the risk arguments and to compare this structure with the actual tariff” (Andrews and Herzberg (1985), p. 413).
Make 4 is the Volkswagen 1200, which was discontinued shortly after 1977. The other makes could not be identified because of the potential for the data to impact on sales of those cars.
For this data, the number of claims has a Poisson distribution, and the amount of each claim follows a gamma distribution very nicely. The total claim has a Tweedie distribution.
The OzDASL datasets. The data were obtained electronically from the Statlib database by Dr Gordon Smyth for OzDASL (http://www.statsci.org/data/).
M. Hallin and J.-F. Ingenbleek (1983) The Swedish automobile portfolio in 1977. A statistical study. Scandinavian Actuarial Journal, 49–64. The data are not listed in this reference.
D. F. Andrews and A. M. Herzberg (1985)
Data. A collection of problems from many fields
for the student and research worker.
Springer, New York, pages 413–421.
Only the data from Zone 1 are listed
(that is,
corresponds to
motorins1).
data(motorins) summary(motorins)data(motorins) summary(motorins)
The number of revertant colonies for various doses of quinoline for TA98 Salmonella
data(mutagen)data(mutagen)
A data frame with 18 observations on the following 2 variables.
Dosethe dose of quinoline; a numeric vector
Coloniesthe number of revertant colonies; a numeric vector
The number of revertant colonies (colonies that revert to their former gentype) for various doses of quinoline for TA98 Salmonella.
The given data represent only one replicate of the three given in Margolin, Kim and Risko (1984), but are as given in Breslow (1989).
Three plates were used for each dose, hence the three observations per dose. The data are given in order of increasing numbers of colonies.
Theory suggests one model for the data is
,
for and greater than or equal to zero,
where is the dose of quinoline.
A good approximation to this is the log-linear model
.
N. E. Breslow (1984) Extra-Poisson variation in log-linear models. Applied Statistics, 33(1), 38–44.
B. H. Margolin, N. Kaplan, E. and Zeiger (1981).\ Statistical analysis of the Ames Salmonella/microsome test. Proceedings of the National Academy of Science usa, 76, 3779–3783.
data(mutagen) summary(mutagen)data(mutagen) summary(mutagen)
The “somatic cell mutant frequencies at the hprt locus of the X-chromosome” in healthy children
data(mutantfreq)data(mutantfreq)
A data frame with 49 observations on the following 5 variables.
Donorthe donor identifier; a factor
Sexthe sex of the child;
a factor with levels F (females) or M (males)
Agethe age of the child in completed years; a numeric vector
Ceffthe mean unselected cloning efficiency; a numeric vector
Mfreqthe mutant frequencies ;
a numeric vector
In the original paper,
the children are sometimes referred to as belonging to Group II (Ages 0 to 5),
Group III (Ages 6 to 11) or Group IV (Ages 12 to 17).
(Group I refers to cord data referenced to another article.)
Age may be treated as categorical with these categories.
B. A. Finette, L. M. Sullivan, J. P. O'Neill, J. A. Nicklas, P. M. Vacek and R. J. Albertini (1994) Determination of hprt mutant frequencies in T-lymphocytes from a healthy pediatric population: statistical comparison between newborn, children and adult mutant frequencies, cloning efficiency and age. Mutation Research, 308, 223–231.
data(mutantfreq) summary(mutantfreq)data(mutantfreq) summary(mutantfreq)
Information about the production of various tableware products
data(nambeware)data(nambeware)
A data frame with 59 observations on the following 4 variables.
Typethe type of product;
a factor with levels Bowl, CassDish,
Dish, Plate and Tray
Diamthe diameter of the product in inches; a numeric vector
Timethe total grinding and polishing time in minutes; a numeric vector
Pricethe price in us dollars; a numeric vector
The data come from Nambe Mills (https://www.nambe.com/), manufacturers of tableware made from sand casting a special alloy of several metals. The polishing times for the products are thought to be related to the size of the item, as indicated by the diameter. After casting, the pieces go through a series of shaping, grinding, buffing, and polishing steps. In 1989 the company began a program to rationalize its production schedule of some 100 items in its tableware line. The total grinding and polishing times listed here were a major output of this program.
The data are originally from the Nambe Mills company, as quoted as the dasl website (https://dasl.datadescription.com/datafile/nambe/).
data(nambeware) summary(nambeware)data(nambeware) summary(nambeware)
The monthly maintenance hours associated with maintaining the anaesthesiology service for twelve naval hospitals
data(nhospital)data(nhospital)
A data frame with 12 observations on the following 4 variables.
MainHoursthe monthly maintenance hours associated with maintaining the anaesthesiology service for twelve naval hospitals in the usa; a numeric vector
Casesthe number of surgical cases; a numeric vector
Eligiblethe eligible population per thousand; a numeric vector
OpRoomsthe number of operating rooms; a numeric vector
The monthly maintenance hours associated with maintaining the anaesthesiology service for twelve naval hospitals in the usa was measured, together with some explanatory variables
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994) A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 269.
Raymond H. Myers (1990) Classical and Modern Regression with Applications, second edition, Duxbury: Belmont, ca.
data(nhospital) summary(nhospital)data(nhospital) summary(nhospital)
The soil nitrogen after applying different fertilizer doses
data(nitrogen)data(nitrogen)
A data frame containing 24 observations with the following 3 variables.
Fertthe fertilizer dose, in kilograms of nitrogen per hectare; a numeric vector
SoilNthe soil nitrogen, in kilograms of nitrogen per hectare; a numeric vector
Sourcethe fertilizer source:
a factor with levels 0 (inorganic) and 1 (organic; farmyard manure)
The data give the soil inorganic nitrogen content for various fertilizer doses, including a control. One application is from an organic source. Each level of fertilizer has data from three replications.
P. W. Lane (2002) Generalized linear models in soil science. European Journal of Soil Science, 53:241–251.
Glendining, M.J., Poulton, P.R. & Powlson, D.S. (1992) The relationship between inorganic N in soil and the rate of fertilizer N applied on the Broadbalk Wheat Experiment. Aspects of Applied Biology, 30, 95–102.
data(nitrogen) summary(nitrogen)data(nitrogen) summary(nitrogen)
The number of noisy miners detected in various 2 hectare transects in buloke woodland patches within the Wimmera Plains of western Victoria, Australia
data(nminer)data(nminer)
A data frame with 31 observations on the following 9 variables.
Minersthe presence or absence of noisy miners;
a numeric vector with levels
1 (present) or 0 (absent)
Eucsthe number of eucalypts in each 2 hectare transect; a numeric vector
Areathe area in hectares of contiguous remnant patch of vegetation in which the transect was located; a numeric vector
Grazedwhether the area was grazed or not;
a numeric vector with levels
0 (grazed) or 1 (not grazed)
Shrubswhether shrubs were present in the transect or not;
a numeric vector with levels 1 (shrubs present) or
0 (shrubs not present)
Bulokesthe number of buloke trees in each 2 ha transect; a numeric vector
Timberthe number of pieces of fallen timber in the transect; a numeric vector
Minerabthe number of noisy miners (abundance) observed in three 20 minute surveys; a numeric vector
The data gives the number of noisy miners detected in various two hectare transects in buloke woodland patches within the Wimmera Plains of western Victoria, Australia. The noisy miner is a small but aggressive native Australian bird.
Personal communication from Martine Maron.
Martine Maron (2007) Threshold effect of eucalypt density on an aggressive avian competitor. Biological Conservation, 136, 100–107.
data(nminer) summary(nminer)data(nminer) summary(nminer)
The tensile strength of Kraft paper with varying hardwood concentrations
data(paper)data(paper)
A data frame with 19 observations on the following 2 variables.
Strengththe paper strength (in pounds per square inch (psi)); a numeric vector
Hardwoodthe hardwood concentration in the paper in percent; a numeric vector
The data give the strength of 25 samples of Kraft paper (a strong, coarse, usually brownish type of paper) for varying amounts of hardwood.
G. Joglekar, J. H. Schuenemeyer and V. LaRicca (1989) Lack-of-fit testing when replicates are not available. American Statistician, 43, 135–143.
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994) A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 271. (The response and explanatory variables are reversed from those in the original article.)
D. C. Montgomery and E. A. Peck (1982) Introduction to linear regression analysis, New York: John Wiley.
data(paper) plot(paper)data(paper) plot(paper)
The permeability of building materials
data(perm)data(perm)
A data frame with 81 observations on the following 3 variables.
Daythe day;
a factor with levels 1 up to 9
Machthe machine used for measurement;
a factor with levels A, B or C
Permthe permeability in seconds: a numeric vector
The data give the average permeability (in seconds) of eight sheets of building materials, using random samples of 81 sheets in three machines over nine days, with three measurements for each machine–day combination.
Bent Joergensen (1992) Exponential dispersion models and extensions: A review. International Statistical Review, 60(1), 5–20.
A. Hald (1952) Statistical theory with engineering applications. New York: Wiley.
data(perm) summary(perm)data(perm) summary(perm)
The amount of phosphorus in soil samples
data(phosphorus)data(phosphorus)
A data frame with 18 observations on the following 4 variables.
Samplean identifier, the sample ID; a numeric vector
Inorgthe amount of inorganic phosphorus chemically determined in ppm (parts per million); a numeric vector
Orgthe amount of organic phosphorus chemically determined in ppm; a numeric vector
PAthe amount of plant-available phosphorus of corn grown in the soil in ppm; a numeric vector
Chemical determinations of the phosphorus in the soil at 18 locations in Iowa were determined, including the amount of available phosphorus for growing corn at 20 degrees C.
S. M. Snappin and R. D. Small (1986) Tests of significance using regression models for ordered categorical data. Biometrics, 42, 583–592.
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994) A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 237.
data(phosphorus) summary(phosphorus)data(phosphorus) summary(phosphorus)
In an experiment, “viral activity was assessed from pock counts at a series of dilutions of the viral medium”
data(pock)data(pock)
A data frame with 48 observations on the following 2 variables.
Countthe number of membrane pock counts; a numeric vector
Dilutionthe dilution factor; a numeric vector
The data come from a titration bioassay, in which viral activity was assessed from pock counts at different dilutions of the viral medium.
P. J. Smith and D. F. Heitjan (1993) Testing and adjusting for departures from nominal dispersion in generalized linear models. Applied Statistics, 42, 31–41 (Table 1).
data(pock) with( pock, tapply( Count, list(Dilution), mean) ) with( pock, tapply( Count, list(Dilution), var) )data(pock) with( pock, tapply( Count, list(Dilution), mean) ) with( pock, tapply( Count, list(Dilution), var) )
The survival times of animals under various treatments and poisons
data(poison)data(poison)
A data frame with 48 observations on the following 3 variables.
Psnthe type of poison;
a vector with levels I, II or III
Trmtthe type of treatment;
a vector with levels
A, B, C or D
Timethe time to death in ten-hour units; a numeric vector
The data give the time to death of animals using one of three different poisons and one of four treatments. For each of the twelve combinations, four times are recorded.
G. E. P. Box and D. R. Cox (1964) An analysis of transformations (with discussion). Journal of the Royal Statistical Society, Series A. 143, 383–430.
The data also appear in D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994) A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 403.
data(poison) summary(poison)data(poison) summary(poison)
The number of polyps in people with familial adenomatous polyposis, after being given a placebo or a new drug
data(polyps)data(polyps)
A data frame with 20 observations on the following 3 variables.
Numberthe number of polyps; a numeric vector
Treatmentthe treatment group;
a factor with levels Drug (suldinac), Placebo
Agethe age of the person; a numeric vector
The data give the number of polyps in people with famial adenomatous polyposis, after being given a placebo or a new drug (suldinac).
B. S. Everitt and T. Hothorn (2006) A Handbook of Statistical Analyses Using r Chapman & Hall/crc, Table 6.1.
F. N. Giardiello, S. R. Hamilton, A. J. Krush, S. Piantadosi, L. M. Hylind, P. Celano, S. V. Booker, C. R. Robinson, and G. J. A. Offerhaus (1993) Treatment of colonic and rectal adenomas with suldindac in famial adenomatous polyposis, New England Journal of Medicine, 328(18), 1313–1316.
S. Piantadosi (1997) Clinical trials: A methodologic perspective, New York: John Wiley and Sons.
data(polyps) coplot( Number ~ Age | Treatment, data=polyps )data(polyps) coplot( Number ~ Age | Treatment, data=polyps )
The usage in tonnes of polythene as a packaging materials for 23 uk cosmetic companies (year unknown)
data(polythene)data(polythene)
A data frame with 23 observations on the following 3 variables.
Companythe uk cosmetic company identifier;
a numeric vector with levels from 1 to 23
Polythenethe amount of polythene used in tonnes for packaging; a numeric vector
Turnoverthe annual company turnover in hundreds of thousands of pounds; a numeric vector
Robert Gilchrist (2000) Regression models for data with a non-zero probability of a zero response. Communications in Statistics—Theory and Methods, 29, 1987–2003.
data(polythene) summary(polythene)data(polythene) summary(polythene)
The right- and left-leg strengths of 13 American footballers (measured using a weight lifting test), plus the distance they punt a football (with their right leg).
data(punting)data(punting)
A data frame with 13 observations on the following 3 variables.
Leftleft-leg strength in pounds; a numeric vector
Rightright-leg strength in pounds; a numeric vector
Puntpunting distance in feet; a numeric vector
Raymond H. Myers (1990) Classical and modern regression with applications, second edition. Duxbury; page 75.
These appear to come from a larger data set, available from (for example) OzDASL at http://www.statsci.org/data/general/punting.html.
data(punting) plot(Punt ~ Right, data=punting)data(punting) plot(Punt ~ Right, data=punting)
The total July rainfall at Quilpie, Queensland, Australia from 1921 to 1988
data(quilpie)data(quilpie)
A data frame with 68 observations on the following 6 variables.
Yearthe year; a numeric vector
Rainthe total monthly July rainfall in millimetres; a numeric vector
SOIthe July average southern oscillation index, or soi; a numeric vector
Phasethe soi phase (see Stone and Auliciems, 1992);
a factor with these values:
1 (consistently negative),
2 (consistently positive),
3 (rapidly falling),
4 (rapidly rising), or
5 (consistently near zero)
Exceedan indicator for whether or not the total monthly
July rainfall exceeds 10 millimetres:
a factor where Yes means the rainfall exceeds 10mm,
and No means the rainfall is 10mm or less
yan indicator for whether or not the total monthly July rainfall
exceeds 10 millimetres:
a factor where 1 means the rainfall exceeds 10mm,
and 0 means the rainfall is 10mm or less
Data obtained from iri/ldeo Climate Data Library
(formerly http://ingrid.ldgo.columbia.edu now http://iridl.ldeo.columbia.edu)
on 26 May 2009.
R. C. Stone and A. Auliciems (1992) soi phase relationships with rainfall in eastern Australia, International Journal of Climatology, 12, 625–636.
data(quilpie) plot( Rain ~ SOI, data=quilpie) plot( Rain ~ factor(Phase), data=quilpie)data(quilpie) plot( Rain ~ SOI, data=quilpie) plot( Rain ~ factor(Phase), data=quilpie)
The data describe an experiment conducted to investigate the amount of drug present in the liver of a rat.
data(ratliver)data(ratliver)
A data frame with 19 observations on the following 4 variables.
BodyWtthe body weight of each rat in grams; a numeric vector
LiverWtthe weight of each liver in grams; a numeric vector
Dosethe relative dose of the drug given to each rat as a fraction of the largest dose; a numeric vector
DoseInLiverthe proportion of the dose in the liver; a numeric vector
The data describe an experiment conducted to investigate the amount of drug present in the liver of a rat. Nineteen rats were randomly selected, weighed, and placed under a light anesthetic and given an oral dose of the drug. Because it was thought that large livers would absorb more of a given dose than a small liver, the actual dose given was approximately determined as 40mg of the drug per kilogram of body weight. After a fixed length of time, each rat was sacrificed and the liver weighed, and the percent dose in the liver was determined.
Sanford Weisberg (1985) Applied Linear Regression, second edition, New York: John Wiley and Sons, page 122.
data(ratliver) summary(ratliver)data(ratliver) summary(ratliver)
The data come from an experiment to investigate the propogation of plum root-stocks
data(rootstock)data(rootstock)
A data frame with 8 observations on the following 4 variables.
Countthe number in each category; a numeric vector
Timethe time of planting;
a numeric vector with levels Now (straight away)
or Spring (spring)
Lengththe length of the cutting;
a numeric vector with levels Long or Short
Conditionthe condition of the cutting at the end of the experiment;
a numeric vector with levels Alive or Dead
M. S. Bartlett (1935) Contingency table interactions. Journal of the Royal Statistical Society Supplement, 2, 248–252.
data(rootstock) summary(rootstock)data(rootstock) summary(rootstock)
The initial rate of benzene oxidation over a vanadium oxide catalyst using three different reaction temperatures and varying oxygen and benzene concentrations
data(rrates)data(rrates)
A data frame with 48 observations on the following 4 variables.
RunAn identifier; a numeric vector
Conc.Othe oxygen concentration (by 10000 gmole per litre); a numeric vector
Tempthe temperature in degrees Kelvin; a numeric vector
Ratethe reaction rate (by gmole per gram)
of catlyst per second;
a numeric vector
D. J. Pritchard, J. Downie, and D. W. Bacon (1977) Further consideration of heteroscedasticity in fitting kinetic models. Technometrics, 19(3), 227–236.
Originally from Jaswal, Mann, Juusola and Downie (1969) The vapour-phase oxidation of benzene over a vandium pentoxide catalyst. Canadian Journal of Chemical Engineering, 47(3), 284–287.
data(rrates) summary(rrates)data(rrates) summary(rrates)
Weights of rainbow trout at various doses of dca
data(rtrout)data(rtrout)
A data frame containing 96 observations with the following 2 variables.
Weightthe weight of the rainbow trout, in grams; a numeric vector
Dosethe dose of 3, 4-dichloroaniline (dca), in micrograms per litre;
one of
0 (control),
19,
39,
39,
71,
120, or
210
The data give the weight of 95 rainbow trout after exposure to dca for 28 days (note that one observation is missing at a dose of 39). The aim of the study was to “determine the concentration level which causes 25% inhibition [i.e. weight loss] from the control” (Maul, p. 161).
Crossland, N.O. (1985) A method to evaluate effects of toxic chemicals on fish growth. Chemosphere, 14(11-12), 1855–1870.
Maul A. (1992) Application of generalized linear models to the analysis of toxicity test data. Environmental Monitoring and Assessment, 23(1), 153–163.
data(rtrout) summary(rtrout)data(rtrout) summary(rtrout)
Energy measurements on various ruminant diets
data(ruminant)data(ruminant)
A data frame containing 36 observations on the following 3 variables.
DryMatterDigestthe dry matter digestibility in feed (in percent)
EnergyDigestthe energy digestibility in feed (in percent)
Energythe digestible energy content (in calories per gram)
The data give measurements of energy of dry feed fed to Merino wethers aged 2 to 2.5 years.
R. J. Moir (1961) A note on the relationship between the digestible dry matter and the digestible energy content of ruminant diets. Australian Journal of Experimental Agriculture and Animal Husbandry, 1, 24–26.
data(ruminant) plot(ruminant)data(ruminant) plot(ruminant)
The number of children and youth aged 12–17 who are satisfied with their weight
data(satiswt)data(satiswt)
A data frame with 24 observations on the following 4 variables.
Countsthe number of youth in the indicated category; a numeric vector
Gendergender;
a factor with levels F (female) or M (male)
WishWtthe youths' wish for their weight relative to now;
a factor with levels Thinner, Same or Heavier
Maturwhen sexual maturity reached;
a factor with levels Late, Mid, and Early
The data come from a study of children and youth aged 12–17, sampled from the population of the United States in 1963.
Paula Duke Duncan, Philip L. Ritter, Sanford M. Dornbusch, Ruth T. Gross, and J. Merrill Carlsmith (1985) The effects of pubertal timing on body image, school behavior, and deviance. Journal of Youth and Adolescence, 14(3), 227–235. The data are inferred from Table II.
data(satiswt) summary(satiswt)data(satiswt) summary(satiswt)
The time taken to deliver soft drinks to vending machines
data(sdrinks)data(sdrinks)
A data frame containing 25 observations with the following 3 variables.
Timethe time taken to service the soft drink vending machine (in minutes); a numeric vector
Casesthe number of cases of product stocked; a numeric vector
Distancethe distance walked by the driver to service the vending machines (in feet); a numeric vector
A soft drink bottler is analyzing vending machine service routes in his distribution system. He is interested in predicting the amount of time required by the route driver to service the vending machines in an outlet. The service activity includes the time taken to stock the machine with beverage products, and for minor maintenance and housekeeping.
The industrial engineer responsible for the study has suggested that the two most important variables affecting the delivery time are the number of cases of product stocked and the distance walked by the route driver.
The data were obtained electronically from OzDASL (http://www.statsci.org/data/). The Details above were obtained from this webpage.
D. C. Montgomery and E. A. Peck (1992) Introduction to Regression Analysis. Wiley, New York. Example 4.1
data(sdrink) summary(sdrink)data(sdrink) summary(sdrink)
The number of four species of seabirds
data(seabirds)data(seabirds)
A data frame with 40 observations on the following 3 variables.
Quadratthe quadrat;
a numeric factor with levels 0 through 10
Speciesthe species;
a factor with levels M (murre),
CA (crested auklet), LA (least auklet)
and P (puffin)
Countthe number of seabirds of the given species in the given quadrat; a numeric vector
The data are counts of four seabird species in ten 0.25 square-km quadrats in the Anadyr Strait (off the Alaskan coast) during summer, 1998.
Andrew R. Solow and Woollcott Smith (1991) Cluster in a heterogeneous community sampled by quadrats. Biometrics, 47(1), 311–317.
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994) A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 215.
data(seabirds) summary(seabirds)data(seabirds) summary(seabirds)
The number of mice surviving a test dose of culture with five different doses of antipneumococcus serum
data(serum)data(serum)
A data frame with 5 observations on the following 3 variables.
Dosethe dose of antipneumococcus serum in cc; a numeric vector
Numberthe number of surviving mice; a numeric vector
Survivorsthe number of mice in each group; a numeric vector
The number of mice surviving a test dose of culture with five different doses of antipneumococcus serum prior to being infected with pneumocci.
J. O. Irwin and E. A. Cheeseman (1939) On the maximum-likelihood method of determining dosage–response curves and approximations to the median-effective dose, in cases of a quantal response. Supplement to the Journal of the Royal Statistical Society, 6(2), 174–185.
data(serum) summary(serum)data(serum) summary(serum)
The heat evolved from different formulations of Portland cement
data(setting)data(setting)
A data frame with 13 observations on the following 5 variables.
Athe percentage by weight of tricalcium aluminate; a numeric vector
Bthe percentage by weight of tricalcium silicate; a numeric vector
Cthe percentage by weight of tetracalcium alumino ferrite; a numeric vector
Dthe percentage by weight of dicalcium silicate; a numeric vector
Heatthe heat evolved in calories per gram of cement; a numeric vector
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994) A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 454.
The data are originally from H. Woods, H. H. Steinour, and H. P. Starke (1932) Effects of composition of Portland Cement on heat evolved during hardening. Industrial and Engineering Chemistry, 24, 1207–1214.
data(setting) summary(setting)data(setting) summary(setting)
The sharpener data
data(sharpener)data(sharpener)
A data frame with 15 observations on the following 11 variables.
Ythe measured response; a numeric vector
X1a measured predictor; a numeric vector
X2a measured predictor; a numeric vector
X3a measured predictor; a numeric vector
X4a measured predictor; a numeric vector
X5a measured predictor; a numeric vector
X6a measured predictor; a numeric vector
X7a measured predictor; a numeric vector
X8a measured predictor; a numeric vector
X9a measured predictor; a numeric vector
X10a measured predictor; a numeric vector
The data come from a study about making a point.
### The data are actually random numbers, generated in R as follows: nxvars <- 10 # The number of explanatory variables nobs <- 15 # The number of observations set.seed(5000) # To ensure reproducibility # Ensure the response is normally distributed y <- round( rnorm( nobs,0,1), 2) + 10 # The explanatory variables rd <- runif( nxvars*nobs, 0, 1) rd <- round( matrix( rd, ncol=nxvars), 2) # Convert to a dataframe rdf <- data.frame( Y=y ) for (i in (1:nxvars)){ code <- paste( "rdf$X",i," <- rd[,",i,"]", sep="") eval( parse(text=code)) } head( rdf ) data(sharpener) head( sharpener )### The data are actually random numbers, generated in R as follows: nxvars <- 10 # The number of explanatory variables nobs <- 15 # The number of observations set.seed(5000) # To ensure reproducibility # Ensure the response is normally distributed y <- round( rnorm( nobs,0,1), 2) + 10 # The explanatory variables rd <- runif( nxvars*nobs, 0, 1) rd <- round( matrix( rd, ncol=nxvars), 2) # Convert to a dataframe rdf <- data.frame( Y=y ) for (i in (1:nxvars)){ code <- paste( "rdf$X",i," <- rd[,",i,"]", sep="") eval( parse(text=code)) } head( rdf ) data(sharpener) head( sharpener )
The daily energy requirements for wethers at various weights
data(sheep)data(sheep)
A data frame with 64 observations on the following 2 variables.
Weightthe weight of each sheep in kg; a numeric vector
Energythe daily energy requirements in Mcal per day; a numeric vector
The data measure the daily energy requirement of castrated male (wethers) grazing Merino sheep at various weights (measured by radioassay of urinary carbon dioxide). The energy requirements are useful for predicting meat production.
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994) A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 241.
D. Wallach and B. Goffinet (1987) Mean square error of prediction in models for studying ecological systems and agronomic systems. Biometrics, 43(3), 561–573.
B. A. Young and J. L. Corbett (1972) Maintenance energy requirement of grazing sheep in relation to herbage availability. Australian Journal of Agricultural Research, 23(1), 57–76.
data(sheep) plot(Energy ~ Weight, data=sheep, pch=19)data(sheep) plot(Energy ~ Weight, data=sheep, pch=19)
The number of O-rings damaged for 23 space shuttle launches
data(shuttles)data(shuttles)
A data frame containing 23 observation with the following 2 variables.
Tempthe ambient air temperature in degrees Fahrenheit; a numeric vector
Damagedthe number of primary O-rings damaged for 23 space shuttle launches
The data give the ambient temperature and the number of primary O-rings damaged for 23 of the 24 space shuttle launches before the launch of the space shuttle Challenger on January 28, 1986. (Challenger was the 25th shuttle. One engine was lost at sea and could not be examined.) Each space shuttle contains 6 primary O-rings.
Samprit Chatterjee, Mark S. Handcock and Jeffrey S. Simonoff (1995) A Casebook for a First Course in Statistics and Data Analysis, Wiley.
Siddhartha R. Dalal, Edward B. Fowlkes and Bruce Hoadley (1989) Risk analysis of the space shuttle: Pre-Challenger prediction of failure. Journal of the American Statistical Association, 84(408), 945–957; Table 1.
data(shuttles) plot(Damaged/6 ~ Temp, data=shuttles)data(shuttles) plot(Damaged/6 ~ Temp, data=shuttles)
Health concerns of teenagers
data(teenconcerns)data(teenconcerns)
A data frame with 16 rows, on the following 4 variables.
Countsthe average number of calls in each assay
Sexthe sex of the teenagers; one of M or F
Agethe age groups of the teenagers; one of 12-15 or 16-17
Concernthe type of health concerns; one of
Sex, Menstrual, Healthy or Nothing
The data give the numbers of teenagers of two age groups with
health concerns in specific areas:
Sex, Menstrual, Healthy (that is, how healthy they are) or Nothing (no concerns at all).
More specifically,
these are the number of teens who would like to discuss these topics with their doctor.
For males M,
menstrual concerns can be treated as structural zeros.
Brunswick, Ann F. (1971) Adolescent health, sex, and fertility. American Journal of Public Health, 61(4): 711–729. The numbers are inferred from the percentages in Table 3.
Christen, R. (2013) Log-Linear Models, Springer Texts in Statistics, Springer: New York.
Fienberg, S. E. (2007) The Analysis of Cross-Classified Categorical Data, Springer: New York.
data(teenconcerns) summary(teenconcerns)data(teenconcerns) summary(teenconcerns)
The effectiveness of two types of toothbrushes for males and females
data(toothbrush)data(toothbrush)
A data frame with 52 observations on the following 5 variables.
Subjectan identifier
Sexthe sex of the subject;
a factor with levels F (female) or M (male)
Toothbrushthe type of toothbrush;
a factor with levels Hugger or Conventional
Beforethe dental plaque index before brushing; a numeric vector
Afterthe dental plaque index after brushing; a numeric vector
The data give the plaque index before and after brushing for
two types of toothbrushes for males and females.
Each subject uses both toothbrushes.
A dental plaque index of zero is the best possible score;
brushing cannot make the score worse;
Before - After is positive continuous with one exact zero.
Reiko Aoki, Jorge A. Achcar, Heleno Bolfarine, and Julio M. Singer (2003) Bayesian analysis of null-intercept errors-in-variables regression for pretest/post-test data. Journal of Applied Statistics, 30(1), 3–12.
J. M. Singer and D. F. Andrade (1997) Regression models for the analysis of pretest-posttest data. Biometrics, 53, 729–735.
data(toothbrush) with(toothbrush, plot(Before-After ~ Sex) ) with(toothbrush, plot(Before-After ~ Toothbrush) )data(toothbrush) with(toothbrush, plot(Before-After ~ Sex) ) with(toothbrush, plot(Before-After ~ Toothbrush) )
The proportion of people sampled in 34 cities in El Salvador who tested positive for toxoplasmosis.
data(toxo)data(toxo)
A data frame with 34 observations on the following 5 variables.
Citythe city from which the data comes; a numeric vector
Rainfallthe recorded rainfall in millimetres at each city, presumably annual; a numeric vector
Proportionthe proportion of those sampled who tested positive to toxoplasmosis; a numeric vector
Sampledthe number of people sampled in each city; a numeric vector of integers
Positivethe number of people who tested positive to toxoplasmosis; a numeric vector of integers
The subjects are not randomly sampled within city.
Bradley Efron (1986) Double exponential families and their use in generalized linear regression. Journal of the American Statistical Association, 81(395), 709–721.
data(toxo) summary(toxo)data(toxo) summary(toxo)
The data are the lengths of three sides of (hypothetical) right-angled triangles
data(triangle)data(triangle)
A data frame with 20 observations on the following 3 variables.
ythe length of the hypotenuse; a numeric vector
x1the length of one side of the triangle; a numeric vector
x2the length of the third side of the triangle; a numeric vector
The data give the three sides of hypothetical right-angled triangles.
The data are randomly generated so that is the square root of
,
plus a small amount of error.
The idea is from Gelman and Nolan (2002).
The data are artificial; generated using R.
The idea is from Andrew Gelman and Deborah Nolan (2002) Teaching Statistics: A bag of tricks. Oxford University Press.
data(triangle) plot(triangle)data(triangle) plot(triangle)
The survival of trout eggs exposed to potassium cyanate
data(trout)data(trout)
A data frame with 48 observations on the following 4 variables.
Concthe concentration of potassium cyanate in mg/litre; a numeric variable
Whenwhen the toxicant is applied;
a factor with levels Now or Later
(after the eggs have water-hardened)
Numberthe number of eggs used; a numeric variable
Deadthe number of eggs dead; a numeric variable
The data show the number of trout eggs that are dead at Day 19 after exposure to potassium cyanate (kscn). Half the eggs in each vial were first allowed to water-harden before the toxicant was applied; the other were exposed immediately.
R. J. O'Hara Hines and E. M. Carter (1993) Improved added variable and partial residual plots for detection of influential observations in generalized linear models. Applied Statistics, 42(1), 3–20.
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994) A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 418.
data(trout) summary(trout)data(trout) summary(trout)
In an experiment, turbine wheels were run for a number of hours, and the number of fissures developed was counted
data(turbines)data(turbines)
A data frame with 11 observations on the following 3 variables.
Hoursthe number of hours the turbine was run; a numeric vector
Turbinesthe number of turbines run for the given amount of hours; a numeric vector
Fissuresthe number of turbines wheels with fissures; a numeric vector
The data give the midpoints of running times;
that is,
the first row (where Hours=400) actually corresponds to
a running time of 0 to 800 hours.
The final class is 4400+ hours,
taken as 4600 for convenience.
Raymond H. Myers, Douglas C. Montgomery, and G. Geoffrey Vining (2002) Generalized linear models with applications in engineering and the sciences, Wiley.
The original source is Wayne Nelson (1982) Applied Life Data, Wiley, 407–409.
data(turbines) summary(turbines)data(turbines) summary(turbines)
The urination times of animals
data(urinationD)data(urinationD)
A data frame containing 35 observations with the following 5 variables.
Animalthe type of animals; some are repeated
Sexthe sex of the animal; one of F or M
Massthe mass of the animal (or mean mass of the animals, when multiple animals are represented), in kg
Durationthe urination time of the animal (or the mean, when multiple animals are represented), in seconds
SampleSizethe size of the sample represented by the data, usually 1
The data give the duration time for urination for animals of different sex and mass.
The data were collected using numerous methods
(including YouTube videos); see details in Yang et al. (2014).
From the paper:
“we discover that all mammals above 3 kg in weight empty their bladders
over nearly constant duration of 21 13 s.”
(p. 11932)
Yang et al. (2014) supplementary information Table S1.
Patricia J. Yang, Jonathan Pham, Jerome Choo, and David L. Hu (2014) Duration of urination does not change with body size. Proceedings of the National Academy of Sciences, 111(33), 11932–11937.
data(urinationD) summary(urinationD)data(urinationD) summary(urinationD)
The urethral length of animals
data(urinationL)data(urinationL)
A data frame containing 35 observations with the following 5 variables.
Animalthe type of animals; some are repeated
Sexthe sex of the animal; one of F or M
Massthe mass of the animal (or mean mass of the animals, when multiple animals are represented), in kg
Lengththe urethral length of the animal (or the mean, when multiple animals are represented), in mm
SampleSizethe size of the sample represented by the data, usually 1
The data give the urethral length for animals of different sex and mass. The data were collected using numerous methods; see details in Yang et al. (2014).
Yang et al. (2014) supplementary information Table S2.
Patricia J. Yang, Jonathan Pham, Jerome Choo, and David L. Hu (2014) Duration of urination does not change with body size. Proceedings of the National Academy of Sciences, 111(33), 11932–11937.
data(urinationL) summary(urinationL)data(urinationL) summary(urinationL)
Diagnoses of cancer in Western Australia for males and females in 1996
data(wacancer)data(wacancer)
A data frame with 14 observations on the following 3 variables.
Cancerthe type of cancer;
a factor with levels Prostate, Breast, Colorectal,
Lung, Melanoma, Cervix, and Other
Genderthe gender;
a factor with levels M (males) and F (females)
Countsthe number of people in the designated category; a numeric vector
The data gives the number of diagnoses of the designated cancers in Western Australia in 1996.
Health Department of Western Australia Annual Report 1997/1998—health
of Western Australians—mortality and survival.
Published on the internet
http://www.health.wa.gov.au/Publications/annualreport_9798/,
accessed 19~September 2001.
data(wacancer) summary(wacancer)data(wacancer) summary(wacancer)
The annual rainfall for stations in the wheat-belt in the north and centre of New South Wales (Australia)
data(wheatrain)data(wheatrain)
A data frame with 24 observations on the following 6 variables.
Stationthe station name; a text vector
Altthe station altitude (in metres); a numeric vector
Latthe station latitude (in degrees south); a numeric vector
Lonthe station longitude (in degrees east); a numeric vector
ARthe stations' mean annual rainfall (in mm) between 1916 and 1990; a numeric vector
Regionthe station's region, as computed by Boer et al. (1993) using
a principal component analysis based on monthly rainfall;
a numeric vector with levels 1, 2 and 3
The data gives the mean annual rainfall for 24 stations in the wheat-belt of nsw. The mean rainfall is based on the year 1916 to 1990, apart from Station 1 (1907 to 1983), Station 10 (1916 to 1965) and Station 11 (1935 to 1976).
Rizaldi Boer, David J. Fletcher, and Lindsay C. Campbell (1993) Rainfall patterns in a major wheat-growing region of Australia. Australian Journal of Agricultural Research, 44(2), 606–624.
data(wheatrain) plot(AR ~ Region, data=wheatrain)data(wheatrain) plot(AR ~ Region, data=wheatrain)
The amount of direct current (dc) output from windmills for varying wind velocities
data(windmill)data(windmill)
A data frame with 25 observations on the following 2 variables.
Windthe wind velocity in miles per hours; a numeric vector
DCthe dc output; a numeric vector
The wind velocity and corresponding direct current (dc) output from windmills was recorded.
G. Joglekar, J. H. Schuenemeyer and V. LaRicca (1989) Lack-of-fit testing when replicates are not available. American Statistician, 43, 135–143.
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski (1994) A Handbook of Small Data Sets, London: Chapman and Hall. Dataset 271.
D. C. Montgomery and E. A. Peck (1982) Introduction to Linear Regression Analysis. New York: John Wiley.
data(windmill) summary(windmill)data(windmill) summary(windmill)
The smoking habits and survival of women in Whickham
data(wwomen)data(wwomen)
A data frame with 14 observations on the following 4 variables.
Agethe age of the women in completed years in the original survey;
a factor with levels 18-24, 25-34, 35-44,
45-54, 55-64, 65-74 and 75+
Smokingthe smoking status of the women in the original survey;
a factor with levels NonSmoker and Smoker
Statusthe status of the women twenty years after the original survey;
a factor with levels Dead or Alive
Countthe number of women in each category; a numeric vector
The data gives the smoking and survival data for 1314 women in Whickham (north England). A survey was originally conducted in 1972–1974; a subsequent survey twenty years later followed up the women to determine how many women from the original survey had died. (Of the original women in the survey, 180 have been excluded here: 18 whose smoking habits were not recorded, and 162 who were smokers before the first survey but were non-smokers at the time of the second survey.)
D. R. Appleton, J. M. French, and M. P. J. Vanderpump (1996) Ignoring a covariate: An example of Simpson's paradox. The American Statistician, 50, 340–341.
The data also appear in Anthony C. Davison. Statistical Models (2003) Number 11 in Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, uk.
data(wwomen) summary(wwomen)data(wwomen) summary(wwomen)
The mean yields per plant for three onion varieties
data(yieldden)data(yieldden)
A data frame with 30 observations on the following 3 variables.
Yieldthe yield per plant in grams; a numeric vector
Densthe planting density in plants per square foot; a numeric vector
Varthe variety;
a numeric vector with levels 1, 2 or 3
R. Mead (1970) Plant density and crop yield. Applied Statistics, 19(1), 64–81.
data(yieldden) summary(yieldden)data(yieldden) summary(yieldden)