Materials from the R workshop at #AAPA2016

For last week’s AAPA conference, my friend and colleague David Pappano organized a workshop teaching about the many uses of the R programming language for biological anthropology (I’m listed as co-organizer, but really David did everything). After introducing the basics, we broke into small groups focusing on specific aspects of using R. I devised some lessons for basic statistics, writing functions, and resampling. Since each of the lessons could have easily taken up an hour and most people didn’t get to go through the activities fully, I’m posting up the R codes here for people to mess around with.

The basic stats lesson utilized Francis Galton’s height data for hundreds of families, courtesy of Dr. Ryan Raaum. To load in these data you just need to type into R: galton = read.csv(url(“http://bit.ly/galtondata“)). The code simply shows how to do basic statistics that are built into R, such as  t-test and linear regression.

Example of some summary stats for the Galton height data.

Some summary stats for the Galton  data. The code is in blue and the output in black.

Here is the Basic Stats code, download and paste it into an R file, then buckle up!

The lesson on functions and resampling was based on limb length data for apes, fossil hominins and modern humans (from Dr. Herman Pontzer). The csv file with the data can be downloaded from David’s website. R has lots of great built-in functions (see basic stats, above), and even if you’re looking to do something more than the basics, chances are you can find what you’re looking for in one of the myriad packages that researchers have developed and published over the years. But sometimes it’s necessary to write a function on your own, and with fossil samples you may find yourself needing to do resampling with a specific function or test statistic.

For example, you can ask whether a small sample of “anatomically modern” fossil humans (n=12) truly differs in femur length from a small sample of Neandertals (n=9). Traditional statistics require certain assumptions about the size and distribution of the data, which fossils fail to meet. Another way to ask the question is, “If the two groups come from the same distribution (e.g. population), would random samples of sizes n=12 and n=9 have so great an average difference as we see between the fossil samples?” A permutation test, shuffling the group membership of the fossils and then calculating the difference between the new “group” means, allows you to quickly and easily ask this question:

R code for a simple permutation test.

R code for a simple permutation test. The built-in function “sample()” is your best friend.

Although simply viewing the data suggests the two groups are different (boxplot on the left, below), the permutation test confirms that there is a very low probability of sampling so great a difference as is seen between the two fossil samples.

Left: Femur lengths of anatomically modern humans (AMH) and Neandertals. Right: distribution of resampled group differences. Dashed lines bracket 95% of the resampled distribution, and the red line is the observed difference between AMH and Neandertal femur lengths. Only about 1% of the resampled differences are as great as the observed fossil difference.

Left: Femur lengths of anatomically modern humans (AMH) and Neandertals. Right: distribution of resampled group differences. Dashed lines bracket 95% of the resampled distribution, and the red line is the observed difference between AMH and Neandertal femur lengths. Only about 1% of the resampled differences are as great as the observed fossil difference.

Here’s the code for the functions & resampling lesson. There are a bunch of examples of different resampling tests, way more than we possibly could’ve done in the brief time for the workshop. It’s posted here so you can wade through it yourself, it should keep you busy for a while if you’re new to R. Good luck!

The beardless White House: Part I

Something’s been bothering me about this election. No, it’s not the silence from both major parties on climate change. It’s the fact that neither Obama nor Romney (I accidentally just typed “RMoney”… accidentally?) sports facial hair. A friend and I were talking about this the other day, and a quick google search showed us there hasn’t been an appreciable furface sleeping at 1600 Pennsylvania Ave. since the mustachioed WH Taft (of butter and bathtub fame), 100 years ago. That is, unless any of these recent presidents was a closet homosexual (different meaning of “beard”).

This is hairy dearth is deplorable. Just look at this pic of portraits of past presidents:

You’re probably thinking, “Where’s all the virile scruff?” Well, no, you’re probably thinking, “There’s a lot of dudes / white ppl there.” But your next thought is probably, “Where’s all the virile scruff?” However, from Abe Lincoln through Bill Taft there’s a fairly flagrant concentration of beards, mustaches and whatever you call the thing hiding Chester A. Arthur’s charming smile (squared off in red); only W McKinley and A Johnson dared rain on this badass parade. Yes, there are some audacious sideburns on John Q. Adams and Martin Van Buren, but otherwise all Executive facial hair is concentrated between 1860 and 1913. What gives?

It looks like there’s a fairly clear pattern: voters loathed and distrusted facial hair for the first nearly 100 years of American history, followed by a brief period in which facial hair was loved and trusted, which may then have been ruined by Taft and after which there’s been nary a stache nor goat sitting in the oval office to the present day. Is this a real pattern, or could some other random process produce this same distribution of scruff? (for simplicity’s sake, we’ll pretend no president served more than 1 term…) Could random sampling of 43 (mostly white) men give us a clump of 9/13 with facial hair? (side burns don’t count) If there’s a 50/50 chance of a man growing facial hair, is 9/43 Prezes unusually high or low? I’ll let you know after I write and run some tests!

Let’s hear it for the Null!

via Carl Zimmer, Dr. Jon Brock in his blog, “Cracking the enigma,” has some thoughts on why null hypotheses don’t suck so bad as so many people think. Null hypotheses are generally along the lines of, “there is no difference between these groups,” or “this variable has no effect on something,” or “there is no relationship between variables.” The more general statistical statement behind the null hypothesis is usually along the lines of “this phenomenon can be explained just as well by a completely random process.” I’d agree with Brock that it seems that a good many researchers (not me!) view the null hypothesis as a bore or meaningless. But I like his final thought:

This brings me neatly to my final point. In research on disorders such as autism or Williams syndrome, a significant group difference is considered to be the holy grail. In terms of getting the study published, it certainly makes life easier. But there is another way of looking at it. If you find a group difference, you’ve failed to control for whatever it is that has caused the group difference in the first place. A significant effect should really only be the beginning of the story.

Statistics: Friend or Foe?

ResearchBlogging.org

In this week’s Science, Greg Miller describes recent uproar about a study that claims to have scientific support for the existence of extrasensory perception (ESP). Of course, ESP being in the realm of the paranormal, it ought to be somewhat outside the purview of Big Science.

But who cares about ESP?! What comes under scrutiny is statistics, the mathematical theory underlying hypothesis testing. And inference. The brief story is worth a read, as it cites statisticians on what these statistical tests actually tell us, as well as the ups and downs of Bayesian stats.
An important thing to keep in mind is that no matter how mathematical, statistics is nevertheless like everything else in science – a human endeavor. No matter how creative and insightful humans can be, there’s always a limit to our ability to decipher the world around us. I’m certainly not decrying statistics, but it’s important to keep in mind that these aren’t just handed down to us from on high. We human beings play a critical (and often subjective) hand in how we apply statistics to address our research questions.
Along these lines, just last night I was reading about body mass variation in the Gombe chimpanzees (Pusey et al. 2005), and the authors provide a very insightful quote from statistician George Box:

All models are wrong; some models are useful.

As I added to this on Facebook, “… some models can be hott.”
References
Miller G (2011). Statistics. ESP paper rekindles discussion about statistics. Science (New York, N.Y.), 331 (6015), 272-3 PMID: 21252321
Pusey, A., Oehlert, G., Williams, J., & Goodall, J. (2005). Influence of Ecological and Social Factors on Body Mass of Wild Chimpanzees International Journal of Primatology, 26 (1), 3-31 DOI: 10.1007/s10764-005-0721-2