Programming Update: Resampling procedure

In the last post, I was talking about learning to program in R. I was able to re-program a resampling project that I’d written in Visual Basic, but I could not figure out how to store my resampled test-statistics, so I could plot a histogram of their distribution. Last night, after searching the world wide webs, I stumbled upon an even shorter code for resampling. I was able to tweak that code to the specs of what I wanted it to do, and–voila–I have a program that resamples my comparative sample, takes two specimens and computes a test statistic, compares that to my fossil specimens, and repeats the process as many times as I want. I’ll print the code at the end for readers to mess with if they want.

The basic idea of the project is that the “habiline” cranial fossils–early Homo from 1.9-1.6 million years ago are quite variable. Because of this, researchers have tried to fit these square pegs of fossils into round holes of species–H. habilis and rudolfensis. A simple, univariate trait that has been claimed to be evidence of multiple species is cranial capacity variation. For a long time this idea was propagated by comparing ER 1470 with ER 1813, because the two have very different cranial capacities and were thought to date to 1.9 Ma. Turns out, though, that while ER 1470 is 1.9 Ma, and ER 1813 might be closer to 1.65 Ma (Gathogo and Brown 2006). Gathogo and Brown state that a more geologically apt comparison, then, would be between ER 1813 and ER 3733. For the 1.9 Ma interval, the most disparate comparison is between ER 1470 and OH 24. Here’s the summary of specimens, ages, and their cranial capacities:

1.9 Ma: ER 1470 (752 cc) and OH 24 (590 cc)

1.65 Ma: ER 3733 (848 cc) and ER 1813 (510 cc)

Ratio of ER 1470/OH24 = 1.274

Ratio of ER 3733/ER 1813 = 1.663

So I wrote a program that tests the null hypothesis that early habiline cranial capacity variation is no different from that of extant gorillas–gorillas being one of the most size-dimorphic living relatives of hominins. If I cannot reject the null hypothesis, this would suggest that I cannot reject the idea that the habiline fossils sample a single species, based on cranial capacities. To test this, I took the ratio of the larger cranial capacity to the smallest, rather than male to female–this makes it easier to compare to the max-min ratio of habilines, without having to assume the sex of the specimens. The resampling program randomly selects two gorilla specimens, takes the ratio of the larger to the smaller, and repeats 5000 times. This way, I can assess the probability of randomly sampling two gorilla cranial capacities that are as different as the two sets of habilines.

The results, displayed in the histogram, show that there is a 25% chance of sampling two gorilla cranial capacities as different as the early habilines (1.9 Ma). However, there is only a 0.006% chance of sampling to gorillas as different as ER 1813 and ER 3733. Thus, we reject the null hypothesis for the comparison of ER 3733 and ER 1813. This means it is very unlikely that these two specimens come from a single species with a level of dimorphism/variation similar to modern day gorillas. This could mean that the two represent two different, contemporaneous species, or that they represent a single species with a level of variation greater than in our extant analog. The test cannot distinguish between these alternatives.Histogram of the resampled gorilla cranial capacity ratios. Notice that it is one-tailed. The red-dashed line indicates the 95th percentile. The early habiline ratio (E) is well within the 95 limit, while the later ratio (L) is outside the 95th percentile.

Here’s the code (the first two lines tell the program where to find the data, since I haven’t posted these, in order to run this program you’ll need a) to reset the directory to where you store your data [use the setwd() command] and b) a data file with cranial capacities listed in a single column, the first four of which are the habilines, and the remainder your comparative sample)



setwd(“C:/Users/zacharoo/Documents/Data”)

get <- read.csv("CranCaps.csv")

habs <- get[1:4,2]

OH24 <- habs[1]; ER1470 <- habs[2]; ER1813 <- habs[3]; ER3733 <- habs[4]

early <- ER1470/OH24; late <- ER3733/ER1813

gors <- get[5:105,2]

p = 0 # incremented if G1/G2 >= “early”

q = 0 # incremented if G1/G2 >= “late”

n = 5000 # <– NUMBER OF ITERATIONS !!!

gor.boot <- numeric(n) # x-number vector containing test statistics

for (i in 1:n) {

sub.samp <- gors[sample(101,2, replace = FALSE)] # sub.samp = 2 randomly sampled gorillas

G1 <- sub.samp[1]; G2 <- sub.samp[2]

if (G1 >= G2) {ratio <-G1/G2} else {ratio <-G2/G1}

gor.boot[i] <- ratio

if (gor.boot[i] >= early) {p = p +1} else {p = p} #frequencies

if (gor.boot[i] >= late) {q = q +1} else {q = q}

}

pval <- p/i

qval <- q/i

qntl <- quantile(gor.boot, .95)

hist(gor.boot, col = 3, xlab = “Resampled Gorilla ratio”, ylab = “Frequency”, main = “Frequency Distribution”)

points(early,25, pch = “E”, col = 2, cex = 1.25); points(late,25, pch = “L”, col = 2, cex = 1.25)

abline(v = qntl, col = 2, lty = 6) # line marking the 95% limit

print(pval); print(qval)

summary(gor.boot)



Reference

Gathogo PN, and Brown FH. 2006. Revised stratigraphy of Area 123, Koobi Fora, Kenya, and new age estimates of its fossil mammals, including hominins. Journal of Human Evolution 51(5):471-479.

Advertisements

Programming

I’m no wiz when it comes to computers, but lots of really smart people have told me I’d do myself an academic favor by learning to write programs. In Paleobiology last year, Phil Gingerich introduced me to programming in Visual Basic. While I did not become so proficient to find it as useful he did, it still got me interested. Then last semester, Milford urged me and other lawnchair anthropologists to take a course in the department of Ecology and Evolutionary Biology taught by George Estabrook, where we learned to write resampling programs for Excel in Visual Basic. It was a bit stressful at times–reading and writing computer codes at first was like learning to read and write Japanese with little training. But overall I thought it was a very useful class, and I’ll certainly be writing more resampling programs in the future (i.e. for my Australopithecus robustus projects…).

But the computer language that everybody’s been abuzz about in the past year or so is R. To quote the website (see the link), “R is a language and environment for statistical computing and graphics.” The software and codes for it are available FOR FREE on the website. I’ve been meaning to sit down and figure out how to use it for a year now, and now that I’m all alone at the museum in the evenings, I’ve begun learning to use R.

At first I had quite a difficult time simply loading data into the program. But once I figured it out (I had the idea, I just wasn’t typing it correctly–“syntax” errors), it was quite easy to run. With some datasets supplied from the Paleobiology class, I was able quite easily to compute and plot statistics like principle components analysis and linear models. It’s really easy! There are even built-in boostrap (resampling) programs, although I could not figure out how to use them as I wanted.

But that isn’t programming–I wasn’t making a custom program that would do what I wanted it to do. So I decided to give it a shot. For the programming class last semester (and Milford’s “Evolution of the Genus Homo” course), I wrote a resampling program that gave the probability of randomly sampling two gorilla cranial capacities as different as some early habilines. So I decided to try to rewrite that program in R. It took about two evenings of reading manuals and fiddling with commands (and drinking red wine and watching Roadhouse). The R code is much simpler than the original one I wrote in Visual Basic–this is largely because R has a built-in command that permutes rows/columns. Plus, it runs much, much faster–I was able to resample 100,000 times in a few seconds, whereas in the original (VB) program it took a few minutes to do 5,000. I think after a few more evenings of tinkering, I can figure out how to also plot the distribution of resampled test-statistics (although if anyone can tell me how I’d love to hear).

If I can figure out how to do it, I’ll post the code so other people can tinker with it if they want. Lesson: use R!