I’m finally about to push my study of brain growth in H. erectus out of the gate, and one of the finishing touches was to make pretty pretty pictures. Recall from the last post on the subject that I was resampling pairs of specimens to compute how much proportional brain size change (PSC) occurred from birth a given age in humans and chimpanzees (and now gorillas). This resulted in lots of data points, which can be a bit difficult to read and interpret when plotted. Ah, cross-sectional data. “HOW?!” I asked, “HOW CAN I MAKE THIS MORE DIGESTIBLE?” Having nice and clean plots is useful regardless of what you study, so here I’ll outline some solutions to this problem. (If you want to figure this out for yourself, here are the raw resampled data. Save it as a .csv file and load it into R)
The left-most plot above shows the raw resampled ratios: you can see a lot of overlap between humans (black), chimpanzees (green) and gorillas (red). But all those points are a bit confusing: just how extensive is the overlap? What is the central tendency of each species?
The second plot shows a less noisy way of displaying the results. We can highlight the central tendencies by plotting PSC medians for each age (I used medians and not means since the data are not normally distributed), and rather than showing the full range of variation in PSC at each age, we can simply highlight the majority (95%) of the values.
To make such a plot in R, for each species you need four pieces of information, in vector form: 1) the unique (non-repeated) ages sorted from smallest to largest, and the 2) median, 3) upper 97.5% quantile, and 4) lower 0.025% quantile for each unique age. You can quickly and easily create these vectors using R‘s built-in commands:
With these simple vectors summarizing humans and chimpanzees variation across ages, you’re ready to plot. The medians (hpm and ppm in the code above) can simply be plotted against age using the plot() and lines() functions, simple enough. But the shaded-in 95% quantiles have to be made using the polygon() function, which creates a shape (a polygon) by connecting sets of points that have to be entered confusingly: two sets of x-coordinates with the first in normal order and the second reversed, and two sets of y-coordinates with the first in normal order and the second reversed.
In our case, the first set of x coordinates is the vector of sorted, unique ages (h and p in the code), and the second set is the same vector but in reverse. The first set of y coordinates is the vector of 97.5% quantiles (hpu and ppu), and the second set is the vector of 0.025% quantiles in reverse. You can play around with ranges of colors and transparency with “col=….”
What I like about the second plot is that it clearly summarizes the ranges of variation for humans and chimps, and highlights which parts of the ranges overlap: the human and ape medians are comparable at the youngest ages, but by 6 months the human median is pretty much always above the chimpanzee upper range. The gorilla points are generally close to the chimpanzee median until around 2 years after which gorilla size increase basically stops but chimpanzees continue. Importantly, we can also see at what ages the simulated H. erectus values are most similar to the empirical species values, and when they fall out of species’ ranges. As I pointed out a bajillion years ago, the H. erectus values (based on the Mojokerto juvenile fossil) encompass most living species’ values around six months to two years.
I also like that second plot does all the above, and still honestly shows the jagged messiness that comes with cross-sectional, resampled data. Of course no individual’s proportional brain size increases and decreases so haphazardly during growth as depicted in the plot. It’s ugly but it’s honest. But if you like lying to yourself about the nature of your data, if you prefer curvy, smoothed inference to harsh, gritty reality, you can resort to the third plot above: the loess regression lines calculated from the resampled data.
Loess and lowess (not to be confused with loess) refer to locally weighted regression scatterplot smoothing, a way to model gross data like we have, but with a nice and smooth (but not straight) line. Because R is awesome, it has a loess() function built right in. The function easily does the math, and you can quickly obtain confidence intervals for the modelled line, but plotting these is another story. After scouring the internet, coding and failing (repeatedly) I finally came up with this:
If you simply try to plot a loess() line based on 1000s of unordered points, you’ll get a harrowing spider’s web of lines between all the points. Instead, you need to create ordered vectors of the non-repeated modelled points (hlm, plm, glm, above) and their upper and lower confidence limits. Once modelled, you can simply plot the lines and create polygons based on the confidence intervals as above.
The best way to learn to do stuff in R is to just play around with data and code until you figure out how to do whatever it is you have in mind. If you want to recreate, or alter, what I’ve described here, you can download the resampled data (link at the beginning of the post) and R code. Good luck!
4 thoughts on “A picture is worth a thousand datapoints in #rstats”
Super useful! I need to teach myself R at some point, and this might be just the sort of thing to play with.
I hope it’s helpful! Let me know if you have any questions.
Pingback: Driving nails into the 2014 Lawn Chair | Lawn Chair Anthropology
Pingback: A neonatal perspective on Homo erectus brain growth | Lawn Chair Anthropology