Lately I had to figure out how to do a repeated measures (or mixed effects) analysis on multivariate (species) data. Here I share code for a computation in R with the adonis function of the vegan package. Credit goes to Gavin Simpson providing most of the important pieces of the below code in R-Help.

The design:

We have multivariate species data, sampled at different sites (n = 6) at 3 points in time (N = 24).

The hypothesis:

H0: Species composition is the same across time. The permutation will produce the H0-population by a restricted randomization of time points.

H1: Species composition differs between time points

The analysis:

With repeated measurements one has to allow for the temporal structure and for the dependency within repeated measures. In a pedantic manner one would account for temporal structure by permuting all time-points in the same way also keeping the ordering of time-points. With 3 time-points this would give you only 3 permutations!

But we may relax this very restrictive assumption and say it doesn't matter that all samples at t1 were taken at one time and the ones of t2, etc. at another. Still there would be the repeated measures and the fact that the samples were taken in a temporal sequence. So how to deal with this?

(1) Having in mind that repeated measures on one sample are not independent, we will not permute across these samples.

(2) Now to the temporal correlation within repeated samples: The ordering of time-points is another type of dependency: the ordering is 1, 2, 3 - we will change this ordering only to 3, 1, 2 or 2, 3, 1, This permutation is referred to as toroidal shift which is keeping the "neighborhood" of values intact.

Eventually, we have 3 permutations per site, with 6 sites - this yields 3^6 = 729 permutations overall. Thus, a minimal p-value of 1/729 = 0.0014 can be obtained. I'd say that's good enough!

### species matrix with 20 species abundances (mean = 50, sd = 10) ### one time variable, with 3 timepoints, which should be tested ### and a factor denoting sites that were repeatedly sampled (site) ## Load packages require(vegan) ### Data: sp <- matrix(rnorm(3 * 6 * 20, 50, 10), nrow = 3 * 6, ncol = 20, dimnames = list(1:18, paste("Sp", 1:20, sep = ""))) time <- as.ordered(rep(1:3, 6)) site <- gl(6, 3) cbind(site, time, sp) ### add time effect at timepoint 3, ### this will effect will be tested by adonis(): sp_1 <- sp sp_1[time==3,] <- sp[time==3,] + rnorm(20, 10, 1) cbind(site, time, sp_1) ### choose which species set to test: test_sp <- sp_1 ### computing the true R2-value ### (btw, using dist() defaults to euclidean distance): print(fit <- adonis(dist(test_sp) ~ time, permutations=1)) ### number of perms B <- 1999 ### setting up frame which will be populated by ### random r2 values: pop <- rep(NA, B + 1) ### the first entry will be the true r2: pop[1] <- fit$aov.tab[1, 5] ### set up a "permControl" object: ### we turn off mirroring as time should only flow in one direction ctrl <- permControl(strata = site, within = Within(type = "series", mirror = FALSE)) ### Number of observations: nobs <- nrow(test_sp) ### check permutation (...rows represent the sample id): ### ..they are ok! ### within in each repeated sample (= sites) timepoints are shuffled, ### with keeping the sequence intact (e.g., for site 1: 1,2,3 - 2,3,1 - 3,2,1) shuffle(nobs, control = ctrl) ### loop: ### in adonis(...) you need to put permutations = 1, otherwise ### adonis will not run set.seed(123) for(i in 2:(B+1)){ idx <- shuffle(nobs, control = ctrl) fit.rand <- adonis(dist(test_sp) ~ time[idx], permutations = 1) pop[i] <- fit.rand$aov.tab[1, 5] } ### get the p-value: print(pval <- sum(pop >= pop[1]) / (B + 1)) ### [1] 0.0035 ### the sign. p-value supports the H1 (->there is a time effect). ### ..and the fact that samples are not iid is allowed by ### the customized perms - so this p-value is trustworthy as opposed ### to tests not acknowledging dependency of data points.. ### test sp set without an effect: ### replace test_sp with sp set without effect: test_sp <- sp ### now re-run the script and see the result: ### it is insign. - as expected: ### setting up frame which will be populated by ### random r2 values: pop <- rep(NA, B + 1) ### computing the true R2-value: print(fit <- adonis(dist(test_sp) ~ time, permutations = 1)) ### the first entry will be the true r2: pop[1] <- fit$aov.tab[1, 5] ### run the loop: set.seed(123) for(i in 2:(B+1)){ idx <- shuffle(nobs, control = ctrl) fit.rand <- adonis(dist(test_sp) ~ time[idx], permutations = 1) pop[i] <- fit.rand$aov.tab[1, 5] } print(pval <- sum(pop >= pop[1]) / (B + 1)) ### [1] 0.701 ## make a histogram to see random R2-values and the true one: hist(pop, xlab = "Population R2") abline(v = pop[1], col = 2, lty = 3) text(0.08, 300, paste("true R2,\np = ", pval, sep = ""))

To cite package ‘vegan’ in publications use:

Jari Oksanen, F. Guillaume Blanchet, Roeland Kindt, Pierre Legendre,

Peter R. Minchin, R. B. O'Hara, Gavin L. Simpson, Peter Solymos, M.

Henry H. Stevens and Helene Wagner (2011). vegan: Community Ecology

Package. R package version 2.0-2.

http://CRAN.R-project.org/package=vegan

Hi there. I'm interested in this approach, but am having a hard time finding information on it. It seems that there are a number of papers out there that just use the regular PERMANOVA for data sets with repeated sampling. Are these just poor analyses?

ReplyDeleteAlso, it would be great if you could provide the output that you see in the text so that readers can confirm that what they are doing is coming out right.

hi anonymous,

Deletenot allowing for spatial/temporal correlation will yield anti-conservative p-values and there is higher chance of false significant effects than suggest by this p-values. i'd say that *is* poor analysis.

i did add some output as proposed by you.

hth, kay

If you have 729 unique permutations, why did you use 1999 permutations? Why not all possible 729?

ReplyDelete..you're right: using the whole set of enumartions, what would be referred to as "exact test", would be best. But, with as much as 1999 permutation you yield a good approximation of the null-distribution. And, for the case being considered, I suppose that is all ok.

DeleteHi, I need to perform a repeated measures permanova based on a bi-factorial sampling design.

ReplyDeleteCan you help me?

Cheers.

Ronei.

depends.. you can mail to the address in the footer.

DeleteHi Kay

DeleteI need to perform a rmPERMANOVA based on a bi-factorial (as bellow). Can you help me? In the script I lose my self in the "for"...

Cheers, João

birds fishes aphyo.1 aphyo.2 aphyo.3 mfore.1 mfore.2 mfore.3

without native 75 59 59 75 61 54

with native 75 58 60 75 69 69

without non-native 75 68 63 75 69 61

without native 75 58 59 75 63 55

..send me the whole dataset and a short description and I'll see.

DeleteHi- I am analyzing genetic distance matrices from samples taken from a three-factor sampling design, with repeated samplings over time. I have horrible problems with imbalance, as I had no way to control whether or not I actually got my (relatively rare) target organism at each sampling. Because I have had a hard time finding information about how to deal with violations of the assumptions of balance I am slashing back my data set to only the combinations that I can get balance for.

ReplyDeleteDo I need balance in the number of instances of samplings per site as well, or have I some wiggle room in the repeated measures portion of the analysis?

Thanks for your input!

I'd say as long as your cell sizes (also for random factors, like your sites) exceed a minimum of about 5-10 (?) samples the test applies, given that you're testing the appropiate hypotheses..

DeleteI'd advise you to address the R-mailing list (r-sig-eco) for a more profound answer on that question. See i.e. this: https://stat.ethz.ch/pipermail/r-help/2008-April/159252.html

Hi Kay,

ReplyDeleteThis is a very cool. I´m have the p value, but how can we retrieve the F and degrees of freedom.

Sorry if it is a simple question, but my knowledge on statistic is very basic

See str(fit$aov.tab)

DeleteThank´s Kay

ReplyDeleteFC

Hi Kay,

ReplyDeleteSorry if this is a messy question for you. I have a species matrix with 11 species abundances, one time variable (weeks) with 6 timepoints, sites that were repeatedly sampled (site), replicate 3 per site. My null hypothesis is that the site didn't influence the species abundances

I have some basic background in statistics, but I am quite new to R. I have tried unsuccessfully

to modify your code to work with my data. Could you suggest me on how to proceed?

Best regards

Fernando

Example of my matrix:

Site Replicate week Sp1 Sp2 Sp3 Sp4 Sp5 Sp6 Sp7 Sp8 Sp9 Sp10 Sp11

SS a 1 0 9 0 0 0 0 0 0 0 0 9

SS b 1 0 6 0 0 1 0 0 0 1 0 6

SS c 1 0 11 2 0 0 0 0 0 0 0 13

SS d 1 0 9 1 0 0 0 0 0 0 0 10

Mh a 1 0 3 0 0 2 1 0 0 0 0 3

Mh b 1 0 3 1 0 0 0 0 0 0 0 4

........

Hi there,

DeleteIf you`re solely interested in differences between sites I'd suppose you should average abundances over the timepoints, reducing the dataset from 6*3*N(Sites) to 3*N(Sites).. BTW, what species are we talking about, and, why did you repeat over weeks in the first place?

HTH,

Kay

Hi Kay,

ReplyDeletethanks a lot for the script as it does help to reduce the amount of analyses to run and is just what I was looking for. However, I have one question: In lines 18 to 20 you have the following code:

sp_1 <- sp

sp_1[time==3,] <- sp[time==3,] + rnorm(20, 10, 1)

cbind(site, time, sp_1)

In the middle line you add further random numbers to the original data set. These random numbers are needed just in your case or are the supposed to be added in my adapted script as well? What is the reason behind adding these random numbers?

I'm sorry if it sounds like s silly question, am just trying to understand what I am actually doing here.

Kind regards,

Suse

Suse,

Deleteas commented in the script, this is just to add an artificial effect to showcase in the test..

Regards,

Kay

Dear KAY,

ReplyDeleteIf I run a MANOVA can I get the significance of each variable in the left part of the model by applying a summary.aov function.

Did you know whether it is possible to obtain this output in a PERMANOVA using the adonis function.

Thanks

Ok! But the output of the adonis function doesn't show the significance of each variable, solely the total effect in the multivariate space.

DeleteTks.

Yes, for the significance of single variables on the left side ("species") there are no tests in adonis. That is simply because it is dealing with dissimilarity matrices rather than the single variables/species.

DeleteHi Kay,

ReplyDeleteI am having trouble understanding how you add the time effect at timepoint 3. I understand the assignment,

sp_1[time==3,] <- sp[time==3,] + rnorm(20, 10, 1)

but you never use that line again (time==3) but go back to sp_1 which includes everything, and then rename it test_sp which to me seems exactly the same as your original sp.

What am I missing?

Hi e,

Delete'sp' has no time effect, whereas for 'sp_1' some effect is added at time-level '3'. I run the test twice, in both cases using 'test_sp' - for the first test 'sp' is assigned to 'test_sp' and in the second test it is 'sp_1' that is assigned to 'test_sp'..

HTH, Kay

I'm not sure I totally get this. Stupid me.

ReplyDeleteFirst, easy question:

why would you use euclidian distances on species abundance data? This is just for demonstration, right?

Second: If I'm not only looking at the effect of time, but also at the effect of a treatment (in split-plot design, e.g.), how would I need to control for it?

This would make no sense, I guess:

fit <- adonis(distance ~ treatment+time, permutations=1, strata=treatment)

Euclidean was used just for convenience. However, there's nothing wrong with it in general - there are of course some situations where it is not appropiate to use this distance measure, but that's the case for any chosen metric.

DeleteIf you want to test for a fixed factor, controlling for a second random factor (time) things get really tricky and I'd suppose you post to vegan-help or r-sig-eco. However, I'm afraid there were several questions on this topic before and to my knowledge none of them were answered clearly.

Thanks for the clarification. Months have gone by, and I tried hard.

DeleteFor instance, I aimed to hunt down every single thread on r-sig-ecology eben remotely connected, and also had a lengthy conversation with an editor of a Journal and author of a book on applied MV stats. Apparently, there is no clear solution to the problem.

After some consideration, I now perform a db-RDA on my treatment factor while partialling out the sampling variation. Jari (i.e., vegantutor.pdf) literally indicated that this kind of treatment is somewhat related to a mixed effects model: "These conditioning variables typi-

cally are “random” or background variables, and their eﬀect is removed from the analysis based on “ﬁxed” or interesting variables."

I can use MRPP to attach a p-value to the data, or use adonis() and it's pseudo-F ratios and p-values. Both approaches are unable to deal with my design, I gather.

My current opinion is it's better than nothing, but far from satisfying.

I'd be interested in your views, if you find the time.

Hi @rne!

Delete..Your story reminds me of the issues I ran into when writing my paper in which I also had a mixed model with mv-data. There are no clear answers to ones questions, and the more you gather the more complicated it gets..

Please let my clarify that I'm nor a statistican, neither a specialist for mv data analysis - I'm 'just' an interested user who went a little further than most other people who blindly follow recipes for testing their data.

The RDA approach with patialling out 'random factors' or 'unwanted factors' seems perfectly fine for me - however, I was told that many reviewers dislike multivariate tests - I guess that's because most people don't realls grasp those tests. So, be advised to really get to know what your doing - otherwise you will have your trouble defending your approach.. Anyways, this would also apply for a PERMANOVA approach. But I guess the latter method is somewhat easier to explain.

Good luck,

Kay

Kay, I still struggle. Using a real-life dataset - 24 plots (actually, 12 times two treatments, but that's another complication), sampled at four times, would I have to use 4^24 restricted permutations to check for a sampling time effect?

ReplyDeleteThis does not seem right. Nor is it feasible.

Hi there,

DeleteTo my understanding the very same procedure applies (neglecting your treatments) - however, I tried to reproduce your example and noticed that my code ist not uptodate. There were several changes in the vegan package (permControl) in the meantime. I'm afraid I don't find the time to adjust the example for the moment. Please consider to head to the suited mailing lists..

Best,

Kay