23 May 2011

Summarize Data by Several Variables

Here's an example how to conveniently summarize data with the cast function (package reshape). By the way you see how this could be done "in-conveniently" by hand. You also see how a for-loop works and how a matrix is constructed and filled. In addition this serves as an illustrative example how flexible "indexing" in R works, as seen in the below loop! (download data) (this example is adapted from https://stat.ethz.ch/pipermail/r-sig-ecology/2011-May/002174.html)

```# Set path were you downloaded data to:

# see what's in there:
ls()

# investigate data3:
str(data3)

# i want to know how many taxa are within each ecoregion -
# more presicly i want to know how many orders, families, genera are there within each region:

require(reshape)

dfm<- melt(data3, id = "ECO_NAME")
dfc<- cast(dfm, ECO_NAME~variable, function(x) length(unique(x)))

# the same by hand -
# make new variable of ECO_NAME*variable combinations:
dfm\$variable2 <- as.factor(paste(dfm\$variable, dfm\$ECO_NAME, sep = " - "))

# make vector to collect results for all unique ECO_NAME*variable combinations:
Ns <- data.frame(count = rep(NA, length(unique(dfm\$variable2))),
row.names = unique(dfm\$variable2))

# loop through all unique ECO_NAME*variable combinations
# and record length of unique values:
for (i in levels(dfm\$variable2)){
subset = dfm\$value[dfm\$variable2 == i]
Ns[i, "count"] <- length(unique(subset))
}

# put counts in matrix/table, Ns\$count is in order of taxonomy
# levels (= "variable"), so I have to fill by cols (byrow = F),
# as the matrix/table colums are chosen to be the taxonomy levels:
result <- matrix(Ns\$count,
nrow = length(levels(dfm\$ECO_NAME)),
ncol = length(levels(dfm\$variable)),
dimnames = list(levels(dfm\$ECO_NAME), levels(dfm\$variable)), byrow = F)
print(result)
```
To cite reshape in publications, please use:
H. Wickham. Reshaping data with the reshape package. Journal of
Statistical Software, 21(12), 2007.