..Some time ago I did a word cloud for representing a Google Scholar search result. Tal Galili pointed me at a post by Drew Conway that expanded on the topic of word clouds lacking spatial meaning. In fact the spatial ordering of words in a word cloud is arbitrary and meaningless..
As I am an ecologist, I soon came to the idea that text could be treated as a multivariate data set - assuming that words can be treated as species and sentences being similar to samples. So, presuming that it makes sense to put sentences and words in a cross-table as I similarly would do with a species / samples matrix, it may also be sensible to analyze such a matrix by ordination-methods for multivariate data, mostly used by ecologist recently. I chose NMDS ordination, as it is robust and quite easy to compute with R-package {vegan}.
In a NMDS ordination plot the distances between Species/Words that often co-occurre within sentences or/and within groups of sentences (say, sentences said by you vs. sentences said by me) are minimized. That is, words associated with each other or with words within levels of a grouping-factor are plotted closer to each other as comapred to words with low association.
In my simple example two texts are compared, each with five sentences. One with sentences I said about you (denoted by red "Is") and sentences said by you about yourself (the red "Ys"). Words used by both of us are in the intersection. Whereas, e.g., words said exclusively by me are far away from the centroid of sentences said by you, and vice versa. I will not annoy you with the nitty-gritty stuff of ordination methods or NMDS, you will have to check this yourself.
Word frequencies are represented by size of the plotted text, as in the usual word clouds..
So, to all linguists out there, what do you think??
The stand-alone code to produce this word cloud can be found HERE.
I think this is SUPER cool!
ReplyDelete:)
Tal
thanks Tal!
Delete