theBioBucket*: Merging Dataframes by Partly Matching String

29 Sept 2012

Merging Dataframes by Partly Matching String

The latest posting by Tony Hirst sparked my attention because I was thinking about a very similar issue recently.

I was also fiddling around with agrep and adist until I realised that for this very issue matching of substrings is not as important as matching multiple words.. With this different approach I quite easily matched all but 3 countries.

See what I did:

## look up matches of one dataframe in another dataframe.
## the strings to be matched are comprised of 1 or more words 
## and seperated by white space.
## method: match strings that have the highest fraction of words that match up

d1 <- read.csv("http://s.telegraph.co.uk/graphics/conrad/PercentageUsingTheNet.csv", 
               header = T, sep = ",", encoding = "UTF-8")
d2 <- read.csv("http://www.iso.org/iso/country_names_and_code_elements_txt",
               header = T, sep = ";", encoding = "UTF-8")

## strings to be compared d1$ECONOMY and d2$Country.Name
mystr.1 <- as.character(d1$ECONOMY)
mystr.2 <- as.character(d2$Country.Name)
mystr.3 <- as.character(d2$ISO.3166.1.alpha.2.code)

## remove punctuation and multiple spaces
mystr.1 <- tolower(gsub("[^[:alnum:][:space:]]", "", mystr.1))
mystr.1 <- gsub("\\s+", " ", mystr.1)
mystr.2 <- tolower(gsub("[^[:alnum:][:space:]]", "", mystr.2))
mystr.2 <- gsub("\\s+", " ", mystr.2)

## function that finds matching words in string (words seperated by single space!)
n.wordshared <- function(x, y) {
    sum(!is.na(match(unlist(strsplit(x, " ")),
                     unlist(strsplit(y, " ")))
         )
        )
    }
## example
n.wordshared(x = "hello world", y = "good bye world")
## [1] 1

## function that calculates fraction of shared words
fr.wordshared <- function(x, y) {
                     n.wordshared(x, y) / (length(unique(unlist(strsplit(x, " "))))
                                           + length(unique(unlist(strsplit(y, " ")))))
                          }
## example
fr.wordshared(x = "hello world", y = "good bye world")
## [1] 0.2

mydf <- data.frame(str1 = mystr.1, mymatch = "", match.iso = "",
                   stringsAsFactors = F)

## now look up every element of string 1 in string 2
## and if there are matching words assign match to dataframe
for (i in 1:nrow(mydf)) {
   xx <- sapply(mystr.2, fr.wordshared, y = mystr.1[i])
   if (sum(xx) == 0) {
     mydf$mymatch[i] <- NA
     mydf$match.iso[i] <- NA
     } else {
     mydf$mymatch[i] <- paste(names(which(xx == max(xx))), collapse = "; ")
     mydf$match.iso[i] <- paste(mystr.3[as.integer(which(xx == max(xx)))], collapse = "; ")
   }
}

## see result
print(mydf)

## these are the multiple matches
(aa <- mydf[grep(";", mydf$mymatch), ])
##
##               str1                            mymatch match.iso
## 28 slovak republic czech republic; dominican republic    CZ; DO


## these were not matched
(bb <- mydf[is.na(mydf$mymatch), ])
##      str1 mymatch match.iso
##
## 61  russia     NA        NA
## 108  syria     NA        NA

Now, expanding on this concept by introduction of partial matching would most propably result in a 100% match...

4 comments :

Tony Hirst30 September 2012 at 19:16
Hi Kay - thanks for spinning another twist on this problem.. in turn it got me wondering: if you're doing a count of exact word matches, then we could treat each phrase as a set and then look for the size of set intersections versus set unions?
ReplyDelete
Replies
Anonymous1 October 2012 at 11:22
Hm, interesting post. A colleague of mine is working on GBIF data, matching it with some data from other sources. It's a mess: a hell lot of locations, and as well species names, don't match the master table he's using. I forwarded him Tony's post a couple of days ago, your's is going to follow.

Suggestion: being a biologist, and to get an idea how difficult it is, you could try to match GBIF and, let's say, APD or IUCN Data.

Would be nice to see some thoughts on that. :)
ReplyDelete
Replies

Add comment

TABS

29 Sept 2012

Merging Dataframes by Partly Matching String

4 comments :