I was also fiddling around with agrep and adist until I realised that for this very issue matching of substrings is not as important as matching multiple words.. With this different approach I quite easily matched all but 3 countries.
See what I did:
## look up matches of one dataframe in another dataframe.
## the strings to be matched are comprised of 1 or more words
## and seperated by white space.
## method: match strings that have the highest fraction of words that match up
d1 <- read.csv("http://s.telegraph.co.uk/graphics/conrad/PercentageUsingTheNet.csv",
header = T, sep = ",", encoding = "UTF-8")
d2 <- read.csv("http://www.iso.org/iso/country_names_and_code_elements_txt",
header = T, sep = ";", encoding = "UTF-8")
## strings to be compared d1$ECONOMY and d2$Country.Name
mystr.1 <- as.character(d1$ECONOMY)
mystr.2 <- as.character(d2$Country.Name)
mystr.3 <- as.character(d2$ISO.3166.1.alpha.2.code)
## remove punctuation and multiple spaces
mystr.1 <- tolower(gsub("[^[:alnum:][:space:]]", "", mystr.1))
mystr.1 <- gsub("\\s+", " ", mystr.1)
mystr.2 <- tolower(gsub("[^[:alnum:][:space:]]", "", mystr.2))
mystr.2 <- gsub("\\s+", " ", mystr.2)
## function that finds matching words in string (words seperated by single space!)
n.wordshared <- function(x, y) {
sum(!is.na(match(unlist(strsplit(x, " ")),
unlist(strsplit(y, " ")))
)
)
}
## example
n.wordshared(x = "hello world", y = "good bye world")
## [1] 1
## function that calculates fraction of shared words
fr.wordshared <- function(x, y) {
n.wordshared(x, y) / (length(unique(unlist(strsplit(x, " "))))
+ length(unique(unlist(strsplit(y, " ")))))
}
## example
fr.wordshared(x = "hello world", y = "good bye world")
## [1] 0.2
mydf <- data.frame(str1 = mystr.1, mymatch = "", match.iso = "",
stringsAsFactors = F)
## now look up every element of string 1 in string 2
## and if there are matching words assign match to dataframe
for (i in 1:nrow(mydf)) {
xx <- sapply(mystr.2, fr.wordshared, y = mystr.1[i])
if (sum(xx) == 0) {
mydf$mymatch[i] <- NA
mydf$match.iso[i] <- NA
} else {
mydf$mymatch[i] <- paste(names(which(xx == max(xx))), collapse = "; ")
mydf$match.iso[i] <- paste(mystr.3[as.integer(which(xx == max(xx)))], collapse = "; ")
}
}
## see result
print(mydf)
## these are the multiple matches
(aa <- mydf[grep(";", mydf$mymatch), ])
##
## str1 mymatch match.iso
## 28 slovak republic czech republic; dominican republic CZ; DO
## these were not matched
(bb <- mydf[is.na(mydf$mymatch), ])
## str1 mymatch match.iso
##
## 61 russia NA NA
## 108 syria NA NA
Now, expanding on this concept by introduction of partial matching would most propably result in a 100% match...
Hi Kay - thanks for spinning another twist on this problem.. in turn it got me wondering: if you're doing a count of exact word matches, then we could treat each phrase as a set and then look for the size of set intersections versus set unions?
ReplyDeleteHi Tony,
DeleteLike doing actualy the same with the build in functions (which I obviously missed..)?
Hm, interesting post. A colleague of mine is working on GBIF data, matching it with some data from other sources. It's a mess: a hell lot of locations, and as well species names, don't match the master table he's using. I forwarded him Tony's post a couple of days ago, your's is going to follow.
ReplyDeleteSuggestion: being a biologist, and to get an idea how difficult it is, you could try to match GBIF and, let's say, APD or IUCN Data.
Would be nice to see some thoughts on that. :)
..Propably I'm not quite the right person to address this question to. I guess there are some people out there (like the ones at GBIF that are merging data from thousands of sources) who have spent much of their time on such issues and have developed sophisticated methods. Would be interesting to learn more about it - I'm afraid so far, we're only scratching at the surface...
Delete