3 Nov 2011

Some Simple but Propably Useful Regex Examples with R-Package stringr...

I found that examples for the use of regex in R are rather rare. Thus, I will provide some examples from my own learning materials - mostly stolen from the help pages, with small but maybe illustrative adaptions. ps: I will extent this list of examples HERE occasionally..

library(stringr)

shopping_list <- c("bread & Apples §$%&/()=?4", "flouR", "sugar", "milk x2")
str_extract(shopping_list, "[A-Z].*[1-9]")
# this extracts partial strings starting with an upper-case letter
# and ending with a digit, for all elements of the input vector..
# "." period, any single case letter, "*" the preceding item will
# be matched zero or more times, ".*" regex for a string
# comprised of any item being repeated arbitrarily often.

# output:
[1] "Apples §$%&/()=?4" NA                  NA                  NA

str_extract(shopping_list, "[a-z]{1,4}")
# this extracts partial strings with lowercase repetitions of 4,
# for all elements of the input vector..

# output:
[1] "brea" "flou" "suga" "milk"

str_extract(shopping_list, "\\b[a-z]{1,4}\\b")
# this extracts whole words with lowercase repetitions of 4,
# for all elements of the input vector..

#output:
[1] NA     NA     NA     "milk"

str <- c("&George W. Bush", "Lyndon B. Johnson?")
gsub("[^[:alnum:][:space:].]", "", str)
# keep alphanumeric signs AND full-stop, remove anything else,
# that is, all other punctuation. what should not be matched is
# designated by the caret.

# output:
[1] "George W. Bush"    "Lyndon B. Johnson"

4 comments :

  1. You might want to show how the function str_extract_all compares on the above examples

    ReplyDelete
  2. ..."str_extract" extracts the first piece of a string (held within a vector element) matching a pattern, returning a vector containing the matching strings.

    "str_extract_all" extracts all pieces of a string (held within a vector element) that match a pattern and returns a vector-list, each vector containing all matches with the elements of the input vector.

    Hope I got this right,
    Kay

    ReplyDelete
  3. What is the advantage of using the stringr package versus the grep() function in the base library?

    ReplyDelete
  4. try yourself and you'll see...

    str_extract(shopping_list, "[A-Z](.*)[1-9]")
    grep("[A-Z](.*)[1-9]", shopping_list)

    ReplyDelete