The control() verb replaces values in a vector with values looked up in a
thesaurus. It is similar to switch() or dplyr::recode(), but the
replacement values are specified as a data frame instead of as individual
arguments.
By default control() replaces only values of x that exactly match terms
in thesaurus. Additional arguments allow for case insensitive and fuzzy
matching strategies (see details). control_ci() and control_fuzzy() are
convenience aliases for case insensitive exact matching and full fuzzy
matching respectively.
Usage
control(
x,
thesaurus,
case_insensitive = FALSE,
fuzzy_boundary = FALSE,
fuzzy_encoding = FALSE,
quiet = FALSE,
warn_unmatched = TRUE,
coalesce = TRUE
)
control_ci(x, thesaurus, ...)
control_fuzzy(x, thesaurus, ...)Arguments
- x
Vector to recode.
- thesaurus
Data frame with two columns: a vector of preferred terms, and a vector of variants.
- case_insensitive
Set to
TRUEto perform case insensitive matching.- fuzzy_boundary
Set to
TRUEto perform fuzzy matching that ignores differences in the word boundaries used (e.g."foo bar"matches"foo-bar").- fuzzy_encoding
Set to
TRUEto perform fuzzy matching that ignores non-ASCII characters that may have been encoded differently (e.g."foo"matches"foö").- quiet
Set to
TRUEsuppress messages about replaced values.- warn_unmatched
If
TRUE(the default), issues a warning for values that couldn't be matched inthesaurus.- coalesce
If
TRUE(the default), return only the closest matches inx. IfFALSE, return all matches.- ...
For
control_ci()andcontrol_fuzzy, other arguments passed tocontrol().
Value
If coalesce = TRUE (the default), a vector the same length as x with
values matching variants in thesaurus replaced with the preferred term.
If coalesce = FALSE, a data frame with the same number of rows as x, and
columns for each type of match (e.g. exact, case_insensitive,
fuzzy_boundary, fuzzy_encoding).
By default gives a message listing replaced values and a warning listing any
values not matched in the thesaurus. These can be suppressed with
quiet = TRUE and warn_unmatched = FALSE respectively.
Examples
data(colour_thesaurus)
# Exact matching
x <- c("red", "lipstick", "green", "mint", "blue", "azure")
control(x, colour_thesaurus)
#> Replaced values:
#> ℹ lipstick → red
#> ℹ mint → green
#> ℹ azure → blue
#> [1] "red" "red" "green" "green" "blue" "blue"
# Case insensitive matching
x <- toupper(x)
control_ci(x, colour_thesaurus)
#> Replaced values:
#> ℹ RED → red
#> ℹ LIPSTICK → red
#> ℹ GREEN → green
#> ℹ MINT → green
#> ℹ BLUE → blue
#> ℹ AZURE → blue
#> [1] "red" "red" "green" "green" "blue" "blue"
# coalesce = FALSE returns all matches as a data frame, which can be useful
# for debugging:
control(x, colour_thesaurus, case_insensitive = TRUE, coalesce = FALSE)
#> Replaced values:
#> ℹ RED → red
#> ℹ LIPSTICK → red
#> ℹ GREEN → green
#> ℹ MINT → green
#> ℹ BLUE → blue
#> ℹ AZURE → blue
#> exact case_insensitive
#> 1 <NA> red
#> 2 <NA> red
#> 3 <NA> green
#> 4 <NA> green
#> 5 <NA> blue
#> 6 <NA> blue