Skip to contents

The control() verb replaces values in a vector with values looked up in a thesaurus. It is similar to switch() or dplyr::recode(), but the replacement values are specified as a data frame instead of as individual arguments.

By default control() replaces only values of x that exactly match terms in thesaurus. Additional arguments allow for case insensitive and fuzzy matching strategies (see details). control_ci() and control_fuzzy() are convenience aliases for case insensitive exact matching and full fuzzy matching respectively.

Usage

control(
  x,
  thesaurus,
  case_insensitive = FALSE,
  fuzzy_boundary = FALSE,
  fuzzy_encoding = FALSE,
  quiet = FALSE,
  warn_unmatched = TRUE,
  coalesce = TRUE
)

control_ci(x, thesaurus, ...)

control_fuzzy(x, thesaurus, ...)

Arguments

x

Vector to recode.

thesaurus

Data frame with two columns: a vector of preferred terms, and a vector of variants.

case_insensitive

Set to TRUE to perform case insensitive matching.

fuzzy_boundary

Set to TRUE to perform fuzzy matching that ignores differences in the word boundaries used (e.g. "foo bar" matches "foo-bar").

fuzzy_encoding

Set to TRUE to perform fuzzy matching that ignores non-ASCII characters that may have been encoded differently (e.g. "foo" matches "foö").

quiet

Set to TRUE suppress messages about replaced values.

warn_unmatched

If TRUE (the default), issues a warning for values that couldn't be matched in thesaurus.

coalesce

If TRUE (the default), return only the closest matches in x. If FALSE, return all matches.

...

For control_ci() and control_fuzzy, other arguments passed to control().

Value

If coalesce = TRUE (the default), a vector the same length as x with values matching variants in thesaurus replaced with the preferred term.

If coalesce = FALSE, a data frame with the same number of rows as x, and columns for each type of match (e.g. exact, case_insensitive, fuzzy_boundary, fuzzy_encoding).

By default gives a message listing replaced values and a warning listing any values not matched in the thesaurus. These can be suppressed with quiet = TRUE and warn_unmatched = FALSE respectively.

Examples

data(colour_thesaurus)

# Exact matching
x <- c("red", "lipstick", "green", "mint", "blue", "azure")
control(x, colour_thesaurus)
#> Replaced values:
#>  lipstick → red
#>  mint → green
#>  azure → blue
#> [1] "red"   "red"   "green" "green" "blue"  "blue" 

# Case insensitive matching
x <- toupper(x)
control_ci(x, colour_thesaurus)
#> Replaced values:
#>  RED → red
#>  LIPSTICK → red
#>  GREEN → green
#>  MINT → green
#>  BLUE → blue
#>  AZURE → blue
#> [1] "red"   "red"   "green" "green" "blue"  "blue" 

# coalesce = FALSE returns all matches as a data frame, which can be useful
# for debugging:
control(x, colour_thesaurus, case_insensitive = TRUE, coalesce = FALSE)
#> Replaced values:
#>  RED → red
#>  LIPSTICK → red
#>  GREEN → green
#>  MINT → green
#>  BLUE → blue
#>  AZURE → blue
#>   exact case_insensitive
#> 1  <NA>              red
#> 2  <NA>              red
#> 3  <NA>            green
#> 4  <NA>            green
#> 5  <NA>             blue
#> 6  <NA>             blue