Create Descriptive Text Narratives from Data — narrate

narrate_descriptive() creates text narratives from a data frame containing one numeric and one or more character or factor text columns using glue syntax. Function can work with raw or aggregated data frame. It will automatically use the first numeric column as measure and all character or factor columns as dimensions

Usage

narrate_descriptive(
  df,
  measure = NULL,
  dimensions = NULL,
  summarization = "sum",
  coverage = 0.5,
  coverage_limit = 5,
  narration_depth = 2,
  use_chatgpt = FALSE,
  openai_api_key = Sys.getenv("OPENAI_API_KEY"),
  max_tokens = 1024,
  temperature = 0,
  top_p = 1,
  frequency_penalty = 0,
  presence_penalty = 0,
  template_total = "Total {measure} across all {pluralize(dimension_one)} is {total}.",
  template_average =
    "Average {measure} across all {pluralize(dimension_one)} is {total}.",
  template_outlier = "Outlying {dimension} by {measure} is {outlier_insight}.",
  template_outlier_multiple =
    "Outlying {pluralize(dimension)} by {measure} are {outlier_insight}.",
  template_outlier_l2 =
    "In {level_l1}, significant {level_l2} by {measure} is {outlier_insight}.",
  template_outlier_l2_multiple =
    "In {level_l1}, significant {pluralize(level_l2)} by {measure} are {outlier_insight}.",
  use_renviron = FALSE,
  return_data = FALSE,
  simplify = FALSE,
  format_numbers = FALSE,
  collapse_sep = ", ",
  collapse_last = " and ",
  ...
)

Arguments

df: data.frame() or tibble() Data frame of tibble, can be aggregated or raw
measure: Numeric measure for function to create calculations with, if NULL then it will take the first numeric field available
dimensions: Vector of dimensions for analysis, by default all character or factor variable will be used
summarization: Approach for data summarization/aggregation - 'sum', 'count' or 'average'
coverage: Numeric portion of variability to be covered by narrative, 0 to 1
coverage_limit: Integer maximum number of elements to be narrated, overrides coverage to avoid extremely verbose narrative creation
narration_depth: Parameter to control the depth of the analysis 1 for summary and 2 for detailed
use_chatgpt: If TRUE - use ChatGPT to enhance the narrative
openai_api_key: Your OpenAI API key, you can set it up in .Renviron file as "OPENAI_API_KEY", function will look for it with Sys.getenv("OPENAI_API_KEY")
max_tokens: The maximum number of tokens to generate in the chat completion.
temperature: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.
frequency_penalty: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
presence_penalty: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
template_total: glue template for total volumes narrative
template_average: glue template for average volumes narrative
template_outlier: glue template for single outlier narrative
template_outlier_multiple: glue template for multiple outliers narrative
template_outlier_l2: glue template for deeper hierarchical single outlier narrative
template_outlier_l2_multiple: glue template for deeper hierarchical multiple outliers narrative
use_renviron: If TRUE use .Renviron variables in the template. You can also set options(narrator.use_renviron = TRUE) to make it global for the session, or create an environment variable "use_renviron" by changing your .Renviron file usethis::edit_r_environ()
return_data: If TRUE - return a list of variables used in the function's templates
simplify: If TRUE - return a character vector, if FALSE - named list
format_numbers: If TRUE - format big numbers to K/M/B using format_num() function
collapse_sep: Separator for glue_collapse in cases with multiple values in single variable
collapse_last: Separator for glue_collapse for the last item, in cases with multiple values in single variable
...: other arguments passed to glue

Value

A list() of narratives by default and character() if simplify = TRUE

Examples

sales %>%
narrate_descriptive(measure = "Sales",
            dimensions = c("Region", "Product"))
#> $`Total Sales`
#> Total Sales across all Regions is 38790478.4.
#> 
#> $`Region by Sales`
#> Outlying Regions by Sales are NA (18079736.4, 46.6 %) and EMEA (13555412.7, 34.9 %).
#> 
#> $`NA by Product`
#> In NA, significant Products by Sales are Food & Beverage (7392821, 40.9 %) and Electronics (3789132.7, 21 %).
#> 
#> $`EMEA by Product`
#> In EMEA, significant Products by Sales are Food & Beverage (5265113.2, 38.8 %) and Electronics (3182803.4, 23.5 %).
#> 
#> $`Product by Sales`
#> Outlying Products by Sales are Food & Beverage (15543469.7, 40.1 %) and Electronics (8608962.8, 22.2 %).
#> 

sales %>%
  dplyr::filter(Product %in% c("Tools", "Clothing", "Home")) %>%
  dplyr::group_by(Product, Region)  %>%
  dplyr::summarise(Quantity = sum(Quantity)) %>%
  narrate_descriptive()
#> $`Total Quantity`
#> Total Quantity across all Products is 65653.
#> 
#> $`Product by Quantity`
#> Outlying Products by Quantity are Home (26697, 40.7 %) and Tools (25457, 38.8 %).
#> 
#> $`Home by Region`
#> In Home, significant Regions by Quantity are NA (12204, 45.7 %) and EMEA (8693, 32.6 %).
#> 
#> $`Tools by Region`
#> In Tools, significant Regions by Quantity are NA (11253, 44.2 %) and EMEA (8216, 32.3 %).
#> 
#> $`Region by Quantity`
#> Outlying Regions by Quantity are NA (29819, 45.4 %) and EMEA (21249, 32.4 %).
#> 

sales %>%
narrate_descriptive(measure = "Order ID", dimensions = "Region", summarization = "count")
#> $`Total Order ID`
#> Total Order ID across all Regions is 10000.
#> 
#> $`Region by Order ID`
#> Outlying Regions by Order ID are NA (3975, 39.8 %) and EMEA (2986, 29.9 %).
#>