Introducing {nplyr}

A grammar of (nested) data manipulation

A grammar of (nested) data manipulation
rstats
nplyr
Author

Mark Rieke

Published

July 24, 2022

Data manipulation and transformation is a fundamental part of any analysis. There are excellent tools in the R ecosystem for manipulating data frames (dplyr, data.table, and arrow, to name a few). Sometimes, however, it is desirable to work with nested data frames, for which few tools are readily available.

This is where nplyr comes into play! nplyr is a grammar of nested data manipulation that allows users to perform dplyr-like manipulations on data frames nested within a list-col of another data frame. Most dplyr verbs have nested equivalents in nplyr. For example:

Installation

nplyr 0.1.0 is available on CRAN. Alternatively, you can install the development version from github with the devtools or remotes package:

Code
# install from CRAN
install.packages("nplyr")

# install from github
devtools::install_github("markjrieke/nplyr")

Usage

To get started, we’ll create a nested column for the country data within each continent from the gapminder dataset.

Code
library(nplyr)

gm_nest <-
  gapminder::gapminder_unfiltered %>%
  tidyr::nest(country_data = -continent)

gm_nest
#> # A tibble: 6 × 2
#>   continent country_data        
#>   <fct>     <list>              
#> 1 Asia      <tibble [578 × 5]>  
#> 2 Europe    <tibble [1,302 × 5]>
#> 3 Africa    <tibble [637 × 5]>  
#> 4 Americas  <tibble [470 × 5]>  
#> 5 FSU       <tibble [139 × 5]>  
#> 6 Oceania   <tibble [187 × 5]>

dplyr can perform operations on the top-level data frame, but with nplyr, we can perform operations on the nested data frames:

Code
gm_nest_example <-
  gm_nest %>%
  nest_filter(country_data, year == max(year)) %>%
  nest_mutate(country_data, pop_millions = pop/1000000)

# each nested tibble is now filtered to the most recent year
gm_nest_example
#> # A tibble: 6 × 2
#>   continent country_data     
#>   <fct>     <list>           
#> 1 Asia      <tibble [43 × 6]>
#> 2 Europe    <tibble [34 × 6]>
#> 3 Africa    <tibble [53 × 6]>
#> 4 Americas  <tibble [33 × 6]>
#> 5 FSU       <tibble [9 × 6]> 
#> 6 Oceania   <tibble [11 × 6]>
Code
# if we unnest, we can see that a new column for pop_millions has been created
gm_nest_example %>%
  slice_head(n = 1) %>%
  tidyr::unnest(country_data)
#> # A tibble: 43 × 7
#>    continent country           year lifeExp        pop gdpPercap pop_millions
#>    <fct>     <fct>            <int>   <dbl>      <int>     <dbl>        <dbl>
#>  1 Asia      Afghanistan       2007    43.8   31889923      975.       31.9  
#>  2 Asia      Azerbaijan        2007    67.5    8017309     7709.        8.02 
#>  3 Asia      Bahrain           2007    75.6     708573    29796.        0.709
#>  4 Asia      Bangladesh        2007    64.1  150448339     1391.      150.   
#>  5 Asia      Bhutan            2007    65.6    2327849     4745.        2.33 
#>  6 Asia      Brunei            2007    77.1     386511    48015.        0.387
#>  7 Asia      Cambodia          2007    59.7   14131858     1714.       14.1  
#>  8 Asia      China             2007    73.0 1318683096     4959.     1319.   
#>  9 Asia      Hong Kong, China  2007    82.2    6980412    39725.        6.98 
#> 10 Asia      India             2007    64.7 1110396331     2452.     1110.   
#> # … with 33 more rows

nplyr also supports grouped operations with nest_group_by():

Code
gm_nest_example <-
  gm_nest %>%
  nest_group_by(country_data, year) %>%
  nest_summarise(
    country_data,
    n = n(),
    lifeExp = median(lifeExp),
    pop = median(pop),
    gdpPercap = median(gdpPercap)
  )

gm_nest_example
#> # A tibble: 6 × 2
#>   continent country_data     
#>   <fct>     <list>           
#> 1 Asia      <tibble [58 × 5]>
#> 2 Europe    <tibble [58 × 5]>
#> 3 Africa    <tibble [13 × 5]>
#> 4 Americas  <tibble [57 × 5]>
#> 5 FSU       <tibble [44 × 5]>
#> 6 Oceania   <tibble [56 × 5]>
Code
# unnesting shows summarised tibbles for each continent
gm_nest_example %>%
  slice(2) %>%
  tidyr::unnest(country_data)
#> # A tibble: 58 × 6
#>    continent  year     n lifeExp      pop gdpPercap
#>    <fct>     <int> <int>   <dbl>    <dbl>     <dbl>
#>  1 Europe     1950    22    65.8 7408264      6343.
#>  2 Europe     1951    18    65.7 7165515      6509.
#>  3 Europe     1952    31    65.9 7124673      5210.
#>  4 Europe     1953    17    67.3 7346100      6774.
#>  5 Europe     1954    17    68.0 7423300      7046.
#>  6 Europe     1955    17    68.5 7499400      7817.
#>  7 Europe     1956    17    68.5 7575800      8224.
#>  8 Europe     1957    31    67.5 7363802      6093.
#>  9 Europe     1958    18    69.6 8308052.     8833.
#> 10 Europe     1959    18    69.6 8379664.     9088.
#> # … with 48 more rows

Other use cases

In the previous set of examples, the output from nplyr’s nested operations could be obtained by unnesting and performing grouped dplyr operations.

Code
# we can use nplyr to perform operations on the nested data
gm_nest %>%
  nest_filter(country_data, year == max(year)) %>%
  nest_mutate(country_data, pop_millions = pop/1000000) %>%
  slice_head(n = 1) %>%
  tidyr::unnest(country_data)
#> # A tibble: 43 × 7
#>    continent country           year lifeExp        pop gdpPercap pop_millions
#>    <fct>     <fct>            <int>   <dbl>      <int>     <dbl>        <dbl>
#>  1 Asia      Afghanistan       2007    43.8   31889923      975.       31.9  
#>  2 Asia      Azerbaijan        2007    67.5    8017309     7709.        8.02 
#>  3 Asia      Bahrain           2007    75.6     708573    29796.        0.709
#>  4 Asia      Bangladesh        2007    64.1  150448339     1391.      150.   
#>  5 Asia      Bhutan            2007    65.6    2327849     4745.        2.33 
#>  6 Asia      Brunei            2007    77.1     386511    48015.        0.387
#>  7 Asia      Cambodia          2007    59.7   14131858     1714.       14.1  
#>  8 Asia      China             2007    73.0 1318683096     4959.     1319.   
#>  9 Asia      Hong Kong, China  2007    82.2    6980412    39725.        6.98 
#> 10 Asia      India             2007    64.7 1110396331     2452.     1110.   
#> # … with 33 more rows
Code
# in this case, we could have obtained the same result with tidyr and dplyr
gm_nest %>%
  tidyr::unnest(country_data) %>%
  group_by(continent) %>%
  filter(year == max(year)) %>%
  mutate(pop_millions = pop/1000000) %>%
  ungroup() %>%
  filter(continent == "Asia")
#> # A tibble: 43 × 7
#>    continent country           year lifeExp        pop gdpPercap pop_millions
#>    <fct>     <fct>            <int>   <dbl>      <int>     <dbl>        <dbl>
#>  1 Asia      Afghanistan       2007    43.8   31889923      975.       31.9  
#>  2 Asia      Azerbaijan        2007    67.5    8017309     7709.        8.02 
#>  3 Asia      Bahrain           2007    75.6     708573    29796.        0.709
#>  4 Asia      Bangladesh        2007    64.1  150448339     1391.      150.   
#>  5 Asia      Bhutan            2007    65.6    2327849     4745.        2.33 
#>  6 Asia      Brunei            2007    77.1     386511    48015.        0.387
#>  7 Asia      Cambodia          2007    59.7   14131858     1714.       14.1  
#>  8 Asia      China             2007    73.0 1318683096     4959.     1319.   
#>  9 Asia      Hong Kong, China  2007    82.2    6980412    39725.        6.98 
#> 10 Asia      India             2007    64.7 1110396331     2452.     1110.   
#> # … with 33 more rows

Why, then, might we need to use nplyr? Well, in other scenarios, it may be far more convenient to work with nested data frames or it may not even be possible to unnest!

Consider a set of surveys that an organization might use to gather market data. It is common for organization to have separate surveys for separate purposes but to gather the same baseline set of data across all surveys (for example , a respondent’s age and gender may be recorded across all surveys, but each survey will have a different set of questions). Let’s use two fake surveys with the below questions for this example:

Survey 1: Job
  1. How old are you? (multiple choice)
  2. What city do you live in? (multiple choice)
  3. What field do you work in? (multiple choice)
  4. Overall, how satisfied are you with your job? (multiple choice)
  5. What is your annual salary? (numeric entry)
Survey 2: Personal Life
  1. How old are you? (multiple choice)
  2. What city do you live in? (multiple choice)
  3. What field do you work in? (mulitple choice)
  4. Overall, how satisfied are you with your personal life (multiple choice)
  5. Please provide any additional detail (text entry)

In this scenario, both surveys are collecting demographic information — age, location, and industry — but differ in the remaining questions. A convenient way to get the response files into the environment would be to use purrr::map() to read each file to a nested data frame.

Code
path <- "https://raw.githubusercontent.com/markjrieke/nplyr/main/data-raw/"

surveys <- 
  tibble::tibble(survey_file = c("job_survey", "personal_survey")) %>%
  mutate(survey_data = purrr::map(survey_file, ~readr::read_csv(paste0(path, .x, ".csv"))))

surveys
#> # A tibble: 2 × 2
#>   survey_file     survey_data         
#>   <chr>           <list>              
#> 1 job_survey      <spc_tbl_ [500 × 6]>
#> 2 personal_survey <spc_tbl_ [750 × 6]>

tidyr::unnest() can usually handle idiosyncracies in layout when unnesting, but in this case unnesting throws an error!

Code
surveys %>%
  tidyr::unnest(survey_data)
#> Error in `list_unchop()`:
#> ! Can't combine `x[[1]]$Q5` <double> and `x[[2]]$Q5` <character>.

This is because the surveys share column names but not necessarily column types! In this case, both data frames contain a column named Q5, but in job_survey it’s a double and in personal_survey it’s a character.

Code
surveys %>%
  slice(1) %>%
  tidyr::unnest(survey_data) %>%
  glimpse()
#> Rows: 500
#> Columns: 7
#> $ survey_file <chr> "job_survey", "job_survey", "job_survey", "job_survey", "j…
#> $ survey_name <chr> "job", "job", "job", "job", "job", "job", "job", "job", "j…
#> $ Q1          <dbl> 100, 81, 51, 81, 80, 32, 65, 57, 43, 94, 25, 83, 61, 66, 8…
#> $ Q2          <chr> "Austin", "San Antonio", "Austin", "Austin", "Dallas", "Fo…
#> $ Q3          <chr> "Consulting", "Consulting", "Consulting", "Technology", "C…
#> $ Q4          <chr> "Somewhat dissatisfied", "Neither satisfied nor dissatisfi…
#> $ Q5          <dbl> 163, 48, 190, 25, 143, 233, 43, 243, 158, 235, 245, 195, 2…
Code
surveys %>%
  slice(2) %>%
  tidyr::unnest(survey_data) %>%
  glimpse()
#> Rows: 750
#> Columns: 7
#> $ survey_file <chr> "personal_survey", "personal_survey", "personal_survey", "…
#> $ survey_name <chr> "personal", "personal", "personal", "personal", "personal"…
#> $ Q1          <dbl> 91, 32, 40, 23, 88, 69, 96, 40, 57, 40, 39, 70, 29, 38, 57…
#> $ Q2          <chr> "Austin", "San Antonio", "San Antonio", "Austin", "Dallas"…
#> $ Q3          <chr> "Energy", "Healthcare", "Consulting", "Consulting", "Techn…
#> $ Q4          <chr> "Neither satisfied nor dissatisfied", "Extremely satisfied…
#> $ Q5          <chr> "Blandit eros! A, ligula facilisis imperdiet! Interdum pla…

We could potentially get around this issue with unnesting by reading in all columns as characters via readr::read_csv(x, col_types = cols(.default = "c")), but this presents its own challenges. Q5 would still be better represented as a double in job_survey and, from the survey question text, Q4 has similar, but distinctly different, meanings across the survey files.

This is where nplyr can assist! Rather than malign the data types or create separate objects for each survey file, we can use nplyr to perform operations directly on the nested data frames.

Code
surveys <- 
  surveys %>%
  nest_mutate(survey_data,
              age_group = if_else(Q1 < 65, "Adult", "Retirement Age")) %>%
  nest_group_by(survey_data, Q3) %>%
  nest_add_count(survey_data, 
                 name = "n_respondents_in_industry") %>%
  nest_mutate(survey_data, 
              median_industry_age = median(Q1)) %>%
  nest_ungroup(survey_data)

surveys %>%
  slice(1) %>%
  tidyr::unnest(survey_data)
#> # A tibble: 500 × 10
#>    survey_file survey_name    Q1 Q2    Q3    Q4       Q5 age_g…¹ n_res…² media…³
#>    <chr>       <chr>       <dbl> <chr> <chr> <chr> <dbl> <chr>     <int>   <dbl>
#>  1 job_survey  job           100 Aust… Cons… Some…   163 Retire…     107    57  
#>  2 job_survey  job            81 San … Cons… Neit…    48 Retire…     107    57  
#>  3 job_survey  job            51 Aust… Cons… Extr…   190 Adult       107    57  
#>  4 job_survey  job            81 Aust… Tech… Extr…    25 Retire…     108    61.5
#>  5 job_survey  job            80 Dall… Cons… Extr…   143 Retire…     107    57  
#>  6 job_survey  job            32 Fort… Ener… Some…   233 Adult        99    59  
#>  7 job_survey  job            65 Dall… Cons… Some…    43 Retire…     107    57  
#>  8 job_survey  job            57 Hous… Heal… Some…   243 Adult        75    65  
#>  9 job_survey  job            43 Dall… Gove… Neit…   158 Adult       111    57  
#> 10 job_survey  job            94 Fort… Heal… Extr…   235 Retire…      75    65  
#> # … with 490 more rows, and abbreviated variable names ¹​age_group,
#> #   ²​n_respondents_in_industry, ³​median_industry_age
Code
surveys %>%
  slice(2) %>%
  tidyr::unnest(survey_data)
#> # A tibble: 750 × 10
#>    survey_file     surve…¹    Q1 Q2    Q3    Q4    Q5    age_g…² n_res…³ media…⁴
#>    <chr>           <chr>   <dbl> <chr> <chr> <chr> <chr> <chr>     <int>   <dbl>
#>  1 personal_survey person…    91 Aust… Ener… Neit… Blan… Retire…     145      61
#>  2 personal_survey person…    32 San … Heal… Extr… Elem… Adult       131      62
#>  3 personal_survey person…    40 San … Cons… Some… Eget… Adult       149      61
#>  4 personal_survey person…    23 Aust… Cons… Extr… Scel… Adult       149      61
#>  5 personal_survey person…    88 Dall… Tech… Neit… Aene… Retire…     150      61
#>  6 personal_survey person…    69 Fort… Tech… Neit… Inte… Retire…     150      61
#>  7 personal_survey person…    96 Hous… Heal… Extr… Blan… Retire…     131      62
#>  8 personal_survey person…    40 Hous… Cons… Extr… Scel… Adult       149      61
#>  9 personal_survey person…    57 Fort… Ener… Extr… Pede… Adult       145      61
#> 10 personal_survey person…    40 Fort… Heal… Extr… Phar… Adult       131      62
#> # … with 740 more rows, and abbreviated variable names ¹​survey_name,
#> #   ²​age_group, ³​n_respondents_in_industry, ⁴​median_industry_age

Citation

BibTeX citation:
@online{rieke2022,
  author = {Rieke, Mark},
  title = {Introducing \{Nplyr\}},
  date = {2022-07-24},
  url = {https://www.thedatadiary.net/posts/2022-07-24-introducing-nplyr},
  langid = {en}
}
For attribution, please cite this work as:
Rieke, Mark. 2022. “Introducing {Nplyr}.” July 24, 2022. https://www.thedatadiary.net/posts/2022-07-24-introducing-nplyr.

Stay in touch

Support my work with a coffee

Share