Introducing {nplyr}

A grammar of (nested) data manipulation

Data manipulation and transformation is a fundamental part of any analysis. There are excellent tools in the R ecosystem for manipulating data frames (dplyr, data.table, and arrow, to name a few). Sometimes, however, it is desirable to work with nested data frames, for which few tools are readily available.

This is where nplyr comes into play! nplyr is a grammar of nested data manipulation that allows users to perform dplyr-like manipulations on data frames nested within a list-col of another data frame. Most dplyr verbs have nested equivalents in nplyr. For example:

  • nest_mutate() is the nested equivalent of mutate()
  • nest_select() is the nested equivalent of select()
  • nest_filter() is the nested equivalent of filter()
  • nest_summarise() is the nested equivalent of summarise()
  • nest_group_by() is the nested equivalent of group_by()

Installation

nplyr 0.1.0 is available on CRAN. Alternatively, you can install the development version from github with the devtools or remotes package:

# install from CRAN
install.packages("nplyr")

# install from github
devtools::install_github("markjrieke/nplyr")

Usage

To get started, we’ll create a nested column for the country data within each continent from the gapminder dataset.

library(nplyr)

gm_nest <-
  gapminder::gapminder_unfiltered %>%
  tidyr::nest(country_data = -continent)

gm_nest
## # A tibble: 6 × 2
##   continent country_data        
##   <fct>     <list>              
## 1 Asia      <tibble [578 × 5]>  
## 2 Europe    <tibble [1,302 × 5]>
## 3 Africa    <tibble [637 × 5]>  
## 4 Americas  <tibble [470 × 5]>  
## 5 FSU       <tibble [139 × 5]>  
## 6 Oceania   <tibble [187 × 5]>

dplyr can perform operations on the top-level data frame, but with nplyr, we can perform operations on the nested data frames:

gm_nest_example <-
  gm_nest %>%
  nest_filter(country_data, year == max(year)) %>%
  nest_mutate(country_data, pop_millions = pop/1000000)

# each nested tibble is now filtered to the most recent year
gm_nest_example
## # A tibble: 6 × 2
##   continent country_data     
##   <fct>     <list>           
## 1 Asia      <tibble [43 × 6]>
## 2 Europe    <tibble [34 × 6]>
## 3 Africa    <tibble [53 × 6]>
## 4 Americas  <tibble [33 × 6]>
## 5 FSU       <tibble [9 × 6]> 
## 6 Oceania   <tibble [11 × 6]>
# if we unnest, we can see that a new column for pop_millions has been created
gm_nest_example %>%
  slice_head(n = 1) %>%
  tidyr::unnest(country_data)
## # A tibble: 43 × 7
##    continent country           year lifeExp        pop gdpPercap pop_millions
##    <fct>     <fct>            <int>   <dbl>      <int>     <dbl>        <dbl>
##  1 Asia      Afghanistan       2007    43.8   31889923      975.       31.9  
##  2 Asia      Azerbaijan        2007    67.5    8017309     7709.        8.02 
##  3 Asia      Bahrain           2007    75.6     708573    29796.        0.709
##  4 Asia      Bangladesh        2007    64.1  150448339     1391.      150.   
##  5 Asia      Bhutan            2007    65.6    2327849     4745.        2.33 
##  6 Asia      Brunei            2007    77.1     386511    48015.        0.387
##  7 Asia      Cambodia          2007    59.7   14131858     1714.       14.1  
##  8 Asia      China             2007    73.0 1318683096     4959.     1319.   
##  9 Asia      Hong Kong, China  2007    82.2    6980412    39725.        6.98 
## 10 Asia      India             2007    64.7 1110396331     2452.     1110.   
## # … with 33 more rows

nplyr also supports grouped operations with nest_group_by():

gm_nest_example <-
  gm_nest %>%
  nest_group_by(country_data, year) %>%
  nest_summarise(
    country_data,
    n = n(),
    lifeExp = median(lifeExp),
    pop = median(pop),
    gdpPercap = median(gdpPercap)
  )

gm_nest_example
## # A tibble: 6 × 2
##   continent country_data     
##   <fct>     <list>           
## 1 Asia      <tibble [58 × 5]>
## 2 Europe    <tibble [58 × 5]>
## 3 Africa    <tibble [13 × 5]>
## 4 Americas  <tibble [57 × 5]>
## 5 FSU       <tibble [44 × 5]>
## 6 Oceania   <tibble [56 × 5]>
# unnesting shows summarised tibbles for each continent
gm_nest_example %>%
  slice(2) %>%
  tidyr::unnest(country_data)
## # A tibble: 58 × 6
##    continent  year     n lifeExp      pop gdpPercap
##    <fct>     <int> <int>   <dbl>    <dbl>     <dbl>
##  1 Europe     1950    22    65.8 7408264      6343.
##  2 Europe     1951    18    65.7 7165515      6509.
##  3 Europe     1952    31    65.9 7124673      5210.
##  4 Europe     1953    17    67.3 7346100      6774.
##  5 Europe     1954    17    68.0 7423300      7046.
##  6 Europe     1955    17    68.5 7499400      7817.
##  7 Europe     1956    17    68.5 7575800      8224.
##  8 Europe     1957    31    67.5 7363802      6093.
##  9 Europe     1958    18    69.6 8308052.     8833.
## 10 Europe     1959    18    69.6 8379664.     9088.
## # … with 48 more rows

Other use cases

In the previous set of examples, the output from nplyr’s nested operations could be obtained by unnesting and performing grouped dplyr operations.

# we can use nplyr to perform operations on the nested data
gm_nest %>%
  nest_filter(country_data, year == max(year)) %>%
  nest_mutate(country_data, pop_millions = pop/1000000) %>%
  slice_head(n = 1) %>%
  tidyr::unnest(country_data)
## # A tibble: 43 × 7
##    continent country           year lifeExp        pop gdpPercap pop_millions
##    <fct>     <fct>            <int>   <dbl>      <int>     <dbl>        <dbl>
##  1 Asia      Afghanistan       2007    43.8   31889923      975.       31.9  
##  2 Asia      Azerbaijan        2007    67.5    8017309     7709.        8.02 
##  3 Asia      Bahrain           2007    75.6     708573    29796.        0.709
##  4 Asia      Bangladesh        2007    64.1  150448339     1391.      150.   
##  5 Asia      Bhutan            2007    65.6    2327849     4745.        2.33 
##  6 Asia      Brunei            2007    77.1     386511    48015.        0.387
##  7 Asia      Cambodia          2007    59.7   14131858     1714.       14.1  
##  8 Asia      China             2007    73.0 1318683096     4959.     1319.   
##  9 Asia      Hong Kong, China  2007    82.2    6980412    39725.        6.98 
## 10 Asia      India             2007    64.7 1110396331     2452.     1110.   
## # … with 33 more rows
# in this case, we could have obtained the same result with tidyr and dplyr
gm_nest %>%
  tidyr::unnest(country_data) %>%
  group_by(continent) %>%
  filter(year == max(year)) %>%
  mutate(pop_millions = pop/1000000) %>%
  ungroup() %>%
  filter(continent == "Asia")
## # A tibble: 43 × 7
##    continent country           year lifeExp        pop gdpPercap pop_millions
##    <fct>     <fct>            <int>   <dbl>      <int>     <dbl>        <dbl>
##  1 Asia      Afghanistan       2007    43.8   31889923      975.       31.9  
##  2 Asia      Azerbaijan        2007    67.5    8017309     7709.        8.02 
##  3 Asia      Bahrain           2007    75.6     708573    29796.        0.709
##  4 Asia      Bangladesh        2007    64.1  150448339     1391.      150.   
##  5 Asia      Bhutan            2007    65.6    2327849     4745.        2.33 
##  6 Asia      Brunei            2007    77.1     386511    48015.        0.387
##  7 Asia      Cambodia          2007    59.7   14131858     1714.       14.1  
##  8 Asia      China             2007    73.0 1318683096     4959.     1319.   
##  9 Asia      Hong Kong, China  2007    82.2    6980412    39725.        6.98 
## 10 Asia      India             2007    64.7 1110396331     2452.     1110.   
## # … with 33 more rows

Why, then, might we need to use nplyr? Well, in other scenarios, it may be far more convenient to work with nested data frames or it may not even be possible to unnest!

Consider a set of surveys that an organization might use to gather market data. It is common for organization to have separate surveys for separate purposes but to gather the same baseline set of data across all surveys (for example , a respondent’s age and gender may be recorded across all surveys, but each survey will have a different set of questions). Let’s use two fake surveys with the below questions for this example:

Survey 1: Job
  1. How old are you? (multiple choice)
  2. What city do you live in? (multiple choice)
  3. What field do you work in? (multiple choice)
  4. Overall, how satisfied are you with your job? (multiple choice)
  5. What is your annual salary? (numeric entry)
Survey 2: Personal Life
  1. How old are you? (multiple choice)
  2. What city do you live in? (multiple choice)
  3. What field do you work in? (mulitple choice)
  4. Overall, how satisfied are you with your personal life (multiple choice)
  5. Please provide any additional detail (text entry)

In this scenario, both surveys are collecting demographic information — age, location, and industry — but differ in the remaining questions. A convenient way to get the response files into the environment would be to use purrr::map() to read each file to a nested data frame.

path <- "https://raw.githubusercontent.com/markjrieke/nplyr/main/data-raw/"

surveys <- 
  tibble::tibble(survey_file = c("job_survey", "personal_survey")) %>%
  mutate(survey_data = purrr::map(survey_file, ~readr::read_csv(paste0(path, .x, ".csv"))))

surveys
## # A tibble: 2 × 2
##   survey_file     survey_data            
##   <chr>           <list>                 
## 1 job_survey      <spec_tbl_df [500 × 6]>
## 2 personal_survey <spec_tbl_df [750 × 6]>

tidyr::unnest() can usually handle idiosyncracies in layout when unnesting, but in this case unnesting throws an error!

surveys %>%
  tidyr::unnest(survey_data)
## Error:
## ! Can't combine `Q5` <double> and `Q5` <character>.

This is because the surveys share column names but not necessarily column types! In this case, both data frames contain a column named Q5, but in job_survey it’s a double and in personal_survey it’s a character.

surveys %>%
  slice(1) %>%
  tidyr::unnest(survey_data) %>%
  glimpse()
## Rows: 500
## Columns: 7
## $ survey_file <chr> "job_survey", "job_survey", "job_survey", "job_survey", "j…
## $ survey_name <chr> "job", "job", "job", "job", "job", "job", "job", "job", "j…
## $ Q1          <dbl> 100, 81, 51, 81, 80, 32, 65, 57, 43, 94, 25, 83, 61, 66, 8…
## $ Q2          <chr> "Austin", "San Antonio", "Austin", "Austin", "Dallas", "Fo…
## $ Q3          <chr> "Consulting", "Consulting", "Consulting", "Technology", "C…
## $ Q4          <chr> "Somewhat dissatisfied", "Neither satisfied nor dissatisfi…
## $ Q5          <dbl> 163, 48, 190, 25, 143, 233, 43, 243, 158, 235, 245, 195, 2…
surveys %>%
  slice(2) %>%
  tidyr::unnest(survey_data) %>%
  glimpse()
## Rows: 750
## Columns: 7
## $ survey_file <chr> "personal_survey", "personal_survey", "personal_survey", "…
## $ survey_name <chr> "personal", "personal", "personal", "personal", "personal"…
## $ Q1          <dbl> 91, 32, 40, 23, 88, 69, 96, 40, 57, 40, 39, 70, 29, 38, 57…
## $ Q2          <chr> "Austin", "San Antonio", "San Antonio", "Austin", "Dallas"…
## $ Q3          <chr> "Energy", "Healthcare", "Consulting", "Consulting", "Techn…
## $ Q4          <chr> "Neither satisfied nor dissatisfied", "Extremely satisfied…
## $ Q5          <chr> "Blandit eros! A, ligula facilisis imperdiet! Interdum pla…

We could potentially get around this issue with unnesting by reading in all columns as characters via readr::read_csv(x, col_types = cols(.default = "c")), but this presents its own challenges. Q5 would still be better represented as a double in job_survey and, from the survey question text, Q4 has similar, but distinctly different, meanings across the survey files.

This is where nplyr can assist! Rather than malign the data types or create separate objects for each survey file, we can use nplyr to perform operations directly on the nested data frames.

surveys <- 
  surveys %>%
  nest_mutate(survey_data,
              age_group = if_else(Q1 < 65, "Adult", "Retirement Age")) %>%
  nest_group_by(survey_data, Q3) %>%
  nest_add_count(survey_data, 
                 name = "n_respondents_in_industry") %>%
  nest_mutate(survey_data, 
              median_industry_age = median(Q1)) %>%
  nest_ungroup(survey_data)

surveys %>%
  slice(1) %>%
  tidyr::unnest(survey_data)
## # A tibble: 500 × 10
##    survey_file survey_name    Q1 Q2          Q3            Q4       Q5 age_group
##    <chr>       <chr>       <dbl> <chr>       <chr>         <chr> <dbl> <chr>    
##  1 job_survey  job           100 Austin      Consulting    Some…   163 Retireme…
##  2 job_survey  job            81 San Antonio Consulting    Neit…    48 Retireme…
##  3 job_survey  job            51 Austin      Consulting    Extr…   190 Adult    
##  4 job_survey  job            81 Austin      Technology    Extr…    25 Retireme…
##  5 job_survey  job            80 Dallas      Consulting    Extr…   143 Retireme…
##  6 job_survey  job            32 Fort Worth  Energy        Some…   233 Adult    
##  7 job_survey  job            65 Dallas      Consulting    Some…    43 Retireme…
##  8 job_survey  job            57 Houston     Healthcare    Some…   243 Adult    
##  9 job_survey  job            43 Dallas      Government S… Neit…   158 Adult    
## 10 job_survey  job            94 Fort Worth  Healthcare    Extr…   235 Retireme…
## # … with 490 more rows, and 2 more variables: n_respondents_in_industry <int>,
## #   median_industry_age <dbl>
surveys %>%
  slice(2) %>%
  tidyr::unnest(survey_data)
## # A tibble: 750 × 10
##    survey_file     survey_name    Q1 Q2          Q3        Q4    Q5    age_group
##    <chr>           <chr>       <dbl> <chr>       <chr>     <chr> <chr> <chr>    
##  1 personal_survey personal       91 Austin      Energy    Neit… Blan… Retireme…
##  2 personal_survey personal       32 San Antonio Healthca… Extr… Elem… Adult    
##  3 personal_survey personal       40 San Antonio Consulti… Some… Eget… Adult    
##  4 personal_survey personal       23 Austin      Consulti… Extr… Scel… Adult    
##  5 personal_survey personal       88 Dallas      Technolo… Neit… Aene… Retireme…
##  6 personal_survey personal       69 Fort Worth  Technolo… Neit… Inte… Retireme…
##  7 personal_survey personal       96 Houston     Healthca… Extr… Blan… Retireme…
##  8 personal_survey personal       40 Houston     Consulti… Extr… Scel… Adult    
##  9 personal_survey personal       57 Fort Worth  Energy    Extr… Pede… Adult    
## 10 personal_survey personal       40 Fort Worth  Healthca… Extr… Phar… Adult    
## # … with 740 more rows, and 2 more variables: n_respondents_in_industry <int>,
## #   median_industry_age <dbl>
Avatar
Mark Rieke
Senior CX Analyst

I’m a mechanical engineer by education, data analyst by practice. I love machine learning and communicating complex topics clearly with simple and beautiful charts.