DataFramesMeta.jl Tutorial

This is a port of the HarvardX series PH525x Genomics class tutorial on dplyr. View the original here and the source here.

What is DataFramesMeta.jl?

DataFramesMeta.jl is a Julia package to transform and summarize tabular data. It uses Julia macros to create a domain-specific language for convenient syntax to work with data frames from DataFrames.jl. DataFramesMeta.jl mirrors concepts in DataFrames.jl as closely as possible, without implementing new features on it's own. For a deeper explanation of DataFramesMeta.jl, see the documentation.

DataFramesMeta.jl is heavily inspired by R's dplyr. If you are familiar with dplyr this guide should get you up to speed with DataFramesMeta.jl.

However this tutorial assumes a cursory knowledge of DataFrames.jl. For instance, you should be familiar with the concept of a symbol in Julia (:x), and how it is used to index a data frame in DataFrames.jl, such as with df[:, :x].

Why Is It Useful?

Like dplyr, the DataFramesMeta.jl package contains a set of macros (or "verbs") that perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, and adding new columns.

In addition, DataFramesMeta.jl makes it easier to summarize data with the @combine verb, which implements the split-apply-combine pattern commonly seen in dplyr and other data manipulation packages.

How Does It Compare To Using Base Functions in Julia and in DataFrames.jl?

If you are familiar with Julia, you are probably familiar with base Julia functions such map, and broadcast (akin to *apply in R). These functions are convenient to use, but are designed to work with arrays, not tabular data.

DataFrames.jl provides the functions select, transform, and more to work with data frames. Unlike map and broadcast, these functions are designed to work with tabular data, but have a complicated syntax.

DataFramesMeta.jl provides a convenient syntax for working with the vectors in a data frame so that you get all the convenience of Base Julia and DataFrames combined.

How Do I Get DataFramesMeta.jl?

To install DataFramesMeta.jl, which also installs DataFrames.jl:

import Pkg
Pkg.activate(; temp=true) # activate a temprary environment for this tutorial
Pkg.add("DataFramesMeta");

To load DataFramesMeta.jl, which also loads DataFrames.jl:

julia> using DataFramesMeta

For this tutorial, we will install some additional packages as well.

Pkg.add(["CSV", "HTTP"])

Now we load them. We also load the Statistics standard library, which is shipped with Julia, so does not need to be installed.

julia> using CSV, HTTP, Statistics

We will use CSV.jl and HTTP.jl for downloading our dataset from the internet.

Data: Mammals Sleep

The msleep (mammals sleep) data set contains the sleep times and weights for a set of mammals and is available in the dagdata repository on GitHub. This data set contains 83 rows and 11 variables.

We can load the data directly into a DataFrame from the url.

julia> url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv""https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv"
julia> msleep = CSV.read(HTTP.get(url).body, DataFrame; missingstring="NA")83×11 DataFrame Row │ name genus vore order co ⋯ │ String31… String15… String7…? String15… St ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Cheetah Acinonyx carni Carnivora lc ⋯ 2 │ Owl monkey Aotus omni Primates mi 3 │ Mountain beaver Aplodontia herbi Rodentia nt 4 │ Greater short-tailed shrew Blarina omni Soricomorpha lc 5 │ Cow Bos herbi Artiodactyla do ⋯ 6 │ Three-toed sloth Bradypus herbi Pilosa mi 7 │ Northern fur seal Callorhinus carni Carnivora vu 8 │ Vesper mouse Calomys missing Rodentia mi ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ 77 │ Brazilian tapir Tapirus herbi Perissodactyla vu ⋯ 78 │ Tenrec Tenrec omni Afrosoricida mi 79 │ Tree shrew Tupaia omni Scandentia mi 80 │ Bottle-nosed dolphin Tursiops carni Cetacea mi 81 │ Genet Genetta carni Carnivora mi ⋯ 82 │ Arctic fox Vulpes carni Carnivora mi 83 │ Red fox Vulpes carni Carnivora mi 7 columns and 68 rows omitted

The columns (in order) correspond to the following:

column nameDescription
:namecommon name
:genustaxonomic rank
:vorecarnivore, omnivore or herbivore?
:ordertaxonomic rank
:conservationthe conservation status of the mammal
:sleep_totaltotal amount of sleep, in hours
:sleep_remrem sleep, in hours
:sleep_cyclelength of sleep cycle, in hours
:awakeamount of time spent awake, in hours
:brainwtbrain weight in kilograms
:bodywtbody weight in kilograms

Important DataFramesMeta.jl Verbs To Remember

Many DataFrames.jl macros come in two forms, a version which operates on columns as a whole and a version which operations row-wise, prefixed by r.

DataFramesMeta.jl macroBy-row versionDescriptiondplyr equivalent
@select@rselectselect columnsselect
@transform@rtransformcreate new columnsmutate
@subset@rsubsetfilter rowsfilter
@orderby@rorderbyre-order or arrange rowsarrange
@combinesummarise valuessummarize (but @combine is more flexible)
groupbyallows for group operations in the "split-apply-combine" conceptgroup_by

DataFramesMeta.jl Verbs In Action

Two of the most basic functions are @select and @subset, which selects columns and filters rows respectively. To reference columns, use the Symbol of the column name, i.e. :name refers to the column msleep.name.

Selecting Columns Using @select

Select a set of columns: the :name and the :sleep_total columns.

julia> @select msleep :name :sleep_total83×2 DataFrame
 Row │ name                        sleep_total
     │ String31…                   Float64
─────┼─────────────────────────────────────────
   1 │ Cheetah                            12.1
   2 │ Owl monkey                         17.0
   3 │ Mountain beaver                    14.4
   4 │ Greater short-tailed shrew         14.9
   5 │ Cow                                 4.0
   6 │ Three-toed sloth                   14.4
   7 │ Northern fur seal                   8.7
   8 │ Vesper mouse                        7.0
  ⋮  │             ⋮                    ⋮
  77 │ Brazilian tapir                     4.4
  78 │ Tenrec                             15.6
  79 │ Tree shrew                          8.9
  80 │ Bottle-nosed dolphin                5.2
  81 │ Genet                               6.3
  82 │ Arctic fox                         12.5
  83 │ Red fox                             9.8
                                68 rows omitted

If you have a column name stored as a variable, you can select it as a column with the syntax $.

julia> varname = :sleep_total:sleep_total
julia> @select msleep :name $varname83×2 DataFrame Row │ name sleep_total │ String31… Float64 ─────┼───────────────────────────────────────── 1 │ Cheetah 12.1 2 │ Owl monkey 17.0 3 │ Mountain beaver 14.4 4 │ Greater short-tailed shrew 14.9 5 │ Cow 4.0 6 │ Three-toed sloth 14.4 7 │ Northern fur seal 8.7 8 │ Vesper mouse 7.0 ⋮ │ ⋮ ⋮ 77 │ Brazilian tapir 4.4 78 │ Tenrec 15.6 79 │ Tree shrew 8.9 80 │ Bottle-nosed dolphin 5.2 81 │ Genet 6.3 82 │ Arctic fox 12.5 83 │ Red fox 9.8 68 rows omitted

The $ sign has special meaning in DataFramesMeta.jl. We use it for any column reference which is not a symbol. Without it, DataFramesMeta.jl can't tell whether varname refers to the column :sleep_total.

You can also use $ to refer to columns with strings

julia> varname = "sleep_total""sleep_total"
julia> @select msleep :name $varname83×2 DataFrame Row │ name sleep_total │ String31… Float64 ─────┼───────────────────────────────────────── 1 │ Cheetah 12.1 2 │ Owl monkey 17.0 3 │ Mountain beaver 14.4 4 │ Greater short-tailed shrew 14.9 5 │ Cow 4.0 6 │ Three-toed sloth 14.4 7 │ Northern fur seal 8.7 8 │ Vesper mouse 7.0 ⋮ │ ⋮ ⋮ 77 │ Brazilian tapir 4.4 78 │ Tenrec 15.6 79 │ Tree shrew 8.9 80 │ Bottle-nosed dolphin 5.2 81 │ Genet 6.3 82 │ Arctic fox 12.5 83 │ Red fox 9.8 68 rows omitted

as well as vectors of variable names

julia> varnames = ["name", "sleep_total"]2-element Vector{String}:
 "name"
 "sleep_total"
julia> @select msleep $varnames83×2 DataFrame Row │ name sleep_total │ String31… Float64 ─────┼───────────────────────────────────────── 1 │ Cheetah 12.1 2 │ Owl monkey 17.0 3 │ Mountain beaver 14.4 4 │ Greater short-tailed shrew 14.9 5 │ Cow 4.0 6 │ Three-toed sloth 14.4 7 │ Northern fur seal 8.7 8 │ Vesper mouse 7.0 ⋮ │ ⋮ ⋮ 77 │ Brazilian tapir 4.4 78 │ Tenrec 15.6 79 │ Tree shrew 8.9 80 │ Bottle-nosed dolphin 5.2 81 │ Genet 6.3 82 │ Arctic fox 12.5 83 │ Red fox 9.8 68 rows omitted

Similarly, to select the first column, use the syntax $1.

julia> @select msleep $183×1 DataFrame
 Row │ name
     │ String31…
─────┼────────────────────────────
   1 │ Cheetah
   2 │ Owl monkey
   3 │ Mountain beaver
   4 │ Greater short-tailed shrew
   5 │ Cow
   6 │ Three-toed sloth
   7 │ Northern fur seal
   8 │ Vesper mouse
  ⋮  │             ⋮
  77 │ Brazilian tapir
  78 │ Tenrec
  79 │ Tree shrew
  80 │ Bottle-nosed dolphin
  81 │ Genet
  82 │ Arctic fox
  83 │ Red fox
                   68 rows omitted

To select all the columns except a specific column, use the Not function for inverse selection. We also need to wrap Not in the $ sign, because it is not a symbol.

julia> @select msleep $(Not(:name))83×10 DataFrame
 Row │ genus         vore       order           conservation  sleep_total  sle ⋯
     │ String15…     String7…?  String15…       String15…?    Float64      Flo ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Acinonyx      carni      Carnivora       lc                   12.1  mis ⋯
   2 │ Aotus         omni       Primates        missing              17.0
   3 │ Aplodontia    herbi      Rodentia        nt                   14.4
   4 │ Blarina       omni       Soricomorpha    lc                   14.9
   5 │ Bos           herbi      Artiodactyla    domesticated          4.0      ⋯
   6 │ Bradypus      herbi      Pilosa          missing              14.4
   7 │ Callorhinus   carni      Carnivora       vu                    8.7
   8 │ Calomys       missing    Rodentia        missing               7.0  mis
  ⋮  │      ⋮            ⋮            ⋮              ⋮             ⋮           ⋱
  77 │ Tapirus       herbi      Perissodactyla  vu                    4.4      ⋯
  78 │ Tenrec        omni       Afrosoricida    missing              15.6
  79 │ Tupaia        omni       Scandentia      missing               8.9
  80 │ Tursiops      carni      Cetacea         missing               5.2  mis
  81 │ Genetta       carni      Carnivora       missing               6.3      ⋯
  82 │ Vulpes        carni      Carnivora       missing              12.5  mis
  83 │ Vulpes        carni      Carnivora       missing               9.8
                                                   5 columns and 68 rows omitted

To select a range of columns by name, use the Between operator:

julia> @select msleep $(Between(:name, :order))83×4 DataFrame
 Row │ name                        genus         vore       order
     │ String31…                   String15…     String7…?  String15…
─────┼─────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni      Carnivora
   2 │ Owl monkey                  Aotus         omni       Primates
   3 │ Mountain beaver             Aplodontia    herbi      Rodentia
   4 │ Greater short-tailed shrew  Blarina       omni       Soricomorpha
   5 │ Cow                         Bos           herbi      Artiodactyla
   6 │ Three-toed sloth            Bradypus      herbi      Pilosa
   7 │ Northern fur seal           Callorhinus   carni      Carnivora
   8 │ Vesper mouse                Calomys       missing    Rodentia
  ⋮  │             ⋮                    ⋮            ⋮            ⋮
  77 │ Brazilian tapir             Tapirus       herbi      Perissodactyla
  78 │ Tenrec                      Tenrec        omni       Afrosoricida
  79 │ Tree shrew                  Tupaia        omni       Scandentia
  80 │ Bottle-nosed dolphin        Tursiops      carni      Cetacea
  81 │ Genet                       Genetta       carni      Carnivora
  82 │ Arctic fox                  Vulpes        carni      Carnivora
  83 │ Red fox                     Vulpes        carni      Carnivora
                                                            68 rows omitted

To select all columns that start with the character string "sl" use regular expressions:

julia> @select msleep $(r"^sl")83×3 DataFrame
 Row │ sleep_total  sleep_rem  sleep_cycle
     │ Float64      Float64?   Float64?
─────┼────────────────────────────────────────
   1 │        12.1  missing    missing
   2 │        17.0        1.8  missing
   3 │        14.4        2.4  missing
   4 │        14.9        2.3        0.133333
   5 │         4.0        0.7        0.666667
   6 │        14.4        2.2        0.766667
   7 │         8.7        1.4        0.383333
   8 │         7.0  missing    missing
  ⋮  │      ⋮           ⋮            ⋮
  77 │         4.4        1.0        0.9
  78 │        15.6        2.3  missing
  79 │         8.9        2.6        0.233333
  80 │         5.2  missing    missing
  81 │         6.3        1.3  missing
  82 │        12.5  missing    missing
  83 │         9.8        2.4        0.35
                               68 rows omitted

Regular expressions are powerful, but can be difficult for new users to understand. Here are some quick tips.

  1. r"^abc" = Starts with "abc"
  2. r"abc$" = Ends with "abc"
  3. r"abc" = Contains "abc" anywhere.

Selecting Rows Using @subset and @rsubset

Filter the rows for mammals that sleep a total of more than 16 hours.

julia> @subset msleep :sleep_total .>= 168×11 DataFrame
 Row │ name                    genus         vore       order            conse ⋯
     │ String31…               String15…     String7…?  String15…        Strin ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Owl monkey              Aotus         omni       Primates         missi ⋯
   2 │ Long-nosed armadillo    Dasypus       carni      Cingulata        lc
   3 │ North American Opossum  Didelphis     omni       Didelphimorphia  lc
   4 │ Big brown bat           Eptesicus     insecti    Chiroptera       lc
   5 │ Thick-tailed opposum    Lutreolina    carni      Didelphimorphia  lc    ⋯
   6 │ Little brown bat        Myotis        insecti    Chiroptera       missi
   7 │ Giant armadillo         Priodontes    insecti    Cingulata        en
   8 │ Arctic ground squirrel  Spermophilus  herbi      Rodentia         lc
                                                               7 columns omitted

In the above expression, the .>= means we "broadcast" the >= comparison across the whole column. We can use a simpler syntax, @rsubset which automatically broadcasts all operations.

julia> @rsubset msleep :sleep_total > 168×11 DataFrame
 Row │ name                    genus         vore       order            conse ⋯
     │ String31…               String15…     String7…?  String15…        Strin ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Owl monkey              Aotus         omni       Primates         missi ⋯
   2 │ Long-nosed armadillo    Dasypus       carni      Cingulata        lc
   3 │ North American Opossum  Didelphis     omni       Didelphimorphia  lc
   4 │ Big brown bat           Eptesicus     insecti    Chiroptera       lc
   5 │ Thick-tailed opposum    Lutreolina    carni      Didelphimorphia  lc    ⋯
   6 │ Little brown bat        Myotis        insecti    Chiroptera       missi
   7 │ Giant armadillo         Priodontes    insecti    Cingulata        en
   8 │ Arctic ground squirrel  Spermophilus  herbi      Rodentia         lc
                                                               7 columns omitted

Subset the rows for mammals that sleep a total of more than 16 hours and have a body weight of greater than 1 kilogram. For this we put multiple operations on separate lines in a single block.

julia> @rsubset msleep begin
           :sleep_total >= 16
           :bodywt >= 1
       end3×11 DataFrame
 Row │ name                    genus       vore       order            conserv ⋯
     │ String31…               String15…   String7…?  String15…        String1 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Long-nosed armadillo    Dasypus     carni      Cingulata        lc      ⋯
   2 │ North American Opossum  Didelphis   omni       Didelphimorphia  lc
   3 │ Giant armadillo         Priodontes  insecti    Cingulata        en
                                                               7 columns omitted

If you are coming from dplyr, you can also write the above command in a way that looks more familiar.

julia> @rsubset(msleep, :sleep_total >= 16, :bodywt >= 1)3×11 DataFrame
 Row │ name                    genus       vore       order            conserv ⋯
     │ String31…               String15…   String7…?  String15…        String1 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Long-nosed armadillo    Dasypus     carni      Cingulata        lc      ⋯
   2 │ North American Opossum  Didelphis   omni       Didelphimorphia  lc
   3 │ Giant armadillo         Priodontes  insecti    Cingulata        en
                                                               7 columns omitted

Filter the rows for mammals in the Perissodactyla and Primates taxonomic order. We wrap code in a let block to ensure things are fast.

julia> let
           relevant_orders = Set(["Perissodactyla", "Primates"])
           @rsubset msleep :order in relevant_orders
       end15×11 DataFrame
 Row │ name             genus          vore       order           conservation ⋯
     │ String31…        String15…      String7…?  String15…       String15…?   ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Owl monkey       Aotus          omni       Primates        missing      ⋯
   2 │ Grivet           Cercopithecus  omni       Primates        lc
   3 │ Horse            Equus          herbi      Perissodactyla  domesticated
   4 │ Donkey           Equus          herbi      Perissodactyla  domesticated
   5 │ Patas monkey     Erythrocebus   omni       Primates        lc           ⋯
   6 │ Galago           Galago         omni       Primates        missing
   7 │ Human            Homo           omni       Primates        missing
   8 │ Mongoose lemur   Lemur          herbi      Primates        vu
   9 │ Macaque          Macaca         omni       Primates        missing      ⋯
  10 │ Slow loris       Nyctibeus      carni      Primates        missing
  11 │ Chimpanzee       Pan            omni       Primates        missing
  12 │ Baboon           Papio          omni       Primates        missing
  13 │ Potto            Perodicticus   omni       Primates        lc           ⋯
  14 │ Squirrel monkey  Saimiri        omni       Primates        missing
  15 │ Brazilian tapir  Tapirus        herbi      Perissodactyla  vu
                                                               6 columns omitted

You can use the boolean operators (e.g. >, <, >=, <=, !=, in) to create the logical tests.

Chain.jl

Before we go any further, let's introduce the @chain macro from Chain.jl, which is re-exported by DataFramesMeta.jl. @chain allows for you to pipe the output from one operation into the input of another operation. The idea of piping is to read the functions from left to right. The syntax and design of @chain is very similar to %>% which users of dplyr are familiar with.

To show it's usefulness, let's use @select and @rsubset one after the other.

msleep_1 = @select msleep :name :sleep_total
msleep_2 = @rsubset msleep_1 :sleep_total > 16

Now in this case, we will pipe the msleep data frame to the function that will select two columns (:name and :sleep_total) and then pipe the new data frame to the @rsubset opertaion. This method involves awkwardly creating and naming temporary data frames. We can avoid this with @chain.

julia> @chain msleep begin
           @select :name :sleep_total
           @rsubset :sleep_total > 16
       end8×2 DataFrame
 Row │ name                    sleep_total
     │ String31…               Float64
─────┼─────────────────────────────────────
   1 │ Owl monkey                     17.0
   2 │ Long-nosed armadillo           17.4
   3 │ North American Opossum         18.0
   4 │ Big brown bat                  19.7
   5 │ Thick-tailed opposum           19.4
   6 │ Little brown bat               19.9
   7 │ Giant armadillo                18.1
   8 │ Arctic ground squirrel         16.6

You will soon see how useful the @chain macro is when we start to combine many functions.

Back To dplyr Verbs In Action

Now that you know about the @chain macro, we will use it throughout the rest of this tutorial.

Arrange Or Re-order Rows Using @orderby

To arrange (or re-order) rows by a particular column, such as the taxonomic order, list the name of the column you want to arrange the rows by:

julia> @orderby msleep :order83×11 DataFrame
 Row │ name                            genus         vore       order          ⋯
     │ String31…                       String15…     String7…?  String15…      ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Tenrec                          Tenrec        omni       Afrosoricida   ⋯
   2 │ Cow                             Bos           herbi      Artiodactyla
   3 │ Roe deer                        Capreolus     herbi      Artiodactyla
   4 │ Goat                            Capri         herbi      Artiodactyla
   5 │ Giraffe                         Giraffa       herbi      Artiodactyla   ⋯
   6 │ Sheep                           Ovis          herbi      Artiodactyla
   7 │ Pig                             Sus           omni       Artiodactyla
   8 │ Cheetah                         Acinonyx      carni      Carnivora
  ⋮  │               ⋮                      ⋮            ⋮           ⋮         ⋱
  77 │ Eastern american chipmunk       Tamias        herbi      Rodentia       ⋯
  78 │ Tree shrew                      Tupaia        omni       Scandentia
  79 │ Greater short-tailed shrew      Blarina       omni       Soricomorpha
  80 │ Star-nosed mole                 Condylura     omni       Soricomorpha
  81 │ Lesser short-tailed shrew       Cryptotis     omni       Soricomorpha   ⋯
  82 │ Eastern american mole           Scalopus      insecti    Soricomorpha
  83 │ Musk shrew                      Suncus        missing    Soricomorpha
                                                   7 columns and 68 rows omitted

Now we will select three columns from msleep, arrange the rows by the taxonomic order and then arrange the rows by :sleep_total. Finally, keep the first 10 rows of the data frame.

julia> @chain msleep begin
           @select :name :order :sleep_total
           @orderby :order :sleep_total
           first(10)
       end10×3 DataFrame
 Row │ name          order         sleep_total
     │ String31…     String15…     Float64
─────┼─────────────────────────────────────────
   1 │ Tenrec        Afrosoricida         15.6
   2 │ Giraffe       Artiodactyla          1.9
   3 │ Roe deer      Artiodactyla          3.0
   4 │ Sheep         Artiodactyla          3.8
   5 │ Cow           Artiodactyla          4.0
   6 │ Goat          Artiodactyla          5.3
   7 │ Pig           Artiodactyla          9.1
   8 │ Caspian seal  Carnivora             3.5
   9 │ Gray seal     Carnivora             6.2
  10 │ Genet         Carnivora             6.3

The last line of the above block, first(10), does not have @. This is because first is a Julia function, not a macro, whose names always begin with @.

Same as above, except here we filter the rows for mammals that sleep for 16 or more hours, instead of showing the head of the final data frame:

julia> @chain msleep begin
           @select :name :order :sleep_total
           @orderby :order :sleep_total
           @rsubset :sleep_total > 16
       end8×3 DataFrame
 Row │ name                    order            sleep_total
     │ String31…               String15…        Float64
─────┼──────────────────────────────────────────────────────
   1 │ Big brown bat           Chiroptera              19.7
   2 │ Little brown bat        Chiroptera              19.9
   3 │ Long-nosed armadillo    Cingulata               17.4
   4 │ Giant armadillo         Cingulata               18.1
   5 │ North American Opossum  Didelphimorphia         18.0
   6 │ Thick-tailed opposum    Didelphimorphia         19.4
   7 │ Owl monkey              Primates                17.0
   8 │ Arctic ground squirrel  Rodentia                16.6

Something slightly more complicated: same as above, except arrange the rows in the :sleep_total column in a descending order. For this, use the function sortperm with the keyword argument rev=true.

julia> @chain msleep begin
           @select :name :order :sleep_total
           @orderby begin
               :order
               sortperm(:sleep_total, rev=true)
           end
           @rsubset :sleep_total >= 16
       end8×3 DataFrame
 Row │ name                    order            sleep_total
     │ String31…               String15…        Float64
─────┼──────────────────────────────────────────────────────
   1 │ Little brown bat        Chiroptera              19.9
   2 │ Big brown bat           Chiroptera              19.7
   3 │ Long-nosed armadillo    Cingulata               17.4
   4 │ Giant armadillo         Cingulata               18.1
   5 │ Thick-tailed opposum    Didelphimorphia         19.4
   6 │ North American Opossum  Didelphimorphia         18.0
   7 │ Owl monkey              Primates                17.0
   8 │ Arctic ground squirrel  Rodentia                16.6

Create New Columns Using @transform and @rtransform

The @transform macro will add new columns to the data frame. Like with other macros, use @rtransform to operate row-wise. Create a new column called :rem_proportion, which is the ratio of rem sleep to total amount of sleep.

julia> @rtransform msleep :rem_proportion = :sleep_rem / :sleep_total83×12 DataFrame
 Row │ name                        genus         vore       order           co ⋯
     │ String31…                   String15…     String7…?  String15…       St ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni      Carnivora       lc ⋯
   2 │ Owl monkey                  Aotus         omni       Primates        mi
   3 │ Mountain beaver             Aplodontia    herbi      Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina       omni       Soricomorpha    lc
   5 │ Cow                         Bos           herbi      Artiodactyla    do ⋯
   6 │ Three-toed sloth            Bradypus      herbi      Pilosa          mi
   7 │ Northern fur seal           Callorhinus   carni      Carnivora       vu
   8 │ Vesper mouse                Calomys       missing    Rodentia        mi
  ⋮  │             ⋮                    ⋮            ⋮            ⋮            ⋱
  77 │ Brazilian tapir             Tapirus       herbi      Perissodactyla  vu ⋯
  78 │ Tenrec                      Tenrec        omni       Afrosoricida    mi
  79 │ Tree shrew                  Tupaia        omni       Scandentia      mi
  80 │ Bottle-nosed dolphin        Tursiops      carni      Cetacea         mi
  81 │ Genet                       Genetta       carni      Carnivora       mi ⋯
  82 │ Arctic fox                  Vulpes        carni      Carnivora       mi
  83 │ Red fox                     Vulpes        carni      Carnivora       mi
                                                   8 columns and 68 rows omitted

You can many new columns using @transform by placing multiple operations in a block.

julia> @rtransform msleep begin
           :rem_proportion = :sleep_rem / :sleep_total
           :bodywt_grams = :bodywt * 1000
       end83×13 DataFrame
 Row │ name                        genus         vore       order           co ⋯
     │ String31…                   String15…     String7…?  String15…       St ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni      Carnivora       lc ⋯
   2 │ Owl monkey                  Aotus         omni       Primates        mi
   3 │ Mountain beaver             Aplodontia    herbi      Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina       omni       Soricomorpha    lc
   5 │ Cow                         Bos           herbi      Artiodactyla    do ⋯
   6 │ Three-toed sloth            Bradypus      herbi      Pilosa          mi
   7 │ Northern fur seal           Callorhinus   carni      Carnivora       vu
   8 │ Vesper mouse                Calomys       missing    Rodentia        mi
  ⋮  │             ⋮                    ⋮            ⋮            ⋮            ⋱
  77 │ Brazilian tapir             Tapirus       herbi      Perissodactyla  vu ⋯
  78 │ Tenrec                      Tenrec        omni       Afrosoricida    mi
  79 │ Tree shrew                  Tupaia        omni       Scandentia      mi
  80 │ Bottle-nosed dolphin        Tursiops      carni      Cetacea         mi
  81 │ Genet                       Genetta       carni      Carnivora       mi ⋯
  82 │ Arctic fox                  Vulpes        carni      Carnivora       mi
  83 │ Red fox                     Vulpes        carni      Carnivora       mi
                                                   9 columns and 68 rows omitted

Using @transform instead of @rtransform will let us work with the column as a whole, and not a single row at a time. Let's create a new variable showing how far an animal's sleep time is from the average of all animals.

julia> @transform msleep :demeand_sleep = :sleep_total .- mean(:sleep_total)83×12 DataFrame
 Row │ name                        genus         vore       order           co ⋯
     │ String31…                   String15…     String7…?  String15…       St ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni      Carnivora       lc ⋯
   2 │ Owl monkey                  Aotus         omni       Primates        mi
   3 │ Mountain beaver             Aplodontia    herbi      Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina       omni       Soricomorpha    lc
   5 │ Cow                         Bos           herbi      Artiodactyla    do ⋯
   6 │ Three-toed sloth            Bradypus      herbi      Pilosa          mi
   7 │ Northern fur seal           Callorhinus   carni      Carnivora       vu
   8 │ Vesper mouse                Calomys       missing    Rodentia        mi
  ⋮  │             ⋮                    ⋮            ⋮            ⋮            ⋱
  77 │ Brazilian tapir             Tapirus       herbi      Perissodactyla  vu ⋯
  78 │ Tenrec                      Tenrec        omni       Afrosoricida    mi
  79 │ Tree shrew                  Tupaia        omni       Scandentia      mi
  80 │ Bottle-nosed dolphin        Tursiops      carni      Cetacea         mi
  81 │ Genet                       Genetta       carni      Carnivora       mi ⋯
  82 │ Arctic fox                  Vulpes        carni      Carnivora       mi
  83 │ Red fox                     Vulpes        carni      Carnivora       mi
                                                   8 columns and 68 rows omitted

Finally, note that you can create a new column with the name taken from an existing variable, or a new column name with spaces in it, with $

julia> newname = :rem_proportion:rem_proportion
julia> @rtransform msleep begin $newname = :sleep_rem / :sleep_total $"Body weight in grams" = :bodywt * 1000 end83×13 DataFrame Row │ name genus vore order co ⋯ │ String31… String15… String7…? String15… St ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Cheetah Acinonyx carni Carnivora lc ⋯ 2 │ Owl monkey Aotus omni Primates mi 3 │ Mountain beaver Aplodontia herbi Rodentia nt 4 │ Greater short-tailed shrew Blarina omni Soricomorpha lc 5 │ Cow Bos herbi Artiodactyla do ⋯ 6 │ Three-toed sloth Bradypus herbi Pilosa mi 7 │ Northern fur seal Callorhinus carni Carnivora vu 8 │ Vesper mouse Calomys missing Rodentia mi ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ 77 │ Brazilian tapir Tapirus herbi Perissodactyla vu ⋯ 78 │ Tenrec Tenrec omni Afrosoricida mi 79 │ Tree shrew Tupaia omni Scandentia mi 80 │ Bottle-nosed dolphin Tursiops carni Cetacea mi 81 │ Genet Genetta carni Carnivora mi ⋯ 82 │ Arctic fox Vulpes carni Carnivora mi 83 │ Red fox Vulpes carni Carnivora mi 9 columns and 68 rows omitted

Create Summaries of the Data Frame using @combine

The @combine macro will create summary statistics for a given column in the data frame, such as finding the mean. For example, to compute the average number of hours of sleep, apply the mean function to the column :sleep_total and call the summary value :avg_sleep.

julia> @chain msleep @combine :avg_sleep = mean(:sleep_total)1×1 DataFrame
 Row │ avg_sleep
     │ Float64
─────┼───────────
   1 │   10.4337

There are many other summary statistics you could consider such std, minimum, maximum, median, sum, length (returns the length of vector), first (returns first value in vector), and last (returns last value in vector).

julia> @combine msleep begin
           :avg_sleep = mean(:sleep_total)
           :min_sleep = minimum(:sleep_total)
           :max_sleep = maximum(:sleep_total)
           :total = length(:sleep_total)
       end1×4 DataFrame
 Row │ avg_sleep  min_sleep  max_sleep  total
     │ Float64    Float64    Float64    Int64
─────┼────────────────────────────────────────
   1 │   10.4337        1.9       19.9     83

DataFrames.jl also provides the function describe which performs many of these summaries automatically.

julia> describe(msleep)11×7 DataFrame
 Row │ variable      mean      min               median    max                 ⋯
     │ Symbol        Union…    Any               Union…    Any                 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ name                    African elephant            Western american ch ⋯
   2 │ genus                   Acinonyx                    Vulpes
   3 │ vore                    carni                       omni
   4 │ order                   Afrosoricida                Soricomorpha
   5 │ conservation            cd                          vu                  ⋯
   6 │ sleep_total   10.4337   1.9               10.1      19.9
   7 │ sleep_rem     1.87541   0.1               1.5       6.6
   8 │ sleep_cycle   0.439583  0.116667          0.333333  1.5
   9 │ awake         13.5675   4.1               13.9      22.1                ⋯
  10 │ brainwt       0.281581  0.00014           0.0124    5.712
  11 │ bodywt        166.136   0.005             1.67      6654.0
                                                               3 columns omitted

Group Operations using groupby and @combine

The groupby verb is an important function in DataFrames.jl (it does not live in DataFramesMeta.jl). As we mentioned before it's related to concept of "split-apply-combine". We literally want to split the data frame by some variable (e.g. taxonomic order), apply a function to the individual data frames and then combine the output.

Let's do that: split the msleep data frame by the taxonomic order, then ask for the same summary statistics as above. We expect a set of summary statistics for each taxonomic order.

julia> @chain msleep begin
           groupby(:order)
           @combine begin
               :avg_sleep = mean(:sleep_total)
               :min_sleep = minimum(:sleep_total)
               :max_sleep = maximum(:sleep_total)
               :total = length(:sleep_total)
           end
       end19×5 DataFrame
 Row │ order            avg_sleep  min_sleep  max_sleep  total
     │ String15…        Float64    Float64    Float64    Int64
─────┼─────────────────────────────────────────────────────────
   1 │ Carnivora         10.1167         3.5       15.8     12
   2 │ Primates          10.5            8.0       17.0     12
   3 │ Rodentia          12.4682         7.0       16.6     22
   4 │ Soricomorpha      11.1            8.4       14.9      5
   5 │ Artiodactyla       4.51667        1.9        9.1      6
   6 │ Pilosa            14.4           14.4       14.4      1
   7 │ Cingulata         17.75          17.4       18.1      2
   8 │ Hyracoidea         5.66667        5.3        6.3      3
  ⋮  │        ⋮             ⋮          ⋮          ⋮        ⋮
  13 │ Erinaceomorpha    10.2           10.1       10.3      2
  14 │ Cetacea            4.5            2.7        5.6      3
  15 │ Lagomorpha         8.4            8.4        8.4      1
  16 │ Diprotodontia     12.4           11.1       13.7      2
  17 │ Monotremata        8.6            8.6        8.6      1
  18 │ Afrosoricida      15.6           15.6       15.6      1
  19 │ Scandentia         8.9            8.9        8.9      1
                                                 4 rows omitted

Split-apply-combine can also be used with @transform to add new variables to a data frame by performing operations by group. For instance, we can de-mean the total hours of sleep of an animal relative to other animals in the same genus.

julia> @chain msleep begin
           groupby(:order)
           @transform :sleep_genus = :sleep_total .- mean(:sleep_total)
       end83×12 DataFrame
 Row │ name                        genus         vore       order           co ⋯
     │ String31…                   String15…     String7…?  String15…       St ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni      Carnivora       lc ⋯
   2 │ Owl monkey                  Aotus         omni       Primates        mi
   3 │ Mountain beaver             Aplodontia    herbi      Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina       omni       Soricomorpha    lc
   5 │ Cow                         Bos           herbi      Artiodactyla    do ⋯
   6 │ Three-toed sloth            Bradypus      herbi      Pilosa          mi
   7 │ Northern fur seal           Callorhinus   carni      Carnivora       vu
   8 │ Vesper mouse                Calomys       missing    Rodentia        mi
  ⋮  │             ⋮                    ⋮            ⋮            ⋮            ⋱
  77 │ Brazilian tapir             Tapirus       herbi      Perissodactyla  vu ⋯
  78 │ Tenrec                      Tenrec        omni       Afrosoricida    mi
  79 │ Tree shrew                  Tupaia        omni       Scandentia      mi
  80 │ Bottle-nosed dolphin        Tursiops      carni      Cetacea         mi
  81 │ Genet                       Genetta       carni      Carnivora       mi ⋯
  82 │ Arctic fox                  Vulpes        carni      Carnivora       mi
  83 │ Red fox                     Vulpes        carni      Carnivora       mi
                                                   8 columns and 68 rows omitted

This short tutorial only touches on the wide array of features in Julia, DataFrames.jl, and DataFramesMeta.jl. Read the full documentation for more information.