DataFramesMeta.jl Tutorial

This is a port of the HarvardX series PH525x Genomics class tutorial on dplyr. View the original here and the source here.

What is DataFramesMeta.jl?

DataFramesMeta.jl is a Julia package to transform and summarize tabular data. It uses Julia macros to create a domain-specific language for convenient syntax to work with data frames from DataFrames.jl. DataFramesMeta.jl mirrors concepts in DataFrames.jl as closely as possible, without implementing new features on it's own. For a deeper explanation of DataFramesMeta.jl, see the documentation.

DataFramesMeta.jl is heavily inspired by R's dplyr. If you are familiar with dplyr this guide should get you up to speed with DataFramesMeta.jl.

However this tutorial assumes a cursory knowledge of DataFrames.jl. For instance, you should be familiar with the concept of a symbol in Julia (:x), and how it is used to index a data frame in DataFrames.jl, such as with df[:, :x].

Why Is It Useful?

Like dplyr, the DataFramesMeta.jl package contains a set of macros (or "verbs") that perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, and adding new columns.

In addition, DataFramesMeta.jl makes it easier to summarize data with the @combine verb, which implements the split-apply-combine pattern commonly seen in dplyr and other data manipulation packages.

How Does It Compare To Using Base Functions in Julia and in DataFrames.jl?

If you are familiar with Julia, you are probably familiar with base Julia functions such map, and broadcast (akin to *apply in R). These functions are convenient to use, but are designed to work with arrays, not tabular data.

DataFrames.jl provides the functions select, transform, and more to work with data frames. Unlike map and broadcast, these functions are designed to work with tabular data, but have a complicated syntax.

DataFramesMeta.jl provides a convenient syntax for working with the vectors in a data frame so that you get all the convenience of Base Julia and DataFrames combined.

How Do I Get DataFramesMeta.jl?

To install DataFramesMeta.jl, which also installs DataFrames.jl:

import Pkg
Pkg.activate(; temp=true) # activate a temporary environment for this tutorial
Pkg.add("DataFramesMeta");

To load DataFramesMeta.jl, which also loads DataFrames.jl:

julia> using DataFramesMeta

For this tutorial, we will install some additional packages as well.

Pkg.add(["CSV", "HTTP"])

Now we load them. We also load the Statistics standard library, which is shipped with Julia, so does not need to be installed.

julia> using CSV, HTTP, Statistics

We will use CSV.jl and HTTP.jl for downloading our dataset from the internet.

Data: Mammals Sleep

The msleep (mammals sleep) data set contains the sleep times and weights for a set of mammals and is available in the dagdata repository on GitHub. This data set contains 83 rows and 11 variables.

We can load the data directly into a DataFrame from the url.

julia> url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv""https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv"
julia> msleep = CSV.read(HTTP.get(url).body, DataFrame; missingstring="NA")83×11 DataFrame Row │ name genus vore order cons ⋯ │ String31 String15 String7 String15 Stri ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Cheetah Acinonyx carni Carnivora lc ⋯ 2 │ Owl monkey Aotus omni Primates miss 3 │ Mountain beaver Aplodontia herbi Rodentia nt 4 │ Greater short-tailed shrew Blarina omni Soricomorpha lc 5 │ Cow Bos herbi Artiodactyla dome ⋯ 6 │ Three-toed sloth Bradypus herbi Pilosa miss 7 │ Northern fur seal Callorhinus carni Carnivora vu 8 │ Vesper mouse Calomys missing Rodentia miss ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ 77 │ Brazilian tapir Tapirus herbi Perissodactyla vu ⋯ 78 │ Tenrec Tenrec omni Afrosoricida miss 79 │ Tree shrew Tupaia omni Scandentia miss 80 │ Bottle-nosed dolphin Tursiops carni Cetacea miss 81 │ Genet Genetta carni Carnivora miss ⋯ 82 │ Arctic fox Vulpes carni Carnivora miss 83 │ Red fox Vulpes carni Carnivora miss 7 columns and 68 rows omitted

The columns (in order) correspond to the following:

column nameDescription
:namecommon name
:genustaxonomic rank
:vorecarnivore, omnivore or herbivore?
:ordertaxonomic rank
:conservationthe conservation status of the mammal
:sleep_totaltotal amount of sleep, in hours
:sleep_remrem sleep, in hours
:sleep_cyclelength of sleep cycle, in hours
:awakeamount of time spent awake, in hours
:brainwtbrain weight in kilograms
:bodywtbody weight in kilograms

Important DataFramesMeta.jl Verbs To Remember

Many DataFrames.jl macros come in two forms, a version which operates on columns as a whole and a version which operations row-wise, prefixed by r.

DataFramesMeta.jl macroBy-row versionDescriptiondplyr equivalent
@select@rselectselect columnsselect
@transform@rtransformcreate new columnsmutate
@subset@rsubsetfilter rowsfilter
@orderby@rorderbyre-order or arrange rowsarrange
@combinesummarise valuessummarize (but @combine is more flexible)
@groupbyallows for group operations in the "split-apply-combine" conceptgroup_by

DataFramesMeta.jl Verbs In Action

Two of the most basic functions are @select and @subset, which selects columns and filters rows respectively. To reference columns, use the Symbol of the column name, i.e. :name refers to the column msleep.name.

Selecting Columns Using @select

Select a set of columns: the :name and the :sleep_total columns.

julia> @select msleep :name :sleep_total83×2 DataFrame
 Row │ name                        sleep_total
     │ String31                    Float64
─────┼─────────────────────────────────────────
   1 │ Cheetah                            12.1
   2 │ Owl monkey                         17.0
   3 │ Mountain beaver                    14.4
   4 │ Greater short-tailed shrew         14.9
   5 │ Cow                                 4.0
   6 │ Three-toed sloth                   14.4
   7 │ Northern fur seal                   8.7
   8 │ Vesper mouse                        7.0
  ⋮  │             ⋮                    ⋮
  77 │ Brazilian tapir                     4.4
  78 │ Tenrec                             15.6
  79 │ Tree shrew                          8.9
  80 │ Bottle-nosed dolphin                5.2
  81 │ Genet                               6.3
  82 │ Arctic fox                         12.5
  83 │ Red fox                             9.8
                                68 rows omitted

If you have a column name stored as a variable, you can select it as a column with the syntax $.

julia> varname = :sleep_total:sleep_total
julia> @select msleep :name $varname83×2 DataFrame Row │ name sleep_total │ String31 Float64 ─────┼───────────────────────────────────────── 1 │ Cheetah 12.1 2 │ Owl monkey 17.0 3 │ Mountain beaver 14.4 4 │ Greater short-tailed shrew 14.9 5 │ Cow 4.0 6 │ Three-toed sloth 14.4 7 │ Northern fur seal 8.7 8 │ Vesper mouse 7.0 ⋮ │ ⋮ ⋮ 77 │ Brazilian tapir 4.4 78 │ Tenrec 15.6 79 │ Tree shrew 8.9 80 │ Bottle-nosed dolphin 5.2 81 │ Genet 6.3 82 │ Arctic fox 12.5 83 │ Red fox 9.8 68 rows omitted

The $ sign has special meaning in DataFramesMeta.jl. We use it for any column reference which is not a symbol. Without it, DataFramesMeta.jl can't tell whether varname refers to the column :sleep_total.

You can also use $ to refer to columns with strings

julia> varname = "sleep_total""sleep_total"
julia> @select msleep :name $varname83×2 DataFrame Row │ name sleep_total │ String31 Float64 ─────┼───────────────────────────────────────── 1 │ Cheetah 12.1 2 │ Owl monkey 17.0 3 │ Mountain beaver 14.4 4 │ Greater short-tailed shrew 14.9 5 │ Cow 4.0 6 │ Three-toed sloth 14.4 7 │ Northern fur seal 8.7 8 │ Vesper mouse 7.0 ⋮ │ ⋮ ⋮ 77 │ Brazilian tapir 4.4 78 │ Tenrec 15.6 79 │ Tree shrew 8.9 80 │ Bottle-nosed dolphin 5.2 81 │ Genet 6.3 82 │ Arctic fox 12.5 83 │ Red fox 9.8 68 rows omitted

as well as vectors of variable names

julia> varnames = ["name", "sleep_total"]2-element Vector{String}:
 "name"
 "sleep_total"
julia> @select msleep $varnames83×2 DataFrame Row │ name sleep_total │ String31 Float64 ─────┼───────────────────────────────────────── 1 │ Cheetah 12.1 2 │ Owl monkey 17.0 3 │ Mountain beaver 14.4 4 │ Greater short-tailed shrew 14.9 5 │ Cow 4.0 6 │ Three-toed sloth 14.4 7 │ Northern fur seal 8.7 8 │ Vesper mouse 7.0 ⋮ │ ⋮ ⋮ 77 │ Brazilian tapir 4.4 78 │ Tenrec 15.6 79 │ Tree shrew 8.9 80 │ Bottle-nosed dolphin 5.2 81 │ Genet 6.3 82 │ Arctic fox 12.5 83 │ Red fox 9.8 68 rows omitted

Similarly, to select the first column, use the syntax $1.

julia> @select msleep $183×1 DataFrame
 Row │ name
     │ String31
─────┼────────────────────────────
   1 │ Cheetah
   2 │ Owl monkey
   3 │ Mountain beaver
   4 │ Greater short-tailed shrew
   5 │ Cow
   6 │ Three-toed sloth
   7 │ Northern fur seal
   8 │ Vesper mouse
  ⋮  │             ⋮
  77 │ Brazilian tapir
  78 │ Tenrec
  79 │ Tree shrew
  80 │ Bottle-nosed dolphin
  81 │ Genet
  82 │ Arctic fox
  83 │ Red fox
                   68 rows omitted

To select all the columns except a specific column, use the Not function for inverse selection.

julia> @select msleep Not(:name)83×10 DataFrame
 Row │ genus         vore     order           conservation  sleep_total  sleep ⋯
     │ String15      String7  String15        String15      Float64      Float ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Acinonyx      carni    Carnivora       lc                   12.1  missi ⋯
   2 │ Aotus         omni     Primates        missing              17.0
   3 │ Aplodontia    herbi    Rodentia        nt                   14.4
   4 │ Blarina       omni     Soricomorpha    lc                   14.9
   5 │ Bos           herbi    Artiodactyla    domesticated          4.0        ⋯
   6 │ Bradypus      herbi    Pilosa          missing              14.4
   7 │ Callorhinus   carni    Carnivora       vu                    8.7
   8 │ Calomys       missing  Rodentia        missing               7.0  missi
  ⋮  │      ⋮           ⋮           ⋮              ⋮             ⋮           ⋮ ⋱
  77 │ Tapirus       herbi    Perissodactyla  vu                    4.4        ⋯
  78 │ Tenrec        omni     Afrosoricida    missing              15.6
  79 │ Tupaia        omni     Scandentia      missing               8.9
  80 │ Tursiops      carni    Cetacea         missing               5.2  missi
  81 │ Genetta       carni    Carnivora       missing               6.3        ⋯
  82 │ Vulpes        carni    Carnivora       missing              12.5  missi
  83 │ Vulpes        carni    Carnivora       missing               9.8
                                                   5 columns and 68 rows omitted

To select a range of columns by name, use the Between operator:

julia> @select msleep Between(:name, :order)83×4 DataFrame
 Row │ name                        genus         vore     order
     │ String31                    String15      String7  String15
─────┼───────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni    Carnivora
   2 │ Owl monkey                  Aotus         omni     Primates
   3 │ Mountain beaver             Aplodontia    herbi    Rodentia
   4 │ Greater short-tailed shrew  Blarina       omni     Soricomorpha
   5 │ Cow                         Bos           herbi    Artiodactyla
   6 │ Three-toed sloth            Bradypus      herbi    Pilosa
   7 │ Northern fur seal           Callorhinus   carni    Carnivora
   8 │ Vesper mouse                Calomys       missing  Rodentia
  ⋮  │             ⋮                    ⋮           ⋮           ⋮
  77 │ Brazilian tapir             Tapirus       herbi    Perissodactyla
  78 │ Tenrec                      Tenrec        omni     Afrosoricida
  79 │ Tree shrew                  Tupaia        omni     Scandentia
  80 │ Bottle-nosed dolphin        Tursiops      carni    Cetacea
  81 │ Genet                       Genetta       carni    Carnivora
  82 │ Arctic fox                  Vulpes        carni    Carnivora
  83 │ Red fox                     Vulpes        carni    Carnivora
                                                          68 rows omitted

To select all columns that start with the character string "sl" use regular expressions in conjunction with Cols.

julia> @select msleep Cols(r"^sl")83×3 DataFrame
 Row │ sleep_total  sleep_rem  sleep_cycle
     │ Float64      Float64?   Float64?
─────┼────────────────────────────────────────
   1 │        12.1  missing    missing
   2 │        17.0        1.8  missing
   3 │        14.4        2.4  missing
   4 │        14.9        2.3        0.133333
   5 │         4.0        0.7        0.666667
   6 │        14.4        2.2        0.766667
   7 │         8.7        1.4        0.383333
   8 │         7.0  missing    missing
  ⋮  │      ⋮           ⋮            ⋮
  77 │         4.4        1.0        0.9
  78 │        15.6        2.3  missing
  79 │         8.9        2.6        0.233333
  80 │         5.2  missing    missing
  81 │         6.3        1.3  missing
  82 │        12.5  missing    missing
  83 │         9.8        2.4        0.35
                               68 rows omitted

Regular expressions are powerful, but can be difficult for new users to understand. Here are some quick tips.

  1. r"^abc" = Starts with "abc"
  2. r"abc$" = Ends with "abc"
  3. r"abc" = Contains "abc" anywhere.

Selecting Rows Using @subset and @rsubset

Filter the rows for mammals that sleep a total of more than 16 hours.

julia> @subset msleep :sleep_total .>= 168×11 DataFrame
 Row │ name                    genus         vore     order            conserv ⋯
     │ String31                String15      String7  String15         String1 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Owl monkey              Aotus         omni     Primates         missing ⋯
   2 │ Long-nosed armadillo    Dasypus       carni    Cingulata        lc
   3 │ North American Opossum  Didelphis     omni     Didelphimorphia  lc
   4 │ Big brown bat           Eptesicus     insecti  Chiroptera       lc
   5 │ Thick-tailed opposum    Lutreolina    carni    Didelphimorphia  lc      ⋯
   6 │ Little brown bat        Myotis        insecti  Chiroptera       missing
   7 │ Giant armadillo         Priodontes    insecti  Cingulata        en
   8 │ Arctic ground squirrel  Spermophilus  herbi    Rodentia         lc
                                                               7 columns omitted

In the above expression, the .>= means we "broadcast" the >= comparison across the whole column. We can use a simpler syntax, @rsubset which automatically broadcasts all operations.

julia> @rsubset msleep :sleep_total > 168×11 DataFrame
 Row │ name                    genus         vore     order            conserv ⋯
     │ String31                String15      String7  String15         String1 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Owl monkey              Aotus         omni     Primates         missing ⋯
   2 │ Long-nosed armadillo    Dasypus       carni    Cingulata        lc
   3 │ North American Opossum  Didelphis     omni     Didelphimorphia  lc
   4 │ Big brown bat           Eptesicus     insecti  Chiroptera       lc
   5 │ Thick-tailed opposum    Lutreolina    carni    Didelphimorphia  lc      ⋯
   6 │ Little brown bat        Myotis        insecti  Chiroptera       missing
   7 │ Giant armadillo         Priodontes    insecti  Cingulata        en
   8 │ Arctic ground squirrel  Spermophilus  herbi    Rodentia         lc
                                                               7 columns omitted

Subset the rows for mammals that sleep a total of more than 16 hours and have a body weight of greater than 1 kilogram. For this we put multiple operations on separate lines in a single block.

julia> @rsubset msleep begin
           :sleep_total >= 16
           :bodywt >= 1
       end3×11 DataFrame
 Row │ name                    genus       vore     order            conservat ⋯
     │ String31                String15    String7  String15         String15  ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Long-nosed armadillo    Dasypus     carni    Cingulata        lc        ⋯
   2 │ North American Opossum  Didelphis   omni     Didelphimorphia  lc
   3 │ Giant armadillo         Priodontes  insecti  Cingulata        en
                                                               7 columns omitted

If you are coming from dplyr, you can also write the above command in a way that looks more familiar.

julia> @rsubset(msleep, :sleep_total >= 16, :bodywt >= 1)3×11 DataFrame
 Row │ name                    genus       vore     order            conservat ⋯
     │ String31                String15    String7  String15         String15  ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Long-nosed armadillo    Dasypus     carni    Cingulata        lc        ⋯
   2 │ North American Opossum  Didelphis   omni     Didelphimorphia  lc
   3 │ Giant armadillo         Priodontes  insecti  Cingulata        en
                                                               7 columns omitted

Filter the rows for mammals in the Perissodactyla and Primates taxonomic order. We wrap code in a let block to ensure things are fast.

julia> let
           relevant_orders = Set(["Perissodactyla", "Primates"])
           @rsubset msleep :order in relevant_orders
       end15×11 DataFrame
 Row │ name             genus          vore     order           conservation   ⋯
     │ String31         String15       String7  String15        String15       ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Owl monkey       Aotus          omni     Primates        missing        ⋯
   2 │ Grivet           Cercopithecus  omni     Primates        lc
   3 │ Horse            Equus          herbi    Perissodactyla  domesticated
   4 │ Donkey           Equus          herbi    Perissodactyla  domesticated
   5 │ Patas monkey     Erythrocebus   omni     Primates        lc             ⋯
   6 │ Galago           Galago         omni     Primates        missing
   7 │ Human            Homo           omni     Primates        missing
   8 │ Mongoose lemur   Lemur          herbi    Primates        vu
   9 │ Macaque          Macaca         omni     Primates        missing        ⋯
  10 │ Slow loris       Nyctibeus      carni    Primates        missing
  11 │ Chimpanzee       Pan            omni     Primates        missing
  12 │ Baboon           Papio          omni     Primates        missing
  13 │ Potto            Perodicticus   omni     Primates        lc             ⋯
  14 │ Squirrel monkey  Saimiri        omni     Primates        missing
  15 │ Brazilian tapir  Tapirus        herbi    Perissodactyla  vu
                                                               6 columns omitted

You can use the boolean operators (e.g. >, <, >=, <=, !=, in) to create the logical tests.

Chain.jl

Before we go any further, let's introduce the @chain macro from Chain.jl, which is re-exported by DataFramesMeta.jl. @chain allows for you to pipe the output from one operation into the input of another operation. The idea of piping is to read the functions from left to right. The syntax and design of @chain is very similar to %>% which users of dplyr are familiar with.

To show it's usefulness, let's use @select and @rsubset one after the other.

msleep_1 = @select msleep :name :sleep_total
msleep_2 = @rsubset msleep_1 :sleep_total > 16

Now in this case, we will pipe the msleep data frame to the function that will select two columns (:name and :sleep_total) and then pipe the new data frame to the @rsubset operation. This method involves awkwardly creating and naming temporary data frames. We can avoid this with @chain.

julia> @chain msleep begin
           @select :name :sleep_total
           @rsubset :sleep_total > 16
       end8×2 DataFrame
 Row │ name                    sleep_total
     │ String31                Float64
─────┼─────────────────────────────────────
   1 │ Owl monkey                     17.0
   2 │ Long-nosed armadillo           17.4
   3 │ North American Opossum         18.0
   4 │ Big brown bat                  19.7
   5 │ Thick-tailed opposum           19.4
   6 │ Little brown bat               19.9
   7 │ Giant armadillo                18.1
   8 │ Arctic ground squirrel         16.6

You will soon see how useful the @chain macro is when we start to combine many functions.

Back To dplyr Verbs In Action

Now that you know about the @chain macro, we will use it throughout the rest of this tutorial.

Arrange Or Re-order Rows Using @orderby

To arrange (or re-order) rows by a particular column, such as the taxonomic order, list the name of the column you want to arrange the rows by:

julia> @orderby msleep :order83×11 DataFrame
 Row │ name                            genus         vore     order         co ⋯
     │ String31                        String15      String7  String15      St ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Tenrec                          Tenrec        omni     Afrosoricida  mi ⋯
   2 │ Cow                             Bos           herbi    Artiodactyla  do
   3 │ Roe deer                        Capreolus     herbi    Artiodactyla  lc
   4 │ Goat                            Capri         herbi    Artiodactyla  lc
   5 │ Giraffe                         Giraffa       herbi    Artiodactyla  cd ⋯
   6 │ Sheep                           Ovis          herbi    Artiodactyla  do
   7 │ Pig                             Sus           omni     Artiodactyla  do
   8 │ Cheetah                         Acinonyx      carni    Carnivora     lc
  ⋮  │               ⋮                      ⋮           ⋮          ⋮           ⋱
  77 │ Eastern american chipmunk       Tamias        herbi    Rodentia      mi ⋯
  78 │ Tree shrew                      Tupaia        omni     Scandentia    mi
  79 │ Greater short-tailed shrew      Blarina       omni     Soricomorpha  lc
  80 │ Star-nosed mole                 Condylura     omni     Soricomorpha  lc
  81 │ Lesser short-tailed shrew       Cryptotis     omni     Soricomorpha  lc ⋯
  82 │ Eastern american mole           Scalopus      insecti  Soricomorpha  lc
  83 │ Musk shrew                      Suncus        missing  Soricomorpha  mi
                                                   7 columns and 68 rows omitted

Now we will select three columns from msleep, arrange the rows by the taxonomic order and then arrange the rows by :sleep_total. Finally, keep the first 10 rows of the data frame.

julia> @chain msleep begin
           @select :name :order :sleep_total
           @orderby :order :sleep_total
           first(10)
       end10×3 DataFrame
 Row │ name          order         sleep_total
     │ String31      String15      Float64
─────┼─────────────────────────────────────────
   1 │ Tenrec        Afrosoricida         15.6
   2 │ Giraffe       Artiodactyla          1.9
   3 │ Roe deer      Artiodactyla          3.0
   4 │ Sheep         Artiodactyla          3.8
   5 │ Cow           Artiodactyla          4.0
   6 │ Goat          Artiodactyla          5.3
   7 │ Pig           Artiodactyla          9.1
   8 │ Caspian seal  Carnivora             3.5
   9 │ Gray seal     Carnivora             6.2
  10 │ Genet         Carnivora             6.3

The last line of the above block, first(10), does not have @. This is because first is a Julia function, not a macro, whose names always begin with @.

Same as above, except here we filter the rows for mammals that sleep for 16 or more hours, instead of showing the head of the final data frame:

julia> @chain msleep begin
           @select :name :order :sleep_total
           @orderby :order :sleep_total
           @rsubset :sleep_total > 16
       end8×3 DataFrame
 Row │ name                    order            sleep_total
     │ String31                String15         Float64
─────┼──────────────────────────────────────────────────────
   1 │ Big brown bat           Chiroptera              19.7
   2 │ Little brown bat        Chiroptera              19.9
   3 │ Long-nosed armadillo    Cingulata               17.4
   4 │ Giant armadillo         Cingulata               18.1
   5 │ North American Opossum  Didelphimorphia         18.0
   6 │ Thick-tailed opposum    Didelphimorphia         19.4
   7 │ Owl monkey              Primates                17.0
   8 │ Arctic ground squirrel  Rodentia                16.6

Something slightly more complicated: same as above, except arrange the rows in the :sleep_total column in a descending order. Since this column is numeric we can do it by negating it:

julia> @chain msleep begin
           @select :name :order :sleep_total
           @orderby begin
               :order
               -:sleep_total
           end
           @rsubset :sleep_total >= 16
       end8×3 DataFrame
 Row │ name                    order            sleep_total
     │ String31                String15         Float64
─────┼──────────────────────────────────────────────────────
   1 │ Little brown bat        Chiroptera              19.9
   2 │ Big brown bat           Chiroptera              19.7
   3 │ Giant armadillo         Cingulata               18.1
   4 │ Long-nosed armadillo    Cingulata               17.4
   5 │ Thick-tailed opposum    Didelphimorphia         19.4
   6 │ North American Opossum  Didelphimorphia         18.0
   7 │ Owl monkey              Primates                17.0
   8 │ Arctic ground squirrel  Rodentia                16.6

Alternatively, using the StatsBase.jl package you could have written ordinalrank(:sleep_total, rev=true). Finally, if you wanted to avoid using an additional package you could use the following expression invperm(sortperm(:sleep_total, rev=true)).

Here is a minimal example of sorting string column in reverse:

julia> using StatsBase
julia> df = DataFrame(group=[1, 2, 1, 2, 1], name = ["Bob", "Dexter", "Alice", "Eve", "Cedric"])5×2 DataFrame Row │ group name │ Int64 String ─────┼─────────────── 1 │ 1 Bob 2 │ 2 Dexter 3 │ 1 Alice 4 │ 2 Eve 5 │ 1 Cedric
julia> @orderby df begin :group ordinalrank(:name, rev=true) end5×2 DataFrame Row │ group name │ Int64 String ─────┼─────────────── 1 │ 1 Cedric 2 │ 1 Bob 3 │ 1 Alice 4 │ 2 Eve 5 │ 2 Dexter

Create New Columns Using @transform and @rtransform

The @transform macro will add new columns to the data frame. Like with other macros, use @rtransform to operate row-wise. Create a new column called :rem_proportion, which is the ratio of rem sleep to total amount of sleep.

julia> @rtransform msleep :rem_proportion = :sleep_rem / :sleep_total83×12 DataFrame
 Row │ name                        genus         vore     order           cons ⋯
     │ String31                    String15      String7  String15        Stri ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni    Carnivora       lc   ⋯
   2 │ Owl monkey                  Aotus         omni     Primates        miss
   3 │ Mountain beaver             Aplodontia    herbi    Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina       omni     Soricomorpha    lc
   5 │ Cow                         Bos           herbi    Artiodactyla    dome ⋯
   6 │ Three-toed sloth            Bradypus      herbi    Pilosa          miss
   7 │ Northern fur seal           Callorhinus   carni    Carnivora       vu
   8 │ Vesper mouse                Calomys       missing  Rodentia        miss
  ⋮  │             ⋮                    ⋮           ⋮           ⋮              ⋱
  77 │ Brazilian tapir             Tapirus       herbi    Perissodactyla  vu   ⋯
  78 │ Tenrec                      Tenrec        omni     Afrosoricida    miss
  79 │ Tree shrew                  Tupaia        omni     Scandentia      miss
  80 │ Bottle-nosed dolphin        Tursiops      carni    Cetacea         miss
  81 │ Genet                       Genetta       carni    Carnivora       miss ⋯
  82 │ Arctic fox                  Vulpes        carni    Carnivora       miss
  83 │ Red fox                     Vulpes        carni    Carnivora       miss
                                                   8 columns and 68 rows omitted

You can add many columns at a time by placing multiple operations in a block.

julia> @rtransform msleep begin
           :rem_proportion = :sleep_rem / :sleep_total
           :bodywt_grams = :bodywt * 1000
       end83×13 DataFrame
 Row │ name                        genus         vore     order           cons ⋯
     │ String31                    String15      String7  String15        Stri ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni    Carnivora       lc   ⋯
   2 │ Owl monkey                  Aotus         omni     Primates        miss
   3 │ Mountain beaver             Aplodontia    herbi    Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina       omni     Soricomorpha    lc
   5 │ Cow                         Bos           herbi    Artiodactyla    dome ⋯
   6 │ Three-toed sloth            Bradypus      herbi    Pilosa          miss
   7 │ Northern fur seal           Callorhinus   carni    Carnivora       vu
   8 │ Vesper mouse                Calomys       missing  Rodentia        miss
  ⋮  │             ⋮                    ⋮           ⋮           ⋮              ⋱
  77 │ Brazilian tapir             Tapirus       herbi    Perissodactyla  vu   ⋯
  78 │ Tenrec                      Tenrec        omni     Afrosoricida    miss
  79 │ Tree shrew                  Tupaia        omni     Scandentia      miss
  80 │ Bottle-nosed dolphin        Tursiops      carni    Cetacea         miss
  81 │ Genet                       Genetta       carni    Carnivora       miss ⋯
  82 │ Arctic fox                  Vulpes        carni    Carnivora       miss
  83 │ Red fox                     Vulpes        carni    Carnivora       miss
                                                   9 columns and 68 rows omitted

Using @transform instead of @rtransform will let us work with the column as a whole, and not a single row at a time. Let's create a new variable showing how far an animal's sleep time is from the average of all animals.

julia> @transform msleep :demeand_sleep = :sleep_total .- mean(:sleep_total)83×12 DataFrame
 Row │ name                        genus         vore     order           cons ⋯
     │ String31                    String15      String7  String15        Stri ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni    Carnivora       lc   ⋯
   2 │ Owl monkey                  Aotus         omni     Primates        miss
   3 │ Mountain beaver             Aplodontia    herbi    Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina       omni     Soricomorpha    lc
   5 │ Cow                         Bos           herbi    Artiodactyla    dome ⋯
   6 │ Three-toed sloth            Bradypus      herbi    Pilosa          miss
   7 │ Northern fur seal           Callorhinus   carni    Carnivora       vu
   8 │ Vesper mouse                Calomys       missing  Rodentia        miss
  ⋮  │             ⋮                    ⋮           ⋮           ⋮              ⋱
  77 │ Brazilian tapir             Tapirus       herbi    Perissodactyla  vu   ⋯
  78 │ Tenrec                      Tenrec        omni     Afrosoricida    miss
  79 │ Tree shrew                  Tupaia        omni     Scandentia      miss
  80 │ Bottle-nosed dolphin        Tursiops      carni    Cetacea         miss
  81 │ Genet                       Genetta       carni    Carnivora       miss ⋯
  82 │ Arctic fox                  Vulpes        carni    Carnivora       miss
  83 │ Red fox                     Vulpes        carni    Carnivora       miss
                                                   8 columns and 68 rows omitted

Finally, note that you can create a new column with the name taken from an existing variable, or a new column name with spaces in it, with $

julia> newname = :rem_proportion:rem_proportion
julia> @rtransform msleep begin $newname = :sleep_rem / :sleep_total $"Body weight in grams" = :bodywt * 1000 end83×13 DataFrame Row │ name genus vore order cons ⋯ │ String31 String15 String7 String15 Stri ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Cheetah Acinonyx carni Carnivora lc ⋯ 2 │ Owl monkey Aotus omni Primates miss 3 │ Mountain beaver Aplodontia herbi Rodentia nt 4 │ Greater short-tailed shrew Blarina omni Soricomorpha lc 5 │ Cow Bos herbi Artiodactyla dome ⋯ 6 │ Three-toed sloth Bradypus herbi Pilosa miss 7 │ Northern fur seal Callorhinus carni Carnivora vu 8 │ Vesper mouse Calomys missing Rodentia miss ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ 77 │ Brazilian tapir Tapirus herbi Perissodactyla vu ⋯ 78 │ Tenrec Tenrec omni Afrosoricida miss 79 │ Tree shrew Tupaia omni Scandentia miss 80 │ Bottle-nosed dolphin Tursiops carni Cetacea miss 81 │ Genet Genetta carni Carnivora miss ⋯ 82 │ Arctic fox Vulpes carni Carnivora miss 83 │ Red fox Vulpes carni Carnivora miss 9 columns and 68 rows omitted

Create Summaries of the Data Frame using @combine

The @combine macro will create summary statistics for a given column in the data frame, such as finding the mean. For example, to compute the average number of hours of sleep, apply the mean function to the column :sleep_total and call the summary value :avg_sleep.

julia> @chain msleep @combine :avg_sleep = mean(:sleep_total)1×1 DataFrame
 Row │ avg_sleep
     │ Float64
─────┼───────────
   1 │   10.4337

There are many other summary statistics you could consider such std, minimum, maximum, median, sum, length (returns the length of vector), first (returns first value in vector), and last (returns last value in vector).

julia> @combine msleep begin
           :avg_sleep = mean(:sleep_total)
           :min_sleep = minimum(:sleep_total)
           :max_sleep = maximum(:sleep_total)
           :total = length(:sleep_total)
       end1×4 DataFrame
 Row │ avg_sleep  min_sleep  max_sleep  total
     │ Float64    Float64    Float64    Int64
─────┼────────────────────────────────────────
   1 │   10.4337        1.9       19.9     83

DataFrames.jl also provides the function describe which performs many of these summaries automatically.

julia> describe(msleep)11×7 DataFrame
 Row │ variable      mean      min               median    max                 ⋯
     │ Symbol        Union…    Any               Union…    Any                 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ name                    African elephant            Western american ch ⋯
   2 │ genus                   Acinonyx                    Vulpes
   3 │ vore                    carni                       omni
   4 │ order                   Afrosoricida                Soricomorpha
   5 │ conservation            cd                          vu                  ⋯
   6 │ sleep_total   10.4337   1.9               10.1      19.9
   7 │ sleep_rem     1.87541   0.1               1.5       6.6
   8 │ sleep_cycle   0.439583  0.116667          0.333333  1.5
   9 │ awake         13.5675   4.1               13.9      22.1                ⋯
  10 │ brainwt       0.281581  0.00014           0.0124    5.712
  11 │ bodywt        166.136   0.005             1.67      6654.0
                                                               3 columns omitted

Group Operations using @groupby and @combine

The @groupby verb is the first step in the "split-apply-combine" workflow. We literally want to split the data frame by some variable (e.g. taxonomic order), apply a function to the individual data frames and then combine the output.

Let's do that: split the msleep data frame by the taxonomic order, then ask for the same summary statistics as above. We expect a set of summary statistics for each taxonomic order.

julia> @chain msleep begin
           @groupby :order
           @combine begin
               :avg_sleep = mean(:sleep_total)
               :min_sleep = minimum(:sleep_total)
               :max_sleep = maximum(:sleep_total)
               :total = length(:sleep_total)
           end
       end19×5 DataFrame
 Row │ order            avg_sleep  min_sleep  max_sleep  total
     │ String15         Float64    Float64    Float64    Int64
─────┼─────────────────────────────────────────────────────────
   1 │ Carnivora         10.1167         3.5       15.8     12
   2 │ Primates          10.5            8.0       17.0     12
   3 │ Rodentia          12.4682         7.0       16.6     22
   4 │ Soricomorpha      11.1            8.4       14.9      5
   5 │ Artiodactyla       4.51667        1.9        9.1      6
   6 │ Pilosa            14.4           14.4       14.4      1
   7 │ Cingulata         17.75          17.4       18.1      2
   8 │ Hyracoidea         5.66667        5.3        6.3      3
  ⋮  │        ⋮             ⋮          ⋮          ⋮        ⋮
  13 │ Erinaceomorpha    10.2           10.1       10.3      2
  14 │ Cetacea            4.5            2.7        5.6      3
  15 │ Lagomorpha         8.4            8.4        8.4      1
  16 │ Diprotodontia     12.4           11.1       13.7      2
  17 │ Monotremata        8.6            8.6        8.6      1
  18 │ Afrosoricida      15.6           15.6       15.6      1
  19 │ Scandentia         8.9            8.9        8.9      1
                                                 4 rows omitted

Split-apply-combine can also be used with @transform to add new variables to a data frame by performing operations by group. For instance, we can de-mean the total hours of sleep of an animal relative to other animals in the same genus.

julia> @chain msleep begin
           @groupby :order
           @transform :sleep_genus = :sleep_total .- mean(:sleep_total)
       end83×12 DataFrame
 Row │ name                        genus         vore     order           cons ⋯
     │ String31                    String15      String7  String15        Stri ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni    Carnivora       lc   ⋯
   2 │ Owl monkey                  Aotus         omni     Primates        miss
   3 │ Mountain beaver             Aplodontia    herbi    Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina       omni     Soricomorpha    lc
   5 │ Cow                         Bos           herbi    Artiodactyla    dome ⋯
   6 │ Three-toed sloth            Bradypus      herbi    Pilosa          miss
   7 │ Northern fur seal           Callorhinus   carni    Carnivora       vu
   8 │ Vesper mouse                Calomys       missing  Rodentia        miss
  ⋮  │             ⋮                    ⋮           ⋮           ⋮              ⋱
  77 │ Brazilian tapir             Tapirus       herbi    Perissodactyla  vu   ⋯
  78 │ Tenrec                      Tenrec        omni     Afrosoricida    miss
  79 │ Tree shrew                  Tupaia        omni     Scandentia      miss
  80 │ Bottle-nosed dolphin        Tursiops      carni    Cetacea         miss
  81 │ Genet                       Genetta       carni    Carnivora       miss ⋯
  82 │ Arctic fox                  Vulpes        carni    Carnivora       miss
  83 │ Red fox                     Vulpes        carni    Carnivora       miss
                                                   8 columns and 68 rows omitted

This short tutorial only touches on the wide array of features in Julia, DataFrames.jl, and DataFramesMeta.jl. Read the full documentation for more information.