Database-Style Joins

We often need to combine two or more data sets together to provide a complete picture of the topic we are studying. For example, suppose that we have the following two data sets:

names = DataTable(ID = [1, 2], Name = ["John Doe", "Jane Doe"])
jobs = DataTable(ID = [1, 2], Job = ["Lawyer", "Doctor"])

We might want to work with a larger data set that contains both the names and jobs for each ID. We can do this using the join function:

full = join(names, jobs, on = :ID)

Output:

RowIDNameJob
11"John Doe""Lawyer"
21"Jane Doe""Doctor"

In relational database theory, this operation is generally referred to as a join. The columns used to determine which rows should be combined during a join are called keys.

There are seven kinds of joins supported by the DataTables package:

You can control the kind of join that join performs using the kind keyword argument:

a = DataTable(ID = [1, 2], Name = ["A", "B"])
b = DataTable(ID = [1, 3], Job = ["Doctor", "Lawyer"])
join(a, b, on = :ID, kind = :inner)
join(a, b, on = :ID, kind = :left)
join(a, b, on = :ID, kind = :right)
join(a, b, on = :ID, kind = :outer)
join(a, b, on = :ID, kind = :semi)
join(a, b, on = :ID, kind = :anti)

Cross joins are the only kind of join that does not use a key:

join(a, b, kind = :cross)

In order to join data frames on keys which have different names, you must first rename them so that they match. This can be done using rename!:

a = DataTable(ID = [1, 2], Name = ["A", "B"])
b = DataTable(IDNew = [1, 2], Job = ["Doctor", "Lawyer"])
rename!(b, :IDNew, :ID)
join(a, b, on = :ID, kind = :inner)

Or renaming multiple columns at a time:

a = DataTable(City = ["Amsterdam", "London", "London", "New York", "New York"], 
              Job = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"], 
              Category = [1, 2, 3, 4, 5])
b = DataTable(Location = ["Amsterdam", "London", "London", "New York", "New York"], 
              Work = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"], 
              Name = ["a", "b", "c", "d", "e"])
rename!(b, [:Location => :City, :Work => :Job])
join(a, b, on = [:City, :Job])