Main Types

An abstract type for which all concrete types expose a database-like interface.

Common methods

An AbstractDataTable is a two-dimensional table with Symbols for column names. An AbstractDataTable is also similar to an Associative type in that it allows indexing by a key (the columns).

The following are normally implemented for AbstractDataTables:

  • describe : summarize columns

  • dump : show structure

  • hcat : horizontal concatenation

  • vcat : vertical concatenation

  • names : columns names

  • names! : set columns names

  • rename! : rename columns names based on keyword arguments

  • eltypes : eltype of each column

  • length : number of columns

  • size : (nrows, ncols)

  • head : first n rows

  • tail : last n rows

  • convert : convert to an array

  • NullableArray : convert to a NullableArray

  • completecases : boolean vector of complete cases (rows with no nulls)

  • dropnull : remove rows with null values

  • dropnull! : remove rows with null values in-place

  • nonunique : indexes of duplicate rows

  • unique! : remove duplicate rows

  • similar : a DataTable with similar columns as d

Indexing

Table columns are accessed (getindex) by a single index that can be a symbol identifier, an integer, or a vector of each. If a single column is selected, just the column object is returned. If multiple columns are selected, some AbstractDataTable is returned.

d[:colA]
d[3]
d[[:colA, :colB]]
d[[1:3; 5]]

Rows and columns can be indexed like a Matrix with the added feature of indexing columns by name.

d[1:3, :colA]
d[3,3]
d[3,:]
d[3,[:colA, :colB]]
d[:, [:colA, :colB]]
d[[1:3; 5], :]

setindex works similarly.

source

An AbstractDataTable that stores a set of named columns

The columns are normally AbstractVectors stored in memory, particularly a Vector, NullableVector, or CategoricalVector.

Constructors

DataTable(columns::Vector{Any}, names::Vector{Symbol})
DataTable(kwargs...)
DataTable() # an empty DataTable
DataTable(t::Type, nrows::Integer, ncols::Integer) # an empty DataTable of arbitrary size
DataTable(column_eltypes::Vector, names::Vector, nrows::Integer)
DataTable(ds::Vector{Associative})

Arguments

  • columns : a Vector{Any} with each column as contents

  • names : the column names

  • kwargs : the key gives the column names, and the value is the column contents

  • t : elemental type of all columns

  • nrows, ncols : number of rows and columns

  • column_eltypes : elemental type of each column

  • ds : a vector of Associatives

Each column in columns should be the same length.

Notes

Most of the default constructors convert columns to NullableArray. The base constructor, DataTable(columns::Vector{Any}, names::Vector{Symbol}) does not convert to NullableArray.

A DataTable is a lightweight object. As long as columns are not manipulated, creation of a DataTable from existing AbstractVectors is inexpensive. For example, indexing on columns is inexpensive, but indexing by rows is expensive because copies are made of each column.

Because column types can vary, a DataTable is not type stable. For performance-critical code, do not index into a DataTable inside of loops.

Examples

dt = DataTable()
v = ["x","y","z"][rand(1:3, 10)]
dt1 = DataTable(Any[collect(1:10), v, rand(10)], [:A, :B, :C])  # columns are Arrays
dt2 = DataTable(A = 1:10, B = v, C = rand(10))           # columns are NullableArrays
dump(dt1)
dump(dt2)
describe(dt2)
DataTables.head(dt1)
dt1[:A] + dt2[:C]
dt1[1:4, 1:2]
dt1[[:A,:C]]
dt1[1:2, [:A,:C]]
dt1[:, [:A,:C]]
dt1[:, [1,3]]
dt1[1:4, :]
dt1[1:4, :C]
dt1[1:4, :C] = 40. * dt1[1:4, :C]
[dt1; dt2]  # vcat
[dt1  dt2]  # hcat
size(dt1)
source

A view of row subsets of an AbstractDataTable

A SubDataTable is meant to be constructed with view. A SubDataTable is used frequently in split/apply sorts of operations.

view(d::AbstractDataTable, rows)

Arguments

  • d : an AbstractDataTable

  • rows : any indexing type for rows, typically an Int, AbstractVector{Int}, AbstractVector{Bool}, or a Range

Notes

A SubDataTable is an AbstractDataTable, so expect that most DataTable functions should work. Such methods include describe, dump, nrow, size, by, stack, and join. Indexing is just like a DataTable; copies are returned.

To subset along columns, use standard column indexing as that creates a view to the columns by default. To subset along rows and columns, use column-based indexing with view.

Examples

dt = DataTable(a = repeat([1, 2, 3, 4], outer=[2]),
               b = repeat([2, 1], outer=[4]),
               c = randn(8))
sdt1 = view(dt, 1:6)
sdt2 = view(dt, dt[:a] .> 1)
sdt3 = view(dt[[1,3]], dt[:a] .> 1)  # row and column subsetting
sdt4 = groupby(dt, :a)[1]  # indexing a GroupedDataTable returns a SubDataTable
sdt5 = view(sdt1, 1:3)
sdt1[:,[:a,:b]]
source