Functions

Grouping, Joining, and Split-Apply-Combine

Split-apply-combine that applies a set of functions over columns of an AbstractDataFrame or GroupedDataFrame

``````aggregate(d::AbstractDataFrame, cols, fs)
aggregate(gd::GroupedDataFrame, fs)``````

Arguments

• `d` : an AbstractDataFrame
• `gd` : a GroupedDataFrame
• `cols` : a column indicator (Symbol, Int, Vector{Symbol}, etc.)
• `fs` : a function or vector of functions to be applied to vectors within groups; expects each argument to be a column vector

Each `fs` should return a value or vector. All returns must be the same length.

Returns

• `::DataFrame`

Examples

``````using Statistics
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))
aggregate(df, :a, sum)
aggregate(df, :a, [sum, x->mean(skipmissing(x))])
aggregate(groupby(df, :a), [sum, x->mean(skipmissing(x))])``````
``````by(d::AbstractDataFrame, keys, cols => f...; sort::Bool = false)
by(d::AbstractDataFrame, keys; (colname = cols => f)..., sort::Bool = false)
by(d::AbstractDataFrame, keys, f; sort::Bool = false)
by(f, d::AbstractDataFrame, keys; sort::Bool = false)``````

Split-apply-combine in one step: apply `f` to each grouping in `d` based on grouping columns `keys`, and return a `DataFrame`.

`keys` can be either a single column index, or a vector thereof.

If the last argument(s) consist(s) in one or more `cols => f` pair(s), or if `colname = cols => f` keyword arguments are provided, `cols` must be a column name or index, or a vector or tuple thereof, and `f` must be a callable. A pair or a (named) tuple of pairs can also be provided as the first or last argument. If `cols` is a single column index, `f` is called with a `SubArray` view into that column for each group; else, `f` is called with a named tuple holding `SubArray` views into these columns.

If the last argument is a callable `f`, it is passed a `SubDataFrame` view for each group, and the returned `DataFrame` then consists of the returned rows plus the grouping columns. Note that this second form is much slower than the first one due to type instability. A method is defined with `f` as the first argument, so do-block notation can be used.

`f` can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:

• A single value gives a data frame with a single column and one row per group.
• A named tuple of single values or a `DataFrameRow` gives a data frame with one column for each field and one row per group.
• A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
• A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.

As a special case, if multiple pairs are passed as last arguments, each function is required to return a single value or vector, which will produce each a separate column.

In all cases, the resulting data frame contains all the grouping columns in addition to those listed above. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input colummn name; for other functions, columns are called `x1`, `x2` and so on. The resulting data frame will be sorted on `keys` if `sort=true`. Otherwise, ordering of rows is undefined.

Note that `f` must always return the same type of object for all groups, and (if a named tuple or data frame) with the same fields or columns. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.

Optimized methods are used when standard summary functions (`sum`, `prod`, `minimum`, `maximum`, `mean`, `var`, `std`, `first`, `last` and `length) are specified using the pair syntax (e.g.`col => sum`). When computing the`sum`or`mean`over floating point columns, results will be less accurate than the standard [`sum`](@ref) function (which uses pairwise summation). Use`col => x -> sum(x)` to avoid the optimized method and use the slower, more accurate one.

`by(d, cols, f)` is equivalent to `combine(f, groupby(d, cols))` and to the less efficient `combine(map(f, groupby(d, cols)))`.

Examples

``````julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);

julia> by(df, :a, :c => sum)
4×2 DataFrame
│ Row │ a     │ c_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> by(df, :a, d -> sum(d.c)) # Slower variant
4×2 DataFrame
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> by(df, :a) do d # do syntax for the slower variant
sum(d.c)
end
4×2 DataFrame
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> by(df, :a, :c => x -> 2 .* x)
8×2 DataFrame
│ Row │ a     │ c_function │
│     │ Int64 │ Int64      │
├─────┼───────┼────────────┤
│ 1   │ 1     │ 2          │
│ 2   │ 1     │ 10         │
│ 3   │ 2     │ 4          │
│ 4   │ 2     │ 12         │
│ 5   │ 3     │ 6          │
│ 6   │ 3     │ 14         │
│ 7   │ 4     │ 8          │
│ 8   │ 4     │ 16         │

julia> by(df, :a, c_sum = :c => sum, c_sum2 = :c => x -> sum(x.^2))
4×3 DataFrame
│ Row │ a     │ c_sum │ c_sum2 │
│     │ Int64 │ Int64 │ Int64  │
├─────┼───────┼───────┼────────┤
│ 1   │ 1     │ 6     │ 26     │
│ 2   │ 2     │ 8     │ 40     │
│ 3   │ 3     │ 10    │ 58     │
│ 4   │ 4     │ 12    │ 80     │

julia> by(df, :a, (:b, :c) => x -> (minb = minimum(x.b), sumc = sum(x.c)))
4×3 DataFrame
│ Row │ a     │ minb  │ sumc  │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 6     │
│ 2   │ 2     │ 1     │ 8     │
│ 3   │ 3     │ 2     │ 10    │
│ 4   │ 4     │ 1     │ 12    │``````
Apply a function to each column in an AbstractDataFrame or GroupedDataFrame

``colwise(f, d)``

Arguments

• `f` : a function or vector of functions
• `d` : an AbstractDataFrame of GroupedDataFrame

Returns

• various, depending on the call

Examples

``````df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))
colwise(sum, df)
colwise([sum, length], df)
colwise((minimum, maximum), df)
colwise(sum, groupby(df, :a))``````
``````combine(gd::GroupedDataFrame, cols => f...)
combine(gd::GroupedDataFrame; (colname = cols => f)...)
combine(gd::GroupedDataFrame, f)
combine(f, gd::GroupedDataFrame)``````

Transform a `GroupedDataFrame` into a `DataFrame`.

If the last argument(s) consist(s) in one or more `cols => f` pair(s), or if `colname = cols => f` keyword arguments are provided, `cols` must be a column name or index, or a vector or tuple thereof, and `f` must be a callable. A pair or a (named) tuple of pairs can also be provided as the first or last argument. If `cols` is a single column index, `f` is called with a `SubArray` view into that column for each group; else, `f` is called with a named tuple holding `SubArray` views into these columns.

If the last argument is a callable `f`, it is passed a `SubDataFrame` view for each group, and the returned `DataFrame` then consists of the returned rows plus the grouping columns. Note that this second form is much slower than the first one due to type instability. A method is defined with `f` as the first argument, so do-block notation can be used.

`f` can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:

• A single value gives a data frame with a single column and one row per group.
• A named tuple of single values or a `DataFrameRow` gives a data frame with one column for each field and one row per group.
• A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
• A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.

As a special case, if a tuple or vector of pairs is passed as the first argument, each function is required to return a single value or vector, which will produce each a separate column.

In all cases, the resulting data frame contains all the grouping columns in addition to those listed above. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input column name; for other functions, columns are called `x1`, `x2` and so on. The resulting data frame will be sorted if `sort=true` was passed to the `groupby` call from which `gd` was constructed. Otherwise, ordering of rows is undefined.

Note that `f` must always return the same type of object for all groups, and (if a named tuple or data frame) with the same fields or columns. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.

Optimized methods are used when standard summary functions (`sum`, `prod`, `minimum`, `maximum`, `mean`, `var`, `std`, `first`, `last` and `length) are specified using the pair syntax (e.g.`col => sum`). When computing the`sum`or`mean`over floating point columns, results will be less accurate than the standard [`sum`](@ref) function (which uses pairwise summation). Use`col => x -> sum(x)` to avoid the optimized method and use the slower, more accurate one.

Examples

``````julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);

julia> gd = groupby(df, :a);

julia> combine(gd, :c => sum)
4×2 DataFrame
│ Row │ a     │ c_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> combine(:c => sum, gd)
4×2 DataFrame
│ Row │ a     │ c_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> combine(df -> sum(df.c), gd) # Slower variant
4×2 DataFrame
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │``````

See `by` for more examples.

`by(f, df, cols)` is a shorthand for `combine(f, groupby(df, cols))`.

`map`: `combine(f, groupby(df, cols))` is a more efficient equivalent of `combine(map(f, groupby(df, cols)))`.

A view of an AbstractDataFrame split into row groups

``````groupby(d::AbstractDataFrame, cols; sort = false, skipmissing = false)
groupby(cols; sort = false, skipmissing = false)``````

Arguments

• `d` : an AbstractDataFrame to split (optional, see Returns)
• `cols` : data table columns to group by
• `sort`: whether to sort rows according to the values of the grouping columns `cols`
• `skipmissing`: whether to skip rows with `missing` values in one of the grouping columns `cols`

Returns

A `GroupedDataFrame` : a grouped view into `d`

Details

An iterator over a `GroupedDataFrame` returns a `SubDataFrame` view for each grouping into `d`. A `GroupedDataFrame` also supports indexing by groups, `map` (which applies a function to each group) and `combine` (which applies a function to each group and combines the result into a data frame).

See the following for additional split-apply-combine operations:

• `by` : split-apply-combine using functions
• `aggregate` : split-apply-combine; applies functions in the form of a cross product
• `colwise` : apply a function to each column in an `AbstractDataFrame` or `GroupedDataFrame`
• `map` : apply a function to each group of a `GroupedDataFrame` (without combining)
• `combine` : combine a `GroupedDataFrame`, optionally applying a function to each group

Examples

``````df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))
gd = groupby(df, :a)
gd[1]
last(gd)
vcat([g[:b] for g in gd]...)
for g in gd
println(g)
end``````
``groupindices(gd::GroupedDataFrame)``

Return a vector of group indices for each row of `parent(gd)`.

Rows appearing in group `gd[i]` are attributed index `i`. Rows not present in any group are attributed `missing` (this can happen if `skipmissing=true` was passed when creating `gd`, or if `gd` is a subset from a larger `GroupedDataFrame`).

``groupvars(gd::GroupedDataFrame)``

Return a vector of column names in `parent(gd)` used for grouping.

``````join(df1, df2; on = Symbol[], kind = :inner, makeunique = false,
indicator = nothing, validate = (false, false))``````

Join two `DataFrame` objects

Arguments

• `df1`, `df2` : the two AbstractDataFrames to be joined

Keyword Arguments

• `on` : A column, or vector of columns to join df1 and df2 on. If the column(s) that df1 and df2 will be joined on have different names, then the columns should be `(left, right)` tuples or `left => right` pairs, or a vector of such tuples or pairs. `on` is a required argument for all joins except for `kind = :cross`

• `kind` : the type of join, options include:

• `:inner` : only include rows with keys that match in both `df1` and `df2`, the default
• `:outer` : include all rows from `df1` and `df2`
• `:left` : include all rows from `df1`
• `:right` : include all rows from `df2`
• `:semi` : return rows of `df1` that match with the keys in `df2`
• `:anti` : return rows of `df1` that do not match with the keys in `df2`
• `:cross` : a full Cartesian product of the key combinations; every row of `df1` is matched with every row of `df2`
• `makeunique` : if `false` (the default), an error will be raised if duplicate names are found in columns not joined on; if `true`, duplicate names will be suffixed with `_i` (`i` starting at 1 for the first duplicate).

• `indicator` : Default: `nothing`. If a `Symbol`, adds categorical indicator column named `Symbol` for whether a row appeared in only `df1` (`"left_only"`), only `df2` (`"right_only"`) or in both (`"both"`). If `Symbol` is already in use, the column name will be modified if `makeunique=true`.

• `validate` : whether to check that columns passed as the `on` argument define unique keys in each input data frame (according to `isequal`). Can be a tuple or a pair, with the first element indicating whether to run check for `df1` and the second element for `df2`. By default no check is performed.

For the three join operations that may introduce missing values (`:outer`, `:left`, and `:right`), all columns of the returned data table will support missing values.

When merging `on` categorical columns that differ in the ordering of their levels, the ordering of the left `DataFrame` takes precedence over the ordering of the right `DataFrame`

Result

• `::DataFrame` : the joined DataFrame

Examples

``````name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])

join(name, job, on = :ID)
join(name, job, on = :ID, kind = :outer)
join(name, job, on = :ID, kind = :left)
join(name, job, on = :ID, kind = :right)
join(name, job, on = :ID, kind = :semi)
join(name, job, on = :ID, kind = :anti)
join(name, job, kind = :cross)

job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
join(name, job2, on = (:ID, :identifier))
join(name, job2, on = :ID => :identifier)``````
``````map(cols => f, gd::GroupedDataFrame)
map(f, gd::GroupedDataFrame)``````

Apply a function to each group of rows and return a `GroupedDataFrame`.

If the first argument is a `cols => f` pair, `cols` must be a column name or index, or a vector or tuple thereof, and `f` must be a callable. If `cols` is a single column index, `f` is called with a `SubArray` view into that column for each group; else, `f` is called with a named tuple holding `SubArray` views into these columns.

If the first argument is a vector, tuple or named tuple of such pairs, each pair is handled as described above. If a named tuple, field names are used to name each generated column.

If the first argument is a callable, it is passed a `SubDataFrame` view for each group, and the returned `DataFrame` then consists of the returned rows plus the grouping columns. Note that this second form is much slower than the first one due to type instability.

`f` can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:

• A single value gives a data frame with a single column and one row per group.
• A named tuple of single values or a `DataFrameRow` gives a data frame with one column for each field and one row per group.
• A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
• A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.

As a special case, if a tuple or vector of pairs is passed as the first argument, each function is required to return a single value or vector, which will produce each a separate column.

In all cases, the resulting `GroupedDataFrame` contains all the grouping columns in addition to those listed above. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input column name; for other functions, columns are called `x1`, `x2` and so on.

Note that `f` must always return the same type of object for all groups, and (if a named tuple or data frame) with the same fields or columns. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.

Optimized methods are used when standard summary functions (`sum`, `prod`, `minimum`, `maximum`, `mean`, `var`, `std`, `first`, `last` and `length) are specified using the pair syntax (e.g.`col => sum`). When computing the`sum`or`mean`over floating point columns, results will be less accurate than the standard [`sum`](@ref) function (which uses pairwise summation). Use`col => x -> sum(x)` to avoid the optimized method and use the slower, more accurate one.

Examples

``````julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);

julia> gd = groupby(df, :a);

julia> map(:c => sum, gd)
GroupedDataFrame{DataFrame} with 4 groups based on key: :a
First Group: 1 row
│ Row │ a     │ c_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
⋮
Last Group: 1 row
│ Row │ a     │ c_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 4     │ 12    │

julia> map(df -> sum(df.c), gd) # Slower variant
GroupedDataFrame{DataFrame} with 4 groups based on key: :a
First Group: 1 row
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
⋮
Last Group: 1 row
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 4     │ 12    │``````

See `by` for more examples.

`combine(f, gd)` returns a `DataFrame` rather than a `GroupedDataFrame`

Stacks a DataFrame; convert from a wide to long format; see `stack`.

Stacks a DataFrame; convert from a wide to long format

``````stack(df::AbstractDataFrame, [measure_vars], [id_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value)
melt(df::AbstractDataFrame, [id_vars], [measure_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value)``````

Arguments

• `df` : the AbstractDataFrame to be stacked

• `measure_vars` : the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; for `melt`, defaults to all variables that are not `id_vars`. If neither `measure_vars` or `id_vars` are given, `measure_vars` defaults to all floating point columns.

• `id_vars` : the identifier columns that are repeated during stacking, a normal column indexing type; for `stack` defaults to all variables that are not `measure_vars`

• `variable_name` : the name of the new stacked column that shall hold the names of each of `measure_vars`

• `value_name` : the name of the new stacked column containing the values from each of `measure_vars`

Result

• `::DataFrame` : the long-format DataFrame with column `:value` holding the values of the stacked columns (`measure_vars`), with column `:variable` a Vector of Symbols with the `measure_vars` name, and with columns for each of the `id_vars`.

See also `stackdf` and `meltdf` for stacking methods that return a view into the original DataFrame. See `unstack` for converting from long to wide format.

Examples

``````d1 = DataFrame(a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12),
e = map(string, 'a':'l'))

d1s = stack(d1, [:c, :d])
d1s2 = stack(d1, [:c, :d], [:a])
d1m = melt(d1, [:a, :b, :e])
d1s_name = melt(d1, [:a, :b, :e], variable_name=:somemeasure)``````
Unstacks a DataFrame; convert from a long to wide format

``````unstack(df::AbstractDataFrame, rowkeys::Union{Symbol, Integer},
colkey::Union{Symbol, Integer}, value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame, rowkeys::AbstractVector{<:Union{Symbol, Integer}},
colkey::Union{Symbol, Integer}, value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame, colkey::Union{Symbol, Integer},
value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame)``````

Arguments

• `df` : the AbstractDataFrame to be unstacked

• `rowkeys` : the column(s) with a unique key for each row, if not given, find a key by grouping on anything not a `colkey` or `value`

• `colkey` : the column holding the column names in wide format, defaults to `:variable`

• `value` : the value column, defaults to `:value`

Result

• `::DataFrame` : the wide-format DataFrame

If `colkey` contains `missing` values then they will be skipped and a warning will be printed.

If combination of `rowkeys` and `colkey` contains duplicate entries then last `value` will be retained and a warning will be printed.

Examples

``````wide = DataFrame(id = 1:12,
a  = repeat([1:3;], inner = [4]),
b  = repeat([1:4;], inner = [3]),
c  = randn(12),
d  = randn(12))

long = stack(wide)
wide0 = unstack(long)
wide1 = unstack(long, :variable, :value)
wide2 = unstack(long, :id, :variable, :value)
wide3 = unstack(long, [:id, :a], :variable, :value)``````

Note that there are some differences between the widened results above.

A stacked view of a DataFrame (long format)

Like `stack` and `melt`, but a view is returned rather than data copies.

``````stackdf(df::AbstractDataFrame, [measure_vars], [id_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value)
meltdf(df::AbstractDataFrame, [id_vars], [measure_vars];
variable_name::Symbol=:variable, value_name::Symbol=:value)``````

Arguments

• `df` : the wide AbstractDataFrame

• `measure_vars` : the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; for `melt`, defaults to all variables that are not `id_vars`

• `id_vars` : the identifier columns that are repeated during stacking, a normal column indexing type; for `stack` defaults to all variables that are not `measure_vars`

Result

• `::DataFrame` : the long-format DataFrame with column `:value` holding the values of the stacked columns (`measure_vars`), with column `:variable` a Vector of Symbols with the `measure_vars` name, and with columns for each of the `id_vars`.

The result is a view because the columns are special AbstractVectors that return indexed views into the original DataFrame.

Examples

``````d1 = DataFrame(a = repeat([1:3;], inner = [4]),
b = repeat([1:4;], inner = [3]),
c = randn(12),
d = randn(12),
e = map(string, 'a':'l'))

d1s = stackdf(d1, [:c, :d])
d1s2 = stackdf(d1, [:c, :d], [:a])
d1m = meltdf(d1, [:a, :b, :e])``````
A stacked view of a DataFrame (long format); see `stackdf`

Basics

``allowmissing!(df::DataFrame)``

Convert all columns of a `df` from element type `T` to `Union{T, Missing}` to support missing values.

``allowmissing!(df::DataFrame, col::Union{Integer, Symbol})``

Convert a single column of a `df` from element type `T` to `Union{T, Missing}` to support missing values.

``allowmissing!(df::DataFrame, cols::AbstractVector{<:Union{Integer, Symbol}})``

Convert multiple columns of a `df` from element type `T` to `Union{T, Missing}` to support missing values.

``````completecases(df::AbstractDataFrame)
completecases(df::AbstractDataFrame, cols::AbstractVector)
completecases(df::AbstractDataFrame, cols::Union{Integer, Symbol})``````

Return a Boolean vector with `true` entries indicating rows without missing values (complete cases) in data frame `df`. If `cols` is provided, only missing values in the corresponding columns are considered.

See also: `dropmissing` and `dropmissing!`. Use `findall(completecases(df))` to get the indices of the rows.

Examples

``````julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i     │ x       │ y       │
│     │ Int64 │ Int64⍰  │ String⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 1     │ missing │ missing │
│ 2   │ 2     │ 4       │ missing │
│ 3   │ 3     │ missing │ c       │
│ 4   │ 4     │ 2       │ d       │
│ 5   │ 5     │ 1       │ e       │

julia> completecases(df)
5-element BitArray{1}:
false
false
false
true
true

julia> completecases(df, :x)
5-element BitArray{1}:
false
true
false
true
true

julia> completecases(df, [:x, :y])
5-element BitArray{1}:
false
false
false
true
true``````
``deletecols!(df::DataFrame, ind)``

Delete columns specified by `ind` from a `DataFrame` `df` in place and return it.

Argument `ind` can be any index that is allowed for column indexing of a `DataFrame` provided that the columns requested to be removed are unique.

Examples

``````julia> d = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │

julia> deletecols!(d, 1)
3×1 DataFrame
│ Row │ b     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │
│ 2   │ 5     │
│ 3   │ 6     │``````
``deleterows!(df::DataFrame, ind)``

Delete rows specified by `ind` from a `DataFrame` `df` in place and return it.

Internally `deleteat!` is called for all columns so `ind` must be: a vector of sorted and unique integers, a boolean vector or an integer.

Examples

``````julia> d = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │

julia> deleterows!(d, 2)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 3     │ 6     │``````
Report descriptive statistics for a data frame

``describe(df::AbstractDataFrame; stats = [:mean, :min, :median, :max, :nmissing, :nunique, :eltype])``

Arguments

• `df` : the AbstractDataFrame
• `stats::Union{Symbol,AbstractVector{Symbol}}` : the summary statistics to report. If a vector, allowed fields are `:mean`, `:std`, `:min`, `:q25`, `:median`, `:q75`, `:max`, `:eltype`, `:nunique`, `:first`, `:last`, and `:nmissing`. If set to `:all`, all summary statistics are reported.

Result

• A `DataFrame` where each row represents a variable and each column a summary statistic.

Details

For `Real` columns, compute the mean, standard deviation, minimum, first quantile, median, third quantile, and maximum. If a column does not derive from `Real`, `describe` will attempt to calculate all statistics, using `nothing` as a fall-back in the case of an error.

When `stats` contains `:nunique`, `describe` will report the number of unique values in a column. If a column's base type derives from `Real`, `:nunique` will return `nothing`s.

Missing values are filtered in the calculation of all statistics, however the column `:nmissing` will report the number of missing values of that variable. If the column does not allow missing values, `nothing` is returned. Consequently, `nmissing = 0` indicates that the column allows missing values, but does not currently contain any.

Examples

``````df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
describe(df)
describe(df, stats = :all)
describe(df, stats = [:min, :max])``````
``disallowmissing!(df::DataFrame)``

Convert all columns of a `df` from element type `Union{T, Missing}` to `T` to drop support for missing values.

``disallowmissing!(df::DataFrame, col::Union{Integer, Symbol})``

Convert a single column of a `df` from element type `Union{T, Missing}` to `T` to drop support for missing values.

``disallowmissing!(df::DataFrame, cols::AbstractVector{<:Union{Integer, Symbol}})``

Convert multiple columns of a `df` from element type `Union{T, Missing}` to `T` to drop support for missing values.

source
``````dropmissing(df::AbstractDataFrame; disallowmissing::Bool=false)
dropmissing(df::AbstractDataFrame, cols::AbstractVector; disallowmissing::Bool=false)
dropmissing(df::AbstractDataFrame, cols::Union{Integer, Symbol}; disallowmissing::Bool=false)``````

Return a copy of data frame `df` excluding rows with missing values. If `cols` is provided, only missing values in the corresponding columns are considered.

In the future `disallowmissing` will be `true` by default.

See also: `completecases` and `dropmissing!`.

Examples

``````julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i     │ x       │ y       │
│     │ Int64 │ Int64⍰  │ String⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 1     │ missing │ missing │
│ 2   │ 2     │ 4       │ missing │
│ 3   │ 3     │ missing │ c       │
│ 4   │ 4     │ 2       │ d       │
│ 5   │ 5     │ 1       │ e       │

julia> dropmissing(df)
2×3 DataFrame
│ Row │ i     │ x      │ y       │
│     │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1   │ 4     │ 2      │ d       │
│ 2   │ 5     │ 1      │ e       │

julia> dropmissing(df, disallowmissing=true)
2×3 DataFrame
│ Row │ i     │ x     │ y      │
│     │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1   │ 4     │ 2     │ d      │
│ 2   │ 5     │ 1     │ e      │

julia> dropmissing(df, :x)
3×3 DataFrame
│ Row │ i     │ x      │ y       │
│     │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1   │ 2     │ 4      │ missing │
│ 2   │ 4     │ 2      │ d       │
│ 3   │ 5     │ 1      │ e       │

julia> dropmissing(df, [:x, :y])
2×3 DataFrame
│ Row │ i     │ x      │ y       │
│     │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1   │ 4     │ 2      │ d       │
│ 2   │ 5     │ 1      │ e       │``````
``````dropmissing!(df::AbstractDataFrame; disallowmissing::Bool=false)
dropmissing!(df::AbstractDataFrame, cols::AbstractVector; disallowmissing::Bool=false)
dropmissing!(df::AbstractDataFrame, cols::Union{Integer, Symbol}; disallowmissing::Bool=false)``````

Remove rows with missing values from data frame `df` and return it. If `cols` is provided, only missing values in the corresponding columns are considered.

In the future `disallowmissing` will be `true` by default.

See also: `dropmissing` and `completecases`.

Examples

``````julia> df = DataFrame(i = 1:5,
x = [missing, 4, missing, 2, 1],
y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i     │ x       │ y       │
│     │ Int64 │ Int64⍰  │ String⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 1     │ missing │ missing │
│ 2   │ 2     │ 4       │ missing │
│ 3   │ 3     │ missing │ c       │
│ 4   │ 4     │ 2       │ d       │
│ 5   │ 5     │ 1       │ e       │

julia> df1 = copy(df);

julia> dropmissing!(df1);

julia> df1
2×3 DataFrame
│ Row │ i     │ x      │ y       │
│     │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1   │ 4     │ 2      │ d       │
│ 2   │ 5     │ 1      │ e       │

julia> dropmissing!(df1, disallowmissing=true);
julia> df1
2×3 DataFrame
│ Row │ i     │ x     │ y      │
│     │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1   │ 4     │ 2     │ d      │
│ 2   │ 5     │ 1     │ e      │

julia> df2 = copy(df);

julia> dropmissing!(df2, :x);

julia> df2
3×3 DataFrame
│ Row │ i     │ x      │ y       │
│     │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1   │ 2     │ 4      │ missing │
│ 2   │ 4     │ 2      │ d       │
│ 3   │ 5     │ 1      │ e       │

julia> df3 = copy(df);

julia> dropmissing!(df3, [:x, :y]);

julia> df3
2×3 DataFrame
│ Row │ i     │ x      │ y       │
│     │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1   │ 4     │ 2      │ d       │
│ 2   │ 5     │ 1      │ e       │``````
``eachrow(df::AbstractDataFrame)``

Return a `DataFrameRows` that iterates a data frame row by row, with each row represented as a `DataFrameRow`.

Examples

``````julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 11    │
│ 2   │ 2     │ 12    │
│ 3   │ 3     │ 13    │
│ 4   │ 4     │ 14    │

julia> eachrow(df)
4-element DataFrameRows:
DataFrameRow (row 1)
x  1
y  11
DataFrameRow (row 2)
x  2
y  12
DataFrameRow (row 3)
x  3
y  13
DataFrameRow (row 4)
x  4
y  14

julia> copy.(eachrow(df))
4-element Array{NamedTuple{(:x, :y),Tuple{Int64,Int64}},1}:
(x = 1, y = 11)
(x = 2, y = 12)
(x = 3, y = 13)
(x = 4, y = 14)

julia> eachrow(view(df, [4,3], [2,1]))
2-element DataFrameRows:
DataFrameRow (row 4)
y  14
x  4
DataFrameRow (row 3)
y  13
x  3``````
``eachcol(df::AbstractDataFrame, names::Bool=true)``

Return a `DataFrameColumns` that iterates an `AbstractDataFrame` column by column. If `names` is equal to `true` (currently the default, in the future the default will be set to `false`) iteration returns a pair consisting of column name and column vector. If `names` is equal to `false` then column vectors are yielded.

Examples

``````julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 11    │
│ 2   │ 2     │ 12    │
│ 3   │ 3     │ 13    │
│ 4   │ 4     │ 14    │

julia> collect(eachcol(df, true))
2-element Array{Pair{Symbol,AbstractArray{T,1} where T},1}:
:x => [1, 2, 3, 4]
:y => [11, 12, 13, 14]

julia> collect(eachcol(df, false))
2-element Array{AbstractArray{T,1} where T,1}:
[1, 2, 3, 4]
[11, 12, 13, 14]

julia> sum.(eachcol(df, false))
2-element Array{Int64,1}:
10
50

julia> map(eachcol(df, false)) do col
maximum(col) - minimum(col)
end
2-element Array{Int64,1}:
3
3``````
Return element types of columns

``eltypes(df::AbstractDataFrame)``

Arguments

• `df` : the AbstractDataFrame

Result

• `::Vector{Type}` : the element type of each column

Examples

``````df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
eltypes(df)``````
``filter(function, df::AbstractDataFrame)``

Return a copy of data frame `df` containing only rows for which `function` returns `true`. The function is passed a `DataFrameRow` as its only argument.

Examples

``````julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> filter(row -> row[:x] > 1, df)
2×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │``````
``filter!(function, df::AbstractDataFrame)``

Remove rows from data frame `df` for which `function` returns `false`. The function is passed a `DataFrameRow` as its only argument.

Examples

``````julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> filter!(row -> row[:x] > 1, df);

julia> df
2×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │``````
Insert a column into a data frame in place.

``````insertcols!(df::DataFrame, ind::Int; name=col,
makeunique::Bool=false)
insertcols!(df::DataFrame, ind::Int, (:name => col)::Pair{Symbol,<:AbstractVector};
makeunique::Bool=false)``````

Arguments

• `df` : the DataFrame to which we want to add a column

• `ind` : a position at which we want to insert a column

• `name` : the name of the new column

• `col` : an `AbstractVector` giving the contents of the new column

• `makeunique` : Defines what to do if `name` already exists in `df`; if it is `false` an error will be thrown; if it is `true` a new unique name will be generated by adding a suffix

Result

• `::DataFrame` : a `DataFrame` with added column.

Examples

``````julia> d = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │

julia> insertcols!(d, 1, b=['a', 'b', 'c'])
3×2 DataFrame
│ Row │ b    │ a     │
│     │ Char │ Int64 │
├─────┼──────┼───────┤
│ 1   │ 'a'  │ 1     │
│ 2   │ 'b'  │ 2     │
│ 3   │ 'c'  │ 3     │

julia> insertcols!(d, 1, :c => [2, 3, 4])
3×3 DataFrame
│ Row │ c     │ b    │ a     │
│     │ Int64 │ Char │ Int64 │
├─────┼───────┼──────┼───────┤
│ 1   │ 2     │ 'a'  │ 1     │
│ 2   │ 3     │ 'b'  │ 2     │
│ 3   │ 4     │ 'c'  │ 3     │``````
``mapcols(f::Union{Function,Type}, df::AbstractDataFrame)``

Return a `DataFrame` where each column of `df` is transformed using function `f`. `f` must return `AbstractVector` objects all with the same length or scalars.

Examples

``````julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 11    │
│ 2   │ 2     │ 12    │
│ 3   │ 3     │ 13    │
│ 4   │ 4     │ 14    │

julia> mapcols(x -> x.^2, df)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 121   │
│ 2   │ 4     │ 144   │
│ 3   │ 9     │ 169   │
│ 4   │ 16    │ 196   │``````
Set column names

``names!(df::AbstractDataFrame, vals)``

Arguments

• `df` : the AbstractDataFrame
• `vals` : column names, normally a Vector{Symbol} the same length as the number of columns in `df`
• `makeunique` : if `false` (the default), an error will be raised if duplicate names are found; if `true`, duplicate names will be suffixed with `_i` (`i` starting at 1 for the first duplicate).

Result

• `::AbstractDataFrame` : the updated result

Examples

``````df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
names!(df, [:a, :b, :c])
names!(df, [:a, :b, :a])  # throws ArgumentError
names!(df, [:a, :b, :a], makeunique=true)  # renames second :a to :a_1``````
Indexes of duplicate rows (a row that is a duplicate of a prior row)

``````nonunique(df::AbstractDataFrame)
nonunique(df::AbstractDataFrame, cols)``````

Arguments

• `df` : the AbstractDataFrame
• `cols` : a column indicator (Symbol, Int, Vector{Symbol}, etc.) specifying the column(s) to compare

Result

• `::Vector{Bool}` : indicates whether the row is a duplicate of some prior row

See also `unique` and `unique!`.

Examples

``````df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
nonunique(df)
nonunique(df, 1)``````
Rename columns

``````rename!(df::AbstractDataFrame, (from => to)::Pair{Symbol, Symbol}...)
rename!(df::AbstractDataFrame, d::AbstractDict{Symbol,Symbol})
rename!(df::AbstractDataFrame, d::AbstractArray{Pair{Symbol,Symbol}})
rename!(f::Function, df::AbstractDataFrame)
rename(df::AbstractDataFrame, (from => to)::Pair{Symbol, Symbol}...)
rename(df::AbstractDataFrame, d::AbstractDict{Symbol,Symbol})
rename(df::AbstractDataFrame, d::AbstractArray{Pair{Symbol,Symbol}})
rename(f::Function, df::AbstractDataFrame)``````

Arguments

• `df` : the AbstractDataFrame
• `d` : an Associative type or an AbstractArray of pairs that maps the original names to new names
• `f` : a function which for each column takes the old name (a Symbol) and returns the new name (a Symbol)

Result

• `::AbstractDataFrame` : the updated result

New names are processed sequentially. A new name must not already exist in the `DataFrame` at the moment an attempt to rename a column is performed.

Examples

``````df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
rename(df, :i => :A, :x => :X)
rename(df, [:i => :A, :x => :X])
rename(df, Dict(:i => :A, :x => :X))
rename(x -> Symbol(uppercase(string(x))), df)
rename(df) do x
Symbol(uppercase(string(x)))
end
rename!(df, Dict(:i =>: A, :x => :X))``````
Rename columns

``````rename!(df::AbstractDataFrame, (from => to)::Pair{Symbol, Symbol}...)
rename!(df::AbstractDataFrame, d::AbstractDict{Symbol,Symbol})
rename!(df::AbstractDataFrame, d::AbstractArray{Pair{Symbol,Symbol}})
rename!(f::Function, df::AbstractDataFrame)
rename(df::AbstractDataFrame, (from => to)::Pair{Symbol, Symbol}...)
rename(df::AbstractDataFrame, d::AbstractDict{Symbol,Symbol})
rename(df::AbstractDataFrame, d::AbstractArray{Pair{Symbol,Symbol}})
rename(f::Function, df::AbstractDataFrame)``````

Arguments

• `df` : the AbstractDataFrame
• `d` : an Associative type or an AbstractArray of pairs that maps the original names to new names
• `f` : a function which for each column takes the old name (a Symbol) and returns the new name (a Symbol)

Result

• `::AbstractDataFrame` : the updated result

New names are processed sequentially. A new name must not already exist in the `DataFrame` at the moment an attempt to rename a column is performed.

Examples

``````df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
rename(df, :i => :A, :x => :X)
rename(df, [:i => :A, :x => :X])
rename(df, Dict(:i => :A, :x => :X))
rename(x -> Symbol(uppercase(string(x))), df)
rename(df) do x
Symbol(uppercase(string(x)))
end
rename!(df, Dict(:i =>: A, :x => :X))``````
``repeat(df::AbstractDataFrame; inner::Integer = 1, outer::Integer = 1)``

Construct a data frame by repeating rows in `df`. `inner` specifies how many times each row is repeated, and `outer` specifies how many times the full set of rows is repeated.

Example

``````julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> repeat(df, inner = 2, outer = 3)
12×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 1     │ 3     │
│ 3   │ 2     │ 4     │
│ 4   │ 2     │ 4     │
│ 5   │ 1     │ 3     │
│ 6   │ 1     │ 3     │
│ 7   │ 2     │ 4     │
│ 8   │ 2     │ 4     │
│ 9   │ 1     │ 3     │
│ 10  │ 1     │ 3     │
│ 11  │ 2     │ 4     │
│ 12  │ 2     │ 4     │``````
``repeat(df::AbstractDataFrame, count::Integer)``

Construct a data frame by repeating each row in `df` the number of times specified by `count`.

Example

``````julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> repeat(df, 2)
4×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │
│ 3   │ 1     │ 3     │
│ 4   │ 2     │ 4     │``````
``````show([io::IO,] df::AbstractDataFrame;
allrows::Bool = !get(io, :limit, false),
allcols::Bool = !get(io, :limit, false),
allgroups::Bool = !get(io, :limit, false),
splitcols::Bool = get(io, :limit, false),
rowlabel::Symbol = :Row,
summary::Bool = true)``````

Render a data frame to an I/O stream. The specific visual representation chosen depends on the width of the display.

If `io` is omitted, the result is printed to `stdout`, and `allrows`, `allcols` and `allgroups` default to `false` while `splitcols` defaults to `true`.

Arguments

• `io::IO`: The I/O stream to which `df` will be printed.
• `df::AbstractDataFrame`: The data frame to print.
• `allrows::Bool`: Whether to print all rows, rather than a subset that fits the device height. By default this is the case only if `io` does not have the `IOContext` property `limit` set.
• `allcols::Bool`: Whether to print all columns, rather than a subset that fits the device width. By default this is the case only if `io` does not have the `IOContext` property `limit` set.
• `allgroups::Bool`: Whether to print all groups rather than the first and last, when `df` is a `GroupedDataFrame`. By default this is the case only if `io` does not have the `IOContext` property `limit` set.
• `splitcols::Bool`: Whether to split printing in chunks of columns fitting the screen width rather than printing all columns in the same block. Only applies if `allcols` is `true`. By default this is the case only if `io` has the `IOContext` property `limit` set.
• `rowlabel::Symbol = :Row`: The label to use for the column containing row numbers.
• `summary::Bool = true`: Whether to print a brief string summary of the data frame.

Examples

``````julia> using DataFrames

julia> df = DataFrame(A = 1:3, B = ["x", "y", "z"]);

julia> show(df, allcols=true)
3×2 DataFrame
│ Row │ A     │ B      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ x      │
│ 2   │ 2     │ y      │
│ 3   │ 3     │ z      │``````
``````sort(df::AbstractDataFrame, cols;
alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
rev::Bool=false, order::Ordering=Forward)``````

Return a copy of data frame `df` sorted by column(s) `cols`. `cols` can be either a `Symbol` or `Integer` column index, or a tuple or vector of such indices.

If `alg` is `nothing` (the default), the most appropriate algorithm is chosen automatically among `TimSort`, `MergeSort` and `RadixSort` depending on the type of the sorting columns and on the number of rows in `df`. If `rev` is `true`, reverse sorting is performed. To enable reverse sorting only for some columns, pass `order(c, rev=true)` in `cols`, with `c` the corresponding column index (see example below). See `sort!` for a description of other keyword arguments.

Examples

``````julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> sort(df, :x)
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ c      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │

julia> sort(df, (:x, :y))
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │

julia> sort(df, (:x, :y), rev=true)
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │
│ 3   │ 1     │ c      │
│ 4   │ 1     │ b      │

julia> sort(df, (:x, order(:y, rev=true)))
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ c      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │``````
``````sort!(df::AbstractDataFrame, cols;
alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
rev::Bool=false, order::Ordering=Forward)``````

Sort data frame `df` by column(s) `cols`. `cols` can be either a `Symbol` or `Integer` column index, or a tuple or vector of such indices.

If `alg` is `nothing` (the default), the most appropriate algorithm is chosen automatically among `TimSort`, `MergeSort` and `RadixSort` depending on the type of the sorting columns and on the number of rows in `df`. If `rev` is `true`, reverse sorting is performed. To enable reverse sorting only for some columns, pass `order(c, rev=true)` in `cols`, with `c` the corresponding column index (see example below). See other methods for a description of other keyword arguments.

Examples

``````julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> sort!(df, :x)
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ c      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │

julia> sort!(df, (:x, :y))
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │

julia> sort!(df, (:x, :y), rev=true)
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │
│ 3   │ 1     │ c      │
│ 4   │ 1     │ b      │

julia> sort!(df, (:x, order(:y, rev=true)))
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ c      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │``````
Delete duplicate rows

``````unique(df::AbstractDataFrame)
unique(df::AbstractDataFrame, cols)
unique!(df::AbstractDataFrame)
unique!(df::AbstractDataFrame, cols)``````

Arguments

• `df` : the AbstractDataFrame
• `cols` : column indicator (Symbol, Int, Vector{Symbol}, etc.)

specifying the column(s) to compare.

Result

• `::AbstractDataFrame` : the updated version of `df` with unique rows.

When `cols` is specified, the return DataFrame contains complete rows, retaining in each case the first instance for which `df[cols]` is unique.

See also `nonunique`.

Examples

``````df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
unique(df)   # doesn't modify df
unique(df, 1)
unique!(df)  # modifies df``````
``permutecols!(df::DataFrame, p::AbstractVector)``

Permute the columns of `df` in-place, according to permutation `p`. Elements of `p` may be either column indices (`Int`) or names (`Symbol`), but cannot be a combination of both. All columns must be listed.

Examples

``````julia> df = DataFrame(a=1:5, b=2:6, c=3:7)
5×3 DataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │
│ 2   │ 2     │ 3     │ 4     │
│ 3   │ 3     │ 4     │ 5     │
│ 4   │ 4     │ 5     │ 6     │
│ 5   │ 5     │ 6     │ 7     │

julia> permutecols!(df, [2, 1, 3]);

julia> df
5×3 DataFrame
│ Row │ b     │ a     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 2     │ 1     │ 3     │
│ 2   │ 3     │ 2     │ 4     │
│ 3   │ 4     │ 3     │ 5     │
│ 4   │ 5     │ 4     │ 6     │
│ 5   │ 6     │ 5     │ 7     │

julia> permutecols!(df, [:c, :a, :b]);

julia> df
5×3 DataFrame
│ Row │ c     │ a     │ b     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 3     │ 1     │ 2     │
│ 2   │ 4     │ 2     │ 3     │
│ 3   │ 5     │ 3     │ 4     │
│ 4   │ 6     │ 4     │ 5     │
│ 5   │ 7     │ 5     │ 6     │``````
``vcat(dfs::AbstractDataFrame...)``

Vertically concatenate `AbstractDataFrames`.

Column names in all passed data frames must be the same, but they can have different order. In such cases the order of names in the first passed `DataFrame` is used.

Example

``````julia> df1 = DataFrame(A=1:3, B=1:3);

julia> df2 = DataFrame(A=4:6, B=4:6);

julia> vcat(df1, df2)
6×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 4     │ 4     │
│ 5   │ 5     │ 5     │
│ 6   │ 6     │ 6     │``````
``append!(df1::DataFrame, df2::AbstractDataFrame)``

Add the rows of `df2` to the end of `df1`.

Column names must be equal (including order). Values corresponding to new rows are appended in-place to the column vectors of `df1`. Column types are therefore preserved, and new values are converted if necessary. An error is thrown if conversion fails: this is the case in particular if a column in `df2` contains `missing` values but the corresponding column in `df1` does not accept them.

Note

Use `vcat` instead of `append!` when more flexibility is needed. Since `vcat` does not operate in place, it is able to use promotion to find an appropriate element type to hold values from both data frames. It also accepts columns in different orders between `df1` and `df2`.

Use `push!` to add individual rows to a data frame.

Examples

``````julia> df1 = DataFrame(A=1:3, B=1:3);

julia> df2 = DataFrame(A=4.0:6.0, B=4:6);

julia> append!(df1, df2);

julia> df1
6×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 4     │ 4     │
│ 5   │ 5     │ 5     │
│ 6   │ 6     │ 6     │``````
