Functions

Functions

Grouping, Joining, and Split-Apply-Combine

DataFrames.aggregateFunction.

Split-apply-combine that applies a set of functions over columns of an AbstractDataFrame or GroupedDataFrame

aggregate(d::AbstractDataFrame, cols, fs)
aggregate(gd::GroupedDataFrame, fs)

Arguments

  • d : an AbstractDataFrame
  • gd : a GroupedDataFrame
  • cols : a column indicator (Symbol, Int, Vector{Symbol}, etc.)
  • fs : a function or vector of functions to be applied to vectors within groups; expects each argument to be a column vector

Each fs should return a value or vector. All returns must be the same length.

Returns

  • ::DataFrame

Examples

using Statistics
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
               b = repeat([2, 1], outer=[4]),
               c = randn(8))
aggregate(df, :a, sum)
aggregate(df, :a, [sum, x->mean(skipmissing(x))])
aggregate(groupby(df, :a), [sum, x->mean(skipmissing(x))])
source
DataFrames.byFunction.
by(d::AbstractDataFrame, keys, cols => f...; sort::Bool = false)
by(d::AbstractDataFrame, keys; (colname = cols => f)..., sort::Bool = false)
by(d::AbstractDataFrame, keys, f; sort::Bool = false)
by(f, d::AbstractDataFrame, keys; sort::Bool = false)

Split-apply-combine in one step: apply f to each grouping in d based on grouping columns keys, and return a DataFrame.

keys can be either a single column index, or a vector thereof.

If the last argument(s) consist(s) in one or more cols => f pair(s), or if colname = cols => f keyword arguments are provided, cols must be a column name or index, or a vector or tuple thereof, and f must be a callable. A pair or a (named) tuple of pairs can also be provided as the first or last argument. If cols is a single column index, f is called with a SubArray view into that column for each group; else, f is called with a named tuple holding SubArray views into these columns.

If the last argument is a callable f, it is passed a SubDataFrame view for each group, and the returned DataFrame then consists of the returned rows plus the grouping columns. Note that this second form is much slower than the first one due to type instability. A method is defined with f as the first argument, so do-block notation can be used.

f can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:

  • A single value gives a data frame with a single column and one row per group.
  • A named tuple of single values or a DataFrameRow gives a data frame with one column for each field and one row per group.
  • A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
  • A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.

As a special case, if multiple pairs are passed as last arguments, each function is required to return a single value or vector, which will produce each a separate column.

In all cases, the resulting data frame contains all the grouping columns in addition to those listed above. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input colummn name; for other functions, columns are called x1, x2 and so on. The resulting data frame will be sorted on keys if sort=true. Otherwise, ordering of rows is undefined.

Note that f must always return the same type of object for all groups, and (if a named tuple or data frame) with the same fields or columns. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.

Optimized methods are used when standard summary functions (sum, prod, minimum, maximum, mean, var, std, first, last and length) are specified using the pair syntax (e.g.col => sum). When computing thesumormeanover floating point columns, results will be less accurate than the standard [sum](@ref) function (which uses pairwise summation). Usecol => x -> sum(x)` to avoid the optimized method and use the slower, more accurate one.

by(d, cols, f) is equivalent to combine(f, groupby(d, cols)) and to the less efficient combine(map(f, groupby(d, cols))).

Examples

julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8);

julia> by(df, :a, :c => sum)
4×2 DataFrame
│ Row │ a     │ c_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> by(df, :a, d -> sum(d.c)) # Slower variant
4×2 DataFrame
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> by(df, :a) do d # do syntax for the slower variant
           sum(d.c)
       end
4×2 DataFrame
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> by(df, :a, :c => x -> 2 .* x)
8×2 DataFrame
│ Row │ a     │ c_function │
│     │ Int64 │ Int64      │
├─────┼───────┼────────────┤
│ 1   │ 1     │ 2          │
│ 2   │ 1     │ 10         │
│ 3   │ 2     │ 4          │
│ 4   │ 2     │ 12         │
│ 5   │ 3     │ 6          │
│ 6   │ 3     │ 14         │
│ 7   │ 4     │ 8          │
│ 8   │ 4     │ 16         │

julia> by(df, :a, c_sum = :c => sum, c_sum2 = :c => x -> sum(x.^2))
4×3 DataFrame
│ Row │ a     │ c_sum │ c_sum2 │
│     │ Int64 │ Int64 │ Int64  │
├─────┼───────┼───────┼────────┤
│ 1   │ 1     │ 6     │ 26     │
│ 2   │ 2     │ 8     │ 40     │
│ 3   │ 3     │ 10    │ 58     │
│ 4   │ 4     │ 12    │ 80     │

julia> by(df, :a, (:b, :c) => x -> (minb = minimum(x.b), sumc = sum(x.c)))
4×3 DataFrame
│ Row │ a     │ minb  │ sumc  │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 6     │
│ 2   │ 2     │ 1     │ 8     │
│ 3   │ 3     │ 2     │ 10    │
│ 4   │ 4     │ 1     │ 12    │
source
DataFrames.colwiseFunction.

Apply a function to each column in an AbstractDataFrame or GroupedDataFrame

colwise(f, d)

Arguments

  • f : a function or vector of functions
  • d : an AbstractDataFrame of GroupedDataFrame

Returns

  • various, depending on the call

Examples

df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
               b = repeat([2, 1], outer=[4]),
               c = randn(8))
colwise(sum, df)
colwise([sum, length], df)
colwise((minimum, maximum), df)
colwise(sum, groupby(df, :a))
source
DataFrames.combineFunction.
combine(gd::GroupedDataFrame, cols => f...)
combine(gd::GroupedDataFrame; (colname = cols => f)...)
combine(gd::GroupedDataFrame, f)
combine(f, gd::GroupedDataFrame)

Transform a GroupedDataFrame into a DataFrame.

If the last argument(s) consist(s) in one or more cols => f pair(s), or if colname = cols => f keyword arguments are provided, cols must be a column name or index, or a vector or tuple thereof, and f must be a callable. A pair or a (named) tuple of pairs can also be provided as the first or last argument. If cols is a single column index, f is called with a SubArray view into that column for each group; else, f is called with a named tuple holding SubArray views into these columns.

If the last argument is a callable f, it is passed a SubDataFrame view for each group, and the returned DataFrame then consists of the returned rows plus the grouping columns. Note that this second form is much slower than the first one due to type instability. A method is defined with f as the first argument, so do-block notation can be used.

f can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:

  • A single value gives a data frame with a single column and one row per group.
  • A named tuple of single values or a DataFrameRow gives a data frame with one column for each field and one row per group.
  • A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
  • A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.

As a special case, if a tuple or vector of pairs is passed as the first argument, each function is required to return a single value or vector, which will produce each a separate column.

In all cases, the resulting data frame contains all the grouping columns in addition to those listed above. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input column name; for other functions, columns are called x1, x2 and so on. The resulting data frame will be sorted if sort=true was passed to the groupby call from which gd was constructed. Otherwise, ordering of rows is undefined.

Note that f must always return the same type of object for all groups, and (if a named tuple or data frame) with the same fields or columns. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.

Optimized methods are used when standard summary functions (sum, prod, minimum, maximum, mean, var, std, first, last and length) are specified using the pair syntax (e.g.col => sum). When computing thesumormeanover floating point columns, results will be less accurate than the standard [sum](@ref) function (which uses pairwise summation). Usecol => x -> sum(x)` to avoid the optimized method and use the slower, more accurate one.

Examples

julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8);

julia> gd = groupby(df, :a);

julia> combine(gd, :c => sum)
4×2 DataFrame
│ Row │ a     │ c_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> combine(:c => sum, gd)
4×2 DataFrame
│ Row │ a     │ c_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> combine(df -> sum(df.c), gd) # Slower variant
4×2 DataFrame
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

See by for more examples.

See also

by(f, df, cols) is a shorthand for combine(f, groupby(df, cols)).

map: combine(f, groupby(df, cols)) is a more efficient equivalent of combine(map(f, groupby(df, cols))).

source
DataFrames.groupbyFunction.

A view of an AbstractDataFrame split into row groups

groupby(d::AbstractDataFrame, cols; sort = false, skipmissing = false)
groupby(cols; sort = false, skipmissing = false)

Arguments

  • d : an AbstractDataFrame to split (optional, see Returns)
  • cols : data table columns to group by
  • sort: whether to sort rows according to the values of the grouping columns cols
  • skipmissing: whether to skip rows with missing values in one of the grouping columns cols

Returns

A GroupedDataFrame : a grouped view into d

Details

An iterator over a GroupedDataFrame returns a SubDataFrame view for each grouping into d. A GroupedDataFrame also supports indexing by groups, map (which applies a function to each group) and combine (which applies a function to each group and combines the result into a data frame).

See the following for additional split-apply-combine operations:

  • by : split-apply-combine using functions
  • aggregate : split-apply-combine; applies functions in the form of a cross product
  • colwise : apply a function to each column in an AbstractDataFrame or GroupedDataFrame
  • map : apply a function to each group of a GroupedDataFrame (without combining)
  • combine : combine a GroupedDataFrame, optionally applying a function to each group

Examples

df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
               b = repeat([2, 1], outer=[4]),
               c = randn(8))
gd = groupby(df, :a)
gd[1]
last(gd)
vcat([g[:b] for g in gd]...)
for g in gd
    println(g)
end
source
groupindices(gd::GroupedDataFrame)

Return a vector of group indices for each row of parent(gd).

Rows appearing in group gd[i] are attributed index i. Rows not present in any group are attributed missing (this can happen if skipmissing=true was passed when creating gd, or if gd is a subset from a larger GroupedDataFrame).

source
DataFrames.groupvarsFunction.
groupvars(gd::GroupedDataFrame)

Return a vector of column names in parent(gd) used for grouping.

source
Base.joinFunction.
join(df1, df2; on = Symbol[], kind = :inner, makeunique = false,
     indicator = nothing, validate = (false, false))

Join two DataFrame objects

Arguments

  • df1, df2 : the two AbstractDataFrames to be joined

Keyword Arguments

  • on : A column, or vector of columns to join df1 and df2 on. If the column(s) that df1 and df2 will be joined on have different names, then the columns should be (left, right) tuples or left => right pairs, or a vector of such tuples or pairs. on is a required argument for all joins except for kind = :cross

  • kind : the type of join, options include:

    • :inner : only include rows with keys that match in both df1 and df2, the default
    • :outer : include all rows from df1 and df2
    • :left : include all rows from df1
    • :right : include all rows from df2
    • :semi : return rows of df1 that match with the keys in df2
    • :anti : return rows of df1 that do not match with the keys in df2
    • :cross : a full Cartesian product of the key combinations; every row of df1 is matched with every row of df2
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).

  • indicator : Default: nothing. If a Symbol, adds categorical indicator column named Symbol for whether a row appeared in only df1 ("left_only"), only df2 ("right_only") or in both ("both"). If Symbol is already in use, the column name will be modified if makeunique=true.

  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.

For the three join operations that may introduce missing values (:outer, :left, and :right), all columns of the returned data table will support missing values.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left DataFrame takes precedence over the ordering of the right DataFrame

Result

  • ::DataFrame : the joined DataFrame

Examples

name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])

join(name, job, on = :ID)
join(name, job, on = :ID, kind = :outer)
join(name, job, on = :ID, kind = :left)
join(name, job, on = :ID, kind = :right)
join(name, job, on = :ID, kind = :semi)
join(name, job, on = :ID, kind = :anti)
join(name, job, kind = :cross)

job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
join(name, job2, on = (:ID, :identifier))
join(name, job2, on = :ID => :identifier)
source
Base.mapFunction.
map(cols => f, gd::GroupedDataFrame)
map(f, gd::GroupedDataFrame)

Apply a function to each group of rows and return a GroupedDataFrame.

If the first argument is a cols => f pair, cols must be a column name or index, or a vector or tuple thereof, and f must be a callable. If cols is a single column index, f is called with a SubArray view into that column for each group; else, f is called with a named tuple holding SubArray views into these columns.

If the first argument is a vector, tuple or named tuple of such pairs, each pair is handled as described above. If a named tuple, field names are used to name each generated column.

If the first argument is a callable, it is passed a SubDataFrame view for each group, and the returned DataFrame then consists of the returned rows plus the grouping columns. Note that this second form is much slower than the first one due to type instability.

f can return a single value, a row or multiple rows. The type of the returned value determines the shape of the resulting data frame:

  • A single value gives a data frame with a single column and one row per group.
  • A named tuple of single values or a DataFrameRow gives a data frame with one column for each field and one row per group.
  • A vector gives a data frame with a single column and as many rows for each group as the length of the returned vector for that group.
  • A data frame, a named tuple of vectors or a matrix gives a data frame with the same columns and as many rows for each group as the rows returned for that group.

As a special case, if a tuple or vector of pairs is passed as the first argument, each function is required to return a single value or vector, which will produce each a separate column.

In all cases, the resulting GroupedDataFrame contains all the grouping columns in addition to those listed above. Column names are automatically generated when necessary: for functions operating on a single column and returning a single value or vector, the function name is appended to the input column name; for other functions, columns are called x1, x2 and so on.

Note that f must always return the same type of object for all groups, and (if a named tuple or data frame) with the same fields or columns. Due to type instability, returning a single value or a named tuple is dramatically faster than returning a data frame.

Optimized methods are used when standard summary functions (sum, prod, minimum, maximum, mean, var, std, first, last and length) are specified using the pair syntax (e.g.col => sum). When computing thesumormeanover floating point columns, results will be less accurate than the standard [sum](@ref) function (which uses pairwise summation). Usecol => x -> sum(x)` to avoid the optimized method and use the slower, more accurate one.

Examples

julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8);

julia> gd = groupby(df, :a);

julia> map(:c => sum, gd)
GroupedDataFrame{DataFrame} with 4 groups based on key: :a
First Group: 1 row
│ Row │ a     │ c_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
⋮
Last Group: 1 row
│ Row │ a     │ c_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 4     │ 12    │

julia> map(df -> sum(df.c), gd) # Slower variant
GroupedDataFrame{DataFrame} with 4 groups based on key: :a
First Group: 1 row
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
⋮
Last Group: 1 row
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 4     │ 12    │

See by for more examples.

See also

combine(f, gd) returns a DataFrame rather than a GroupedDataFrame

source
DataFrames.meltFunction.

Stacks a DataFrame; convert from a wide to long format; see stack.

source
DataFrames.stackFunction.

Stacks a DataFrame; convert from a wide to long format

stack(df::AbstractDataFrame, [measure_vars], [id_vars];
      variable_name::Symbol=:variable, value_name::Symbol=:value)
melt(df::AbstractDataFrame, [id_vars], [measure_vars];
     variable_name::Symbol=:variable, value_name::Symbol=:value)

Arguments

  • df : the AbstractDataFrame to be stacked

  • measure_vars : the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; for melt, defaults to all variables that are not id_vars. If neither measure_vars or id_vars are given, measure_vars defaults to all floating point columns.

  • id_vars : the identifier columns that are repeated during stacking, a normal column indexing type; for stack defaults to all variables that are not measure_vars

  • variable_name : the name of the new stacked column that shall hold the names of each of measure_vars

  • value_name : the name of the new stacked column containing the values from each of measure_vars

Result

  • ::DataFrame : the long-format DataFrame with column :value holding the values of the stacked columns (measure_vars), with column :variable a Vector of Symbols with the measure_vars name, and with columns for each of the id_vars.

See also stackdf and meltdf for stacking methods that return a view into the original DataFrame. See unstack for converting from long to wide format.

Examples

d1 = DataFrame(a = repeat([1:3;], inner = [4]),
               b = repeat([1:4;], inner = [3]),
               c = randn(12),
               d = randn(12),
               e = map(string, 'a':'l'))

d1s = stack(d1, [:c, :d])
d1s2 = stack(d1, [:c, :d], [:a])
d1m = melt(d1, [:a, :b, :e])
d1s_name = melt(d1, [:a, :b, :e], variable_name=:somemeasure)
source
DataFrames.unstackFunction.

Unstacks a DataFrame; convert from a long to wide format

unstack(df::AbstractDataFrame, rowkeys::Union{Symbol, Integer},
        colkey::Union{Symbol, Integer}, value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame, rowkeys::AbstractVector{<:Union{Symbol, Integer}},
        colkey::Union{Symbol, Integer}, value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame, colkey::Union{Symbol, Integer},
        value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame)

Arguments

  • df : the AbstractDataFrame to be unstacked

  • rowkeys : the column(s) with a unique key for each row, if not given, find a key by grouping on anything not a colkey or value

  • colkey : the column holding the column names in wide format, defaults to :variable

  • value : the value column, defaults to :value

Result

  • ::DataFrame : the wide-format DataFrame

If colkey contains missing values then they will be skipped and a warning will be printed.

If combination of rowkeys and colkey contains duplicate entries then last value will be retained and a warning will be printed.

Examples

wide = DataFrame(id = 1:12,
                 a  = repeat([1:3;], inner = [4]),
                 b  = repeat([1:4;], inner = [3]),
                 c  = randn(12),
                 d  = randn(12))

long = stack(wide)
wide0 = unstack(long)
wide1 = unstack(long, :variable, :value)
wide2 = unstack(long, :id, :variable, :value)
wide3 = unstack(long, [:id, :a], :variable, :value)

Note that there are some differences between the widened results above.

source
DataFrames.stackdfFunction.

A stacked view of a DataFrame (long format)

Like stack and melt, but a view is returned rather than data copies.

stackdf(df::AbstractDataFrame, [measure_vars], [id_vars];
        variable_name::Symbol=:variable, value_name::Symbol=:value)
meltdf(df::AbstractDataFrame, [id_vars], [measure_vars];
       variable_name::Symbol=:variable, value_name::Symbol=:value)

Arguments

  • df : the wide AbstractDataFrame

  • measure_vars : the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; for melt, defaults to all variables that are not id_vars

  • id_vars : the identifier columns that are repeated during stacking, a normal column indexing type; for stack defaults to all variables that are not measure_vars

Result

  • ::DataFrame : the long-format DataFrame with column :value holding the values of the stacked columns (measure_vars), with column :variable a Vector of Symbols with the measure_vars name, and with columns for each of the id_vars.

The result is a view because the columns are special AbstractVectors that return indexed views into the original DataFrame.

Examples

d1 = DataFrame(a = repeat([1:3;], inner = [4]),
               b = repeat([1:4;], inner = [3]),
               c = randn(12),
               d = randn(12),
               e = map(string, 'a':'l'))

d1s = stackdf(d1, [:c, :d])
d1s2 = stackdf(d1, [:c, :d], [:a])
d1m = meltdf(d1, [:a, :b, :e])
source
DataFrames.meltdfFunction.

A stacked view of a DataFrame (long format); see stackdf

source

Basics

allowmissing!(df::DataFrame)

Convert all columns of a df from element type T to Union{T, Missing} to support missing values.

allowmissing!(df::DataFrame, col::Union{Integer, Symbol})

Convert a single column of a df from element type T to Union{T, Missing} to support missing values.

allowmissing!(df::DataFrame, cols::AbstractVector{<:Union{Integer, Symbol}})

Convert multiple columns of a df from element type T to Union{T, Missing} to support missing values.

source
completecases(df::AbstractDataFrame)
completecases(df::AbstractDataFrame, cols::AbstractVector)
completecases(df::AbstractDataFrame, cols::Union{Integer, Symbol})

Return a Boolean vector with true entries indicating rows without missing values (complete cases) in data frame df. If cols is provided, only missing values in the corresponding columns are considered.

See also: dropmissing and dropmissing!. Use findall(completecases(df)) to get the indices of the rows.

Examples

julia> df = DataFrame(i = 1:5,
                      x = [missing, 4, missing, 2, 1],
                      y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i     │ x       │ y       │
│     │ Int64 │ Int64⍰  │ String⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 1     │ missing │ missing │
│ 2   │ 2     │ 4       │ missing │
│ 3   │ 3     │ missing │ c       │
│ 4   │ 4     │ 2       │ d       │
│ 5   │ 5     │ 1       │ e       │

julia> completecases(df)
5-element BitArray{1}:
 false
 false
 false
  true
  true

julia> completecases(df, :x)
5-element BitArray{1}:
 false
  true
 false
  true
  true

julia> completecases(df, [:x, :y])
5-element BitArray{1}:
 false
 false
 false
  true
  true
source
deletecols!(df::DataFrame, ind)

Delete columns specified by ind from a DataFrame df in place and return it.

Argument ind can be any index that is allowed for column indexing of a DataFrame provided that the columns requested to be removed are unique.

Examples

julia> d = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │

julia> deletecols!(d, 1)
3×1 DataFrame
│ Row │ b     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │
│ 2   │ 5     │
│ 3   │ 6     │
source
deleterows!(df::DataFrame, ind)

Delete rows specified by ind from a DataFrame df in place and return it.

Internally deleteat! is called for all columns so ind must be: a vector of sorted and unique integers, a boolean vector or an integer.

Examples

julia> d = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │

julia> deleterows!(d, 2)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 3     │ 6     │
source
StatsBase.describeFunction.

Report descriptive statistics for a data frame

describe(df::AbstractDataFrame; stats = [:mean, :min, :median, :max, :nmissing, :nunique, :eltype])

Arguments

  • df : the AbstractDataFrame
  • stats::Union{Symbol,AbstractVector{Symbol}} : the summary statistics to report. If a vector, allowed fields are :mean, :std, :min, :q25, :median, :q75, :max, :eltype, :nunique, :first, :last, and :nmissing. If set to :all, all summary statistics are reported.

Result

  • A DataFrame where each row represents a variable and each column a summary statistic.

Details

For Real columns, compute the mean, standard deviation, minimum, first quantile, median, third quantile, and maximum. If a column does not derive from Real, describe will attempt to calculate all statistics, using nothing as a fall-back in the case of an error.

When stats contains :nunique, describe will report the number of unique values in a column. If a column's base type derives from Real, :nunique will return nothings.

Missing values are filtered in the calculation of all statistics, however the column :nmissing will report the number of missing values of that variable. If the column does not allow missing values, nothing is returned. Consequently, nmissing = 0 indicates that the column allows missing values, but does not currently contain any.

Examples

df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
describe(df)
describe(df, stats = :all)
describe(df, stats = [:min, :max])
source
disallowmissing!(df::DataFrame)

Convert all columns of a df from element type Union{T, Missing} to T to drop support for missing values.

disallowmissing!(df::DataFrame, col::Union{Integer, Symbol})

Convert a single column of a df from element type Union{T, Missing} to T to drop support for missing values.

disallowmissing!(df::DataFrame, cols::AbstractVector{<:Union{Integer, Symbol}})

Convert multiple columns of a df from element type Union{T, Missing} to T to drop support for missing values.

source
dropmissing(df::AbstractDataFrame; disallowmissing::Bool=false)
dropmissing(df::AbstractDataFrame, cols::AbstractVector; disallowmissing::Bool=false)
dropmissing(df::AbstractDataFrame, cols::Union{Integer, Symbol}; disallowmissing::Bool=false)

Return a copy of data frame df excluding rows with missing values. If cols is provided, only missing values in the corresponding columns are considered.

In the future disallowmissing will be true by default.

See also: completecases and dropmissing!.

Examples

julia> df = DataFrame(i = 1:5,
                      x = [missing, 4, missing, 2, 1],
                      y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i     │ x       │ y       │
│     │ Int64 │ Int64⍰  │ String⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 1     │ missing │ missing │
│ 2   │ 2     │ 4       │ missing │
│ 3   │ 3     │ missing │ c       │
│ 4   │ 4     │ 2       │ d       │
│ 5   │ 5     │ 1       │ e       │

julia> dropmissing(df)
2×3 DataFrame
│ Row │ i     │ x      │ y       │
│     │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1   │ 4     │ 2      │ d       │
│ 2   │ 5     │ 1      │ e       │

julia> dropmissing(df, disallowmissing=true)
2×3 DataFrame
│ Row │ i     │ x     │ y      │
│     │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1   │ 4     │ 2     │ d      │
│ 2   │ 5     │ 1     │ e      │

julia> dropmissing(df, :x)
3×3 DataFrame
│ Row │ i     │ x      │ y       │
│     │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1   │ 2     │ 4      │ missing │
│ 2   │ 4     │ 2      │ d       │
│ 3   │ 5     │ 1      │ e       │

julia> dropmissing(df, [:x, :y])
2×3 DataFrame
│ Row │ i     │ x      │ y       │
│     │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1   │ 4     │ 2      │ d       │
│ 2   │ 5     │ 1      │ e       │
source
dropmissing!(df::AbstractDataFrame; disallowmissing::Bool=false)
dropmissing!(df::AbstractDataFrame, cols::AbstractVector; disallowmissing::Bool=false)
dropmissing!(df::AbstractDataFrame, cols::Union{Integer, Symbol}; disallowmissing::Bool=false)

Remove rows with missing values from data frame df and return it. If cols is provided, only missing values in the corresponding columns are considered.

In the future disallowmissing will be true by default.

See also: dropmissing and completecases.

Examples

julia> df = DataFrame(i = 1:5,
                      x = [missing, 4, missing, 2, 1],
                      y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i     │ x       │ y       │
│     │ Int64 │ Int64⍰  │ String⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 1     │ missing │ missing │
│ 2   │ 2     │ 4       │ missing │
│ 3   │ 3     │ missing │ c       │
│ 4   │ 4     │ 2       │ d       │
│ 5   │ 5     │ 1       │ e       │

julia> df1 = copy(df);

julia> dropmissing!(df1);

julia> df1
2×3 DataFrame
│ Row │ i     │ x      │ y       │
│     │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1   │ 4     │ 2      │ d       │
│ 2   │ 5     │ 1      │ e       │

julia> dropmissing!(df1, disallowmissing=true);
 julia> df1
2×3 DataFrame
│ Row │ i     │ x     │ y      │
│     │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1   │ 4     │ 2     │ d      │
│ 2   │ 5     │ 1     │ e      │

julia> df2 = copy(df);

julia> dropmissing!(df2, :x);

julia> df2
3×3 DataFrame
│ Row │ i     │ x      │ y       │
│     │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1   │ 2     │ 4      │ missing │
│ 2   │ 4     │ 2      │ d       │
│ 3   │ 5     │ 1      │ e       │

julia> df3 = copy(df);

julia> dropmissing!(df3, [:x, :y]);


julia> df3
2×3 DataFrame
│ Row │ i     │ x      │ y       │
│     │ Int64 │ Int64⍰ │ String⍰ │
├─────┼───────┼────────┼─────────┤
│ 1   │ 4     │ 2      │ d       │
│ 2   │ 5     │ 1      │ e       │
source
DataFrames.eachrowFunction.
eachrow(df::AbstractDataFrame)

Return a DataFrameRows that iterates a data frame row by row, with each row represented as a DataFrameRow.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 11    │
│ 2   │ 2     │ 12    │
│ 3   │ 3     │ 13    │
│ 4   │ 4     │ 14    │

julia> eachrow(df)
4-element DataFrameRows:
 DataFrameRow (row 1)
x  1
y  11
 DataFrameRow (row 2)
x  2
y  12
 DataFrameRow (row 3)
x  3
y  13
 DataFrameRow (row 4)
x  4
y  14

julia> copy.(eachrow(df))
4-element Array{NamedTuple{(:x, :y),Tuple{Int64,Int64}},1}:
 (x = 1, y = 11)
 (x = 2, y = 12)
 (x = 3, y = 13)
 (x = 4, y = 14)

julia> eachrow(view(df, [4,3], [2,1]))
2-element DataFrameRows:
 DataFrameRow (row 4)
y  14
x  4
 DataFrameRow (row 3)
y  13
x  3
source
DataFrames.eachcolFunction.
eachcol(df::AbstractDataFrame, names::Bool=true)

Return a DataFrameColumns that iterates an AbstractDataFrame column by column. If names is equal to true (currently the default, in the future the default will be set to false) iteration returns a pair consisting of column name and column vector. If names is equal to false then column vectors are yielded.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 11    │
│ 2   │ 2     │ 12    │
│ 3   │ 3     │ 13    │
│ 4   │ 4     │ 14    │

julia> collect(eachcol(df, true))
2-element Array{Pair{Symbol,AbstractArray{T,1} where T},1}:
 :x => [1, 2, 3, 4]
 :y => [11, 12, 13, 14]

julia> collect(eachcol(df, false))
2-element Array{AbstractArray{T,1} where T,1}:
 [1, 2, 3, 4]
 [11, 12, 13, 14]

julia> sum.(eachcol(df, false))
2-element Array{Int64,1}:
 10
 50

julia> map(eachcol(df, false)) do col
           maximum(col) - minimum(col)
       end
2-element Array{Int64,1}:
 3
 3
source
DataFrames.eltypesFunction.

Return element types of columns

eltypes(df::AbstractDataFrame)

Arguments

  • df : the AbstractDataFrame

Result

  • ::Vector{Type} : the element type of each column

Examples

df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
eltypes(df)
source
Base.filterFunction.
filter(function, df::AbstractDataFrame)

Return a copy of data frame df containing only rows for which function returns true. The function is passed a DataFrameRow as its only argument.

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> filter(row -> row[:x] > 1, df)
2×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │
source
Base.filter!Function.
filter!(function, df::AbstractDataFrame)

Remove rows from data frame df for which function returns false. The function is passed a DataFrameRow as its only argument.

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> filter!(row -> row[:x] > 1, df);

julia> df
2×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │
source

Insert a column into a data frame in place.

insertcols!(df::DataFrame, ind::Int; name=col,
            makeunique::Bool=false)
insertcols!(df::DataFrame, ind::Int, (:name => col)::Pair{Symbol,<:AbstractVector};
            makeunique::Bool=false)

Arguments

  • df : the DataFrame to which we want to add a column

  • ind : a position at which we want to insert a column

  • name : the name of the new column

  • col : an AbstractVector giving the contents of the new column

  • makeunique : Defines what to do if name already exists in df; if it is false an error will be thrown; if it is true a new unique name will be generated by adding a suffix

Result

  • ::DataFrame : a DataFrame with added column.

Examples

julia> d = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │

julia> insertcols!(d, 1, b=['a', 'b', 'c'])
3×2 DataFrame
│ Row │ b    │ a     │
│     │ Char │ Int64 │
├─────┼──────┼───────┤
│ 1   │ 'a'  │ 1     │
│ 2   │ 'b'  │ 2     │
│ 3   │ 'c'  │ 3     │

julia> insertcols!(d, 1, :c => [2, 3, 4])
3×3 DataFrame
│ Row │ c     │ b    │ a     │
│     │ Int64 │ Char │ Int64 │
├─────┼───────┼──────┼───────┤
│ 1   │ 2     │ 'a'  │ 1     │
│ 2   │ 3     │ 'b'  │ 2     │
│ 3   │ 4     │ 'c'  │ 3     │
source
DataFrames.mapcolsFunction.
mapcols(f::Union{Function,Type}, df::AbstractDataFrame)

Return a DataFrame where each column of df is transformed using function f. f must return AbstractVector objects all with the same length or scalars.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 11    │
│ 2   │ 2     │ 12    │
│ 3   │ 3     │ 13    │
│ 4   │ 4     │ 14    │

julia> mapcols(x -> x.^2, df)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 121   │
│ 2   │ 4     │ 144   │
│ 3   │ 9     │ 169   │
│ 4   │ 16    │ 196   │
source
DataFrames.names!Function.

Set column names

names!(df::AbstractDataFrame, vals)

Arguments

  • df : the AbstractDataFrame
  • vals : column names, normally a Vector{Symbol} the same length as the number of columns in df
  • makeunique : if false (the default), an error will be raised if duplicate names are found; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).

Result

  • ::AbstractDataFrame : the updated result

Examples

df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
names!(df, [:a, :b, :c])
names!(df, [:a, :b, :a])  # throws ArgumentError
names!(df, [:a, :b, :a], makeunique=true)  # renames second :a to :a_1
source
DataFrames.nonuniqueFunction.

Indexes of duplicate rows (a row that is a duplicate of a prior row)

nonunique(df::AbstractDataFrame)
nonunique(df::AbstractDataFrame, cols)

Arguments

  • df : the AbstractDataFrame
  • cols : a column indicator (Symbol, Int, Vector{Symbol}, etc.) specifying the column(s) to compare

Result

  • ::Vector{Bool} : indicates whether the row is a duplicate of some prior row

See also unique and unique!.

Examples

df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
nonunique(df)
nonunique(df, 1)
source
DataFrames.rename!Function.

Rename columns

rename!(df::AbstractDataFrame, (from => to)::Pair{Symbol, Symbol}...)
rename!(df::AbstractDataFrame, d::AbstractDict{Symbol,Symbol})
rename!(df::AbstractDataFrame, d::AbstractArray{Pair{Symbol,Symbol}})
rename!(f::Function, df::AbstractDataFrame)
rename(df::AbstractDataFrame, (from => to)::Pair{Symbol, Symbol}...)
rename(df::AbstractDataFrame, d::AbstractDict{Symbol,Symbol})
rename(df::AbstractDataFrame, d::AbstractArray{Pair{Symbol,Symbol}})
rename(f::Function, df::AbstractDataFrame)

Arguments

  • df : the AbstractDataFrame
  • d : an Associative type or an AbstractArray of pairs that maps the original names to new names
  • f : a function which for each column takes the old name (a Symbol) and returns the new name (a Symbol)

Result

  • ::AbstractDataFrame : the updated result

New names are processed sequentially. A new name must not already exist in the DataFrame at the moment an attempt to rename a column is performed.

Examples

df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
rename(df, :i => :A, :x => :X)
rename(df, [:i => :A, :x => :X])
rename(df, Dict(:i => :A, :x => :X))
rename(x -> Symbol(uppercase(string(x))), df)
rename(df) do x
    Symbol(uppercase(string(x)))
end
rename!(df, Dict(:i =>: A, :x => :X))
source
DataFrames.renameFunction.

Rename columns

rename!(df::AbstractDataFrame, (from => to)::Pair{Symbol, Symbol}...)
rename!(df::AbstractDataFrame, d::AbstractDict{Symbol,Symbol})
rename!(df::AbstractDataFrame, d::AbstractArray{Pair{Symbol,Symbol}})
rename!(f::Function, df::AbstractDataFrame)
rename(df::AbstractDataFrame, (from => to)::Pair{Symbol, Symbol}...)
rename(df::AbstractDataFrame, d::AbstractDict{Symbol,Symbol})
rename(df::AbstractDataFrame, d::AbstractArray{Pair{Symbol,Symbol}})
rename(f::Function, df::AbstractDataFrame)

Arguments

  • df : the AbstractDataFrame
  • d : an Associative type or an AbstractArray of pairs that maps the original names to new names
  • f : a function which for each column takes the old name (a Symbol) and returns the new name (a Symbol)

Result

  • ::AbstractDataFrame : the updated result

New names are processed sequentially. A new name must not already exist in the DataFrame at the moment an attempt to rename a column is performed.

Examples

df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
rename(df, :i => :A, :x => :X)
rename(df, [:i => :A, :x => :X])
rename(df, Dict(:i => :A, :x => :X))
rename(x -> Symbol(uppercase(string(x))), df)
rename(df) do x
    Symbol(uppercase(string(x)))
end
rename!(df, Dict(:i =>: A, :x => :X))
source
Base.repeatFunction.
repeat(df::AbstractDataFrame; inner::Integer = 1, outer::Integer = 1)

Construct a data frame by repeating rows in df. inner specifies how many times each row is repeated, and outer specifies how many times the full set of rows is repeated.

Example

julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> repeat(df, inner = 2, outer = 3)
12×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 1     │ 3     │
│ 3   │ 2     │ 4     │
│ 4   │ 2     │ 4     │
│ 5   │ 1     │ 3     │
│ 6   │ 1     │ 3     │
│ 7   │ 2     │ 4     │
│ 8   │ 2     │ 4     │
│ 9   │ 1     │ 3     │
│ 10  │ 1     │ 3     │
│ 11  │ 2     │ 4     │
│ 12  │ 2     │ 4     │
source
repeat(df::AbstractDataFrame, count::Integer)

Construct a data frame by repeating each row in df the number of times specified by count.

Example

julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> repeat(df, 2)
4×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │
│ 3   │ 1     │ 3     │
│ 4   │ 2     │ 4     │
source
Base.showFunction.
show([io::IO,] df::AbstractDataFrame;
     allrows::Bool = !get(io, :limit, false),
     allcols::Bool = !get(io, :limit, false),
     allgroups::Bool = !get(io, :limit, false),
     splitcols::Bool = get(io, :limit, false),
     rowlabel::Symbol = :Row,
     summary::Bool = true)

Render a data frame to an I/O stream. The specific visual representation chosen depends on the width of the display.

If io is omitted, the result is printed to stdout, and allrows, allcols and allgroups default to false while splitcols defaults to true.

Arguments

  • io::IO: The I/O stream to which df will be printed.
  • df::AbstractDataFrame: The data frame to print.
  • allrows::Bool: Whether to print all rows, rather than a subset that fits the device height. By default this is the case only if io does not have the IOContext property limit set.
  • allcols::Bool: Whether to print all columns, rather than a subset that fits the device width. By default this is the case only if io does not have the IOContext property limit set.
  • allgroups::Bool: Whether to print all groups rather than the first and last, when df is a GroupedDataFrame. By default this is the case only if io does not have the IOContext property limit set.
  • splitcols::Bool: Whether to split printing in chunks of columns fitting the screen width rather than printing all columns in the same block. Only applies if allcols is true. By default this is the case only if io has the IOContext property limit set.
  • rowlabel::Symbol = :Row: The label to use for the column containing row numbers.
  • summary::Bool = true: Whether to print a brief string summary of the data frame.

Examples

julia> using DataFrames

julia> df = DataFrame(A = 1:3, B = ["x", "y", "z"]);

julia> show(df, allcols=true)
3×2 DataFrame
│ Row │ A     │ B      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ x      │
│ 2   │ 2     │ y      │
│ 3   │ 3     │ z      │
source
Base.sortFunction.
sort(df::AbstractDataFrame, cols;
     alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
     rev::Bool=false, order::Ordering=Forward)

Return a copy of data frame df sorted by column(s) cols. cols can be either a Symbol or Integer column index, or a tuple or vector of such indices.

If alg is nothing (the default), the most appropriate algorithm is chosen automatically among TimSort, MergeSort and RadixSort depending on the type of the sorting columns and on the number of rows in df. If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below). See sort! for a description of other keyword arguments.

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> sort(df, :x)
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ c      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │

julia> sort(df, (:x, :y))
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │

julia> sort(df, (:x, :y), rev=true)
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │
│ 3   │ 1     │ c      │
│ 4   │ 1     │ b      │

julia> sort(df, (:x, order(:y, rev=true)))
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ c      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │
source
Base.sort!Function.
sort!(df::AbstractDataFrame, cols;
      alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
      rev::Bool=false, order::Ordering=Forward)

Sort data frame df by column(s) cols. cols can be either a Symbol or Integer column index, or a tuple or vector of such indices.

If alg is nothing (the default), the most appropriate algorithm is chosen automatically among TimSort, MergeSort and RadixSort depending on the type of the sorting columns and on the number of rows in df. If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below). See other methods for a description of other keyword arguments.

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> sort!(df, :x)
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ c      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │

julia> sort!(df, (:x, :y))
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │

julia> sort!(df, (:x, :y), rev=true)
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │
│ 3   │ 1     │ c      │
│ 4   │ 1     │ b      │

julia> sort!(df, (:x, order(:y, rev=true)))
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ c      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │
source
Base.unique!Function.

Delete duplicate rows

unique(df::AbstractDataFrame)
unique(df::AbstractDataFrame, cols)
unique!(df::AbstractDataFrame)
unique!(df::AbstractDataFrame, cols)

Arguments

  • df : the AbstractDataFrame
  • cols : column indicator (Symbol, Int, Vector{Symbol}, etc.)

specifying the column(s) to compare.

Result

  • ::AbstractDataFrame : the updated version of df with unique rows.

When cols is specified, the return DataFrame contains complete rows, retaining in each case the first instance for which df[cols] is unique.

See also nonunique.

Examples

df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
unique(df)   # doesn't modify df
unique(df, 1)
unique!(df)  # modifies df
source
permutecols!(df::DataFrame, p::AbstractVector)

Permute the columns of df in-place, according to permutation p. Elements of p may be either column indices (Int) or names (Symbol), but cannot be a combination of both. All columns must be listed.

Examples

julia> df = DataFrame(a=1:5, b=2:6, c=3:7)
5×3 DataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │
│ 2   │ 2     │ 3     │ 4     │
│ 3   │ 3     │ 4     │ 5     │
│ 4   │ 4     │ 5     │ 6     │
│ 5   │ 5     │ 6     │ 7     │

julia> permutecols!(df, [2, 1, 3]);

julia> df
5×3 DataFrame
│ Row │ b     │ a     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 2     │ 1     │ 3     │
│ 2   │ 3     │ 2     │ 4     │
│ 3   │ 4     │ 3     │ 5     │
│ 4   │ 5     │ 4     │ 6     │
│ 5   │ 6     │ 5     │ 7     │

julia> permutecols!(df, [:c, :a, :b]);

julia> df
5×3 DataFrame
│ Row │ c     │ a     │ b     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 3     │ 1     │ 2     │
│ 2   │ 4     │ 2     │ 3     │
│ 3   │ 5     │ 3     │ 4     │
│ 4   │ 6     │ 4     │ 5     │
│ 5   │ 7     │ 5     │ 6     │
source
Base.vcatFunction.
vcat(dfs::AbstractDataFrame...)

Vertically concatenate AbstractDataFrames.

Column names in all passed data frames must be the same, but they can have different order. In such cases the order of names in the first passed DataFrame is used.

Example

julia> df1 = DataFrame(A=1:3, B=1:3);

julia> df2 = DataFrame(A=4:6, B=4:6);

julia> vcat(df1, df2)
6×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 4     │ 4     │
│ 5   │ 5     │ 5     │
│ 6   │ 6     │ 6     │
source
Base.append!Function.
append!(df1::DataFrame, df2::AbstractDataFrame)

Add the rows of df2 to the end of df1.

Column names must be equal (including order). Values corresponding to new rows are appended in-place to the column vectors of df1. Column types are therefore preserved, and new values are converted if necessary. An error is thrown if conversion fails: this is the case in particular if a column in df2 contains missing values but the corresponding column in df1 does not accept them.

Note

Use vcat instead of append! when more flexibility is needed. Since vcat does not operate in place, it is able to use promotion to find an appropriate element type to hold values from both data frames. It also accepts columns in different orders between df1 and df2.

Use push! to add individual rows to a data frame.

Examples

julia> df1 = DataFrame(A=1:3, B=1:3);

julia> df2 = DataFrame(A=4.0:6.0, B=4:6);

julia> append!(df1, df2);

julia> df1
6×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 4     │ 4     │
│ 5   │ 5     │ 5     │
│ 6   │ 6     │ 6     │
source