Functions

Functions

Joining, Grouping, and Split-Apply-Combine

DataFrames.innerjoinFunction.
innerjoin(df1, df2; on, makeunique = false,
          validate = (false, false))
innerjoin(df1, df2, dfs...; on, makeunique = false,
          validate = (false, false))

Perform an inner join of two or more data frame objects and return a DataFrame containing the result. An inner join includes rows with keys that match in all passed data frames.

Arguments

  • df1, df2, dfs...: the AbstractDataFrames to be joined

Keyword Arguments

  • on : A column name to join df1 and df2 on. If the columns on which df1 and df2 will be joined have different names, then a left=>right pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed). If more than two data frames are joined then only a column name or a vector of column names are allowed. on is a required argument.
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

If more than two data frames are passed, the join is performed recursively with left associativity. In this case the validate keyword argument is applied recursively with left associativity.

See also: leftjoin, rightjoin, outerjoin, semijoin, antijoin, crossjoin.

Examples

julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID    │ Name      │
│     │ Int64 │ String    │
├─────┼───────┼───────────┤
│ 1   │ 1     │ John Doe  │
│ 2   │ 2     │ Jane Doe  │
│ 3   │ 3     │ Joe Blogs │

julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID    │ Job    │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ Lawyer │
│ 2   │ 2     │ Doctor │
│ 3   │ 4     │ Farmer │

julia> innerjoin(name, job, on = :ID)
2×3 DataFrame
│ Row │ ID    │ Name     │ Job    │
│     │ Int64 │ String   │ String │
├─────┼───────┼──────────┼────────┤
│ 1   │ 1     │ John Doe │ Lawyer │
│ 2   │ 2     │ Jane Doe │ Doctor │

julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job    │
│     │ Int64      │ String │
├─────┼────────────┼────────┤
│ 1   │ 1          │ Lawyer │
│ 2   │ 2          │ Doctor │
│ 3   │ 4          │ Farmer │

julia> innerjoin(name, job2, on = :ID => :identifier)
2×3 DataFrame
│ Row │ ID    │ Name     │ Job    │
│     │ Int64 │ String   │ String │
├─────┼───────┼──────────┼────────┤
│ 1   │ 1     │ John Doe │ Lawyer │
│ 2   │ 2     │ Jane Doe │ Doctor │

julia> innerjoin(name, job2, on = [:ID => :identifier])
2×3 DataFrame
│ Row │ ID    │ Name     │ Job    │
│     │ Int64 │ String   │ String │
├─────┼───────┼──────────┼────────┤
│ 1   │ 1     │ John Doe │ Lawyer │
│ 2   │ 2     │ Jane Doe │ Doctor │
source
DataFrames.leftjoinFunction.
leftjoin(df1, df2; on, makeunique = false,
         indicator = nothing, validate = (false, false))

Perform a left join of twodata frame objects and return a DataFrame containing the result. A left join includes all rows from df1.

Arguments

  • df1, df2: the AbstractDataFrames to be joined

Keyword Arguments

  • on : A column name to join df1 and df2 on. If the columns on which df1 and df2 will be joined have different names, then a left=>right pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • indicator : Default: nothing. If a Symbol, adds categorical indicator column named Symbol for whether a row appeared in only df1 ("left_only"), only df2 ("right_only") or in both ("both"). If Symbol is already in use, the column name will be modified if makeunique=true.
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.

All columns of the returned data table will support missing values.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

See also: innerjoin, rightjoin, outerjoin, semijoin, antijoin, crossjoin.

Examples

julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID    │ Name      │
│     │ Int64 │ String    │
├─────┼───────┼───────────┤
│ 1   │ 1     │ John Doe  │
│ 2   │ 2     │ Jane Doe  │
│ 3   │ 3     │ Joe Blogs │

julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID    │ Job    │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ Lawyer │
│ 2   │ 2     │ Doctor │
│ 3   │ 4     │ Farmer │

julia> leftjoin(name, job, on = :ID)
3×3 DataFrame
│ Row │ ID    │ Name      │ Job     │
│     │ Int64 │ String    │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1   │ 1     │ John Doe  │ Lawyer  │
│ 2   │ 2     │ Jane Doe  │ Doctor  │
│ 3   │ 3     │ Joe Blogs │ missing │

julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job    │
│     │ Int64      │ String │
├─────┼────────────┼────────┤
│ 1   │ 1          │ Lawyer │
│ 2   │ 2          │ Doctor │
│ 3   │ 4          │ Farmer │

julia> leftjoin(name, job2, on = :ID => :identifier)
3×3 DataFrame
│ Row │ ID    │ Name      │ Job     │
│     │ Int64 │ String    │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1   │ 1     │ John Doe  │ Lawyer  │
│ 2   │ 2     │ Jane Doe  │ Doctor  │
│ 3   │ 3     │ Joe Blogs │ missing │

julia> leftjoin(name, job2, on = [:ID => :identifier])
3×3 DataFrame
│ Row │ ID    │ Name      │ Job     │
│     │ Int64 │ String    │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1   │ 1     │ John Doe  │ Lawyer  │
│ 2   │ 2     │ Jane Doe  │ Doctor  │
│ 3   │ 3     │ Joe Blogs │ missing │
source
DataFrames.rightjoinFunction.
rightjoin(df1, df2; on, makeunique = false,
          indicator = nothing, validate = (false, false))

Perform a right join on two data frame objects and return a DataFrame containing the result. A right join includes all rows from df2.

Arguments

  • df1, df2: the AbstractDataFrames to be joined

Keyword Arguments

  • on : A column name to join df1 and df2 on. If the columns on which df1 and df2 will be joined have different names, then a left=>right pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • indicator : Default: nothing. If a Symbol, adds categorical indicator column named Symbol for whether a row appeared in only df1 ("left_only"), only df2 ("right_only") or in both ("both"). If Symbol is already in use, the column name will be modified if makeunique=true.
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.

All columns of the returned data table will support missing values.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

See also: innerjoin, leftjoin, outerjoin, semijoin, antijoin, crossjoin.

Examples

julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID    │ Name      │
│     │ Int64 │ String    │
├─────┼───────┼───────────┤
│ 1   │ 1     │ John Doe  │
│ 2   │ 2     │ Jane Doe  │
│ 3   │ 3     │ Joe Blogs │

julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID    │ Job    │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ Lawyer │
│ 2   │ 2     │ Doctor │
│ 3   │ 4     │ Farmer │

julia> rightjoin(name, job, on = :ID)
3×3 DataFrame
│ Row │ ID    │ Name     │ Job    │
│     │ Int64 │ String?  │ String │
├─────┼───────┼──────────┼────────┤
│ 1   │ 1     │ John Doe │ Lawyer │
│ 2   │ 2     │ Jane Doe │ Doctor │
│ 3   │ 4     │ missing  │ Farmer │

julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job    │
│     │ Int64      │ String │
├─────┼────────────┼────────┤
│ 1   │ 1          │ Lawyer │
│ 2   │ 2          │ Doctor │
│ 3   │ 4          │ Farmer │

julia> rightjoin(name, job2, on = :ID => :identifier)
3×3 DataFrame
│ Row │ ID    │ Name     │ Job    │
│     │ Int64 │ String?  │ String │
├─────┼───────┼──────────┼────────┤
│ 1   │ 1     │ John Doe │ Lawyer │
│ 2   │ 2     │ Jane Doe │ Doctor │
│ 3   │ 4     │ missing  │ Farmer │

julia> rightjoin(name, job2, on = [:ID => :identifier])
3×3 DataFrame
│ Row │ ID    │ Name     │ Job    │
│     │ Int64 │ String?  │ String │
├─────┼───────┼──────────┼────────┤
│ 1   │ 1     │ John Doe │ Lawyer │
│ 2   │ 2     │ Jane Doe │ Doctor │
│ 3   │ 4     │ missing  │ Farmer │
source
DataFrames.outerjoinFunction.
outerjoin(df1, df2; on, kind = :inner, makeunique = false,
          indicator = nothing, validate = (false, false))
outerjoin(df1, df2, dfs...; on, kind = :inner, makeunique = false,
          validate = (false, false))

Perform an outer join of two or more data frame objects and return a DataFrame containing the result. An outer join includes rows with keys that appear in any of the passed data frames.

Arguments

  • df1, df2, dfs... : the AbstractDataFrames to be joined

Keyword Arguments

  • on : A column name to join df1 and df2 on. If the columns on which df1 and df2 will be joined have different names, then a left=>right pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed). If more than two data frames are joined then only a column name or a vector of column names are allowed. on is a required argument.
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • indicator : Default: nothing. If a Symbol, adds categorical indicator column named Symbol for whether a row appeared in only df1 ("left_only"), only df2 ("right_only") or in both ("both"). If Symbol is already in use, the column name will be modified if makeunique=true. This argument is only supported when joining exactly two data frames.
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.

All columns of the returned data table will support missing values.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

If more than two data frames are passed, the join is performed recursively with left associativity. In this case the indicator keyword argument is not supported and validate keyword argument is applied recursively with left associativity.

See also: innerjoin, leftjoin, rightjoin, semijoin, antijoin, crossjoin.

Examples

julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID    │ Name      │
│     │ Int64 │ String    │
├─────┼───────┼───────────┤
│ 1   │ 1     │ John Doe  │
│ 2   │ 2     │ Jane Doe  │
│ 3   │ 3     │ Joe Blogs │

julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID    │ Job    │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ Lawyer │
│ 2   │ 2     │ Doctor │
│ 3   │ 4     │ Farmer │

julia> outerjoin(name, job, on = :ID)
4×3 DataFrame
│ Row │ ID    │ Name      │ Job     │
│     │ Int64 │ String?   │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1   │ 1     │ John Doe  │ Lawyer  │
│ 2   │ 2     │ Jane Doe  │ Doctor  │
│ 3   │ 3     │ Joe Blogs │ missing │
│ 4   │ 4     │ missing   │ Farmer  │

julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job    │
│     │ Int64      │ String │
├─────┼────────────┼────────┤
│ 1   │ 1          │ Lawyer │
│ 2   │ 2          │ Doctor │
│ 3   │ 4          │ Farmer │

julia> outerjoin(name, job2, on = :ID => :identifier)
4×3 DataFrame
│ Row │ ID    │ Name      │ Job     │
│     │ Int64 │ String?   │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1   │ 1     │ John Doe  │ Lawyer  │
│ 2   │ 2     │ Jane Doe  │ Doctor  │
│ 3   │ 3     │ Joe Blogs │ missing │
│ 4   │ 4     │ missing   │ Farmer  │

julia> outerjoin(name, job2, on = [:ID => :identifier])
4×3 DataFrame
│ Row │ ID    │ Name      │ Job     │
│     │ Int64 │ String?   │ String? │
├─────┼───────┼───────────┼─────────┤
│ 1   │ 1     │ John Doe  │ Lawyer  │
│ 2   │ 2     │ Jane Doe  │ Doctor  │
│ 3   │ 3     │ Joe Blogs │ missing │
│ 4   │ 4     │ missing   │ Farmer  │
source
DataFrames.antijoinFunction.
antijoin(df1, df2; on, makeunique = false, validate = (false, false))

Perform an anti join of two data frame objects and return a DataFrame containing the result. An anti join returns the subset of rows of df1 that do not match with the keys in df2.

Arguments

  • df1, df2: the AbstractDataFrames to be joined

Keyword Arguments

  • on : A column name to join df1 and df2 on. If the columns on which df1 and df2 will be joined have different names, then a left=>right pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

See also: innerjoin, leftjoin, rightjoin, outerjoin, semijoin, crossjoin.

Examples

julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID    │ Name      │
│     │ Int64 │ String    │
├─────┼───────┼───────────┤
│ 1   │ 1     │ John Doe  │
│ 2   │ 2     │ Jane Doe  │
│ 3   │ 3     │ Joe Blogs │

julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID    │ Job    │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ Lawyer │
│ 2   │ 2     │ Doctor │
│ 3   │ 4     │ Farmer │

julia> antijoin(name, job, on = :ID)
1×2 DataFrame
│ Row │ ID    │ Name      │
│     │ Int64 │ String    │
├─────┼───────┼───────────┤
│ 1   │ 3     │ Joe Blogs │

julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job    │
│     │ Int64      │ String │
├─────┼────────────┼────────┤
│ 1   │ 1          │ Lawyer │
│ 2   │ 2          │ Doctor │
│ 3   │ 4          │ Farmer │

julia> antijoin(name, job2, on = :ID => :identifier)
1×2 DataFrame
│ Row │ ID    │ Name      │
│     │ Int64 │ String    │
├─────┼───────┼───────────┤
│ 1   │ 3     │ Joe Blogs │

julia> antijoin(name, job2, on = [:ID => :identifier])
1×2 DataFrame
│ Row │ ID    │ Name      │
│     │ Int64 │ String    │
├─────┼───────┼───────────┤
│ 1   │ 3     │ Joe Blogs │
source
DataFrames.semijoinFunction.
semijoin(df1, df2; on, makeunique = false, validate = (false, false))

Perform a semi join of two data frame objects and return a DataFrame containing the result. A semi join returns the subset of rows of df1 that match with the keys in df2.

Arguments

  • df1, df2: the AbstractDataFrames to be joined

Keyword Arguments

  • on : A column name to join df1 and df2 on. If the columns on which df1 and df2 will be joined have different names, then a left=>right pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed).
  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).
  • indicator : Default: nothing. If a Symbol, adds categorical indicator column named Symbol for whether a row appeared in only df1 ("left_only"), only df2 ("right_only") or in both ("both"). If Symbol is already in use, the column name will be modified if makeunique=true.
  • validate : whether to check that columns passed as the on argument define unique keys in each input data frame (according to isequal). Can be a tuple or a pair, with the first element indicating whether to run check for df1 and the second element for df2. By default no check is performed.

When merging on categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

See also: innerjoin, leftjoin, rightjoin, outerjoin, antijoin, crossjoin.

Examples

julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
│ Row │ ID    │ Name      │
│     │ Int64 │ String    │
├─────┼───────┼───────────┤
│ 1   │ 1     │ John Doe  │
│ 2   │ 2     │ Jane Doe  │
│ 3   │ 3     │ Joe Blogs │

julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ ID    │ Job    │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ Lawyer │
│ 2   │ 2     │ Doctor │
│ 3   │ 4     │ Farmer │

julia> semijoin(name, job, on = :ID)
2×2 DataFrame
│ Row │ ID    │ Name     │
│     │ Int64 │ String   │
├─────┼───────┼──────────┤
│ 1   │ 1     │ John Doe │
│ 2   │ 2     │ Jane Doe │

julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
│ Row │ identifier │ Job    │
│     │ Int64      │ String │
├─────┼────────────┼────────┤
│ 1   │ 1          │ Lawyer │
│ 2   │ 2          │ Doctor │
│ 3   │ 4          │ Farmer │

julia> semijoin(name, job2, on = :ID => :identifier)
2×2 DataFrame
│ Row │ ID    │ Name     │
│     │ Int64 │ String   │
├─────┼───────┼──────────┤
│ 1   │ 1     │ John Doe │
│ 2   │ 2     │ Jane Doe │

julia> semijoin(name, job2, on = [:ID => :identifier])
2×2 DataFrame
│ Row │ ID    │ Name     │
│     │ Int64 │ String   │
├─────┼───────┼──────────┤
│ 1   │ 1     │ John Doe │
│ 2   │ 2     │ Jane Doe │
source
DataFrames.crossjoinFunction.
crossjoin(df1, df2, dfs...; makeunique = false)

Perform a cross join of two or more data frame objects and return a DataFrame containing the result. A cross join returns the cartesian product of rows from all passed data frames.

Arguments

  • df1, df2, dfs... : the AbstractDataFrames to be joined

Keyword Arguments

  • makeunique : if false (the default), an error will be raised if duplicate names are found in columns not joined on; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).

If more than two data frames are passed, the join is performed recursively with left associativity.

See also: innerjoin, leftjoin, rightjoin, outerjoin, semijoin, antijoin.

Examples

julia> df1 = DataFrame(X=1:3)
3×1 DataFrame
│ Row │ X     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │

julia> df2 = DataFrame(Y=["a", "b"])
2×1 DataFrame
│ Row │ Y      │
│     │ String │
├─────┼────────┤
│ 1   │ a      │
│ 2   │ b      │

julia> crossjoin(df1, df2)
6×2 DataFrame
│ Row │ X     │ Y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ a      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 2     │ b      │
│ 5   │ 3     │ a      │
│ 6   │ 3     │ b      │
source
DataFrames.combineFunction.
combine(df::AbstractDataFrame, args...)

Create a new data frame that contains columns from df specified by args and return it. The result can have any number of rows that is determined by the values returned by passed transformations.

See select for detailed rules regarding accepted values for args.

Examples

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │

julia> combine(df, :a => sum, nrow)
1×2 DataFrame
│ Row │ a_sum │ nrow  │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 6     │ 3     │
source
combine(gd::GroupedDataFrame, args...; keepkeys::Bool=true, ungroup::Bool=true)
combine(fun::Union{Function, Type}, gd::GroupedDataFrame;
        keepkeys::Bool=true, ungroup::Bool=true)
combine(pair::Pair, gd::GroupedDataFrame; keepkeys::Bool=true, ungroup::Bool=true)
combine(fun::Union{Function, Type}, df::AbstractDataFrame, ungroup::Bool=true)
combine(pair::Pair, df::AbstractDataFrame, ungroup::Bool=true)

Apply operations to each group in a GroupedDataFrame and return the combined result as a DataFrame if ungroup=true or GroupedDataFrame if ungroup=false.

If an AbstractDataFrame is passed, apply operations to the data frame as a whole and a DataFrame is always returend.

Arguments passed as args... can be:

  • Any index that is allowed for column indexing (Symbol, string or integer, :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
  • Column transformation operations using the Pair notation that is described below and vectors of such pairs.

Transformations allowed using Pairs follow the rules specified for select and have the form source_cols => fun, source_cols => fun => target_col, or source_col => target_col. Function fun is passed SubArray views as positional arguments for each column specified to be selected, or a NamedTuple containing these SubArrays if source_cols is an AsTable selector. It can return a vector or a single value (defined precisely below).

As a special case nrow or nrow => target_col can be passed without specifying input columns to efficiently calculate number of rows in each group. If nrow is passed the resulting column name is :nrow.

If multiple args are passed then return values of different funs are allowed to mix single values and vectors. In this case single values will be broadcasted to match the length of columns specified by returned vectors. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then broadcasted.

If the first or last argument is pair then it must be a Pair following the rules for pairs described above, except that in this case function defined by fun can return any return value defined below.

If the first or last argument is a function fun, it is passed a SubDataFrame view for each group and can return any return value defined below. Note that this form is slower than pair or args due to type instability.

fun can return a single value, a row, a vector, or multiple rows. The type of the returned value determines the shape of the resulting DataFrame. There are four kind of return values allowed:

  • A single value gives a DataFrame with a single additional column and one row per group.
  • A named tuple of single values or a DataFrameRow gives a DataFrame with one additional column for each field and one row per group (returning a named tuple will be faster). It is not allowed to mix single values and vectors if a named tuple is returned.
  • A vector gives a DataFrame with a single additional column and as many rows for each group as the length of the returned vector for that group.
  • A data frame, a named tuple of vectors or a matrix gives a DataFrame with the same additional columns and as many rows for each group as the rows returned for that group (returning a named tuple is the fastest option). Returning a table with zero columns is allowed, whatever the number of columns returned for other groups.

fun must always return the same kind of object (out of four kinds defined above) for all groups, and with the same column names.

Optimized methods are used when standard summary functions (sum, prod, minimum, maximum, mean, var, std, first, last and length) are specified using the Pair syntax (e.g. :col => sum). When computing the sum or mean over floating point columns, results will be less accurate than the standard sum function (which uses pairwise summation). Use col => x -> sum(x) to avoid the optimized method and use the slower, more accurate one.

Column names are automatically generated when necessary using the rules defined in select if the Pair syntax is used and fun returns a single value or a vector (e.g. for :col => sum the column name is col_sum); otherwise (if fun is a function or a return value is an AbstractMatrix) columns are named x1, x2 and so on.

If keepkeys=true, the resulting DataFrame contains all the grouping columns in addition to those generated. In this case if the returned value contains columns with the same names as the grouping columns, they are required to be equal.

If ungroup=true (the default) a DataFrame is returned. If ungroup=false a GroupedDataFrame grouped using keycols(gdf) is returned.

Ordering of rows follows the order of groups in gdf.

See also

groupby, select, select!, transform, transform!

Examples

julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8);

julia> gd = groupby(df, :a);

julia> combine(gd, :c => sum, nrow)
4×3 DataFrame
│ Row │ a     │ c_sum │ nrow  │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 6     │ 2     │
│ 2   │ 2     │ 8     │ 2     │
│ 3   │ 3     │ 10    │ 2     │
│ 4   │ 4     │ 12    │ 2     │

julia> combine(gd, :c => sum, nrow, ungroup=false)
GroupedDataFrame with 4 groups based on key: a
First Group (1 row): a = 1
│ Row │ a     │ c_sum │ nrow  │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 6     │ 2     │
⋮
Last Group (1 row): a = 4
│ Row │ a     │ c_sum │ nrow  │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 4     │ 12    │ 2     │

julia> combine(sdf -> sum(sdf.c), gd) # Slower variant
4×2 DataFrame
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> combine(gdf) do d # do syntax for the slower variant
           sum(d.c)
       end
4×2 DataFrame
│ Row │ a     │ x1    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 8     │
│ 3   │ 3     │ 10    │
│ 4   │ 4     │ 12    │

julia> combine(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column
4×2 DataFrame
│ Row │ a     │ sum_log_c │
│     │ Int64 │ Float64   │
├─────┼───────┼───────────┤
│ 1   │ 1     │ 1.60944   │
│ 2   │ 2     │ 2.48491   │
│ 3   │ 3     │ 3.04452   │
│ 4   │ 4     │ 3.46574   │


julia> combine(gd, [:b, :c] .=> sum) # passing a vector of pairs
4×3 DataFrame
│ Row │ a     │ b_sum │ c_sum │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 4     │ 6     │
│ 2   │ 2     │ 2     │ 8     │
│ 3   │ 3     │ 4     │ 10    │
│ 4   │ 4     │ 2     │ 12    │

julia> combine(gd) do sdf # dropping group when DataFrame() is returned
          sdf.c[1] != 1 ? sdf : DataFrame()
       end
6×3 DataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 2     │ 1     │ 2     │
│ 2   │ 2     │ 1     │ 6     │
│ 3   │ 3     │ 2     │ 3     │
│ 4   │ 3     │ 2     │ 7     │
│ 5   │ 4     │ 1     │ 4     │
│ 6   │ 4     │ 1     │ 8     │

julia> combine(gd, :b => :b1, :c => :c1,
               [:b, :c] => +, keepkeys=false) # auto-splatting, renaming and keepkeys
8×3 DataFrame
│ Row │ b1    │ c1    │ b_c_+ │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 2     │ 1     │ 3     │
│ 2   │ 2     │ 5     │ 7     │
│ 3   │ 1     │ 2     │ 3     │
│ 4   │ 1     │ 6     │ 7     │
│ 5   │ 2     │ 3     │ 5     │
│ 6   │ 2     │ 7     │ 9     │
│ 7   │ 1     │ 4     │ 5     │
│ 8   │ 1     │ 8     │ 9     │

julia> combine(gd, :b, :c => sum) # passing columns and broadcasting
8×3 DataFrame
│ Row │ a     │ b     │ c_sum │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 6     │
│ 2   │ 1     │ 2     │ 6     │
│ 3   │ 2     │ 1     │ 8     │
│ 4   │ 2     │ 1     │ 8     │
│ 5   │ 3     │ 2     │ 10    │
│ 6   │ 3     │ 2     │ 10    │
│ 7   │ 4     │ 1     │ 12    │
│ 8   │ 4     │ 1     │ 12    │

julia> combine(gd, [:b, :c] .=> Ref)
4×3 DataFrame
│ Row │ a     │ b_Ref    │ c_Ref    │
│     │ Int64 │ SubArra… │ SubArra… │
├─────┼───────┼──────────┼──────────┤
│ 1   │ 1     │ [2, 2]   │ [1, 5]   │
│ 2   │ 2     │ [1, 1]   │ [2, 6]   │
│ 3   │ 3     │ [2, 2]   │ [3, 7]   │
│ 4   │ 4     │ [1, 1]   │ [4, 8]   │

julia> combine(gd, AsTable(:) => Ref)
4×2 DataFrame
│ Row │ a     │ a_b_c_Ref                            │
│     │ Int64 │ NamedTuple…                          │
├─────┼───────┼──────────────────────────────────────┤
│ 1   │ 1     │ (a = [1, 1], b = [2, 2], c = [1, 5]) │
│ 2   │ 2     │ (a = [2, 2], b = [1, 1], c = [2, 6]) │
│ 3   │ 3     │ (a = [3, 3], b = [2, 2], c = [3, 7]) │
│ 4   │ 4     │ (a = [4, 4], b = [1, 1], c = [4, 8]) │

julia> combine(gd, :, AsTable(Not(:a)) => sum)
8×4 DataFrame
│ Row │ a     │ b     │ c     │ b_c_sum │
│     │ Int64 │ Int64 │ Int64 │ Int64   │
├─────┼───────┼───────┼───────┼─────────┤
│ 1   │ 1     │ 2     │ 1     │ 3       │
│ 2   │ 1     │ 2     │ 5     │ 7       │
│ 3   │ 2     │ 1     │ 2     │ 3       │
│ 4   │ 2     │ 1     │ 6     │ 7       │
│ 5   │ 3     │ 2     │ 3     │ 5       │
│ 6   │ 3     │ 2     │ 7     │ 9       │
│ 7   │ 4     │ 1     │ 4     │ 5       │
│ 8   │ 4     │ 1     │ 8     │ 9       │
source
DataFrames.groupbyFunction.
groupby(d::AbstractDataFrame, cols; sort=false, skipmissing=false)

Return a GroupedDataFrame representing a view of an AbstractDataFrame split into row groups.

Arguments

  • df : an AbstractDataFrame to split
  • cols : data frame columns to group by. Can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
  • sort : whether to sort groups according to the values of the grouping columns cols; if all cols are CategoricalVectors then groups are always sorted irrespective of the value of sort
  • skipmissing : whether to skip groups with missing values in one of the grouping columns cols

Details

An iterator over a GroupedDataFrame returns a SubDataFrame view for each grouping into df. Within each group, the order of rows in df is preserved.

cols can be any valid data frame indexing expression. In particular if it is an empty vector then a single-group GroupedDataFrame is created.

A GroupedDataFrame also supports indexing by groups, map (which applies a function to each group) and combine (which applies a function to each group and combines the result into a data frame).

GroupedDataFrame also supports the dictionary interface. The keys are GroupKey objects returned by keys(::GroupedDataFrame), which can also be used to get the values of the grouping columns for each group. Tuples and NamedTuples containing the values of the grouping columns (in the same order as the cols argument) are also accepted as indices, but this will be slower than using the equivalent GroupKey.

See also

combine, select, select!, transform, transform!

Examples

julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8);

julia> gd = groupby(df, :a)
GroupedDataFrame with 4 groups based on key: a
First Group (2 rows): a = 1
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 1     │
│ 2   │ 1     │ 2     │ 5     │
⋮
Last Group (2 rows): a = 4
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 4     │ 1     │ 4     │
│ 2   │ 4     │ 1     │ 8     │

julia> gd[1]
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 1     │
│ 2   │ 1     │ 2     │ 5     │

julia> last(gd)
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 4     │ 1     │ 4     │
│ 2   │ 4     │ 1     │ 8     │

julia> gd[(a=3,)]
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 3     │ 2     │ 3     │
│ 2   │ 3     │ 2     │ 7     │

julia> gd[(3,)]
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 3     │ 2     │ 3     │
│ 2   │ 3     │ 2     │ 7     │

julia> k = first(keys(gd))
GroupKey: (a = 3)

julia> gd[k]
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 3     │ 2     │ 3     │
│ 2   │ 3     │ 2     │ 7     │

julia> for g in gd
           println(g)
       end
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 1     │
│ 2   │ 1     │ 2     │ 5     │
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 2     │ 1     │ 2     │
│ 2   │ 2     │ 1     │ 6     │
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 3     │ 2     │ 3     │
│ 2   │ 3     │ 2     │ 7     │
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 4     │ 1     │ 4     │
│ 2   │ 4     │ 1     │ 8     │
source
groupindices(gd::GroupedDataFrame)

Return a vector of group indices for each row of parent(gd).

Rows appearing in group gd[i] are attributed index i. Rows not present in any group are attributed missing (this can happen if skipmissing=true was passed when creating gd, or if gd is a subset from a larger GroupedDataFrame).

source
DataFrames.groupcolsFunction.
groupcols(gd::GroupedDataFrame)

Return a vector of Symbol column names in parent(gd) used for grouping.

source
DataFrames.valuecolsFunction.
valuecols(gd::GroupedDataFrame)

Return a vector of Symbol column names in parent(gd) not used for grouping.

source
Base.keysFunction.
keys(gd::GroupedDataFrame)

Get the set of keys for each group of the GroupedDataFrame gd as a GroupKeys object. Each key is a GroupKey, which behaves like a NamedTuple holding the values of the grouping columns for a given group. Unlike the equivalent Tuple and NamedTuple, these keys can be used to index into gd efficiently. The ordering of the keys is identical to the ordering of the groups of gd under iteration and integer indexing.

Examples

julia> df = DataFrame(a = repeat([:foo, :bar, :baz], outer=[4]),
                      b = repeat([2, 1], outer=[6]),
                      c = 1:12);

julia> gd = groupby(df, [:a, :b])
GroupedDataFrame with 6 groups based on keys: a, b
First Group (2 rows): a = :foo, b = 2
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ foo    │ 2     │ 1     │
│ 2   │ foo    │ 2     │ 7     │
⋮
Last Group (2 rows): a = :baz, b = 1
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ baz    │ 1     │ 6     │
│ 2   │ baz    │ 1     │ 12    │

julia> keys(gd)
6-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (a = :foo, b = 2)
 GroupKey: (a = :bar, b = 1)
 GroupKey: (a = :baz, b = 2)
 GroupKey: (a = :foo, b = 1)
 GroupKey: (a = :bar, b = 2)
 GroupKey: (a = :baz, b = 1)

GroupKey objects behave similarly to NamedTuples:

julia> k = keys(gd)[1]
GroupKey: (a = :foo, b = 2)

julia> keys(k)
(:a, :b)

julia> values(k)  # Same as Tuple(k)
(:foo, 2)

julia> NamedTuple(k)
(a = :foo, b = 2)

julia> k.a
:foo

julia> k[:a]
:foo

julia> k[1]
:foo

Keys can be used as indices to retrieve the corresponding group from their GroupedDataFrame:

julia> gd[k]
2×3 SubDataFrame
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ foo    │ 2     │ 1     │
│ 2   │ foo    │ 2     │ 7     │

julia> gd[keys(gd)[1]] == gd[1]
true
source
keys(dfc::DataFrameColumns)

Get a vector of column names of dfc as Symbols.

source
Base.getFunction.
get(gd::GroupedDataFrame, key, default)

Get a group based on the values of the grouping columns.

key may be a NamedTuple or Tuple of grouping column values (in the same order as the cols argument to groupby).

Examples

julia> df = DataFrame(a = repeat([:foo, :bar, :baz], outer=[2]),
                      b = repeat([2, 1], outer=[3]),
                      c = 1:6);

julia> gd = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = :foo
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ foo    │ 2     │ 1     │
│ 2   │ foo    │ 1     │ 4     │
⋮
Last Group (2 rows): a = :baz
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ baz    │ 2     │ 3     │
│ 2   │ baz    │ 1     │ 6     │

julia> get(gd, (a=:bar,), nothing)
2×3 SubDataFrame
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ bar    │ 1     │ 2     │
│ 2   │ bar    │ 2     │ 5     │

julia> get(gd, (:baz,), nothing)
2×3 SubDataFrame
│ Row │ a      │ b     │ c     │
│     │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ baz    │ 2     │ 3     │
│ 2   │ baz    │ 1     │ 6     │

julia> get(gd, (:qux,), nothing)
source
DataFrames.stackFunction.
stack(df::AbstractDataFrame, [measure_vars], [id_vars];
      variable_name=:variable, value_name=:value,
      view::Bool=false, variable_eltype::Type=CategoricalValue{String})

Stack a data frame df, i.e. convert it from wide to long format.

Return the long-format DataFrame with: columns for each of the id_vars, column variable_name (:value by default) holding the values of the stacked columns (measure_vars), and column variable_name (:variable by default) a vector holding the name of the corresponding measure_vars variable.

If view=true then return a stacked view of a data frame (long format). The result is a view because the columns are special AbstractVectors that return views into the original data frame.

Arguments

  • df : the AbstractDataFrame to be stacked
  • measure_vars : the columns to be stacked (the measurement variables), as a column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers). If neither measure_vars or id_vars are given, measure_vars defaults to all floating point columns.
  • id_vars : the identifier columns that are repeated during stacking, as a column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers). Defaults to all variables that are not measure_vars
  • variable_name : the name (Symbol or string) of the new stacked column that shall hold the names of each of measure_vars
  • value_name : the name (Symbol or string) of the new stacked column containing the values from each of measure_vars
  • view : whether the stacked data frame should be a view rather than contain freshly allocated vectors.
  • variable_eltype : determines the element type of column variable_name. By default a categorical vector of strings is created. If variable_eltype=Symbol it is a vector of Symbol, and if variable_eltype=String a vector of String is produced.

Examples

d1 = DataFrame(a = repeat([1:3;], inner = [4]),
               b = repeat([1:4;], inner = [3]),
               c = randn(12),
               d = randn(12),
               e = map(string, 'a':'l'))

d1s = stack(d1, [:c, :d])
d1s2 = stack(d1, [:c, :d], [:a])
d1m = stack(d1, Not([:a, :b, :e]))
d1s_name = stack(d1, Not([:a, :b, :e]), variable_name=:somemeasure)
source
DataFrames.unstackFunction.
unstack(df::AbstractDataFrame, rowkeys, colkey, value; renamecols::Function=identity)
unstack(df::AbstractDataFrame, colkey, value; renamecols::Function=identity)
unstack(df::AbstractDataFrame; renamecols::Function=identity)

Unstack data frame df, i.e. convert it from long to wide format.

If colkey contains missing values then they will be skipped and a warning will be printed.

If combination of rowkeys and colkey contains duplicate entries then last value will be retained and a warning will be printed.

Arguments

  • df : the AbstractDataFrame to be unstacked
  • rowkeys : the columns with a unique key for each row, if not given, find a key by grouping on anything not a colkey or value. Can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
  • colkey : the column (Symbol, string or integer) holding the column names in wide format, defaults to :variable
  • value : the value column (Symbol, string or integer), defaults to :value
  • renamecols : a function called on each unique value in colkey which must return the name of the column to be created (typically as a string or a Symbol). Duplicate names are not allowed.

Examples

wide = DataFrame(id = 1:12,
                 a  = repeat([1:3;], inner = [4]),
                 b  = repeat([1:4;], inner = [3]),
                 c  = randn(12),
                 d  = randn(12))

long = stack(wide)
wide0 = unstack(long)
wide1 = unstack(long, :variable, :value)
wide2 = unstack(long, :id, :variable, :value)
wide3 = unstack(long, [:id, :a], :variable, :value)
wide4 = unstack(long, :id, :variable, :value, renamecols=x->Symbol(:_, x))

Note that there are some differences between the widened results above.

source

Basics

Missings.allowmissingFunction.
allowmissing(df::AbstractDataFrame, cols=:)

Return a copy of data frame df with columns cols converted to element type Union{T, Missing} from T to allow support for missing values.

cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If cols is omitted all columns in the data frame are converted.

Examples

julia> df = DataFrame(a=[1,2])
2×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │

julia> allowmissing(df)
2×1 DataFrame
│ Row │ a      │
│     │ Int64? │
├─────┼────────┤
│ 1   │ 1      │
│ 2   │ 2      │
source
allowmissing!(df::DataFrame, cols=:)

Convert columns cols of data frame df from element type T to Union{T, Missing} to support missing values.

cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If cols is omitted all columns in the data frame are converted.

source
Base.append!Function.
append!(df::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal,
        promote::Bool=(cols in [:union, :subset]))
append!(df::DataFrame, table; cols::Symbol=:setequal,
        promote::Bool=(cols in [:union, :subset]))

Add the rows of df2 to the end of df. If the second argument table is not an AbstractDataFrame then it is converted using DataFrame(table, copycols=false) before being appended.

The exact behavior of append! depends on the cols argument:

  • If cols == :setequal (this is the default) then df2 must contain exactly the same columns as df (but possibly in a different order).
  • If cols == :orderequal then df2 must contain the same columns in the same order (for AbstractDict this option requires that keys(row) matches propertynames(df) to allow for support of ordered dicts; however, if df2 is a Dict an error is thrown as it is an unordered collection).
  • If cols == :intersect then df2 may contain more columns than df, but all column names that are present in df must be present in df2 and only these are used.
  • If cols == :subset then append! behaves like for :intersect but if some column is missing in df2 then a missing value is pushed to df.
  • If cols == :union then append! adds columns missing in df that are present in row, for columns present in df but missing in row a missing value is pushed.

If promote=true and element type of a column present in df does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df. If promote=false an error is thrown.

The above rule has the following exceptions:

  • If df has no columns then copies of columns from df2 are added to it.
  • If df2 has no columns then calling append! leaves df unchanged.

Please note that append! must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===).

See also

Use push! to add individual rows to a data frame and vcat to vertically concatenate data frames.

Examples

julia> df1 = DataFrame(A=1:3, B=1:3);

julia> df2 = DataFrame(A=4.0:6.0, B=4:6);

julia> append!(df1, df2);

julia> df1
6×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 4     │ 4     │
│ 5   │ 5     │ 5     │
│ 6   │ 6     │ 6     │
source
categorical(df::AbstractDataFrame, cols=Union{AbstractString, Missing};
            compress::Bool=false)

Return a copy of data frame df with columns cols converted to CategoricalVector.

cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers) or a Type.

If categorical is called with the cols argument being a Type, then all columns whose element type is a subtype of this type (by default Union{AbstractString, Missing}) will be converted to categorical.

If the compress keyword argument is set to true then the created CategoricalVectors will be compressed.

All created CategoricalVectors are unordered.

Examples

julia> df = DataFrame(a=[1,2], b=["a","b"])
2×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ a      │
│ 2   │ 2     │ b      │

julia> categorical(df)
2×2 DataFrame
│ Row │ a     │ b    │
│     │ Int64 │ Cat… │
├─────┼───────┼──────┤
│ 1   │ 1     │ a    │
│ 2   │ 2     │ b    │

julia> categorical(df, :)
2×2 DataFrame
│ Row │ a    │ b    │
│     │ Cat… │ Cat… │
├─────┼──────┼──────┤
│ 1   │ 1    │ a    │
│ 2   │ 2    │ b    │
source
categorical!(df::DataFrame, cols=Union{AbstractString, Missing};
             compress::Bool=false)

Change columns selected by cols in data frame df to CategoricalVector.

cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers) or a Type.

If categorical! is called with the cols argument being a Type, then all columns whose element type is a subtype of this type (by default Union{AbstractString, Missing}) will be converted to categorical.

If the compress keyword argument is set to true then the created CategoricalVectors will be compressed.

All created CategoricalVectors are unordered.

Examples

julia> df = DataFrame(X=["a", "b"], Y=[1, 2], Z=["p", "q"])
2×3 DataFrame
│ Row │ X      │ Y     │ Z      │
│     │ String │ Int64 │ String │
├─────┼────────┼───────┼────────┤
│ 1   │ a      │ 1     │ p      │
│ 2   │ b      │ 2     │ q      │

julia> categorical!(df)
2×3 DataFrame
│ Row │ X    │ Y     │ Z    │
│     │ Cat… │ Int64 │ Cat… │
├─────┼──────┼───────┼──────┤
│ 1   │ a    │ 1     │ p    │
│ 2   │ b    │ 2     │ q    │

julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
 CategoricalValue{String,UInt32}
 Int64
 CategoricalValue{String,UInt32}

julia> df = DataFrame(X=["a", "b"], Y=[1, 2], Z=["p", "q"])
2×3 DataFrame
│ Row │ X      │ Y     │ Z      │
│     │ String │ Int64 │ String │
├─────┼────────┼───────┼────────┤
│ 1   │ a      │ 1     │ p      │
│ 2   │ b      │ 2     │ q      │

julia> categorical!(df, :Y, compress=true)
2×3 DataFrame
│ Row │ X      │ Y    │ Z      │
│     │ String │ Cat… │ String │
├─────┼────────┼──────┼────────┤
│ 1   │ a      │ 1    │ p      │
│ 2   │ b      │ 2    │ q      │

julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
 String
 CategoricalValue{Int64,UInt8}
 String
source
completecases(df::AbstractDataFrame, cols=:)

Return a Boolean vector with true entries indicating rows without missing values (complete cases) in data frame df.

If cols is provided, only missing values in the corresponding columns areconsidered. cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

See also: dropmissing and dropmissing!. Use findall(completecases(df)) to get the indices of the rows.

Examples

julia> df = DataFrame(i = 1:5,
                      x = [missing, 4, missing, 2, 1],
                      y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i     │ x       │ y       │
│     │ Int64 │ Int64?  │ String? │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 1     │ missing │ missing │
│ 2   │ 2     │ 4       │ missing │
│ 3   │ 3     │ missing │ c       │
│ 4   │ 4     │ 2       │ d       │
│ 5   │ 5     │ 1       │ e       │

julia> completecases(df)
5-element BitArray{1}:
 false
 false
 false
  true
  true

julia> completecases(df, :x)
5-element BitArray{1}:
 false
  true
 false
  true
  true

julia> completecases(df, [:x, :y])
5-element BitArray{1}:
 false
 false
 false
  true
  true
source
Base.copyFunction.
copy(df::DataFrame; copycols::Bool=true)

Copy data frame df. If copycols=true (the default), return a new DataFrame holding copies of column vectors in df. If copycols=false, return a new DataFrame sharing column vectors with df.

source
copy(dfr::DataFrameRow)

Construct a NamedTuple with the same contents as the DataFrameRow. This method returns a NamedTuple so that the returned object is not affected by changes to the parent data frame of which dfr is a view.

source
DataFrames.DataFrame!Function.
DataFrame!(args...; kwargs...)

Equivalent to DataFrame(args...; copycols=false, kwargs...).

If kwargs contains the copycols keyword argument an error is thrown.

Examples

julia> df1 = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │

julia> df2 = DataFrame!(df1)

julia> df1.a === df2.a
true
source
Base.delete!Function.
delete!(df::DataFrame, inds)

Delete rows specified by inds from a DataFrame df in place and return it.

Internally deleteat! is called for all columns so inds must be: a vector of sorted and unique integers, a boolean vector, an integer, or Not.

Examples

julia> d = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │

julia> delete!(d, 2)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 3     │ 6     │
source
DataAPI.describeFunction.
describe(df::AbstractDataFrame; cols=:)
describe(df::AbstractDataFrame, stats::Union{Symbol, Pair}...; cols=:)

Return descriptive statistics for a data frame as a new DataFrame where each row represents a variable and each column a summary statistic.

Arguments

  • df : the AbstractDataFrame
  • stats::Union{Symbol, Pair}... : the summary statistics to report. Arguments can be:
    • A symbol from the list :mean, :std, :min, :q25, :median, :q75, :max, :eltype, :nunique, :first, :last, and :nmissing. The default statistics used are :mean, :min, :median, :max, :nunique, :nmissing, and :eltype.
    • :all as the only Symbol argument to return all statistics.
    • A name => function pair where name is a Symbol or string. This will create a column of summary statistics with the provided name.
  • cols : a keyword argument allowing to select only a subset of columns from df to describe. Can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

Details

For Real columns, compute the mean, standard deviation, minimum, first quantile, median, third quantile, and maximum. If a column does not derive from Real, describe will attempt to calculate all statistics, using nothing as a fall-back in the case of an error.

When stats contains :nunique, describe will report the number of unique values in a column. If a column's base type derives from Real, :nunique will return nothings.

Missing values are filtered in the calculation of all statistics, however the column :nmissing will report the number of missing values of that variable. If the column does not allow missing values, nothing is returned. Consequently, nmissing = 0 indicates that the column allows missing values, but does not currently contain any.

If custom functions are provided, they are called repeatedly with the vector corresponding to each column as the only argument. For columns allowing for missing values, the vector is wrapped in a call to skipmissing: custom functions must therefore support such objects (and not only vectors), and cannot access missing values.

Examples

julia> df = DataFrame(i=1:10, x=0.1:0.1:1.0, y='a':'j')
10×3 DataFrame
│ Row │ i     │ x       │ y    │
│     │ Int64 │ Float64 │ Char │
├─────┼───────┼─────────┼──────┤
│ 1   │ 1     │ 0.1     │ 'a'  │
│ 2   │ 2     │ 0.2     │ 'b'  │
│ 3   │ 3     │ 0.3     │ 'c'  │
│ 4   │ 4     │ 0.4     │ 'd'  │
│ 5   │ 5     │ 0.5     │ 'e'  │
│ 6   │ 6     │ 0.6     │ 'f'  │
│ 7   │ 7     │ 0.7     │ 'g'  │
│ 8   │ 8     │ 0.8     │ 'h'  │
│ 9   │ 9     │ 0.9     │ 'i'  │
│ 10  │ 10    │ 1.0     │ 'j'  │

julia> describe(df)
3×8 DataFrame
│ Row │ variable │ mean   │ min │ median │ max │ nunique │ nmissing │ eltype   │
│     │ Symbol   │ Union… │ Any │ Union… │ Any │ Union…  │ Nothing  │ DataType │
├─────┼──────────┼────────┼─────┼────────┼─────┼─────────┼──────────┼──────────┤
│ 1   │ i        │ 5.5    │ 1   │ 5.5    │ 10  │         │          │ Int64    │
│ 2   │ x        │ 0.55   │ 0.1 │ 0.55   │ 1.0 │         │          │ Float64  │
│ 3   │ y        │        │ 'a' │        │ 'j' │ 10      │          │ Char     │

julia> describe(df, :min, :max)
3×3 DataFrame
│ Row │ variable │ min │ max │
│     │ Symbol   │ Any │ Any │
├─────┼──────────┼─────┼─────┤
│ 1   │ i        │ 1   │ 10  │
│ 2   │ x        │ 0.1 │ 1.0 │
│ 3   │ y        │ 'a' │ 'j' │

julia> describe(df, :min, :sum => sum)
3×3 DataFrame
│ Row │ variable │ min │ sum │
│     │ Symbol   │ Any │ Any │
├─────┼──────────┼─────┼─────┤
│ 1   │ i        │ 1   │ 55  │
│ 2   │ x        │ 0.1 │ 5.5 │
│ 3   │ y        │ 'a' │     │

julia> describe(df, :min, :sum => sum, cols=:x)
1×3 DataFrame
│ Row │ variable │ min     │ sum     │
│     │ Symbol   │ Float64 │ Float64 │
├─────┼──────────┼─────────┼─────────┤
│ 1   │ x        │ 0.1     │ 5.5     │
source
disallowmissing(df::AbstractDataFrame, cols=:; error::Bool=true)

Return a copy of data frame df with columns cols converted from element type Union{T, Missing} to T to drop support for missing values.

cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If cols is omitted all columns in the data frame are converted.

If error=false then columns containing a missing value will be skipped instead of throwing an error.

Examples

julia> df = DataFrame(a=Union{Int,Missing}[1,2])
2×1 DataFrame
│ Row │ a      │
│     │ Int64? │
├─────┼────────┤
│ 1   │ 1      │
│ 2   │ 2      │

julia> disallowmissing(df)
2×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │

julia> df = DataFrame(a=[1,missing])
2×2 DataFrame
│ Row │ a       │ b      │
│     │ Int64?  │ Int64? │
├─────┼─────────┼────────┤
│ 1   │ 1       │ 1      │
│ 2   │ missing │ 2      │

julia> disallowmissing(df, error=false)
2×2 DataFrame
│ Row │ a       │ b     │
│     │ Int64?  │ Int64 │
├─────┼─────────┼───────┤
│ 1   │ 1       │ 1     │
│ 2   │ missing │ 2     │
source
disallowmissing!(df::DataFrame, cols=:; error::Bool=true)

Convert columns cols of data frame df from element type Union{T, Missing} to T to drop support for missing values.

cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If cols is omitted all columns in the data frame are converted.

If error=false then columns containing a missing value will be skipped instead of throwing an error.

source
dropmissing(df::AbstractDataFrame, cols=:; disallowmissing::Bool=true)

Return a copy of data frame df excluding rows with missing values.

If cols is provided, only missing values in the corresponding columns are considered. cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If disallowmissing is true (the default) then columns specified in cols will be converted so as not to allow for missing values using disallowmissing!.

See also: completecases and dropmissing!.

Examples

julia> df = DataFrame(i = 1:5,
                      x = [missing, 4, missing, 2, 1],
                      y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i     │ x       │ y       │
│     │ Int64 │ Int64?  │ String? │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 1     │ missing │ missing │
│ 2   │ 2     │ 4       │ missing │
│ 3   │ 3     │ missing │ c       │
│ 4   │ 4     │ 2       │ d       │
│ 5   │ 5     │ 1       │ e       │

julia> dropmissing(df)
2×3 DataFrame
│ Row │ i     │ x     │ y      │
│     │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1   │ 4     │ 2     │ d      │
│ 2   │ 5     │ 1     │ e      │

julia> dropmissing(df, disallowmissing=false)
2×3 DataFrame
│ Row │ i     │ x      │ y       │
│     │ Int64 │ Int64? │ String? │
├─────┼───────┼────────┼─────────┤
│ 1   │ 4     │ 2      │ d       │
│ 2   │ 5     │ 1      │ e       │

julia> dropmissing(df, :x)
3×3 DataFrame
│ Row │ i     │ x     │ y       │
│     │ Int64 │ Int64 │ String? │
├─────┼───────┼───────┼─────────┤
│ 1   │ 2     │ 4     │ missing │
│ 2   │ 4     │ 2     │ d       │
│ 3   │ 5     │ 1     │ e       │

julia> dropmissing(df, [:x, :y])
2×3 DataFrame
│ Row │ i     │ x     │ y      │
│     │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1   │ 4     │ 2     │ d      │
│ 2   │ 5     │ 1     │ e      │
source
dropmissing!(df::AbstractDataFrame, cols=:; disallowmissing::Bool=true)

Remove rows with missing values from data frame df and return it.

If cols is provided, only missing values in the corresponding columns are considered. cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If disallowmissing is true (the default) then the cols columns will get converted using disallowmissing!.

See also: dropmissing and completecases.

julia> df = DataFrame(i = 1:5,
                      x = [missing, 4, missing, 2, 1],
                      y = [missing, missing, "c", "d", "e"])
5×3 DataFrame
│ Row │ i     │ x       │ y       │
│     │ Int64 │ Int64?  │ String? │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 1     │ missing │ missing │
│ 2   │ 2     │ 4       │ missing │
│ 3   │ 3     │ missing │ c       │
│ 4   │ 4     │ 2       │ d       │
│ 5   │ 5     │ 1       │ e       │

julia> dropmissing!(copy(df))
2×3 DataFrame
│ Row │ i     │ x     │ y      │
│     │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1   │ 4     │ 2     │ d      │
│ 2   │ 5     │ 1     │ e      │

julia> dropmissing!(copy(df), disallowmissing=false)
2×3 DataFrame
│ Row │ i     │ x      │ y       │
│     │ Int64 │ Int64? │ String? │
├─────┼───────┼────────┼─────────┤
│ 1   │ 4     │ 2      │ d       │
│ 2   │ 5     │ 1      │ e       │

julia> dropmissing!(copy(df), :x)
3×3 DataFrame
│ Row │ i     │ x     │ y       │
│     │ Int64 │ Int64 │ String? │
├─────┼───────┼───────┼─────────┤
│ 1   │ 2     │ 4     │ missing │
│ 2   │ 4     │ 2     │ d       │
│ 3   │ 5     │ 1     │ e       │

julia> dropmissing!(df3, [:x, :y])
2×3 DataFrame
│ Row │ i     │ x     │ y      │
│     │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼────────┤
│ 1   │ 4     │ 2     │ d      │
│ 2   │ 5     │ 1     │ e      │
source
Compat.eachcolFunction.
eachcol(df::AbstractDataFrame)

Return a DataFrameColumns that is an AbstractVector that allows iterating an AbstractDataFrame column by column. Additionally it is allowed to index DataFrameColumns using column names.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 11    │
│ 2   │ 2     │ 12    │
│ 3   │ 3     │ 13    │
│ 4   │ 4     │ 14    │

julia> collect(eachcol(df))
2-element Array{AbstractArray{T,1} where T,1}:
 [1, 2, 3, 4]
 [11, 12, 13, 14]

julia> map(eachcol(df)) do col
           maximum(col) - minimum(col)
       end
2-element Array{Int64,1}:
 3
 3

julia> sum.(eachcol(df))
2-element Array{Int64,1}:
 10
 50
source
Compat.eachrowFunction.
eachrow(df::AbstractDataFrame)

Return a DataFrameRows that iterates a data frame row by row, with each row represented as a DataFrameRow.

Because DataFrameRows have an eltype of Any, use copy(dfr::DataFrameRow) to obtain a named tuple, which supports iteration and property access like a DataFrameRow, but also passes information on the eltypes of the columns of df.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 11    │
│ 2   │ 2     │ 12    │
│ 3   │ 3     │ 13    │
│ 4   │ 4     │ 14    │

julia> eachrow(df)
4-element DataFrameRows:
 DataFrameRow (row 1)
x  1
y  11
 DataFrameRow (row 2)
x  2
y  12
 DataFrameRow (row 3)
x  3
y  13
 DataFrameRow (row 4)
x  4
y  14

julia> copy.(eachrow(df))
4-element Array{NamedTuple{(:x, :y),Tuple{Int64,Int64}},1}:
 (x = 1, y = 11)
 (x = 2, y = 12)
 (x = 3, y = 13)
 (x = 4, y = 14)

julia> eachrow(view(df, [4,3], [2,1]))
2-element DataFrameRows:
 DataFrameRow (row 4)
y  14
x  4
 DataFrameRow (row 3)
y  13
x  3
source
Base.filterFunction.
filter(function, df::AbstractDataFrame)
filter(cols => function, df::AbstractDataFrame)

Return a copy of data frame df containing only rows for which function returns true.

If cols is not specified then the function is passed DataFrameRows.

If cols is specified then the function is passed elements of the corresponding columns as separate positional arguments, unless cols is an AsTable selector, in which case a NamedTuple of these arguments is passed. cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers), and column duplicates are allowed if a vector of Symbols, strings, or integers is passed.

Passing cols leads to a more efficient execution of the operation for large data frames.

See also: filter!

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> filter(row -> row.x > 1, df)
2×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │

julia> filter(:x => x -> x > 1, df)
2×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │

julia> filter([:x, :y] => (x, y) -> x == 1 || y == "b", df)
3×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 1     │ b      │

julia> filter(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 1     │ b      │
source
Base.filter!Function.
filter!(function, df::AbstractDataFrame)
filter!(cols => function, df::AbstractDataFrame)

Remove rows from data frame df for which function returns false.

If cols is not specified then the function is passed DataFrameRows. If cols is specified then the function is passed elements of the corresponding columns as separate positional arguments, unless cols is an AsTable selector, in which case a NamedTuple of these arguments is passed. cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers), and column duplicates are allowed if a vector of Symbols, strings, or integers is passed.

Passing cols leads to a more efficient execution of the operation for large data frames.

See also: filter

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> filter!(row -> row.x > 1, df);

julia> df
2×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │

julia> filter!(:x => x -> x == 3, df);

julia> df
1×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"]);

julia> filter!([:x, :y] => (x, y) -> x == 1 || y == "b", df);

julia> df
3×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 1     │ b      │

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"]);

julia> filter!(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 1     │ b      │
source
DataFrames.flattenFunction.
flatten(df::AbstractDataFrame, cols)

When columns cols of data frame df have iterable elements that define length (for example a Vector of Vectors), return a DataFrame where each element of each col in cols is flattened, meaning the column corresponding to col becomes a longer vector where the original entries are concatenated. Elements of row i of df in columns other than cols will be repeated according to the length of df[i, col]. These lengths must therefore be the same for each col in cols, or else an error is raised. Note that these elements are not copied, and thus if they are mutable changing them in the returned DataFrame will affect df.

cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

Examples

julia> df1 = DataFrame(a = [1, 2], b = [[1, 2], [3, 4]], c = [[5, 6], [7, 8]])
2×3 DataFrame
│ Row │ a     │ b      │ c      │
│     │ Int64 │ Array… │ Array… │
├─────┼───────┼────────┼────────┤
│ 1   │ 1     │ [1, 2] │ [5, 6] │
│ 2   │ 2     │ [3, 4] │ [7, 8] │

julia> flatten(df1, :b)
4×3 DataFrame
│ Row │ a     │ b     │ c      │
│     │ Int64 │ Int64 │ Array… │
├─────┼───────┼───────┼────────┤
│ 1   │ 1     │ 1     │ [5, 6] │
│ 2   │ 1     │ 2     │ [5, 6] │
│ 3   │ 2     │ 3     │ [7, 8] │
│ 4   │ 2     │ 4     │ [7, 8] │

julia> flatten(df1, [:b, :c])
4×3 DataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 5     │
│ 2   │ 1     │ 2     │ 6     │
│ 3   │ 2     │ 3     │ 7     │
│ 4   │ 2     │ 4     │ 8     │

julia> df2 = DataFrame(a = [1, 2], b = [("p", "q"), ("r", "s")])
2×2 DataFrame
│ Row │ a     │ b          │
│     │ Int64 │ Tuple…     │
├─────┼───────┼────────────┤
│ 1   │ 1     │ ("p", "q") │
│ 2   │ 2     │ ("r", "s") │

julia> flatten(df2, :b)
4×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ p      │
│ 2   │ 1     │ q      │
│ 3   │ 2     │ r      │
│ 4   │ 2     │ s      │

julia> df3 = DataFrame(a = [1, 2], b = [[1, 2], [3, 4]], c = [[5, 6], [7]])
2×3 DataFrame
│ Row │ a     │ b      │ c      │
│     │ Int64 │ Array… │ Array… │
├─────┼───────┼────────┼────────┤
│ 1   │ 1     │ [1, 2] │ [5, 6] │
│ 2   │ 2     │ [3, 4] │ [7]    │

julia> flatten(df3, [:b, :c])
ERROR: ArgumentError: Lengths of iterables stored in columns :b and :c
are not the same in row 2
source
Base.hcatFunction.
hcat(df::AbstractDataFrame...;
     makeunique::Bool=false, copycols::Bool=true)
hcat(df::AbstractDataFrame..., vs::AbstractVector;
     makeunique::Bool=false, copycols::Bool=true)
hcat(vs::AbstractVector, df::AbstractDataFrame;
     makeunique::Bool=false, copycols::Bool=true)

Horizontally concatenate AbstractDataFrames and optionally AbstractVectors.

If AbstractVector is passed then a column name for it is automatically generated as :x1 by default.

If makeunique=false (the default) column names of passed objects must be unique. If makeunique=true then duplicate column names will be suffixed with _i (i starting at 1 for the first duplicate).

If copycols=true (the default) then the DataFrame returned by hcat will contain copied columns from the source data frames. If copycols=false then it will contain columns as they are stored in the source (without copying). This option should be used with caution as mutating either the columns in sources or in the returned DataFrame might lead to the corruption of the other object.

Example

julia [DataFrame(A=1:3) DataFrame(B=1:3)]
3×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │

julia> df1 = DataFrame(A=1:3, B=1:3);

julia> df2 = DataFrame(A=4:6, B=4:6);

julia> df3 = hcat(df1, df2, makeunique=true)
3×4 DataFrame
│ Row │ A     │ B     │ A_1   │ B_1   │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 4     │ 4     │
│ 2   │ 2     │ 2     │ 5     │ 5     │
│ 3   │ 3     │ 3     │ 6     │ 6     │

julia> df3.A === df1.A
false

julia> df3 = hcat(df1, df2, makeunique=true, copycols=false);

julia> df3.A === df1.A
true
source
insertcols!(df::DataFrame, [ind::Int], (name=>col)::Pair...;
            makeunique::Bool=false, copycols::Bool=true)

Insert a column into a data frame in place. Return the updated DataFrame. If ind is omitted it is set to ncol(df)+1 (the column is inserted as the last column).

Arguments

  • df : the DataFrame to which we want to add columns
  • ind : a position at which we want to insert a column
  • name : the name of the new column
  • col : an AbstractVector giving the contents of the new column or a value of any type other than AbstractArray which will be repeated to fill a new vector; As a particular rule a values stored in a Ref or a 0-dimensional AbstractArray are unwrapped and treated in the same way.
  • makeunique : Defines what to do if name already exists in df; if it is false an error will be thrown; if it is true a new unique name will be generated by adding a suffix
  • copycols : whether vectors passed as columns should be copied

If col is an AbstractRange then the result of collect(col) is inserted.

Examples

julia> d = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │

julia> insertcols!(d, 1, :b => 'a':'c')
3×2 DataFrame
│ Row │ b    │ a     │
│     │ Char │ Int64 │
├─────┼──────┼───────┤
│ 1   │ 'a'  │ 1     │
│ 2   │ 'b'  │ 2     │
│ 3   │ 'c'  │ 3     │

julia> insertcols!(d, 2, :c => 2:4, :c => 3:5, makeunique=true)
3×4 DataFrame
│ Row │ b    │ c     │ c_1   │ a     │
│     │ Char │ Int64 │ Int64 │ Int64 │
├─────┼──────┼───────┼───────┼───────┤
│ 1   │ 'a'  │ 2     │ 3     │ 1     │
│ 2   │ 'b'  │ 3     │ 4     │ 2     │
│ 3   │ 'c'  │ 4     │ 5     │ 3     │
source
Base.lengthFunction.
length(dfr::DataFrameRow)

Return the number of elements of dfr.

See also: size

Examples

julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :];

julia> length(dfr)
2
source
DataFrames.mapcolsFunction.
mapcols(f::Union{Function,Type}, df::AbstractDataFrame)

Return a DataFrame where each column of df is transformed using function f. f must return AbstractVector objects all with the same length or scalars (all values other than AbstractVector are considered to be a scalar).

Note that mapcols guarantees not to reuse the columns from df in the returned DataFrame. If f returns its argument then it gets copied before being stored.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 11    │
│ 2   │ 2     │ 12    │
│ 3   │ 3     │ 13    │
│ 4   │ 4     │ 14    │

julia> mapcols(x -> x.^2, df)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 121   │
│ 2   │ 4     │ 144   │
│ 3   │ 9     │ 169   │
│ 4   │ 16    │ 196   │
source
DataFrames.mapcols!Function.
mapcols!(f::Union{Function,Type}, df::DataFrame)

Update a DataFrame in-place where each column of df is transformed using function f. f must return AbstractVector objects all with the same length or scalars (all values other than AbstractVector are considered to be a scalar).

Note that mapcols! reuses the columns from df if they are returned by f.

Examples

julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 11    │
│ 2   │ 2     │ 12    │
│ 3   │ 3     │ 13    │
│ 4   │ 4     │ 14    │

julia> mapcols!(x -> x.^2, df);

julia> df
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 121   │
│ 2   │ 4     │ 144   │
│ 3   │ 9     │ 169   │
│ 4   │ 16    │ 196   │
source
Base.namesFunction.
names(df::AbstractDataFrame)
names(df::AbstractDataFrame, cols)

Return a freshly allocated Vector{String} of names of columns contained in df.

If cols is passed then restrict returned column names to those matching the selector (this is useful in particular with regular expressions, Not, and Between). cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

See also propertynames which returns a Vector{Symbol}.

source
DataFrames.ncolFunction.
nrow(df::AbstractDataFrame)
ncol(df::AbstractDataFrame)

Return the number of rows or columns in an AbstractDataFrame df.

See also size.

Examples

julia> df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10));

julia> size(df)
(10, 3)

julia> nrow(df)
10

julia> ncol(df)
3
source
Base.ndimsFunction.
ndims(::AbstractDataFrame)
ndims(::Type{<:AbstractDataFrame})

Return the number of dimensions of a data frame, which is always 2.

source
ndims(::DataFrameRow)
ndims(::Type{<:DataFrameRow})

Return the number of dimensions of a data frame row, which is always 1.

source
DataFrames.nonuniqueFunction.
nonunique(df::AbstractDataFrame)
nonunique(df::AbstractDataFrame, cols)

Return a Vector{Bool} in which true entries indicate duplicate rows. A row is a duplicate if there exists a prior row with all columns containing equal values (according to isequal).

See also unique and unique!.

Arguments

  • df : AbstractDataFrame
  • cols : a selector specifying the column(s) to compare. Can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

Examples

df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
nonunique(df)
nonunique(df, 1)
source
DataFrames.nrowFunction.
nrow(df::AbstractDataFrame)
ncol(df::AbstractDataFrame)

Return the number of rows or columns in an AbstractDataFrame df.

See also size.

Examples

julia> df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10));

julia> size(df)
(10, 3)

julia> nrow(df)
10

julia> ncol(df)
3
source
DataFrames.orderFunction.
order(col::ColumnIndex; kwargs...)

Specify sorting order for a column col in a data frame. kwargs can be lt, by, rev, and order with values following the rules defined in sort!.

See also: sort!, sort

Examples

julia> df = DataFrame(x = [-3, -1, 0, 2, 4], y = 1:5)
5×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ -3    │ 1     │
│ 2   │ -1    │ 2     │
│ 3   │ 0     │ 3     │
│ 4   │ 2     │ 4     │
│ 5   │ 4     │ 5     │

julia> sort(df, order(:x, rev=true))
5×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 4     │ 5     │
│ 2   │ 2     │ 4     │
│ 3   │ 0     │ 3     │
│ 4   │ -1    │ 2     │
│ 5   │ -3    │ 1     │

julia> sort(df, order(:x, by=abs))
5×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 0     │ 3     │
│ 2   │ -1    │ 2     │
│ 3   │ 2     │ 4     │
│ 4   │ -3    │ 1     │
│ 5   │ 4     │ 5     │
source
Base.push!Function.
push!(df::DataFrame, row::Union{Tuple, AbstractArray}; promote::Bool=false)
push!(df::DataFrame, row::Union{DataFrameRow, NamedTuple, AbstractDict};
      cols::Symbol=:setequal, promote::Bool=(cols in [:union, :subset]))

Add in-place one row at the end of df taking the values from row.

Column types of df are preserved, and new values are converted if necessary. An error is thrown if conversion fails.

If row is neither a DataFrameRow, NamedTuple nor AbstractDict then it must be a Tuple or an AbstractArray and columns are matched by order of appearance. In this case row must contain the same number of elements as the number of columns in df.

If row is a DataFrameRow, NamedTuple or AbstractDict then values in row are matched to columns in df based on names. The exact behavior depends on the cols argument value in the following way:

  • If cols == :setequal (this is the default) then row must contain exactly the same columns as df (but possibly in a different order).
  • If cols == :orderequal then row must contain the same columns in the same order (for AbstractDict this option requires that keys(row) matches propertynames(df) to allow for support of ordered dicts; however, if row is a Dict an error is thrown as it is an unordered collection).
  • If cols == :intersect then row may contain more columns than df, but all column names that are present in df must be present in row and only they are used to populate a new row in df.
  • If cols == :subset then push! behaves like for :intersect but if some column is missing in row then a missing value is pushed to df.
  • If cols == :union then columns missing in df that are present in row are added to df (using missing for existing rows) and a missing value is pushed to columns missing in row that are present in df.

If promote=true and element type of a column present in df does not allow the type of a pushed argument then a new column with a promoted element type allowing it is freshly allocated and stored in df. If promote=false an error is thrown.

As a special case, if df has no columns and row is a NamedTuple or DataFrameRow, columns are created for all values in row, using their names and order.

Please note that push! must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===).

Examples

julia> df = DataFrame(A=1:3, B=1:3);

julia> push!(df, (true, false))
4×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 1     │ 0     │

julia> push!(df, df[1, :])
5×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 1     │ 0     │
│ 5   │ 1     │ 1     │

julia> push!(df, (C="something", A=true, B=false), cols=:intersect)
6×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 1     │ 0     │
│ 5   │ 1     │ 1     │
│ 6   │ 1     │ 0     │

julia> push!(df, Dict(:A=>1.0, :C=>1.0), cols=:union)
7×3 DataFrame
│ Row │ A       │ B       │ C        │
│     │ Float64 │ Int64?  │ Float64? │
├─────┼─────────┼─────────┼──────────┤
│ 1   │ 1.0     │ 1       │ missing  │
│ 2   │ 2.0     │ 2       │ missing  │
│ 3   │ 3.0     │ 3       │ missing  │
│ 4   │ 1.0     │ 0       │ missing  │
│ 5   │ 1.0     │ 1       │ missing  │
│ 6   │ 1.0     │ 0       │ missing  │
│ 7   │ 1.0     │ missing │ 1.0      │

julia> push!(df, NamedTuple(), cols=:subset)
8×3 DataFrame
│ Row │ A        │ B       │ C        │
│     │ Float64? │ Int64?  │ Float64? │
├─────┼──────────┼─────────┼──────────┤
│ 1   │ 1.0      │ 1       │ missing  │
│ 2   │ 2.0      │ 2       │ missing  │
│ 3   │ 3.0      │ 3       │ missing  │
│ 4   │ 1.0      │ 0       │ missing  │
│ 5   │ 1.0      │ 1       │ missing  │
│ 6   │ 1.0      │ 0       │ missing  │
│ 7   │ 1.0      │ missing │ 1.0      │
│ 8   │ missing  │ missing │ missing  │
source
DataFrames.renameFunction.
rename(df::AbstractDataFrame, vals::AbstractVector{Symbol};
       makeunique::Bool=false)
rename(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString};
       makeunique::Bool=false)
rename(df::AbstractDataFrame, (from => to)::Pair...)
rename(df::AbstractDataFrame, d::AbstractDict)
rename(df::AbstractDataFrame, d::AbstractVector{<:Pair})
rename(f::Function, df::AbstractDataFrame)

Create a new data frame that is a copy of df with changed column names. Each name is changed at most once. Permutation of names is allowed.

Arguments

  • df : the AbstractDataFrame
  • d : an AbstractDict or an AbstractVector of Pairs that maps the original names or column numbers to new names
  • f : a function which for each column takes the old name as a String and returns the new name that gets converted to a Symbol
  • vals : new column names as a vector of Symbols or AbstractStrings of the same length as the number of columns in df
  • makeunique : if false (the default), an error will be raised if duplicate names are found; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).

If pairs are passed to rename (as positional arguments or in a dictionary or a vector) then:

  • from value can be a Symbol, an AbstractString or an Integer;
  • to value can be a Symbol or an AbstractString.

Mixing symbols and strings in to and from is not allowed.

See also: rename!

Examples

julia> df = DataFrame(i = 1, x = 2, y = 3)
1×3 DataFrame
│ Row │ i     │ x     │ y     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │

julia> rename(df, :i => :A, :x => :X)
1×3 DataFrame
│ Row │ A     │ X     │ y     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │

julia> rename(df, :x => :y, :y => :x)
1×3 DataFrame
│ Row │ i     │ y     │ x     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │

julia> rename(df, [1 => :A, 2 => :X])
1×3 DataFrame
│ Row │ A     │ X     │ y     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │

julia> rename(df, Dict("i" => "A", "x" => "X"))
1×3 DataFrame
│ Row │ A     │ X     │ y     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │

julia> rename(uppercase, df)
1×3 DataFrame
│ Row │ I     │ X     │ Y     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │
source
DataFrames.rename!Function.
rename!(df::AbstractDataFrame, vals::AbstractVector{Symbol};
        makeunique::Bool=false)
rename!(df::AbstractDataFrame, vals::AbstractVector{<:AbstractString};
        makeunique::Bool=false)
rename!(df::AbstractDataFrame, (from => to)::Pair...)
rename!(df::AbstractDataFrame, d::AbstractDict)
rename!(df::AbstractDataFrame, d::AbstractVector{<:Pair})
rename!(f::Function, df::AbstractDataFrame)

Rename columns of df in-place. Each name is changed at most once. Permutation of names is allowed.

Arguments

  • df : the AbstractDataFrame
  • d : an AbstractDict or an AbstractVector of Pairs that maps the original names or column numbers to new names
  • f : a function which for each column takes the old name as a String and returns the new name that gets converted to a Symbol
  • vals : new column names as a vector of Symbols or AbstractStrings of the same length as the number of columns in df
  • makeunique : if false (the default), an error will be raised if duplicate names are found; if true, duplicate names will be suffixed with _i (i starting at 1 for the first duplicate).

If pairs are passed to rename! (as positional arguments or in a dictionary or a vector) then:

  • from value can be a Symbol, an AbstractString or an Integer;
  • to value can be a Symbol or an AbstractString.

Mixing symbols and strings in to and from is not allowed.

See also: rename

Examples

julia> df = DataFrame(i = 1, x = 2, y = 3)
1×3 DataFrame
│ Row │ i     │ x     │ y     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │

julia> rename!(df, Dict(:i => "A", :x => "X"))
1×3 DataFrame
│ Row │ A     │ X     │ y     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │

julia> rename!(df, [:a, :b, :c])
1×3 DataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │

julia> rename!(df, [:a, :b, :a])
ERROR: ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make
them unique using a suffix automatically.

julia> rename!(df, [:a, :b, :a], makeunique=true)
1×3 DataFrame
│ Row │ a     │ b     │ a_1   │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │

julia> rename!(uppercase, df)
1×3 DataFrame
│ Row │ A     │ B     │ A_1   │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 3     │
source
Base.repeatFunction.
repeat(df::AbstractDataFrame; inner::Integer = 1, outer::Integer = 1)

Construct a data frame by repeating rows in df. inner specifies how many times each row is repeated, and outer specifies how many times the full set of rows is repeated.

Example

julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> repeat(df, inner = 2, outer = 3)
12×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 1     │ 3     │
│ 3   │ 2     │ 4     │
│ 4   │ 2     │ 4     │
│ 5   │ 1     │ 3     │
│ 6   │ 1     │ 3     │
│ 7   │ 2     │ 4     │
│ 8   │ 2     │ 4     │
│ 9   │ 1     │ 3     │
│ 10  │ 1     │ 3     │
│ 11  │ 2     │ 4     │
│ 12  │ 2     │ 4     │
source
repeat(df::AbstractDataFrame, count::Integer)

Construct a data frame by repeating each row in df the number of times specified by count.

Example

julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> repeat(df, 2)
4×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │
│ 3   │ 1     │ 3     │
│ 4   │ 2     │ 4     │
source
DataFrames.repeat!Function.
repeat!(df::DataFrame; inner::Integer = 1, outer::Integer = 1)

Update a data frame df in-place by repeating its rows. inner specifies how many times each row is repeated, and outer specifies how many times the full set of rows is repeated. Columns of df are freshly allocated.

Example

julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> repeat!(df, inner = 2, outer = 3);

julia> df
12×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 1     │ 3     │
│ 3   │ 2     │ 4     │
│ 4   │ 2     │ 4     │
│ 5   │ 1     │ 3     │
│ 6   │ 1     │ 3     │
│ 7   │ 2     │ 4     │
│ 8   │ 2     │ 4     │
│ 9   │ 1     │ 3     │
│ 10  │ 1     │ 3     │
│ 11  │ 2     │ 4     │
│ 12  │ 2     │ 4     │
source
repeat!(df::DataFrame, count::Integer)

Update a data frame df in-place by repeating its rows the number of times specified by count. Columns of df are freshly allocated.

Example

julia> df = DataFrame(a = 1:2, b = 3:4)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> repeat(df, 2)
4×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │
│ 3   │ 1     │ 3     │
│ 4   │ 2     │ 4     │
source
DataFrames.selectFunction.
select(df::AbstractDataFrame, args...; copycols::Bool=true)

Create a new data frame that contains columns from df specified by args and return it. The result is guaranteed to have the same number of rows as df.

If df is a DataFrame or copycols=true then column renaming and transformations are supported.

Arguments passed as args... can be:

  • Any index that is allowed for column indexing (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
  • Column transformation operations using the Pair notation that is described below and vectors of such pairs.

Columns can be renamed using the old_column => new_column_name syntax, and transformed using the old_column => fun => new_column_name syntax. new_column_name must be a Symbol or a string, and fun a function or a type. If old_column is a Symbol, a string, or an integer then fun is applied to the corresponding column vector. Otherwise old_column can be any column indexing syntax, in which case fun will be passed the column vectors specified by old_column as separate arguments. The only exception is when old_column is an AsTable type wrapping a selector, in which case fun is passed a NamedTuple containing the selected columns.

If fun returns a value of type other than AbstractVector then it will be broadcasted into a vector matching the target number of rows in the data frame, unless its type is one of AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix, in which case an error is thrown as currently these return types are not allowed. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then broadcasted.

To apply fun to each row instead of whole columns, it can be wrapped in a ByRow struct. In this case if old_column is a Symbol, a string, or an integer then fun is applied to each element (row) of old_column using broadcasting. Otherwise old_column can be any column indexing syntax, in which case fun will be passed one argument for each of the columns specified by old_column. If ByRow is used it is not allowed for old_column to select an empty set of columns nor for fun to return a NamedTuple or a DataFrameRow.

Column transformation can also be specified using the short old_column => fun form. In this case, new_column_name is automatically generated as $(old_column)_$(fun). Up to three column names are used for multiple input columns and they are joined using _; if more than three columns are passed then the name consists of the first two names and etc suffix then, e.g. [:a,:b,:c,:d] => fun produces the new column name :a_b_etc_fun.

Column renaming and transformation operations can be passed wrapped in vectors (this is useful when combined with broadcasting).

As a special rule passing nrow without specifying old_column creates a column named :nrow containing a number of rows in a source data frame, and passing nrow => new_column_name stores the number of rows in source data frame in new_column_name column.

If a collection of column names is passed to select! or select then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.

If df is a DataFrame a new DataFrame is returned. If copycols=false, then the returned DataFrame shares column vectors with df where possible. If copycols=true (the default), then the returned DataFrame will not share columns with df. The only exception for this rule is the old_column => fun => new_column transformation when fun returns a vector that is not allocated by fun but is neither a SubArray nor one of the input vectors. In such a case a new DataFrame might contain aliases. Such a situation can only happen with transformations which returns vectors other than their inputs, e.g. with select(df, :a => (x -> c) => :c1, :b => (x -> c) => :c2) when c is a vector object or with select(df, :a => (x -> df.c) => :c2).

If df is a SubDataFrame and copycols=true then a DataFrame is returned and the same copying rules apply as for a DataFrame input: this means in particular that selected columns will be copied. If copycols=false, a SubDataFrame is returned without copying columns.

Note that including the same column several times in the data frame via renaming or transformations that return the same object when copycols=false will create column aliases. An example of such a situation is select(df, :a, :a => :b, :a => identity => :c, copycols=false).

Examples

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │

julia> select(df, :b)
3×1 DataFrame
│ Row │ b     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │
│ 2   │ 5     │
│ 3   │ 6     │

julia> select(df, Not(:b)) # drop column :b from df
3×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │

julia> select(df, :a => :c, :b)
3×2 DataFrame
│ Row │ c     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │

julia> select(df, :a => ByRow(sin) => :c, :b)
3×2 DataFrame
│ Row │ c        │ b     │
│     │ Float64  │ Int64 │
├─────┼──────────┼───────┤
│ 1   │ 0.841471 │ 4     │
│ 2   │ 0.909297 │ 5     │
│ 3   │ 0.14112  │ 6     │

julia> select(df, :, [:a, :b] => (a,b) -> a .+ b .- sum(b)/length(b))
3×3 DataFrame
│ Row │ a     │ b     │ a_b_function │
│     │ Int64 │ Int64 │ Float64      │
├─────┼───────┼───────┼──────────────┤
│ 1   │ 1     │ 4     │ 0.0          │
│ 2   │ 2     │ 5     │ 2.0          │
│ 3   │ 3     │ 6     │ 4.0          │

julia> select(df, names(df) .=> sum)
3×2 DataFrame
│ Row │ a_sum │ b_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 6     │ 15    │
│ 2   │ 6     │ 15    │
│ 3   │ 6     │ 15    │

julia> select(df, names(df) .=> sum .=> [:A, :B])
3×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 6     │ 15    │
│ 2   │ 6     │ 15    │
│ 3   │ 6     │ 15    │

julia> select(df, AsTable(:) => ByRow(mean))
3×1 DataFrame
│ Row │ a_b_mean │
│     │ Float64  │
├─────┼──────────┤
│ 1   │ 2.5      │
│ 2   │ 3.5      │
│ 3   │ 4.5      │
source
select(gd::GroupedDataFrame, args...;
       copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true)

Apply args to gd following the rules described in combine and return the result as a DataFrame if ungroup=true or GroupedDataFrame if ungroup=false.

The parent of the returned value has as many rows as parent(gd). If an operation in args returns a single value it is always broadcasted to have this number of rows.

If copycols=false then do not perform copying of columns that are not transformed.

If keepkeys=true, the resulting DataFrame contains all the grouping columns in addition to those generated. In this case if the returned value contains columns with the same names as the grouping columns, they are required to be equal.

If ungroup=true (the default) a DataFrame is returned. If ungroup=false a GroupedDataFrame grouped using keycols(gdf) is returned.

See also

groupby, combine, select!, transform, transform!

Examples

julia> df = DataFrame(a = [1, 1, 1, 2, 2, 1, 1, 2],
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8)
8×3 DataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 1     │
│ 2   │ 1     │ 1     │ 2     │
│ 3   │ 1     │ 2     │ 3     │
│ 4   │ 2     │ 1     │ 4     │
│ 5   │ 2     │ 2     │ 5     │
│ 6   │ 1     │ 1     │ 6     │
│ 7   │ 1     │ 2     │ 7     │
│ 8   │ 2     │ 1     │ 8     │

julia> gd = groupby(df, :a);

julia> select(gd, :c => sum, nrow)
8×3 DataFrame
│ Row │ a     │ c_sum │ nrow  │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 19    │ 5     │
│ 2   │ 1     │ 19    │ 5     │
│ 3   │ 1     │ 19    │ 5     │
│ 4   │ 2     │ 17    │ 3     │
│ 5   │ 2     │ 17    │ 3     │
│ 6   │ 1     │ 19    │ 5     │
│ 7   │ 1     │ 19    │ 5     │
│ 8   │ 2     │ 17    │ 3     │

julia> select(gd, :c => sum, nrow, ungroup=false)
GroupedDataFrame with 2 groups based on key: a
First Group (5 rows): a = 1
│ Row │ a     │ c_sum │ nrow  │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 19    │ 5     │
│ 2   │ 1     │ 19    │ 5     │
│ 3   │ 1     │ 19    │ 5     │
│ 4   │ 1     │ 19    │ 5     │
│ 5   │ 1     │ 19    │ 5     │
⋮
Last Group (3 rows): a = 2
│ Row │ a     │ c_sum │ nrow  │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 2     │ 17    │ 3     │
│ 2   │ 2     │ 17    │ 3     │
│ 3   │ 2     │ 17    │ 3     │

julia> select(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column
8×2 DataFrame
│ Row │ a     │ sum_log_c │
│     │ Int64 │ Float64   │
├─────┼───────┼───────────┤
│ 1   │ 1     │ 5.52943   │
│ 2   │ 1     │ 5.52943   │
│ 3   │ 1     │ 5.52943   │
│ 4   │ 2     │ 5.07517   │
│ 5   │ 2     │ 5.07517   │
│ 6   │ 1     │ 5.52943   │
│ 7   │ 1     │ 5.52943   │
│ 8   │ 2     │ 5.07517   │

julia> select(gd, [:b, :c] .=> sum) # passing a vector of pairs
8×3 DataFrame
│ Row │ a     │ b_sum │ c_sum │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 8     │ 19    │
│ 2   │ 1     │ 8     │ 19    │
│ 3   │ 1     │ 8     │ 19    │
│ 4   │ 2     │ 4     │ 17    │
│ 5   │ 2     │ 4     │ 17    │
│ 6   │ 1     │ 8     │ 19    │
│ 7   │ 1     │ 8     │ 19    │
│ 8   │ 2     │ 4     │ 17    │

julia> select(gd, :b => :b1, :c => :c1,
              [:b, :c] => +, keepkeys=false) # multiple arguments, renaming and keepkeys
8×3 DataFrame
│ Row │ b1    │ c1    │ b_c_+ │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 2     │ 1     │ 3     │
│ 2   │ 1     │ 2     │ 3     │
│ 3   │ 2     │ 3     │ 5     │
│ 4   │ 1     │ 4     │ 5     │
│ 5   │ 2     │ 5     │ 7     │
│ 6   │ 1     │ 6     │ 7     │
│ 7   │ 2     │ 7     │ 9     │
│ 8   │ 1     │ 8     │ 9     │

julia> select(gd, :b, :c => sum) # passing columns and broadcasting
8×3 DataFrame
│ Row │ a     │ b     │ c_sum │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 19    │
│ 2   │ 1     │ 1     │ 19    │
│ 3   │ 1     │ 2     │ 19    │
│ 4   │ 2     │ 1     │ 17    │
│ 5   │ 2     │ 2     │ 17    │
│ 6   │ 1     │ 1     │ 19    │
│ 7   │ 1     │ 2     │ 19    │
│ 8   │ 2     │ 1     │ 17    │

julia> select(gd, :, AsTable(Not(:a)) => sum)
8×4 DataFrame
│ Row │ a     │ b     │ c     │ b_c_sum │
│     │ Int64 │ Int64 │ Int64 │ Int64   │
├─────┼───────┼───────┼───────┼─────────┤
│ 1   │ 1     │ 2     │ 1     │ 3       │
│ 2   │ 1     │ 1     │ 2     │ 3       │
│ 3   │ 1     │ 2     │ 3     │ 5       │
│ 4   │ 2     │ 1     │ 4     │ 5       │
│ 5   │ 2     │ 2     │ 5     │ 7       │
│ 6   │ 1     │ 1     │ 6     │ 7       │
│ 7   │ 1     │ 2     │ 7     │ 9       │
│ 8   │ 2     │ 1     │ 8     │ 9       │
source
DataFrames.select!Function.
select!(df::DataFrame, args...)

Mutate df in place to retain only columns specified by args... and return it. The result is guaranteed to have the same number of rows as df.

Arguments passed as args... can be:

  • Any index that is allowed for column indexing (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).
  • Column transformation operations using the Pair notation that is described below and vectors of such pairs.

Columns can be renamed using the old_column => new_column_name syntax, and transformed using the old_column => fun => new_column_name syntax. new_column_name must be a Symbol or a string, and fun a function or a type. If old_column is a Symbol, a string, or an integer then fun is applied to the corresponding column vector. Otherwise old_column can be any column indexing syntax, in which case fun will be passed the column vectors specified by old_column as separate arguments. The only exception is when old_column is an AsTable type wrapping a selector, in which case fun is passed a NamedTuple containing the selected columns.

If fun returns a value of type other than AbstractVector then it will be broadcasted into a vector matching the target number of rows in the data frame, unless its type is one of AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix, in which case an error is thrown as currently these return types are not allowed. As a particular rule, values wrapped in a Ref or a 0-dimensional AbstractArray are unwrapped and then broadcasted.

To apply fun to each row instead of whole columns, it can be wrapped in a ByRow struct. In this case if old_column is a Symbol, a string, or an integer then fun is applied to each element (row) of old_column using broadcasting. Otherwise old_column can be any column indexing syntax, in which case fun will be passed one argument for each of the columns specified by old_column. If ByRow is used it is not allowed for old_column to select an empty set of columns nor for fun to return a NamedTuple or a DataFrameRow.

Column transformation can also be specified using the short old_column => fun form. In this case, new_column_name is automatically generated as $(old_column)_$(fun). Up to three column names are used for multiple input columns and they are joined using _; if more than three columns are passed then the name consists of the first two names and etc suffix then, e.g. [:a,:b,:c,:d] => fun produces the new column name :a_b_etc_fun.

Column renaming and transformation operations can be passed wrapped in vectors (this is useful when combined with broadcasting).

As a special rule passing nrow without specifying old_column creates a column named :nrow containing a number of rows in a source data frame, and passing nrow => new_column_name stores the number of rows in source data frame in new_column_name column.

If a collection of column names is passed to select! or select then requesting duplicate column names in target data frame are accepted (e.g. select!(df, [:a], :, r"a") is allowed) and only the first occurrence is used. In particular a syntax to move column :col to the first position in the data frame is select!(df, :col, :). On the contrary, output column names of renaming, transformation and single column selection operations must be unique, so e.g. select!(df, :a, :a => :a) or select!(df, :a, :a => ByRow(sin) => :a) are not allowed.

Note that including the same column several times in the data frame via renaming or transformations that return the same object without copying will create column aliases. An example of such a situation is select!(df, :a, :a => :b, :a => identity => :c).

Examples

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │

julia> select!(df, 2)
3×1 DataFrame
│ Row │ b     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │
│ 2   │ 5     │
│ 3   │ 6     │

julia> df = DataFrame(a=1:3, b=4:6);

julia> select!(df, :a => ByRow(sin) => :c, :b)
3×2 DataFrame
│ Row │ c        │ b     │
│     │ Float64  │ Int64 │
├─────┼──────────┼───────┤
│ 1   │ 0.841471 │ 4     │
│ 2   │ 0.909297 │ 5     │
│ 3   │ 0.14112  │ 6     │

julia> select!(df, :, [:c, :b] => (c,b) -> c .+ b .- sum(b)/length(b))
3×3 DataFrame
│ Row │ c        │ b     │ c_b_function │
│     │ Float64  │ Int64 │ Float64      │
├─────┼──────────┼───────┼──────────────┤
│ 1   │ 0.841471 │ 4     │ -0.158529    │
│ 2   │ 0.909297 │ 5     │ 0.909297     │
│ 3   │ 0.14112  │ 6     │ 1.14112      │

julia> df = DataFrame(a=1:3, b=4:6);

julia> select!(df, names(df) .=> sum);

julia> df
3×2 DataFrame
│ Row │ a_sum │ b_sum │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 6     │ 15    │
│ 2   │ 6     │ 15    │
│ 3   │ 6     │ 15    │

julia> df = DataFrame(a=1:3, b=4:6);

julia> using Statistics

julia> select!(df, AsTable(:) => ByRow(mean))
3×1 DataFrame
│ Row │ a_b_mean │
│     │ Float64  │
├─────┼──────────┤
│ 1   │ 2.5      │
│ 2   │ 3.5      │
│ 3   │ 4.5      │
source
select!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true)

An equivalent of select(gd, args..., copycols=false, keepkeys=true, ungroup=ungroup) but updates parent(gd) in place.

See also

groupby, combine, select, transform, transform!

source
Base.showFunction.
show([io::IO,] df::AbstractDataFrame;
     allrows::Bool = !get(io, :limit, false),
     allcols::Bool = !get(io, :limit, false),
     allgroups::Bool = !get(io, :limit, false),
     splitcols::Bool = get(io, :limit, false),
     rowlabel::Symbol = :Row,
     summary::Bool = true)

Render a data frame to an I/O stream. The specific visual representation chosen depends on the width of the display.

If io is omitted, the result is printed to stdout, and allrows, allcols and allgroups default to false while splitcols defaults to true.

Arguments

  • io::IO: The I/O stream to which df will be printed.
  • df::AbstractDataFrame: The data frame to print.
  • allrows::Bool: Whether to print all rows, rather than a subset that fits the device height. By default this is the case only if io does not have the IOContext property limit set.
  • allcols::Bool: Whether to print all columns, rather than a subset that fits the device width. By default this is the case only if io does not have the IOContext property limit set.
  • allgroups::Bool: Whether to print all groups rather than the first and last, when df is a GroupedDataFrame. By default this is the case only if io does not have the IOContext property limit set.
  • splitcols::Bool: Whether to split printing in chunks of columns fitting the screen width rather than printing all columns in the same block. Only applies if allcols is true. By default this is the case only if io has the IOContext property limit set.
  • rowlabel::Symbol = :Row: The label to use for the column containing row numbers.
  • summary::Bool = true: Whether to print a brief string summary of the data frame.
  • eltypes::Bool = true: Whether to print the column types under column names.

Examples

julia> using DataFrames

julia> df = DataFrame(A = 1:3, B = ["x", "y", "z"]);

julia> show(df, allcols=true)
3×2 DataFrame
│ Row │ A     │ B      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ x      │
│ 2   │ 2     │ y      │
│ 3   │ 3     │ z      │
source
show(io::IO, mime::MIME, df::AbstractDataFrame)

Render a data frame to an I/O stream in MIME type mime.

Arguments

  • io::IO: The I/O stream to which df will be printed.
  • mime::MIME: supported MIME types are: "text/plain", "text/html", "text/latex", "text/csv", "text/tab-separated-values" (the last two MIME types do not support showing #undef values)
  • df::AbstractDataFrame: The data frame to print.

Additionally selected MIME types support passing the following keyword arguments:

  • MIME type "text/plain" accepts all listed keyword arguments and therir behavior is identical as for show(::IO, ::AbstractDataFrame)
  • MIME type "text/html" accepts summary keyword argument which allows to choose whether to print a brief string summary of the data frame.

Examples

julia> show(stdout, MIME("text/latex"), DataFrame(A = 1:3, B = ["x", "y", "z"]))
\begin{tabular}{r|cc}
        & A & B\\
        \hline
        & Int64 & String\\
        \hline
        1 & 1 & x \\
        2 & 2 & y \\
        3 & 3 & z \\
\end{tabular}
14

julia> show(stdout, MIME("text/csv"), DataFrame(A = 1:3, B = ["x", "y", "z"]))
"A","B"
1,"x"
2,"y"
3,"z"
source
Base.sizeFunction.
size(df::AbstractDataFrame, [dim])

Return a tuple containing the number of rows and columns of df. Optionally a dimension dim can be specified, where 1 corresponds to rows and 2 corresponds to columns.

See also: nrow, ncol

Examples

julia> df = DataFrame(a=1:3, b='a':'c');

julia> size(df)
(3, 2)

julia> size(df, 1)
3
source
size(dfr::DataFrameRow, [dim])

Return a 1-tuple containing the number of elements of dfr. If an optional dimension dim is specified, it must be 1, and the number of elements is returned directly as a number.

See also: length

Examples

julia> dfr = DataFrame(a=1:3, b='a':'c')[1, :];

julia> size(dfr)
(2,)

julia> size(dfr, 1)
2
source
Base.sortFunction.
sort(df::AbstractDataFrame, cols;
     alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
     rev::Bool=false, order::Ordering=Forward)

Return a copy of data frame df sorted by column(s) cols.

cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If alg is nothing (the default), the most appropriate algorithm is chosen automatically among TimSort, MergeSort and RadixSort depending on the type of the sorting columns and on the number of rows in df. If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below). See sort! for a description of other keyword arguments.

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> sort(df, :x)
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ c      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │

julia> sort(df, [:x, :y])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │

julia> sort(df, [:x, :y], rev=true)
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │
│ 3   │ 1     │ c      │
│ 4   │ 1     │ b      │

julia> sort(df, [:x, order(:y, rev=true)])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ c      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │
source
Base.sort!Function.
sort!(df::AbstractDataFrame, cols;
      alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
      rev::Bool=false, order::Ordering=Forward)

Sort data frame df by column(s) cols.

cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If alg is nothing (the default), the most appropriate algorithm is chosen automatically among TimSort, MergeSort and RadixSort depending on the type of the sorting columns and on the number of rows in df. If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below). See other methods for a description of other keyword arguments.

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> sort!(df, :x)
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ c      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │

julia> sort!(df, [:x, :y])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │

julia> sort!(df, [:x, :y], rev=true)
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 2     │ a      │
│ 3   │ 1     │ c      │
│ 4   │ 1     │ b      │

julia> sort!(df, (:x, order(:y, rev=true)))
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ c      │
│ 2   │ 1     │ b      │
│ 3   │ 2     │ a      │
│ 4   │ 3     │ b      │
source
DataFrames.transformFunction.
transform(df::AbstractDataFrame, args...; copycols::Bool=true)

Create a new data frame that contains columns from df and adds columns specified by args and return it. The result is guaranteed to have the same number of rows as df. Equivalent to select(df, :, args..., copycols=copycols).

See select for detailed rules regarding accepted values for args.

source
transform(gd::GroupedDataFrame, args...;
          copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true)

An equivalent of select(gd, :, args..., copycols=copycols, keepkeys=keepkeys, ungroup=ungroup)

See also

groupby, combine, select, select!, transform!

source
DataFrames.transform!Function.
transform!(df::DataFrame, args...)

Mutate df in place to add columns specified by args... and return it. The result is guaranteed to have the same number of rows as df. Equivalent to select!(df, :, args...).

See select! for detailed rules regarding accepted values for args.

source
transform!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true)

An equivalent of transform(gd, args..., copycols=false, keepkeys=true, ungroup=ungroup) but updates parent(gd) in place.

See also

groupby, combine, select, select!, transform

source
Base.unique!Function.
unique(df::AbstractDataFrame)
unique(df::AbstractDataFrame, cols)
unique!(df::AbstractDataFrame)
unique!(df::AbstractDataFrame, cols)

Delete duplicate rows of data frame df, keeping only the first occurrence of unique rows. When cols is specified, the returned DataFrame contains complete rows, retaining in each case the first instance for which df[cols] is unique. cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

When unique is called a new data frame is returned; unique! updates df in-place.

See also nonunique.

Arguments

  • df : the AbstractDataFrame
  • cols : column indicator (Symbol, Int, Vector{Symbol}, Regex, etc.)

specifying the column(s) to compare.

Examples

df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
unique(df)   # doesn't modify df
unique(df, 1)
unique!(df)  # modifies df
source
Base.vcatFunction.
vcat(dfs::AbstractDataFrame...;
     cols::Union{Symbol, AbstractVector{Symbol},
                 AbstractVector{<:AbstractString}}=:setequal)

Vertically concatenate AbstractDataFrames.

The cols keyword argument determines the columns of the returned data frame:

  • :setequal: require all data frames to have the same column names disregarding order. If they appear in different orders, the order of the first provided data frame is used.
  • :orderequal: require all data frames to have the same column names and in the same order.
  • :intersect: only the columns present in all provided data frames are kept. If the intersection is empty, an empty data frame is returned.
  • :union: columns present in at least one of the provided data frames are kept. Columns not present in some data frames are filled with missing where necessary.
  • A vector of Symbols or strings: only listed columns are kept. Columns not present in some data frames are filled with missing where necessary.

The order of columns is determined by the order they appear in the included data frames, searching through the header of the first data frame, then the second, etc.

The element types of columns are determined using promote_type, as with vcat for AbstractVectors.

vcat ignores empty data frames, making it possible to initialize an empty data frame at the beginning of a loop and vcat onto it.

Example

julia> df1 = DataFrame(A=1:3, B=1:3);

julia> df2 = DataFrame(A=4:6, B=4:6);

julia> df3 = DataFrame(A=7:9, C=7:9);

julia> d4 = DataFrame();

julia> vcat(df1, df2)
6×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
│ 4   │ 4     │ 4     │
│ 5   │ 5     │ 5     │
│ 6   │ 6     │ 6     │

julia> vcat(df1, df3, cols=:union)
6×3 DataFrame
│ Row │ A     │ B       │ C       │
│     │ Int64 │ Int64?  │ Int64?  │
├─────┼───────┼─────────┼─────────┤
│ 1   │ 1     │ 1       │ missing │
│ 2   │ 2     │ 2       │ missing │
│ 3   │ 3     │ 3       │ missing │
│ 4   │ 7     │ missing │ 7       │
│ 5   │ 8     │ missing │ 8       │
│ 6   │ 9     │ missing │ 9       │

julia> vcat(df1, df3, cols=:intersect)
6×1 DataFrame
│ Row │ A     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │
│ 4   │ 7     │
│ 5   │ 8     │
│ 6   │ 9     │

julia> vcat(d4, df1)
3×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │
source

Unsorted

Base.firstFunction.
first(df::AbstractDataFrame)

Get the first row of df as a DataFrameRow.

source
first(df::AbstractDataFrame, n::Integer)

Get a data frame with the n first rows of df.

source
Base.lastFunction.
last(df::AbstractDataFrame)

Get the last row of df as a DataFrameRow.

source
last(df::AbstractDataFrame, n::Integer)

Get a data frame with the n last rows of df.

source
Base.uniqueFunction.
unique(df::AbstractDataFrame)
unique(df::AbstractDataFrame, cols)
unique!(df::AbstractDataFrame)
unique!(df::AbstractDataFrame, cols)

Delete duplicate rows of data frame df, keeping only the first occurrence of unique rows. When cols is specified, the returned DataFrame contains complete rows, retaining in each case the first instance for which df[cols] is unique. cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

When unique is called a new data frame is returned; unique! updates df in-place.

See also nonunique.

Arguments

  • df : the AbstractDataFrame
  • cols : column indicator (Symbol, Int, Vector{Symbol}, Regex, etc.)

specifying the column(s) to compare.

Examples

df = DataFrame(i = 1:10, x = rand(10), y = rand(["a", "b", "c"], 10))
df = vcat(df, df)
unique(df)   # doesn't modify df
unique(df, 1)
unique!(df)  # modifies df
source
Base.propertynamesFunction.
propertynames(df::AbstractDataFrame)

Return a freshly allocated Vector{Symbol} of names of columns contained in df.

source
Base.similarFunction.
similar(df::AbstractDataFrame, rows::Integer=nrow(df))

Create a new DataFrame with the same column names and column element types as df. An optional second argument can be provided to request a number of rows that is different than the number of rows present in df.

source
Base.sortpermFunction.
sortperm(df::AbstractDataFrame, cols;
         alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
         rev::Bool=false, order::Ordering=Forward)

Return a permutation vector of row indices of data frame df that puts them in sorted order according to column(s) cols.

cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If alg is nothing (the default), the most appropriate algorithm is chosen automatically among TimSort, MergeSort and RadixSort depending on the type of the sorting columns and on the number of rows in df. If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below). See other methods for a description of other keyword arguments.

Examples

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"])
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 3     │ b      │
│ 2   │ 1     │ c      │
│ 3   │ 2     │ a      │
│ 4   │ 1     │ b      │

julia> sortperm(df, :x)
4-element Array{Int64,1}:
 2
 4
 3
 1

julia> sortperm(df, (:x, :y))
4-element Array{Int64,1}:
 4
 2
 3
 1

julia> sortperm(df, (:x, :y), rev=true)
4-element Array{Int64,1}:
 1
 3
 2
 4

 julia> sortperm(df, (:x, order(:y, rev=true)))
 4-element Array{Int64,1}:
  2
  4
  3
  1
source
Base.pairsFunction.
pairs(dfc::DataFrameColumns)

Return an iterator of pairs associating the name of each column of dfc with the corresponding column vector, i.e. name => col where name is the column name of the column col.

source
Base.parentFunction.
parent(gd::GroupedDataFrame)

Return the parent data frame of gd.

source
Base.issortedFunction.
issorted(df::AbstractDataFrame, cols;
         lt=isless, by=identity, rev::Bool=false, order::Ordering=Forward)

Test whether data frame df sorted by column(s) cols.

cols can be any column selector (Symbol, string or integer; :, All, Between, Not, a regular expression, or a vector of Symbols, strings or integers).

If rev is true, reverse sorting is performed. To enable reverse sorting only for some columns, pass order(c, rev=true) in cols, with c the corresponding column index (see example below). See other methods for a description of other keyword arguments.

source