Using CategoricalArrays

# Using CategoricalArrays

## Basic usage

Suppose that you have data about four individuals, with three different age groups. Since this variable is clearly ordinal, we mark the array as such via the `ordered` argument.

``````julia> using CategoricalArrays

julia> x = CategoricalArray(["Old", "Young", "Middle", "Young"], ordered=true)
4-element CategoricalArray{String,1,UInt32}:
"Old"
"Young"
"Middle"
"Young"
``````

By default, the levels are lexically sorted, which is clearly not correct in our case and would give incorrect results when testing for order. This is easily fixed using the `levels!` function to reorder levels:

``````julia> levels(x)
3-element Array{String,1}:
"Middle"
"Old"
"Young"

julia> levels!(x, ["Young", "Middle", "Old"])
4-element CategoricalArray{String,1,UInt32}:
"Old"
"Young"
"Middle"
"Young"
``````

Thanks to this order, we can not only test for equality between two values, but also compare the ages of e.g. individuals 1 and 2:

``````julia> x[1]
CategoricalString{UInt32} "Old" (3/3)

julia> x[2]
CategoricalString{UInt32} "Young" (1/3)

julia> x[2] == x[4]
true

julia> x[1] > x[2]
true
``````

Now let us imagine the first individual is actually in the "Young" group. Let's fix this (notice how the string `"Young"` is automatically converted to a `CategoricalString`):

``````julia> x[1] = "Young"
"Young"

julia> x[1]
CategoricalString{UInt32} "Young" (1/3)
``````

The `CategoricalArray` still considers `"Old"` as a possible level even if it is unused now. This is necessary to allow efficiently accessing the levels and setting values of elements in the array: indeed, dropping unused levels requires iterating over every element in the array, which is expensive. This property can also be useful to keep track of possible levels, even if they do not occur in practice.

To get rid of the `"Old"` group, just call the `droplevels!` function:

``````julia> levels(x)
3-element Array{String,1}:
"Young"
"Middle"
"Old"

julia> droplevels!(x)
4-element CategoricalArray{String,1,UInt32}:
"Young"
"Young"
"Middle"
"Young"

julia> levels(x)
2-element Array{String,1}:
"Young"
"Middle"
``````

Another solution would have been to call `levels!(x, ["Young", "Middle"])` manually. This command is safe too, since it will raise an error when trying to remove levels that are currently used:

``````julia> levels!(x, ["Young", "Midle"])
ERROR: ArgumentError: cannot remove level "Middle" as it is used at position 3. Change the array element type to Union{String, Missing} using convert if you want to transform some levels to missing values.
[...]
``````

Note that entries in the `x` array can be treated as strings (that's because `CategoricalString <: AbstractString`):

``````julia> lowercase(x[3])
"middle"

julia> replace(x[3], 'M'=>'R')
"Riddle"``````
``````droplevels!
levels
levels!``````

## Handling Missing Values

The examples above assumed that the data contained no missing values. This is generally not the case for real data. This is where `CategoricalArray{Union{T, Missing}}` comes into play. It is essentially the categorical-data equivalent of `Array{Union{T, Missing}}`. It behaves exactly as `CategoricalArray{T}`, except that when indexed it returns either a categorical value object (`CategoricalString` or `CategoricalValue{T}`) or `missing` if the value is missing. See the Julia manual for more information on the `Missing` type.

Let's adapt the example developed above to support missing values. Since there are no missing values in the input vector, we need to specify that the array should be able to hold either a `String` or `missing`:

``````julia> y = CategoricalArray{Union{Missing, String}}(["Old", "Young", "Middle", "Young"], ordered=true)
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
"Old"
"Young"
"Middle"
"Young"
``````

Levels still need to be reordered manually:

``````julia> levels(y)
3-element Array{String,1}:
"Middle"
"Old"
"Young"

julia> levels!(y, ["Young", "Middle", "Old"])
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
"Old"
"Young"
"Middle"
"Young"
``````

At this point, indexing into the array gives exactly the same result

``````julia> y[1]
CategoricalString{UInt32} "Old" (3/3)``````

Missing values can be introduced either manually, or by restricting the set of possible levels. Let us imagine this time that we actually do not know the age of the first individual. We can set it to a missing value this way:

``````julia> y[1] = missing
missing

julia> y
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
missing
"Young"
"Middle"
"Young"

julia> y[1]
missing
``````

It is also possible to transform all values belonging to some levels into missing values, which gives the same result as above in the present case since we have only one individual in the `"Old"` group. Let's first restore the original value for the first element, and then set it to missing again using the `allow_missing` argument to `levels!`:

``````julia> y[1] = "Old"
"Old"

julia> y
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
"Old"
"Young"
"Middle"
"Young"

julia> levels!(y, ["Young", "Middle"]; allow_missing=true)
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
missing
"Young"
"Middle"
"Young"
``````

## Working with categorical arrays

`categorical(A)` - Construct a categorical array with values from `A`

`compress(A)` - Return a copy of categorical array `A` using the smallest possible reference type

`cut(x)` - Cut a numeric array into intervals and return an ordered `CategoricalArray`

`decompress(A)` - Return a copy of categorical array `A` using the default reference type

`isordered(A)` - Test whether entries in `A` can be compared using `<`, `>` and similar operators

`ordered!(A)` - Set whether entries in `A` can be compared using `<`, `>` and similar operators

`recode(a[, default], pairs...)` - Return a copy of `a` after replacing one or more values

`recode!(a[, default], pairs...)` - Replace one or more values in `a` in-place

``categorical{T}(A::AbstractArray{T}[, compress::Bool]; ordered::Bool=false)``

Construct a categorical array with the values from `A`.

If the element type supports it, levels are sorted in ascending order; else, they are kept in their order of appearance in `A`. The `ordered` keyword argument determines whether the array values can be compared according to the ordering of levels or not (see `isordered`).

If `compress` is provided and set to `true`, the smallest reference type able to hold the number of unique values in `A` will be used. While this will reduce memory use, passing this parameter will also introduce a type instability which can affect performance inside the function where the call is made. Therefore, use this option with caution (the one-argument version does not suffer from this problem).

``categorical{T}(A::CategoricalArray{T}[, compress::Bool]; ordered::Bool=isordered(A))``

If `A` is already a `CategoricalArray`, its levels are preserved; the same applies to the ordered property, and to the reference type unless `compress` is passed.

source
``compress(A::CategoricalArray)``

Return a copy of categorical array `A` using the smallest reference type able to hold the number of `levels` of `A`.

While this will reduce memory use, this function is type-unstable, which can affect performance inside the function where the call is made. Therefore, use it with caution.

source
``````cut(x::AbstractArray, breaks::AbstractVector;
labels::Union{AbstractVector{<:AbstractString},Function},
extend::Bool=false, allow_missing::Bool=false)``````

Cut a numeric array into intervals and return an ordered `CategoricalArray` indicating the interval into which each entry falls. Intervals are of the form `[lower, upper)`, i.e. the lower bound is included and the upper bound is excluded.

If `x` accepts missing values (i.e. `eltype(x) >: Missing`) the returned array will also accept them.

Arguments

• `extend::Bool=false`: when `false`, an error is raised if some values in `x` fall outside of the breaks; when `true`, breaks are automatically added to include all values in `x`, and the upper bound is included in the last interval.
• `labels::Union{AbstractVector,Function}: a vector of strings giving the names to use for the intervals; or a function`f(from, to, i; closed)`that generates the labels from the left and right interval boundaries and the group index. Defaults to`"[from, to)"`(or`"[from, to]"`for the rightmost interval if`extend == true`).
• `allow_missing::Bool=true`: when `true`, values outside of breaks result in missing values. only supported when `x` accepts missing values.

Examples

``````julia> cut(-1:0.5:1, [0, 1], extend=true)
5-element CategoricalArray{String,1,UInt32}:
"[-1.0, 0.0)"
"[-1.0, 0.0)"
"[0.0, 1.0]"
"[0.0, 1.0]"
"[0.0, 1.0]"

julia> cut(-1:0.5:1, 2)
5-element CategoricalArray{String,1,UInt32}:
"[-1.0, 0.0)"
"[-1.0, 0.0)"
"[0.0, 1.0]"
"[0.0, 1.0]"
"[0.0, 1.0]"

julia> cut(-1:0.5:1, 2, labels=["A", "B"])
5-element CategoricalArray{String,1,UInt32}:
"A"
"A"
"B"
"B"
"B"

julia> fmt(from, to, i; closed) = "grp \$i (\$from//\$to)"
fmt (generic function with 1 method)

julia> cut(-1:0.5:1, 3, labels=fmt)
5-element CategoricalArray{String,1,UInt32}:
"grp 1 (-1.0//-0.333333)"
"grp 1 (-1.0//-0.333333)"
"grp 2 (-0.333333//0.333333)"
"grp 3 (0.333333//1.0)"
"grp 3 (0.333333//1.0)"      ``````
source
``````cut(x::AbstractArray, ngroups::Integer;
labels::Union{AbstractVector{<:AbstractString},Function})``````

Cut a numeric array into `ngroups` quantiles, determined using `quantile`.

source
``decompress(A::CategoricalArray)``

Return a copy of categorical array `A` using the default reference type (UInt32). If `A` is using a small reference type (such as `UInt8` or `UInt16`) the decompressed array will have room for more levels.

To avoid the need to call decompress, ensure `compress` is not called when creating the categorical array.

source
``isordered(A::CategoricalArray)``

Test whether entries in `A` can be compared using `<`, `>` and similar operators, using the ordering of levels.

source
``ordered!(A::CategoricalArray, ordered::Bool)``

Set whether entries in `A` can be compared using `<`, `>` and similar operators, using the ordering of levels. Return the modified `A`.

source
``recode(a::AbstractArray[, default::Any], pairs::Pair...)``

Return a copy of `a`, replacing elements matching a key of `pairs` with the corresponding value. The type of the array is chosen so that it can hold all recoded elements (but not necessarily original elements from `a`).

For each `Pair` in `pairs`, if the element is equal to (according to `isequal`) or `in` the key (first item of the pair), then the corresponding value (second item) is used. If the element matches no key and `default` is not provided or `nothing`, it is copied as-is; if `default` is specified, it is used in place of the original element. If an element matches more than one key, the first match is used.

``recode(a::CategoricalArray[, default::Any], pairs::Pair...)``

If `a` is a `CategoricalArray` then the ordering of resulting levels is determined by the order of passed `pairs` and `default` will be the last level if provided.

Examples

``````julia> using CategoricalArrays

julia> recode(1:10, 1=>100, 2:4=>0, [5; 9:10]=>-1)
10-element Array{Int64,1}:
100
0
0
0
-1
6
7
8
-1
-1
``````
`` recode(a::AbstractArray{>:Missing}[, default::Any], pairs::Pair...)``

If `a` contains missing values, they are never replaced with `default`: use `missing` in a pair to recode them. If that's not the case, the returned array will accept missing values.

Examples

``````julia> using CategoricalArrays

julia> recode(1:10, 1=>100, 2:4=>0, [5; 9:10]=>-1, 6=>missing)
10-element Array{Union{Missing, Int64},1}:
100
0
0
0
-1
missing
7
8
-1
-1
``````
source
``recode!(dest::AbstractArray, src::AbstractArray[, default::Any], pairs::Pair...)``

Fill `dest` with elements from `src`, replacing those matching a key of `pairs` with the corresponding value.

For each `Pair` in `pairs`, if the element is equal to (according to `isequal`)) the key (first item of the pair) or to one of its entries if it is a collection, then the corresponding value (second item) is copied to `dest`. If the element matches no key and `default` is not provided or `nothing`, it is copied as-is; if `default` is specified, it is used in place of the original element. `dest` and `src` must be of the same length, but not necessarily of the same type. Elements of `src` as well as values from `pairs` will be `convert`ed when possible on assignment. If an element matches more than one key, the first match is used.

``recode!(dest::CategoricalArray, src::AbstractArray[, default::Any], pairs::Pair...)``

If `dest` is a `CategoricalArray` then the ordering of resulting levels is determined by the order of passed `pairs` and `default` will be the last level if provided.

``recode!(dest::AbstractArray, src::AbstractArray{>:Missing}[, default::Any], pairs::Pair...)``

If `src` contains missing values, they are never replaced with `default`: use `missing` in a pair to recode them.

source