Using CategoricalArrays

Using CategoricalArrays

Basic usage

Suppose that you have data about four individuals, with three different age groups. Since this variable is clearly ordinal, we mark the array as such via the ordered argument.

julia> using CategoricalArrays

julia> x = CategoricalArray(["Old", "Young", "Middle", "Young"], ordered=true)
4-element CategoricalArray{String,1,UInt32}:
 "Old"   
 "Young" 
 "Middle"
 "Young" 

By default, the levels are lexically sorted, which is clearly not correct in our case and would give incorrect results when testing for order. This is easily fixed using the levels! function to reorder levels:

julia> levels(x)
3-element Array{String,1}:
 "Middle"
 "Old"   
 "Young" 

julia> levels!(x, ["Young", "Middle", "Old"])
4-element CategoricalArray{String,1,UInt32}:
 "Old"   
 "Young" 
 "Middle"
 "Young" 

Thanks to this order, we can not only test for equality between two values, but also compare the ages of e.g. individuals 1 and 2:

julia> x[1]
CategoricalString{UInt32} "Old" (3/3)

julia> x[2]
CategoricalString{UInt32} "Young" (1/3)

julia> x[2] == x[4]
true

julia> x[1] > x[2]
true

Now let us imagine the first individual is actually in the "Young" group. Let's fix this (notice how the string "Young" is automatically converted to a CategoricalString):

julia> x[1] = "Young"
"Young"

julia> x[1]
CategoricalString{UInt32} "Young" (1/3)

The CategoricalArray still considers "Old" as a possible level even if it is unused now. This is necessary to allow efficiently accessing the levels and setting values of elements in the array: indeed, dropping unused levels requires iterating over every element in the array, which is expensive. This property can also be useful to keep track of possible levels, even if they do not occur in practice.

To get rid of the "Old" group, just call the droplevels! function:

julia> levels(x)
3-element Array{String,1}:
 "Young" 
 "Middle"
 "Old"   

julia> droplevels!(x)
4-element CategoricalArray{String,1,UInt32}:
 "Young" 
 "Young" 
 "Middle"
 "Young" 

julia> levels(x)
2-element Array{String,1}:
 "Young" 
 "Middle"

Another solution would have been to call levels!(x, ["Young", "Middle"]) manually. This command is safe too, since it will raise an error when trying to remove levels that are currently used:

julia> levels!(x, ["Young", "Midle"])
ERROR: ArgumentError: cannot remove level "Middle" as it is used at position 3. Change the array element type to Union{String, Missing} using convert if you want to transform some levels to missing values.
[...]

Note that entries in the x array can be treated as strings (that's because CategoricalString <: AbstractString):

julia> lowercase(x[3])
"middle"

julia> replace(x[3], 'M'=>'R')
"Riddle"
droplevels!
levels
levels!

Handling Missing Values

The examples above assumed that the data contained no missing values. This is generally not the case for real data. This is where CategoricalArray{Union{T, Missing}} comes into play. It is essentially the categorical-data equivalent of Array{Union{T, Missing}}. It behaves exactly as CategoricalArray{T}, except that when indexed it returns either a categorical value object (CategoricalString or CategoricalValue{T}) or missing if the value is missing. See the Julia manual for more information on the Missing type.

Let's adapt the example developed above to support missing values. Since there are no missing values in the input vector, we need to specify that the array should be able to hold either a String or missing:

julia> y = CategoricalArray{Union{Missing, String}}(["Old", "Young", "Middle", "Young"], ordered=true)
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "Old"   
 "Young" 
 "Middle"
 "Young" 

Levels still need to be reordered manually:

julia> levels(y)
3-element Array{String,1}:
 "Middle"
 "Old"   
 "Young" 

julia> levels!(y, ["Young", "Middle", "Old"])
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "Old"   
 "Young" 
 "Middle"
 "Young" 

At this point, indexing into the array gives exactly the same result

julia> y[1]
CategoricalString{UInt32} "Old" (3/3)

Missing values can be introduced either manually, or by restricting the set of possible levels. Let us imagine this time that we actually do not know the age of the first individual. We can set it to a missing value this way:

julia> y[1] = missing
missing

julia> y
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
 missing
 "Young" 
 "Middle"
 "Young" 

julia> y[1]
missing

It is also possible to transform all values belonging to some levels into missing values, which gives the same result as above in the present case since we have only one individual in the "Old" group. Let's first restore the original value for the first element, and then set it to missing again using the allow_missing argument to levels!:

julia> y[1] = "Old"
"Old"

julia> y
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "Old"   
 "Young" 
 "Middle"
 "Young" 

julia> levels!(y, ["Young", "Middle"]; allow_missing=true)
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
 missing
 "Young" 
 "Middle"
 "Young" 

Working with categorical arrays

categorical(A) - Construct a categorical array with values from A

compress(A) - Return a copy of categorical array A using the smallest possible reference type

cut(x) - Cut a numeric array into intervals and return an ordered CategoricalArray

decompress(A) - Return a copy of categorical array A using the default reference type

isordered(A) - Test whether entries in A can be compared using <, > and similar operators

ordered!(A) - Set whether entries in A can be compared using <, > and similar operators

recode(a[, default], pairs...) - Return a copy of a after replacing one or more values

recode!(a[, default], pairs...) - Replace one or more values in a in-place

categorical{T}(A::AbstractArray{T}[, compress::Bool]; ordered::Bool=false)

Construct a categorical array with the values from A.

If the element type supports it, levels are sorted in ascending order; else, they are kept in their order of appearance in A. The ordered keyword argument determines whether the array values can be compared according to the ordering of levels or not (see isordered).

If compress is provided and set to true, the smallest reference type able to hold the number of unique values in A will be used. While this will reduce memory use, passing this parameter will also introduce a type instability which can affect performance inside the function where the call is made. Therefore, use this option with caution (the one-argument version does not suffer from this problem).

categorical{T}(A::CategoricalArray{T}[, compress::Bool]; ordered::Bool=isordered(A))

If A is already a CategoricalArray, its levels are preserved; the same applies to the ordered property, and to the reference type unless compress is passed.

source
compress(A::CategoricalArray)

Return a copy of categorical array A using the smallest reference type able to hold the number of levels of A.

While this will reduce memory use, this function is type-unstable, which can affect performance inside the function where the call is made. Therefore, use it with caution.

source
CategoricalArrays.cutFunction.
cut(x::AbstractArray, breaks::AbstractVector;
    extend::Bool=false, labels::AbstractVector=[], allow_missing::Bool=false)

Cut a numeric array into intervals and return an ordered CategoricalArray indicating the interval into which each entry falls. Intervals are of the form [lower, upper), i.e. the lower bound is included and the upper bound is excluded.

If x accepts missing values (i.e. eltype(x) >: Missing) the returned array will also accept them.

Arguments

  • extend::Bool=false: when false, an error is raised if some values in x fall outside of the breaks; when true, breaks are automatically added to include all values in x, and the upper bound is included in the last interval.
  • labels::AbstractVector=[]: a vector of strings giving the names to use for the intervals; if empty, default labels are used.
  • allow_missing::Bool=true: when true, values outside of breaks result in missing values. only supported when x accepts missing values.
source
cut(x::AbstractArray, ngroups::Integer;
    labels::AbstractVector=String[])

Cut a numeric array into ngroups quantiles, determined using quantile.

source
decompress(A::CategoricalArray)

Return a copy of categorical array A using the default reference type (UInt32). If A is using a small reference type (such as UInt8 or UInt16) the decompressed array will have room for more levels.

To avoid the need to call decompress, ensure compress is not called when creating the categorical array.

source
isordered(A::CategoricalArray)

Test whether entries in A can be compared using <, > and similar operators, using the ordering of levels.

source
ordered!(A::CategoricalArray, ordered::Bool)

Set whether entries in A can be compared using <, > and similar operators, using the ordering of levels. Return the modified A.

source
recode(a::AbstractArray[, default::Any], pairs::Pair...)

Return a copy of a, replacing elements matching a key of pairs with the corresponding value. The type of the array is chosen so that it can hold all recoded elements (but not necessarily original elements from a).

For each Pair in pairs, if the element is equal to (according to isequal) or in the key (first item of the pair), then the corresponding value (second item) is used. If the element matches no key and default is not provided or nothing, it is copied as-is; if default is specified, it is used in place of the original element. If an element matches more than one key, the first match is used.

recode(a::CategoricalArray[, default::Any], pairs::Pair...)

If a is a CategoricalArray then the ordering of resulting levels is determined by the order of passed pairs and default will be the last level if provided.

Examples

julia> using CategoricalArrays

julia> recode(1:10, 1=>100, 2:4=>0, [5; 9:10]=>-1)
10-element Array{Int64,1}:
 100
   0
   0
   0
  -1
   6
   7
   8
  -1
  -1
 recode(a::AbstractArray{>:Missing}[, default::Any], pairs::Pair...)

If a contains missing values, they are never replaced with default: use missing in a pair to recode them. If that's not the case, the returned array will accept missing values.

Examples

julia> using CategoricalArrays

julia> recode(1:10, 1=>100, 2:4=>0, [5; 9:10]=>-1, 6=>missing)
10-element Array{Union{Missing, Int64},1}:
 100    
   0    
   0    
   0    
  -1    
    missing
   7    
   8    
  -1    
  -1    
source
recode!(dest::AbstractArray, src::AbstractArray[, default::Any], pairs::Pair...)

Fill dest with elements from src, replacing those matching a key of pairs with the corresponding value.

For each Pair in pairs, if the element is equal to (according to isequal)) the key (first item of the pair) or to one of its entries if it is a collection, then the corresponding value (second item) is copied to dest. If the element matches no key and default is not provided or nothing, it is copied as-is; if default is specified, it is used in place of the original element. dest and src must be of the same length, but not necessarily of the same type. Elements of src as well as values from pairs will be converted when possible on assignment. If an element matches more than one key, the first match is used.

recode!(dest::CategoricalArray, src::AbstractArray[, default::Any], pairs::Pair...)

If dest is a CategoricalArray then the ordering of resulting levels is determined by the order of passed pairs and default will be the last level if provided.

recode!(dest::AbstractArray, src::AbstractArray{>:Missing}[, default::Any], pairs::Pair...)

If src contains missing values, they are never replaced with default: use missing in a pair to recode them.

source