Out-of-core support
JuliaDBMeta supports out-of-core operations in several different ways. In the following examples, we will have started the REPL with julia -p 4
Row-wise macros parallelize out of the box
Row-wise macros can be trivially implemented in parallel and will work out of the box with out-of-core tables.
julia> iris = loadtable(Pkg.dir("JuliaDBMeta", "test", "tables", "iris.csv"));
julia> iris5 = table(iris, chunks = 5);
julia> @where iris5 :SepalLength == 4.9 && :Species == "setosa"
Distributed Table with 4 rows in 2 chunks:
SepalLength SepalWidth PetalLength PetalWidth Species
──────────────────────────────────────────────────────────
4.9 3.0 1.4 0.2 "setosa"
4.9 3.1 1.5 0.1 "setosa"
4.9 3.1 1.5 0.2 "setosa"
4.9 3.6 1.4 0.1 "setosa"
Grouping operations parallelize with some data shuffling
Grouping operations will work on out-of-core data tables, but may involve some data shuffling as it requires data belonging to the same group to be on the same processor.
julia> @groupby iris5 :Species {mean(:SepalLength)}
Distributed Table with 3 rows in 3 chunks:
Species mean(SepalLength)
───────────────────────────────
"setosa" 5.006
"versicolor" 5.936
"virginica" 6.588
Apply a pipeline to your data in chunks
@applychunked
will apply the analysis pipeline separately to each chunk of data in parallel and collect the result as a distributed table.
julia> @applychunked iris5 begin
@where :Species == "setosa" && :SepalLength == 4.9
@transform {Ratio = :SepalLength / :SepalWidth}
end
Distributed Table with 4 rows in 2 chunks:
SepalLength SepalWidth PetalLength PetalWidth Species Ratio
───────────────────────────────────────────────────────────────────
4.9 3.0 1.4 0.2 "setosa" 1.63333
4.9 3.1 1.5 0.1 "setosa" 1.58065
4.9 3.1 1.5 0.2 "setosa" 1.58065
4.9 3.6 1.4 0.1 "setosa" 1.36111
Column-wise macros do not parallelize yet
Column-wise macros do not have a parallel implementation yet, unless when grouping: they require working on the whole column at the same time which makes it difficult to parallelize them.