Out-of-core processing

Out-of-core processing

JuliaDB can load data that is too big to fit in memory (RAM) as well as run a subset of operations on big tables. In particular, OnlineStats Integration works with reduce and groupreduce for running statistical analyses that traditionally would not be possible!

Processing Scheme

The limitations of this processing scheme is that only certain operations work out-of-core:

Loading Data

The loadtable and loadndsparse functions accept the keyword arguments output and chunks that specify the directory to save the data into and the number of chunks to be generated from the input files, respectively.

Here's an example:

loadtable(glob("*.csv"), output="bin", chunks=100; kwargs...)

Suppose there are 800 .csv files in the current directory. They will be read into 100 chunks (8 files per chunk). Each worker process will load 8 files into memory, save the chunk into a single binary file in the bin directory, and move onto the next 8 files.

Note

Distributed.nworkers() * (number_of_csvs / chunks) needs to fit in memory simultaneously.

Once data has been loaded in this way, you can reload the dataset (extremely fast) via

tbl = load("bin")

reduce and groupreduce Operations

reduce is the simplest out-of-core operation since it works pair-wise. You can also perform group-by operations with a reducer via groupreduce.

using JuliaDB, OnlineStats

x = rand(Bool, 100)
y = x + randn(100)

t = table((x=x, y=y))

groupreduce(+, t, :x; select = :y)
Table with 2 rows, 2 columns:
x      +
──────────────
false  5.66734
true   35.2068

You can replace the reducer with any OnlineStat object (see OnlineStats Integration for more details):

groupreduce(Sum(), t, :x; select = :y)
Table with 2 rows, 2 columns:
x      Sum
────────────────────────────────
false  Sum: n=54 | value=5.66734
true   Sum: n=46 | value=35.2068

Join to Big Table

join operations have limited out-of-core support. Specifically,

join(bigtable, smalltable; broadcast=:right, how=:inner|:left|:anti)

Here bigtable can be larger than memory, while Distributed.nworkers() copies of smalltable must fit in memory. Note that only :inner, :left, and :anti joins are supported (no :outer joins). In this operation, smalltable is first broadcast to all processors and bigtable is joined Distributed.nworkers() chunks at a time.