The DataStreams.jl package aims to define a generic and performant framework for the transfer of "table-like" data. (i.e. data that can, at least in some sense, be described by rows and columns).

The framework achieves this by defining interfaces (i.e. a group of methods) for Data.Source types and methods to describe how they "provide" data; as well as Data.Sink types and methods around how they "receive" data. This allows Data.Sources and Data.Sinks to implement their interfaces separately, without needing to be aware of each other. The end result is an ecosystem of packages that "automatically" talk with each other, with adding an additional package not requiring changes to existing packages.

Packages can have a single julia type implement both the Data.Source and Data.Sink interfaces, or two separate types can implement them separately. For examples of interface implementations, see some of the packages below:

Data.Source implementations:

Data.Sink implementations:

Data.Source Interface

The Data.Source interface requires the following definitions, where MyPkg would represent a package wishing to implement the interface:

Optional definition:

A Data.Source also needs to "register" the type (or types) of streaming it supports. Currently defined streaming types in the DataStreams framework include:

A Data.Source formally supports field-based streaming by defining the following:

And for column-based streaming:

Data.Sink Interface

Similar to a Data.Source, a Data.Sink needs to "register" the types of streaming it supports, it does so through the following definition:

A Data.Sink needs to also implement specific forms of constructors that ensure proper Sink state in many higher-level streaming functions:

Similar to Data.Source, a Data.Sink also needs to implement it's own streamto! method that indicates how it receives data.

A Data.Sink supports field-based streaming by defining:

it can be calculated once at the beginning of a! and used quickly for many calls to Data.streamto!. This argument is optional, because a Sink can overload Data.streamto! with or without it. Note that it is appropriate for a Data.Sink to implement specialized Data.streamto! methods that can dispath according to the type T of val::T, although not strictly required.

A Data.Sink supports column-based streaming by defining:

* `Data.streamto!{T}(sink::MyPkg.Sink, ::Type{Data.Column}, column::Type{T}, row, col[, schema])`: Given a column number `col` and column of data `column`, a `Data.Sink` should store it appropriately. The type of the column is given by `T`, which may be a `NullableVector{T}`. Optionally provided is the `schema` (the same `schema` that is passed in the `MyPkg.Sink(schema, ...)` constructors). This argument is passed for efficiency since it can be calculated once at the beginning of a `!` and used quickly for many calls to `Data.streamto!`. This argument is optional, because a Sink can overload `Data.streamto!` with or without it.

A Data.Sink can optionally define the following if needed:


A Data.Schema describes a tabular dataset (i.e. a set of optionally named, typed columns with records as rows)

Data.Schema allow Data.Source and Data.Sink to talk to each other and prepare to provide/receive data through streaming. Data.Schema fields include:

  • A boolean type parameter that indicates whether the # of rows is known in the Data.Source; this is useful as a type parameter to allow Data.Sink and Data.streamto! methods to dispatch. Note that the sentinel value -1 is used as the # of rows when the # of rows is unknown.

  • Data.header(schema) to return the header/column names in a Data.Schema

  • Data.types(schema) to return the column types in a Data.Schema; Nullable{T} indicates columns that may contain missing data (null values)

  • Data.size(schema) to return the (# of rows, # of columns) in a Data.Schema

Data.Source and Data.Sink interfaces both require that Data.schema(source_or_sink) be defined to ensure that other Data.Source/Data.Sink can work appropriately.