OnlineStats Integration

OnlineStats Integration

OnlineStats is a package for calculating statistics and models with online (one observation at a time) parallelizable algorithms. This integrates tightly with JuliaDB's distributed data structures to calculate statistics on large datasets. The full documentation for OnlineStats is available here.

Basics

OnlineStats' objects can be updated with more data and also merged together. The image below demonstrates what goes on under the hood in JuliaDB to compute a statistic s in parallel.

OnlineStats integration is available via the reduce and groupreduce functions. An OnlineStat acts differently from a normal reducer:

julia> using JuliaDB, OnlineStats

julia> t = table(1:100, rand(Bool, 100), randn(100));

julia> reduce(Mean(), t; select = 3)
Mean: n=100 | value=0.0424815

julia> grp = groupreduce(Mean(), t, 2; select=3)
Table with 2 rows, 2 columns:
1      2
───────────────────────────────────
false  Mean: n=52 | value=0.0368576
true   Mean: n=48 | value=0.048574

julia> select(grp, (1, 2 => value))
Table with 2 rows, 2 columns:
1      2
────────────────
false  0.0368576
true   0.048574
Note

The OnlineStats.value function extracts the value of the statistic. E.g. value(Mean()).

Calculating Statistics on Multiple Columns.

The OnlineStats.Group type is used for calculating statistics on multiple data streams. A Group that computes the same OnlineStat can be created through integer multiplication:

reduce(3Mean(), t)
Group
  ├── Mean: n=100 | value=50.5
  ├── Mean: n=100 | value=0.48
  └── Mean: n=100 | value=0.0424815

Alternatively, a Group can be created by providing a collection of OnlineStats.

reduce(Group(Extrema(Int), CountMap(Bool), Mean()), t)
Group
  ├── Extrema: n=100 | value=(1, 100)
  ├── CountMap: n=100 | value=OrderedCollections.OrderedDict(true=>48,false=>52)
  └── Mean: n=100 | value=0.0424815