OnlineStats Integration

OnlineStats Integration

OnlineStats is a package for calculating statistics and models with online (one observation at a time) parallelizable algorithms. This integrates tightly with JuliaDB's distributed data structures to calculate statistics on large datasets.

For the full OnlineStats documentation, see http://joshday.github.io/OnlineStats.jl/stable/.


Basics

Each OnlineStat can be updated with more data and merged together with another of the same type. JuliaDB integrates with OnlineStats via the reduce and groupreduce functions by accepting an OnlineStat or tuple of OnlineStats.

Example Table

julia> using JuliaDB, OnlineStats

julia> t = table(@NT(x = randn(100), y = randn(100), z = rand(1:5, 100)))
Table with 100 rows, 3 columns:
x          y           z
────────────────────────
1.74621    -1.47081    5
-0.928172  -0.0268673  2
0.38211    0.424011    1
-0.892278  0.553838    4
-1.50414   -0.188053   3
-0.337786  0.426907    5
1.00177    -0.255973   2
-3.06038   -0.76582    2
0.155157   -1.39087    1
⋮
1.86348    -0.683583   5
-0.904156  0.299639    3
-0.266115  1.11892     3
2.53874    -0.665886   4
0.517936   -0.217844   4
0.565392   0.847034    3
-0.560953  1.25836     3
0.506623   -1.29462    2

Usage on a single column

reduce via OnlineStat

julia> reduce(Mean(), t; select = :x)
Mean: n=100 | value=-0.0061806

Several OnlineStats can be calculated on the same column by joining them via Series.

julia> reduce(Series(Mean(), Variance()), t; select = :x)
Series
  ├── Mean: n=100 | value=-0.0061806
  └── Variance: n=100 | value=1.26289

reduce via Tuple of OnlineStats

julia> reduce((Mean(), Variance()), t; select = :x)
(Mean = Mean: n=100 | value=-0.0061806, Variance = Variance: n=100 | value=1.26289)

Usage on multiple columns

To calculate different statistics on each column, OnlineStats offers the Group type. There are several methods for creating a Group.

2Mean() == Group(Mean(), Mean())
[Mean() CountMap(Int)] == Group(Mean(), CountMap(Int))
julia> reduce(2Mean(), t; select = (:x, :y))
Group
  ├── Mean: n=100 | value=-0.0061806
  └── Mean: n=100 | value=0.0011239

Different OnlineStats on columns

To calculate different statistics on different columns, we need to make a Group, which can be created via hcat.

julia> g = reduce([Mean() CountMap(Int)], t; select = (:x, :z))
ERROR: MethodError: objects of type Array{OnlineStatsBase.OnlineStat,2} are not callable
Use square brackets [] for indexing an Array.