OnlineStats Integration

OnlineStats Integration

OnlineStats is a package for calculating statistics and models with online (one observation at a time) parallelizable algorithms. This integrates tightly with JuliaDB's distributed data structures to calculate statistics on large datasets.

For the full OnlineStats documentation, see http://joshday.github.io/OnlineStats.jl/stable/.


Basics

Each statistic/model is a subtype of OnlineStat. OnlineStats are grouped together in a Series. In JuliaDB, the functions reduce and groupreduce can accept:

  1. An OnlineStat

  2. A tuple of OnlineStats

  3. A Series

Example Table

julia> using JuliaDB, OnlineStats

julia> t = table(@NT(x = randn(100), y = randn(100), z = rand(1:5, 100)))
Table with 100 rows, 3 columns:
x          y          z
───────────────────────
-0.934631  1.6782     4
0.719415   0.873509   4
-0.15204   -1.58039   4
1.35698    -1.34347   5
2.4496     0.593092   4
-1.95775   2.55436    5
1.05645    1.22796    2
-0.895908  1.72162    3
-0.401243  0.326645   5
⋮
-0.351493  0.942583   1
0.513704   1.99944    2
-2.35242   0.114023   1
-0.193922  -0.437365  3
-1.06496   1.82176    4
0.599822   0.305946   4
1.19809    0.446625   4
1.41909    0.444338   2

Usage on a single column

reduce via OnlineStat

julia> reduce(Mean(), t; select = :x)
▦ Series{0}
│ EqualWeight | nobs=100
└── Mean(-0.0139999)

reduce via Tuple of OnlineStats

julia> reduce((Mean(), Variance()), t; select = :x)
(Mean = ▦ Series{0}
│ EqualWeight | nobs=100
└── Mean(-0.0139999), Variance = ▦ Series{0}
│ EqualWeight | nobs=100
└── Variance(1.17187))

reduce via Series

julia> s = Series(Mean(), Variance(), Sum());

julia> reduce(s, t; select = :x)
▦ Series{0}
│ EqualWeight | nobs=100
├── Mean(-0.0139999)
├── Variance(1.17187)
└── Sum{Float64}(-1.39999)

Usage on multiple columns

Same OnlineStat on each column

If we want the same statistic calculated for each column in the selection, we need to specify the number of columns.

julia> reduce(2Mean(), t; select = (:x, :y))
▦ Series{1}
│ EqualWeight | nobs=100
└── MV{Mean}(-0.013999941453549563, 0.20541118062479766)

Different OnlineStats on columns

To calculate different statistics on different columns, we need to make a Group, which can be created via hcat.

julia> s = reduce([Mean() CountMap(Int)], t; select = (:x, :z))
▦ Series{1}
│ EqualWeight | nobs=100
└── Group : ("Mean", "CountMap")

julia> value(stats(s)[1])
(-0.013999941453549563, Dict(4=>21,2=>18,3=>13,5=>25,1=>23))