Data Structures

Data structures

Table

A Table is a collection of tuples or named tuples. These tuples are "rows" of the table. The values of the same field in all rows form a "column". A Table can be constructed by passing the columns to the table function. The names argument sets the names of the columns:

julia> t = table([1,2,3], [4,5,6], names=[:x, :y])
Table with 3 rows, 2 columns:
x  y
────
1  4
2  5
3  6

Since a table iterates over rows, indexing with an iteger will return the row at that position:

julia> row = t[2]
(x = 2, y = 5)

julia> row.x
2

julia> row.y
5

The returned value is a named tuple in this case.

Further, indexing a table with a range of indices or generally any array of integer indices will return a new table with those subset of rows.

julia> t[2:3]
Table with 2 rows, 2 columns:
x  y
────
2  5
3  6

julia> t[[1,1,3]]
Table with 3 rows, 2 columns:
x  y
────
1  4
1  4
3  6

Optionally, a subset of fields can be chosen as "primary key". The rows are kept sorted in lexicographic order of the primary key fields. The benefits are:

  1. It makes lookup, grouping, join and sort operations fast when the primary key fields are involved.

  2. It provides a natural default for operations such as groupby and join

Passing the pkey option to table constructor will select the primary keys.

julia> b = table([2,1,2,1],[2,3,1,3],[4,5,6,7], names=[:x,:y,:z], pkey=(:x,:y))
Table with 4 rows, 3 columns:
x  y  z
───────
1  3  5
1  3  7
2  1  6
2  2  4

Note that the output table is sorted by the primary key fields.

Below is the full documentation of the table constructor:

IndexedTables.tableFunction.

table(cols::AbstractVector...; names, <options>)

Create a table with columns given by cols.

julia> a = table([1,2,3], [4,5,6])
Table with 3 rows, 2 columns:
1  2
────
1  4
2  5
3  6

names specify names for columns. If specified, the table will be an iterator of named tuples.

julia> b = table([1,2,3], [4,5,6], names=[:x, :y])
Table with 3 rows, 2 columns:
x  y
────
1  4
2  5
3  6

table(cols::Union{Tuple, NamedTuple}; <options>)

Convert a struct of columns to a table of structs.

julia> table(([1,2,3], [4,5,6])) == a
true

julia> table(@NT(x=[1,2,3], y=[4,5,6])) == b
true

table(cols::Columns; <options>)

Construct a table from a vector of tuples. See rows.

julia> table(Columns([1,2,3], [4,5,6])) == a
true

julia> table(Columns(x=[1,2,3], y=[4,5,6])) == b
true

table(t::Union{Table, NDSparse}; <options>)

Copy a Table or NDSparse to create a new table. The same primary keys as the input are used.

julia> b == table(b)
true

table(iter; <options>)

Construct a table from an iterable table.

Options:

  • pkey: select columns to act as the primary key. By default, no columns are used as primary key.

  • presorted: is the data pre-sorted by primary key columns? If so, skip sorting. false by default. Irrelevant if chunks is specified.

  • copy: creates a copy of the input vectors if true. true by default. Irrelavant if chunks is specified.

  • chunks: distribute the table into chunks (Integer) chunks (a safe bet is nworkers()). Table is not distributed by default. See Distributed docs.

Examples:

Specifying pkey will cause the table to be sorted by the columns named in pkey:

julia> b = table([2,3,1], [4,5,6], names=[:x, :y], pkey=:x)
Table with 3 rows, 2 columns:
x  y
────
1  6
2  4
3  5

julia> b = table([2,1,2,1],[2,3,1,3],[4,5,6,7],
                 names=[:x, :y, :z], pkey=(:x,:y))
Table with 4 rows, 3 columns:
x  y  z
───────
1  3  5
1  3  7
2  1  6
2  2  4

Note that the keys do not have to be unique.

chunks option creates a distributed table.

chunks can be:

  1. An integer – number of chunks to create

  2. An vector of k integers – number of elements in each of the k chunks.

  3. The distribution of another array. i.e. vec.subdomains where vec is a distributed array.

julia> t = table([2,3,1,4], [4,5,6,7],
                  names=[:x, :y], pkey=:x, chunks=2)
Distributed Table with 4 rows in 2 chunks:
x  y
────
1  6
2  4
3  5
4  7

A distributed table will be constructed if one of the arrays passed into table constructor is a distributed array. A distributed Array can be constructed using distribute:


julia> x = distribute([1,2,3,4], 2);

julia> t = table(x, [5,6,7,8], names=[:x,:y])
Distributed Table with 4 rows in 2 chunks:
x  y
────
1  5
2  6
3  7
4  8

julia> table(columns(t)..., [9,10,11,12],
             names=[:x,:y,:z])
Distributed Table with 4 rows in 2 chunks:
x  y  z
────────
1  5  9
2  6  10
3  7  11
4  8  12

Distribution is done to match the first distributed column from left to right. Specify chunks to override this.

source

NDSparse

An NDSparse object is a collection of values sparsely distributed over domains which may be discrete or continuous. For example, stock prices are sparsely distributed over the domains of stock ticker symbols, and timestamps.

julia> prices = ndsparse(@NT(ticker=["GOOG", "GOOG", "KO", "KO"],
                         date=Date.(["2017-11-10", "2017-11-11",
                                     "2017-11-10", "2017-11-11"])),
                         [1029.74, 1028.23, 46.23, 46.53])
2-d NDSparse with 4 values (Float64):
ticker  date       │
───────────────────┼────────
"GOOG"  2017-11-10 │ 1029.74
"GOOG"  2017-11-11 │ 1028.23
"KO"    2017-11-10 │ 46.23
"KO"    2017-11-11 │ 46.53

NDSparse maps tuples of indices of arbitrary types to values, just like an Array maps tuples of integer indices to values. Here, the indices are shown to the left of the vertical line, while the values they map to are to the right.

The indexing syntax can be used for lookup:

julia> prices["KO", Date("2017-11-10")]
46.23

julia> prices["KO", :]
2-d NDSparse with 2 values (Float64):
ticker  date       │
───────────────────┼──────
"KO"    2017-11-10 │ 46.23
"KO"    2017-11-11 │ 46.53

julia> prices[:, Date("2017-11-10")]
2-d NDSparse with 2 values (Float64):
ticker  date       │
───────────────────┼────────
"GOOG"  2017-11-10 │ 1029.74
"KO"    2017-11-10 │ 46.23

Similarly, other array operations like broadcast, reducedim, and mapslices are defined for NDSparse as for Arrays.

An NDSparse is constructed using the ndsparse function.

ndsparse(indices, data; agg, presorted, copy, chunks)

Construct an NDSparse array with the given indices and data. Each vector in indices represents the index values for one dimension. On construction, the indices and data are sorted in lexicographic order of the indices.

Arguments:

  • agg::Function: If indices contains duplicate entries, the corresponding data items are reduced using this 2-argument function.

  • presorted::Bool: If true, the indices are assumed to already be sorted and no sorting is done.

  • copy::Bool: If true, the storage for the new array will not be shared with the passed indices and data. If false (the default), the passed arrays will be copied only if necessary for sorting. The only way to guarantee sharing of data is to pass presorted=true.

  • chunks::Integer: distribute the table into chunks (Integer) chunks (a safe bet is nworkers()). Not distributed by default. See Distributed docs.

Examples:

1-dimensional NDSparse can be constructed with a single array as index.

julia> x = ndsparse(["a","b"],[3,4])
1-d NDSparse with 2 values (Int64):
1   │
────┼──
"a" │ 3
"b" │ 4

julia> keytype(x), eltype(x)
(Tuple{String}, Int64)

A dimension will be named if constructed with a named tuple of columns as index.

julia> x = ndsparse(@NT(date=Date.(2014:2017)), [4:7;])
1-d NDSparse with 4 values (Int64):
date       │
───────────┼──
2014-01-01 │ 4
2015-01-01 │ 5
2016-01-01 │ 6
2017-01-01 │ 7
julia> x[Date("2015-01-01")]
5

julia> keytype(x), eltype(x)
(Tuple{Date}, Int64)

Multi-dimensional NDSparse can be constructed by passing a tuple of index columns:

julia> x = ndsparse((["a","b"],[3,4]), [5,6])
2-d NDSparse with 2 values (Int64):
1    2 │
───────┼──
"a"  3 │ 5
"b"  4 │ 6

julia> keytype(x), eltype(x)
(Tuple{String,Int64}, Int64)

julia> x["a", 3]
5

The data itself can also contain tuples (these are stored in columnar format, just like in table.)

julia> x = ndsparse((["a","b"],[3,4]), ([5,6], [7.,8.]))
2-d NDSparse with 2 values (2-tuples):
1    2 │ 3  4
───────┼───────
"a"  3 │ 5  7.0
"b"  4 │ 6  8.0

julia> x = ndsparse(@NT(x=["a","a","b"],y=[3,4,4]),
                    @NT(p=[5,6,7], q=[8.,9.,10.]))
2-d NDSparse with 3 values (2 field named tuples):
x    y │ p  q
───────┼────────
"a"  3 │ 5  8.0
"a"  4 │ 6  9.0
"b"  4 │ 7  10.0

julia> keytype(x), eltype(x)
(Tuple{String,Int64}, NamedTuples._NT_p_q{Int64,Float64})

julia> x["a", :]
2-d NDSparse with 2 values (2 field named tuples):
x    y │ p  q
───────┼───────
"a"  3 │ 5  8.0
"a"  4 │ 6  9.0

Passing a chunks option to ndsparse, or constructing with a distributed array will cause the result to be distributed. Use distribute function to distribute an array.

julia> x = ndsparse(@NT(date=Date.(2014:2017)), [4:7.;], chunks=2)
1-d Distributed NDSparse with 4 values (Float64) in 2 chunks:
date       │
───────────┼────
2014-01-01 │ 4.0
2015-01-01 │ 5.0
2016-01-01 │ 6.0
2017-01-01 │ 7.0

julia> x = ndsparse(@NT(date=Date.(2014:2017)), distribute([4:7.0;], 2))
1-d Distributed NDSparse with 4 values (Float64) in 2 chunks:
date       │
───────────┼────
2014-01-01 │ 4.0
2015-01-01 │ 5.0
2016-01-01 │ 6.0
2017-01-01 │ 7.0

Distribution is done to match the first distributed column from left to right. Specify chunks to override this.

source

Indexing

This section describes the reindex and rechunk functions which let you change the indexed columns in a table or NDSparse, and sort the contents of a distributed table or NDSparse respectively.

IndexedTables.reindexFunction.

reindex(t::Table, by[, select])

Reindex t by columns selected in by. Keeps columns selected by select as non-indexed columns. By default all columns not mentioned in by are kept.

Use selectkeys to reindex and NDSparse object.

julia> t = table([2,1],[1,3],[4,5], names=[:x,:y,:z], pkey=(1,2))

julia> reindex(t, (:y, :z))
Table with 2 rows, 3 columns:
y  z  x
───────
1  4  2
3  5  1

julia> pkeynames(t)
(:y, :z)

julia> reindex(t, (:w=>[4,5], :z))
Table with 2 rows, 4 columns:
w  z  x  y
──────────
4  5  1  3
5  4  2  1

julia> pkeynames(t)
(:w, :z)
source
rechunk