Joins

Joins

Base.joinFunction.

join([f, ] left, right; how, <options>)

Join two tables (left and right). how specifies which join method is used (one of :inner, :left, :right, :outer and :anti). By default, join keys are implied to be the primary keys, but this can be changed using the lkey and rkey options. See Options section below.

The function f must take 2 arguments: tuples of non-key fields from both tables as input. The fields chosen for f can be configured using lselect and rselect options. See Options section below. If f is not specified, then these tuples are concatenated to form the non-indexed fields of the output.

Inner join

Inner join is the default join (when how is unspecified). It looks up keys from left in right and only joins them when there is a match. This generates the "intersection" of keys from left and right.

julia> l = table([1,1,2,2], [1,2,1,2], [1,2,3,4],
                 names=[:a,:b,:c], pkey=(:a, :b))
Table with 4 rows, 3 columns:
a  b  c
───────
1  1  1
1  2  2
2  1  3
2  2  4

julia> r = table([0,1,1,3], [1,1,2,2], [1,2,3,4],
                 names=[:a,:b,:d], pkey=(:a, :b))
Table with 4 rows, 3 columns:
a  b  d
───────
0  1  1
1  1  2
1  2  3
3  2  4

julia> join(l,r) # inner join
Table with 2 rows, 4 columns:
a  b  c  d
──────────
1  1  1  2
1  2  2  3

Left join

Left join looks up rows from right where keys match that in left. If there are no such rows in right, an NA value is used for every selected field from right.

julia> join(l,r, how=:left)
Table with 4 rows, 4 columns:
a  b  c  d
────────────
1  1  1  2
1  2  2  3
2  1  3  #NA
2  2  4  #NA

Outer join

Outer (aka Union) join looks up rows from right where keys match that in left, and also rows from left where keys match those in left, if there are no matches on either side, a tuple of NA values is used. The output is guarranteed to contain the union of all keys from both tables.

julia> join(l,r, how=:outer)
Table with 6 rows, 4 columns:
a  b  c    d
──────────────
0  1  #NA  1
1  1  1    2
1  2  2    3
2  1  3    #NA
2  2  4    #NA
3  2  #NA  4

Anti join

Anti join keeps rows in left whose keys are NOT present in right.

julia> join(l, r, how=:anti)
Table with 2 rows, 3 columns:
a  b  c
───────
2  1  3
2  2  4

One-to-many and many-to-many matches

If the same key appears multiple times in either table (say, m and n times respectively), each row with a key from left is matched with each row from right with that key (resulting in m×n output rows with the same key.)

julia> l1 = table([1,2,2,3], [1,2,3,4], names=[:x,:y])
Table with 4 rows, 2 columns:
x  y
────
1  1
2  2
2  3
3  4

julia> r1 = table([2,2,3,3], [5,6,7,8], names=[:x,:z])
Table with 4 rows, 2 columns:
x  z
────
2  5
2  6
3  7
3  8

julia> join(l1,r1, lkey=:x, rkey=:x)
Table with 6 rows, 3 columns:
x  y  z
───────
2  2  5
2  2  6
2  3  5
2  3  6
3  4  7
3  4  8

This applies to all joins described above except anti join where rows are not matched.

Options

  • how::Symbol – join method to use. Described above.

  • lkey::Selection – fields from left to match on

  • rkey::Selection – fields from right to match on

  • lselect::Selection – fields from left to use as output columns, or input to f if it is specified. By default, this is all fields not selected in lkey.

  • rselect::Selection – fields from right to use as output columns, or input to f if it is specified. By default, this is all fields not selected in rkey.

See select for a description of Selection type.

julia> join(l, r, lkey=:a, rkey=:a,
            lselect=:b, rselect=:d, how=:outer)
Table with 8 rows, 3 columns:
a  b    d
───────────
0  #NA  1
1  1    2
1  1    3
1  2    2
1  2    3
2  1    #NA
2  2    #NA
3  #NA  4
source

groupjoin([f, ] left, right; how, <options>)

Join left and right creating groups of values with matching keys.

Inner join

Inner join is the default join (when how is unspecified). It looks up keys from left in right and only joins them when there is a match. This generates the "intersection" of keys from left and right.

One-to-many and many-to-many matches

If the same key appears multiple times in either table (say, m and n times respectively), each row with a key from left is matched with each row from right with that key. The resulting group has m×n output elements.

julia> l = table([1,1,1,2], [1,2,2,1], [1,2,3,4],
                 names=[:a,:b,:c], pkey=(:a, :b))
Table with 4 rows, 3 columns:
a  b  c
───────
1  1  1
1  2  2
1  2  3
2  1  4

julia> r = table([0,1,1,2], [1,2,2,1], [1,2,3,4],
                 names=[:a,:b,:d], pkey=(:a, :b))
Table with 4 rows, 3 columns:
a  b  d
───────
0  1  1
1  2  2
1  2  3
2  1  4

julia> groupjoin(l,r)
Table with 2 rows, 3 columns:
a  b  groups
──────────────────────────────────────────────────────────────────────────────────────────────────────
1  2  NamedTuples._NT_c_d{Int64,Int64}[(c = 2, d = 2), (c = 2, d = 3), (c = 3, d = 2), (c = 3, d = 3)]
2  1  NamedTuples._NT_c_d{Int64,Int64}[(c = 4, d = 4)]

Left join

Left join looks up rows from right where keys match that in left. If there are no such rows in right, an NA value is used for every selected field from right.

julia> groupjoin(l,r, how=:left)
Table with 3 rows, 3 columns:
a  b  groups
──────────────────────────────────────────────────────────────────────────────────────────────────────
1  1  NamedTuples._NT_c_d{Int64,Int64}[]
1  2  NamedTuples._NT_c_d{Int64,Int64}[(c = 2, d = 2), (c = 2, d = 3), (c = 3, d = 2), (c = 3, d = 3)]
2  1  NamedTuples._NT_c_d{Int64,Int64}[(c = 4, d = 4)]

Outer join

Outer (aka Union) join looks up rows from right where keys match that in left, and also rows from left where keys match those in left, if there are no matches on either side, a tuple of NA values is used. The output is guarranteed to contain


julia> groupjoin(l,r, how=:outer)
Table with 4 rows, 3 columns:
a  b  groups
──────────────────────────────────────────────────────────────────────────────────────────────────────
0  1  NamedTuples._NT_c_d{Int64,Int64}[]
1  1  NamedTuples._NT_c_d{Int64,Int64}[]
1  2  NamedTuples._NT_c_d{Int64,Int64}[(c = 2, d = 2), (c = 2, d = 3), (c = 3, d = 2), (c = 3, d = 3)]
2  1  NamedTuples._NT_c_d{Int64,Int64}[(c = 4, d = 4)]

Options

  • how::Symbol – join method to use. Described above.

  • lkey::Selection – fields from left to match on

  • rkey::Selection – fields from right to match on

  • lselect::Selection – fields from left to use as input to use as output columns, or input to f if it is specified. By default, this is all fields not selected in lkey.

  • rselect::Selection – fields from left to use as input to use as output columns, or input to f if it is specified. By default, this is all fields not selected in rkey.

julia> groupjoin(l,r, lkey=:a, rkey=:a, lselect=:c, rselect=:d, how=:outer)
Table with 3 rows, 2 columns:
a  groups
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
0  NamedTuples._NT_c_d{Int64,Int64}[]
1  NamedTuples._NT_c_d{Int64,Int64}[(c = 1, d = 2), (c = 1, d = 3), (c = 2, d = 2), (c = 2, d = 3), (c = 3, d = 2), (c = 3, d = 3)]
2  NamedTuples._NT_c_d{Int64,Int64}[(c = 4, d = 4)]
source

asofjoin(left::NDSparse, right::NDSparse)

asofjoin is most useful on two time-series. It joins rows from left with the "most recent" value from right.

julia> x = ndsparse((["ko","ko", "xrx","xrx"],
                     Date.(["2017-11-11", "2017-11-12",
                            "2017-11-11", "2017-11-12"])), [1,2,3,4]);

julia> y = ndsparse((["ko","ko", "xrx","xrx"],
                     Date.(["2017-11-12", "2017-11-13",
                            "2017-11-10", "2017-11-13"])), [5,6,7,8])

julia> asofjoin(x,y)
2-d NDSparse with 4 values (Int64):
1      2          │
──────────────────┼──
"ko"   2017-11-11 │ 1
"ko"   2017-11-12 │ 5
"xrx"  2017-11-11 │ 7
"xrx"  2017-11-12 │ 7
source

Merge

Base.mergeFunction.

merge(a::Union{Table, NDSparse}, a::Union{Table, NDSparse})

Merge rows from two datasets while keeping them ordered by primary keys.

Examples:

julia> a = table([1,3,5], [1,2,3], names=[:x,:y], pkey=:x)
Table with 3 rows, 2 columns:
x  y
────
1  1
3  2
5  3

julia> b = table([2,3,4], [1,2,3], names=[:x,:y], pkey=:x)
Table with 3 rows, 2 columns:
x  y
────
2  1
3  2
4  3

julia> merge(a,b)
Table with 6 rows, 2 columns:
x  y
────
1  1
2  1
3  2
3  2
4  3
5  3

When merging two NDSparse objects, if the same key is present in both inputs, the value from the second input is chosen.

julia> a = ndsparse([1,3,5], [1,2,3]);

julia> b = ndsparse([2,3,4], [1,2,3]);

julia> merge(a,b)
1-d NDSparse with 5 values (Int64):
1 │
──┼──
1 │ 1
2 │ 1
3 │ 2
4 │ 3
5 │ 3

However, you can pass the agg keyword argument to combine the values with a custom function.

julia> merge(a,b,agg=+)
1-d NDSparse with 5 values (Int64):
1 │
──┼──
1 │ 1
2 │ 1
3 │ 4
4 │ 3
5 │ 3
source

Merge two NamedTuples favoring the lhs Order is preserved lhs names come first. This copies the underlying data.

source

Broadcast

Base.broadcastMethod.

broadcast(f::Function, A::NDSparse, B::NDSparse; dimmap::Tuple{Vararg{Int}})

Compute an inner join of A and B using function f, where the dimensions of B are a subset of the dimensions of A. Values from B are repeated over the extra dimensions.

dimmap optionally specifies how dimensions of A correspond to dimensions of B. It is a tuple where dimmap[i]==j means the ith dimension of A matches the jth dimension of B. Extra dimensions that do not match any dimensions of j should have dimmap[i]==0.

If dimmap is not specified, it is determined automatically using index column names and types.

julia> a = ndsparse(([1,1,2,2], [1,2,1,2]), [1,2,3,4])
2-d NDSparse with 4 values (Int64):
1  2 │
─────┼──
1  1 │ 1
1  2 │ 2
2  1 │ 3
2  2 │ 4

julia> b = ndsparse([1,2], [1/1, 1/2])
1-d NDSparse with 2 values (Float64):
1 │
──┼────
1 │ 1.0
2 │ 0.5

julia> broadcast(*, a, b)
2-d NDSparse with 4 values (Float64):
1  2 │
─────┼────
1  1 │ 1.0
1  2 │ 2.0
2  1 │ 1.5
2  2 │ 2.0

The .-broadcast syntax works with NDSparse:

julia> a.*b
2-d NDSparse with 4 values (Float64):
1  2 │
─────┼────
1  1 │ 1.0
1  2 │ 2.0
2  1 │ 1.5
2  2 │ 2.0

dimmap maps dimensions that should be broadcasted:


julia> broadcast(*, a, b, dimmap=(0,1))
2-d NDSparse with 4 values (Float64):
1  2 │
─────┼────
1  1 │ 1.0
1  2 │ 1.0
2  1 │ 3.0
2  2 │ 2.0
source