Basics

Basics

JuliaDB offers two main data structures as well as distributed counterparts. This allows you to easily scale up an analysis, as operations that work on non-distributed tables either work out of the box or are easy to transition for distributed tables.

Here is a high level overview of tables in JuliaDB:

Data for examples:

x = 1:10
y = vcat(fill('a', 4), fill('b', 6))
z = randn(10);

IndexedTable

An IndexedTable is wrapper around a (named) tuple of Vectors, but it behaves like a Vector of (named) tuples. You can choose to sort the table by any number of primary keys (in this case columns :x and :y).

An IndexedTable is created with data in Julia via the table function or with data on disk via the loadtable function.

julia> t = table((x=x, y=y, z=z); pkey = [:x, :y])
Table with 10 rows, 3 columns:
x   y    z
───────────────────
1   'a'  -0.342209
2   'a'  -0.971645
3   'a'  0.0199646
4   'a'  2.21521
5   'b'  -0.0578009
6   'b'  -0.15116
7   'b'  0.869767
8   'b'  -0.549508
9   'b'  0.522934
10  'b'  0.00264428

julia> t[1]
(x = 1, y = 'a', z = -0.3422087705224758)

julia> t[end]
(x = 10, y = 'b', z = 0.002644283632470748)

NDSparse

An NDSparse has a similar underlying structure to IndexedTable, but it behaves like a sparse array with arbitrary indices. The keys of an NDSparse are sorted, much like the primary keys of an IndexedTable.

An NDSparse is created with data in Julia via the ndsparse function or with data on disk via the loadndsparse function.

julia> nd = ndsparse((x=x, y=y), (z=z,))
2-d NDSparse with 10 values (1 field named tuples):
x   y   │ z
────────┼───────────
1   'a' │ -0.342209
2   'a' │ -0.971645
3   'a' │ 0.0199646
4   'a' │ 2.21521
5   'b' │ -0.0578009
6   'b' │ -0.15116
7   'b' │ 0.869767
8   'b' │ -0.549508
9   'b' │ 0.522934
10  'b' │ 0.00264428

julia> nd[1, 'a']
(z = -0.3422087705224758,)

julia> nd[10, 'j'].z
ERROR: KeyError: key (10, 'j') not found

julia> nd[1, :]
1-d NDSparse with 1 values (1 field named tuples):
y   │ z
────┼──────────
'a' │ -0.342209

Selectors

JuliaDB has a variety of ways to select columns. These selection methods get used across many JuliaDB's functions: select, reduce, groupreduce, groupby, join, pushcol, reindex, and more.

To demonstrate selection, we'll use the select function. A selection can be any of the following types:

  1. Integer – returns the column at this position.
  2. Symbol – returns the column with this name.
  3. Pair{Selection => Function} – selects and maps a function over the selection, returns the result.
  4. AbstractArray – returns the array itself. This must be the same length as the table.
  5. Tuple of Selection – returns a table containing a column for every selector in the tuple.
  6. Regex – returns the columns with names that match the regular expression.
  7. Type – returns columns with elements of the given type.
  8. Not(Selection) – returns columns that are not included in the selection.
  9. Between(first, last) – returns columns between first and last.
  10. Keys() – return the primary key columns.
t = table(1:10, randn(10), rand(Bool, 10); names = [:x, :y, :z])
Table with 10 rows, 3 columns:
x   y          z
────────────────────
1   0.801782   true
2   -2.79697   true
3   -0.290173  true
4   -1.06434   true
5   -1.15326   true
6   0.454554   true
7   -0.350886  false
8   0.0321288  false
9   0.580476   false
10  1.43514    true

select the :x vector

julia> select(t, 1)
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> select(t, :x)
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

map a function to the :y vector

julia> select(t, 2 => abs)
10-element Array{Float64,1}:
 0.8017818814362683
 2.7969726791329395
 0.2901727808810734
 1.0643444468521792
 1.1532592799371608
 0.4545538488820198
 0.3508861162061325
 0.03212882024028744
 0.5804756566634022
 1.4351375461261668

julia> select(t, :y => x -> x > 0 ? x : -x)
10-element Array{Float64,1}:
 0.8017818814362683
 2.7969726791329395
 0.2901727808810734
 1.0643444468521792
 1.1532592799371608
 0.4545538488820198
 0.3508861162061325
 0.03212882024028744
 0.5804756566634022
 1.4351375461261668

select the table of :x and :z

julia> select(t, (:x, :z))
Table with 10 rows, 2 columns:
x   z
─────────
1   true
2   true
3   true
4   true
5   true
6   true
7   false
8   false
9   false
10  true

julia> select(t, r"(x|z)")
Table with 10 rows, 2 columns:
x   z
─────────
1   true
2   true
3   true
4   true
5   true
6   true
7   false
8   false
9   false
10  true

map a function to the table of :x and :y

julia> select(t, (:x, :y) => row -> row[1] + row[2])
10-element Array{Float64,1}:
  1.8017818814362683
 -0.7969726791329395
  2.7098272191189268
  2.9356555531478206
  3.846740720062839
  6.4545538488820196
  6.649113883793867
  8.032128820240288
  9.580475656663403
 11.435137546126168

julia> select(t, (1, :y) => row -> row.x + row.y)
10-element Array{Float64,1}:
  1.8017818814362683
 -0.7969726791329395
  2.7098272191189268
  2.9356555531478206
  3.846740720062839
  6.4545538488820196
  6.649113883793867
  8.032128820240288
  9.580475656663403
 11.435137546126168

select columns that are subtypes of Integer

julia> select(t, Integer)
Table with 10 rows, 2 columns:
x   z
─────────
1   true
2   true
3   true
4   true
5   true
6   true
7   false
8   false
9   false
10  true

select columns that are not subtypes of Integer

julia> select(t, Not(Integer))
Table with 10 rows, 1 columns:
y
─────────
0.801782
-2.79697
-0.290173
-1.06434
-1.15326
0.454554
-0.350886
0.0321288
0.580476
1.43514

Loading and Saving

Loading Data From CSV

Loading a CSV file (or multiple files) into one of JuliaDB's tabular data structures is accomplished via the loadtable and loadndsparse functions.

using JuliaDB, DelimitedFiles

x = rand(10, 2)
writedlm("temp.csv", x, ',')

t = loadtable("temp.csv")
Table with 9 rows, 2 columns:
0.3250804107455836  0.2651681922904663
──────────────────────────────────────
0.93089             0.147918
0.913942            0.175732
0.983839            0.688877
0.231887            0.291179
0.108992            0.892554
0.949574            0.707303
0.117289            0.557286
0.846516            0.265606
0.270774            0.0741044
Note

loadtable and loadndsparse use Missing to represent missing values. To load a CSV that instead uses DataValue, see CSVFiles.jl. For more information on missing value representations, see Missing Values.

Converting From Other Data Structures

using JuliaDB, RDatasets

df = dataset("datasets", "iris")  # load data as DataFrame

table(df)  # Convert DataFrame to IndexedTable
Table with 150 rows, 5 columns:
Columns:
#  colname      type
────────────────────────────────────────
1  SepalLength  Float64
2  SepalWidth   Float64
3  PetalLength  Float64
4  PetalWidth   Float64
5  Species      CategoricalString{UInt8}

Save Table into Binary Format

A table can be saved to disk (for fast, efficient reloading) via the save function.

Load Table from Binary Format

Tables that have been save-ed can be loaded efficiently via load.