Loading and Saving

Loading and Saving datasets

Load from CSV

JuliaDB.loadtableFunction.

loadtable(files::Union{AbstractVector,String}; <options>)

Load a table from CSV files.

files is either a vector of file paths, or a directory name.

Options:

  • output::AbstractString – directory name to write the table to. By default data is loaded directly to memory. Specifying this option will allow you to load data larger than the available memory.

  • indexcols::Vector – columns to use as primary key columns. (defaults to [])

  • datacols::Vector – non-indexed columns. (defaults to all columns but indexed columns). Specify this to only load a subset of columns. In place of the name of a column, you can specify a tuple of names – this will treat any column with one of those names as the same column, but use the first name in the tuple. This is useful when the same column changes name between CSV files. (e.g. vendor_id and VendorId)

  • distributed::Bool – should the output dataset be loaded as a distributed table? If true, this will use all available worker processes to load the data. (defaults to true if workers are available, false if not)

  • chunks::Int – number of chunks to create when loading distributed. (defaults to number of workers)

  • delim::Char – the delimiter character. (defaults to ,). Use spacedelim=true to split by spaces.

  • spacedelim::Bool: parse space-delimited files. delim has no effect if true.

  • quotechar::Char – quote character. (defaults to ")

  • escapechar::Char – escape character. (defaults to ")

  • filenamecol::Union{Symbol, Pair} – create a column containing the file names from where each row came from. This argument gives a name to the column. By default, basename(name) of the name is kept, and ".csv" suffix will be stripped. To provide a custom function to apply on the names, use a name => Function pair. By default, no file name column will be created.

  • header_exists::Bool – does header exist in the files? (defaults to true)

  • colnames::Vector{String} – specify column names for the files, use this with (header_exists=false, otherwise first row is discarded). By default column names are assumed to be present in the file.

  • samecols – a vector of tuples of strings where each tuple contains alternative names for the same column. For example, if some files have the name "vendor_id" and others have the name "VendorID", pass samecols=[("VendorID", "vendor_id")].

  • colparsers – either a vector or dictionary of data types or an AbstractToken object from TextParse package. By default, these are inferred automatically. See type_detect_rows option below.

  • type_detect_rows: number of rows to use to infer the initial colparsers defaults to 20.

  • nastrings::Vector{String} – strings that are to be considered NA. (defaults to TextParse.NA_STRINGS)

  • skiplines_begin::Char – skip some lines in the beginning of each file. (doesn't skip by default)

  • usecache::Bool: (vestigial)

source
JuliaDB.loadndsparseFunction.

loadndsparse(files::Union{AbstractVector,String}; <options>)

Load an NDSparse from CSV files.

files is either a vector of file paths, or a directory name.

Options:

  • indexcols::Vector – columns to use as indexed columns. (by default a 1:n implicit index is used.)

  • datacols::Vector – non-indexed columns. (defaults to all columns but indexed columns). Specify this to only load a subset of columns. In place of the name of a column, you can specify a tuple of names – this will treat any column with one of those names as the same column, but use the first name in the tuple. This is useful when the same column changes name between CSV files. (e.g. vendor_id and VendorId)

All other options are identical to those in loadtable

source

Save and Load blobs

Dagger.saveFunction.

save(t::Union{DNDSparse, DNDSparse}, outputdir::AbstractString)

Saves a distributed dataset to disk. Saved data can be loaded with load.

source
Dagger.loadFunction.

load(dir::AbstractString; tomemory)

Load a saved DNDSparse from dir directory. Data can be saved using the save function.

source