Feature Extraction

Feature Extraction

Machine learning models are composed of mathematical operations on matrices of numbers. However, data in the real world is often in tabular form containing more than just numbers. Hence, the first step in applying machine learning is to turn such tabular non-numeric data into a matrix of numbers. Such matrices are called "feature matrices". JuliaDB contains an ML module which has helper functions to extract feature matrices.

In this document, we will turn the titanic dataset from Kaggle into numeric form and apply a machine learning model on it.

using JuliaDB

download("https://raw.githubusercontent.com/agconti/"*
          "kaggle-titanic/master/data/train.csv", "train.csv")

train_table = loadtable("train.csv", escapechar='"')
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  3 60302    3  1920    0     0  29873      0  0:00:02 --:--:--  0:00:02 29538
100 60302  100 60302    0     0   900k      0 --:--:-- --:--:-- --:--:--  892k
Table with 891 rows, 9 columns:
PassengerId  Survived  Pclass  Sex       Age   SibSp  Parch  Fare     Embarked
──────────────────────────────────────────────────────────────────────────────
1            0         3       "male"    22.0  1      0      7.25     "S"
2            1         1       "female"  38.0  1      0      71.2833  "C"
3            1         3       "female"  26.0  0      0      7.925    "S"
4            1         1       "female"  35.0  1      0      53.1     "S"
5            0         3       "male"    35.0  0      0      8.05     "S"
6            0         3       "male"    #NA   0      0      8.4583   "Q"
7            0         1       "male"    54.0  0      0      51.8625  "S"
8            0         3       "male"    2.0   3      1      21.075   "S"
9            1         3       "female"  27.0  0      2      11.1333  "S"
⋮
884          0         2       "male"    28.0  0      0      10.5     "S"
885          0         3       "male"    25.0  0      0      7.05     "S"
886          0         3       "female"  39.0  0      5      29.125   "Q"
887          0         2       "male"    27.0  0      0      13.0     "S"
888          1         1       "female"  19.0  0      0      30.0     "S"
889          0         3       "female"  #NA   1      2      23.45    "S"
890          1         1       "male"    26.0  0      0      30.0     "C"
891          0         3       "male"    32.0  0      0      7.75     "Q"

ML.schema

Schema is a programmatic description of the data in each column. It is a dictionary which maps each column (by name) to its schema type (mainly Continuous, and Categorical).

ML.schema(train_table) will go through the data and infer the types and distribution of data. Let's try it without any arguments on the titanic dataset:

ML.schema(train_table)
Dict{Symbol,Any} with 12 entries:
  :SibSp       => Continous(μ=0.5230078563411893, σ=1.1027434322934322)
  :Embarked    => Categorical(String["S", "C", "Q", ""])
  :PassengerId => Continous(μ=446.0, σ=257.3538420152301)
  :Cabin       => nothing
  :Age         => JuliaDB.ML.Maybe{JuliaDB.ML.Continuous}(Continous(μ=29.699117…
  :Survived    => Continous(μ=0.3838383838383839, σ=0.4865924542648576)
  :Parch       => Continous(μ=0.3815937149270483, σ=0.8060572211299485)
  :Pclass      => Continous(μ=2.3086419753086447, σ=0.8360712409770491)
  :Ticket      => nothing
  :Sex         => Categorical(String["male", "female"])
  :Name        => nothing
  :Fare        => Continous(μ=32.20420796857465, σ=49.693428597180855)

Here is how the schema was inferred:

You may note that Survived column contains only 1s and 0s to denote whether a passenger survived the disaster or not. However, our schema inferred the column to be Continuous. To not be overly presumptive ML.schema will assume all numeric columns are continuous by default. We can give the hint that the Survived column is categorical by passing the hints arguemnt as a dictionary of column name to schema type. Further, we will also treat Pclass (passenger class) as categorical and suppress Parch and SibSp fields.

sch = ML.schema(train_table, hints=Dict(
        :Pclass => ML.Categorical,
        :Survived => ML.Categorical,
        :Parch => nothing,
        :SibSp => nothing,
        :Fare => nothing,
        )
)
Dict{Symbol,Any} with 12 entries:
  :SibSp       => nothing
  :Embarked    => Categorical(String["S", "C", "Q", ""])
  :PassengerId => Continous(μ=446.0, σ=257.3538420152301)
  :Cabin       => nothing
  :Age         => JuliaDB.ML.Maybe{JuliaDB.ML.Continuous}(Continous(μ=29.699117…
  :Survived    => Categorical([0, 1])
  :Parch       => nothing
  :Pclass      => Categorical([3, 1, 2])
  :Ticket      => nothing
  :Sex         => Categorical(String["male", "female"])
  :Name        => nothing
  :Fare        => nothing

Split schema into input and output

In a machine learning model, a subset of fields act as the input to the model, and one or more fields act as the output (predicted variables). For example, in the titanic dataset, you may want to predict whether a person will survive or not. So "Survived" field will be the output column. Using the ML.splitschema function, you can split the schema into input and output schema.

input_sch, output_sch = ML.splitschema(sch, :Survived)
(Dict{Symbol,Any}(Pair{Symbol,Any}(:SibSp, nothing),Pair{Symbol,Any}(:Embarked, Categorical(String["S", "C", "Q", ""])),Pair{Symbol,Any}(:PassengerId, Continous(μ=446.0, σ=257.3538420152301)),Pair{Symbol,Any}(:Cabin, nothing),Pair{Symbol,Any}(:Age, JuliaDB.ML.Maybe{JuliaDB.ML.Continuous}(Continous(μ=29.69911764705884, σ=14.526497332334051))),Pair{Symbol,Any}(:Parch, nothing),Pair{Symbol,Any}(:Pclass, Categorical([3, 1, 2])),Pair{Symbol,Any}(:Ticket, nothing),Pair{Symbol,Any}(:Sex, Categorical(String["male", "female"])),Pair{Symbol,Any}(:Name, nothing)…), Dict{Symbol,Any}(Pair{Symbol,Any}(:Survived, Categorical([0, 1]))))

Extracting feature matrix

Once the schema has been created, you can extract the feature matrix according to the given schema using ML.featuremat:

train_input = ML.featuremat(input_sch, train_table)
12×891 Array{Float32,2}:
  1.0        0.0       1.0        1.0       …  1.0       0.0       0.0
  0.0        1.0       0.0        0.0          0.0       1.0       0.0
  0.0        0.0       0.0        0.0          0.0       0.0       1.0
  0.0        0.0       0.0        0.0          0.0       0.0       0.0
 -1.72914   -1.72525  -1.72137   -1.71748      1.72137   1.72525   1.72914
  0.0        0.0       0.0        0.0       …  1.0       0.0       0.0
 -0.530005   0.57143  -0.254646   0.364911     0.0      -0.254646  0.158392
  1.0        0.0       1.0        0.0          1.0       0.0       1.0
  0.0        1.0       0.0        1.0          0.0       1.0       0.0
  0.0        0.0       0.0        0.0          0.0       0.0       0.0
  1.0        0.0       0.0        0.0       …  0.0       1.0       1.0
  0.0        1.0       1.0        1.0          1.0       0.0       0.0
train_output = ML.featuremat(output_sch, train_table)
2×891 Array{Float32,2}:
 1.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  0.0  1.0  0.0  1.0
 0.0  1.0  1.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0

Learning

Let us create a simple neural network to learn whether a passenger will survive or not using the Flux framework.

ML.width(schema) will give the number of features in the schema we will use this in specifying the model size:

using Flux

model = Chain(
  Dense(ML.width(input_sch), 32, relu),
  Dense(32, ML.width(output_sch)),
  softmax)

loss(x, y) = Flux.mse(model(x), y)
opt = Flux.ADAM(Flux.params(model))
evalcb = Flux.throttle(() -> @show(loss(first(data)...)), 2);

Train the data in 10 iterations

data = [(train_input, train_output)]
for i = 1:10
  Flux.train!(loss, data, opt, cb = evalcb)
end

data given to the model is a vector of batches of input-output matrices. In this case we are training with just 1 batch.

Prediction

Now let's load some testing data to use the model we learned to predict survival.

download("https://raw.githubusercontent.com/agconti/"*
          "kaggle-titanic/master/data/test.csv", "test.csv")

test_table = loadtable("test.csv", escapechar='"')

test_input = ML.featuremat(input_sch, test_table) ;
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 28210  100 28210    0     0   484k      0 --:--:-- --:--:-- --:--:--  491k

Run the model on one observation:

model(test_input[:, 1])

The output has two numbers which add up to 1: the probability of not surviving vs that of surviving. It seems, according to our model, that this person is unlikely to survive on the titanic.

You can also run the model on all observations by simply passing the whole feature matrix to model.

model(test_input)