MachineLearning::DecisionTrees – create, test, print out, store, and employ decision trees
use MachineLearning::DecisionTrees; my $grove = MachineLearning::DecisionTrees->new({ "InputFieldNames" => ["Input1", "Input2", "Input3"], "OutputFieldNames" => ["Output1", "Output2"], "InputDataTypes" => { "Input1" => ["male", "female"], "Input2" => "number", "Input3" => "special_number"}, "TrainingData" => $training_data_file_path}); my $all_is_well = $grove->get_success(); my $message = $grove->get_message(); my $saved_ok = $grove->save($file_path); my $grove = MachineLearning::DecisionTrees->open($file_path); my $tree_printout = $grove->print_out($tree_name); my $tree_test_results = $grove->test({ "TreeName" => $tree_name, "ValidationData" => $validation_data_file_path, "NodeList" => \@selected_node_numbers}); my $completed_ok = $grove->employ({ "TreeName" => $tree_name, "TargetData" => $writable_data_file_path, "NodeList" => \@selected_node_numbers});
This module defines a package for creating a MachineLearning::DecisionTrees object.
Decision trees as implemented by the MachineLearning::DecisionTrees module favor quality over quantity. That is, they are optimized to find the highest quality predictions for a single output value without undue regard to how many data records (or instances) get screened out in the process. This approach is highly useful for many applications. For example, out of many potential financial investments, you might want to identify a small group that have unusually good prospects for success.
The output values supported by the MachineLearning::DecisionTrees
module are 1
and 0
, and the accuracy of
the results is important only for the 1
values. When
choosing the attribute on which to split a node, the criterion is
maximization of the highest resulting ones percentage. (In the
case of a tie, the highest quantity of records breaks the tie.)
NOTE: The selection of the first three attributes
(resulting in the children, grandchildren, and great-grandchildren
of the root node) is performed organically. That is, the best out
of all possible permutations of attributes for the first three
slots "wins". Subsequent attributes are chosen individually.
Decision trees as implemented by the MachineLearning::DecisionTrees module come in groves. A grove is one or more decision trees, all built from the same data file. The data file used to build (or train) a grove contains one or more output fields, and each output field corresponds to one tree. When printed out, a decision tree identifies itself using the name of the output field for which it was built.
A decision tree is made up of nodes, starting with a root node (which typically appears at the top in graphical representations, highlighting the fact that trees in nature are upside down :-) and branching from there via child nodes. During the building, testing, or employment of a decision tree, each parent node divides data records associated with it among its children according to each record's value for a particular input field (or attribute). A node with no children is a leaf node and terminates a particular path through the tree.
A path through a decision tree always starts at the root node and ends when one of the following conditions occurs during tree creation, which comprises training and pruning phases:
After a tree has been built, it gets pruned from the bottom up. The pruning algorithm removes any leaf node if the records associated with one of the node's ancestors (other than the root node) have an equal or higher percentage of ones for the output field on which the tree is based than do the records associated with the leaf node. Any nodes that become leaf nodes during the pruning process are themselves subject to pruning.
Once a tree has been pruned, the remaining leaf nodes are sorted by quality (defined as the magnitude of the ones percentage for the records associated with the leaf node). In a printout of the tree, the leaf nodes appear in order by quality (highest to lowest) together with the exact ones percentage for each leaf node based on the training data. For each leaf node, the printout also gives the number of associated training records and the path through the tree.
The data used to build, test, or employ decision trees must include at least two input fields, which are used as attributes by nodes during spawning.
The data must also include one or more output fields in
which each value is either a 1
or a 0
.
(The output values can be blank when the data is passed to the
employ()
method.) For example, there might be a field
for whether you enjoy certain TV shows that have been on in the
past and another field for whether your spouse enjoys them. You
could create a grove of two decision trees from this data: one
tree to select shows that you would most likely enjoy out of the
new television offerings starting up in the fall, and another tree
to select shows that your spouse would most likely enjoy. (With
luck, there will be some that you both enjoy!)
Each input field uses one of three data types:
enumeration
, number
, or
special_number
.
An enumeration
type comprises a list of two or more
possible unique text values (for example, male
and
female
for a Gender field).
A number
type comprises decimal numbers. The
tree-building algorithm divides the values in a number field into
the following categories after statistically analyzing all the
values for that field in the data set. (In the
Criteria column below, m
is the mean
and d
is the population standard deviation.)
Category Criteria -------- -------- abnormally_low Less than m - 2d low Greater than or equal to m - 2d, and less than m - d medium Greater than or equal to m - d, and less than or equal to m + d (that is, within one standard deviation of the mean) high Greater than m + d, and less than or equal to m + 2d abnormally_high Greater than m + 2d
A special_number
type is a decimal number for which
the sign and the absolute value have separate significance. For
example, the sign might indicate direction (up or down) while the
absolute value indicates momentum, or the sign might indicate
approval versus disapproval while the absolute value indicates the
intensity of the response. The categories for this type are
abnormally_strong_negative
,
strong_negative
, medium_negative
,
weak_negative
, abnormally_weak_negative
,
abnormally_weak_positive_or_zero
,
weak_positive
, medium_positive
,
strong_positive
, and
abnormally_strong_positive
.
The categories into which the algorithm places numeric inputs act as enumerated values.
To use this module, you must have the MachineLearning module installed.
This is the constructor.
In addition to the class-name argument, which is passed
automatically when you use the
MachineLearning::DecisionTrees->new()
syntax,
the constructor takes a reference to a hash containing the
following keys:
InputFieldNames OutputFieldNames InputDataTypes TrainingData
The values associated with the InputFieldNames and OutputFieldNames keys must each be a reference to an array of field names. All field names (for input and output fields combined) must be unique. Field names must not contain commas or line-break characters. There must be at least two input field names and at least one output field name.
The value associated with the InputDataTypes key must be a
reference to a hash in which the keys are input field names
(which must match those specified by the InputFieldNames
argument) and each value indicates the data type for the
corresponding input field. If the value is an array reference,
the field has a data type of enumeration
and the
array must contain two or more strings, all unique,
representing the possible values for the input field.
Otherwise, the value indicating the data type must be the
string number
or the string
special_number
.
The value associated with the TrainingData key must be the path to a file containing CSV-format training data. The first line in the file must contain field names, which must match the field names specified by the InputFieldNames and OutputFieldNames arguments.
The constructor returns a reference to a MachineLearning::DecisionTrees object (a grove), which is implemented internally as a hash. All functionality is exposed through methods.
If the constructor receives a valid hash reference providing
all required information, the get_success()
instance method returns true (1
) and the
get_message()
instance method returns an empty
string; otherwise, get_success()
returns false
(0
) and get_message()
returns a
string containing an error message.
This returns true (1
) if the grove was
initialized successfully; otherwise, it returns false
(0
).
When get_success()
returns true
(1
), this returns an empty string. When
get_success()
returns false (0
),
get_message()
returns an error message.
This method saves (serializes) the grove to the specified file.
If the serialization and file-writing operations succeed,
this method returns true (1
); otherwise, the
method sets the _success
field to false
(0
), places an error message in the
_message
field, and returns false
(0
).
This method returns a new MachinLearning::DecisionTrees
object (a grove) created by restoring such an object
from the specified file. If the file-reading and
deserialization operations succeed, the resulting object's
get_success()
method returns true
(1
) and the get_message()
method
returns an empty string. Otherwise, the open()
method creates and returns a new object that has a false value
in the _success
field and an error message in the
_message
field.
This method prints out a particular tree within the grove. The name that you provide to the method is the name of the output field corresponding to the tree that you want to study.
The printout lists each leaf node in order by quality
(highest to lowest). For each leaf node, the printout gives a
number with which you can specify the node when you call the
test()
method or the employ()
method, the ones percentage for the node after training, the
number of records associated with the node after training, and
the path through the tree leading to the node.
This method returns a string.
This method takes a reference to a hash with the following fields:
TreeName – The name of the tree, which is the same as the name of the output field for which the tree was built. ValidationData – The path to a file containing CSV-format validation data. This must have the same fields and data types as the training data used to create the grove. NodeList – A reference to an array containing the numbers of the nodes that you want to use in the test to indicate "1" results.
This method returns a string giving the test results, which comprise the number of records in the test data, the number of records assigned a "1" result during the test, and the accuracy (as a percentage) of those "1" results based on the validation data.
This method takes a reference to a hash with the following fields:
TreeName – The name of the tree, which is the same as the name of the output field for which the tree was built. TargetData – The path to a file containing CSV-format data. This must have the same fields and data types as the training data used to create the grove. NodeList – A reference to an array containing the numbers of the nodes that you want to use to generate "1" results.
This method divides the records from the specified data file
among the leaf nodes of the specified tree, assigning a
"1" result to any record associated with one of the
nodes given by the NodeList
argument. For those
records, the method places a 1
in the output
field associated with the tree. For the remaining records, the
method places a 0
in that field.
The data file that you provide to this method must contain an
output field with the same name as the tree; however, the
values in that field can be blank. If an output value is not
blank, it must be a 1
or a 0
to
conform to the data type defined for output fields. This
method populates or overwrites the values in the output field
with new results.
If everything goes OK, this method returns true
(1
); otherwise, the method sets the
_success
field to false (0
), places
an error message in the _message
field, and
returns false (0
).
Benjamin Fitch, <blernflerkl@yahoo.com>
Copyright 2009 by Benjamin Fitch
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.