This application allows training and cross-validation of machine learning models, as well as inference functionality to generate predictions. Currently, it supports a variety of classifiers for binary classification tasks. It also provides several approaches for feature selection.

REQUIREMENTS: Input features file (CSV) and a target label file (CSV) (if training).

USAGE:parameterize

Launch the application from "Applications" -> "Training Module".
Specify the features (.csv) file and the corresponding target (.csv) file.
Specify the mode you wish to run the application in. You can train a model, test an existing model, or perform k-fold cross-validated training and testing.
If training a model, specify the classifier you wish to use and the feature selection approach, as well as other options. -Classifier options include SVMs (with Linear, RBF, Polynomial, Sigmoid, Chi-squared, and Histogram Intersection kernels), stochastic-gradient-descent based SVM, and decision trees (Random Forests and Boosted Trees). -You can also enable hyperparameter optimization for C and Gamma parameters. You can optionally enable internal cross-validation during individual training steps. -Feature selection options include forward sequential feature selection, recursive feature elimination, Effect-size-based feature selection, RELIEF-F based feature selection, and Random-Forest-based feature selection.
In the case of k-fold cross-validation, specify the number of folds in the corresponding edit box.
If testing an existing model, you may optionally provide a labels file to be used as ground truth. Performance metrics will be calculated and placed in the output directory.
Select an output directory.
Press "Confirm" button.
The model or the predicted output, depending on the configuration, will be calculated and saved in the output directory.

This application is also available as with a stand-alone CLI for data analysts to build pipelines around, and can run in the following formats:

K-Fold CrossValidation option:

${CaPTk_InstallDir}/bin/TrainingModule -f C:/TestFeatures.csv -l C:/TestLabels.csv -o C:/OutputDirectory -t crossvalidate -c 1 -k 10

Training option:

${CaPTk_InstallDir}/bin/TrainingModule -f C:/TestFeatures.csv -l C:/TestLabels.csv -o C:/OutputDirectory -t train -c 2 -s 5 -n 2

Testing option:

${CaPTk_InstallDir}/bin/TrainingModule -f C:/TestFeatures.csv -m C:/ModelDirectory/ -o C:/OutputDirectory -t test 3

c is the classifier type (-c 1 for Linear SVM, -c 2 for RBF SVM, -c 3 for Polynomial SVM, -c 4 for Sigmoid SVM, -c 5 Chi-squared SVM, -c 6 Intersection SVM, -c 7 Random Forest, -c 8 SGD SVM, -c 9 Boosted Trees ) s is the feature selection type (-s 1 for Effect-size FS, -s 2 for Forward FS, -s 3 for Recursive Feature Elimination, -s 4 for Random Forest based FS, -5 for RELIEF-F FS). t is the execution mode ('cv' or 'crossvalidate' for cross-validation, 'train' for model training only, 'test' for testing only) k is the # of folds for cross-validation configuration. x is the maximum number of features to select during feature selection. Up to that many features can be included. A value of 0 produces different behavior depending on the feature selection method used. For Forward FS, Effect-size FS, and Recursive Feature Elimination, produces the best set overall. For Random Forest FS and RELIEF-F FS, selects all features but in the order of importance.

See the usage option of the TrainingModule application (-u, –usage) for more information.

Next (Scientific Findings using CaPTk)