Title: | Adaptive Sampling for Positive Unlabeled and Label Noise Learning |
---|---|
Description: | Implements the adaptive sampling procedure, a framework for both positive unlabeled learning and learning with class label noise. Yang, P., Ormerod, J., Liu, W., Ma, C., Zomaya, A., Yang, J. (2018) <doi:10.1109/TCYB.2018.2816984>. |
Authors: | Pengyi Yang |
Maintainer: | Pengyi Yang <[email protected]> |
License: | GPL-3 |
Version: | 1.3 |
Built: | 2024-11-08 04:28:27 UTC |
Source: | https://github.com/pyanglab/adasampling |
adaSample()
applies the AdaSampling procedure to reduce noise
in the training set, and subsequently trains a classifier from
the new training set. For each row (observation) in the test set, it
returns the probabilities of it being a positive ("P) or negative
("N") instance, as a two column data frame.
adaSample(Ps, Ns, train.mat, test.mat, classifier = "svm", s = 1, C = 1, sampleFactor = 1, weights = NULL)
adaSample(Ps, Ns, train.mat, test.mat, classifier = "svm", s = 1, C = 1, sampleFactor = 1, weights = NULL)
Ps |
names (each instance in the data has to be named) of positive examples |
Ns |
names (each instance in the data has to be named) of negative examples |
train.mat |
training data matrix, without class labels. |
test.mat |
test data matrix, without class labels. |
classifier |
classification algorithm to be used for learning. Current options are
support vector machine, |
s |
sets the seed. |
C |
sets how many times to run the classifier, C>1 induces an ensemble learning model. |
sampleFactor |
provides a control on the sample size for resampling. |
weights |
feature weights, required when using weighted knn. |
adaSample()
is an adaptive sampling-based noise reduction method
to deal with noisy class labelled data, which acts as a wrapper for
traditional classifiers, such as support vector machines,
k-nearest neighbours, logistic regression, and linear discriminant
analysis.
This process is used to build up a noise-minimized training set
that is derived by iteratively resampling the training set,
(train
) based on probabilities derived after its classification.
This sampled training set is then used to train a classifier, which
is then executed on the test set. adaSample()
returns a series of
predictions for each row of the test set.
Note that this function does not evaluate the quality of the model
and thus does not compare its output to true values of the test set.
To assess please see adaSvmBenchmark()
.
a two column matrix providing classification probabilities of each sample with respect to positive and negative classes
Yang, P., Liu, W., Yang. J. (2017) Positive unlabeled learning via wrapper-based adaptive sampling. International Joint Conferences on Artificial Intelligence (IJCAI), 3272-3279
Yang, P., Ormerod, J., Liu, W., Ma, C., Zomaya, A., Yang, J.(2018) AdaSampling for positive-unlabeled and label noise learning with bioinformatics applications. IEEE Transactions on Cybernetics, doi:10.1109/TCYB.2018.2816984
# Load the example dataset data(brca) head(brca) # First, clean up the dataset to transform into the required format. brca.mat <- apply(X = brca[,-10], MARGIN = 2, FUN = as.numeric) brca.cls <- sapply(X = brca$cla, FUN = function(x) {ifelse(x == "malignant", 1, 0)}) rownames(brca.mat) <- paste("p", 1:nrow(brca.mat), sep="_") # Introduce 40% noise to positive class and 30% noise to the negative class set.seed(1) pos <- which(brca.cls == 1) neg <- which(brca.cls == 0) brca.cls.noisy <- brca.cls brca.cls.noisy[sample(pos, floor(length(pos) * 0.4))] <- 0 brca.cls.noisy[sample(neg, floor(length(neg) * 0.3))] <- 1 # Identify positive and negative examples from the noisy dataset Ps <- rownames(brca.mat)[which(brca.cls.noisy == 1)] Ns <- rownames(brca.mat)[which(brca.cls.noisy == 0)] # Apply AdaSampling method on the noisy data brca.preds <- adaSample(Ps, Ns, train.mat=brca.mat, test.mat=brca.mat, classifier = "knn") head(brca.preds) # Orignal accuracy from the labels accuracy <- sum(brca.cls.noisy == brca.cls) / length(brca.cls) accuracy # Accuracy after applying AdaSampling method accuracyWithAdaSample <- sum(ifelse(brca.preds[,"P"] > 0.5, 1, 0) == brca.cls) / length(brca.cls) accuracyWithAdaSample
# Load the example dataset data(brca) head(brca) # First, clean up the dataset to transform into the required format. brca.mat <- apply(X = brca[,-10], MARGIN = 2, FUN = as.numeric) brca.cls <- sapply(X = brca$cla, FUN = function(x) {ifelse(x == "malignant", 1, 0)}) rownames(brca.mat) <- paste("p", 1:nrow(brca.mat), sep="_") # Introduce 40% noise to positive class and 30% noise to the negative class set.seed(1) pos <- which(brca.cls == 1) neg <- which(brca.cls == 0) brca.cls.noisy <- brca.cls brca.cls.noisy[sample(pos, floor(length(pos) * 0.4))] <- 0 brca.cls.noisy[sample(neg, floor(length(neg) * 0.3))] <- 1 # Identify positive and negative examples from the noisy dataset Ps <- rownames(brca.mat)[which(brca.cls.noisy == 1)] Ns <- rownames(brca.mat)[which(brca.cls.noisy == 0)] # Apply AdaSampling method on the noisy data brca.preds <- adaSample(Ps, Ns, train.mat=brca.mat, test.mat=brca.mat, classifier = "knn") head(brca.preds) # Orignal accuracy from the labels accuracy <- sum(brca.cls.noisy == brca.cls) / length(brca.cls) accuracy # Accuracy after applying AdaSampling method accuracyWithAdaSample <- sum(ifelse(brca.preds[,"P"] > 0.5, 1, 0) == brca.cls) / length(brca.cls) accuracyWithAdaSample
adaSvmBenchmark()
allows a comparison between the performance
of an AdaSampling-enhanced SVM (support vector machine)-
classifier against the SVM-classifier on its
own. It requires a matrix of features (extracted from a labelled dataset),
and two vectors of true labels and labels with noise added as desired.
It runs an SVM classifier and returns a matrix which displays the specificity
(Sp), sensitivity (Se) and F1 score for each of four conditions:
"Original" (classifying with true labels), "Baseline" (classifying with
noisy labels), "AdaSingle" (classifying using AdaSampling) and
"AdaEnsemble" (classifying using AdaSampling in conjunction with
an ensemble of models).
adaSvmBenchmark(data.mat, data.cls, data.cls.truth, cvSeed, C = 50, sampleFactor = 1)
adaSvmBenchmark(data.mat, data.cls, data.cls.truth, cvSeed, C = 50, sampleFactor = 1)
data.mat |
a rectangular matrix or data frame that can be coerced to a matrix, containing the features of the dataset, without class labels. Rownames (possibly containing unique identifiers) will be ignored. |
data.cls |
a numeric vector containing class labels for the dataset
with added noise.
Must be in the same order and of the same length as |
data.cls.truth |
a numeric vector of true class labels for
the dataset. Must be the same order and of the same length as |
cvSeed |
sets the seed for cross-validation. |
C |
sets how many times to run the classifier, for the AdaEnsemble condition. See Description above. |
sampleFactor |
provides a control on the sample size for resampling. |
AdaSampling is an adaptive sampling-based noise reduction method
to deal with noisy class labelled data, which acts as a wrapper for
traditional classifiers, such as support vector machines,
k-nearest neighbours, logistic regression, and linear discriminant
analysis. For more details see ?adaSample()
.
This function runs evaluates the AdaSampling procedure by adding noise
to a labelled dataset, and then running support vector machines on
the original and the noisy dataset. Note that this function is for
benchmarking AdaSampling performance using what is assumed to be
a well-labelled dataset. In order to run AdaSampling on a noisy dataset,
please see adaSample()
.
performance matrix
Yang, P., Liu, W., Yang. J. (2017) Positive unlabeled learning via wrapper-based adaptive sampling. International Joint Conferences on Artificial Intelligence (IJCAI), 3272-3279
Yang, P., Ormerod, J., Liu, W., Ma, C., Zomaya, A., Yang, J.(2018) AdaSampling for positive-unlabeled and label noise learning with bioinformatics applications. IEEE Transactions on Cybernetics, doi:10.1109/TCYB.2018.2816984
# Load the example dataset data(brca) head(brca) # First, clean up the dataset to transform into the required format. brca.mat <- apply(X = brca[,-10], MARGIN = 2, FUN = as.numeric) brca.cls <- sapply(X = brca$cla, FUN = function(x) {ifelse(x == "malignant", 1, 0)}) rownames(brca.mat) <- paste("p", 1:nrow(brca.mat), sep="_") # Introduce 40% noise to positive class and 30% noise to the negative class set.seed(1) pos <- which(brca.cls == 1) neg <- which(brca.cls == 0) brca.cls.noisy <- brca.cls brca.cls.noisy[sample(pos, floor(length(pos) * 0.4))] <- 0 brca.cls.noisy[sample(neg, floor(length(neg) * 0.3))] <- 1 # benchmark classification performance with different approaches adaSvmBenchmark(data.mat = brca.mat, data.cls = brca.cls.noisy, data.cls.truth = brca.cls, cvSeed=1)
# Load the example dataset data(brca) head(brca) # First, clean up the dataset to transform into the required format. brca.mat <- apply(X = brca[,-10], MARGIN = 2, FUN = as.numeric) brca.cls <- sapply(X = brca$cla, FUN = function(x) {ifelse(x == "malignant", 1, 0)}) rownames(brca.mat) <- paste("p", 1:nrow(brca.mat), sep="_") # Introduce 40% noise to positive class and 30% noise to the negative class set.seed(1) pos <- which(brca.cls == 1) neg <- which(brca.cls == 0) brca.cls.noisy <- brca.cls brca.cls.noisy[sample(pos, floor(length(pos) * 0.4))] <- 0 brca.cls.noisy[sample(neg, floor(length(neg) * 0.3))] <- 1 # benchmark classification performance with different approaches adaSvmBenchmark(data.mat = brca.mat, data.cls = brca.cls.noisy, data.cls.truth = brca.cls, cvSeed=1)
A cleaned version of the original Wisconsin Breast Cancer dataset containing histological information about 683 breast cancer samples collected from patients at the University of Wisconsin Hospitals, Madison by Dr. William H. Wolberg between January 1989 and November 1991.
brca
brca
A data frame with 683 rows and 10 variables:
Clump thickness, 1 - 10
Uniformity of cell size, 1 - 10
Uniformity of cell shape, 1 - 10
Marginal adhesion, 1 - 10
Single epithelial cell size, 1 - 10
Bare nuclei, 1 - 10
Bland chromatin, 1 - 10
Normal nucleoli, 1 - 10
Mitoses, 1 - 10
Class, benign or malignant
O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
singleIter()
applies a single iteraction of AdaSampling procedure. It
returns the probabilities of all samples as being a positive (P) or negative
(N) instance, as a two column data frame.Classification algorithms included are support vector machines (svm), k-nearest neighbours (knn), logistic regression (logit), linear discriminant analysis (lda), feature weighted knn (wKNN).
singleIter(Ps, Ns, dat, test = NULL, pos.probs = NULL, una.probs = NULL, classifier = "svm", sampleFactor, seed, weights)
singleIter(Ps, Ns, dat, test = NULL, pos.probs = NULL, una.probs = NULL, classifier = "svm", sampleFactor, seed, weights)
Ps |
names (name as index) of positive examples |
Ns |
names (name as index) of negative examples |
dat |
training data matrix, without class labels. |
test |
test data matrix, without class labels. Training data matrix will be used for testing if this is NULL (default). |
pos.probs |
a numeric vector of containing probability of positive examples been positive |
una.probs |
a numeric vector of containing probability of negative or unannotated examples been negative |
classifier |
classification algorithm to be used for learning. Current options are
support vector machine, |
sampleFactor |
provides a control on the sample size for resampling. |
seed |
sets the seed. |
weights |
feature weights, required when using weighted knn. |
Yang, P., Liu, W., Yang. J. (2017) Positive unlabeled learning via wrapper-based adaptive sampling. International Joint Conferences on Artificial Intelligence (IJCAI), 3272-3279
Yang, P., Ormerod, J., Liu, W., Ma, C., Zomaya, A., Yang, J.(2018) AdaSampling for positive-unlabeled and label noise learning with bioinformatics applications. IEEE Transactions on Cybernetics, doi:10.1109/TCYB.2018.2816984
Implementation of a feature weighted k-nearest neighbour classifier.
weightedKNN(train.mat, test.mat, cl, k = 3, weights)
weightedKNN(train.mat, test.mat, cl, k = 3, weights)
train.mat |
training data matrix, without class labels. |
test.mat |
test data matrix, without class labels. |
cl |
class labels for training data. |
k |
number of nearest neighbour to be used. |
weights |
weights to be assigned to each feautre. |