Title: | Partition Exchangeability Toolkit |
---|---|
Description: | Implements the Bayesian supervised predictive classifiers, hypothesis testing, and parametric estimation under Partition Exchangeability. The two classifiers implemented are a marginal classifier that assumes test data is i.i.d., next to a more computationally costly but accurate simultaneous classifier that finds a labelling for the entire test dataset at once, using all the test data to predict each label. We also provide the Maximum Likelihood Estimation (MLE) of the only underlying parameter of the partition exchangeability generative model as well as hypothesis testing statistics for equality of this parameter with a single value, alternative, or multiple samples. We present functions to simulate the sequences from Ewens Sampling Formula as the realisation of the Poisson-Dirichlet distribution and their respective probabilities. |
Authors: | Ville Kinnula [aut], Jing Tang [ctb] , Ali Amiryousefi [aut, cre] |
Maintainer: | Ali Amiryousefi <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.0.9000 |
Built: | 2024-11-19 05:25:51 UTC |
Source: | https://github.com/amiryousefilab/pekit |
A function to calculate the abundance vector, or frequencies of frequencies of discrete or partly discrete
data vector x
. The abundance vector is used as input in the functions dPD()
, MLEp()
, and LMTp()
.
abundance(x)
abundance(x)
x |
Data vector |
This function is equivalent to table(table(x))
.
This function returns a named vector with the frequencies of the frequencies in the data vector x.
The function base::table(x)
returns a contingency table with the frequencies in the input data vector x
as
values. The names(table(x))
are the unique values in data vector x
. In abundance(x)
,
the unique values in table(x)
become the names of the values, while the values
themselves are the frequencies of the frequencies of data vector x
.
set.seed(111) x<-rpois(10,10) ## The frequency table of x: print(table(x)) ## The frequency table of the frequency table of x: abundance(x)
set.seed(111) x<-rpois(10,10) ## The frequency table of x: print(table(x)) ## The frequency table of the frequency table of x: abundance(x)
Fits the model according to training data x, where x is assumed to follow the Poisson-Dirichlet distribution, and discrete labels y.
classifier.fit(x, y)
classifier.fit(x, y)
x |
data vector, or matrix with rows as data points and columns as features. |
y |
training data label vector of length equal to the amount of rows in |
This function is used to learn the model parameters from the
training data, and gather them into an object that is used by the
classification algorithms tMarLab()
and tSimLab()
. The parameters it learns
are the Maximum Likelihood Estimate of the of each feature within
each class in the training data. It also records the frequencies of the data
for each feature within each class as well. These are used in calculating the
predictive probability of each test data being in each of the classes.
Returns an object used as training data objects for the classification
algorithms tMarLab()
and tSimLab()
.
If x
is multidimensional, each list described below is returned for each dimension.
Returns a list of classwise lists, each with components:
frequencies
: the frequencies of values in the class.
psi
: the Maximum Likelihood estimate of for the class.
## Create training data x and its class labels y from Poisson-Dirichlet distributions ## with different psis: set.seed(111) x1<-rPD(5000,10) x2<-rPD(5000,100) x<-c(x1,x2) y1<-rep("1", 5000) y2<-rep("2", 5000) y<-c(y1,y2) fit<-classifier.fit(x,y) ## With multidimensional x: set.seed(111) x1<-cbind(rPD(5000,10),rPD(5000,50)) x2<-cbind(rPD(5000,100),rPD(5000,500)) x<-rbind(x1,x2) y1<-rep("1", 5000) y2<-rep("2", 5000) y<-c(y1,y2) fit<-classifier.fit(x,y)
## Create training data x and its class labels y from Poisson-Dirichlet distributions ## with different psis: set.seed(111) x1<-rPD(5000,10) x2<-rPD(5000,100) x<-c(x1,x2) y1<-rep("1", 5000) y2<-rep("2", 5000) y<-c(y1,y2) fit<-classifier.fit(x,y) ## With multidimensional x: set.seed(111) x1<-cbind(rPD(5000,10),rPD(5000,50)) x2<-cbind(rPD(5000,100),rPD(5000,500)) x<-rbind(x1,x2) y1<-rep("1", 5000) y2<-rep("2", 5000) y<-c(y1,y2) fit<-classifier.fit(x,y)
Distribution function for the Poisson-Dirichlet distribution.
dPD(abund, psi)
dPD(abund, psi)
abund |
An abundance vector. |
psi |
Dispersal parameter |
Given an abundance vector abunds
, calculates the probability
of a data vector x
given by the Poisson-Dirichlet distribution. The higher the
dispersal parameter , the higher the amount of distinct observed
species. In terms of the paintbox process, a high
increases the
size of the continuous part
of the process, while a low
will increase
the size of the discrete parts
.
The probability of the Poisson-Dirichlet distribution for the input
abundance vector, e.g. an exchangeable random partition, and a dispersal parameter .
W.J. Ewens, The sampling theory of selectively neutral alleles, Theoretical Population Biology, Volume 3, Issue 1, 1972, Pages 87-112, ISSN 0040-5809, <doi:10.1016/0040-5809(72)90035-4>.
## Get a random sample from the Poisson Dirichlet distribution, and ## find the probability of such a sample with psi=5: set.seed(111) s <- rPD(n=100,psi=5) a=abundance(s) dPD(a, psi=5)
## Get a random sample from the Poisson Dirichlet distribution, and ## find the probability of such a sample with psi=5: set.seed(111) s <- rPD(n=100,psi=5) a=abundance(s) dPD(a, psi=5)
This function performs a statistical test on the null hypothesis that a given sample's underlying distribution is the Poisson-Dirichlet distribution. It calculates a test statistic that is then used to gain a p-value from an empirical distribution of the statistic from simulated samples from a PD distribution.
is.PD(x, rounds)
is.PD(x, rounds)
x |
A discrete data vector. |
rounds |
How many samples are simulated to obtain the empirical distribution. |
The calculated test statistic is
which is calculated from the sample. Here are the frequencies of each unique value in the sample.
The MLE of
is then estimated from the sample with the function
MLEp()
, and an amount of samples
equal to the input parameter rounds
are generated with that estimate of
and sample size
. The test statistic
is then calculated for each of the simulated samples.
The original
is then given a p-value based on what percentage of the simulated
it exceeds.
A p-value.
Watterson, G.A., (1978), The homozygosity test of neutrality. Genetics. 88(2):405-417.
##Test whether a typical sample follows PD: x<-rPD(1000,10) is.PD(x, 1000) ##Test whether a very atypical sample where frequencies of different values ## are similar: x<-c(rep(1, 200), rep(2, 200), rep(3, 200), rep(4, 200), rep(5,200)) is.PD(x,1000)
##Test whether a typical sample follows PD: x<-rPD(1000,10) is.PD(x, 1000) ##Test whether a very atypical sample where frequencies of different values ## are similar: x<-c(rep(1, 200), rep(2, 200), rep(3, 200), rep(4, 200), rep(5,200)) is.PD(x,1000)
Numerically searches for the MLE of given an abundance vector with a binary search algorithm.
MLEp(abund)
MLEp(abund)
abund |
An abundance vector. |
Numerically searches for the MLE of as the root of equation
where is the observed number of
different species in the sample. The right side of the equation is monotonically
increasing when
, so a binary search is used to find the root.
An accepted
sets value of the right side
of the equation within R's smallest possible value of the actual value of
.
The MLE of .
W.J. Ewens, The sampling theory of selectively neutral alleles, Theoretical Population Biology, Volume 3, Issue 1, 1972, Pages 87-112, ISSN 0040-5809, <doi:10.1016/0040-5809(72)90035-4>.
##Find the MLE of psi of the vector (1,2,2). ##The frequencies of the frequencies of the data vector are given as input: MLEp(abundance(c(1,2,2))) ##Find the MLE of psi of a sample from the Poisson-Dirichlet distribution: set.seed(1000) x<-rPD(n=10000, psi=100) MLEp(abundance(x))
##Find the MLE of psi of the vector (1,2,2). ##The frequencies of the frequencies of the data vector are given as input: MLEp(abundance(c(1,2,2))) ##Find the MLE of psi of a sample from the Poisson-Dirichlet distribution: set.seed(1000) x<-rPD(n=10000, psi=100) MLEp(abundance(x))
A bootstrapped confidence interval for the Maximum Likelihood Estimate for
.
MLEp.bsci(x, level = 0.95, rounds = 1000, frac = 0.8)
MLEp.bsci(x, level = 0.95, rounds = 1000, frac = 0.8)
x |
A data vector. |
level |
Level of confidence interval as number between 0 and 1. |
rounds |
Number of bootstrap rounds. Default is 1000. |
frac |
Percentage of data |
The MLE of as well as lower and upper bounds of the bootstrap
confidence interval.
## Find a 95% -confidence interval for the MLE of psi given a sample from the ## Poisson-Dirichlet distribution: x<-rPD(n=10000, psi=100) MLEp.bsci(x, 0.95)
## Find a 95% -confidence interval for the MLE of psi given a sample from the ## Poisson-Dirichlet distribution: x<-rPD(n=10000, psi=100) MLEp.bsci(x, 0.95)
of multiple samplesLikelihood ratio test for the hypotheses and
, where
...
are the
dispersal parameters of the
samples in the columns of the input data array
x
.
mult.sample.test(x)
mult.sample.test(x)
x |
The data array to be tested. Each column of |
Calculates the Likelihood Ratio Test statistic
where L is the likelihood function of observing the input samples given
a single
in the numerator and
different parameters
...
for each sample respectively in the denominator. According
to the theory of Likelihood Ratio Tests, this statistic converges in
distribution to a
-distribution when the null-hypothesis is true, where
is the
difference in the amount of parameters between the considered models. To
calculate the statistic, the Maximum Likelihood Estimate for
of
and the shared
of
are calculated.
Gives a vector with the Likelihood Ratio Test -statistic Lambda
, as well as the
p-value of the test p
.
Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical Or Physical Character, 231(694-706), 289-337. <doi:10.1098/rsta.1933.0009>.
##Create samples with different n and psi: set.seed(111) x<-rPD(1200, 15) y<-c( rPD(1000, 20), rep(NA, 200) ) z<-c( rPD(800, 30), rep(NA, 400) ) samples<-cbind(cbind(x, y), z) ##Run test mult.sample.test(samples)
##Create samples with different n and psi: set.seed(111) x<-rPD(1200, 15) y<-c( rPD(1000, 20), rep(NA, 200) ) z<-c( rPD(800, 30), rep(NA, 400) ) samples<-cbind(cbind(x, y), z) ##Run test mult.sample.test(samples)
rPD samples randomly from the PD distribution with a given by simulating the Hoppe urn model.
rPD(n, psi)
rPD(n, psi)
n |
number of observations. |
psi |
dispersal parameter. |
Samples random values with a given from the Poisson-Dirichlet distribution by simulating the Hoppe urn model.
Returns a vector with a sample of size from the Hoppe urn model with parameter
.
Hoppe, F.M. The sampling theory of neutral alleles and an urn model in population genetics. J. Math. Biology 25, 123–159 (1987). <doi:10.1007/BF00276386>.
W.J. Ewens, The sampling theory of selectively neutral alleles, Theoretical Population Biology, Volume 3, Issue 1, 1972, Pages 87-112, ISSN 0040-5809, <doi:10.1016/0040-5809(72)90035-4>.
## Get random sample from the PD distribution with different psi, ## and estimate the psi of the samples: s1<-rPD(1000, 10) s2<- rPD(1000, 50) print(c(MLEp(abundance(s1)), MLEp(abundance(s2))))
## Get random sample from the PD distribution with different psi, ## and estimate the psi of the samples: s1<-rPD(1000, 10) s2<- rPD(1000, 50) print(c(MLEp(abundance(s1)), MLEp(abundance(s2))))
Performs the Lagrange Multiplier test for the equality of the dispersion parameter of a sample.
The null hypothesis of the test is
, where
is given as input here.
sample.test(abund, psi = "a")
sample.test(abund, psi = "a")
abund |
An abundance vector of a sample. |
psi |
Target positive number |
Calculates the Lagrange Multiplier test statistic
where is the log-likelihood function of
and
is its Fisher information.
The statistic
follows
-distribution with 1 degree of freedom
when the null hypothesis
is true.
The statistic and a p-value of the two-sided test of the hypothesis.
Radhakrishna Rao, C, (1948), Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44(1), 50-57. <doi:10.1017/S0305004100023987>
## Test the psi of a sample from the Poisson-Dirichlet distribution: set.seed(10000) x<-rPD(1000, 10) ## Find the abundance of the data vector: abund=abundance(x) ## Test for the psi that was used, as well as a higher and a lower one: sample.test(abund, 10) sample.test(abund, 15) sample.test(abund, 5) sample.test(abund) #test for psi=1 sample.test(abund, "r") #test for psi=n
## Test the psi of a sample from the Poisson-Dirichlet distribution: set.seed(10000) x<-rPD(1000, 10) ## Find the abundance of the data vector: abund=abundance(x) ## Test for the psi that was used, as well as a higher and a lower one: sample.test(abund, 10) sample.test(abund, 15) sample.test(abund, 5) sample.test(abund) #test for psi=1 sample.test(abund, "r") #test for psi=n
Classifies the test data x
based on the training data object.
The test data is considered i.i.d., so each
data point is classified one by one.
tMarLab(training, x)
tMarLab(training, x)
training |
A training data object from the function |
x |
Test data vector or matrix with rows as data points and columns as features. |
Independently assigns a class label for each test data point according to a
rule. The predictive probability of data point
arising from class
assuming the training data of size
in the class
arises from a Poisson-Dirichlet(
) distribution is:
if no value equal to exists in the training data of class
, and
if there does, where is the frequency of the value of
in the training data.
A vector of predicted labels for test data x.
Amiryousefi A. Asymptotic supervised predictive classifiers under partition exchangeability. . 2021. https://arxiv.org/abs/2101.10950.
Corander, J., Cui, Y., Koski, T., and Siren, J.: Have I seen you before? Principles of Bayesian predictive classification revisited. Springer, Stat. Comput. 23, (2011), 59–73, (<doi:10.1007/s11222-011-9291-7>).
## Create random samples x from Poisson-Dirichlet distributions with different ## psis, treating each sample as coming from a class of its own: set.seed(111) x1<-rPD(10500,10) x2<-rPD(10500,1000) test.ind1<-sample.int(10500,500) # Sample test datasets from the test.ind2<-sample.int(10500,500) # original samples x<-c(x1[-test.ind1],x2[-test.ind2]) ## create training data labels: y1<-rep("1", 10000) y2<-rep("2", 10000) y<-c(y1,y2) ## Test data t, with first half belonging to class "1", second have in "2": t1<-x1[test.ind1] t2<-x2[test.ind2] t<-c(t1,t2) fit<-classifier.fit(x,y) ## Run the classifier, which returns tM<-tMarLab(fit, t) ##With multidimensional x: set.seed(111) x1<-cbind(rPD(5500,10),rPD(5500,50)) x2<-cbind(rPD(5500,100),rPD(5500,500)) test.ind1<-sample.int(5500,500) test.ind2<-sample.int(5500,500) x<-rbind(x1[-test.ind1,],x2[-test.ind2,]) y1<-rep("1", 5000) y2<-rep("2", 5000) y<-c(y1,y2) fit<-classifier.fit(x,y) t1<-x1[test.ind1,] t2<-x2[test.ind2,] t<-rbind(t1,t2) tM<-tMarLab(fit, t)
## Create random samples x from Poisson-Dirichlet distributions with different ## psis, treating each sample as coming from a class of its own: set.seed(111) x1<-rPD(10500,10) x2<-rPD(10500,1000) test.ind1<-sample.int(10500,500) # Sample test datasets from the test.ind2<-sample.int(10500,500) # original samples x<-c(x1[-test.ind1],x2[-test.ind2]) ## create training data labels: y1<-rep("1", 10000) y2<-rep("2", 10000) y<-c(y1,y2) ## Test data t, with first half belonging to class "1", second have in "2": t1<-x1[test.ind1] t2<-x2[test.ind2] t<-c(t1,t2) fit<-classifier.fit(x,y) ## Run the classifier, which returns tM<-tMarLab(fit, t) ##With multidimensional x: set.seed(111) x1<-cbind(rPD(5500,10),rPD(5500,50)) x2<-cbind(rPD(5500,100),rPD(5500,500)) test.ind1<-sample.int(5500,500) test.ind2<-sample.int(5500,500) x<-rbind(x1[-test.ind1,],x2[-test.ind2,]) y1<-rep("1", 5000) y2<-rep("2", 5000) y<-c(y1,y2) fit<-classifier.fit(x,y) t1<-x1[test.ind1,] t2<-x2[test.ind2,] t<-rbind(t1,t2) tM<-tMarLab(fit, t)
Classifies the test data x
based on the training data object.
All of the test data is used simultaneously to make the classification.
tSimLab(training, x)
tSimLab(training, x)
training |
A training data object from the function |
x |
Test data vector or matrix with rows as data points and columns as features. |
The test data are first labeled with the marginal classifier. The simultaneous classifier then iterates over all test data, assigning each a label by finding the maximum predictive probability given the current classification structure of the test data as a whole. This is repeated until the classification structure converges after iterating over all data.
A vector of predicted labels for test data x.
Amiryousefi A. Asymptotic supervised predictive classifiers under partition exchangeability. . 2021. https://arxiv.org/abs/2101.10950.
Corander, J., Cui, Y., Koski, T., and Siren, J.: Have I seen you before? Principles of Bayesian predictive classification revisited. Springer, Stat. Comput. 23, (2011), 59–73, (<doi:10.1007/s11222-011-9291-7>).
## Create random samples x from Poisson-Dirichlet distributions with different ## psis, treating each sample as coming from a class of its own: set.seed(111) x1<-rPD(10500,10) x2<-rPD(10500,1000) test.ind1<-sample.int(10500,500) # Sample test datasets from the test.ind2<-sample.int(10500,500) # original samples x<-c(x1[-test.ind1],x2[-test.ind2]) ## create training data labels: y1<-rep("1", 10000) y2<-rep("2", 10000) y<-c(y1,y2) ## Test data t, with first half belonging to class "1", second have in "2": t1<-x1[test.ind1] t2<-x2[test.ind2] t<-c(t1,t2) fit<-classifier.fit(x,y) ## Run the classifier, which returns tS<-tSimLab(fit, t) ##With multidimensional x: set.seed(111) x1<-cbind(rPD(5500,10),rPD(5500,50)) x2<-cbind(rPD(5500,100),rPD(5500,500)) test.ind1<-sample.int(5500,500) test.ind2<-sample.int(5500,500) x<-rbind(x1[-test.ind1,],x2[-test.ind2,]) y1<-rep("1", 5000) y2<-rep("2", 5000) y<-c(y1,y2) fit<-classifier.fit(x,y) t1<-x1[test.ind1,] t2<-x2[test.ind2,] t<-rbind(t1,t2) tS<-tSimLab(fit, t)
## Create random samples x from Poisson-Dirichlet distributions with different ## psis, treating each sample as coming from a class of its own: set.seed(111) x1<-rPD(10500,10) x2<-rPD(10500,1000) test.ind1<-sample.int(10500,500) # Sample test datasets from the test.ind2<-sample.int(10500,500) # original samples x<-c(x1[-test.ind1],x2[-test.ind2]) ## create training data labels: y1<-rep("1", 10000) y2<-rep("2", 10000) y<-c(y1,y2) ## Test data t, with first half belonging to class "1", second have in "2": t1<-x1[test.ind1] t2<-x2[test.ind2] t<-c(t1,t2) fit<-classifier.fit(x,y) ## Run the classifier, which returns tS<-tSimLab(fit, t) ##With multidimensional x: set.seed(111) x1<-cbind(rPD(5500,10),rPD(5500,50)) x2<-cbind(rPD(5500,100),rPD(5500,500)) test.ind1<-sample.int(5500,500) test.ind2<-sample.int(5500,500) x<-rbind(x1[-test.ind1,],x2[-test.ind2,]) y1<-rep("1", 5000) y2<-rep("2", 5000) y<-c(y1,y2) fit<-classifier.fit(x,y) t1<-x1[test.ind1,] t2<-x2[test.ind2,] t<-rbind(t1,t2) tS<-tSimLab(fit, t)
Likelihood ratio test for the hypotheses and
, where
and
are the
dispersal parameters of two input samples
s1
and s2
.
two.sample.test(s1, s2)
two.sample.test(s1, s2)
s1 , s2
|
The two data vectors to be tested. |
Calculates the Likelihood Ratio Test statistic
where L is the likelihood function of observing the two input samples given
a single in the numerator and two different parameters
and
for each sample respectively in the denominator. According
to the theory of Likelihood Ratio Tests, this statistic converges in
distribution to a
-distribution under the null-hypothesis, where
is the
difference in the amount of parameters between the considered models, which
is 1 here. To calculate the statistic, the Maximum Likelihood Estimate for
of
and the shared
of
are calculated.
Gives a vector with the Likelihood Ratio Test -statistic Lambda
, as well as the
p-value of the test p
.
Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical Or Physical Character, 231(694-706), 289-337. <doi:10.1098/rsta.1933.0009>.
##Create samples with different n and psi: set.seed(111) x<-rPD(500, 15) y<-rPD(1000, 20) z<-rPD(800, 30) ##Run tests two.sample.test(x,y) two.sample.test(x,z) two.sample.test(y,z)
##Create samples with different n and psi: set.seed(111) x<-rPD(500, 15) y<-rPD(1000, 20) z<-rPD(800, 30) ##Run tests two.sample.test(x,y) two.sample.test(x,z) two.sample.test(y,z)