Package 'PEkit'

Title: Partition Exchangeability Toolkit
Description: Implements the Bayesian supervised predictive classifiers, hypothesis testing, and parametric estimation under Partition Exchangeability. The two classifiers implemented are a marginal classifier that assumes test data is i.i.d., next to a more computationally costly but accurate simultaneous classifier that finds a labelling for the entire test dataset at once, using all the test data to predict each label. We also provide the Maximum Likelihood Estimation (MLE) of the only underlying parameter of the partition exchangeability generative model as well as hypothesis testing statistics for equality of this parameter with a single value, alternative, or multiple samples. We present functions to simulate the sequences from Ewens Sampling Formula as the realisation of the Poisson-Dirichlet distribution and their respective probabilities.
Authors: Ville Kinnula [aut], Jing Tang [ctb] , Ali Amiryousefi [aut, cre]
Maintainer: Ali Amiryousefi <[email protected]>
License: MIT + file LICENSE
Version: 1.0.0.9000
Built: 2024-11-19 05:25:51 UTC
Source: https://github.com/amiryousefilab/pekit

Help Index


Vector of frequencies of frequencies

Description

A function to calculate the abundance vector, or frequencies of frequencies of discrete or partly discrete data vector x. The abundance vector is used as input in the functions dPD(), MLEp(), and LMTp().

Usage

abundance(x)

Arguments

x

Data vector x.

Details

This function is equivalent to table(table(x)).

Value

This function returns a named vector with the frequencies of the frequencies in the data vector x. The function base::table(x) returns a contingency table with the frequencies in the input data vector x as values. The names(table(x)) are the unique values in data vector x. In abundance(x), the unique values in table(x) become the names of the values, while the values themselves are the frequencies of the frequencies of data vector x.

Examples

set.seed(111)
x<-rpois(10,10)
## The frequency table of x:
print(table(x))
## The frequency table of the frequency table of x:
abundance(x)

Fit the supervised classifier under partition exchangeability

Description

Fits the model according to training data x, where x is assumed to follow the Poisson-Dirichlet distribution, and discrete labels y.

Usage

classifier.fit(x, y)

Arguments

x

data vector, or matrix with rows as data points and columns as features.

y

training data label vector of length equal to the amount of rows in x.

Details

This function is used to learn the model parameters from the training data, and gather them into an object that is used by the classification algorithms tMarLab() and tSimLab(). The parameters it learns are the Maximum Likelihood Estimate of the ψ\psi of each feature within each class in the training data. It also records the frequencies of the data for each feature within each class as well. These are used in calculating the predictive probability of each test data being in each of the classes.

Value

Returns an object used as training data objects for the classification algorithms tMarLab() and tSimLab().

If x is multidimensional, each list described below is returned for each dimension.

Returns a list of classwise lists, each with components:

frequencies: the frequencies of values in the class.

psi: the Maximum Likelihood estimate of ψ\psi for the class.

Examples

## Create training data x and its class labels y from Poisson-Dirichlet distributions
## with different psis:
set.seed(111)
x1<-rPD(5000,10)
x2<-rPD(5000,100)
x<-c(x1,x2)
y1<-rep("1", 5000)
y2<-rep("2", 5000)
y<-c(y1,y2)
fit<-classifier.fit(x,y)

## With multidimensional x:
set.seed(111)
x1<-cbind(rPD(5000,10),rPD(5000,50))
x2<-cbind(rPD(5000,100),rPD(5000,500))
x<-rbind(x1,x2)
y1<-rep("1", 5000)
y2<-rep("2", 5000)
y<-c(y1,y2)
fit<-classifier.fit(x,y)

The Poisson-Dirichlet distribution

Description

Distribution function for the Poisson-Dirichlet distribution.

Usage

dPD(abund, psi)

Arguments

abund

An abundance vector.

psi

Dispersal parameter ψ\psi. Accepted input values are positive real numbers, "a" for absolute value ψ\psi=1 by default, or "r" for relative value ψ=n\psi=n, where nn is the size of the input sample.

Details

Given an abundance vector abunds, calculates the probability of a data vector x given by the Poisson-Dirichlet distribution. The higher the dispersal parameter ψ\psi, the higher the amount of distinct observed species. In terms of the paintbox process, a high ψ\psi increases the size of the continuous part p0p_0 of the process, while a low ψ\psi will increase the size of the discrete parts p0p_{\neq 0}.

Value

The probability of the Poisson-Dirichlet distribution for the input abundance vector, e.g. an exchangeable random partition, and a dispersal parameter ψ\psi.

References

W.J. Ewens, The sampling theory of selectively neutral alleles, Theoretical Population Biology, Volume 3, Issue 1, 1972, Pages 87-112, ISSN 0040-5809, <doi:10.1016/0040-5809(72)90035-4>.

Examples

## Get a random sample from the Poisson Dirichlet distribution, and
## find the probability of such a sample with psi=5:
set.seed(111)
s <- rPD(n=100,psi=5)
a=abundance(s)
dPD(a, psi=5)

Test for the shape of the distribution

Description

This function performs a statistical test on the null hypothesis that a given sample's underlying distribution is the Poisson-Dirichlet distribution. It calculates a test statistic that is then used to gain a p-value from an empirical distribution of the statistic from simulated samples from a PD distribution.

Usage

is.PD(x, rounds)

Arguments

x

A discrete data vector.

rounds

How many samples are simulated to obtain the empirical distribution.

Details

The calculated test statistic is

W=i=1nni2/n,W=\sum_{i=1}^n n_i^2 / n ,

which is calculated from the sample. Here nin_i are the frequencies of each unique value in the sample. The MLE of ψ\psi is then estimated from the sample with the function MLEp(), and an amount of samples equal to the input parameter rounds are generated with that estimate of ψ\psi and sample size nn. The test statistic WW is then calculated for each of the simulated samples. The original WW is then given a p-value based on what percentage of the simulated WW it exceeds.

Value

A p-value.

References

Watterson, G.A., (1978), The homozygosity test of neutrality. Genetics. 88(2):405-417.

Examples

##Test whether a typical sample follows PD:
x<-rPD(1000,10)
is.PD(x, 1000)

##Test whether a very atypical sample where frequencies of different values
## are similar:

x<-c(rep(1, 200), rep(2, 200), rep(3, 200), rep(4, 200), rep(5,200))
is.PD(x,1000)

Maximum Likelihood Estimate of ψ\psi

Description

Numerically searches for the MLE of ψ\psi given an abundance vector with a binary search algorithm.

Usage

MLEp(abund)

Arguments

abund

An abundance vector.

Details

Numerically searches for the MLE of ψ\psi as the root of equation

K=i=1nψ/(ψ+i1),K=\sum_{i=1}^n\psi/(\psi+i-1),

where KK is the observed number of different species in the sample. The right side of the equation is monotonically increasing when ψ>0\psi>0, so a binary search is used to find the root. An accepted ψ\psi sets value of the right side of the equation within R's smallest possible value of the actual value of KK.

Value

The MLE of ψ\psi.

References

W.J. Ewens, The sampling theory of selectively neutral alleles, Theoretical Population Biology, Volume 3, Issue 1, 1972, Pages 87-112, ISSN 0040-5809, <doi:10.1016/0040-5809(72)90035-4>.

Examples

##Find the MLE of psi of the vector (1,2,2).
##The frequencies of the frequencies of the data vector are given as input:
MLEp(abundance(c(1,2,2)))

##Find the MLE of psi of a sample from the Poisson-Dirichlet distribution:
set.seed(1000)
x<-rPD(n=10000, psi=100)
MLEp(abundance(x))

Bootstrap confidence interval for the MLE of ψ\psi

Description

A bootstrapped confidence interval for the Maximum Likelihood Estimate for ψ\psi.

Usage

MLEp.bsci(x, level = 0.95, rounds = 1000, frac = 0.8)

Arguments

x

A data vector.

level

Level of confidence interval as number between 0 and 1.

rounds

Number of bootstrap rounds. Default is 1000.

frac

Percentage of data x used for each bootstrap round. 0.8 by default with accepted values between 0 and 1.

Value

The MLE of ψ\psi as well as lower and upper bounds of the bootstrap confidence interval.

Examples

## Find a 95% -confidence interval for the MLE of psi given a sample from the
## Poisson-Dirichlet distribution:
x<-rPD(n=10000, psi=100)
MLEp.bsci(x, 0.95)

Test for ψ\psi of multiple samples

Description

Likelihood ratio test for the hypotheses H0:ψ1=ψ2=...=ψdH_0: \: \psi_1=\psi_2=...=\psi_d and H1:ψ1ψ2...ψdH_1: \: \psi_1 \neq \psi_2 \neq ... \neq \psi_d, where ψ1,ψ2,\psi_1,\psi_2,...,ψd,\psi_d are the dispersal parameters of the dd samples in the columns of the input data array x.

Usage

mult.sample.test(x)

Arguments

x

The data array to be tested. Each column of x is an independent sample.

Details

Calculates the Likelihood Ratio Test statistic

2log(L(ψ^)/L(ψ^1,ψ^2,...,ψ^d)),-2log(L(\hat{\psi})/L(\hat{\psi}_1, \hat{\psi}_2, ..., \hat{\psi}_d)),

where L is the likelihood function of observing the dd input samples given a single ψ\psi in the numerator and dd different parameters ψ1,ψ2,\psi_1,\psi_2,...,ψd,\psi_d for each sample respectively in the denominator. According to the theory of Likelihood Ratio Tests, this statistic converges in distribution to a χd12\chi_{d-1}^2-distribution when the null-hypothesis is true, where d1d-1 is the difference in the amount of parameters between the considered models. To calculate the statistic, the Maximum Likelihood Estimate for ψ1,ψ2,...,ψd\psi_1,\: \psi_2,\: ..., \: \psi_d of H1H_1 and the shared ψ\psi of H0H_0 are calculated.

Value

Gives a vector with the Likelihood Ratio Test -statistic Lambda, as well as the p-value of the test p.

References

Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical Or Physical Character, 231(694-706), 289-337. <doi:10.1098/rsta.1933.0009>.

Examples

##Create samples with different n and psi:
set.seed(111)
x<-rPD(1200, 15)
y<-c( rPD(1000, 20), rep(NA, 200) )
z<-c( rPD(800, 30), rep(NA, 400) )
samples<-cbind(cbind(x, y), z)
##Run test
mult.sample.test(samples)

Random sampling from the Poisson-Dirichlet Distribution

Description

rPD samples randomly from the PD distribution with a given ψ\psi by simulating the Hoppe urn model.

Usage

rPD(n, psi)

Arguments

n

number of observations.

psi

dispersal parameter.

Details

Samples random values with a given ψ\psi from the Poisson-Dirichlet distribution by simulating the Hoppe urn model.

Value

Returns a vector with a sample of size nn from the Hoppe urn model with parameter ψ\psi.

References

Hoppe, F.M. The sampling theory of neutral alleles and an urn model in population genetics. J. Math. Biology 25, 123–159 (1987). <doi:10.1007/BF00276386>.

W.J. Ewens, The sampling theory of selectively neutral alleles, Theoretical Population Biology, Volume 3, Issue 1, 1972, Pages 87-112, ISSN 0040-5809, <doi:10.1016/0040-5809(72)90035-4>.

Examples

## Get random sample from the PD distribution with different psi,
## and estimate the psi of the samples:
s1<-rPD(1000, 10)
s2<- rPD(1000, 50)
print(c(MLEp(abundance(s1)), MLEp(abundance(s2))))

Lagrange Multiplier Test for ψ\psi

Description

Performs the Lagrange Multiplier test for the equality of the dispersion parameter ψ\psi of a sample. The null hypothesis of the test is H0:ψ=ψ0H_0: \psi = \psi_0, where ψ0\psi_0 is given as input here.

Usage

sample.test(abund, psi = "a")

Arguments

abund

An abundance vector of a sample.

psi

Target positive number ψ0\psi_0 to be tested. Accepted values are "a" for absolute value 1, "r" for relative value nn (sample size), or any positive number.

Details

Calculates the Lagrange Multiplier test statistic

S=U(ψ0)2/I(ψ0),S\, = \,U(\psi_0)^2 / I(\psi_0),

where UU is the log-likelihood function of ψ\psi and II is its Fisher information. The statistic SS follows χ2\chi^2-distribution with 1 degree of freedom when the null hypothesis H0:ψ=ψ0H_0:\psi=\psi_0 is true.

Value

The statistic SS and a p-value of the two-sided test of the hypothesis.

References

Radhakrishna Rao, C, (1948), Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44(1), 50-57. <doi:10.1017/S0305004100023987>

Examples

## Test the psi of a sample from the Poisson-Dirichlet distribution:
set.seed(10000)
x<-rPD(1000, 10)
## Find the abundance of the data vector:
abund=abundance(x)
## Test for the psi that was used, as well as a higher and a lower one:
sample.test(abund, 10)
sample.test(abund, 15)
sample.test(abund, 5)
sample.test(abund)       #test for psi=1
sample.test(abund, "r")  #test for psi=n

Marginally predicted labels of the test data given training data classification.

Description

Classifies the test data x based on the training data object. The test data is considered i.i.d., so each data point is classified one by one.

Usage

tMarLab(training, x)

Arguments

training

A training data object from the function classifier.fit().

x

Test data vector or matrix with rows as data points and columns as features.

Details

Independently assigns a class label for each test data point according to a maximumaposteriorimaximum \, a \, posteriori rule. The predictive probability of data point xix_i arising from class cc assuming the training data of size mcm_c in the class arises from a Poisson-Dirichlet(ψ^c\hat{\psi}_c) distribution is:

ψ^c/(mc+ψ^c),\hat{\psi}_c / (m_c + \hat{\psi}_c),

if no value equal to xix_i exists in the training data of class cc, and

mci/(mc+ψ^c),m_{ci} / (m_c + \hat{\psi}_c),

if there does, where mcim_{ci} is the frequency of the value of xix_i in the training data.

Value

A vector of predicted labels for test data x.

References

Amiryousefi A. Asymptotic supervised predictive classifiers under partition exchangeability. . 2021. https://arxiv.org/abs/2101.10950.

Corander, J., Cui, Y., Koski, T., and Siren, J.: Have I seen you before? Principles of Bayesian predictive classification revisited. Springer, Stat. Comput. 23, (2011), 59–73, (<doi:10.1007/s11222-011-9291-7>).

Examples

## Create random samples x from Poisson-Dirichlet distributions with different
## psis, treating each sample as coming from a class of its own:
set.seed(111)
x1<-rPD(10500,10)
x2<-rPD(10500,1000)
test.ind1<-sample.int(10500,500) # Sample test datasets from the
test.ind2<-sample.int(10500,500) # original samples
x<-c(x1[-test.ind1],x2[-test.ind2])
## create training data labels:
y1<-rep("1", 10000)
y2<-rep("2", 10000)
y<-c(y1,y2)

## Test data t, with first half belonging to class "1", second have in "2":
t1<-x1[test.ind1]
t2<-x2[test.ind2]
t<-c(t1,t2)

fit<-classifier.fit(x,y)

## Run the classifier, which returns
tM<-tMarLab(fit, t)

##With multidimensional x:
set.seed(111)
x1<-cbind(rPD(5500,10),rPD(5500,50))
x2<-cbind(rPD(5500,100),rPD(5500,500))
test.ind1<-sample.int(5500,500)
test.ind2<-sample.int(5500,500)
x<-rbind(x1[-test.ind1,],x2[-test.ind2,])
y1<-rep("1", 5000)
y2<-rep("2", 5000)
y<-c(y1,y2)
fit<-classifier.fit(x,y)
t1<-x1[test.ind1,]
t2<-x2[test.ind2,]
t<-rbind(t1,t2)

tM<-tMarLab(fit, t)

Simultaneously predicted labels of the test data given the training data classification.

Description

Classifies the test data x based on the training data object. All of the test data is used simultaneously to make the classification.

Usage

tSimLab(training, x)

Arguments

training

A training data object from the function classifier.fit().

x

Test data vector or matrix with rows as data points and columns as features.

Details

The test data are first labeled with the marginal classifier. The simultaneous classifier then iterates over all test data, assigning each a label by finding the maximum predictive probability given the current classification structure of the test data as a whole. This is repeated until the classification structure converges after iterating over all data.

Value

A vector of predicted labels for test data x.

References

Amiryousefi A. Asymptotic supervised predictive classifiers under partition exchangeability. . 2021. https://arxiv.org/abs/2101.10950.

Corander, J., Cui, Y., Koski, T., and Siren, J.: Have I seen you before? Principles of Bayesian predictive classification revisited. Springer, Stat. Comput. 23, (2011), 59–73, (<doi:10.1007/s11222-011-9291-7>).

Examples

## Create random samples x from Poisson-Dirichlet distributions with different
## psis, treating each sample as coming from a class of its own:
set.seed(111)
x1<-rPD(10500,10)
x2<-rPD(10500,1000)
test.ind1<-sample.int(10500,500) # Sample test datasets from the
test.ind2<-sample.int(10500,500) # original samples
x<-c(x1[-test.ind1],x2[-test.ind2])
## create training data labels:
y1<-rep("1", 10000)
y2<-rep("2", 10000)
y<-c(y1,y2)

## Test data t, with first half belonging to class "1", second have in "2":
t1<-x1[test.ind1]
t2<-x2[test.ind2]
t<-c(t1,t2)

fit<-classifier.fit(x,y)

## Run the classifier, which returns
tS<-tSimLab(fit, t)

##With multidimensional x:
set.seed(111)
x1<-cbind(rPD(5500,10),rPD(5500,50))
x2<-cbind(rPD(5500,100),rPD(5500,500))
test.ind1<-sample.int(5500,500)
test.ind2<-sample.int(5500,500)
x<-rbind(x1[-test.ind1,],x2[-test.ind2,])
y1<-rep("1", 5000)
y2<-rep("2", 5000)
y<-c(y1,y2)
fit<-classifier.fit(x,y)
t1<-x1[test.ind1,]
t2<-x2[test.ind2,]
t<-rbind(t1,t2)

tS<-tSimLab(fit, t)

Two sample test for ψ\psi

Description

Likelihood ratio test for the hypotheses H0:ψ1=ψ2H_0: \: \psi_1=\psi_2 and H1:ψ1ψ2H_1: \: \psi_1 \neq \psi_2, where ψ1\psi_1 and ψ2\psi_2 are the dispersal parameters of two input samples s1 and s2.

Usage

two.sample.test(s1, s2)

Arguments

s1, s2

The two data vectors to be tested.

Details

Calculates the Likelihood Ratio Test statistic

2log(L(ψ^)/L(ψ^1,ψ^2)),-2log(L(\hat{\psi})/L(\hat{\psi}_1, \hat{\psi}_2)),

where L is the likelihood function of observing the two input samples given a single ψ\psi in the numerator and two different parameters ψ1\psi_1 and ψ2\psi_2 for each sample respectively in the denominator. According to the theory of Likelihood Ratio Tests, this statistic converges in distribution to a χd2\chi_d^2-distribution under the null-hypothesis, where dd is the difference in the amount of parameters between the considered models, which is 1 here. To calculate the statistic, the Maximum Likelihood Estimate for ψ1,ψ2\psi_1,\: \psi_2 of H1H_1 and the shared ψ\psi of H0H_0 are calculated.

Value

Gives a vector with the Likelihood Ratio Test -statistic Lambda, as well as the p-value of the test p.

References

Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical Or Physical Character, 231(694-706), 289-337. <doi:10.1098/rsta.1933.0009>.

Examples

##Create samples with different n and psi:
set.seed(111)
x<-rPD(500, 15)
y<-rPD(1000, 20)
z<-rPD(800, 30)
##Run tests
two.sample.test(x,y)
two.sample.test(x,z)
two.sample.test(y,z)