n <- 2500
b.use <- c(-3,0.1)
# simulate the data
simdat <- scdeco.sim.pg(N=n, b0=b.use[1], b1=b.use[2],
phi1=4, phi2=4, phi3=1/7,
mu1=15, mu2=15, mu3=7,
tau0=-2, tau1=0.4)
Parameters:
N
: Sample size for the simulated data.b0
: The intercept coefficient of the zero-inflation
parameter.b1
: The slope coefficient of the zero-inflation
parameter.phi1
: The over-dispersion parameter of the 1st ZINB
marginal.phi2
: The over-dispersion parameter of the 2nd ZINB
marginal.phi3
: The over-dispersion parameter of the ZINB
covariate vector.mu1
: The mean parameter of the 1st ZINB marginal.mu2
: The mean parameter of the 2nd ZINB marginal.mu3
: The mean parameter of the ZINB covariate
vector.tau0
: The intercept coefficient of the correlation
parameter.tau1
: The slope coefficient of the correlation
parameter.This will simulate a 3-column matrix of N rows, where the first two columns are observations and the third column is the ZINB covariate which will be used in regressing the correlation parameter of the scdeco.pg model.
# fit the model
mcmc.out <- scdeco.pg(dat=simdat,
b0=b.use[1], b1=b.use[2],
adapt_iter=1,# 500,
update_iter=1, # 500,
coda_iter=10, # 5000,
coda_thin=1, # 10,
coda_burnin=0)# 1000)
#> Compiling model graph
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 7500
#> Unobserved stochastic nodes: 12508
#> Total graph size: 85189
#>
#> Initializing model
#> Warning in jags.model(IndZ.spec, data = jags_data, n.adapt = adapt_iter, :
#> Adaptation incomplete
#> NOTE: Stopping adaptation
Parameters:
dat
: The 3-column matrix where the first two columns
are observations and the third column is the ZINB covariate. An
additional covariate can be added as a 4th column if desired.adapt_iter
: The number of adaptive iterations to
run.update_iter
: The number of update iterations to
run.coda_iter
: The number of MCMC iterations to run after
the adapt and update.coda_thin
: The number of MCMC iterations to burn from
the coda_iter iterations.coda_burnin
: The number of MCMC iterations to thin from
the coda_burnin iterations.This will return a matrix where the columns correspond to the different parameters of the model and the rows correspond to MCMC samples where the adapt, update, burn, and thin has already been incorporated.
One can obtain estimates and confidence intervals for each parameter by looking at quantiles of these MCMC samples.
boundsmat <- cbind(mcmc.out$quantiles[,1],
c(1/4, 1/4, 7, 15, 15, 7, -2, 0.4),
mcmc.out$quantiles[,c(3,5)])
colnames(boundsmat) <- c("lower", "true", "est", "upper")
boundsmat
#> lower true est upper
#> inverphi[1] 0.8306324 0.25 1.33066455 2.7492641
#> inverphi[2] 0.8694299 0.25 1.48960821 7.4868490
#> inverphi3 5.4611469 7.00 6.89310290 7.8570331
#> mu[1] 20.5071926 15.00 25.98040458 28.3030652
#> mu[2] 17.8750445 15.00 23.58263240 25.3063355
#> mu3 6.7092923 7.00 6.86330611 7.0320776
#> tau0 -1.8955368 -2.00 0.23766658 0.8787380
#> tau1 -0.0355327 0.40 -0.00439232 0.3345095
Let i = 1, …, n represent the number of cells in the dataset, and let X1, X2, X3 be the count-based expression levels for the three genes, with X3 being the controller gene. Let Xc be a vector containing some cellular-level factor such as resistance status or methylation level.
Since technical and/or biological factors often cause expression readings to incorrectly show up as 0, known as a dropout event, we choose to incorporate a zero-inflation parameter into the distribution of X3 and also into the joint distribution of X1, X2.
To incorporate zero-inflation into the distribution of X3, let p3 represent the probability of a dropout event striking an observation of X3.
Then we model X3 as:
f(xi3; μ3, 1/ϕ3) = (1 − p3)fNB(xi3; μ3, 1/ϕ3) + p31(xi3 = 0)
Where NB is under the following mean, over-dispersion parameterization:
$$ f_{\text{NB}}(x;\mu, \alpha) = \frac{\Gamma(x + \frac{1}{\alpha})}{\Gamma(x+1)\Gamma(\frac{1}{\alpha})}\left(\frac{\frac{1}{\alpha}}{\frac{1}{\alpha}+\mu}\right)^{\frac{1}{\alpha}}\left(\frac{\mu}{\frac{1}{\alpha}+\mu}\right)^{x} $$
which has mean μ and variance μ(1 + αμ).
We introduce the latent variable Z, which is responsible for imparting correlation between the two marginals X1, X2.
$$ \boldsymbol{Z}_i \sim N_2\left(\begin{bmatrix}0 \\ 0\end{bmatrix}, \begin{bmatrix}1 & \rho_i \\ \rho_i & 1\end{bmatrix}\right) $$
ρ is made to be a function of X3 and Xc like so:
ρi = (1 − p3)tanh (τ0 + τ1Xi3 + τ2Xic) + p3tanh (τ0 + τ1μ3 + τ2Xic)1(Xi3 = 0)
This shows that if Xi3 = 0 (and thus is possibly dropout), then we replace it with μ3 in the second term of the above sum.
Now we allow the means of X1, X2 to depend on this latent variable Z in the following way. For j = 1, 2,
Xij ∼ Pois(mean = Fϕj−1{Zij}μj)
where Fϕj is the Gamma(shape = 1/ϕj, rate = 1/ϕj) CDF.
Thus, Xij is a poisson random variable with a Gamma(shape = 1/ϕj, rate = 1/μjϕj) mean parameter, which is equivalent to a NB(μj, 1/ϕj) random variable,
To incorporate zero-inflation into the joint distribution of X1, X2, let p1, p2 represent the probability that an observation from X1, X2, respectively, is hit by a dropout event. Then for j = 1, 2,
f(xij; μj, ϕj) = (1 − pj)fPois(xij; Fϕj−1{Zij}μj) + pj1(xij = 0)
Parameter estimation is achieved using a Gibbs sampler MCMC scheme through JAGS.
The priors are as follows:
$$ \begin{aligned} \mu_1 &\sim \text{lognormal}(\mu=0, \ \sigma^2=1)\\ \mu_2 &\sim \text{lognormal}(\mu=0, \ \sigma^2=1)\\ \mu_3 &\sim \text{lognormal}(\mu=0, \ \sigma^2=1)\\ 1/\phi_1 &\sim \text{Gamma}(\text{shape}=1, \ \text{rate}=0.01)\\ 1/\phi_2 &\sim \text{Gamma}(\text{shape}=1, \ \text{rate}=0.01)\\ 1/\phi_3 &\sim \text{Gamma}(\text{shape}=1, \ \text{rate}=0.01)\\ \tau_0 & \sim N(\mu=0, \sigma^2=4/n)\\ \tau_1 & \sim N(\mu=0, \sigma^2=4/n)\\ \tau_2 & \sim N(\mu=0, \sigma^2=4/n)\\ \tau_3 & \sim N(\mu=0, \sigma^2=4/n)\\ \end{aligned} $$
p1, p2, p3 do not appear among these priors because they are all modeled as functions of their respective gene’s mean like so:
$$ p_j = \frac{\exp\left\{b_0 +b_1\mu_j\right\}}{1+\exp\left\{b_0+b_1\mu_j\right\}} $$
where the values for b0, b1 are decided beforehand by fitting above model using the genes in the dataset, but replacing pj with the empirical probability that gene j is equal 0 and replacing μj with the empirical mean expression of gene j, then estimating β0, β1 using nls().
Zhen Yang, Yen-Yi Ho, Modeling Dynamic Correlation in Zero-Inflated Bivariate Count Data with Applications to Single-Cell RNA Sequencing Data, Biometrics, Volume 78, Issue 2, June 2022, Pages 766–776, https://doi.org/10.1111/biom.13457