scDECO-Poisson-Gamma

library(scDECO)

Quick Start

n <- 2500
b.use <- c(-3,0.1)

# simulate the data
simdat <- scdeco.sim.pg(N=n, b0=b.use[1], b1=b.use[2],
                        phi1=4, phi2=4, phi3=1/7,
                        mu1=15, mu2=15, mu3=7,
                        tau0=-2, tau1=0.4)

Parameters:

N: Sample size for the simulated data.
b0: The intercept coefficient of the zero-inflation parameter.
b1: The slope coefficient of the zero-inflation parameter.
phi1: The over-dispersion parameter of the 1st ZINB marginal.
phi2: The over-dispersion parameter of the 2nd ZINB marginal.
phi3: The over-dispersion parameter of the ZINB covariate vector.
mu1: The mean parameter of the 1st ZINB marginal.
mu2: The mean parameter of the 2nd ZINB marginal.
mu3: The mean parameter of the ZINB covariate vector.
tau0: The intercept coefficient of the correlation parameter.
tau1: The slope coefficient of the correlation parameter.

This will simulate a 3-column matrix of N rows, where the first two columns are observations and the third column is the ZINB covariate which will be used in regressing the correlation parameter of the scdeco.pg model.

# fit the model
mcmc.out <- scdeco.pg(dat=simdat,
                      b0=b.use[1], b1=b.use[2],
                      adapt_iter=1,# 500,
                      update_iter=1, # 500,
                      coda_iter=10, # 5000,
                      coda_thin=1, # 10,
                      coda_burnin=0)# 1000)
#> Compiling model graph
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 7500
#>    Unobserved stochastic nodes: 12508
#>    Total graph size: 85183
#> 
#> Initializing model
#> Warning in jags.model(IndZ.spec, data = jags_data, n.adapt = adapt_iter, :
#> Adaptation incomplete
#> NOTE: Stopping adaptation

Parameters:

dat: The 3-column matrix where the first two columns are observations and the third column is the ZINB covariate. An additional covariate can be added as a 4th column if desired.
adapt_iter: The number of adaptive iterations to run.
update_iter: The number of update iterations to run.
coda_iter: The number of MCMC iterations to run after the adapt and update.
coda_thin: The number of MCMC iterations to burn from the coda_iter iterations.
coda_burnin: The number of MCMC iterations to thin from the coda_burnin iterations.

This will return a matrix where the columns correspond to the different parameters of the model and the rows correspond to MCMC samples where the adapt, update, burn, and thin has already been incorporated.

One can obtain estimates and confidence intervals for each parameter by looking at quantiles of these MCMC samples.

boundsmat <- cbind(mcmc.out$quantiles[,1],
                  c(1/4, 1/4, 7, 15, 15, 7, -2, 0.4), 
                  mcmc.out$quantiles[,c(3,5)])

colnames(boundsmat) <- c("lower", "true", "est", "upper")

boundsmat
#>                   lower  true         est       upper
#> inverphi[1]  1.07370803  0.25  1.22104800  1.61449259
#> inverphi[2]  0.96510735  0.25  1.15417250  1.82313073
#> inverphi3    5.89952246  7.00  6.67470608  7.57993496
#> mu[1]       20.45382278 15.00 25.92128440 30.29680714
#> mu[2]       17.73589064 15.00 26.04911972 27.37149941
#> mu3          6.82520282  7.00  6.99315226  7.09318560
#> tau0        -0.47808224 -2.00  0.11273727  0.36480493
#> tau1        -0.01848128  0.40  0.03035249  0.08946435

Model Details

Let i = 1, …, n represent the number of cells in the dataset, and let X₁, X₂, X₃ be the count-based expression levels for the three genes, with X₃ being the controller gene. Let X_c be a vector containing some cellular-level factor such as resistance status or methylation level.

Since technical and/or biological factors often cause expression readings to incorrectly show up as 0, known as a dropout event, we choose to incorporate a zero-inflation parameter into the distribution of X₃ and also into the joint distribution of X₁, X₂.

To incorporate zero-inflation into the distribution of X₃, let p₃ represent the probability of a dropout event striking an observation of X₃.

Then we model X₃ as:

f(x_i3; μ₃, 1/ϕ₃) = (1 − p₃)f_NB(x_i3; μ₃, 1/ϕ₃) + p₃1(x_i3 = 0)

Where NB is under the following mean, over-dispersion parameterization:

$$ f_{\text{NB}}(x;\mu, \alpha) = \frac{\Gamma(x + \frac{1}{\alpha})}{\Gamma(x+1)\Gamma(\frac{1}{\alpha})}\left(\frac{\frac{1}{\alpha}}{\frac{1}{\alpha}+\mu}\right)^{\frac{1}{\alpha}}\left(\frac{\mu}{\frac{1}{\alpha}+\mu}\right)^{x} $$

which has mean μ and variance μ(1 + αμ).

We introduce the latent variable Z, which is responsible for imparting correlation between the two marginals X₁, X₂.

$$ \boldsymbol{Z}_i \sim N_2\left(\begin{bmatrix}0 \\ 0\end{bmatrix}, \begin{bmatrix}1 & \rho_i \\ \rho_i & 1\end{bmatrix}\right) $$

ρ is made to be a function of X₃ and X_c like so:

ρ_i = (1 − p₃)tanh (τ₀ + τ₁X_i3 + τ₂X_ic) + p₃tanh (τ₀ + τ₁μ₃ + τ₂X_ic)1(X_i3 = 0)

This shows that if X_i3 = 0 (and thus is possibly dropout), then we replace it with μ₃ in the second term of the above sum.

Now we allow the means of X₁, X₂ to depend on this latent variable Z in the following way. For j = 1, 2,

X_ij ∼ Pois(mean = F_{ϕ_j}⁻¹{Z_ij}μ_j)

where F_{ϕ_j} is the Gamma(shape = 1/ϕ_j, rate = 1/ϕ_j) CDF.

Thus, X_ij is a poisson random variable with a Gamma(shape = 1/ϕ_j, rate = 1/μ_jϕ_j) mean parameter, which is equivalent to a NB(μ_j, 1/ϕ_j) random variable,

To incorporate zero-inflation into the joint distribution of X₁, X₂, let p₁, p₂ represent the probability that an observation from X₁, X₂, respectively, is hit by a dropout event. Then for j = 1, 2,

f(x_ij; μ_j, ϕ_j) = (1 − p_j)f_Pois(x_ij; F_{ϕ_j}⁻¹{Z_ij}μ_j) + p_j1(x_ij = 0)

Parameter Estimation

Parameter estimation is achieved using a Gibbs sampler MCMC scheme through JAGS.

The priors are as follows:

$$ \begin{aligned} \mu_1 &\sim \text{lognormal}(\mu=0, \ \sigma^2=1)\\ \mu_2 &\sim \text{lognormal}(\mu=0, \ \sigma^2=1)\\ \mu_3 &\sim \text{lognormal}(\mu=0, \ \sigma^2=1)\\ 1/\phi_1 &\sim \text{Gamma}(\text{shape}=1, \ \text{rate}=0.01)\\ 1/\phi_2 &\sim \text{Gamma}(\text{shape}=1, \ \text{rate}=0.01)\\ 1/\phi_3 &\sim \text{Gamma}(\text{shape}=1, \ \text{rate}=0.01)\\ \tau_0 & \sim N(\mu=0, \sigma^2=4/n)\\ \tau_1 & \sim N(\mu=0, \sigma^2=4/n)\\ \tau_2 & \sim N(\mu=0, \sigma^2=4/n)\\ \tau_3 & \sim N(\mu=0, \sigma^2=4/n)\\ \end{aligned} $$

p₁, p₂, p₃ do not appear among these priors because they are all modeled as functions of their respective gene’s mean like so:

$$ p_j = \frac{\exp\left\{b_0 +b_1\mu_j\right\}}{1+\exp\left\{b_0+b_1\mu_j\right\}} $$

where the values for b₀, b₁ are decided beforehand by fitting above model using the genes in the dataset, but replacing p_j with the empirical probability that gene j is equal 0 and replacing μ_j with the empirical mean expression of gene j, then estimating β₀, β₁ using nls().

Citations

Zhen Yang, Yen-Yi Ho, Modeling Dynamic Correlation in Zero-Inflated Bivariate Count Data with Applications to Single-Cell RNA Sequencing Data, Biometrics, Volume 78, Issue 2, June 2022, Pages 766–776, https://doi.org/10.1111/biom.13457