Overview of Spectra
Spectra (https://github.com/dpeerlab/spectra) grounds datadriven factors with prior biological knowledge (Supplementary Fig. 1). First, Spectra takes in prior biological information in the form of celltype labels and explicitly models separate celltypespecific factors that can account for local correlation patterns. This explicit separation of celltypespecific and global factors enables the estimation of factors at multiple scales of resolution. Second, Spectra resolves indeterminacy of the reconstruction loss function via a penalty derived from a gene–gene knowledge graph that encourages solutions that assign similar latent representations to genes with edges between them. To account for prior information of variable relevance and quality, Spectra adaptively tunes its reliance on prior information based on concordance of the prior and observed expression data. Third, novel factors are adaptively allocated when prior information is insufficient to explain the observed expression data.
In the first step of Spectra, a set of gene–gene similarity graphs is built by aggregating information across gene sets and/or other sources. This graph representation is flexible and can accommodate various types of prior knowledge; gene sets can be incorporated into graphs by including edges between genes that are annotated to the same pathway, whereas existing datasets can be used to generate annotations by thresholding partial correlations or factor similarity scores. This representation lends computational convenience, as the graph dimensions are fixed regardless of the size of the input annotations. The annotations are either labeled as celltypespecific or have global scope. A separate graph is thus built for each cell type alongside a global graph.
In the second step, Spectra learns a multidimensional parameter for each cell and each gene, representing each cell and each gene’s distribution over gene expression programs. Similarity of the parameters between genes indicates that these genes are likely to have an edge joining them, whereas similarity of the parameters between a cell and a gene indicates that the cell is likely to express that gene. Hence, the graph encodes the prior that genes with edges between them are likely to be expressed by the same set of cells. In practice, we take several additional steps to fulfill the desiderata: (1) factors not represented in the annotations can be discovered, (2) lowquality annotations can be removed, and (3) discrete cell types are assumed to be fixed and known and therefore not captured as factors by the model.
To avoid penalizing novel factors that have no relation to the annotations, we introduce a weighting matrix that scales the computation of gene–gene similarity scores by factorspecific weights that are learned from the data. Factors that have low weight are not used in computing edge probabilities, whereas factors with high weights influence the edge probabilities directly. Hence, Spectra can estimate similar parameters for two genes without forcing a high edge probability between them as long as the factors corresponding to these genes also have low weight. These weights allow the addition of new, unbiased factors that are not influenced by the input annotations. Importantly, weights are estimated from the data, allowing for an adaptive determination of the relative number of unbiased and biased factors. An estimated background rate of edges in the graph allows for the removal of annotations with little supporting evidence from gene expression data. Finally, Spectra explicitly separates global and celltypespecific factors by enforcing a celltypedetermined block sparsity pattern in the cell loading matrix. Celltypespecific factors capture withincelltype variation, whereas global factors capture any variation that is shared across multiple cell types. To reduce the burden of modeling constitutively expressed celltype marker genes, each factor’s contribution to gene expression is multiplied by a celltypespecific gene weight. These celltypespecific gene weights explain away the influence of celltype marker genes and hence mitigate the tendency of these marker genes to influence the factors themselves.
Components of the Spectra objective function
Broadly speaking, Spectra fits a set of factors and cell scores by minimizing an objective function with two components. The first component of the objective function, \({{{{\mathcal{L}}}}}_{{{{\rm{Reconstruction}}}}}\), measures how well the estimated model parameters can reconstruct (or predict) the observed expression data using the set of all model parameters Θ. We write \({{{{\mathcal{L}}}}}_{{{{\rm{Reconstruction}}}}}(\Theta )\) to emphasize that \({{{{\mathcal{L}}}}}_{{{{\rm{Reconstruction}}}}}\) is a function that maps a set of model parameters to a corresponding objective value. The second component of the objective function measures how well the set of model parameters Θ corresponds to our biological prior information. This second component is denoted \({{{{\mathcal{L}}}}}_{{{{\rm{Graph}}}}}(\Theta )\). We weight this term by a userdefined hyperparameter λ, which allows a user to control the level of confidence placed in the given biological prior information. The general form of the Spectra objective function is
$${{{\mathcal{L}}}}(\Theta )=\lambda {{{{\mathcal{L}}}}}_{{{{\rm{Reconstruction}}}}}(\Theta )+{{{{\mathcal{L}}}}}_{{{{\rm{Graph}}}}}(\Theta )$$
Below, we describe the precise functional forms of each of the objective function components.
\({\boldsymbol{\mathcal{L}}}_{{{{\mathbf{Reconstruction}}}}}\) (Θ): modeling gene expression as a lowrank product
We assume that the expression variation observed in the count matrix is driven by variation in the activity of different biologically meaningful gene programs as well as technical variation that often involves highly expressed genes. Therefore, our model of gene expression needs to account for both components. In more detail, interpretation of factors estimated from scRNAseq data is often hindered by highly expressed genes, which factor analysis methods based on reconstruction loss functions must account for. Housekeeping genes required for basal cellular function, such as GAPDH, ACTB and ribosomal genes, are expressed at high levels and hence unduly influence the reconstruction loss function despite the fact that their expression variance is explained in large part by overall levels of transcription. As a result, existing matrix decomposition methods tend to put high weight on such nonspecifically expressed genes, although post hoc corrections can be applied for the interpretation of individual factors. However, certain important cytokine genes (for example, IL4, IL6, IL2 and IL10), chemokine receptor genes (CXCR1 and CXCR2) and transcription factor genes (RORC and BATF3) are expressed in low mRNA copy numbers. Normalization strategies that rescale features empirically tend to amplify measurement uncertainty associated with lowly expressed genes, leading matrix factorization methods to overfit and return lowquality gene expression programs. To address this, we introduce gene scale factors g_{j} that are estimated from the data and allow the model to explain high expression and variability of certain genes without increasing the magnitude of the gene factor weights. Because lowly expressed genes are correspondingly noisier, we bound the minimum gene scale factors below by a tuning parameter δ.
By way of notation, X refers to the processed gene expression matrix, with entry X_{ij} containing the gene expression value for cell i and gene j. The matrix X has n rows (the number of cells) and p columns (the number of genes). K refers to the number of gene expression programs unless otherwise specified. Additionally, for a given cell indexed by i, the cell loading (a set of weights across the set of factors) is denoted by α_{i}. The distribution across factors for gene j is denoted as θ_{j}, which sums to 1 over K gene expression programs, \(\mathop{\sum }\nolimits_{k = 1}^{K}{\theta }_{jk}=1\). Unsubscripted variables refer to the collection containing all possible subscripts; for example, θ refers to the collection of all θ_{j}. The base expression model describing the gene expression measurement for cell i and gene j is
$${\mathbb{E}}[{{{{\bf{X}}}}}_{ij}]=({g}_{j}+\delta ){\alpha }_{i}^{\top }{\theta }_{j}$$
with g_{j} ∈ [0, 1] a gene scaling parameter, \({\alpha }_{i}\in {{\mathbb{R}}}_{+}^{K}\) and θ_{j} ∈ Δ^{K−1} (where Δ^{K−1} is the set of positive K − vectors that sum to 1). The lowrank decomposition of this expression model can be visualized in Supplementary Fig. 2.
Incorporating cell types into modeling expression variation
Because expression variation is dominated by cell types, existing methods generally fit factors that are polluted with celltype markers or alternatively must be run on a subset of the data. For example, TCR activation programs (consisting of marker genes such as NFATC1 and NFATC2) are confounded with T cell identity, and existing factor analysis methods tend to return identity marker genes, such as CD3, CD4 and CD8. Similarly, programs representing metabolic pathways are often confounded with plasmacytoid dendritic cell (IL3R and BDCA2) or B cell (CD19 and CD79A) identity marker genes. Although it is challenging to fit a biologically meaningful factor model, successful cell typing of scRNAseq data using clustering approaches is a solved problem for discrete cell types but not for intermediate states. Therefore, to mitigate this issue, Spectra assumes that discrete cell types are known and therefore not captured as factors by the model; instead, Spectra explicitly fits celltypespecific and global factors, allowing Spectra to effectively deal with expression variance at multiple scales. To perform this celltypeintegrative factor analysis, for cell type c and cell i, the model is extended to
$${\mathbb{E}}[{{{{\bf{X}}}}}_{cij}]=({g}_{j}+\delta ){\alpha }_{c,i,:K}^{\top }{\theta }_{j}+({g}_{cj}+\delta ){\alpha }_{c,i,K+1:}^{\top }{\theta }_{cj}$$
where c is the celltype label for cell i, g_{cj} is celltypespecific gene scaling, and \({\theta }_{cj}\in {\Delta }^{{K}_{c}1}\) is a celltypespecific gene representation with \({\alpha }_{c,i}\in {{\mathbb{R}}}^{K+{K}_{c}}\). Singlesubscript variables, such as g_{j} and θ_{j}, denote global parameters, whereas the notation α_{:K} indicates the first K elements of a vector (typically denoting global elements), and α_{K+1:} indicates the tail of the vector from the K + 1st element (typically denoting celltypespecific elements). The threshold δ restricts the maximum ratio of gene scaling factors to \(\frac{1+\delta }{\delta }\).
Spectra models the presence of gene programs with highly limited scope in that they can only be activated by a specific cell type, which can be represented by a hardcoded sparsity pattern in the cell loading matrix (Supplementary Fig. 3). The celltypespecific gene scalings (g_{cj}) associated with these programs are encouraged to capture celltype identity markers and constitutively active genes, enabling factors themselves to capture variation across cell types and within cell types (Supplementary Fig. 4). Spectra tends to assign constitutive genes, such as EEF1A1 and ACTB, and identity marker genes, such as CD4 and CD3, high values of g_{j}. Lowly expressed genes important for CD4^{+} T cellspecific gene programs, such as IL21, IL13 and IL6, are often assigned small values of g_{j}, which allows Spectra to attend to gene expression differences that occur on a smaller scale (Supplementary Fig. 4). By default, Spectra runs with at least one celltypespecific factor per cell type so that global factors do not capture celltype identities.
Determining celltype granularity
Spectra can accommodate celltype labels at any level of granularity, subject to a linear increase in computational burden with the number of cell types in the dataset. Additionally, as the granularity increases, the effective sample size for estimating celltypespecific factors decreases, leading to potentially lowerquality celltypespecific factors. The correct celltype granularity depends on the dataset and the specific scientific questions at hand. First, the analyst should incorporate cell types that are known to be discrete and easily identifiable in the dataset via standard clustering analysis (for example, T cells, B cells, myeloid cells and epithelial cells). If cell subtypes exist that are not included as input to the model, Spectra devotes factors to describing variation across these subtypes. Moreover, if intermediate differentiation states between subtypes exist in the data, these subtypes should generally not be included as input to the model because (1) coarser celltypespecific factors can describe these intermediate states, and (2) delineating between subtypes via clustering may be inaccurate.
\({\boldsymbol{\mathcal{L}}}_{\mathbf{Graph}}\) (Θ): modeling gene–gene relationships in relation to expression data
In addition to faithful approximation of the input count matrix, we would also like interpretable factors that correspond known gene programs and biological processes (prior). Therefore, the second component of our likelihood function is a penalty term that guides the solution toward this prior. One aspect that makes Spectra unique is that it models this prior knowledge as a gene–gene community graph, which provides both computational efficiency and flexibility to adapt the graph structure to the data.
In this graph, nodes represent individual genes, and edges between genes occur when each gene has a similar distribution over factors. Communities within the graph, or densely connected subsets, then represent gene programs, whereas edges between communities contain information about genes that participate in multiple gene programs. Providing an imperfect, partially known graph structure as input, we can constrain our matrix factorization solution to respect the structure to yield interpretable gene programs. A main advantage of this approach is its flexibility. Gene sets are naturally incorporated into a graph by forming fully connected cliques among members of each set.
Further, more complex prior knowledge graph structures can be used as input, for example, arising from gene programs estimated from a separate dataset or cell atlas. Most importantly, the structure of this input gene–gene graph can be improved by fitting it to the data and learning gene programs that are more faithful to the data.
A second advantage of the graph prior is its scalability. Although gene sets may be highly overlapping, especially when curated from several separate databases, this redundancy is eliminated when storing information at the level of gene–gene relationships. Redundant gene sets will be merged into highly overlapping communities, and so two redundant gene sets can be approximately described by a single factor. A further computational advantage over gene set priors is that the dimensions of the graph are fixed as the size of gene set database increases, with only the number of edges increasing, and eliminates the need for iterating over the gene set dimension. Finally, operations involving the graph are implemented via efficient and parallelizable matrix multiplications with the graph adjacency matrix, thus allowing Spectra to efficiently scale to a large number of gene sets and cells (Fig. 2g).
To encourage factors to capture our prior knowledge of gene programs, we assume that binary gene–gene relationships are evidence of a pair of genes having similar latent profiles. This assumption could be incorporated by assuming a model for edge probabilities depending on the similarity scores 〈θ_{i}, θ_{j}〉 for genes i and j. However, the naive inner product does not explicitly account for the fact that prior information is invariably imperfect in systematic ways. First, at the level of entire gene programs, not all gene programs are active in all datasets, and, therefore, entire graph communities may be unnecessary for describing the observed expression data, while there are likely novel gene programs observed in the expression data that are not represented by communities in the graph. Also gene programs are imperfect due to inaccuracy of annotation, and, more frequently, gene programs differ across biological contexts, and our prior information is typically derived from a different biological context. Therefore, genes may be misclassified into gene sets to which they do not belong (corresponding to noisy edge observations), or gene sets may be incomplete (corresponding to missing edges). Spectra addresses these issues in the following two ways: (1) adaptively modeling background noise in the graph, allowing for the addition and removal of edges (Background edge rates), and (2) tuning the weight of the prior gene–gene matrix through the incorporation of a weight matrix, termed the factor interaction matrix, into the inner product between gene representations θ_{i} and θ_{j} (see below).
The factor interaction matrix tunes the weight of the gene–gene prior
To understand the purpose of the factor interaction matrix, let us first consider the ordinary inner product measuring gene–gene similarity in terms of gene program representations:
$$\langle {\theta }_{i},{\theta }_{j}\rangle ={\theta }_{i1}{\theta }_{j1}+\cdots +{\theta }_{iK}{\theta }_{jK}$$
The maximum value of this product is 1 and is achieved only when gene i and gene j put all their weight into a single gene program. Consider what happens if genes i and j are important components of a gene program that exists only in the expression data and not in our prior information. Then, i and j are not connected in the graph, and so the inner product model encourages 〈θ_{i}, θ_{j}〉 ≈ 0. When 〈θ_{i}, θ_{j}〉 ≈ 0, genes i and j must be components of entirely separate programs. In this way, we see that the naive inner product discourages new factors from being estimated from the expression data. Such an inner product model estimates novel factors that are heavily biased by the graph.
Now, instead of the naive inner product, consider a weighted product weighted by scalar values (b_{1}, b_{2},…, b_{K}) that are between 0 and 1:
$${\langle {\theta }_{i},{\theta }_{j}\rangle }_{b}={b}_{1}{\theta }_{i1}{\theta }_{j1}+\cdots +{b}_{K}{\theta }_{iK}{\theta }_{jK}$$
To model the data, we can adjust the values of (b_{1},…, b_{K}) to achieve the best fit. Consider the same situation as above, where i and j are not connected in the graph, but they are components of a gene program supported by expression data alone. The product model again encourages \({\langle {\theta }_{i},{\theta }_{j}\rangle }_{b}\approx 0\); however, now this constraint does not necessarily encourage θ_{i} and θ_{j} to be dissimilar. To see this, suppose that θ_{i} = [1, 0, 0] and θ_{j} = [1, 0, 0]. If b_{1} = 0, then
$$\begin{array}{rcl}{\langle {\theta }_{i},{\theta }_{j}\rangle }_{b}&=&{b}_{1}1* 1+{b}_{2}0* 0+{b}_{3}0* 0\\ &=&0\end{array}$$
Hence, novel gene programs can be estimated as long as the value of b_{k} corresponding to that program is pushed toward 0. We can interpret gene programs corresponding to low values of b_{k} as novel and gene programs corresponding to high values of b_{k} as supported by prior information. We could equivalently write each weight b_{k} as one of the nonzero elements of a diagonal matrix
$$B=\left[\begin{array}{ccc}{b}_{1}&&\\ &\ddots &\\ &&{b}_{K}\end{array}\right]$$
so that
$$\begin{array}{rcl}\langle {\theta }_{i},B{\theta }_{j}\rangle &=&{\langle {\theta }_{i},{\theta }_{j}\rangle }_{b}\\ &=&{b}_{1}{\theta }_{i1}{\theta }_{j1}+\cdots +{b}_{K}{\theta }_{iK}{\theta }_{jK}\end{array}$$
In practice, we allow the off diagonals of this matrix B to be estimated as nonzero (Supplementary Fig. 5). The resulting matrix is termed the factor interaction matrix.
Allowing off diagonals of the factor interaction matrix to be nonzero serves two purposes. First, it allows the model to explain overlapping gene sets without forcing shared genes to have partial membership. For example, if two gene sets overlap but in reality represent two distinct biological processes that can be separated in the gene expression data, the model is not forced to assign partial membership to overlapping genes but can fully assign genes to one of two programs. To account for this, the offdiagonal element corresponding to this pair of gene programs (B_{k,l} for programs k and l) can be estimated as greater than 0. On real data, we see this occur for βalanine metabolism and fatty acid metabolism (Supplementary Fig. 6). Second, nonzero offdiagonal elements of the factor interaction matrix serve to mitigate the effect of lowquality edges in the prior graph by allowing edges between genes that are in separate gene expression programs to arise with nonzero probability.
Full Spectra model
As a notation, we refer to the adjacency matrix of an input graph as \(A\in {{\mathbb{R}}}^{p\times p}\) with element A_{ij} = 1 if an edge exists between i and j and A_{ij} = 0 otherwise. Following the discussion above, the Spectra generative model states (Supplementary Fig. 5)
$${\mathbb{P}}\left[{A}_{ij}=1\right]=\langle {\theta }_{i},B{\theta }_{j}\rangle$$
In the full Spectra model, each gene has a separate representation per cell type (in addition to its global representation), θ_{ci}, where c indexes into the possible cell types. To supervise these representations in a celltypespecific manner, the user (optionally) provides one graph for each cell type and a graph representing global gene–gene relationships (Supplementary Figs. 6 and 7). These graphs are modeled separately, where each graph’s edges can only be predicted using factor representations specific to that cell type. The celltypespecific graphs are denoted A_{c} for cell type c, with A_{c,ij} = 1 if there is a celltypespecific annotation between genes i and j for cell type c. The celltypespecific graphs can only influence celltypespecific factors and vice versa:
$${\mathbb{P}}\left[{A}_{c,ij}=1\right]=\langle {\theta }_{ci},{B}_{c}{\theta }_{cj}\rangle$$
diagrammed in Supplementary Fig. 7. Importantly, a separate factor interaction matrix, B_{c}, is learned for each cell type with a prior graph provided.
The computational cost of including granular celltypespecific prior information can be large, as each cell type requires its own graph.
Background edge rates
Realistic annotation graphs have several edges that are not supported by expression data, and the model should be allowed the flexibility to attribute edges (or the lack thereof) in annotations to a background rate of noise. To allow flexibility in modifying the original graph, we incorporate background edge and nonedge rates κ and ρ that reflect noise rates in the observed graph. These parameters serve two separate purposes. First, these parameters deal with numerical stability issues by moving probabilities away from 0 and 1. Second, the parameters control the rate that edges are added and removed from the original graph. Intuitively, our inference procedure examines whether a relationship (or lack of a relationship) in the prior knowledge graph is consistent with expression data and if not can ascribe this relationship to random noise.
The generative process of our model is that with some probability ρ, edges between gene i and j are blocked out and cannot occur irrespective of the corresponding factor values θ_{i} and θ_{j}. If this does not occur, an edge will be generated by random chance with probability κ. Finally, if neither of these events occur, an edge is generated according to the factor similarity score 〈θ_{i}, Bθ_{j}〉. This yields the following distribution for the adjacency matrix:
$$\begin{array}{l}{\mathbb{P}}\left[{A}_{ij}=1\right]=(1\kappa )(1\rho ){\theta }_{i}^{\top }B{\theta }_{j}+\kappa (1\rho )\\ {\mathbb{P}}\left[{A}_{ij}=0\right]=(1\kappa )(1\rho )\left(1{\theta }_{i}^{\top }B{\theta }_{j}\right)+\rho \end{array}$$
where κ and ρ are (celltypespecific) background rates of 1 and 0 in the adjacency matrix, respectively. κ and ρ can be estimated from the data or fixed to constants and treated as tunable hyperparameters.
Constructing the gene–gene prior graph
In most applications, Spectra receives a set of gene sets rather than a gene–gene graph as input, and the gene–gene graph is constructed from these gene sets. Large gene sets generally provide lower evidence that any given gene is crucial to the process that the gene set represents. For example, hallmark gene sets often contain hundreds of genes^{76}, some of which are upregulated as distant downstream targets. Additionally, larger cliques represent a larger component of the likelihood function, potentially biasing Spectra solutions toward attending to the largest gene communities. Therefore, by default, when Spectra takes in gene sets as input, the edge weights used to downweight the contribution of any individual graph edge are proportional to the size of the gene set that it is derived from. The default weighting scheme is to weight edges by the total number of edges in the clique. For a given gene set G_{k}, this involves downweighting by \(\frac{1}{{{ {G}_{k}} \choose {2}}}\):
$${w}_{ij}\propto \frac{1}{\left(\begin{array}{c} {G}_{k} \\ 2\end{array}\right)}$$
where ∣G_{k}∣ is the size of a gene set G_{k} containing genes i and j. The weights are rescaled so that the median weight across gene sets is 1. When a pair of genes exists in multiple gene sets, the weights accumulate additively. Another reasonable choice is \({w}_{ij}=\frac{1}{\max [d(i),d(\,j)]}\), where d(i) is the degree of node i.
As an alternative weighting scheme, Spectra accommodates weighted graphs by scaling edges by edgespecific weights. This feature allows users to annotate the prior information graphs with additional quantitative information representing relative confidence in each individual annotation.
Pseudolikelihood function
The heretofore described model components describe the expected values of the expression data matrix X and the prior knowledge graph A under the Spectra generative process. Together with specific observation distributions, this would specify a likelihood function that serves as the maximization objective of Spectra, fit via either firstorder methods or expectation maximization (EM). The loss function described below is the negative value of a proper likelihood function in the case where weights w_{ij} are equal to 1, λ = 1, and expression data X follow a Poisson distribution. In practice, these conditions are not satisfied, so we adopt the terminology pseudolikelihood function to describe the negative loss function. For ease of exposition, we first describe the pseudolikelihood function assuming a single cell type. Recall the general form of the Spectra objective, consisting of a term that measures the ability of Spectra factors to recapitulate expression data and a term that measures the concordance of Spectra factors with the prior knowledge database:
$${{{\mathcal{L}}}}(\Theta )=\lambda {{{{\mathcal{L}}}}}_{{{{\rm{Reconstruction}}}}}(\Theta )+{{{{\mathcal{L}}}}}_{{{{\rm{Graph}}}}}(\Theta )$$
As edges are binary, combined with the assumption of independence, the log likelihood of A_{ij} given a probability of 1, \({p}_{ij}:= {\mathbb{P}}[{A}_{ij}=1]\), is
$$\log {\mathbb{P}}({A}_{ij})={A}_{ij}\log {p}_{ij}+(1{A}_{ij})\log (1{p}_{ij})$$
With p_{ij} as described in Modeling gene–gene relationships in relation to expression data and The factor interaction matrix tunes the weight of the gene–gene prior,
$$\begin{array}{l}\log {\mathbb{P}}({A}_{ij})={A}_{ij}\log \left[(1\kappa )(1\rho ){\theta }_{i}^{\top }B{\theta }_{j}+\kappa (1\rho )\right]\\ \qquad\qquad\quad\;+(1{A}_{ij})\log \left[(1\kappa )(1\rho )(1{\theta }_{i}^{\top }B{\theta }_{j})+\rho \right]\end{array}$$
To incorporate weights (following Constructing the gene–gene prior graph), we weight likelihood terms corresponding to each edge in the graph by an edgespecific weight w_{ij}:
$$\begin{array}{rcl}\log {\mathbb{P}}({A}_{ij})&=&{w}_{ij}{A}_{ij}\log \left[(1\kappa )(1\rho ){\theta }_{i}^{\top }B{\theta }_{j}+\kappa (1\rho )\right]\\ &+&(1{A}_{ij})\log \left[(1\kappa )(1\rho )(1{\theta }_{i}^{\top }B{\theta }_{j})+\rho \right]\end{array}$$
Combining across all observations (i, j), this leads to the expression for \({{{{\mathcal{L}}}}}_{{{{\rm{Graph}}}}}(\Theta )\):
$$\begin{array}{rcl}{{{{\mathcal{L}}}}}_{{{{\rm{Graph}}}}}(\Theta )&=&\mathop{\sum }\limits_{i=1}^{p}\,\mathop{\sum }\limits_{j=1,j\ne i}^{p}\left[{w}_{ij}{A}_{ij}\log \left((1\kappa )(1\rho ){\theta }_{i}^{\top }B{\theta }_{j}+\kappa (1\rho )\right)\right.\\ &+&\left.(1{A}_{ij})\log \left((1\kappa )(1\rho )(1{\theta }_{i}^{\top }B{\theta }_{j})+\rho \right)\right]\end{array}$$
The loss function derived from the Poisson distribution has been widely used for modeling scRNAseq counts^{4,56,77}. Although processed data may not necessarily be well described by the Poisson observation model (that is, scranprocessed data are on a log scale), the resulting log likelihood strikes a practical balance in scaling with gene expression magnitude. The resulting loss function has been used in contexts other than modeling count data as the KL divergence loss^{78}. Additionally, under idealized settings, estimates obtained by minimizing this loss function inherit properties of M estimators analogous to those of maximum likelihood estimators, with few assumptions on the data distribution^{79}. Here, we are primarily concerned with how the loss function behaves under changes in scale. For example, suppose we have an estimated expression value \({\hat{{{{\bf{X}}}}}}_{ij}\). We can write \({\hat{{{{\bf{X}}}}}}_{ij}(\Theta )\) as our predicted gene expression as a function of the model parameters. The least squares loss \({{{{\mathcal{L}}}}}_{2}(\Theta ):= {[{{{{\bf{X}}}}}_{ij}{\hat{{{{\bf{X}}}}}}_{ij}(\Theta )]}^{2}\) is quadratically dependent on the scale of X_{ij}, because replacing both ground truth and estimate by scaled versions φX_{ij} and \(\varphi {\hat{{{{\bf{X}}}}}}_{ij}\) leads to a loss of \({\varphi }^{2}{{{{\mathcal{L}}}}}_{2}(\Theta )\). Similar to the issues addressed in ‘L_{Reconstruction} (Θ): modeling gene expression as a lowrank product’, the squared loss function encourages factors to attend to highly expressed genes because scale differences amplify the loss quadratically. At the other extreme, consider the Itakura–Saito loss (IS loss) given by (we briefly assume that both ground truth and estimate are not 0)
$${{{{\mathcal{L}}}}}_{IS}(\Theta ):= \frac{{{{{\bf{X}}}}}_{ij}}{{\hat{{{{\bf{X}}}}}}_{ij}(\Theta )}\log \frac{{{{{\bf{X}}}}}_{ij}}{{\hat{{{{\bf{X}}}}}}_{ij}(\Theta )}1$$
If we scale observed counts and prediction φX_{ij} and \(\varphi {\hat{{{{\bf{X}}}}}}_{ij}\), then the IS loss does not change. So, matrix factorization with the IS loss does not suffer from a bias toward highly expressed genes. However, forcing the model to predict all lowly expressed genes is not desirable, often leading to lowquality factors. The Poisson log likelihood exhibits a practically convenient balance between these two extremes:
$${{{{\mathcal{L}}}}}_{{{{\rm{Pois}}}}}(\Theta ):= {{{{\bf{X}}}}}_{ij}\log {\hat{{{{\bf{X}}}}}}_{ij}(\Theta )+{\hat{{{{\bf{X}}}}}}_{ij}(\Theta )$$
When X_{ij} and \({\hat{{{{\bf{X}}}}}}_{ij}\) are scaled by φ, \({{{{\mathcal{L}}}}}_{{{{\rm{Pois}}}}}(\Theta )\) is scaled by φ. This linear dependence on expression scale achieves a good balance in the relative weighting between highly expressed and lowly expressed genes.
An additional advantage of this loss function is that the second term behaves as a lasso penalty^{80}, inducing sparsity in the resulting estimates of \({\hat{{{{\bf{X}}}}}}_{ij}\) for sparse data X, noted by Gopalan et al.^{81}. This sparsity allows for a parsimonious explanation of a cell’s gene expression using as few factors as possible. In Spectra, we have \({\hat{{{{\bf{X}}}}}}_{ij}:= {\alpha }_{i}^{\top }{\theta }_{j}({g}_{j}+\delta )\), yielding the expression
$${{{{\mathcal{L}}}}}_{{{{\rm{Reconstruction}}}}}(\Theta )=\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j=1}^{p}{{{{\bf{X}}}}}_{ij}\log \left[{\alpha }_{i}^{\top }{\theta }_{j}({g}_{j}+\delta )\right]{\alpha }_{i}^{\top }{\theta }_{j}({g}_{j}+\delta )$$
Combining the components, the pseudolog likelihood function is
$$\begin{array}{rcl} {\mathcal{L}}(\alpha, \theta, g, B) &=& \lambda \underbrace{\mathop{\sum}\limits_{i=1}^n\mathop{\sum}\limits_{j=1}^p {\mathbf{X}}_{ij}\log\left(\alpha_{i}^\top \theta_j (g_j + \delta)\right) – \alpha_{i}^\top\theta_j(g_j + \delta)}_{{\mathcal{L}}_{{\rm{Reconstruction}}}} \\ & +& \mathop{\sum}\limits_{i=1}^p\,\mathop{\sum}\limits_{j=1,\, j\neq i}^p\left[w_{ij}A_{ij}\log\left((1\kappa)(1\rho)\theta_{i}^\top B \theta_j + \kappa(1\rho) \right)\right. \\ & +& \underbrace{\left.(1A_{ij})\log\left((1\kappa)(1\rho)(1\theta_{i}^\top B \theta_j) + \rho\right)\right]}_{{{\mathcal{L}}_{{\rm{Graph}}}}} \end{array}$$
Again, X is the data matrix after processing, whereas α, θ, g and B are the four model parameters that need to be estimated. The first term in the pseudolog likelihood function comes from the log likelihood of the Poisson distribution (also referred to as the KL divergence loss function when multiplied by −1), while the second term is the log likelihood of a Bernoulli distribution with positive observations scaled by w_{ij}. The pseudolikelihood function optimized by Spectra includes an optimization over celltypespecific and global parameters, and so an additional sum over cell types is included in the pseudolog likelihood (Supplementary Fig. 7).
$$\begin{array}{rcl} {\mathcal{L}}(\alpha, \theta, g, B) &=&\mathop{\sum}\limits_{c=1}^C\mathop{\sum}\limits_{i=1}^{n_c} \lambda_c\mathop{\sum}\limits_{j=1}^p {\mathbf{X}}_{cij}\log\left((g_j + \delta)\alpha_{c,i,:K}^\top \theta_j + (g_{cj} + \delta) \alpha_{c,i, K+1:}^\top \theta_{cj}\right)\\ && \underbrace{ \mathop{\sum}\limits_{c=1}^C\mathop{\sum}\limits_{i=1}^{n_c}\mathop{\sum}\limits_{j=1}^p\left((g_j + \delta)\alpha_{c,i,:K}^\top \theta_j + (g_{cj} + \delta) \alpha_{c,i, K+1:}^\top \theta_{cj}\right)}_{{\mathcal{L}}_{{\rm{Reconstruction}}}}\\ && + \mathop{\sum}\limits_{c=1}^{C+1} \mathop{\sum}\limits_{i=1}^p\,\mathop{\sum}\limits_{j=1,\, j\neq i}^p\left[w_{c,ij}A_{c,ij}\log\left((1\kappa_c)(1\rho_c)\theta_{ci}^\top B_c \theta_{cj} + \kappa_c(1\rho_c)\right)\right.\\ && \underbrace{+\left.(1A_{c,ij})\log\left((1\kappa_c)(1\rho_c)(1\theta_{ci}^\top B_c \theta_{cj}) + \rho_c\right)\right]}_{{{\mathcal{L}}_{{\rm{Graph}}}}} \end{array}$$
As all discrete parameters have been integrated out, this pseudolog likelihood can be directly maximized via firstorder methods, such as gradient descent. Approximate secondorder methods are not ideal due to the high dimension of the parameter space for practical problem sizes. However, for smallersized problems (in terms of number of genes and factors), we develop an EM approach that yields intuitive coordinate ascent updates of model parameters.
Spectra’s output
To describe the activity level of factor k in cell i, we compute cell scores as cell_score_{ik} = q_{k}α_{ik}, where \({q}_{k}=\frac{1}{p}\mathop{\sum }\nolimits_{j = 1}^{p}{\theta }_{jk}\). In other words, the cell scores are the loadings weighted by the total factor usage across all genes. This allows us to circumvent the nonidentifiability of scale associated with factor analysis approaches. Regarding terminology, we will always refer to the unnormalized loadings α_{ik} as ‘loadings’ and the normalized loadings as cell scores. Additionally the cellspecific parameters of other matrix factorization methods are described as ‘loadings’. The ground truth parameters in our simulations are also described as ‘loadings’.
To describe the relevance of gene j for factor k, we compute gene scores for gene j and factor k as \(\left(\frac{{g}_{j}+\delta }{{g}_{j}+\delta +{\mathtt{offset}}}\right){\theta }_{jk}\). The first term is near 0 when g_{j} is very small and near 1 when g_{j} is large. This allows us to remove very lowly expressed genes from the factors while maintaining coherence. By default, the offset term is set to 1, can be tuned and in some cases set to 0, which yields the factors θ_{jk} themselves. Each θ_{jk} is more directly influenced by the prior than g_{j}θ_{jk}, and so setting offset to 0 tends to yield marker lists closely resembling input gene sets.
Users can access additional parameters that facilitate interpretation of the gene scores and cell scores. The factor interaction matrix per cell type (B; Supplementary Fig. 6) contains entries in the range [0, 1], where diagonal entries can be interpreted as the relevance of a given factor to the prior graph. Offdiagonal entries can be interpreted as a background rate of edges between genes that are expressed in separate factors. For each cell type, users can access a posterior graph that is denoised using information from the expression data. The posterior graph is computed by the inner product 〈θ_{i}, Bθ_{j}〉 for each pair of genes θ_{i} and θ_{j} after estimating θ_{i}, θ_{j} and B from the data.
Of importance are the diagonal elements of the interaction matrices B, which contain information about the dependence of the factor on the input graph. We term these diagonal elements η, specifically η_{c} ≔ diag(B_{c}).
Factor importance and information scores
We adopt the following two metrics to prioritize factors in the output of Spectra: factor importance and factor information scores, each measuring a different property of the factor. Both metrics are computed per cell type for all of the factors that are potentially relevant to that cell type. In other words, to prioritize the relevant factors for a cell type, the metrics are computed for each celltypespecific factor and each global factor, resulting in 2(K + K_{c}) scores for cell type c. The factor importance score measures the overall contribution of a factor to explain the observed expression data (as measured by the reconstruction component of the loss function), regardless of whether this factor explains withincelltype variation. The factor information score, complementary to the factor importance score, measures whether the gene set associated with a factor captures meaningful withincelltype variation. Factors with high scores in either of these categories are potentially of interest for post hoc analysis. The factor importance score is a relative change in reconstruction error for a specific cell type when a certain factor is masked out. Let \({\bar{L}}_{c}(\theta ,{\theta }_{j})\) be the reconstruction error for cell type c:
$${\bar{L}}_{c}(\theta ,{\theta }_{j}):= \mathop{\sum }\limits_{i=1}^{{n}_{c}}{\lambda }_{c}\mathop{\sum }\limits_{j=1}^{p}{{{{\bf{X}}}}}_{cij}\log \left[({g}_{j}+\delta ){\alpha }_{c,i,:K}^{\top }{\theta }_{j}+({g}_{cj}+\delta ){\alpha }_{c,i,K+1:}^{\top }{\theta }_{cj}\right]$$
(1)
$$\mathop{\sum }\limits_{i=1}^{{n}_{c}}\mathop{\sum }\limits_{j=1}^{p}\left[({g}_{j}+\delta ){\alpha }_{c,i,:K}^{\top }{\theta }_{j}+({g}_{cj}+\delta ){\alpha }_{c,i,K+1:}^{\top }{\theta }_{cj}\right]$$
(2)
Here, it is understood that all parameters except θ and θ_{j} are fixed to their fitted values. Further, let ϵ_{k} denote a vector of all 1 values of a dimension equal to the number of factors, except at k where it is 0: ϵ_{k} = 1 − e_{k}. The importance score for celltypespecific factor k is then \({{{{\mathcal{F}}}}}_{c,k}=\frac{{\bar{L}}_{c}(\theta ,{\epsilon }_{k}\circ {\theta }_{j}){\bar{L}}_{c}(\theta ,{\theta }_{j})}{{\bar{L}}_{c}(\theta ,{\theta }_{j})}\), where ° represents elementwise product. Similarly the importance score for a global factor k is given by \({{{{\mathcal{F}}}}}_{c,k}^{(g)}=\frac{{\bar{L}}_{c}({\epsilon }_{k}\circ \theta ,{\theta }_{j}){\bar{L}}_{c}(\theta ,{\theta }_{j})}{{\bar{L}}_{c}(\theta ,{\theta }_{j})}\)
Information scores are given by Definition (1) in Mimno et al.^{82} but computed per cell type to represent celltypespecific information content. Specifically, given a marker list associated with a factor and with M set to 30 we have
$${C}_{c,k}=\mathop{\sum }\limits_{m=2}^{M}\mathop{\sum }\limits_{l=1}^{m1}\log \frac{{D}_{c}({g}_{m}^{(k)},{g}_{l}^{(k)})+1}{{D}_{c}({g}_{l}^{(k)})}$$
(3)
where \({g}_{m}^{(k)}\) is now the mth top gene for factor k, and D_{c}(⋅, ⋅) and D_{c}(⋅) are the cooccurrence frequency and frequency, respectively, within cell type c. We plot \(\exp ({C}_{c,k})\) as the information scores.
Optimization
We develop the following two optimization schemes: an auxiliary latent variable EM approach and gradient descentbased optimization via Adam^{83}. EM converges quickly in many situations; however, the memory requirements are substantially larger than the gradient descentbased optimization. Specifically the memory requirement of EM parameter storage is O(npK + p^{2}K^{2}) due to auxiliary parameter storage, while the memory requirement of gradient descent is substantially lower, O(nK + pK + K^{2}).
Although memory intensive, the EM solution is valuable for two reasons: (1) for problems with a small number of factors (<20) and genes (<2,500), EM is fast and less sensitive to initialization than gradient descent^{84}, and (2) the EM updates are intuitive and give us an understanding of how our algorithm balances evidence from the graph and expression data.
However, optimization with Adam can handle a large number of factors (>200) and genes (>10,000) and can exhibit stability with the appropriate initialization. By default, Spectra uses Adam for optimization.
EM
For ease of exposition, we describe the EM routine for the nonintegrative model; the updates are easily extendable to incorporate celltype labels. Additionally, we write the pseudolikelihood function equivalently (up to a scale factor) in terms of \(\tilde{\lambda }:= \frac{1}{\lambda }\). To make EM possible, we exploit two facts about the distribution of (X, A)^{85}.
The first is that if z_{ijk} ~ Pois[(g_{j} + δ)α_{ik}θ_{jk}] and we define \({{{{\bf{X}}}}}_{ij}=\mathop{\sum }\nolimits_{k = 1}^{K}{z}_{ijk}\), then X_{ij} still has the correct marginal distribution due to standard properties of the Poisson distribution^{81}. Second, if we define \({\tilde{z}}_{ij} \sim {{{\rm{Categorical}}}}({\theta }_{i})\) and \({\tilde{z}}_{ji} \sim {{{\rm{Categorical}}}}({\theta }_{j})\) and define a conditional distribution for A_{ij} as
$${\mathbb{P}}\left({{{{\bf{A}}}}}_{ij}=1 {\tilde{z}}_{ijk}=1,{\tilde{z}}_{jil}=1\right)={B}_{kl}$$
then A_{ij} still has the correct marginal distribution^{86}. As a result, we can optimize the marginal log likelihood via optimization of the expected complete data log likelihood \({{\mathbb{E}}}_{z,\tilde{z}}[\log p({{{\bf{X}}}},{{{\bf{A}}}},z,\tilde{z})]\), where the expectation is taken over the posterior \(p(z,\tilde{z} {{{\bf{A}}}},{{{\bf{X}}}})\). The expected complete data log likelihood is given by
$$\begin{array}{rcl}\tilde{{{{\mathcal{L}}}}}(\alpha ,B,\theta ,g)&=&\mathop{\sum }\limits_{i=1}^{n}\,\mathop{\sum }\limits_{j=1}^{p}\mathop{\sum }\limits_{k=1}^{K}{\phi }_{ijk}\log (({g}_{j}+\delta ){\alpha }_{ik}{\theta }_{jk})({g}_{j}+\delta ){\alpha }_{ik}{\theta }_{jk}\\ &&+\;\tilde{\lambda }\mathop{\sum }\limits_{i=1}^{p}\,\mathop{\sum }\limits_{j=1,\,j\ne i}^{p}\mathop{\sum }\limits_{k=1}^{K}\mathop{\sum }\limits_{l=1}^{K}{\tilde{\phi }}_{ijkl}\left({w}_{ij}{A}_{ij}\log \left((1\kappa ){B}_{kl}+\kappa \right)\right.\\ &&+\left.(1{A}_{ij})\log \left((1\kappa )(1\rho )(1{B}_{kl})+\rho \right)+\log {\theta }_{ik}+\log {\theta }_{jl}\right)\end{array}$$
where \({\phi }_{ijk}:= {\mathbb{E}}({z}_{ijk} {{{\bf{X}}}})={{{{\bf{X}}}}}_{ij}\frac{{\alpha }_{ik}{\theta }_{jk}}{\mathop{\sum }\nolimits_{k = 1}^{K}{\alpha }_{ik}{\theta }_{jk}}\) and
$$\begin{array}{rcl}{\tilde{\phi }}_{ijkl}&=&{\mathbb{P}}\left({\tilde{z}}_{ijk}=1,{\tilde{z}}_{jil}=1 {{{\bf{A}}}}\right)\\ &\propto &{\theta }_{ik}{\theta }_{jl}{\left((1\kappa ){B}_{kl}+\kappa \right)}^{{w}_{ij}{A}_{ij}}{\left((1\kappa )(1\rho )(1{B}_{kl})+\rho \right)}^{1{A}_{ij}}\end{array}$$
Importantly, this manipulation moves summations outside of the logs, which permits analytic EM updates for B, α and g given by
$$\begin{array}{l}{\alpha }_{ik}\leftarrow \frac{\mathop{\sum }\nolimits_{j = 1}^{p}{\phi }_{ijk}}{\mathop{\sum }\nolimits_{j = 1}^{p}{\theta }_{jk}({g}_{j}+\delta )}\\ {g}_{j}\leftarrow {{{{\rm{proj}}}}}_{[0,1]}\left(\frac{\mathop{\sum }\nolimits_{i = 1}^{n}\mathop{\sum }\nolimits_{k = 1}^{K}{\phi }_{ijk}}{\mathop{\sum }\nolimits_{i = 1}^{n}\mathop{\sum }\nolimits_{k = 1}^{K}{\alpha }_{ik}{\theta }_{jk}}\delta \right)\\ {B}_{kl}\leftarrow {{{{\rm{proj}}}}}_{[0,1]}\left(\frac{\left(\frac{\rho }{1\rho }+(1\kappa )\right){\Xi }_{kl}\kappa }{(1\kappa )(1+{\Xi }_{kl})}\right)\end{array}$$
where \({\Xi }_{kl}:= \frac{\mathop{\sum }\nolimits_{i = 1}^{p}\mathop{\sum }\nolimits_{j = 1}^{p}{\tilde{\phi }}_{ijkl}{w}_{ij}{A}_{ij}}{\mathop{\sum }\nolimits_{i = 1}^{p}\mathop{\sum }\nolimits_{j = 1}^{p}{\tilde{\phi }}_{ijkl}(1{A}_{ij})}\), representing an odds ratio between Bernoulli outcomes. Further, the complete data log likelihood has diagonal Hessian when viewed as a function of θ only, \(\tilde{L}(\theta )\), permitting linear time Newton Raphson updates
$$\begin{array}{l}{\gamma }_{jl}\leftarrow \frac{1}{{\theta }_{jl}}\left[\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j=1}^{p}{\phi }_{ijl}+\tilde{\lambda }\mathop{\sum }\limits_{i=1}^{p}\mathop{\sum }\limits_{k=1}^{K}{\tilde{\phi }}_{ijkl}+\tilde{\lambda }\mathop{\sum }\limits_{i=1}^{p}\mathop{\sum }\limits_{k=1}^{K}{\tilde{\phi }}_{jilk}\right]({g}_{j}+\delta )\mathop{\sum }\limits_{i=1}^{n}{\alpha }_{il}\\ {H}_{jl}^{1}\leftarrow \frac{{\theta }_{jl}^{2}}{\mathop{\sum }\nolimits_{i = 1}^{n}\mathop{\sum }\nolimits_{j = 1}^{p}{\phi }_{ijl}+\tilde{\lambda }\mathop{\sum }\nolimits_{i = 1}^{p}\mathop{\sum }\nolimits_{k = 1}^{K}{\tilde{\phi }}_{ijkl}+\tilde{\lambda }\mathop{\sum }\nolimits_{i = 1}^{p}\mathop{\sum }\nolimits_{k = 1}^{K}{\tilde{\phi }}_{jilk}}\\ {\Delta }_{j}\leftarrow \frac{\mathop{\sum }\nolimits_{k = 1}^{K}{\gamma }_{jk}{H}_{jk}^{1}}{\mathop{\sum }\nolimits_{k = 1}^{K}{H}_{jk}^{1}}\\ {\theta }_{jk}\leftarrow {\theta }_{jk}{H}_{jk}^{1}({\Delta }_{j}+{\gamma }_{jk})\end{array}$$
Algorithm 1. EM Spectra routine
Require:
\({{{\bf{X}}}}\ge 0,{{{\bf{A}}}}\in {\{0,1\}}^{p\times p},T\in {{\mathbb{Z}}}_{+},\kappa \in {{\mathbb{R}}}_{+},\rho \in {{\mathbb{R}}}_{+},\tilde{\lambda }\in {{\mathbb{R}}}_{+}\)
initialize B, α, θ, g
while \({{{{\mathcal{L}}}}}_{n}{{{{\mathcal{L}}}}}_{n1} > \epsilon\), do
\({\phi }_{ijk}\leftarrow {{{{\bf{X}}}}}_{ij}\frac{{\alpha }_{ik}{\theta }_{jk}}{\mathop{\sum }\nolimits_{k = 1}^{K}{\alpha }_{ik}{\theta }_{jk}}\)
\({\tilde{\phi }}_{ijkl}\leftarrow {\theta }_{ik}{\theta }_{jl}{((1\kappa ){B}_{kl}+\kappa )}^{{w}_{ij}{A}_{ij}}{((1\kappa )(1\rho )(1{B}_{kl})+\rho )}^{1{A}_{ij}}\)
\({\tilde{\phi }}_{ijkl}\leftarrow {\tilde{\phi }}_{ijkl}/{\sum }_{kl}{\tilde{\phi }}_{ijkl}\)
while t < T, do
\({\gamma }_{jl}\leftarrow \frac{1}{{\theta }_{jl}}\left[\mathop{\sum }\nolimits_{i = 1}^{n}\mathop{\sum }\nolimits_{j = 1}^{p}{\phi }_{ijl}+\tilde{\lambda }\mathop{\sum }\nolimits_{i = 1}^{p}\mathop{\sum }\nolimits_{k = 1}^{K}{\tilde{\phi }}_{ijkl}+\tilde{\lambda }\mathop{\sum }\nolimits_{i = 1}^{p}\mathop{\sum }\nolimits_{k = 1}^{K}{\tilde{\phi }}_{jilk}\right]\)\(({g}_{j}+\delta )\mathop{\sum }\nolimits_{i = 1}^{n}{\alpha }_{il}\)
\({H}_{jl}^{1}\leftarrow \frac{{\theta }_{jl}^{2}}{\mathop{\sum }\nolimits_{i = 1}^{n}\mathop{\sum }\nolimits_{j = 1}^{p}{\phi }_{ijl}+\tilde{\lambda }\mathop{\sum }\nolimits_{i = 1}^{p}\mathop{\sum }\nolimits_{k = 1}^{K}{\tilde{\phi }}_{ijkl}+\tilde{\lambda }\mathop{\sum }\nolimits_{i = 1}^{p}\mathop{\sum }\nolimits_{k = 1}^{K}{\tilde{\phi }}_{jilk}}\)
\({\Delta }_{j}\leftarrow \frac{\mathop{\sum }\nolimits_{k = 1}^{K}{\gamma }_{jk}{H}_{jk}^{1}}{\mathop{\sum }\nolimits_{k = 1}^{K}{H}_{jk}^{1}}\)
\({\theta }_{jk}\leftarrow {\theta }_{jk}{H}_{jk}^{1}({\Delta }_{j}+{\gamma }_{jk})\)
\({\alpha }_{ik}\leftarrow \frac{\mathop{\sum }\nolimits_{j = 1}^{p}{\phi }_{ijk}}{\mathop{\sum }\nolimits_{j = 1}^{p}{\theta }_{jk}({g}_{j}+\delta )}\)
\({g}_{j}\leftarrow {{{{\rm{proj}}}}}_{[0,1]}\left(\frac{\mathop{\sum }\nolimits_{i = 1}^{n}\mathop{\sum }\nolimits_{k = 1}^{K}{\phi }_{ijk}}{\mathop{\sum }\nolimits_{i = 1}^{n}\mathop{\sum }\nolimits_{k = 1}^{K}{\alpha }_{ik}{\theta }_{jk}}\delta \right)\)
\({B}_{kl}\leftarrow {{{{\rm{proj}}}}}_{[0,1]}\left(\frac{(\frac{\rho }{1\rho }+(1\kappa )){\Xi }_{kl}\kappa }{(1\kappa )(1+{\Xi }_{kl})}\right)\)
\({{{{\mathcal{L}}}}}_{n}\leftarrow {{{\mathcal{L}}}}(\alpha ,\theta ,B,g)\)
The integrative version of Spectra uses analogous updates, with the bounds of summations appropriately modified; specifically, the E step updates are \({\phi }_{ijk}={{{{\bf{X}}}}}_{ij}\frac{{g}_{c,\,j}{\alpha }_{ik}{\theta }_{jk}}{{g}_{j}\mathop{\sum }\nolimits_{k = 1}^{K}{\alpha }_{ik}{\theta }_{jk}+{g}_{c,\,j}\mathop{\sum }\nolimits_{k = K+1}^{K+{K}_{c}}{\alpha }_{ik}{\theta }_{jk}}\) for celltypespecific factors and \({\phi }_{ijk}={{{{\bf{X}}}}}_{ij}\frac{{g}_{j}{\alpha }_{ik}{\theta }_{jk}}{{g}_{j}\mathop{\sum }\nolimits_{k = 1}^{K}{\alpha }_{ik}{\theta }_{jk}+{g}_{c,\,j}\mathop{\sum }\nolimits_{k = K+1}^{K+{K}_{c}}{\alpha }_{ik}{\theta }_{jk}}\) for global factors.
Adam
For large scRNAseq datasets, the memory requirement of EM for fitting a large number of factors is prohibitive. We optimize the marginal log likelihood with Adam^{83}, a momentumbased gradient descent optimizer implemented in pytorch, directly. In detail, the Adam hyperparameters β_{1} and β_{2} are set to default values 0.9 and 0.999, respectively. We use a learning rate schedule of [1.0, 0.5, 0.1, 0.01, 0.001, 0.0001], where training at subsequent learning rates occurs after convergence at higher learning rates. A maximum number of iterations is fixed to 10,000. This default training scheme can be modified by the user. In particular, for faster convergence, either the maximum number of iterations can be made smaller or the smallest learning rates can be removed, allowing for solutions that are not as fine tuned.
Initialization
Because the Spectra objective function is nonconvex and susceptible to suboptimal local maxima, initialization plays an important role in the quality of the eventual solutions. When Spectra is provided with gene sets as input, our strategy is to initialize factors as close to the gene sets as possible. Whenever the number of factors is greater than the number of gene sets, we resort to a gene setbased initialization procedure.
First, a hyperparameter t controls the strength of the initialization. By default, t is set to 25. For a given cell type, whenever the number of factors is at least as large as the number of gene sets, we initialize \(\log {\theta }_{ij}\leftarrow t\) when gene i belongs to gene set j for each gene set j = 1,…, N_{gs}, and N_{gs} is the number of gene sets. Further, the factor interaction matrix is initialized with logitB_{jj} ← t to encode the knowledge that this factor corresponds to a gene set. To encourage the last factor to capture genes that have no edges in the prior graph, we initialize the last row and column to small values, logit B_{K,j} ← −t and logit B_{j,K} ← −t for all j = 1,…, K. Corollary 1 (Supplementary Note) explains why this leads to extremely fast convergence when λ is small.
For a given cell type, when the number of factors is not greater than the number of gene sets, we resort to initialization with NMF^{87}.
GPU acceleration
For all results involving GPU acceleration, a wrapper around the original model implementation is provided, loading models onto the GPU via the pytorch syntax device = torch.device (‘cuda:0’) and model.to (device) when CUDA is available. Data (including adjacency matrices and expression data) are similarly loaded onto GPU. All GPU methods were run on an NVIDIA A100 Tensor Core GPU.
Determining the number of factors
We adopt two approaches to determine the number of factors. The first approach is to set the number of factors for each cell type equal to the number of gene sets available for that cell type + 1 (similar to the approach taken by Slalom), and the second approach is to estimate the number of factors from the data via bulk eigenvalue matching analysis^{88}. Fitting a large number of factors is possible; in our experiments, we fit a set of 197 factors.
The second approach involves three steps. In the first step, we estimate a null distribution of eigenvalues based on sampling variances \({\sigma }_{j}^{2} \sim\) Gamma(θ, 1/θ) and then subsequently an n × p Gaussian matrix \({Z}_{ij} \sim N(0,{\sigma }_{j}^{2})\). We sample B of these matrices and take the average of the sorted eigenvalues of \(\frac{1}{n}Z{Z}^{\,t}\) over B samples. Typically B is set to 100. Given the average sorted eigenvalues, we compute a regression coefficient without intercept between the ‘bulk’ of these eigenvalues and the bulk of the observed eigenvalues of the data covariance matrix. The bulk of the distribution are the values between some lower and upper quantiles, which are hyperparameters of the method. We perform a line search on θ to find a value of θ that minimizes the sum of squared residuals of this regression. Denoting this regression coefficient as β, in the second step, we simulate a background distribution based on sampling variance terms \({\sigma }_{j}^{2} \sim {{{\rm{Gamma}}}}(\theta ,\beta /\theta )\), data from \({Z}_{ij} \sim N(0,{\sigma }_{j}^{2})\) and eigenvalues from \(\frac{1}{n}Z{Z}^{\,t}\). Finally, K is estimated as the (1 − α) quantile of the simulated distribution of leading eigenvalues. We apply this process for every cell type separately.
Determining Spectra’s input parameters
We summarize all userdefined inputs to the Spectra algorithm.
X: Expression matrix with n cells and p genes (required).
λ: Regularization strength of prior graph (required).
Gene set dictionary: dictionary with cell types as keys and gene sets as values (optional).
Celltype labels: list of cell types corresponding to expression matrix (optional).
δ: parameter that bounds minimum gene scale factor (optional).
w: graph edge weights (optional).
κ: background rate of edges (optional).
ρ: background rate of edge deletion.
The data matrix and regularization strength λ must be provided by the user, while prior information can be provided in the form of a dictionary of celltypespecific and global gene sets (note that Spectra can also be run by providing graph adjacency matrices directly). Optionally, celltype labels that align with keys of the gene set dictionary can be provided. The lower bound for gene scale factors, δ, controls the extent that gene expression is normalized and is set to a default value of 0.001. This translates to a maximum ratio of gene scale factors of ≈1,000. By default, the graph edge weights are set to be inversely proportional to the total number of edges induced by the gene set leading to a given edge and accumulate additively for genes in multiple sets. The background rate of noise edges, κ, and the rate at which edges are randomly removed from the graph, ρ, can be provided as fixed parameters that provide users with an extra degree of control over the extent that the graph is modified. If they are set to ‘None’ (default), they are estimated during the training process in the same manner that other model parameters are estimated.
For typically sized scRNAseq datasets, as a rule of thumb, we recommend λ = 0.01 for studies in which factors should closely resemble the input gene sets and λ = 0.1 for studies where the factors should be allowed to deviate substantially from the gene sets. Values of δ ranging from 0.0001 to 0.01 yield similar results, with δ > 0.01 providing solutions with typical highly expressed genes observed from NMF.
Validation metrics
Marker list coherence metrics
To evaluate the quality of factors computed from data, we follow previous work^{89,90} and use coherence, cooccurrence of factor genes in heldout data, to evaluate the quality of the inferred factors. For a given factor, we consider the 50 top marker genes with the highest gene scores for that factor. Between every pair of genes in the top 50 markers, we compute the pointwise mutual information as
$${{{\rm{PMI}}}}({g}_{i},{g}_{j})=\log \frac{p({g}_{i},{g}_{j})}{p({g}_{i})p({g}_{j})}$$
where probabilities denote the empirical occurrences in the heldout data. Coherence is defined as the average of this quantity across the marker gene list. This metric is used in Fig. 2f. To assess the coherence of Spectra and other methods, we allocated 9,787 cells as a holdout set to compute the coherence scores at evaluation time. The remaining 88,076 cells were used to fit the model. For each experiment, we subsampled the 88,076 cells in the training set to a size of 10,000 without replacement (repeating this process five times to recapitulate the underlying data distribution). This number was chosen to be sufficiently large subject to the constraint that each of the methods under evaluation could run in a reasonable amount of time (<2 d). For each subsampled dataset, we computed the coherence score described above with the top 50 markers, where marker lists are determined via the method suggested by the individual papers. For scHPF, we used the gene_scores function from the scHPF package to get the top 50 markers^{4}. For Slalom, we multiplied the estimated parameter matrices, that is, the continuous posterior mean \({\mathbb{E}}W\) and Bernoulli posterior mean \({\mathbb{E}}Z\), as in Buettner et al.^{6}. To evaluate NMF, we derive marker lists based on the absolute values of the estimated factor matrix as is standard practice^{91}.
Reconstruction of heldout genes
To quantify the ability of methods to impute missing genes from gene sets, we ran Spectra and Slalom (scRNAseq data preprocessing and analysis) on the full Bassez dataset but with randomly truncated gene sets. Due to Slalom’s computational demands and size of the dataset, we choose a small set of 24 gene sets to evaluate for both methods, which are chosen a priori and held fixed throughout the experiment. We hold out 40% of genes (selected randomly) from the original set and measure the fraction of these genes recovered in the top 200 genes according to Slalom and Spectra’s gene scores. To match factors to gene sets, for both methods, we find the gene set (in our full database) with the highest Szymkiewicz–Simpson overlap coefficient (overlap coefficient) to the given factor and label the factor as corresponding to that gene set. The overlap coefficient for the sets X and Y is defined as the size of the intersection divided by the size of the smaller set:
$${{{\rm{overlap}}}}(X,Y)=\frac{ X\cap Y }{\min ( X , Y )}$$
If two factors both have the highest overlap coefficient to the same gene set, we take the one with the higher overlap coefficient. The accuracy reported is the fraction of heldout markers recovered in the top 200 highest gene scores (Fig. 2e and Extended Data Fig. 5a–c).
Simulation experiments
Robustness to correlated factors
Matrix factorization methods rely on reconstructionbased objective functions that implicitly encourage the estimation of a diverse set of gene programs. As a result, when gene programs are expressed in similar contexts (for example, CD8^{+} T cell activation, exhaustion and tumor reactivity or TNF and type 2 IFN responses), matrix factorization methods often return a single program representing the combined set of correlated programs. Further, as the correlation between gene programs increases, the effective sample size of the estimation problem decreases, as most cells do not provide information to separate the gene programs. To illustrate that Spectra can incorporate prior information to maintain robust estimation in the presence of highly correlated gene programs, we simulated gene expression data from a generic factor analysis model where the cell loadings corresponding to factors 1 and 2 are simulated from a joint log normal distribution with nonzero correlation terms ranging from 0.25 to 0.99 (Extended Data Fig. 5d). Factors themselves were simulated from a halfCauchy distribution to achieve realistic levels of sparsity and extreme values. Conditional on simulated factors and loadings, gene expression was simulated from a Poisson distribution with the mean given by the matrix product of loadings and factors. A noisy prior knowledge graph was simulated by sampling the adjacency matrix from a Bernoulli distribution with parameters given by inner products between factors (as in the Spectra model) and used as input to Spectra. For each value of the correlation, we simulated ten datasets and ran Spectra (λ = 0.1), NMF, scHPF and Slalom (20 top genes per factor as input). We quantified estimation accuracy by the mean Pearson correlation of ground truth factors with estimated factors across genes, both for the two correlated factors and for a third factor uncorrelated with the first two. While the unbiased methods, NMF and scHPF, correctly recover the factors when factors are weakly correlated, estimation accuracy deteriorates as the correlation increases (although the inaccurate estimation of the correlated factors does not hurt performance on the uncorrelated factors). Spectra’s use of prior knowledge allows it to separate highly correlated factors.
In more detail, in our comparative simulation, study factors are correlated in the sense that they tend to be expressed by the same cells (Extended Data Fig. 5d). We simulate ground truth factor matrices with p features and K factors with each entry independently distributed according to a halfCauchy distribution (chosen to obtain realistically sparse factor matrices). To obtain correlated factors, the factor loadings, α, are independently drawn from a correlated log normal distribution:
$$\left[\begin{array}{c}{\alpha }_{1}\\ {\alpha }_{2}\\ \vdots \\ {\alpha }_{K}\end{array}\right] \sim {{{\rm{LogNormal}}}}\left(\left[\begin{array}{c}0\\ 0\\ \vdots \\ 0\end{array}\right],\left[\begin{array}{ccccc}1&\rho &0&\ldots &0\\ \rho &1&0&\ldots &0\\ 0&0&1&\ldots &0\\ \vdots &\vdots &\vdots &\ddots &\vdots \\ 0&0&0&\ldots &1\end{array}\right]\right)$$
If we denote the N × K loading matrix by α and the K × p factor matrix by θ, the count data are simulated by X ~ Pois(αθ + ϵ), where ϵ is a random noise term with variance σ^{2}. An adjacency matrix is sampled coordinatewise \(A \sim {{{\rm{Bern}}}}({\tilde{\theta }}^{\top }\tilde{\theta })\), where \({\tilde{\theta }}_{j}=\frac{{\theta }_{j}}{\mathop{\sum }\nolimits_{k = 1}^{K}{\theta }_{jk}}\). We run 10 independent trials for 7 different levels of correlation ρ = {0.25, 0.5, 0.7, 0.85, 0.9, 0.95, 0.99}, totaling 70 simulated datasets. Because NMF, scHPF, Slalom and Spectra each estimate a factor matrix, we compared the estimated (normalized) factor matrices to \(\tilde{\theta }\) via Pearson correlation (y axis of Extended Data Fig. 5d) after resolving the permutation of estimated factors that is closest to ground truth. Resolving the correct permutation for each estimate is done via finding the permutation that maximizes the average Spearman correlation between ground truth factors and estimates. Because we are interested in performance on correlated factors, we report the average correlation between estimation and ground truth for the first two correlated factors across the ten independent trials.
In our experiment, N = 20, p = 500, K = 3 and σ^{2} = 4 (a setting with low signaltonoise ratio). Spectra is provided with the simulated matrix A, whereas Slalom is provided with feature sets containing the correct top 20 features of each factor. Spectra uses a λ value of 0.1 and δ value of 0. All methods are run with the correctly specified number of factors and with default parameters.
Biasedness of gene set averaging for overlapping gene sets
When gene sets corresponding to gene programs overlap, simple gene set averaging approaches produce falsepositive program activity calls. To illustrate this phenomenon, we simulated gene expression data driven by sets of overlapping gene programs with varying degrees of gene set overlap and showed that gene set averages are increasingly biased proxies for program activity as the degree of gene set overlap increases (Extended Data Fig. 5e). Specifically, we simulated factor matrices with known sparsity patterns determined by a set of gene sets (each nonzero entry independently Exponential(16)). Each gene set is designed to have an overlap coefficient ρ with at least one other gene set, with ρ ranging from 0 to 0.75. Loadings are generated by first sampling each coordinate LogNormal(0, 1) independently and then zeroing out components that are <1 to induce sparsity. Simulated expression data are from a Poisson distribution, with the mean given by the matrix product of simulated loadings and factors.
For each possible value of the overlap coefficient ρ in {0.0, 0.3, 0.5, 0.75}, we create three simulated datasets and ran both Spectra and score_genes with the ground truth gene sets. Accuracy is measured by the Pearson correlation of estimated cell scores (or score_genes estimates) and the ground truth factor loadings from the data generation process (y axis of Extended Data Fig. 5e). In this experiment, the gene set size is fixed to 20, the number of gene sets is 10, the number of features (genes) is 500, and the number of observations is 1,000.
Recovery of active gene sets
We compared Spectra to Slalom, another factor analysis method that uses prior information in the form of gene sets, in a simulation experiment where we measured the ability of each method to recover the gene sets involved in the true datagenerating process. Here, we followed the simulation settings of the Buettner et al. manuscript^{6} closely. First, background factors are generated from an exponential distribution, as before. To simulate sparsity, entries smaller than 2 are zeroed out. Next, loadings are generated LogNormal independently, and entries <1 are zeroed out. We then generate both active and control gene setbased factors as in Buettner et al.^{6} and described in Biasedness of gene set averaging for overlapping gene sets, where gene sets overlap with an overlap coefficient of 0.3. Loadings corresponding to active gene setbased factors are also drawn from a standard LogNormal and zeroed out if less than 1. Next genes are randomly added to gene sets and removed from gene sets to achieve a falsepositive rate of 0.2 and falsenegative rate of 0.2. As a measure of success, we use the area under the receiver operating characteristic curve (AUC) based on Slalom’s relevance score and Spectra’s average cell score for a given factor. Spectra was robust to increasing the number of gene sets, whereas Slalom suffered a drop in AUC as the number of active gene sets increased (Extended Data Fig. 5f).
In our experiments, the numbers of active pathways vary on the x axis of Extended Data Fig. 5f, the number of control pathways is 5, the gene set size is 20, the number of genes is 300, the number of cells is 300, the number of unbiased factors is 5, and the gene set overlap coefficient is 0.3.
Run time and memory benchmarks
All memory and run time benchmarks were performed on simulated data to allow for precise control over the number of cells, genes, factors and cell types. Data were simulated as described in Simulation experiments, closely following the settings described in Buettner et al.^{6}. To benchmark run time and memory with respect to the number of cells, we scaled the number of cells in our simulated data from {300, 1,000, 5,000, 10,000, 25,000, 75,000, 100,000, 200,000} cells. The number of genes was held constant at 2,000 genes. To benchmark the methods on the number of gene sets, we scaled the n_active_pathways parameter in our simulation from {10, 20, 50, 100, 200} gene sets. Next, we note that to keep our gene set size of 20 constant with an overlap coefficient ρ = 0.3 between gene sets, we increase the number of genes to 3,000 genes. We used 25,000 cells for all experiments. To benchmark Spectra GPU and central processing unit (CPU) on the number of cell types, we scaled the number of cell types from {2, 4, 8, 16, 32, 64} cell types. All experiments were run using 25,000 cells and 3,000 genes. We note that due to variation in the number of epochs until convergence, we forced both Spectra CPU and GPU to run to the default 10,000 epochs to study a pessimistic but lowvariance run time quantity, although convergence was generally achieved between 2,000 and 7,000 epochs. All CPU methods were run on five CPU cores (Intel Xeon Gold 6230 at 2.10 GHz), while all GPU methods were run on an NVIDIA A100 Tensor Core GPU.
A human immunology knowledge base
Databases, such as the Gene Ontology Resource^{92}, the Molecular Signatures Database^{93}, the Kyoto Encyclopedia of Genes and Genomes^{94} and the Reactome database^{95}, contain thousands of gene sets and their relationships, but they are noisy and often do not distinguish whether or not genes are transcriptionally regulated. For example, many genes with signaling pathway annotations are regulated at the posttranslational level by phosphorylation or subcellular localization. Expression signatures in these databases are often derived from bulk sequencing data, which may not represent responses in individual cells. Moreover, the databases do not have a framework for distinguishing which gene sets are celltype specific. To address these issues, we created an immunology knowledge base with the following criteria:

1.
Genes within a gene set define a cellular process at the transcript level.

2.
Gene sets represent cellular processes at the singlecell level.

3.
Gene sets can be specific to a defined cell type.
Our knowledge base includes 231 gene sets representing ‘cellular processes’ (n = 181) to be queried by Spectra and ‘cellular identities’ (n = 50) to obtain replicable highquality celltype annotations. To generate the resource, we developed 97 gene sets, including 14 from perturbation experiments, and added these to 134 gene sets from publications and external databases, some of which we modified. Of the cellular processes, 150 apply to most cell types in the data (for example, leukocytes) and are designated as global, and 31 apply to individual cell types.
Like Spectra, the knowledge base models gene sets as a graph wherein every gene set is a node connected to all individual gene nodes within the set as well as to a celltype node. Celltype nodes (currently 50) are connected to ‘cellular identity’ gene sets, one for each cell type, which contain marker genes for their connected cell type. Metadata, such as the scientific publication the gene set was derived from, gene set version and original gene set authors, are stored as node properties. Celltype nodes are organized in a hierarchy, reflecting that cell types are frequently subsets of other cell types. This hierarchy starts with a celltype node labeled ‘allcells’ to which gene sets for ‘cellular processes’ occurring in all cell types are connected. Thus, the knowledge base can be queried for cellular processes that can be found in all cell types (for example, glycolysis) or a cellular process specific to a cell type, such as TCR signaling, which is only present in T cells. It also allows retrieving ‘cellular identity’ marker gene sets, which define the queried cell type.
Within this resource, 150 cellular processes apply to all leukocytes, and 31 apply to individual cell types. Of all 231 gene sets, 97 were gene sets newly curated from the literature, 14 used data from perturbation experiments, 11 were adopted from the literature with modifications, and 123 were taken from the literature and external databases without changes. Gene sets correspond to diverse cellular identities (n = 50) and cellular processes, such as homeostasis (n = 9), stress response (n = 3), cell death and autophagy (n = 18), proliferation (n = 6), signaling (n = 12), metabolism (n = 90), immune function (n = 22), immune cell responses to external stimuli (n = 18) and hemostasis/coagulation (n = 3; Fig. 1b). We designed the gene sets for cellular processes to have comparable size (median n = 20 genes per gene sets) and relatively little overlap (median pairwise overlap coefficient of 40%) to enable dissection of a large number of cellular processes and to avoid gene set sizedriven effects.
To specify Spectra input, the user first defines cell types at a granularity of interest in their singlecell expression data and retrieves the celltypespecific cellular process gene sets and gene sets applying to all cell types from the knowledge base. Next, the user can select cellular process gene sets pertaining to all cell types in the dataset, which should be set as ‘global’ in the Spectra model.
The user indicates which cellular processes can be considered global based on which cell types are present in the dataset under study. For example, if a dataset only contains T cells, all cellular processes pertaining to leukocytes and T cells should be considered global. If cellular processes apply to more than one but not all cell types in the data, there are two options:

1.
The gene set can be multiplied, and one copy can be assigned to each cell type. This will ensure that the cell scores of the resulting factors will be specific to those cell types but may result in separate factors for the same cellular process in each cell type.

2.
The gene set can be set as global, which will generally result in one factor. However, cell scores for this factor may be detected in other cell types also.
Users can take advantage of the hierarchical organization of cell types in the knowledge base by adding the children or parent classes of selected cell types. For example, cellular processes for both ‘CD4 T cells’ and its parent ‘T cells’ can be retrieved and assigned to ‘CD4 T cells’, making it possible to find broader processes (for example, TCR signaling) that are specific to CD4^{+} T cells. Alternatively, cellular processes for CD4^{+} T cells and for CD4^{+} subtypes ‘T_{H}1’, ‘T_{H}2’ and ‘T_{H}17’ can be retrieved and assigned to ‘CD4 T cells’, thereby pooling rare cell types that may not contain enough training data for the Spectra model to converge to a generalizable solution. Moreover, hierarchical classification is advantageous when cellular processes are ambiguously or incorrectly assigned. For example, CD4^{+} T cell subtypes are often presented as distinct lineages with distinct cellular processes. However, mixed CD4^{+} T cell subtypes have been reported, such as cells possessing both T_{H}1 and T_{H}2 polarization cellular processes^{96}, suggesting that CD4^{+} T cells can be described using combinations of purportedly subtypespecific processes.
The newest version of our human immunology knowledge base is available on GitHub as the Cytopus Python package^{97}. This includes 20 additional cellular processes mostly for nonimmune cells, such as fibroblasts (n = 7), which we curated from a list of perturbation experiments on bulksorted immune cells^{98}. Users can load our default or a custom knowledge base using the KnowledgeBase class build on a NetworkX object. Cytopus includes methods to retrieve gene sets and corresponding cell types and can visualize them as a graph and convert them into a Spectracompatible dictionary. The ‘celltypes’ method retrieves a list of available cell types, ‘processes’ generates a dictionary of all ‘cellular processes’ gene sets, and ‘identities’ generates a dictionary of all ‘cellular identity’ gene sets. Gene set metadata (for example, author, topic, date of generation and version) can be accessed as node properties of the gene sets. The get_celltype_processes method retrieves celltypespecific ‘cellular processes’ based on a userprovided list of cell types at the desired granularity (generally all cell types contained in the data).
Full Cytopus documentation can be found at https://github.com/walletmaker/cytopus. The tumorinfiltrating leukocyte gene sets used in the paper are included in the Spectra package at https://github.com/dpeerlab/spectra and are in Supplementary Table 1.
scRNAseq data preprocessing and analysis
An overview of all datasets used in this study and relevant patient metadata can be found in Supplementary Table 5.
PBMC dataset
The original dataset^{8} consists of scRNAseq data from PBMCs from four healthy donors after incubation (1–6 h) in IFNγ, LPS or the protein kinase C and TCR stimulation mimetic PMA. For three donors and at the 6h time point, they added Golgi inhibitors (GIs), which prevent exocytosis of cytokines from PBMCs after perturbation and secondary paracrine signaling events, and compared them to control cells treated with GIs alone. Thereby, the gene expression changes in the GItreated perturbations compared to the GItreated control cells can be attributed to direct signaling of the applied perturbations alone. We used the GItreated conditions as a ground truth to benchmark factorization methods. Information on ethics oversight was not available in the original manuscript^{8}.
We obtained preprocessed count matrices (23,754 cells, four individuals) from the Gene Expression Omnibus (GEO; accession number GSE178431) from the Kartha et al.^{8} PBMC perturbation scRNAseq dataset. We normalized gene expression counts to median library size and log 1p transformed the data. We selected the 10,000 most highly variable genes using scanpy’s pp.highly_variable_genes function with the seurat_v3 method on raw counts. To avoid discarding genes relevant to cell typing, we added a manually curated list of cell typing markers to highly variable genes (Supplementary Table 6). We then calculated the neighborhood graph on the first 50 principal components using these highly variable genes, which explained 27.87% of the total variance, and calculated a UMAP embedding on this neighborhood graph.
To get coarse immune cell types, we clustered the data in principal component space using the scanpy implementation of phenograph. We chose the k = 40 parameter for PhenoGraph because of its ability to delineate immune cell from nonimmune cell populations while showing stable clustering in a window of adjacent k parameters (pairwise rand indices > 0.7). We then annotated the clusters into coarse immune cell types (monocytes, T cells/innate lymphoid cells and B cells/plasma blasts) by assessing the mean marker gene expression per cluster (Supplementary Table 2).
Running Spectra
To run Spectra, we retrieved 188 input gene sets pertaining to PBMC data from the newest version of our Cytopus knowledge base (https://doi.org/10.5281/zenodo.7306238). This included gene sets for signaling/response programs to the ground truth perturbations (IFNγ response, LPS signaling in monocytes/macrophages and TCR activation in T cells). We fitted Spectra on the union of the 10,000 most highly variable genes and the input gene sets for a total of 11,840 genes using the following parameters: λ of 0.01, δ of 0.001 and ρ of 0.001. We obtained a total of 196 factors and found 1 factor for each of the input gene sets related to each perturbation according to our criteria below (overlap coefficient of the top 50 marker genes with input gene set > 0.2). We calculated the average cell score per cell type and sample and compared the perturbed and unperturbed conditions. Spectra was run on a compute cluster with 64 CPU cores (Intel Xeon Gold 6230 CPU at 2.10 GHz) with 512 GB RAM.
Running Slalom
For Slalom, expression data were preprocessed the same as Spectra. Because Slalom’s run time scales linearly with the number of gene sets (Fig. 2g), we had to subset the number of gene sets used to run Slalom on our dataset of interest. We provided Slalom only with the three gene sets corresponding to the investigated perturbations (LPS, IFNγ and PMA) plus ten additional factors. These gene sets were used to determine the I parameter of the Slalom initFA() function. The following additional input parameters were used: nHidden = 0, nHiddenSparse = 0, do_preTrain = False, minGenes = 1 and pruneGenes = False, with all other options set to default values.
Running expiMap
For expiMap, expression data were preprocessed the same as Spectra. We used expiMap’s default parameters as shown in the tutorials of the scArches GitHub repository (https://github.com/theislab/scarches; version as of 20 March 2023). We provided expiMap with the same gene sets as Spectra. When using the default parameters, expiMap cannot learn new genes involved in gene programs related to the input gene sets nor can it learn new factors in the reference data, but only in the mapped query.
Immunooncology datasets
To study Spectra in an immunooncology context, we used two published scRNAseq datasets of tumorinfiltrating leukocytes from female individuals with breast cancer treated with immunotherapy. We chose this immunooncology context for the following reasons:

1.
The abundant prior knowledge of cellular processes and wellcharacterized cell types in tumorinfiltrating leukocytes enabled us to leverage the full power of gene set and celltype priors.

2.
The availability of before and ontreatment samples to test the sensitivity of factor cell scores to environmental perturbation with antiPD1/PDL1 therapy.

3.
The clinical need for detecting cellular processes affected by antiPD1 in humans to improve current immunotherapy strategies.

4.
The availability of two studies in similar biological settings to enable validating findings in an independent dataset.
Bassez dataset
Bassez et al.^{10} was a prospective windowofopportunity study reporting scRNAseq as an exploratory endpoint. The authors analyzed scRNAseq data from wholetumor singlecell suspensions from 42 individuals with operable breast cancer before and after antiPD1 immunotherapy (pembrolizumab, NCT03197389). Individuals received neoadjuvant chemotherapy as per standard of care (chemotherapy n = 11, no chemotherapy n = 31), followed by a single dose of antiPD1. Breast resections were performed 7–14 d after antiPD1 treatment. Tissue from preantiPD1 biopsies (7–15 d before surgery) and from surgical resections was processed for scRNAseq. The study was approved by the institutional review board of University Hospital Leuven (S60100). As a surrogate for response to therapy, the authors of the original study quantified the clonal expansion of T cells under therapy on the patient level using paired singlecell TCR sequencing; we used these annotations to find responseassociated cellular processes in the data. The authors categorized participants as either exhibiting (we refer to as responders) or lacking (we refer to as nonresponders) T cell clonal expansion under therapy. To classify participants, the authors quantified the number of expanding T cell clones (T cells with identical TCR sequences) per participant, labeling participants with >30 expanding clones as responders and those with ≤30 expanding clones as nonresponders. A T cell clone had to fulfill the following two criteria to be labeled as expanded:

1.
Detected at least twice in the participant’s ontreatment sample.

2.
More frequent in the participant’s ontreatment samples than in the pretreatment sample either by the absolute cell number in that clone or by the cell number in that clone relative to the number of cells with a TCR detected.
Zhang dataset
Zhang et al.^{23} was a retrospective clinical study analyzing tumorinfiltrating leukocyte scRNAseq data of pre, on and posttherapy samples from 15 female individuals with advanced breast cancer receiving either antiPDL1 (atezolizumab) combined with chemotherapy (paclitaxel, n = 8) or chemotherapy alone (paclitaxel, n = 7). The study was approved by the Chinese Academy of Medical Sciences ethics review board (18216/1794). Notably, participants received corticosteroid premedication for paclitaxel. The authors assessed participant response to immunotherapy using radiological response according to RECIST v1.1 criteria^{99}. RECIST v1.1 criteria are the standard criteria used for drug approvalrelevant clinical trials and standard clinical management of individuals with metastatic solid tumors. The RECIST criteria classify individuals into responders (combining partial and complete response labels) and nonresponders (combining progressive and stable disease labels) based on the change in the sum of tumor lesion diameters under therapy. We used this classification to identify responseassociated cellular processes in the Zhang dataset. Relevant clinical variables distinguishing the Bassez and Zhang data are indicated below.
Bassez:
Operable disease: yes
n = 31 treated with antiPD1
n = 11 treated with chemotherapy and antiPD1
Received corticosteroids: no
Zhang:
Operable disease: no
n = 8 treated with chemotherapy and antiPDL1
n = 7 treated with chemotherapy
Received corticosteroids: yes
Processing strategy
To minimize systematic differences in celltype annotations and normalization of gene expression counts, we performed the same preprocessing for the Bassez^{10} and Zhang^{23} datasets. After basic filtering, removing residual lowquality cells and doublets and subsetting to leukocytes with scanpy^{2}, we normalized the data using scran^{77}. We hierarchically annotated cell types in the data by first labeling major immune subsets (T cells/innate lymphoid cells, B cells/plasma cells and myeloid cells) by clustering on the most dominant principal components only. We then partitioned the data into these major immune subsets, renormalized the data within every subset using scran and clustered on more principal components to annotate granular cell types. We then combined the annotated data from major immune subsets for joint analysis using Spectra. We have outlined the details of the analysis strategy below.
Retrieving singlecell gene expression data
Count matrices of the Bassez data^{10} were kindly provided by the authors and are also available here (226,635 cells). Raw read counts are available in the European Genome–Phenome Archive (EGA; accession numbers EGAS00001004809 and EGAD00001006608). The count matrices for the Zhang data^{23} were downloaded from GEO using the accession number GSE169246 (489,490 cells).
Removing lowquality cells
To prepare the data for clustering, we removed cells with less than 200 genes per cell and genes observed in less than 20 cells as well as mitochondrial and ribosomal genes. This filtering procedure removed 2,971 and 203 genes, resulting in a total of 22,639 and 20,898 genes in the Bassez^{10} and Zhang^{23} datasets, respectively. We defined doublets in the data by running DoubletDetection^{100} for each sample individually using standard parameters (clustering algorithm: PhenoGraph; P value threshold: 1 × 10^{–16}; voter threshold: 0.5). DoubletDetection detected 3,270 (1.4%) and 12,760 (2.6%) doublets as well as 27 (0.01%) and 8 (0.001%) ambiguous doublets in the Bassez^{10} and Zhang^{23} datasets, respectively, which we removed from the data.
Retrieving tumorinfiltrating leukocytes for downstream annotation
While the Zhang^{23} data contained sorted tumorinfiltrating leukocytes, the Bassez^{10} data contained unsorted wholetumor singlecell suspensions. To retrieve immune cells from the Bassez^{10} data for downstream annotation, we first performed standard median library size normalization and log 1p transformed the data so that the normalized expression of every gene j in cell i is \({x}_{ij}^{{\prime} }\), and the median of the sum of gene expression counts across all cells is \(med\left(\mathop{\sum }\nolimits_{j = 1}^{n}{x}_{j}\right)\):
$${x}_{ij}^{{\prime} }=\ln \left[med\left(\mathop{\sum }\limits_{\,j=1}^{n}{x}_{j}\right)* \frac{{x}_{ij}}{\mathop{\sum }\nolimits_{j = 1}^{n}{x}_{ij}}+1\right]$$
We then clustered the data using PhenoGraph^{101} on the most dominant principal components, which we selected using the knee point of the principal component versus explained variance curve (calculated using the kneed package v.0.7.0 (ref. ^{102})) or the lowest number of principal components explaining ≥20% of the total variance, whichever was higher. Using this procedure, we clustered the data with PhenoGraph on the first 26 principal components, explaining 20.1% of the total variance. We chose the k = 80 parameter for PhenoGraph because of its ability to delineate immune cell from nonimmune cell populations while showing stable clustering in a window of adjacent k parameters (pairwise rand indices > 0.7). We then subsetted leukocytes for further analysis by their marker gene expression (myeloid cells, T cells, innate lymphoid cells, B cells and plasma cells) per PhenoGraph cluster (Supplementary Table 2).
Annotating tumorinfiltrating leukocytes
We renormalized leukocytes in the Bassez^{10} and Zhang^{23} data using scran because median library size normalization can generate artificial differential gene expression between cells of different library size, such as leukocytes^{77}. After testing all genes and a range between 5,000 and 15,000 highly variable genes, we selected the top 15,000 highly variable genes for the Bassez^{10} data and all genes for the Zhang^{23} data, which led to the best separation of major immune cell subtypes using scanpy’s pp.highly_variable_genes function with the seurate_v3 method on raw counts. To avoid discarding genes relevant to cell typing, we added a manually curated list of 458 celltyping markers to highly variable genes (Supplementary Table 6). We then repeated the clustering procedure outlined above (Bassez^{10}: 24 principal components explaining 20.1% of total variance; Zhang^{23}: 52 principal components explaining 20% of total variance; k = 50) and annotated major immune cell subsets (T cells/innate lymphoid cells, B cells/plasma cells and myeloid cells) by assessing their mean marker gene expression per cluster (Supplementary Table 2). To obtain more granular annotations, we partitioned the data into major immune subtypes (T cells, innate lymphoid cells, B cells/plasma cells and myeloid cells), renormalized each subtype using scran, recalculated highly variable genes and principal components and clustered as described above. The following processing parameters for each subtype were used: Bassez data, numbers of highly variable genes and marker genes: 7,500, 7,500 and 15,000 for T cells/innate lymphoid cells, B cells/plasma cells and myeloid cells, respectively; numbers of principal components: 24, 10 and 16 for T cells/innate lymphoid cells, B cells/plasma cells and myeloid cells, respectively; variance explained by principal components: 20.3%, 20.1% and 20.1% for T cells/innate lymphoid cells, B cells/plasma cells and myeloid cells, respectively; k parameter for clustering: 30, 40 and 20 for T cells/innate lymphoid cells, B cells/plasma cells and myeloid cells, respectively; Zhang data, numbers of highly variable genes and marker genes: 19,379, 19,379, 10,000 and 18,888 for T cells/innate lymphoid cells, innate lymphoid cells, B cells/plasma cells and myeloid cells, respectively; numbers of principal components: 100, 100, 17 and 23 for T cells/innate lymphoid cells, innate lymphoid cells, B cells/plasma cells and myeloid cells, respectively; variance explained by principal components: 17.9%, 17.7%, 20.4% and 20.1% for T cells/innate lymphoid cells, innate lymphoid cells, B cells/plasma cells and myeloid cells, respectively; k parameter for clustering: 40, 40, 60 and 20 for T cells/innate lymphoid cells, innate lymphoid cells, B cells/plasma cells and myeloid cells, respectively.
We then annotated granular immune cell types by assessing the mean marker gene expression per cluster (Supplementary Table 2). In the Bassez^{10} data, we detected clusters with low library size and lower complexity of gene–gene correlation patterns at this step (5,509 cells), which we removed from the data. Finally, we combined the annotated major immune subtypes for downstream joint analysis.
Running Spectra
After the filtering and preprocessing steps above, the Bassez data had 97,863 cells^{10}, and the Zhang data had 150,985 cells^{23}. To run Spectra, we restrict the number of genes using scanpy’s highly_variable_genes function with the cell_ranger method, selecting the 3,000 most highly variable genes. We removed several genes that are highly abundant and may originate from ambient RNA in many cell types, thus adding noise to the analysis. This included mitochondrial, ribosomal, immunoglobulin (genes starting with IGHM, IGLC, IGHG, IGHA, IGHV, IGLV and IGKV), TCR variable domains (genes starting with TRBV, TRAV, TRGV and TRDV) and hemoglobin genes (genes starting with HB). The total numbers of genes used (the union of genes included in a gene set and highly variable genes) were 6,397 for the Bassez dataset^{10} and 6,398 for the Zhang dataset^{23}.
One hundred and eightyone gene sets from our knowledge base were then converted into weighted adjacency matrices. One of Spectra’s strongest features is its ability to meaningfully modify the input gene–gene knowledge graph (gene sets) in a datadriven matter. With the influence parameter λ set to 0.01, the median overlap coefficient across all factors in the Bassez^{10} data was 88%, with 25% of factors relevantly deviating from the gene sets (overlap < 70%) and 7% of factors bearing little resemblance to the input gene sets (overlap < 20%). With the influence parameter set to 0.1, the median overlap coefficient across all factors was 82%, with 42% of factors relevantly deviating from the gene sets (overlap < 70%) and 12% of factors with an overlap of less than 20%. In terms of graph edit distance to the input graph (defined as the mean absolute difference between input and output graphs), at λ set to 0.1, we had 0.011, and at λ set to 0.01, we had 0.0095 with diminishing returns in graph edit distance for lower λ (0.0095 again for λ = 1 × 10^{–4} and 0.0094 for λ = 1 × 10^{–5}). For the analyses described below, we used an influence parameter λ between 0.1 and 0.01 depending on whether we wanted more (0.01) or less (0.1) adherence to the input gene sets. Because we obtained very similar results with these parameters in two independent datasets, it is likely that this also constitutes a good default for other datasets. Spectra was run on a compute cluster with 64 CPU cores (Intel Xeon Gold 6230 CPU at 2.10 GHz) with 512 GB RAM.
Running Slalom
Because Slalom’s run time scales linearly with the number of gene sets (Fig. 2g), we had to subset the number of gene sets used to run Slalom on our datasets of interest (n = 20 gene sets, run time 63.49 CPU hours, 40 GB of memory on the Bassez^{10} dataset). Expression data were preprocessed identically to Spectra. To compare results of Spectra and Slalom, we chose a subset of 20 gene sets of scientific relevance to the immune microenvironment under immune checkpoint blockade for the Bassez^{10} and Zhang^{23} datasets: CD8^{+} T cell tumor reactivity, type II IFN response, myeloid angiogenic effectors, posttranslational modification, MHC class I presentation, G2/M transition, oxidative phosphorylation, type I IFN response, macrophage IL4/IL13 response, glycolysis, DNA synthesis, G1/S transition, lysine metabolism, MHC class II presentation, hypoxia response, pentose phosphate pathway, CD8 terminal exhaustion, PD1 signaling, TCR activation and cytotoxicity effectors. These gene sets were used to determine the I parameter of the Slalom initFA() function. The following additional input parameters were used: nHidden = 0, nHiddenSparse = 0, do_preTrain = False, minGenes = 1, pruneGenes = False, with all other options set to default values.
Running scHPF
scHPF was run with the following commands using the defaults in the class constructor of the scHPF package. from scHPF import *: model = scHPF(nfactors = K); model.fit(X).
Running expiMap
When using the default parameters, expiMap^{9} cannot learn new genes involved in gene programs related to the input gene sets nor can it learn new factors in the reference data but only in the mapped query. We note that most demonstrations in the expiMap manuscript^{9} are based on these default parameters and do not involve adaptation to the data.
For the immunology datasets, where the specific task evaluated involved learning new genes and new factors, we modified the default parameters according to Lotfollahi et al.^{9}. We refer to this mode as soft mode. These parameters include setting soft_mask = True in the expiMap model scarches.models.EXPIMAP and setting an L1 penalty using the alpha_l1 parameter of the model.train method, which enables the latent nodes to use genes absent from the input gene sets with an L1 regularization. The alpha_l1 parameter was increased in steps of 0.1 starting from 0.5 until the share of inactive genes exceeded 0.95 (this information can be visualized by the print_stats parameter in the .train method). Using this strategy, we selected an alpha_l1 of 0.8. For the Extended Data Fig. 5b,c reconstruction experiment where the task involved recovering heldout genes, we used an alpha_l1 parameter of 0.4 because expiMap showed the strongest performance with this parameter in a similar experiment performed by the expiMap authors (Extended Data Fig. 7 in Lotfollahi et al.^{9}). However, for most of our gene sets and in contrast to Spectra, expiMap did not recover a meaningful proportion of our input gene sets (Extended Data Fig. 5b,c). This is despite the fact that the Spectra model was complicated by including 16 new factors, while we did not add any new factors for expiMap^{9}.
To learn new factors, one has to provide additional parameters, which is what we did for analyzing the Bassez^{10} dataset except for the reconstruction experiment in Extended Data Fig. 5b,c. We added 16 new factors, the same number as for Spectra (n_ext, set to 16). We set the L1 regularization coefficient for these nodes gamma_ext to 0.6, enabled the Hilbert–Schmidt independence criterion regularization (use_hsic = True, hsic_one_vs_all = True) and provided the Hilbert–Schmidt independence criterion regularization coefficient beta = 3. Because expiMap removed relevant input gene sets, we had to perform two additional modifications of the steps provided in the tutorial for the soft mode. We increased the number of highly variable genes (4,000 instead of 2,000) to retain the input lysine metabolism gene set and decreased the minimum gene set size from 13 to 8 to retain our tumor reactivity gene set. We provided expiMap with the identical 181 gene sets used for Spectra. Because expiMap removes smaller gene sets in preprocessing steps, the final number of gene sets used for the model fit was 142, which resulted in 158 factors, including 16 new factors and 3 inactive factors.
Running NMF
NMF was run using the scikit learn package (sklearn.decomposition.NMF) with default parameters, specifically nmf = NMF(n_components = k) and nmf.fit(X.astype(float)), where k is the number of factors and X is the processed expression matrix.
Running netNMFsc
netNMFsc^{7} was run with default parameters; however, max_iters was set to 100,000 as convergence was never achieved at the default tolerance level (1 × 10^{–2}) at the default 20,000 iterations. Specifically the following operations were used: operator = netNMFsc.netNMFGD(d = k, max_iter = max_iters); operator.N = adj_matrix; operator.X = X.T and W = operator.fit_transform(), where adj_matrix is the global graph provided to Spectra, X is the processed expression matrix, and k is the number of factors.
Assigning factor labels
Factor labels were assigned using the overlap coefficient of the top 50 marker genes (genes with the highest gene scores) with each gene set. We observed a bimodal distribution of overlap coefficients, with one group of factors centered close to 0 and one group of factors centered close to 1 (Extended Data Fig. 2). We therefore chose a threshold of 0.2 to separate highoverlap from lowoverlap factors. For every factor, if the maximum overlap coefficient was >0.2, we assigned the gene set label with the maximal overlap coefficient to that factor, and if the maximum overlap coefficient was ≤0.2, we did not assign a label to that factor.
Aggregating cell scores at the sample level
To aggregate cell scores at the sample level, we calculated either the mean or the mean of the positive cells. The latter was chosen for Spectra and scHPF, which show bimodal cell score distributions with one mode centered around 0. The mean will be skewed toward the more frequent zero mode and may therefore be inappropriate to estimate the central tendency of the distribution. Positive cells were defined as cells with a cell score of >0.001. This threshold was defined empirically to separate the positive and zero mode by inspecting the distributions of multiple factors. If all cells showed a cell score of ≤0.001 for a given gene program, the mean of the positive fraction was set to 0 for that gene program. Because expiMap and Slalom can take negative values, we used the mean value for these methods.
Lung cancer datasets
Caushi dataset
The Caushi et al.^{30} study performed paired scRNAseq and singlecell TCR sequencing of 16 individuals with primary nonsmall cell lung cancer (560,916 cells). Moreover, PBMCs were pulsed with different peptides (specific for viral or tumor neoantigens), and reactive, expanding T cell clones and their TCR sequences were identified using the MANA functional expansion of specific T cells assay^{103}. The authors thereby identified TCR sequences of MANA, Epstein–Barr virus and influenzareactive T cells. They used these TCR sequences to identify tumor (MANA) and virusreactive T cells in the lung cancer tissue. The study was approved by the institutional review boards at Johns Hopkins University and Memorial Sloan Kettering Cancer Center.
Preprocessed data were obtained from the original study’s authors^{30}. The processed data can also be obtained from GEO under accession number GSE173351. Because celltype annotations were not available, we obtained the original study authors’ cluster labels (details on preprocessing and clustering are in Caushi et al.^{30}). We reannotated the original authors’ 15 clusters using marker genes (Supplementary Table 2).
To preprocess the data for Spectra, we normalized raw counts to median library size and applied log 1p transformation. We restricted the number of genes using scanpy’s highly_variable_genes function with the cell_ranger method to the 3,000 most highly variable genes. We retrieved a total of 168 input global gene sets (n = 152) and T cell subtypespecific gene sets (CD4^{+} T cells, CD8^{+} T cells and regulatory T cells, n = 12) from the newest version of our Cytopus knowledge base (https://zenodo.org/record/7306238). We took the union of the highly variable genes and the genes included in these input gene sets for a total of 6,838 genes^{30} used for fitting the Spectra model using the following parameters: λ = 0.1, δ = 0.001 and ρ = 0.001. We ran Spectra on these data and obtained 173 factors, 1 of which matched the CD8^{+} T cell tumor reactivity gene set according to the criteria described above in Assigning factor labels. Spectra was run on a computer cluster with 64 CPU cores (Intel Xeon Gold 6230 CPU at 2.10 GHz) with 512 GB RAM. We plotted and compared the tumor reactivity factor cell scores in 1,151 CD8^{+} T cells with available TCR specificity information grouped by target antigen (Epstein–Barr virus, influenza virus and MANA). We found that this tumor reactivity factor was almost exclusively expressed in MANAspecific T cells.
Salcher atlas
The Salcher nonsmall cell lung cancer atlas combined scRNAseq data of wholetumor singlecell suspensions or tumorinfiltrating leukocytes from 19 independent studies (1,283,972 cells from 318 individuals). They also homogenized celltype annotations and metadata between datasets. The study was approved by the institutional review board at Medical University Innsbruck (AN2140293 342/4.5).
Preprocessed data, including unnormalized gene expression counts, were obtained from Zenodo (https://zenodo.org/record/7227571) for the Salcher et al. lung cancer scRNAseq atlas^{71}. The authors’ celltype annotations were summarized after vetting them for relevant marker expression profiles (Supplementary Table 2).
To run Spectra, we restricted the number of genes using scanpy’s highly_variable_genes function with the cell_ranger method, selecting the 3,000 most highly variable genes with the batch_key option, which calculates highly variable genes in each dataset in the atlas separately and then merges them based on in how many datasets they are captured. We removed several genes that are highly abundant and may originate from ambient RNA in many cell types, thus adding noise to the analysis. This included mitochondrial, ribosomal, immunoglobulin (genes starting with IGHM, IGLC, IGHG, IGHA, IGHV, IGLV and IGKV), TCR variable domains (genes starting with TRBV, TRAV, TRGV and TRDV) and hemoglobin genes (genes starting with HB). We retrieved a total of 198 input gene sets from the newest version of our Cytopus knowledge base (https://zenodo.org/record/7306238). The total number of genes used (the union of genes included in a gene set and highly variable genes) was 7,322. We normalized and log 1p transformed the gene expression counts and ran Spectra with the parameters λ = 0.01, δ = 0.001 and ρ = 0.001 and obtained one factor each for CD8^{+} T cell tumor reactivity and lysine metabolism. We also obtained one factor that shared 20 of the top 50 marker genes with the macrophage invasion factor from the Bassez^{10} dataset (overlap coefficient = 0.4). We then calculated the overlap of these factors with the factors obtained from the Bassez^{10} and Zhang^{23} datasets (Fig. 6c). Spectra was run on a computer cluster with 128 CPU cores (Intel Xeon Gold 6230 CPU at 2.10 GHz) with 1,024 GB of RAM.
To calculate embeddings for plotting UMAPs, we calculated the neighborhood graph (k = 10) on the first 50 principal components using the top 3,000 highly variable genes, which explained 45.92% of the total variance, and calculated a UMAP embedding on this neighborhood graph using the scanpy implementation (Fig. 6a).
We obtained studyspecific factors by their crossstudy entropy. We first removed spuriously expressed factors with ≤100 positive (cell score > 0.001) cells. For each remaining factor, we then calculated the entropy of study label proportions in factorpositive cells. We selected the factors that showed a crossstudy entropy higher than 2.0794, which is the entropy for a hypothetical factor where positive cells are absent in 11 of the 19 studies analyzed and where they are equally distributed among the remaining 8 studies. This resulted in 11 global (Fig. 6b) and 3 celltypespecific factors.
To assess the stability of our lysine factor across studies, we plot its zscored (across cell types) mean expression per cell type (Fig. 6d). We also calculated its mean expression per sample and cell type and compared the expression in plasma cells to the expression in other cell types using Wilcoxon matchedpairs signedrank tests (Fig. 6d). To compare cell scores of the CD8^{+} T cell tumor reactivity and macrophage invasion factors in clinically relevant patient subgroups, we calculated their mean cell scores in positive (cell score > 0.001) CD8^{+} T cells and macrophages, respectively. We then compared these aggregated cell scores in ever smoker versus never smoker and wildtype EGFR versus mutated EGFR tumors. We excluded individuals with other driver mutations from wildtype EGFR tumors because these tumors have different clinical and biological behaviors. Correction for covariates was not performed because many covariates (sex and age) were only available for a small fraction of individuals with available smoking and EGFR status (only for 13 individuals with EGFRmutated tumors were both age and sex information were available).
Classifying new and modified factors
We classified all factors as new, modified or unspecified based on their input gene–gene knowledge graph dependency parameter η. The dependence parameter is a scalar value between 0 and 1 that quantifies its reliance on the input gene set graph. We observed a bimodal distribution of η, with one group of factors centered close to 0 and another group of factors centered close to 1 (Extended Data Fig. 2a). We therefore chose a threshold of 0.25 to separate highdependence from lowdependence parameters. We defined new factors as factors with a graph dependency parameter of η < 0.25 and modified factors as factors with a graph dependency parameter of η ≥ 0.25.
Analyzing breast cancerinfiltrating leukocytes
We compared the ability of Spectra, Slalom and scHPF to retrieve features of tumorinfiltrating immune cells. We ran the three algorithms on all leukocytes in the Bassez dataset, as described above, using a λ parameter of 0.01 for Spectra. We also ran Spectra on the Zhang dataset using a λ parameter of 0.01. For cells with high library size, such as macrophages (Bassez dataset median library size = 8,038), we calculated gene scores for Spectra factors using an offset of 1, which retrieved more stably expressed genes (for example, mean scrannormalized expression of the top 50 marker genes of the macrophage factor 182: 1.15 with offset versus 0.41 with no offset). For remaining analyses with lower library size, such as T cells (Bassez dataset median library size = 3,127) or B cells (Bassez dataset median library size = 3,954), we calculated gene scores for Spectra using an offset of 0, which allowed for more sensitive retrieval of lowly expressed genes, such as transcription factors involved in tumor reactivity and exhaustion (for example, EOMES and TOX) as well as metabolic processes (for example, PIPOX and BBOX1).
Visualizing scRNAseq data
To visualize individual genes in embeddings and account for sparsity in scRNAseq data, we imputed gene expression using scanpy’s implementation of MAGIC^{104} with a t parameter of 3 and the exact solver (Fig. 2d and Extended Data Figs. 3c, 6b, 7b and 8b). For visualizing all leukocytes, we calculated tSNE embeddings on 57 principal components explaining 25.0% (Bassez dataset) or 55 principal components explaining 20% of variance (Zhang dataset) with standard parameters including a learning rate of 1,000 using the scanpy implementation (Figs. 2b and 4d and Extended Data Figs. 3a, 4a, 7e and 8a).
Gene set enrichment analysis
To find the most representative factors for a gene set in Spectra, Slalom and scHPF, we performed gene set enrichment analysis for the exhaustion and tumor reactivity input gene sets in the top 50 marker genes (genes with highest gene scores) of every factor using gseapy’s enrichr function^{105} (Extended Data Figs. 6e and 7c). The enrichr function calculates enrichment using a hypergeometric test to calculate the probability of drawing the observed number of genes belonging to a gene set of interest when sampling from a pool of all genes without replacement (here, the union of the 3,000 most highly variable genes plus the genes contained in the gene sets; see Running Spectra). We calculated the enrichment of gene sets in the top 50 markers genes (genes with highest gene scores) for each factor
$$p(k;N,K,n)=\frac{\left(\begin{array}{c}K\\ k\end{array}\right)\left(\begin{array}{c}NK\\ nk\end{array}\right)}{\left(\begin{array}{c}N\\ n\end{array}\right)}$$
where n is the number of factor marker genes (here, 50), k is the number of genes from the gene set in the top 50 marker genes in the factor, N is the total number of genes contained in the data, and K is the number of genes from the gene set contained in the data. From this, we calculated an FDR using the Benjamini–Hochberg correction. We assumed that the factors with the lowest FDR for enrichment were representative of the respective gene sets if the FDR was <0.05.
CD8+ T cell analysis
We took the subset of CD8^{+} T cells (11 clusters) from the Bassez dataset to explore CD8^{+} T cell tumor reactivity (tumor reactivity) and CD8^{+} T cell exhaustion (exhaustion). The most representative factors for the tumor reactivity and exhaustion gene sets were retrieved using the gene set enrichment procedure described above (Gene set enrichment analysis) for each factor analysis method (Extended Data Fig. 6e). Spectra factors were also compared to expression scores for the tumor reactivity and exhaustion gene sets using scanpy’s score_genes function (Fig. 3b and Extended Data Fig. 6b). To find genes driving score_genes expression, we calculated the covariance of all genes within the tumor reactivity or exhaustion gene sets with the tumor reactivity and exhaustion gene scores (Extended Data Fig. 6b). To visualize forcedirected layouts, we used scanpy’s tl.draw_graph function and the ForceAtlas2 method on a nearest neighbors graph computed on CD8^{+} T cells using scanpy’s tl.neighbors function with n = 10 nearest neighbors (Fig. 3a–c and Extended Data Figs. 3b,c and 6b). Contour plots were created using seaborn’s jointplot kernel density estimation with standard parameters (Fig. 3b,c and Extended Data Fig. 6f). We compared Spectra’s ability to deconvolve processes of tumor reactivity and exhaustion (Fig. 3) with the factorization methods scHPF, Slalom and expiMap. In contrast to Spectra, Slalom only found a factor highly enriched for exhaustion genes in the Bassez dataset, which overlaps with the highest scoring factor for tumor reactivity by 35 genes (Extended Data Fig. 6e). scHPF factors are not enriched for either reactivity or exhaustion gene sets, whereas expiMap identified and successfully deconvolved the two factors (Extended Data Fig. 6e). However, only Spectra was able to distinguish a clonally expanding tumorreactive T cell population that is specific to responders (Extended Data Fig. 6f). Moreover, Slalom, scHPF and expiMap tumor reactivity and Spectra exhaustion factors failed to associate with patientlevel response, defined as a significant difference between expression in responders and nonresponders before or during ICT (Extended Data Fig. 6g).
Metabolism analysis
We assessed the expression pattern of metabolic factors across cell types and found highly specific expression of the lysine metabolism program in plasma cells. In the Bassez dataset, we noticed a small (n = 114 cells, 0.1% of all cells, 3% of all plasma/B cells) group of heterotypic doublets expressing plasma cell and T cell markers (CD3E, CD3D, CD3G, IGHG4 and IGHG1), which was not apparent in previous analyses and not detected by DoubletDetection. We removed these cells from further analyses involving plasma cells (Fig. 4). We also inspected the mean expression per cell type of the MAGIC^{104}imputed (scanpy implementation, t = 3, exact solver) top 50 individual marker genes of the lysine factor genes with highest gene scores (Extended Data Fig. 7b). Slalom^{6} identified a factor in plasma cells with worse resemblance to lysine metabolism (Extended Data Fig. 7c), which is contaminated with celltype markers (Extended Data Fig. 7d), possibly because Slalom does not use celltypespecific gene scalings. expiMap^{9} and scHPF^{4} factors were homogeneously expressed across cells or not enriched in lysine metabolism genes, respectively (Extended Data Fig. 7c–e).
Macrophage analysis
To analyze differentiation gradients in macrophages and to capture all possible maturation stages, we retained the subset of 18 clusters (12,132 cells) annotated as mature macrophages (12 clusters) or more immature myeloidderived (suppressor) cells/monocytic cells (6 clusters) in the Bassez dataset for further analysis (Supplementary Table 2). We embedded the data using DCs, which preserve differentiation trajectories better than many common linear and nonlinear dimensionality reduction techniques^{53}. Using a classification strategy, we selected the DCs that best captured differentiation from more monocytic states to macrophages while separating individuals with (responders) and without (nonresponders) clonal T cell expansion (Fig. 5a,b and Extended Data Fig. 8a,b,d). For every pair of the first 20 DCs, we performed (1) a linear regression with the DCs as the independent variable and the scrannormalized expression of the monocyte marker S100A8 as the dependent variable and (2) a logistic regression with the DCs as the independent variable and response status as the dependent variable. We chose the DC pair with highest sum of the coefficient of determination R^{2} (linear regression) and highest mean accuracy (logistic regression). We calculated Spectra cell score trends over the DCs by fitting a generalized additive model as implemented in Palantir’s _gam_fit_predict and calculate_gene_trends functions^{106} using cell scores instead of gene expression and DCs instead of pseudotime (Fig. 5a):
$${y}_{i,\,j,k}={\beta }_{0}+f(D{C}_{i,k})$$
where y_{i,j,k} is the cell score of cell i, factor j and the kth DC, and DC_{i,k} is the kth DC for cell i. We then visualized cell score trends using the plot_gene_trend_heatmap function from Palantir.
To calculate compositional changes during ICT, we used the Milo package^{56}. Milo is analogous to differential gene expression analysis, but instead of identifying genes that differ between two groups of cells, it tests for differential cell density in (possibly overlapping) neighborhoods of a knearest neighbors (KNN) cell–cell similarity graph across different conditions. We chose the default fraction of 0.1 to be sampled as index cells from the KNN graph, such that representative cellular neighborhoods were only constructed for those index cells. Milo counts the number of cells per sample in each neighborhood and uses a generalized linear model with a negative binomial distribution to test for differences in abundance. Milo also accounts for multiple comparison testing by computing a spatial FDR.
For the Bassez dataset, we constructed a KNN graph on macrophages and monocytic cells. The Milo paper gives the following heuristic to estimate an optimal k parameter^{56}:
where S is the number of samples (here, 79), and a is an arbitrary scaling parameter. Following the authors’ suggestion of 3 ≤ a ≥ 5 resulted in an overly large k parameter 237 ≤ k ≥ 395. We therefore chose k to be smaller than the smallest population of cells identified by clustering (58 cells) but close to the k parameter obtained by the heuristic above, resulting in k = 50 to construct the KNN graph and to identify the nearest 50 neighbors of the index cells.
We then assessed the fold change of cell states under PD1 blockade using the following regression formula:
$${y}_{ns} \sim response+timepoint+timepoint* response$$
where y is the number of cells from sample s in neighborhood n, and response status is defined as 0 for nonresponders and 1 for responders. We defined the timepoint as 0 for before pretherapy and 1 for ontherapy. The notation timepoint * response indicates the interaction between the timepoint and response variables. We then identified the neighborhoods specifically enriched for nonresponders under therapy by taking the subset of neighborhoods based on the estimated regression coefficients. First, we identified the neighborhoods specifically enriched under therapy by retaining a subset of all neighborhoods with an FDR of <0.05 and coefficient (log (fold change)) of >0 for the timepoint parameter for further analysis. From these, we took a subset of the neighborhoods enriched for nonresponders compared to responders under therapy by selecting neighborhoods with an interaction FDR of <0.05 and an interaction coefficient (log (fold change)) of <0 for further analysis. We then compared the mean factor cell scores for these neighborhoods to all remaining neighborhoods.
The Zhang dataset contained fewer individuals treated with immunotherapy than the Bassez dataset (n = 8 versus n = 42) and therefore did not allow for testing as many covariates. We thus chose a slightly different strategy to find macrophage neighborhoods enriched for nonresponders under therapy. For the Bassez dataset, we took a subset of 16 clusters (11,466 cells) annotated as mature macrophages (12 clusters and 9,385 cells) or more immature myeloidderived (suppressor) cells/monocytic cells (4 clusters and 2,081 cells; Supplementary Table 2) and selected samples from individuals classified as nonresponders treated with antiPDL1 (see Zhang dataset) for a total of 4,318 cells and five samples. Analogous to the k parameter selection strategy above, we constructed a KNN graph using a k parameter of 20, which was smaller than the smallest cell population detected by clustering (22 cells). We then defined Milo neighborhoods as the 30 nearest neighbors of the index cells. We fitted the Milo model using the following regression formula:
$${y}_{ns} \sim timepoint$$
where y is the number of cells from sample s in neighborhood n, and timepoint is defined as either 0 for pretherapy or 1 for ontherapy. We took a subset of the neighborhoods with an FDR of <0.2 and a coefficient (log (fold change)) of >0 for the timepoint parameter for further analysis. As for the Bassez dataset, we then compared the factor cell scores for this group with all remaining neighborhoods.
Statistical analysis and visualization
P values were calculated as indicated above using the Milo, scipy and statsmodels Python packages. No normality assumption was made. We used a Mann–Whitney Utest for independent samples and a Wilcoxon matchedpairs signedrank test for paired samples. If not indicated differently, all P values are twosided and corrected for multiple comparisons (Bejamini–Hochberg method). A P value of 0.05 was considered statistically significant. Cohen’s d was calculated according to the following formula:
$$d=\left\vert \frac{{\mathrm{mean}}(a){\mathrm{mean}}(b)}{s}\right\vert$$
where s is the pooled standard deviation, and mean(a) and mean(b) are the means of groups a and b, respectively.
Data were visualized using the matplotlib and seaborn Python packages and GraphPad Prism v9 for Microsoft Windows and were edited in Adobe Illustrator Creative Cloud (v27.0). Panels from Figs. 3c,f and 4d and Extended Data Fig. 3c are duplicated in Extended Data figures to enable sidebyside visual comparisons. The following software packages were used: abslpy (1.0.0), anndata (0.8.0), arpack (3.7.0), cython (0.29.28), fsclvm (1.0.0.dev10), h5py (3.6.0), igraph (0.10.1), intervaltree (2.1.0), jsonpickle (2.1.0), jsonschema (3.2.0), jupyter (1.0.0), leidenalg (0.8.8), matplotlib (3.5.0), networkx (2.6.3), numba (0.54.1), numpy (1.20.3), opt_einsum (3.3.0), pandas (1.3.5), pip (22.1.2), Python (3.7.6), pythonigraph (0.10.1), pytorch (1.10.1), pyvis (0.1.9), scanpy (1.8.2), schpf (0.5.0), scikitlearn (1.0.2), scipy (1.7.3), seaborn (0.11.2), Slalom (1.0.0.dev11), Spectra (0.1.0), statsmodels (0.12.2), tqdm (4.62.3), umaplearn (0.5.2), zifa (0.1), anndata (0.8.0), cellrank (1.5.1), DoubletDetection (2.5.2), graphviz (2.50.0), joypy (0.2.6), jupyter (1.0.0), kneed (0.7.0), leidenalg (0.8.10), notebook (6.4.12), numpy (1.21.6), numpy_groupies (0.9.17), palantir (1.0.1), pandas (1.4.2), phenograph (1.5.7), pickleshare (0.7.5), plotly (5.10.0), pygam (0.8.0), pygments (2.12.0), pygraphviz (1.9), pythonutils (3.3.3), pyvis (0.3.0), rbase (4.1.3), rpy2 (3.5.1), scanpy (1.9.1), scikitlearn (1.1.1), scipy (1.8.1), seaborn (0.11.2), miloR (3.16), pytorch (1.7.0), matplotlib (3.5.1), cudatoolkit (10.2.89), h5py (3.3.0), hdf5 (3.3.0), igraph (0.9.4), networkx (2.5.1) and numba (0.51.1).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
5 blooket hacks for 2023 updated add tokens unlock all blooks
evony the king s return v4 25 0 mod apk unlimited gems money