scanpex.ft package
Module contents
- class scanpex.ft.GeneList(adata, key, category, database=None, gene_names=None, preset=False, source_key='gene_name', caption=None, **kwargs)[source]
Bases:
objectManages a specific set of genes for scoring, aggregation, and feature selection.
This class handles the retrieval of gene lists (via cache or direct input), calculates module scores (using prob_genes), and creates a subsetted AnnData object. It provides methods to select representative genes based on correlation or independence (Lasso-based).
- genes
The list of gene symbols (names).
- Type:
list of str
- ids
The list of gene IDs (indices in adata.var).
- Type:
list of str
- data
A subset of the original AnnData containing only the selected genes and calculated scores.
- Type:
ad.AnnData
- score_name
The key for the raw module score in data.obs.
- Type:
str
- score_prob_name
The key for the probabilistic (transformed) score in data.obs.
- Type:
str
- category
The category name used to retrieve genes from the database.
- Type:
str
- caption
The display caption for the gene list.
- Type:
str
Initialize the GeneList object and calculate scores.
- Parameters:
adata (ad.AnnData) – The annotated data matrix.
key (str) – A unique identifier for caching the gene list.
category (str) – The key to look up in database if gene_names is not provided.
database (dict, optional) – A dictionary mapping categories to lists of genes. Required if gene_names is None.
gene_names (list of str, optional) – An explicit list of gene names. If provided, database is ignored.
preset (bool, optional) – If True, assumes the provided names are final and skips the query step. By default False.
source_key (str, optional) – The column in adata.var containing gene symbols. By default “gene_name”.
caption (str, optional) – A display name for the score. If None, derived from key. By default None.
**kwargs – Additional keyword arguments passed to scanpex.tl.prob_genes for score calculation.
- get_matrix(group_key='SEACells', with_score=False, use_raw_score=False, use_gene_name=True)[source]
Retrieve a DataFrame of the selected genes.
Requires running select_correlated_genes or select_independent_genes first.
- Parameters:
group_key (str, optional) – The key for aggregation. By default “SEACells”.
with_score (bool, optional) – If True, includes the score column in the output. By default False.
use_raw_score (bool, optional) – If True, uses the raw score instead of the probabilistic score (relevant only if with_score is True). By default False.
use_gene_name (bool, optional) – If True, sets columns to gene symbols. If False, uses gene IDs. By default True.
- Returns:
The data matrix of selected genes.
- Return type:
pd.DataFrame
- Raises:
RuntimeError – If no genes have been selected yet.
Select genes most highly correlated with the module score.
This method aggregates data by group_key (e.g., metacells) and computes the correlation between each gene’s expression and the module score.
- Parameters:
n_top (int) – The number of top correlated genes to select.
group_key (str, optional) – The key in obs used for aggregation before correlation. By default “SEACells”.
use_raw_score (bool, optional) – If True, uses the raw score (score_name). If False, uses the probabilistic score (score_prob_name). By default False.
**kwargs – Additional arguments passed to pandas.DataFrame.corr.
- Returns:
A tuple containing (selected_gene_names, selected_gene_ids).
- Return type:
tuple of (list of str, list of str)
- select_independent_genes(n_top, group_key='SEACells', use_raw_score=False, n_cv=5, step=10, random_state=0, **kwargs)[source]
Select a subset of genes that independently predict the module score.
This method uses a two-step process:
LassoCV to determine the optimal regularization parameter (alpha).
Recursive Feature Elimination (RFE) with Lasso to select the top n_top features.
- Parameters:
n_top (int) – The number of features to select.
group_key (str, optional) – The key in obs used for aggregation. By default “SEACells”.
use_raw_score (bool, optional) – If True, uses the raw score as the target variable. By default False.
n_cv (int, optional) – Number of folds for LassoCV. By default 5.
step (float or int, optional) – Number of features to remove at each iteration of RFE. By default 10.
random_state (int, optional) – Seed for reproducibility. By default 0.
**kwargs – (Currently unused, but accepted for compatibility).
- Returns:
A tuple containing (selected_gene_names, selected_gene_ids).
- Return type:
tuple of (list of str, list of str)