scanpex.ft package

Module contents

class scanpex.ft.GeneList(adata, key, category, database=None, gene_names=None, preset=False, source_key='gene_name', caption=None, **kwargs)[source]

Bases: object

Manages a specific set of genes for scoring, aggregation, and feature selection.

This class handles the retrieval of gene lists (via cache or direct input), calculates module scores (using prob_genes), and creates a subsetted AnnData object. It provides methods to select representative genes based on correlation or independence (Lasso-based).

genes

The list of gene symbols (names).

Type:

list of str

ids

The list of gene IDs (indices in adata.var).

Type:

list of str

data

A subset of the original AnnData containing only the selected genes and calculated scores.

Type:

ad.AnnData

score_name

The key for the raw module score in data.obs.

Type:

str

score_prob_name

The key for the probabilistic (transformed) score in data.obs.

Type:

str

category

The category name used to retrieve genes from the database.

Type:

str

caption

The display caption for the gene list.

Type:

str

Initialize the GeneList object and calculate scores.

Parameters:
  • adata (ad.AnnData) – The annotated data matrix.

  • key (str) – A unique identifier for caching the gene list.

  • category (str) – The key to look up in database if gene_names is not provided.

  • database (dict, optional) – A dictionary mapping categories to lists of genes. Required if gene_names is None.

  • gene_names (list of str, optional) – An explicit list of gene names. If provided, database is ignored.

  • preset (bool, optional) – If True, assumes the provided names are final and skips the query step. By default False.

  • source_key (str, optional) – The column in adata.var containing gene symbols. By default “gene_name”.

  • caption (str, optional) – A display name for the score. If None, derived from key. By default None.

  • **kwargs – Additional keyword arguments passed to scanpex.tl.prob_genes for score calculation.

get_matrix(group_key='SEACells', with_score=False, use_raw_score=False, use_gene_name=True)[source]

Retrieve a DataFrame of the selected genes.

Requires running select_correlated_genes or select_independent_genes first.

Parameters:
  • group_key (str, optional) – The key for aggregation. By default “SEACells”.

  • with_score (bool, optional) – If True, includes the score column in the output. By default False.

  • use_raw_score (bool, optional) – If True, uses the raw score instead of the probabilistic score (relevant only if with_score is True). By default False.

  • use_gene_name (bool, optional) – If True, sets columns to gene symbols. If False, uses gene IDs. By default True.

Returns:

The data matrix of selected genes.

Return type:

pd.DataFrame

Raises:

RuntimeError – If no genes have been selected yet.

select_correlated_genes(n_top, group_key='SEACells', use_raw_score=False, **kwargs)[source]

Select genes most highly correlated with the module score.

This method aggregates data by group_key (e.g., metacells) and computes the correlation between each gene’s expression and the module score.

Parameters:
  • n_top (int) – The number of top correlated genes to select.

  • group_key (str, optional) – The key in obs used for aggregation before correlation. By default “SEACells”.

  • use_raw_score (bool, optional) – If True, uses the raw score (score_name). If False, uses the probabilistic score (score_prob_name). By default False.

  • **kwargs – Additional arguments passed to pandas.DataFrame.corr.

Returns:

A tuple containing (selected_gene_names, selected_gene_ids).

Return type:

tuple of (list of str, list of str)

select_independent_genes(n_top, group_key='SEACells', use_raw_score=False, n_cv=5, step=10, random_state=0, **kwargs)[source]

Select a subset of genes that independently predict the module score.

This method uses a two-step process:

  1. LassoCV to determine the optimal regularization parameter (alpha).

  2. Recursive Feature Elimination (RFE) with Lasso to select the top n_top features.

Parameters:
  • n_top (int) – The number of features to select.

  • group_key (str, optional) – The key in obs used for aggregation. By default “SEACells”.

  • use_raw_score (bool, optional) – If True, uses the raw score as the target variable. By default False.

  • n_cv (int, optional) – Number of folds for LassoCV. By default 5.

  • step (float or int, optional) – Number of features to remove at each iteration of RFE. By default 10.

  • random_state (int, optional) – Seed for reproducibility. By default 0.

  • **kwargs – (Currently unused, but accepted for compatibility).

Returns:

A tuple containing (selected_gene_names, selected_gene_ids).

Return type:

tuple of (list of str, list of str)