scanpex.sq package
Module contents
- class scanpex.sq.GeneCacheManager(cache_dir='gene_cache', registry_file='registry.json', base_path=None)[source]
Bases:
objectManages the caching and retrieval of gene lists.
This class handles the execution of gene generation recipes and stores the results as text files to avoid redundant computations. It maintains a JSON registry to track the mapping between keys and filenames.
- cache_dir
The absolute path to the directory where cache files are stored.
- Type:
str
- registry_path
The full path to the registry JSON file.
- Type:
str
- recipes
A dictionary storing the registered functions and their arguments.
- Type:
dict
- registry
A dictionary mapping cache keys to their corresponding filenames.
- Type:
dict
Initialize the GeneCacheManager.
- Parameters:
cache_dir (str, optional) – Name or path of the cache directory. By default “gene_cache”.
registry_file (str, optional) – Name of the registry file. By default “registry.json”.
base_path (str, optional) – Base path for resolving relative cache directories. If None, uses the current file’s location or the working directory.
- clear_cache(key)[source]
Remove the cache file and registry entry for a specific key.
- Parameters:
key (str) – The identifier of the cache item to remove.
- get(key, update=False)[source]
Retrieve the gene list for the given key.
If the cache exists and update is False, data is read from the file. Otherwise, the registered recipe is executed to generate the data, which is then saved to a file.
- Parameters:
key (str) – The identifier of the gene list to retrieve.
update (bool, optional) – If True, ignores existing cache and regenerates the file. By default False.
- Returns:
The list of genes. Returns an empty list if generation fails or no recipe is found.
- Return type:
list of str
- load(key, func, update=False, **kwargs)[source]
Register a recipe and retrieve the gene list in one step.
This is a convenience wrapper that calls register_recipe followed by get.
- Parameters:
key (str) – Unique identifier for the cache item.
func (callable) – The function used to generate the gene list. Must return a list of strings.
update (bool, optional) – If True, forces regeneration of the cache even if it exists. By default False.
**kwargs – Keyword arguments passed to func.
- Returns:
The list of genes loaded from cache or generated by the function.
- Return type:
list of str
- register_recipe(key, func, **kwargs)[source]
Register a function and its arguments for lazy generation.
- Parameters:
key (str) – Unique identifier for the cache item.
func (callable) – The function to execute when generation is triggered.
**kwargs – Keyword arguments to be passed to func upon execution.
- scanpex.sq.ensembl_mapper(ensembl_list, species='human')[source]
Map a list of Ensembl gene IDs to their corresponding gene symbols.
This function queries the MyGene.info API to translate Ensembl IDs into gene symbols. If a gene symbol cannot be found for a given ID, the original Ensembl ID is retained in the ‘symbol’ column. Duplicate queries are dropped, keeping only the first matched record.
- Parameters:
ensembl_list (List[str]) – A list of Ensembl gene IDs to be mapped (e.g., [‘ENSG00000139618’]).
species (str, optional) – The species name or taxonomy ID to restrict the query. Common values include ‘human’ or ‘mouse’. Default is ‘human’.
- Returns:
A pandas DataFrame containing two columns: - ‘query’: The original Ensembl ID. - ‘symbol’: The mapped gene symbol (or the original ID if unmapped).
- Return type:
pd.DataFrame
- Raises:
ImportError – If the mygene package is not installed in the current environment.
Examples
>>> ensembl_ids = ["ENSG00000139618", "ENSG00000157764", "INVALID_ID"] >>> df = ensembl_mapper(ensembl_ids, species="human") >>> print(df) query symbol 0 ENSG00000139618 BRCA2 1 ENSG00000157764 BRAF 2 INVALID_ID INVALID_ID
- scanpex.sq.gene_query(gene_names, source, species='human', logging=True, unique=True, sort=False, keep_unmapped=False)[source]
Map gene names (symbols or aliases) to a target source list (e.g., adata.var_names).
This function uses MyGene.info to resolve gene aliases. It checks if the queried gene or its aliases exist in the provided source list. If a match is found, the gene name as it appears in source is returned.
- Parameters:
gene_names (list) – List of gene names or aliases to query.
source (list) – The target list of valid gene names (e.g., adata.var_names). The function checks if the queried genes exist in this list.
species (str, optional (default: "human")) – Species to query in MyGene.info (e.g., “human”, “mouse”).
logging (bool, optional (default: True)) – If True, prints the number of mapped genes and missing queries.
unique (bool, optional (default: True)) – If True, returns a sorted list of unique gene names. If False, allows duplicates and maintains the original query order.
sort (bool, optional (default: False)) – If True, sorts the returned list of genes alphanumerically.
keep_unmapped (bool, optional (default: False)) – If True, includes unmapped gene names in the returned list. If False, omits unmapped genes.
- Returns:
A list of gene names that were successfully mapped to the source.
- Return type:
List[str]
- Raises:
ImportError – If the mygene library is not installed.
- scanpex.sq.xor(list_a, list_b)[source]
Identify elements exclusive to each of the two input lists.
This function calculates the set difference in both directions: (A - B) and (B - A).
Note
Since this function converts inputs to sets internally: 1. Duplicate elements in the inputs will be removed in the output. 2. The order of elements in the output is not guaranteed.
- Parameters:
list_a (List[Any]) – The first list to compare.
list_b (List[Any]) – The second list to compare.
- Returns:
A tuple containing two lists: 1. Elements present in list_a but not in list_b. 2. Elements present in list_b but not in list_a.
- Return type:
Tuple[List[Any], List[Any]]