geno2phenotb package

Submodules

geno2phenotb.annotate_vcf module

Annotates VCF.

geno2phenotb.annotate_vcf.annotate_vcf(in_path)[source]

Complete missing annotations in the ‘Subst’ column of a MTBseq variant call file (VCF).

Parameters:

in_path (str) – Path to the VCF file to annotate/complete.

Returns:

Writes output to new file in the same directory.

Return type:

None

geno2phenotb.geno2phenotb module

This is the entry point of the geno2phenoTB console script.

geno2phenotb.geno2phenotb.main(args)[source]

Wrapper that allows process() to be called with string arguments in a CLI fashion.

Instead of returning the value from process(), it prints the result to the stdout in a nicely formatted message.

Parameters:

args (List[str]) – Command line parameters as list of strings.

geno2phenotb.geno2phenotb.run()[source]

Calls main() passing the CLI arguments extracted from sys.argv.

This function is used as entry point to create a console script with setuptools.

geno2phenotb.geno2phenotb.setup_logging(loglevel)[source]

Setup basic logging.

Parameters:

loglevel (int) – Minimum loglevel for emitting messages.

geno2phenotb.installation_test module

Self test of installation and dependencies.

geno2phenotb.installation_test.check_sha256(file_path, expected_hash)[source]

Checks the sha256 hash of a file and throws does not match.

Parameters:
  • file_path (str) – The path to the file.

  • expected_hash (str) – Expected sha256 hash of the file.

Returns:

matching – True, if file matches the hash.

Return type:

bool

geno2phenotb.installation_test.download_file(url, file_name, expected_hash)[source]

Downloads a file and save it, if it does not exists.

Displays a progress bar and throws an exception if the sha256 hash does not match.

Parameters:
  • file_name (str) – The name of model file.

  • expected_hash (str) – The sha256 hash of model.

Returns:

Throws an exception if the sha256 hash of the model is not equal to the hash.

Return type:

None

geno2phenotb.installation_test.download_test_files()[source]

Download the forward / reverse reads with accession id ERR551304 from the ENA.

Return type:

None

geno2phenotb.installation_test.self_test(sample_id, complete)[source]

Performs a self test by running everything and comparing it to the precomputed ground truth.

Parameters:
  • sample_id (str) – The sample ID. One of ERR551304, ERR551304, ERR553187.

  • complete (bool) – Run the complete test. Only available for ERR551304.

Returns:

Throws an exception if the output differs from the ground truth.

Return type:

None

geno2phenotb.parse_args module

Creates the argument parser.

geno2phenotb.parse_args.parse_args(args)[source]

Parse command line parameters.

Parameters:

args (List[str]) – Command line parameters as list of strings (for example ["--help"]).

Returns:

Command line parameters namespace.

Return type:

argparse.Namespace

geno2phenotb.predict module

Functions to predict the resistance of an isolate.

geno2phenotb.predict.adjusted_classes(proba, t)[source]

This function adjusts class predictions based on the prediction threshold (t).

Will only work for binary classification problems.

Parameters:
  • proba (float) – Probability.

  • t (float) – Decision threshold.

Returns:

class – Predicted class based on the decision threshold t: will be 1.0, if proba >= t, or 0.0 otherwise.

Return type:

float

geno2phenotb.predict.predict(fastq_dir, output_dir, sample_id, skip_mtbseq=False, drugs=None)[source]

Predicts the drug resistance. This will start all preprocessing steps.

Parameters:
  • fastq_dir (str) – Path to directory containing the fastq files.

  • output_dir (str) – Path to output directory. A file named ‘<sample_id>_feature_importance_evaluation.tsv’ is written to this directory. This file contains a table with feature importance values and catalog info per drug. Further, for each drug a resistance report file ‘<drug>_resistance_report.txt’ is output.

  • sample_id (str) – Sample ID.

  • skip_mtbseq (bool, default=False) – Do not run MTBSeq but use preprocessed data.

  • drugs (Union[str, list], default=None) – If None, drug resistance predictions for all drugs known to geno2phenoTB are determined. If a list of drugs is supplied, predictions will be only determined for these. The drug must be one of ‘AMK’, ‘CAP’, ‘DCS’, ‘EMB’, ‘ETH’, ‘FQ’, ‘INH’, ‘KAN’, ‘PAS’, ‘PZA’, ‘RIF’, ‘STR’.

Return type:

Tuple[DataFrame, DataFrame, Dict[str, Optional[List[str]]]]

Returns:

  • result (pd.DataFrame) – A DataFrame with the probabilities (for resistance) and predictions (1.0 for resistance, 0.0 for susceptibility) for the requested drugs.

  • feature_evaluation (pd.DataFrame) – A DataFrame listing the features (called variants, lineage classification, genotypes) plus an assessment of the relevance of each feature for the Machine-Learning-based and catalog-based resistance prediction per drug. For each drug, two columns are given: ‘<drug> feature importance’ and ‘<drug> catalog resistance variant’. The first contains the feature importance value derived from the Machine Learning model, the second informs if the variant is a known catalog resistance variant for the considered drug.

  • rules (Dict[str, Optional[list[str]]]) – Dict of lists with features constituting a rule. If the used Machine Learning Model is a Rule-Based Classifier, rules[drug] is a list of features constituting a rule (the rule can be constructed by connecting the given features with boolean ‘or’ operators (disjunctions)). Otherwise, rules[drug]=None.

geno2phenotb.predict.single_prediction(drug, output_dir, sample_id, features)[source]

Predicts drug resistance for a single drug based on preprocessed data.

Parameters:
  • drug (str) – Drug to predict resistance. The drug must be one of ‘AMK’, ‘CAP’, ‘DCS’, ‘EMB’, ‘ETH’, ‘FQ’, ‘INH’, ‘KAN’, ‘PAS’, ‘PZA’, ‘RIF’, ‘STR’.

  • output_dir (str) – Path to output directory. A resistance report file ‘<drug>_resistance_report.txt’ is written to this directory.

  • sample_id (str) – Sample ID.

  • features (pd.Series) – The features (incl. called variants, lineage classification, and genotypes) extracted from the supplied FASTQ file(s).

Return type:

Tuple[Optional[float], float, float, Optional[List[str]], Optional[Series], Optional[List[str]]]

Returns:

  • probability (float, default=None) – Probability (for resistance) against the requested drug. If the underlying Machine Learning Model is a Rule-Based Classifier (RBC), probability=None, since RBcs don’t allow to estimate a probability.

  • prediction (float) – Machine-Learning-based Prediction (1.0 for resistance, 0.0 for susceptibility) for the requested drug.

  • catalog_prediction (float) – Resistance catalog based prediction (1.0 for resistance, 0.0 for susceptibility) for the requested drug.

  • found_catalog_variants (List[str], default=None) – Resistance-causing variants found among the features.

  • importances (pd.Series, default=None) – A Series with values quantifying the importance of each feature, if the underlying Machine Learning Model provides feature importances. None, otherwise.

  • rule (List[str], default=None) – A list with features constituting a rule if the underlying Machine Learning Model is a Rule-Based Classifier. The rule can be constructed by connecting the given features with boolean ‘or’ operators (disjunctions). None, otherwise.

geno2phenotb.preprocess module

Wrapper for all preprocessing steps.

These functions include the assembly and variant calling using MTBseq as well as the genotype and lineage collection.

geno2phenotb.preprocess.collect_lineages(classification_file)[source]

Collect lineage classification.

Parameters:

classification_file (str) – Full path to the Strain_Classification.tab file.

Returns:

lineage_classification – Series of floats from {0.0, 1.0} denoting, if the isolate is classified to belong to a lineage, i.e. classification[“lineage X”]=1.0, or not, i.e. classification[“lineage X”]=0.0.

Return type:

pd.Series

geno2phenotb.preprocess.determine_genotype(catalog_variants, resistance_variants)[source]

Determine the genotype of an isolate.

Use resistance variants from the FZB catalog plus all InDels and early stop condons (e.g. A123_) in the following genes (not mentioned in the catalog, i.e., additionally to the known resistance variants from the Masterlist) to assign a resistant genotype: ethA=Rv3854c (ETH), pncA=Rv2043c (PZA), gidB/gid=Rv3919c (STR), rpoB=Rv0667 (RIF), Rv0678 (BDQ/CFZ), ald=Rv2780 (DCS), katG=Rv1908c (INH), ddn=Rv3547 (DLM), tlyA=Rv1694 (CAP).

Parameters:
  • catalog_variants (pd.DataFrame) – DataFrame of resistance-related variants from the FZB catalog. There is one column per drug. The column contains the variants from the FZB catalog that are related with resistance against the drug.

  • resistance_variants (Set[str]) – Resistance-related variants that were found for an isolate.

Return type:

Tuple[Series, Series]

Returns:

  • genotypes (pd.Series[float]) – Series of per-drug genotypes.

  • geno_variants (pd.Series[str]) – Series of resistance-related variants per drug.

geno2phenotb.preprocess.preprocess(fastq_dir, output_dir, sample_id, skip_mtbseq=False)[source]

Runs all preprocessing steps.

Parameters:
  • fastq_dir (str) – Path to directory containing the FASTQ files.

  • output_dir (str) – Path to output directory. Two files are written to this directory. A file named ‘<sample_id>_resistant_genotype_variants.tsv’ with resistance-related variants per drug and a file named ‘<sample_id>_extracted_features.tsv’ with per-drug genotypes.

  • sample_id (str) – Sample ID.

  • skip_mtbseq (bool, default=False) – Do not run MTBSeq but use preprocessed data.

Returns:

features – The features (incl. called variants, lineage classification, and genotypes) extracted from the supplied FASTQ file(s).

Return type:

pd.Series, default=None

geno2phenotb.preprocess.run_mtbseq(fastq_dir, sample_id)[source]

Execute MTBseq for a single isolate.

Parameters:
  • fastq_dir (str) – Path to the directory were the FASTQ file(s) belonging to a single isolate are located.

  • sample_id (str) – SampleID, i.e. run accession (ERR/SRR).

Return type:

None

geno2phenotb.utils module

Small utility functions.

geno2phenotb.utils.check_fastq_filenames(fastq_dir, sample_id)[source]

Checks if the fastq files in a folder are following the MTBSeq naming scheme.:

[SampleID]_[LibID]_[*]_[Direction].f(ast)q.gz
                    ^- Optional values.
Direction must be one of R1, R2.
Parameters:

fastq_dir (str) – Path to the directory containing the fastq files.

Returns:

Throws an error if the names do not follow the assumed scheme.

Return type:

None

geno2phenotb.utils.check_output(output_dir, ground_truth_dir, sample_id, only_preprocess)[source]

Checks the resistance prediction, extracted features and feature importance evaluation against the ground truth.

Throws an assertion error if the files do not match up.

Parameters:
  • output_dir (str) – Output directory of prediction.

  • ground_truth_dir (str) – Directory of ground truth files.

  • sample_id (str) – ID of sample.

  • preprocess (bool) – If True, check only the preprocess output.

Returns:

Throws an exception if the files do not match.

Return type:

bool, default=True

geno2phenotb.utils.get_amino_ann()[source]

Returns regex of amino-acid annotations.

Return type:

str

geno2phenotb.utils.get_aminos()[source]

Returns regex of amino-acids.

Return type:

str

geno2phenotb.utils.get_drugs()[source]

Returns a list of two / three letter drug codes.

Return type:

List[str]

geno2phenotb.utils.get_key_genes()[source]

Returns a dict of genes used to determine the genotype of an isolate.

Return type:

Dict[str, List[str]]

geno2phenotb.utils.get_lineages()[source]

Returns a list of lineages.

Return type:

List[str]

geno2phenotb.utils.get_rules(drug)[source]

Returns the rules learned by the Rule-Based Classifier.

Parameters:

drug (str) – Drug for which the rule, obtained from a Rule-Based Classifier, shall be returned.

Return type:

Tuple[List[int], bool]

Returns:

  • rule (list[int]) – List of integer indices to index the features that are resistance-causing.

  • geno_only (bool) – If True, the returned rule contains only the index of the FZB genotype feature.

geno2phenotb.utils.get_static_dir()[source]

Returns the absolute path of the static folder.

Return type:

str

geno2phenotb.utils.stripper(x)[source]

Strips string / float input. Throws error if type does not match.

Return type:

Union[str, float]

geno2phenotb.vcf_columns_extractor module

Collects all variants appearing in a vcf into a list ‘columns’ and return it.

geno2phenotb.vcf_columns_extractor.vcf_columns_extractor(in_path)[source]

Collect all variants from a MTBseq VCF file into a list and return it.

This function serves to collect all variants from a variant call file into a set columns and return it.

Parameters:

in_path (str) – Path to the VCF file to extract variants from.

Return type:

Tuple[Optional[Set[str]], Optional[str]]

Returns:

  • columns (set) – Set of all variants extracted from the given VCF.

  • identifier (str) – Identifier (single ENA run accession number or combination thereof) extracted from the VCF file name.

geno2phenotb.vcf_columns_extractor_geno module

Collects variants relevant to the resistance genotype appearing in a vcf.

geno2phenotb.vcf_columns_extractor_geno.vcf_columns_extractor_geno(in_path, resistance_variants_set)[source]

Collect known resistance variants from a VCF file into a list and return it.

This function serves to collect known resistance variants, which are relevant to determine the resistance genotype, from a variant call file into a list columns and return it. The variants are collected from the VCF based on a set resistance_variants_set of known resistance variants. Additionally all InDels and early stop condons (e.g. A123_) in the following genes (not mentioned in the catalog) are extracted as well, since they result in a resistant genotype: ethA=Rv3854c (ETH), pncA=Rv2043c (PZA), gidB/gid=Rv3919c (STR), rpoB=Rv0667 (RIF), Rv0678 (BDQ/CFZ), ald=Rv2780 (DCS), katG=Rv1908c (INH), ddn=Rv3547 (DLM), tlyA=Rv1694 (CAP).

Parameters:
  • in_path (str) – Path to the VCF file to extract variants from.

  • resistance_variants_set (set) – Set of all known resistance variants.

Return type:

Tuple[Optional[Set[str]], str]

Returns:

  • columns (set or None) – Set of all variants extracted from the given VCF. If no resistance variants were found columns will be None.

  • identifier (str) – Identifier (single ENA run accession number or combination thereof) extracted from the VCF file name.

Module contents

Init of geno2phenotb package.