Welcome to diffxpy’s documentation!¶
Installation¶
We assume that you have a python environment set up.
Firstly, you need to install batchglm which depdends on the PyPi packages and tensorflow and tensorflow-probability. You can install these dependencies from source to optimze them to your hardware which can improve performance. Note that both packages also have GPU versions which allows you to run the run-time limiting steps of diffxpy on GPUs. The simplest installation of these dependencies is via pip: call:
pip install tf-nightly
pip install tfp-nightly
The nightly versions of tensorflow and tensorflow-probability are up-to-date versions of these packages. Alternatively you can also install the major releases: call:
pip install tensorflow
pip install tensorflow-probability
You can then install batchglm from source by using the repository on GitHub:
Chose a directory where you want batchglm to be located and
cd
into it.Clone the batchglm repository into this directory.
cd
into the root directory of batchglm.Install batchglm from source: call:
pip install -e .
Finally, you can then install diffxpy from source by using the repository on GitHub:
Chose a directory where you want batchglm to be located and
cd
into it.Clone the diffxpy repository into this directory.
cd
into the root directory of diffxpy.Install diffxpy from source: call:
pip install -e .
You can now use diffxpy in a python session by via the following import: call:
import diffxpy.api as de
API¶
Import diffxpy’s high-level API as:
import diffxpy.api as de
Differential expression tests: test¶
Run differential expression tests. diffxpy distinguishes between single tests and multi tests: Single tests perform a single hypothesis test for each gene whereas multi tests perform multiple tests per gene.
Single tests per gene¶
Single tests per gene are the standard differential expression scenario in which one p-value is computed per gene. diffxpy provies infrastructure for likelihood ratio tests, Wald tests, t-tests and Wilcoxon tests.
|
Perform differential expression test between two groups on adata object for each gene. |
|
Perform Wald test for differential expression for each gene. |
|
Perform log-likelihood ratio test for differential expression for each gene. |
|
Perform Welch's t-test for differential expression between two groups on adata object for each gene. |
|
Perform Mann-Whitney rank test (Wilcoxon rank-sum test) for differential expression between two groups on adata object for each gene. |
Multiple tests per gene¶
diffxpy provides infrastructure to perform multiple tests per gene as:
pairwise: pairwise comparisons across more than two groups (
de.test.pairwise
, e.g. clusters of cells against each other)versus_res:t tests of each group against the rest (
de.test.versus_test
, e.g. clusters of cells against the rest)partition: mapping a given differential test across each partition of a data set (
de.test.partition
, e.g. performing differential tests for treatment effects by a second experimental covariate or by cluster of cells).
|
Perform pairwise differential expression test between two groups on adata object for each gene for all combinations of pairs of groups. |
|
Perform pairwise differential expression test between two groups on adata object for each gene for each groups versus the rest of the data set. |
|
Perform differential expression test for each group. |
Gene set enrichment: enrich¶
diffxpy provides infrastructure for gene set enrichment analysis downstream of differential expression analysis. Specifically, reference gene set annotation data sets can be loaded or created and can be compared to diffxpy objects or results from other differential expression tests.
Reference gene sets¶
|
Class for a list of gene sets. |
Enrichment tests¶
|
Perform gene set enrichment. |
Fit model to gene expression: fit¶
Diffxpy allows the user to fit models to gene expression only without conducting Wald or likelihood ratio tests. Note that one can also extract similar model fits from differential expression test output objects if Wald or likelihood ratio test were used. Alternatively, residuals can also be directly computed. As for differential expression tests, the fitting can be distributed across multiple partitions of the data set (such as conditions or cell types).
|
Fit model via maximum likelihood for each gene. |
|
Fits model for each gene and returns residuals. |
|
Perform differential expression test for each group. |
Tutorials¶
Differential testing¶
We grouped tutorials by differential expression concepts:
Introduction to differential expression testing.
Differential expression analysis with continuous covariates such time, concentratino, pseudotime or space.
How to run multiple tests per gene.
Additionally, we also provide links to tutorials that discuss specific concepts as a subset of the tutorial:
Single tests per gene¶
How to perform likelihood-ratio tests lrt.
How to perform wald tests for a single parameter.
How to perform Wald tests for multiple parameters.
How to perform t-tests.
How to perform wilcoxon (rank sum) tests.
Map single tests across partitions of data set¶
Diffxpy allows you to define a data set partition and to conduct test on each gene in each partition. This is shown`here <https://nbviewer.jupyter.org/github/theislab/diffxpy_tutorials/tree/master/diffxpy_tutorials/test/multiple_tests_per_gene.ipynb>`__.
Multiple tests per gene¶
How to perform pairwise tests, group versus rest tests and tests within each parition tests.
Gene set enrichment: enrich¶
How to conduct a gene set enrichment workflow enrich.
Example work-flows on real data sets¶
Will be added soon.
Parallelization¶
Most of the heavy computation within diffxpy functions is carried out by batchglm. batchglm uses numpy and tensorflow for the run-time limiting linear algebra operations. Both tensorflow and numpy may show different parallelization behaviour depending on the operating system. Here, we describe how one can limit the number of cores used by diffxpy by controlling its dependencies, numpy and tensorflow. Note that these limits may not be necessary on all platforms. Secondly, also note that such limits lead to suboptimal performance given the total resources of your machine.
tensorflow¶
Tensorflow multi-threading can be set before batchglm (and therefore diffxpy) are imported into a python session. Accordingly, you have to restart your python session if you want to change the current parallelization settings. Parallelization of tensorflow can be controlled via the following two environmental variables: call:
# Before importing diffxpy.api or batchglm.api in your python session, execute:
import os
os.environ.setdefault("TF_NUM_THREADS", "1")
os.environ.setdefault("TF_LOOP_PARALLEL_ITERATIONS", "1")
import diffxpy.api as de
TF_NUM_THREADS controls the number of threads that are used for linear algebra operations in tensorflow, which controls parallelization during training. TF_LOOP_PARALLEL_ITERATIONS controls the number of threads which are used during tensorflow while_loops, which are used during hessian computation. Here, we set both to one so that only one thread is used by tensorflow within diffxpy.
The environmental variables are checked upon loading of batchglm and are converted into package constants which control the parallelization behaviour of tensorflow. These package constants can also be set after package loading, but they do not affect the behaviour anymore once a tensorflow session was started once. If you want to set parallilzation behaviour after loading the package but before fist using it, you can therefore run: call:
import diffxpy.api as de
from batchglm.pkg_constants import TF_CONFIG_PROTO
TF_CONFIG_PROTO.inter_op_parallelism_threads = 1
TF_CONFIG_PROTO.intra_op_parallelism_threads = x
from batchglm.pkg_constants import TF_LOOP_PARALLEL_ITERATIONS
TF_LOOP_PARALLEL_ITERATIONS = x
where x is the number of threads (integer) to be used within diffxpy.
numpy/scipy¶
Numpy/scipy multi-threading in the linalg sub-modules can be controlled as follows in the shell in which the python session is started in which diffxpy is used (e.g. the shell from which jupyter notebook is called): call:
export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1
Here, we restricted the number of threads to be used by numpy to 1. Numpy is not used for the run-time determining parameter estimation steps so that a larger number of threads has little effect on the overall run time. So far, we have only observed this to be necessary on some linux operating systems.
Training¶
Parameter estimation in diffxpy¶
diffxpy performs parameter estimation for generalized linear models (GLMs) with batchglm. GLMs are necessary for Wald tests and liklihood ratio-tests, not for t-tests and Wilcoxon rank-sum tests. batchglm exploits closed form maximum likelihood estimators in GLMs where possible, but often numerical parameter estimation is necessary. Parameters for GLMs can be estiamted with iteratively weighted least squares (IWLS) (exponential family GLMs) or via standard methods for maximum likelihood estimation which are based on local approximations of the objective function (e.g. gradient decent). The latter cover a larger range of variance models and are applicable for all noise models and were therefore chosen for batchglm. However, these methods often come with hyper-parameters (such as learning rates). Differential expression frameworks often hide the training from the user, diffxpy exposes training details to the user so that training can be monitored and hyperparameters optimized. To reduce the coding effor and technical knowledge necessary for this, we expose core hyper-parameters within “training-strategies”.
Training strategies¶
Training strategies give the user to opportunity to change optimzer defaults such as the optimizer algorithm, learning rates, optimizer schedules (multiple optimizers) and convergence criteria. Please post issues on GitHub if you notice that your model does not converge with the default optimizer.
Models¶
Occurrence of estimator objects in diffxpy¶
GLMs and similar models are a main model class for differential expression analysis with Wald and likelihood ratio tests (LRT).
Diffxpy allows the user to choose between different GLMs based on the noise model argument.
The user can select the covariates that are to be modelled based on formulas or by supplying design matrices directly.
Both Wald test (de.test.wald
) and LRT (de.test.lrt
) require the fit of GLMs to the given data.
These fits can be extracted from the differential expression test objects that are returned by the de.test.*
functions:
These objects are called model_estim
in the case of the Wald test or full_estim
and reduced_estim
for the LRT (for full and reduced model).
Similarly, one can use de.fit.model
to directely produce such an estimator object.
Structure of estimator objects¶
These estimator objects are the interface between diffxpy and batchglm and can be directly produced with batchglm.
An estimator object contains various attributes that relate to the estimation procedure and a .model
attribute that contains an executable
(numpy) version of the estimated model.
The instance of the estimator object contains the raw parameter estimates and functions that compute downstream model characteristics,
such as location and scale parameter estiamtes in a generalized linear model, the equivalent of $hat{y}$ in a simple feed forward neural network.
The names of these model attributes depend on the noise model and are listed below
Generalized linear models (GLMs)¶
The estiamted parameters of the location and scale model are in estim.model.a_var
(location) and estim.model.b_var
(scale).
The corresponding parameter names are in estim.model.loc_names
and estim.model.scale_names
.
The observation and feature wise location and scale prediction after application of design matrix and inverse linker function are in estim.model.location
and estim.model.scale
.
For a negative binomial distribution model, the location model correpsponds to the mean model and the scale model corresponds to the dispersion model. For a normal distribution model, the location model correpsponds to the mean model and the scale model corresponds to the standard deviation model.