Mathematical method

Documentation on PWM Transformation to Log-Odds PSSM

This documentation explains how to transform a Position Weight Matrix (PWM) into a log-odds Position Specific Scoring Matrix (PSSM), covering the calculation of scores, background nucleotide frequencies, pseudocounts, normalization, and p-value computation.

1.1 PWM to Log-Odds PSSM Transformation

A PWM (Position Weight Matrix) represents the frequency of nucleotides (A, T, G, C) at each position of a sequence motif, with each value indicating the likelihood of the nucleotide appearing at that position.

To convert the PWM into a log-odds PSSM, we follow these steps:

For each nucleotide at a given position, compute the log-odds score by comparing the observed frequency of the nucleotide at that position to the background frequency.
Log-Odds Formula:

$$ \text{PSSM}[i][a] = \log_2 \left( \frac{P_\text{PWM}[i][a]}{P_\text{bg}(a)} \right) $$
- i: position in the matrix
- a: the base
- bg: the background nucleotide frequencies.
- Exemple

1.2 Pseudocount

To avoid $log(0)$, we calculate a pseudocount based on the Perl TFBS module with the following formula: √N * bg[nucleotide], where N represents the total number of sequences used to construct the matrix. If necessary, the user can also configure it himself (Advanced Settings > Pseudocount).

A pseudocount is a small, positive value added to each frequency in the matrix to avoid zeros in the PWM (which can lead to issues during computation, especially with logarithms). Pseudocounts help smooth the frequencies and ensure that all positions have a non-zero value.

The pseudocount for converting PWM to log-odds PSSM is handled automatically using the Perl TFBS module.

Pseudocount Formula:

$$ \text{Pseudocount} = \sqrt{N} \times \text{bg[nucleotide]} $$
- N: Total number of sequences used to construct the matrix.
- bg[nucleotide]: Background frequency of the nucleotide.
Pseudocounts prevent problems with rare nucleotides and provide a form of regularization, ensuring better performance when analyzing small datasets.

The pseudocount is automatically calculated but can be predefined in the advanced options of TFinder. It will then be applied to the Score and Adjusted Score.

1.3 Background Nucleotide Frequencies

The background frequencies refer to the expected nucleotide frequencies in the environment (e.g., the genome, a set of sequences) where the motif is found. These frequencies are important to normalize the observed frequencies in the PWM.

Now, in TFinder we wanted a calculation that represents the pure and simple homology between our TFBS and the analyzed sequences. An independent calculation of the nucleotide variability of any sequence. For this we set the background of our PSSM to 0.25 for each nucleotide ONLY for calculating the Score (and Relative Score by extension). For the Adjusted Score (and Relative Adjusted Score), the background is calculated directly from the sequence of interest or according to user input in the advanced options.