Skip to content

PCTSEA guide for developers

Salvador Martínez de Bartolomé edited this page Jul 29, 2021 · 15 revisions

This is a detailed description of how the PCTSEA analysis is implemented in Java.

This guide is supposed to help future developers to continue improving this software.

Flow chart:

See a flow chart here

Both command line and web versions are coupled to the pctsea-core module where the PCTSEA.java class is defined and where the logic of the analysis is implemented, more in particular, in the run() method which has been filled with comments so that can be followed.

This class is well documented, with a lot of comments along the code, however here there is more information about it that might be useful:

In order to change scoring methods, the developer should focus on method

private int calculateScoresToRankSingleCells(
   List<SingleCell> singleCellList,
   GeneExpressionsRetriever interactorExpressions, 
   ScoringSchema scoringSchema, 
   boolean writeScoresFile,
   boolean outputToLog, 
   boolean getExpressionsUsedForScore, 
   boolean takeZerosForCorrelation,
   double minCorrelation) throws IOException

where depending on the ScoringMethod of the ScoringSchema a different score is calculated per SingleCell that reorder them in a ranking list used in the Kolmogorov-Smirnov test used for the calculation of the enrichment score.

Similarity score calculation per single cell:

Inside this method there is a switch clause that calls the appropriate method depending on the ScoringMethod:

switch (scoringMethod) {
   case PEARSONS_CORRELATION:
      singleCell.calculateCorrelation(interactorExpressions, getExpressionsUsedForScore, minCorrelation);
      break;
   case SIMPLE_SCORE:
      singleCell.calculateSimpleScore(interactorExpressions, getExpressionsUsedForScore, minCorrelation);
      break;
   case DOT_PRODUCT:
      singleCell.calculateDotProductScore(interactorExpressions, takeZerosForCorrelation, getExpressionsUsedForScore); 
      break;
   case REGRESSION:
      singleCell.calculateRegressionCoefficient(interactorExpressions, getExpressionsUsedForScore);
      break;
   default:
      throw new IllegalArgumentException("Method " + scoringMethod.getScoreName() + " still not supported.");
}

As you can note, the implementation of the scores is actually performed inside of each singleCell object.

Enrichment score calculation per cell type:

Once all single cells have a score of similarity against the input protein list, we used the ranked list of single cells in a Kolmogorov-Smirnov test, following indications similar to Gene Set Enrichment Analysis. This is implemented in the method calculateEnrichmentScore and the enrichment scores are stored in the CellTypeClassification objects.

Enrichment score significance calculation per cell type:

Then, following the same principles described in the GSEA analysis article, we calculate the significance of the enrichment scores by randomly permutating the cell types of the single cells and recalculating the enrichment scores until having a distribution to use for calculating a p-value. This is implemented in the method calculateSignificanceByCellTypesPermutations where, after permutating the cell types, calls to the method calculateEnrichmentScore with the parameter flag permutatedData=true. Then, the p-value associated with each real enrichment score x of each cell type will be the proportion of random enrichment scores x' greater or equal to x divided by the total number of random enrichment scores obtained for that cell type.

Enrichment score False Discovery Rate per cell type:

Once we have a p-value per cell type, we want to calculate an FDR associated with each cell type, and we do this by using the real enrichment scores xt of all cell types t, and all the random enrichment scores x't of all cell types t. The FDR for a certain cell type t will be the number of random enrichment scores that are greater or equal than xt (snull) divided by the number of real enrichment scores that are greater or equal than xt (sobs). However, a factor of normalization by the number of cells in the cell type t is applied to that number. See line of code:

// nobs is the total number of real scores
// nnull is the total number of random scores
final int nobs = totalRealNormalizedScores.size();
final int nnull = totalRandomNormalizedScores.size();
fdr = (1.0 * snull / sobs) * (1.0 * nobs / nnull);

This is implemented at the end of the method calculateSignificanceByCellTypesPermutations.

Clone this wiki locally