-
Notifications
You must be signed in to change notification settings - Fork 0
PCTSEA guide for developers
This guide is supposed to help future developers to continue improving this software.
See a flow chart here
Both command line and web versions are coupled to the pctsea-core module where the PCTSEA.java class is defined and where the logic of the analysis is implemented, more in particular, in the run() method which has been filled with comments so that can be followed.
This class is well documented, with a lot of comments along the code, however here there is more information about it that might be useful:
In order to change scoring methods, the developer should focus on method
private int calculateScoresToRankSingleCells(
List<SingleCell> singleCellList,
GeneExpressionsRetriever interactorExpressions,
ScoringSchema scoringSchema,
boolean writeScoresFile,
boolean outputToLog,
boolean getExpressionsUsedForScore,
boolean takeZerosForCorrelation,
double minCorrelation) throws IOException
where depending on the ScoringMethod of the ScoringSchema a different score is calculated per SingleCell that reorder them in a ranking list used in the Kolmogorov-Smirnov test used for the calculation of the enrichment score.
Inside this method there is a switch clause that calls the appropriate method depending on the ScoringMethod:
switch (scoringMethod) {
case PEARSONS_CORRELATION:
singleCell.calculateCorrelation(interactorExpressions, getExpressionsUsedForScore, minCorrelation);
break;
case SIMPLE_SCORE:
singleCell.calculateSimpleScore(interactorExpressions, getExpressionsUsedForScore, minCorrelation);
break;
case DOT_PRODUCT:
singleCell.calculateDotProductScore(interactorExpressions, takeZerosForCorrelation, getExpressionsUsedForScore);
break;
case REGRESSION:
singleCell.calculateRegressionCoefficient(interactorExpressions, getExpressionsUsedForScore);
break;
default:
throw new IllegalArgumentException("Method " + scoringMethod.getScoreName() + " still not supported.");
}
As you can note, the implementation of the scores is actually performed inside of each singleCell object.
Once all single cells have a score of similarity against the input protein list, we used the ranked list of single cells in a Kolmogorov-Smirnov test, following indications similar to Gene Set Enrichment Analysis. This is implemented in the method calculateEnrichmentScore
and the enrichment scores are stored in the CellTypeClassification objects.
Then, following the same principles described in the GSEA analysis article, we calculate the significance of the enrichment scores by randomly permutating the cell types of the single cells and recalculating the enrichment scores until having a distribution to use for calculating a p-value. This is implemented in the method calculateSignificanceByCellTypesPermutations
where, after permutating the cell types, calls to the method calculateEnrichmentScore with the parameter flag permutatedData=true
. Then, the p-value associated with each real enrichment score x of each cell type will be the proportion of random enrichment scores x' greater or equal to x divided by the total number of random enrichment scores obtained for that cell type.
Once we have a p-value per cell type, we want to calculate an FDR associated with each cell type, and we do this by using the real enrichment scores xt of all cell types t, and all the random enrichment scores x't of all cell types t. The FDR for a certain cell type t will be the number of random enrichment scores that are greater or equal than xt (snull
) divided by the number of real enrichment scores that are greater or equal than xt (sobs
). However, a factor of normalization by the number of cells in the cell type t is applied to that number. See line of code:
// nobs is the total number of real scores
// nnull is the total number of random scores
final int nobs = totalRealNormalizedScores.size();
final int nnull = totalRandomNormalizedScores.size();
fdr = (1.0 * snull / sobs) * (1.0 * nobs / nnull);
This is implemented at the end of the method calculateSignificanceByCellTypesPermutations
.
This can be done in a separate script that reads the information from the new dataset and creates the appropriate objects and saves them into the MongoDB database.
Set up your code to have access to the MongoDB database:
In order to have access to the database, you should use the utility class MongoBaseService.java
from the pctsea-core module.
In addition, you will have to use Spring injection using some Spring annotations in order to get an instance of MongoBaseService
. See the below code snippet as an example:
@RunWith(SpringRunner.class)
@AutoConfigureDataMongo
@SpringBootTest(// we don't want a web environment to test
webEnvironment = WebEnvironment.NONE, //
properties = { "headles=false" //
// if necessary:
,"spring.config.location=classpath:/application-remoteTunnel.properties"//
, "spring.jpa.hibernate.ddl-auto=create"
}
)
public class NewDatasetCreation {
@Autowired
MongoBaseService mongoBaseService;
// if you want to access the database directly without the use of the methods in MongoBaseService, you can create the access to the repository like this, using @Autowired with all the *MongoRepository classes that are in the 'edu.scripps.yates.pctsea.db'
@Autowired
DatasetMongoRepository projectMongoRepo;
@Autowired
SingleCellMongoRepository singleCellMongoRepository;
@Test
public void DatasetCreation() {
// here will be the code explained below
}
}
Insert the new dataset object
final Dataset dataset = new Dataset();
dataset.setTag("HCL");
dataset.setName("Construction of a human cell landscape at single-cell level");
dataset.setReference("https://doi.org/10.1038/s41586-020-2157-4");
if (projectMongoRepo.findByName(project.getName()).isEmpty()) {
projectMongoRepo.save(dataset);
}
Read the single cell expressions from the new dataset and create the singleCell objects:
// where we keep the SingleCell objects
List<SingleCell> singleCellList = new ArrayList<SingleCell>();
List<Expression> readSingleCellExpressions() {
List<Expression> sces = new ArrayList<Expression>();
// this would be rather in a loop in which we read from the files from the new dataset:
String singleCellName = "single_cell_identifier_1234";
String cellType = "neuron";
String biomaterial = "brain";
String datasetTag = dataset.getTag(); // we need to associate the single cell with the dataset
// create singleCell object (be careful because we don't want duplicated singleCells in the DB, check whether the singleCell has already created by querying by its unique name
SingleCell singleCelldb = new SingleCell(singleCellName, cellType, biomaterial, datasetTag);
// store it in a list
singleCellList.add(singleCelldb);
// create single cell expression of a Gene
String gene = "ALDOA";
double expressionValue = 4.3;
// Create Expression object
final Expression sce = new Expression();
sce.setCell(singleCelldb); // associate Expression with SingleCell
sce.setGene(gene);
sce.setExpression(expressionValue);
sce.setProjectTag(datasetTag);
// add Expression to a list of Expression objects
sces.add(sce);
return sces;
}
Save the Expression objects that are coming from previous method:
final List<Expression> sces = readSingleCellExpressions();
// size of insert queries
int BATCH_SIZE = 1000;
final List<Expression> batch = new ArrayList<Expression>();
for (final Expression sce : sces) {
batch.add(sce);
if (batch.size() == BATCH_SIZE) {
mongoBaseService.saveExpressions(batch, statusListener);
System.out.println(batch.size() + " entities saved in database");
batch.clear();
}
}
// save the remaining ones
if (!batch.isEmpty()) {
mongoBaseService.saveExpressions(batch, statusListener);
}
Save SingleCells that are in the list singleCellList:
final List<SingleCell> batch = new ArrayList<SingleCell>();
// to check whether the single cell was already stored:
Set<String> singleCellsInDB = new HashSet<String>();
for (final SingleCell sc : singleCellList) {
if (!singleCellsInDB.contains(sc.getName())) {
batch.add(sc);
}
if (batch.size() == BATCH_SIZE) {
mongoBaseService.saveSingleCells(batch, statusListener);
singleCellsInDB.addAll(batch.stream().map(sc2 -> sc2.getName()).collect(Collectors.toList()));
batch.clear();
}
}
// store the remaining ones:
if (!batch.isEmpty()) {
mongoBaseService.saveSingleCells(batch, statusListener);
singleCellsInDB.addAll(batch.stream().map(sc2 -> sc2.getName()).collect(Collectors.toList()));
}
Proteomics Yates Laboratory
Salvador Martínez-Bartolomé (salvador at scripps.edu)
Research Associate
The Scripps Research Institute
10550 North Torrey Pines Road
La Jolla, CA 92037
Git-Hub profile