This documents describes the steps necessary to translate a PheKB phenotype into a FHIR-native representation, using CQL as the logical expression language.
The over process to translate a phenotype is:
- Select an untranslated phenotype from the data directory. See the list on the main readme for a list of phenotypes and their status.
- Create any necessary value sets as FHIR ValueSet resources, and add them to the cache.
- Translate phenotype logic into CQL.
- Create tests.
The testing framework uses the JavaScript CQL engine, so you need NodeJS
installed to create tests. The installer can be found
here. The tests are currently being run on
Travis using NodeJS version 11.14.0
.
The testing framework operates on ELM, not CQL, so we use the reference implementation of the CQL to ELM translator to convert our CQL to ELM before running the tests. Travis runs the CQL to ELM conversion using OpenJDK 11, which can be found here.
Clone the repo:
git clone https://github.com/PheMA/phema-workbench-repository.git && cd phema-workbench-repository
Install yarn globally:
npm install -g yarn
Install the JavaScript dependencies:
yarn install
Two good editors to use for writing CQL are Atom and VSCode.
If you use Atom, install the CQL language plugin. This will give you syntax highlighting and some feedback on syntax errors. This plugin also includes a way to execute CQL, but the mechanism is different from the testing framework we use, so execution will not succeed without some extra work not described here.
If you use VSCode, install the Clinical Quality Language plugin to get syntax highlighting.
The process has currently only been tested on macOS. The overall process should
work on Windows too, but we will likely need to add Windows-specific scripts to
the package.json
file.
All phenotypes are in the phenotypes
directory. The directories
are named using the PheKB ID and slug. There are three directories inside each
directory: cql
, test
, and valuesets
.
There is a yarn
script to scaffold the directory structure as well as the CQL
and test files. The script is run as follows:
yarn scaffold 123.phenotype-slug
Note that the parameter must use the format /[0-9]*\.[a-z\-]*
.
Value sets are an important part of any phenotype definition. To support FHIR-native execution, we create value sets as FHIR ValueSet resources.
The Terminology Manager application in the PhEMA Workbench can be used to create these resources. This tool currently supports searching VSAC, uploading FHIR ValueSet resources directly, as well as importing concept sets exported from OHDSI Atlas.
To search VSAC, configure a new FHIR connection in Workbench as follows:
You must use exactly the following URL:
This is because VSAC is not properly configured for CORS, which means that we have to proxy requests via our backend. Use the username apikey and your API KEY (found here) as the password. These credentials are only stored in your browser, so there is very limited security risk.
You can upload FHIR ValueSet resources as well as Bundle resources directly into the Terminology Manager application.
Atlas allows exporting of concept sets. This is done by clicking the Export
tab on the Concept Sets page.
Simply click to select one or more concept sets, and then click the Export
Concept Sets button. This will create a file called
exportedConceptSets.zip
. Upload this file in the PhEMA Terminology Manager
application and these concept sets will be converted to FHIR ValueSet
resources.
💡 It is recommended that you have the VSAC Proxy
connection selected as
the Source
when uploading the exported Atlas concept set. This will ensure
that any standard code systems are also added to the terminology bundle.
There is a script that can create a Bundle
of FHIR ValueSet
resources by
taking a single CSV file as input, and directly searching the OMOP database. An
example of such a CSV file is:
valueSetId | valueSetName | searchTerm | vocabularyId |
---|---|---|---|
17.R1.anemias-and-other-hematologic-disorders | Anemias and Other Hematologic Disorders | 281.3 | ICD9CM |
17.R1.anemias-and-other-hematologic-disorders | Anemias and Other Hematologic Disorders | 281.4 | ICD9CM |
17.R1.anemias-and-other-hematologic-disorders | Anemias and Other Hematologic Disorders | 281.9 | ICD9CM |
17.IV.hereditary-anemias | Hereditary Anemias | 282.64 | ICD9CM |
17.IV.hereditary-anemias | Hereditary Anemias | ^200.* | ICD9CM |
17.red-blood-cell-meds | Medications Affecting Growth | aminoglutethimide | RxNorm |
Note the following things:
- The number and names of the columns must match exactly
- Search terms containing ^ or * will be matched as Regular Expressions
- If the vocabularyId field is
RxNorm
(case sensitive), the search term will be matched (case insensitive) against the RxNorm concept name, not the concept code (like with the other terminologies)
The script extracts the database connection details from environment variables (as described here). You also need to specify the input CSV file path and Bundle output file paht using the -i and -o command line arguments. For example:
PGHOST=65.52.10.157 PGUSER=ohdsi_admin_user PGPASSWORD=<password> PGDATABASE=ohdsi \
scripts/valueset-builder.js \
-i ./phenotypes/17.red-blood-cell-indices/valuesets/raw.csv \
-o ./phenotypes/17.red-blood-cell-indices/valuesets/17.red-blood-cell-indices-valuesets.bundle.json
For a list of valid vocabularyId
values, see this
file
and the bottom of the script. Use
the values found in the omopVocabularyId
field. The case should match exactly
values in the script and only the entries with non-empty fhirCanonicalUrl
values are useful.
There is this script that will take a list of medications (just update the SEARCH_STRINGS
array) and create a FHIR ValueSet
resource by searching RxNav.
This script has been deprecated and replaced by the OMOP Database Search script described above.
Once you have your value sets assembled in the Terminology Manager application, you can download them by clicking on the Download button. This will download a FHIR Bundle containing the assembled value sets and code systems.
Right now you need to manually copy the ValueSet resources out of
this bundle and place them in the valuesets
directory under your phenotype.
The is an open issue to support loading bundles directly (see
#1).
Finally, the testing framework used technically only supports loading value set
from VSAC. However, it caches these value sets on the file system, so we can
manually populate this cache with any value sets we want. The
prepare-valueset-cache script does exact
this. The script looks for all ValueSet resources present in the
valuesets
directory of all phenotypes and parses them and adds them to the
value set cache.
The value set cache is updated using the following command:
yarn prep-vscache
Value sets used by many phenotypes should be added to
__common/valuesets. ValueSet
resources
in this location are also added to the cache when the above command is executed.
The current convention is to create a cql
directory under your phenotype to
store any related CQL libraries. Libraries that are common to many phenotypes
should be placed in __common/cql.
This is an evolving list of conventions that should expand greatly as we translate more phenotypes.
All libraries should be written in the patient context, and it should be up to the executing engine to apply the library to all members of a population.
Current we are using the PheKB ID and slug as the filename, but we should discuss what makes sense long term.
library
statement. For example, if the CQL file contains:
library "97.abdominal-aortic-aneuryms-aaa" version '1.0.0'
Then the filename MUST be 97.abdominal-aortic-aneuryms-aaa.cql
. I believe that
97.abdominal-aortic-aneuryms-aaa-1.0.0.cql
would also be acceptable.
Intermediate statements should have meaningful names, but are less important than
the statement which determines if a patient is a case or not. This statement
should have the identifier "Case"
. For example:
// Determines if the patient is a case
define "Case":
"Has Condition" and "Has Medication" and "Other Criteria"
💡 For now we are only characterizing cases, not controls or non-cases or suspected cases.
We are focused on translating criteria from the original phenotype definition that can be represented using structured data as a FHIR resource. This allows us to ignore if it's actually feasibile that some type of structured data would be available within an EHR. For example, sites may not capture a patient's status in a biobank, but we are able to represent it using FHIR so we will use that within our CQL.
Another consideration is for natural language processing (NLP). We do not need to perform any NLP-specific logic within our CQL translation. For example, if a phenotype says to "look for the string 'hypertension' in the Family History section of all notes", we do not need to try and create CQL logic that tries to search through document text. However, we are able to take the intent of what was written in the original phenotype and we could represent this as "Family history of hypertension" in CQL.
This translation of NLP does not always need to be done. And this section will be updated for additional scenarios and guidance as we discover them. When in doubt, we will discuss as a group and come to a consensus.
See the Patterns documentaiton for some details on how phenotypes should be modeled in CQL and FHIR.
Tests are extremely important, and are currently the only way to know if the CQL logic we write is correct, since we cannot currently execute our phenotypes in production, and even if we could, we do not have access to gold standard cohorts in every instance for comparison.
The testing framework we use is AHRQ's CQL Testing Framework (CTF). This framework integrates with Mocha and uses the JavaScript CQL engine to execute CQL logic. The CTF additionally allows us to specify FHIR data using a special YAML shorthand that makes creating data for tests very convenient.
While the CTF is very useful, there are two important caveats that we need to work around.
The CTF uses cql-exec-vsac
as
its value set service, which means that only VSAC value sets are supported. The
idea is that the CQL author will configure their VSAC credentials, and upload
their value sets to VSAC, and the CTF will download from there.
While we could do this, it adds additional steps to phenotype development, so
we use a simple workaround instead. For optimization purposes, cql-exec-vsac
only ever downloads value set from VSAC once, and stores them in a cache. The
cache is just a JSON file containing the value set data in a custom format (see
an example here).
To use local value sets, we have a
script that traverses the phenotype
directory and parses valuesets in the valuesets
directory of each phenotype
(and in __common\valuesets
) and adds them to the cache. This is done by
running:
yarn prep-vscache
Once this is done, the CTF will find our value set in the cache and will not try to find it in VSAC.
The CTF can only operate on ELM, and so we must convert our CQL to ELM before
running the tests. For this reason we have included a compiled version of the
cql-to-elm
translator in the tools directory.
To convert your CQL to ELM run the following command:
yarn update-elm
It isn't actually necessary to do this manually, because this step is executed
automatically when you run yarn test
.
Inside each phenotype directory there should be a test
directory that contains
a file called cqlt.yaml
. This is the main test configuration file for the
phenotype, and should look almost identical to this:
---
library:
name: "97.abdominal-aortic-aneuryms-aaa" # Only this needs to be updated you match your library/file name
paths:
- ../cql
- ../../__common/cql
tests:
path: cases
options:
vsac:
cache: ../../__common/.vscache # It is important to keep this fixed
date: "2020-09-17T00:00:00.0Z"
dumpFiles:
enabled: false # Change this to true to generate debugging output
path: ../../../debugging_output
Only the library name needs to be updated to match the library (and file) name
of your phenotype. The complete documentation for the cqlt.yaml
file can be
found
here.
Any number of individual test cases can be written for a single phenotype. The
current convention is to put these in a cases
directory inside the test
directory (this matches the tests.path
key in the above yaml).
Test cases have three parts, a name, a data section, and a results section. The name is just for display purposes, but the other sections are described briefly here.
There is an npm
script that can generate a new case for a phenotype and
automatically increase the number. The new case can either be copied from the
previous case (the default), or be created from a blank template. By default
the phenotype slug will be inferred from the name of the current git branch,
but it can also be manually specified with -s
.
To create a new test case by copying the previous one, run:
yarn new-test
To manually specify the phenotype slug, run:
yarn new-test -s 112.herpes-zoster
To generate a case from a blank template, use -c false
:
yarn new-test -c false
Both flags can be used simultaneously.
The data section is used to specify FHIR resources that should be used when
running the test cases. This usually consists of a patient and associated
resources like Condition
and/or Procedure
. An example data section
could look like:
data:
- resourceType: Patient
name: Joe Smith
gender: male
birthDate: 1958-07-16
- resourceType: Procedure
code: CPT#34802 Endovascular repair of infrarenal abdominal aortic aneurysm or dissection; using modular bifurcated prosthesis (1 docking limb)
performedDateTime: "2008-11-12T16:00:00.000Z"
Note the special shorthand for creating data, described in detail here.
It is also possible to reuse individual resources or sets of resources during testing. See the reusing resources docs for instructions on how to do that.
The results section is simply a set of key value pairs that specify the name of a CQL statement and the expected result given the data described above. An example is:
results:
Has Exclusion Diagnosis: false
AAA Repair Procedures: $should have length 2
Case Type 1: true
Each line is a statement name, followed by a colon and the expected result. The
CTF provides a few special keyword like $should
, which are described in the
results section docs.
The complete documentation for create tests cases is here.
To actually run the tests, execute the following:
yarn test
The search path for tests is defined in test.js
. By default
every test in the phenotypes
directory will be run, but there is a comment in
that file that explains how to adjust the search path during development. This
can be done to only run the test you are currently working on to speed up test
development.
There are a lot of moving parts in this repository, so issues are likely to arise. This section should be updated to include any troubleshooting tips or gotchas.
The main cqlt.yaml
file has an option dumpFiles
key described
here.
This can be turned on to generate debugging output, which can be used to figure
out why your test cases might not be working.
There are a few issues around included libraries. First, they are handled
differently by the cql-to-elm
translator and the cql-execution
engine used
by the CTF.
The translator needs all included libraries to be in the same directory as the including library.
- TODO: Add a script that copies CQL/ELM from
__common/cql
to each phenotype that includes common libraries.
The cql-execution
engine seems to not like quoted libraries names. So, it
is best to use library names like Diabetes
and NOT "Diabetes"
.
null
, library.valuesets
causes the code to break, which makes this particular
gotcha hard to diagnose.