Natural Language Processing (NLP)

POSEIDON NLP Extraction Process

The NLP project provides an enterprise solution to access structured and discrete data from raw clinical documents.  NLP developers work closely with the EDW team to obtain the unstructured documents.  Techniques are applied by NLP developers to transform these documents into high quality data to be used for research. 

City of Hope, Research Informatics has established the infrastructure required to facilitate a high-throughput, scalable approach to NLP.  Common data elements have been added and pipelines developed to extract different disease groups like BREAST, GYN , LEUKEMIA and LYMPHOMA.  Each of these pipelines are being used to run weekly batch jobs to keep the database up-to-date.

Weekly automated services include:

  1. Access and import path reports and clinical notes.
  2. Annotate them at large scale with different ontologies as required.
  3. Run NLP algorithms on these annotated documents.
  4. Ingest these discrete results from multiple NLP pipelines.
  5. Perform post-processing to normalize data, and data validation. 
  6. Load discrete data to our central research database to make it available in POSEIDON.

NLP Data Validation

Validation of NLP results is a critical step in achieving high quality data.  There are 2 quality check processes in place:  Automated Data Quality Validation (AdQV) and HART.  In order to derive insights from the unstructured documents, NLP developers worked with GenomOncology (GO) and Linguamatics to develop a quality control framework.  This framework includes tools that can be used by end users, like our Disease Registry Specialists, to review and modify NLP results.  This framework is currently used to validate NLP results, as needed, and generate quality metrics.  Quality checks are performed with each data load cycle.

 

For collaboration requests, questions, or technical support
e-mail InformaticsHelp@coh.org