PretoxTM Project

Background

The main objective of PretoxTM (Preclinical Toxicology Text Mining) is to retrieve treatment-related findings from toxicological reports using Natural Language Processing (NLP) techniques; and present this information in a Web App for toxicology expert curation.

Toxicology reports describing the results of preclinical toxicology studies carried out by pharmaceutical companies have been identified as a valuable source of information on safety findings for investigational drugs in the context of the eTRANSAFE project. However, the exploitation of the preclinical knowledge contained in these reports is extremely difficult since most of them are unstructured texts, usually digitized as PDF documents and often including scanned images. PretoxTM is able to identify, capture and standardize findings related to drug treatment (i.e., safety findings) by mining legacy preclinical toxicology reports.

A treatment-related finding expression enclose several named entities; the most relevant one is the abnormal effect detected; which depending on the study domain of the finding, can be given by a measurement, test, or examination named Study Test and an abnormal Manifestation result obtained for that study test; or by an abnormal Finding in study domains where there is no associated test or measurement (e.g. clinical, macroscopic and microscopic). Other related named entities that could be present to complete the treatment-related finding are; the Specimen of the abnormal observation, the Sex of the subject, the Group of subjects in which the observation was detected and the Dose level administration of the compound. Examples of sentences with treatment-related findings are: “The decrease in food consumption and body weight of the animals from the mid dose onwards is regarded as evidence of general toxicity.” and "At dose level 3, absolute and relative liver weights were increased in male rats.”.

Overview of PretoxTM system in eTRANSAFE — Figure 1. Examples of treatment-related observations and their relevant entities. A) A treatment-related observation described by a study test (body weight gain) and the abnormal manifestation (decreased). The dose administered (25 mg/kg), the sex (males) and group (group A) of the subjects are also part of the observation. B) A treatment-related observation described by a finding (necrosis) in a specimen (liver). The dose administered (50 mg/kg) and the sex (females) of the subjects are also part of the observation.

The PretoxTM pipeline was developed using Transformers, fine-tuning the BioBERT and BiomedNLP pretrained models for sentence classification and named entity recognition (NER). These models are domain-specific language representations pre-trained on large-scale biomedical corpora. This component was developed in Python leveraging the HuggingFace ecosystem for the development of transformer models.

Treatment-related finding concept types — Table 1. The PretoxTM entity model. Named entities related to treatment-related findings with PretoxTM NER metrics.

Figure 2. PretoxTM Sentence Classifier metrics.

NEN Coverage — Table 2. Name Entity Normalization (NEN) coverage in PretoxTM Corpus. "Total” indicates the number of findings recognized by the NER component, including repetitions. There is a column for the “Unique” terms detected. The "NEN" column shows how many of the unique terms were normalized. The last column presents the NEN coverage.

Training Materials

For a quick overview of the basic use of PretoxTM a video is available. The video includes an step-by-step guide on how to upload reports, run the PretoxTM pipeline, visualise/validate/edit findings and submit information to the SR-Domain Editor.

A user manual is also availbale to review all PretoxTM functionalities and recommendations.

System Overview

The PretoxTM system is divided two main modules of the system:

The PretoxTM pipeline is responsible for the detection of treatment-related findingos from toxicology reports.
The PretoxTM Web App in which the information is presented to the toxicology experts for manual validation.

Figure 3. PretoxTM Pipeline components for the extraction for treatment related findings.

PretoxTM Resources

PretoxTM GitLab page is available at https://gitlab.com/pretoxtm/pretoxtm
PretoxTM Sentence Classifier model is available at https://huggingface.co/javicorvi/pretoxtm-sentence-classifier
PretoxTM NER model is available at https://huggingface.co/javicorvi/pretoxtm-ner
PretoxTM Corpus is available on Zenodo and HuggingFace
More information about the PretoxTM Corpus is available at https://gitlab.com/pretoxtm/pretoxtm-corpus

Updated News (Sep 10th 2024).

A new version of the PretoxTM (2.8) was released.

The key feature of this new version of PretoxTM is that it can be installed anywhere, outside the eTRANSAFE environment. By using Docker and Docker Compose, the installation can be completed in just a few simple steps, which are detailed at: https://gitlab.com/pretoxtm/pretoxtm

Updated News (February 20th 2023).

A new version of the PretoxTM (2.7) was released. This is the last version of PretoxTM in the context of the eTRANSAFE project.

In these version the PretoxTM pipeline uses Transformers with a fine-tuning of the BioBERT and BiomedBERT pretrained models for sentence classification and named entity recognition (NER). Sentence Classifier was improved from 0.91 to 0.95 F1 score; and the recognition of the different entities were also improved, see Table 1.

Reports with the status “No section detected” can be executed again with the option “Run without section extraction”. This should be used with caution and only if the normal execution has not found sections to process in a given report. If it is selected, the entire report will be processed looking for findings, resulting in an overhead execution in large reports.

Updated News (December 13th 2022).

A new version of the PretoxTM (2.6) was released. This version contains several improvements in both the PretoxTM pipeline and in the PretoxTM Web App. Following is the list with the most relevant changes:

A new workflow tab appears to view the status of workflows, including which reports are being processed in each workflow, and to control workflow executions. For example, you can cancel or delete a workflow.

We improved the way reports are loaded, and then gave you the possibility to select which reports you want to process with the PretoxTM pipeline.

In the reports tab, a new column "User" was added to describe who uploaded the report. The same column is present in the workflow tab to indicate who was the user that launched the PretoxTM workflow.

We improved the section extraction component in general. We optimized the process when using fully scanned and large reports avoiding workflow bottlenecks and failures. Note: Be aware of the complexity in uploading reports with more than 200 scanned pages. Each page is internally converted using tesseract to a readable pdf page, so this process takes some seconds by page. If there is a page that takes more than 30 seconds in this conversion, it is skipped and the process continues with the next one. Tip: if you are going to upload a large scanned report please execute the workflow only with that specific report, this will provide a better control of the execution.

We improved the table of findings in different aspects: design, navigation, editing, etc. We added an asterisk (*) to indicate which terms do not belong to the Controlled Terminology.

We validate the domain of a finding and the specimen before sending the information to the SR-Domain Web Editor. We gave to the user the feedback indicating which are the findings that should be completed and modified.

When a validated report is sent from PretoxTM to SR-Domain Web Editor, we check if a report with the same name already exists in SR-Domain Web Editor; if so, we indicate to the user that he can send the same report but as a copy.

Several improvements have been made in terms of styling, feedback and navigation in all components of the PretoxTM Web App. For example, upload feedback for the report upload process, a new refresh button for tables, green and red feedback messages to indicate success and failures of operations.

PretoxTM is now incorporated in the REDMINE of the ToxHub to report issues, questions or doubts.

Updated News (October 11th 2022)

Please read the following issues:

A new version of the PretoxTM (2.5.1) was released. The integration of the PretoxTM pipeline into the ToxHub enviroment is complete.

The user is able to upload their reports, execute the PretoxTM pipeline, validate the results and send the information to the SR-Domain database.

When the user is revieweing the information, it can discriminate and focus in a specific section for their analysis. The section information appears in the findinds table and also in the top right in order to visualize the textual evidence as desired.

Edition over the findings is also available. This edition will only modify the values in the table, re-annotation over the text is not allowed.

After the reports is validated, indicating which are the accepted findings, the user can push the findings to the SR-Domain Web Editor throught the SR-Domain API for subsequent analysis.

Updated News (June 6th 2022)

PretoxTM is still under development. Please read the following issues:

A new version of the PretoxTM pipeline (2.4) was released with improvements in the Named Entity Recognition and Relation Extracion.

In the PrexotTM pipeline 2.4, we calculte the Named Entity Recognition (NER) performance (Table 1) using the preclinical corpus developed. The sentence classifier model, which detects relevant toxicological sentences, obtains a 0.91 F1-SCORE. Results are preliminary, we plan to improve the performance of the pipeline in the following version.

The integration of the PretoxTM pipeline into the ToxHub enviroment is ongoing.

This version allows to import the results of the PretoxTM pipeline that were executed locally (see section How to use it). At this point the data imported into the PretoxTM Web App should be public; all the users that have access to the ToxHub can see the same information.

Updated News (April 4th 2022)

PretoxTM is still under development. Please read the following issues:

At the moment we are improving the Named Entity Recognition and Relation Extraction components of the PretoxTM pipeline. The main objective is to present the performance of the concept types recognition, included in the treatment-related findings. The sentence classifier model, which detects relevant toxicological sentences, obtains a 0.91 F1-SCORE.

The PretoxTM Web App is deployed in the test enviroment of the eTRANSAFE ToxHub. This version allows to import the results of the PretoxTM pipeline that were executed locally (see section How to use it). At this point the data imported into the PretoxTM Web App should be public; all the users that have access to the ToxHub can see the same information. There is no security restrictions applied, and should be also discussed.

Following will be the exploration of including the PretoxTM pipeline into the ToxHub enviroment.

PretoxTM in the context of eTRANSAFE project

In the context of the eTRANSAFE project, the PretoxTM system collaborates closely with the SR-Domain Web Editor. At the end of the process, the main goal is to be able to import treatment-related findings into the Preclinical Database, following the SR-Domain format (Sharepoint access). The treatment-related findings detected by the PretoxTM pipeline are visualized and validated by the PretoxTM Web App. Once the validation is over the data can be exported into the SR-Domain format to be later edited by the SR-Domain Editor and incorporated into the Preclinical Database.

Additional Information

PretoxTM participants:

Javier Corvi
Nicolas Díaz Roussel
Pablo Accuosto
José María Fernández
Emilio Centeno
Francesco Ronzano
Thomas Steger-Hartmann
Alfonso Valencia
Laura I. Furlong
Salvador Capella-Gutierrez

PretoxTM Corpus Data Curation:

Celine Ibrahim
Shoji Asakural
Frank Bringezu
Mirjam Fr ̈ohlicher
Annika Kreuchwig
Yoko Nogami
Jeon Rih
Raul Rodriguez-Esteban
Nicolas Sajot
Joerg Wichard
Heng-Yi Michael Wu
Philip Drew

License: This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3 see here for details.