ChEMU - Cheminformatics Elsevier Melbourne University lab

ChEMU lab series provides a unique opportunity for development of information extraction tools over chemical patents. As second running of ChEMU lab, ChEMU2021 focuses on reference resolution in chemical patents. ChEMU 2021 provides two key tasks to achieve this goal: chemical reaction reference resolution and anaphora resolution.

Overview of ChEMU 2021

Brought to you by the University of Melbourne natural language processing group in the School of Computing and Information System, the Elsevier Data Science, Life Sciences team, and RMIT University, the ChEMU lab series provides an opportunity for development of information extraction models over chemical patents.

News
  • Training & Development sets for both tasks are released!
  • Sample data is released.
Lab Registration

To register, please fill out Registration Form before 30 April 2021.

Tasks
  • Task 1 - chemical reaction reference resolution: Given a chemical reaction snippet, the task aims to find similar chemical reactions and general conditions that it refers to.
  • Task 2 - anaphora resolution: The task requires identification of references between expressions in chemical patents.
Key Dates
  • 3 May 2021: Test set release
  • 7 May 2021: End of evaluation cycle and feedback for participants
  • 28 May 2021: Submission of participant papers [CEUR-WS]
  • 28 May – 11 June 2021: Review process of participant papers
  • 11 June 2021: Notification of acceptance participant papers [CEUR-WS]
  • 2 July 2021: Camera ready copy of participant papers [CEUR-WS]

All the above times are in AoE zone.

Task 1: Chemical reaction reference resolution

Given a reaction description, this task requires identifying references to other reactions that the reaction relates to, and to the general conditions that it depends on.

Assume a set of reaction statements (RSs), each of which corresponds to a description of an individual chemical reaction or a general condition for the reaction. By identifying all the reference relationships amongst these reaction statements, the details of reactions can be fully specified by connecting related reaction statements. Two types of reference relationships are defined in this task, namely Analogous Reactions and General Conditions.

Data format

There are two types of data files, which are both in plain text using UTF-8 encoding:

  • *.txt files: the text files converted from origianl patent pdfs.
    • Note that there are some special tags like <img>, <header>, <table>.
  • *.ann files: the annotation files containing span and relation annotations in BRAT standoff format.
    • REACTION_SPAN -> a reaction statement
    • REF -> a reference relation between two REACTION_SPANs
    • To help participants build better models, we provide CUE annotations, where a CUE indicates the analogy in a parent-child reaction pair. Related annotations are CUE, CUE_PARENT, CHILD_CUE.
    • IMG_CUE, IMG_CUE_PARENT, IMG_CHILD_CUE rely on images in the original patents, which are not made available in this task. Participants may ignore these annotations if they are not useful.
For more details, please refer to our annotation guileline.

Dataset and visualization

The dataset is annotated using our modified version of BRAT. The challenge we faced is that reaction spans are often very long which makes annotators hard to link two spans that are far away from each other. Therefore, we use a side-by-side view, where the original text and annotated spans are displayed on the left side and some dummy nodes corresponding to the the spans are shown on the right side. Then the annotators can just link the dummy nodes on the right side.

Visualization for the sample data can be found here (it may take a few seconds to load).

In our release, we provide two versions of the datasets, in folders like sample and sample-vis, where the first one has spans and relations in one ann file for a patent, while the second one has them separated in two ann files to support the side-by-side view. The second one could be visualized by our modified BRAT, and the first one is more friendly for program.

Submission format

A valid submission is a compressed folder (e.g. submission.zip) consisting of prediction files (*.ann files).

  • Please just submit *.ann files, as other files are not necessary for evaluation.
  • We don't require participants predict CUEs, and the evaluation is based on REACTION_SPAN and REF only. So please exclude CUE related predictions.

Evaluation

We use standard precision, recall, and F-score as our primary evaluation metrics. The evaluation system we use on the server is available on the download page of BRATEval repository.

Task 2: Anaphora resolution

This task requires the resolution of general anaphoric dependencies between expressions in chemical patents. In this task, we define five types of anaphoric relationships, common in chemical patents:

  • Co-reference: two expressions/mentions that refer to the same entity.
  • Transformed: two chemical compound entities that are initially based on the same chemical components and have undergone possible changes through various conditions (e.g., pH and temperature).
  • Reaction-associated: the relationship between a chemical compound and its immediate sources via a mixing process. The immediate sources do need to be reagents, but they need to end up in the corresponding product. The source compounds retain their original chemical structure.
  • Work-up: the relationship between chemical compounds that were used for isolation or purification purposes, and their corresponding output products.
  • Contained: the association holding between chemical compounds and the related equipment in which they are placed. The direction of the relation is from the related equipment to the previous chemical compound.

Data format

There are two types of data files, which are both in plain text using UTF-8 encoding:

  • *.txt files: the text snippets extracted from origianl patent pdfs.
  • *.ann files: the annotation files containing span and relation annotations in BRAT standoff format.
    • {COREFERENCE, TRANSFORMED, REACTION_ASSOCIATED, WORK_UP, CONTAINED} are the five anaphoric relations we consider in this task.
    • Every identified relation should be labeled as one of the five types.
    • The referring direction is from anaphor to its corresponding antecedent.
    • The anaphor in a relation has the same label as the relation, i.e. one of the five types, while the antecedent is always labeled as ENTITY.
    • Note that an anaphor/antecedent may have multiple ranges (discontinuous text-bound).
For more details, please refer to our annotation guileline which is avaible here.

Transitive co-reference relationships

Suppose there are co-reference links T1->T2 and T2->T3, then T1->T3 is also a valid link. In evaluation, we are looking for all valid links, and missing links are considered as false negatives. To help you post-process your submission, we provide our code to generate all valid links given existing ones [python][HTML]. The code will append new links to existing files.

Submission format

A valid submission is a compressed folder (e.g. submission.zip) consisting of prediction files (*.ann files).

  • Please just submit *.ann files, as other files are not necessary for evaluation.

Evaluation

We use standard precision, recall, and F-score as our primary evaluation metrics. The evaluation system we use on the server is available on the download page of BRATEval repository.

Annotation Guidelines

To know how the datasets are annotated and gain further insight into the task, please see the annotation guidelines:

Pre-trained ChemPatent Word Embeddings

In a related work, we have publicized a set of new word embeddings, named ChemPatent Word Embeddings, which is trained on a collection of 84,076 full patent documents (1B tokens) across 7 patent offices. We have also released an ELMo model pre-trained on the same corpus which provides contextualized word presentations. We have demonstrate that ChemPatent Word Embeddings produce better performance than the word embeddings pre-trained on biomedical literature corpora.

To access and utilize the released ChemPatent Word Embeddings and the pre-trained ELMo model, please click Github Repository for ChemPatent Embeddings

To see detailed information about the embeddings, please find the original paper in https://www.aclweb.org/anthology/W19-5035.pdf.

Relevant Background:
  1. Nguyen DQ, Zhai Z, Yoshikawa H, Fang B, Druckenbrodt C, Thorne C, Hoessel R, Akhondi SA, Cohn T, Baldwin T and Verspoor K. (2020) ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. In ECIR 2020. PDF.
  2. Zhai Z, Nguyen DQ, Akhondi S, Thorne C, Druckenbrodt C, Cohn T, Gregory M and Verspoor K. (2019) Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings. Proceedings of the Workshop on Biomedical Natural Language Processing (BioNLP) at ACL 2019. https://www.aclweb.org/anthology/W19-5035.pdf
  3. Yoshikawa H, Verspoor K, Baldwin T, Nguyen DQ, Zhai Z, Zkhondi S, Thorne C, Druckenbrodt C. (2019) Detecting Chemical Reaction Schemes in Patents. Australian Language Technology Association Workshop (ALTA 2019). Sydney, Australia, December 2019. https://www.aclweb.org/anthology/U19-1014.pdf
To appear
  1. He J, Fang B, Yoshikawa H, Li Y, Akhondi SA, Druckenbrodt C, Thorne C, Afzal Z, Zhai Z, Cavedon L, Cohn T, Baldwin T, Verspoor K. (2021 to appear) ChEMU 2021: Reaction Reference Resolution and Anaphora Resolution in Chemical Patents. In: Advances in Information Retrieval. ECIR 2021.
  2. Fang B, Druckenbrodt C, Akhondi SA, He J, Baldwin T, Verspoor K. (2021 to appear) ChEMU-Ref: A corpus for modeling anaphora resolution in the chemical domain. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.
2020
  1. He J, Nguyen DQ, Akhondi SA, Druckenbrodt C, Thorne C, Hoessel R, Afzal Z, Zhai Z, Fang B, Yosikawa H, Albahem A, Cavedon L, Cohn T, Baldwin T, Verspoor K. Overview of ChEMU 2020: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, Lecture Notes in Computer Science, vol. 12260: 237-254.
  2. He J, Nguyen DQ, Akhondi SA, Druckenbrodt C, Thorne C, Hoessel R, Afzal Z, Zhai Z, Fang B, Yoshikawa H, Albahem A, Wang j, Ren Y, Zhang Z, Zhang Y, Dao MH, Ruas P, Lamurias A, Couto F, Copara J, Naderi N, Knafou J, Ruch P, Teodoro D, Lowe D, Mayfield J, Koksal A, Donmez H, Ozkirimli E, Ozgur A, Mahendran D, Gurdin G, Lweinski N, Tang C, McInnes BT, Malarkodi CS, Rao TP, Devi SL, Cavedon L, Cohn T, Baldwin T, Verspoor K (2020) An Extended Overview of the CLEF 2020 ChEMU Lab: Information Extraction of Chemical Reactions from Patents. Proceedings of the CLEF 2020 conference. Thessaloniki, Greece. 2020-09. http://hesso.tind.io/record/6175
  3. Nguyen DQ, Zhai Z, Yoshikawa H, Fang B, Druckenbrodt C, ThorneC, Hoessel R, Akhondi SA, Cohn T, Baldwin T, Verspoor K. (2020) ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. In: Jose J. et al. (eds) Advances in Information Retrieval. ECIR 2020. Lecture Notes in Computer Science, vol 12036. Springer, Cham. doi: 10.1007/978-3-030-45442-5_74 PDF
  4. Zhai, Z, Druckenbrodt, C, Thorne, C, Akhondi, SA, Nguyen, DQ, Cohn, T, & Verspoor, K. (2020) ChemTables: A Dataset for Semantic Classification of Tables in Chemical Patents. [paper], [dataset].
  5. Verspoor K, Nguyen DQ, Akhondi SA, Druckenbrodt C, Thorne C, Hoessel R, He J, Zhai Z. ChEMU dataset for information extraction from chemical patents. [dataset].
2019
  1. Yoshikawa H, Nguyen DQ, Zhai Z, Druckenbrodt C, Thorne C, Akhondi SA, Baldwin T, Verspoor K. (2019) Detecting Chemical Reactions in Patents. Australian Language Technology Association Workshop (ALTA 2019). Sydney, Australia, December 2019. https://www.aclweb.org/anthology/U19-1014.pdf [Best Paper Award]
  2. Zhai Z, Nguyen DQ, Akhondi S, Thorne C, Druckenbrodt C, Cohn T, Gregory M and Verspoor K. (2019) Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings. Proceedings of the Workshop on Biomedical Natural Language Processing (BioNLP) at ACL 2019. https://www.aclweb.org/anthology/W19-5035.pdf
2018
  1. Zhai Z, Nguyen DQ, Verspoor K*. (2018) Comparing CNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognition. Proceedings of the 9th International Workshop on Health Text Mining and Information Analysis (LOUHI 2018), pages 38–43, Brussels, Belgium, October 31, 2018. arXiv:1808.08450. http://aclweb.org/anthology/W18-5605
  2. Nguyen DQ, Verspoor K*. (2018) Convolutional neural networks for chemical-disease relation extraction are improved with character-based word embeddings. Proceedings of the Workshop on Biomedical Natural Language Processing (BioNLP) at ACL2018. arXiv:1805.10586. http://aclweb.org/anthology/W18-2314
  1. Can I log in using my credentials used in CLEF registration?

    To provide a more secured environment in our submission website, we use an independent registration system from CLEF. To log into our submission website for the first time, you will need to sign up by providing some simple information including your username, email, password, and your institution. We apologize for any inconvenience incurred.

  2. How can I make a submission?

    You can choose to make a submission against the development or test dataset by toggling the "data split" in the submission panel. You will be provided with evaluation result right after your submission is uploaded successfully. A ranking of all your submissions is provided in your private leaderboard. You may also click "publish" to make the performance of a submission visible to all teams. By "publishing" a submission, the performance of the submission will appear in the public leaderboard.