User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Biomedical Information Extraction

🧫 A curated list of resources relevant to doing Biomedical Information Extraction (including BioNLP)

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Nov. 29, 2021, 7:06 p.m.

Thank you caufieldjh & contributors
View Topic on GitHub:
caufieldjh/awesome-bioie

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Research Overviews

Groups Active in the Field

Organizations

Journals

Conferences and Other Events

full-day special session on Text Mining for Biology and Healthcare). The meeting is combined with that of the European Conference on Computational Biology (ECCB) on odd-numbered years.

Challenges

TASS, an annual workshop for semantic analysis in Spanish.

more bioinformatics-focused challenges, this challenge opened in October 2019 and focuses on using electronic health record data to predict patient mortality. Uses a synthetic data set rather than real EHR contents.

Guides

Video Lectures and Online Courses

Code Libraries

A smorgasbord architecture for coreference resolution in biomedical text

9
1
1y 7m
n/a

Medical Text Mining and Information Extraction with spaCy

335
81
107d
GPL-3.0

A full spaCy pipeline and models for scientific/biomedical documents.

1.04K
144
83d
Apache-2.0

talk with NCBI entrez using R

160
39
96d
n/a

paper - code - Python tools primarily intended for bioinformatics and computational molecular biology purposes, but also a convenient way to obtain data, including documents/abstracts from PubMed (see Chapter 9 of the documentation).

paper - code - a Python package and model (for use with spaCy) for doing NER with medication-related concepts.

Repos for Specific Datasets

MIMIC Code Repository: Code shared by the research community for the MIMIC-III database

1.42K
1.14K
29d
MIT

Tools, Platforms, and Services

Public release of the DeepPhe analytic software

23
6
55d
n/a

A framework for keeping biomedical text mining result up-to-date

34
7
1y 4m
MIT

Surfacing Semantic Data from Clinical Notes in Electronic Health Records for Tailored Care, Trial Recruitment and Clinical Research

67
14
4m
Apache-2.0

Framework for information extraction from tables

35
11
2y 7m
n/a

paper - A natural language processing toolkit intended for use with the text in clinical reports. Check out their live demo first to see what it does. Usable at no cost for academic research.

paper - A method for disease normalization, i.e., linking mentions of disease names and acronyms to unique concept identifiers. Downloadable version includes the NCBI Disease Corpus and BC5CDR (see Annotated Text Data below).

paper - A web platform that identifies five different types of biomedical concepts in PubMed articles and PubMed Central full texts. The full annotation sets are downloadable (see Annotated Text Data below).

paper - Performs concept normalization (see also DNorm above). Can be trained for specific concept types and can perform NER independent of other normalization functions.

Annotation Tools

Anafora is a web-based raw text annotation tool

227
55
1y 5m
n/a

paper - code - The brat rapid annotation tool. Supports producing text annotations visually, through the browser. Not subject specific; appropriate for many annotation projects. Visualization is based on that of the stav tool.

Word Embeddings

paper - Qord embeddings derived from biomedical text (>10 million PubMed abstracts) using the popular word2vec tool.

paper - code - Word embeddings derived from biomedical text (>27 million PubMed titles and abstracts), including subword embedding model based on MeSH.

Language Models

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

506
65
1y 6m
n/a

repository for Publicly Available Clinical BERT Embeddings

399
93
1y 97d
MIT

ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission (CHIL 2020 Workshop)

208
56
6m
n/a

A very simple framework for state-of-the-art Natural Language Processing (NLP)

10.99K
1.77K
12d
n/a

A BERT model for scientific text.

1K
168
1y 53d
Apache-2.0

BlueBERT, pre-trained on PubMed abstracts and clinical notes (MIMIC-III).

346
53
97d
n/a

paper - A BERT model trained from scratch on PubMed, with versions trained on abstracts+full texts and on abstracts alone.

Biomedical Text Sources

paper - 348,566 MEDLINE entries (title and sometimes abstract) from between 1987 and 1991. Includes MeSH labels. Primarily of historical significance.

Annotated Text Data

53
13
1y 9m
n/a

paper - A pilot dataset containing standardised information, and annotations of occurence in text, about ~5,000 known adverse reactions for 200 FDA-approved drugs.

paper - 15,000 sentences (10,000 training and 5,000 test) annotated for protein and gene names. 1,000 full text biomedical research articles annotated with protein names and Gene Ontology terms.

paper - 1,500 articles (title and abstract) published in 2014 or later, annotated for 4,409 chemicals, 5,818 diseases and 3116 chemical–disease interactions. Requires registration.

paper - >2,400 articles annotated with chemical-protein interactions of a variety of relation types. Requires registration.

paper - A corpus of 793 biomedical abstracts annotated with names of diseases and related concepts from MeSH and OMIM.

paper - A web platform that identifies five different types of biomedical concepts in PubMed articles and PubMed Central full texts. The full annotation sets are downloadable (see Annotated Text Data below).

paper - 203 ambiguous words and 37,888 automatically extracted instances of their use in biomedical research publications. Requires UTS account.

paper - a corpus of sentences from medical and biological documents, annotated for negation, speculation, and linguistic scope.

Protein-protein Interaction Annotated Corpora

paper - 225 MEDLINE abstracts annotated for PPI.

paper - 120 full text articles annotated for PPI and genetic interactions. Used in the BioCreative V BioC task.

paper - 1,100 sentences from biomedical research abstracts annotated for relationships (including PPI), named entities, and syntactic dependencies. Additional information and download links are here.

paper - 50 scientific abstracts referenced by the Human Protein Reference Database, annotated for PPI.

paper - 486 sentences from biomedical research abstracts annotated for pairs of co-occurring chemicals, including proteins (hence, PPI annotations).

LLL

paper - 77 sentences from research articles about the bacterium Bacillus subtilis, annotated for protein–gene interactions (so, fairly close to PPI annotations). Additional information is here.

Other Datasets

paper - A database of prevalence and co-occurrence frequencies of conditions, drugs, procedures, and patient demographics extracted from electronic health records. Does not include original record text.

paper - A database of manually curated associations between chemicals, gene products, phenotypes, diseases, and environmental exposures. Useful for assembling ontologies of the related concepts, such as types of chemicals.

paper - Deidentified health data from ~60,000 intensive care unit admissions. Requires completion of an online training course (CITI training) and acceptance of a data use agreement prior to use.

reference manual - A large and comprehensive collection of biomedical terminology and identifiers, as well as accompanying tools and scripts. Depending on your purposes, the single file MRCONSO.RRF may be sufficient, as this file contains unique identifiers and names for all concepts in the UMLS Metathesaurus. See also the Ontologies and Controlled Vocabularies section below.

paper - a database of observations from more than 200 thousand intensive care unit admissions, with consistent structure. Requires registration, training course completion, and data use agreement.

Ontologies and Controlled Vocabularies

paper - An ontology of human diseases. Has cross-links to MeSH, ICD, NCI Thesaurus, SNOMED, and OMIM. Public domain. Available on GitHub and on the OBO Foundry.

paper - Normalized names for clinical drugs and drug packs, with combined ingredients, strengths, and form, and assigned types from the Semantic Network (see below). Released monthly.

paper - A general English lexicon that includes many biomedical terms. Updated yearly since 1994 and still updated as of 2019. Part of UMLS but does not require UTS account to download.

paper - Mappings between >3.8 million concepts, 14 million concept names, and >200 sources of biomedical vocabulary and identifiers. It's big. It may help to prepare a subset of the Metathesaurus with the MetamorphoSys installation tool but we're still talking about ~30 Gb of disk space required for the 2019 release. See the manual here. Requires UTS account.

paper - Lists of 133 semantic types and 54 semantic relationships covering biomedical concepts and vocabulary. Is the Metathesaurus too complex for your needs? Try this. Does not require UTS account to download.

Data Models

Definition and DDLs for the OMOP Common Data Model (CDM)

592
363
63d
Apache-2.0

code - A data model of biological entities. Provided as a YAML file.

paper - An architecture for biomedical data analysis, integration, and visualization. Conceptually based on the visual modeling language UML.