Your first time on this page? Allow me to give some explanations.
Awesome Biomedical Information Extraction
🧫 A curated list of resources relevant to doing Biomedical Information Extraction (including BioNLP)
Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.
Thank you caufieldjh & contributors
View Topic on GitHub:
Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.
Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.
Groups Active in the Field
Conferences and Other Events
full-day special session on Text Mining for Biology and Healthcare). The meeting is combined with that of the European Conference on Computational Biology (ECCB) on odd-numbered years.
TASS, an annual workshop for semantic analysis in Spanish.
Video Lectures and Online Courses
A smorgasbord architecture for coreference resolution in biomedical text
Medical Text Mining and Information Extraction with spaCy
A full spaCy pipeline and models for scientific/biomedical documents.
talk with NCBI entrez using R
Repos for Specific Datasets
MIMIC Code Repository: Code shared by the research community for the MIMIC-III database
Tools, Platforms, and Services
Public release of the DeepPhe analytic software
A framework for keeping biomedical text mining result up-to-date
Surfacing Semantic Data from Clinical Notes in Electronic Health Records for Tailored Care, Trial Recruitment and Clinical Research
Framework for information extraction from tables
paper - A method for disease normalization, i.e., linking mentions of disease names and acronyms to unique concept identifiers. Downloadable version includes the NCBI Disease Corpus and BC5CDR (see Annotated Text Data below).
Anafora is a web-based raw text annotation tool
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
repository for Publicly Available Clinical BERT Embeddings
ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission (CHIL 2020 Workshop)
A very simple framework for state-of-the-art Natural Language Processing (NLP)
A BERT model for scientific text.
BlueBERT, pre-trained on PubMed abstracts and clinical notes (MIMIC-III).
Biomedical Text Sources
paper - 348,566 MEDLINE entries (title and sometimes abstract) from between 1987 and 1991. Includes MeSH labels. Primarily of historical significance.
Annotated Text Data
paper - A pilot dataset containing standardised information, and annotations of occurence in text, about ~5,000 known adverse reactions for 200 FDA-approved drugs.
paper - 15,000 sentences (10,000 training and 5,000 test) annotated for protein and gene names. 1,000 full text biomedical research articles annotated with protein names and Gene Ontology terms.
paper - 1,500 articles (title and abstract) published in 2014 or later, annotated for 4,409 chemicals, 5,818 diseases and 3116 chemical–disease interactions. Requires registration.
paper - >2,400 articles annotated with chemical-protein interactions of a variety of relation types. Requires registration.
list of data challenges for individual descriptions.
paper - 203 ambiguous words and 37,888 automatically extracted instances of their use in biomedical research publications. Requires UTS account.
Protein-protein Interaction Annotated Corpora
paper - 225 MEDLINE abstracts annotated for PPI.
paper - 120 full text articles annotated for PPI and genetic interactions. Used in the BioCreative V BioC task.
paper - 1,100 sentences from biomedical research abstracts annotated for relationships (including PPI), named entities, and syntactic dependencies. Additional information and download links are here.
paper - 50 scientific abstracts referenced by the Human Protein Reference Database, annotated for PPI.
paper - 486 sentences from biomedical research abstracts annotated for pairs of co-occurring chemicals, including proteins (hence, PPI annotations).
paper - A database of prevalence and co-occurrence frequencies of conditions, drugs, procedures, and patient demographics extracted from electronic health records. Does not include original record text.
paper - A database of manually curated associations between chemicals, gene products, phenotypes, diseases, and environmental exposures. Useful for assembling ontologies of the related concepts, such as types of chemicals.
paper - Deidentified health data from ~60,000 intensive care unit admissions. Requires completion of an online training course (CITI training) and acceptance of a data use agreement prior to use.
reference manual - A large and comprehensive collection of biomedical terminology and identifiers, as well as accompanying tools and scripts. Depending on your purposes, the single file MRCONSO.RRF may be sufficient, as this file contains unique identifiers and names for all concepts in the UMLS Metathesaurus. See also the Ontologies and Controlled Vocabularies section below.
Ontologies and Controlled Vocabularies
paper - Normalized names for clinical drugs and drug packs, with combined ingredients, strengths, and form, and assigned types from the Semantic Network (see below). Released monthly.
paper - A general English lexicon that includes many biomedical terms. Updated yearly since 1994 and still updated as of 2019. Part of UMLS but does not require UTS account to download.
paper - Mappings between >3.8 million concepts, 14 million concept names, and >200 sources of biomedical vocabulary and identifiers. It's big. It may help to prepare a subset of the Metathesaurus with the MetamorphoSys installation tool but we're still talking about ~30 Gb of disk space required for the 2019 release. See the manual here. Requires UTS account.
Definition and DDLs for the OMOP Common Data Model (CDM)