Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Středoevropský technologický institut. Oficiální stránka publikace je na webu muni.cz.
Autoři

VOLLMAR Melanie TIRUNAGARI Santosh HARRUS Deborah ARMSTRONG David GÁBOROVÁ Romana GUPTA Deepti AFONSO Marcelo Querino Lima EVANS Genevieve VELANKAR Sameer

Rok publikování 2024
Druh Článek v odborném periodiku
Časopis / Zdroj Scientific Data
Fakulta / Pracoviště MU

Středoevropský technologický institut

Citace
www https://www.nature.com/articles/s41597-024-03841-9
Doi http://dx.doi.org/10.1038/s41597-024-03841-9
Klíčová slova MECHANISM; COMPLEX; ONTOLOGY; DOMAIN
Popis We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.

Další info