LV EN

DEGREE

PROGRAMME

FACULTY

YEAR

LANGUAGE

Unsupervised machine learning approach for hierarchical graph-based representation of natural language text collections.

Managing big data efficiently is important in various fields, much so when data consists of human-written documents. Recent advances in Natural Language Processing (NLP), particularly LLMs, allowed to solve many task in this domain, despite the high demand for labelled data, compute resources and specialized skills.To tackle these limitations, current study proposed a NLP pipeline to identify topic hierarchies in collections of scientific publications. The work focused on evaluation of available unsupervised machine learning methods and quality metrics in NLP, and development of visualization techniques to build a prototype of the pipeline.Proposed solution is based on the hARTM approach optimized for interpretability. It demonstrated the capacity to infer human-interpretable topic hierarchies from collections of scientific texts and construct meaningful hierarchy of topic-based document representations. The visualization approaches rely on MDS to present inter-document similarity and Sankey plots to show document cluster relatedness within topic hierarchy.Utility was demonstrated on two datasets, focusing on interpretability and meaning of the topic hierarchy and associated topic definitions. Potential application areas include personal education and scientific writing.

Author: Jevgenijs Bodrenko

Supervisor: Irina Jackiva

Degree: Master

Year: 2024

Work Language: English

Study programme: Computer Sciences

More...

Table View
Text View