3.4 Processing of natural language texts

As a rule, normative documents and resources of the educational process: work programs, methodical materials, tasks, tests, evaluation results are presented in the form of unstructured and poorly structured texts in natural language. Transforming text materials into a graph of knowledge can be a difficult and time-consuming task that requires the involvement of highly qualified experts. The use of modern NLP methods and tools can significantly simplify and speed up this process, ensure its greater stability and integrity. To automate the task of transforming text documents with a defined subject area context into a knowledge graph, the approaches and tools proposed in [19] can be used.

The transformation process consists of the following main elements: tokenization - dividing the text into meaningful segments called tokens; part-of-speech tagging - definition of a part of speech for each token and its marking; morphological analysis - analysis of the structure of words, selection of the root; lemmatization - grouping declension forms of the word so that they can be analyzed as a single element; named entity recognition - identification and designation of named and numerical entities; dependency parsing - recognizing sentence boundaries and sorting through basic noun phrases; entity linking - linking of entities with a KB graph node.

As a result of the transformation, a prototype of the subgraph is created, which must be reviewed by experts and, if accepted, integrated into the KB graph. The transformation and review process is iterative until the changes are accepted. If the option is rejected, the next iteration is performed with new parameter values, which should bring the result of the next iteration closer to the goal. In case of acceptance, the KB acquires new knowledge about the subject area obtained from the processed source.