Development of the academic corpus of the Kazakh language: goal, task, significance (in the direction of humanities)
DOI:
https://doi.org/10.31489/2025ph3/89-98Keywords:
development of academic corpus, corpus linguistics, academic text, humanities, Kazakh language, scientific style, metadataAbstract
The academic corpus is one of the key components of the educational and scientific sphere. Its development is recognized as a comprehensive process aimed at improving the quality of higher education and enhancing scientific potential. The article addresses the issue of creating an academic corpus of the Kazakh language in the field of the humanities. It analyzes the goals, objectives, and significance of developing such a corpus. The relevance of the project lies in the creation of a comprehensive empirical database of scholarly texts in the state language, which contributes to the study of Kazakh as a language of science. The aim of the study is to analyze the functions and objectives of the corpus of humanities texts in the Kazakh language, its role in the development of the Kazakh scientific language, as well as the characteristics of the corpus. For its creation, texts were used from the following fields: philosophy, history, religious studies, archaeology and ethnology, oriental studies, theology, Turkology, museology, foreign philology, translation studies, and Kazakh philology. The article describes the process of corpus development in accordance with the relevant requirements and criteria. Particular attention is given to the significance of the corpus as a tool for the advancement of natural language processing technologies. The corpus makes it possible to identify the features of academic vocabulary, morphology, and syntax of the Kazakh language. The inclusion of various types of scientific
sources in the corpus — such as abstracts, theses, conference papers, journal articles, as well as monographs, dissertations, textbooks, and teaching materials — is justified for a comprehensive analysis of scholarly data. Particular emphasis is placed on metadata (meta-annotation), which plays a crucial role at all stages — from the collection to the digitization of materials. Metadata is examined in the context of international practices, with its importance highlighted for the continued study of the Kazakh language as a medium of scientific communication. The methodological section employs methods of description, comparison, philological examination, and analysis. In the study of international practices, descriptive and comparative methods were used, while the analysis of linguistic data from both content-related and structural perspectives involved descriptive
and analytical methods. The research results open up new prospects for scholarly inquiry in fields such as terminology, lexicography, cognitive and gender linguistics, translation studies, and other related disciplines. It is emphasized that at all stages — from defining objectives to annotation — it is essential to take scientific and methodological aspects into account. The article also raises issues that require further refinement within the framework of scientific and practical activities, considering the internal classification of the humanities based on their objects of study. The interdisciplinary specificity of maintaining a balanced selection of texts included in the corpus — depending on its intended purpose — has been identified. The study analyzes the impact of the predominance of publications in English and Russian in a number of disciplines on the implementation of the principle of balance in relation to the Kazakh language within the corpus.