Programming for Corpus Linguistics with Python and Dataframes » MIRLIB.RU - ТВОЯ БИБЛИОТЕКА
Programming for Corpus Linguistics with Python and Dataframes
Название: Programming for Corpus Linguistics with Python and Dataframes
Автор: Daniel Keller
Издательство: Cambridge University Press
Год: 2024
Страниц: 114
Язык: английский
Формат: pdf (true), epub
Размер: 10.1 MB

This Element offers intermediate or experienced programmers algorithms for Corpus Linguistic (CL) programming in the Python language using dataframes that provide a fast, efficient, intuitive set of methods for working with large, complex datasets such as corpora. This Element demonstrates principles of dataframe programming applied to CL analyses, as well as complete algorithms for creating concordances; producing lists of collocates, keywords, and lexical bundles; and performing key feature analysis. An additional algorithm for creating dataframe corpora is presented including methods for tokenizing, part-of-speech tagging, and lemmatizing using spaCy. This Element provides a set of core skills that can be applied to a range of CL research questions, as well as to original analyses not possible with existing corpus software.

Programming often involves manipulating data. In CL, our data are samples of language, and our operations are things like counting word types, calculating association strength, measuring dispersion, and so on. To accomplish these things, we need to be able to hold and reference data in a computer’s memory, often in discrete chunks. We do this with variables. To perform operations on these variables, we write instructions (code) that the Python interpreter understands how to carry out. We can group sets of instructions and save them to be reused later. These are called functions. Often, we will use functions written by other people to save time and guarantee replicability.

This section introduces Pandas DataFrame and Series classes, methods for loading and saving them to disk, and methods and functions for counting values, grouping rows, and combining values. These form a core set of tools that can be used to accomplish a range of CL tasks. The focus in this section is on explaining these elements generally, while Section 4 describes algorithms that use these procedures to complete CL analyses specifically. We will use two data types extensively in this element, DataFrames and Series. These are not core data types in Python and must be imported through the Pandas package. However, once imported, we will be able to leverage the powerful methods built into them to do corpus linguistic tasks quickly, reliably, and with minimal hardware resources.

Скачать Programming for Corpus Linguistics with Python and Dataframes

Комментарии 0
Комментариев пока нет. Стань первым!