Название: Minimalist Data Wrangling with Python Автор: Marek Gagolewski Издательство: Independently published Год: 2024-01-25 (v1.0.3.9107) Страниц: 436 Язык: английский Формат: pdf (true) Размер: 10.1 MB
Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features, performing exploratory data analysis and dimensionality reduction, identifying naturally occurring data clusters, modelling patterns in data, comparing data between groups, and reporting the results.
Data Science aims at making sense of and generating predictions from data that have been collected in copious quantities from various sources, such as physical sensors, surveys, online forms, access logs, and (pseudo)random number generators, to name a few. They can take diverse forms, e.g., be given as vectors, matrices, or other tensors, graphs/networks, audio/video streams, or text.
Data usually do not come in a tidy and tamed form. Data wrangling is the very broad process of appropriately curating raw information chunks and then exploring the underlying data structure so that they become analysable.
This course is envisaged as a student’s first exposure to data science, providing a high-level overview as well as discussing key concepts at a healthy level of detail.
By no means do we have the ambition to be comprehensive with regard to any topic we cover. Time for that will come later in separate lectures on calculus, matrix algebra, probability, mathematical statistics, continuous and combinatorial optimisation, information theory, stochastic processes, statistical/machine learning, algorithms and data structures, take a deep breath, databases and Big Data analytics, operational research, graphs and networks, differential equations and dynamical systems, time series analysis, signal processing, etc.
We primarily focus on methods and algorithms that have stood the test of time and that continue to inspire researchers and practitioners. They all meet the reality check comprised of the three following properties, which we believe are essential in practice:
- simplicity (and thus interpretability, being equipped with no or only a few underlying tunable parameters; being based on some sensible intuitions that can be explained in our own words), - mathematical analysability (at least to some extent; so that we can understand their strengths and limitations), - implementability (not too abstract on the one hand, but also not requiring any advanced computer-y hocus-pocus on the other).
This course uses the Python language which we shall introduce from scratch. Consequently, we do not require any prior programming experience.
Over the last few years, Python has proven to be a very robust choice for learning and applying data wrangling techniques. This is possible thanks to the devoted community of open-source programmers who wrote the famous high-quality packages such as NumPy, SciPy, Matplotlib, Pandas, Seaborn, and Scikit-learn.
Nevertheless, Python and its third-party packages are amongst many software tools which can help extract knowledge from data. Other robust open-source choices include R and Julia.
We will focus on developing transferable skills: most of what we learn here can be applied (using different syntax but the same kind of reasoning) in other environments. Thus, this is a course on data wrangling (with Python), not a course on Python (with examples in data wrangling).