In the world of data science, efficiently managing, storing, and cleaning data is critical for deriving actionable insights. This course introduces essential data management concepts, including data warehousing, ETL pipelines, and data cleaning techniques. Covered topics include how to handle large datasets, design scalable storage solutions, and prepare data for analysis using real-world examples and practical Python demonstrations. By the end of the course, learners will be equipped with the skills needed to clean, structure, and store data for both analytical and machine learning purposes.
Data science is a broad field of study encompassing the entire data lifecycle. Before diving into the key concepts a data scientist should know, we will outline some of the key concepts, vocabulary, and areas of focus that are often mentioned in the data science field. We begin with definitions of data science as a whole, the related fields of study and how they intertwine. Then an overview of the terminology used to describe the data in data science, and finally a look at the ethical and privacy concerns when working with data.
Data science is a field that heavily uses statistical analysis in order to draw conclusions and make predictions. Thus, it is important that individuals in this field have a solid grasp on the basics and understand the needed equations to make assumptions. This course takes us through both Descriptive and Inferential statistics in order for us to be able to describe numerical data and make predictions based off the data we have. Additionally, non-numerical data can often times be transformed into numerical representations to then plug into our equations. This course walks through some of the basic methods one should know to achieve both of these goals
Pandas is a python package containing data structures to make working with labeled data intuitive and easy. The two primary data structures are Series and DataFrames. The Pandas package also contains several useful functions for plotting data, manipulating data structures, and data manipulation. We will introduce some of the main methods available for each of these topics.
NumPy is the fundamental package for scientific computing in Python. This library provides a multidimensional array object that is ideal for mathematical computations. Many scientific computing libraries reference the NumPy library in order to construct specific data structures, and thus a basic understanding of how to navigate this library is beneficial. This course assumes you have a basic understanding of fundamentals in Python and focuses on a definition of arrays and an introduction to popular methods available.
When working with data in Python, more likely than not, you will be storing that data in a database that not just a flat file. In order to pull data from your database into your code, you will need to use some type of driver that will allow your program to communicate with the database. In this course we will go into more details on Python's database driver for MySQL, Python's built in database library, and the SQLAlchemy Library which makes the process more pythonic.
This course offers an introduction to the Python programming language. Here you will find guidance on how to download Python and related libraries necessary to practice in this language along with several lessons covering foundational concepts you will need to be aware of. This course assumes you have a foundational knowledge of some other high level language and the goal is to demonstrate how you can apply your knowledge to Python specifically.