Improving data quality in data analytics & machine learning

Improving data quality in data analytics & machine learning

English | MP4 | AVC 1280×720 | AAC 44KHz 2ch | 45 lectures (5h 22m) | 2.03 GB

Learn why, when, and how to maximize the quality of your data to optimize data-based decisions

All of our decisions are based on data. Our sense organs gather data, our memories are data, and our gut-instincts are data. If you want to make good decisions, you need to have high-quality data.

This course is about data quality: What it means, why it’s important, and how you can increase the quality of your data.

In this course, you will learn:

High-level strategies for ensuring high data quality, including terminology, data documentation and management, and the different research phases in which you can check and increase data quality.

Qualitative and quantitative methods for evaluating data quality, including visual inspection, error rates, and outliers. Python code is provided to see how to implement these visualizations and scoring methods using pandas, numpy, seaborn, and matplotlib.

Specific data methods and algorithms for cleaning data and rejecting bad or unusual data. As above, Python code is provided to see how to implement these procedures using pandas, numpy, seaborn, and matplotlib.

This course is for

Data practitioners who want to understand both the high-level strategies and the low-level procedures for evaluating and improving data quality.

Managers, clients, and collaborators who want to understand the importance of data quality, even if they are not working directly with data.

What you’ll learn

  • Strategies for increasing data quality
  • Ways to assess data quality
  • Interpreting data visualizations
  • How to spot problems in data
Table of Contents

1 Is this course right for you

Download course materials Python code
2 Download the code

Why data quality matters
3 Section summary
4 Is data or are data
5 On the origins and quality of data
6 GIGO garbage in garbage out
7 Data quality influences datadriven decisions

Ensuring high data quality
8 Section summary
9 Data management
10 Data documentation
11 Data audits
12 Data cleaning phases
13 Improve quality before getting data
14 Improve quality during data collection
15 Improve quality after data collection
16 Improve quality during data analysis
17 Risks of biased results

Assessing data quality
18 Section summary
19 Qualitative vs quantitative quality assessments
20 Qualitative assessments via visual inspection
21 Code Visualizing data distributions
22 Variance assessments
23 Correlations and correlation matrices
24 Data error rates
25 Sample sizes
26 Code Measuring data quality

Data transformations
27 Section summary
28 Zscore scaling
29 Minmax scaling
30 Binning rounding
31 Unit normalization
32 Rank transform
33 Nonlinear transformations
34 Code Transforming data

Outliers and missing data
35 Section summary
36 What are outliers
37 The zscore method
38 The modified zscore method
39 Dealing with missing data
40 Code Dealing with bad or missing data

Be a highquality data scientist
41 Section summary
42 Keeping up with data science developments
43 Can you know everything
44 What data scientists want

45 Bonus material