Beyond assertion: setup and teardown UN IT TES TIN G F OR DATA S - - PowerPoint PPT Presentation

beyond assertion setup and teardown
SMART_READER_LITE
LIVE PREVIEW

Beyond assertion: setup and teardown UN IT TES TIN G F OR DATA S - - PowerPoint PPT Presentation

Beyond assertion: setup and teardown UN IT TES TIN G F OR DATA S CIEN CE IN P YTH ON Dibya Chakravorty Test Automation Engineer The preprocessing function def preprocess(raw_data_file_path, 1,801 201,411 clean_data_file_path


slide-1
SLIDE 1

Beyond assertion: setup and teardown

UN IT TES TIN G F OR DATA S CIEN CE IN P YTH ON

Dibya Chakravorty

Test Automation Engineer

slide-2
SLIDE 2

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The preprocessing function

def preprocess(raw_data_file_path, clean_data_file_path ): ...

raw

1,801 201,411 1,767565,112 2,002 333,209 1990 782,911 1,285 389129

slide-3
SLIDE 3

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The preprocessing function

def preprocess(raw_data_file_path, clean_data_file_path ): ...

row_to_list() raw

1,801 201,411 1,767565,112 # dirty row, no tab 2,002 333,209 1990 782,911 1,285 389129

slide-4
SLIDE 4

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The preprocessing function

def preprocess(raw_data_file_path, clean_data_file_path ): ...

row_to_list() raw

1,801 201,411 2,002 333,209 1990 782,911 1,285 389129

slide-5
SLIDE 5

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The preprocessing function

def preprocess(raw_data_file_path, clean_data_file_path ): ...

row_to_list() convert_to_int() raw

1,801 201,411 2,002 333,209 1990 782,911 # dirty row, no comma 1,285 389129 # dirty row, no comma

slide-6
SLIDE 6

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The preprocessing function

def preprocess(raw_data_file_path, clean_data_file_path ): ...

row_to_list() convert_to_int() raw

1,801 201,411 2,002 333,209

slide-7
SLIDE 7

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The preprocessing function

def preprocess(raw_data_file_path, clean_data_file_path ): ...

row_to_list() convert_to_int() raw clean

1801 201411 2002 333209

slide-8
SLIDE 8

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Environment preconditions

preprocess() needs a raw data le in the environment to run.

The environment raw

slide-9
SLIDE 9

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Environment modication

preprocess() modies the environment by creating a clean data le.

The environment raw clean

slide-10
SLIDE 10

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Testing the preprocessing function

def test_on_raw_data():

The environment

slide-11
SLIDE 11

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Step 1: Setup

def test_on_raw_data(): # Setup: create the raw data file

Setup brings the environment to a state where testing can begin.

The environment raw

slide-12
SLIDE 12

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Step 2: Assert

def test_on_raw_data(): # Setup: create the raw data file preprocess(raw_data_file_path, clean_data_file_path ) with open(clean_data_file_path) as f: lines = f.readlines() first_line = lines[0] assert first_line == "1801\t201411\n" second_line = lines[1] assert second_line == "2002\t333209\n"

The environment raw clean

slide-13
SLIDE 13

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Step 3: Teardown

def test_on_raw_data(): # Setup: create the raw data file preprocess(raw_data_file_path, clean_data_file_path ) with open(clean_data_file_path) as f: lines = f.readlines() first_line = lines[0] assert first_line == "1801\t201411\n" second_line = lines[1] assert second_line == "2002\t333209\n" # Teardown: remove raw and clean data file

T eardown brings environment to initial state.

The environment

slide-14
SLIDE 14

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The new workow

Old workow assert New workow setup → assert → teardown

slide-15
SLIDE 15

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Fixture

import pytest @pytest.fixture def my_fixture(): # Do setup here return data def test_something(my_fixture): ... data = my_fixture ...

slide-16
SLIDE 16

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Fixture

import pytest @pytest.fixture def my_fixture(): # Do setup here yield data # Use yield instead of return # Do teardown here def test_something(my_fixture): ... data = my_fixture ...

slide-17
SLIDE 17

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Test

import os import pytest def test_on_raw_data():

slide-18
SLIDE 18

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Fixture

@pytest.fixture def raw_and_clean_data_file(): raw_data_file_path = "raw.txt" clean_data_file_path = "clean.txt" with open(raw_data_file_path, "w") as f: f.write("1,801\t201,411\n" "1,767565,112\n" "2,002\t333,209\n" "1990\t782,911\n" "1,285\t389129\n" ) yield raw_data_file_path, clean_data_file_pat

  • s.remove(raw_data_file_path)
  • s.remove(clean_data_file_path)

Test

import os import pytest def test_on_raw_data(raw_and_clean_data_file): raw_path, clean_path = raw_and_clean_data_fil preprocess(raw_path, clean_path) with open(clean_data_file_path) as f: lines = f.readlines() first_line = lines[0] assert first_line == "1801\t201411\n" second_line = lines[1] assert second_line == "2002\t333209\n"

slide-19
SLIDE 19

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The built-in tmpdir xture

Setup: create a temporary directory. Teardown: delete the temporary directory along with contents.

slide-20
SLIDE 20

UNIT TESTING FOR DATA SCIENCE IN PYTHON

tmpdir and xture chaining

setup of tmpdir() → Setup of raw_and_clean_data_file() → test → teardown of

raw_and_clean_data_file() → teardown of tmpdir() .

@pytest.fixture def raw_and_clean_data_file(tmpdir): raw_data_file_path = tmpdir.join("raw.txt") clean_data_file_path = tmpdir.join("clean.txt") with open(raw_data_file_path, "w") as f: f.write("1,801\t201,411\n" "1,767565,112\n" "2,002\t333,209\n" "1990\t782,911\n" "1,285\t389129\n" ) yield raw_data_file_path, clean_data_file_path # No teardown code necessary

slide-21
SLIDE 21

Let's practice setup and teardown using xtures!

UN IT TES TIN G F OR DATA S CIEN CE IN P YTH ON

slide-22
SLIDE 22

Mocking

UN IT TES TIN G F OR DATA S CIEN CE IN P YTH ON

Dibya Chakravorty

Test Automation Engineer

slide-23
SLIDE 23

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The preprocessing function

raw

slide-24
SLIDE 24

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The preprocessing function

row_to_list() raw

slide-25
SLIDE 25

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The preprocessing function

row_to_list() convert_to_int() raw clean

pytest -k "TestPreprocess" =============== test session starts ================ ... collected 21 items / 20 deselected / 1 selected data/test_preprocessing_helpers.py . [100%] ===== 1 passed, 20 deselected in 0.61 seconds ======

slide-26
SLIDE 26

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Test result depend on dependencies

row_to_list() convert_to_int() raw clean

pytest -k "TestPreprocess" =============== test session starts ================ ... collected 21 items / 20 deselected / 1 selected data/test_preprocessing_helpers.py . [100%] ===== 1 passed, 20 deselected in 0.61 seconds ======

slide-27
SLIDE 27

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Test result depend on dependencies

row_to_list() convert_to_int() raw clean

pytest -k "TestPreprocess" =============== test session starts ================ ... collected 21 items / 20 deselected / 1 selected data/test_preprocessing_helpers.py F [100%] ===================== FAILURES ===================== _________ TestPreprocess.test_on_raw_data __________ def test_on_raw_data(self, raw_and_clean_data_file): raw_path, clean_path = raw_and_clean_data_file preprocess(raw_path, clean_path) with open(clean_path, "r") as f: lines = f.readlines() > first_line = lines[0] E IndexError: list index out of range data/test_preprocessing_helpers.py:121: IndexError 1 failed 20 deselected in 0 68 seconds

slide-28
SLIDE 28

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Test result depends on dependencies

T est result should indicate bugs in function under test i.e. preprocess() . not dependencies e.g. row_to_list() or convert_to_int() .

slide-29
SLIDE 29

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Mocking: testing functions independently of dependencies

Packages for mocking in pytest

pytest-mock : Install using pip install pytest-mock . unittest.mock : Python standard library package.

slide-30
SLIDE 30

UNIT TESTING FOR DATA SCIENCE IN PYTHON

MagicMock() and mocker.patch()

row_to_list() convert_to_int() raw clean

slide-31
SLIDE 31

UNIT TESTING FOR DATA SCIENCE IN PYTHON

MagicMock() and mocker.patch()

row_to_list() convert_to_int() raw clean unittest.mock.MagicMock()

def test_on_raw_data(raw_and_clean_data_file, mocker, ): raw_path, clean_path = raw_and_clean_data_file row_to_list_mock = mocker.patch(...)

slide-32
SLIDE 32

UNIT TESTING FOR DATA SCIENCE IN PYTHON

MagicMock() and mocker.patch()

Theoretical structure of mocker.patch()

mocker.patch("<dependency name with module name>") def test_on_raw_data(raw_and_clean_data_file, mocker, ): raw_path, clean_path = raw_and_clean_data_file row_to_list_mock = mocker.patch(...)

slide-33
SLIDE 33

UNIT TESTING FOR DATA SCIENCE IN PYTHON

MagicMock() and mocker.patch()

Theoretical structure of mocker.patch()

mocker.patch("data.preprocessing_helpers.row_to_list") unittest.mock.MagicMock() def test_on_raw_data(raw_and_clean_data_file, mocker, ): raw_path, clean_path = raw_and_clean_data_file row_to_list_mock = mocker.patch( "data.preprocessing_helpers.row_to_list" )

slide-34
SLIDE 34

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Making the MagicMock() bug-free

Raw data

1,801 201,411 1,767565,112 2,002 333,209 1990 782,911 1,285 389129 def row_to_list_bug_free(row): return_values = { "1,801\t201,411\n": ["1,801", "201,411"], "1,767565,112\n": None, "2,002\t333,209\n": ["2,002", "333,209"], "1990\t782,911\n": ["1990", "782,911"], "1,285\t389129\n": ["1,285", "389129"], } return return_values[row] def test_on_raw_data(raw_and_clean_data_file, mocker, ): raw_path, clean_path = raw_and_clean_data_file row_to_list_mock = mocker.patch( "data.preprocessing_helpers.row_to_list" ) row_to_list_mock.side_effect = row_to_list_bug_free

slide-35
SLIDE 35

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Side effect

Raw data

1,801 201,411 1,767565,112 2,002 333,209 1990 782,911 1,285 389129 def row_to_list_bug_free(): return_values = { "1,801\t201,411\n": ["1,801", "201,411"], "1,767565,112\n": None, "2,002\t333,209\n": ["2,002", "333,209"], "1990\t782,911\n": ["1990", "782,911"], "1,285\t389129\n": ["1,285", "389129"], } return return_values[row] def test_on_raw_data(raw_and_clean_data_file, mocker, ): raw_path, clean_path = raw_and_clean_data_file row_to_list_mock = mocker.patch( "data.preprocessing_helpers.row_to_list", side_effect = row_to_list_bug_free )

slide-36
SLIDE 36

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Bug free replacement of dependency

row_to_list() convert_to_int() raw clean row_to_list_mock (bug-free)

def test_on_raw_data(raw_and_clean_data_file, mocker, ): raw_path, clean_path = raw_and_clean_data_file row_to_list_mock = mocker.patch( "data.preprocessing_helpers.row_to_list", side_effect = row_to_list_bug_free ) preprocess(raw_path, clean_path)

slide-37
SLIDE 37

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Checking the arguments

call_args_list attribute returns a list of

arguments that the mock was called with

row_to_list_mock.call_args_list [call("1,801\t201,411\n"), call("1,767565,112\n"), call("2,002\t333,209\n"), call("1990\t782,911\n"), call("1,285\t389129\n") ] def test_on_raw_data(raw_and_clean_data_file, mocker, ): raw_path, clean_path = raw_and_clean_data_file row_to_list_mock = mocker.patch( "data.preprocessing_helpers.row_to_list", side_effect = row_to_list_bug_free ) preprocess(raw_path, clean_path)

slide-38
SLIDE 38

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Checking the arguments

call_args_list attribute returns a list of

arguments that the mock was called with

row_to_list_mock.call_args_list [call("1,801\t201,411\n"), call("1,767565,112\n"), call("2,002\t333,209\n"), call("1990\t782,911\n"), call("1,285\t389129\n") ] from unittest.mock import call def test_on_raw_data(raw_and_clean_data_file, mocker, ): raw_path, clean_path = raw_and_clean_data_file row_to_list_mock = mocker.patch( "data.preprocessing_helpers.row_to_list", side_effect = row_to_list_bug_free ) preprocess(raw_path, clean_path) assert row_to_list_mock.call_args_list == [ call("1,801\t201,411\n"), call("1,767565,112\n"), call("2,002\t333,209\n"), call("1990\t782,911\n") call("1,285\t389129\n") ]

slide-39
SLIDE 39

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Dependency buggy, function bug-free, test still passes!

pytest -k "TestRowToList" =========================== test session starts ============================ collected 21 items / 14 deselected / 7 selected data/test_preprocessing_helpers.py .....FF [100%] ================================= FAILURES ================================= _________________ TestRowToList.test_on_normal_argument_1 __________________ ... _________________ TestRowToList.test_on_normal_argument_2 __________________ ... ============ 2 failed, 5 passed, 14 deselected in 0.70 seconds =============

slide-40
SLIDE 40

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Dependency buggy, function bug-free, test still passes!

pytest -k "TestPreprocess" =========================== test session starts ============================ collected 21 items / 20 deselected / 1 selected data/test_preprocessing_helpers.py . [100%] ================= 1 passed, 20 deselected in 0.63 seconds ==================

slide-41
SLIDE 41

Let's practice mocking!

UN IT TES TIN G F OR DATA S CIEN CE IN P YTH ON

slide-42
SLIDE 42

Testing models

UN IT TES TIN G F OR DATA S CIEN CE IN P YTH ON

Dibya Chakravorty

Test Automation Engineer

slide-43
SLIDE 43

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Functions we have tested so far

preprocess() get_data_as_numpy_array() split_into_training_and_testing_sets()

slide-44
SLIDE 44

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Raw data to clean data

from data.preprocessing_helpers import preprocess from features.as_numpy import get_data_as_numpy_array from models.train import ( split_into_training_and_testing_sets ) preprocess("data/raw/housing_data.txt", "data/clean/clean_housing_data.txt" ) data |-- raw | |-- housing_data.txt |-- clean | src tests

data/raw/housing_data.txt

2,081 314,942 1,059 186,606 293,410 <-- row with missing area 1,148 206,186 ...

slide-45
SLIDE 45

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Raw data to clean data

from data.preprocessing_helpers import preprocess from features.as_numpy import get_data_as_numpy_array from models.train import ( split_into_training_and_testing_sets ) preprocess("data/raw/housing_data.txt", "data/clean/clean_housing_data.txt" ) data |-- raw | |-- housing_data.txt |-- clean | |-- clean_housing_data.txt src tests

data/clean/clean_housing_data.txt

2081 314942 1059 186606 1148 206186 ...

slide-46
SLIDE 46

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Clean data to NumPy array

from data.preprocessing_helpers import preprocess from features.as_numpy import get_data_as_numpy_array from models.train import ( split_into_training_and_testing_sets ) preprocess("data/raw/housing_data.txt", "data/clean/clean_housing_data.txt" ) data = get_data_as_numpy_array( "data/clean/clean_housing_data.txt", 2 ) get_data_as_numpy_array( "data/clean/clean_housing_data.txt", 2 ) array([[ 2081., 314942.], [ 1059., 186606.], [ 1148., 206186.] ... ] )

slide-47
SLIDE 47

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Splitting into training and testing sets

from data.preprocessing_helpers import preprocess from features.as_numpy import get_data_as_numpy_array from models.train import ( split_into_training_and_testing_sets ) preprocess("data/raw/housing_data.txt", "data/clean/clean_housing_data.txt" ) data = get_data_as_numpy_array( "data/clean/clean_housing_data.txt", 2 ) training_set, testing_set = ( split_into_training_and_testing_sets(data) ) split_into_training_and_testing_sets(data) (array([[1148, 206186], # Training set (3/4) [2081, 314942], ... ] ), array([[1059, 186606] # Testing set (1/4) ... ] ) )

slide-48
SLIDE 48

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Functions are well tested - thanks to you!

slide-49
SLIDE 49

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The linear regression model

def train_model(training_set):

slide-50
SLIDE 50

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The linear regression model

from scipy.stats import linregress def train_model(training_set): slope, intercept, _, _, _ = linregress(training_set[:, 0], training_set[:, 1]) return slope, intercept

slide-51
SLIDE 51

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Return values difcult to compute manually

slide-52
SLIDE 52

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Return values difcult to compute manually

slide-53
SLIDE 53

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Return values difcult to compute manually

Cannot test train_model() without knowing expected return values.

slide-54
SLIDE 54

UNIT TESTING FOR DATA SCIENCE IN PYTHON

True for all data science models

slide-55
SLIDE 55

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Trick 1: Use dataset where return value is known

import pytest import numpy as np from models.train import train_model def test_on_linear_data(): test_argument = np.array([[1.0, 3.0], [2.0, 5.0], [3.0, 7.0] ] )

slide-56
SLIDE 56

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Trick 1: Use dataset where return value is known

import pytest import numpy as np from models.train import train_model def test_on_linear_data(): test_argument = np.array([[1.0, 3.0], [2.0, 5.0], [3.0, 7.0] ] ) expected_slope = 2.0 expected_intercept = 1.0 slope, intercept = train_model(test_argument) assert slope == pytest.approx(expected_slope) assert intercept == pytest.approx( expected_intercept )

slide-57
SLIDE 57

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Trick 2: Use inequalities

import numpy as np from models.train import train_model def test_on_positively_correlated_data(): test_argument = np.array([[1.0, 4.0], [2.0, 4.0], [3.0, 9.0], [4.0, 10.0], [5.0, 7.0], [6.0, 13.0], ] )

slide-58
SLIDE 58

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Trick 2: Use inequalities

import numpy as np from models.train import train_model def test_on_positively_correlated_data(): test_argument = np.array([[1.0, 4.0], [2.0, 4.0], [3.0, 9.0], [4.0, 10.0], [5.0, 7.0], [6.0, 13.0], ] ) slope, intercept = train_model(test_argument) assert slope > 0

slide-59
SLIDE 59

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Recommendations

Do not leave models untested just because they are complex. Perform as many sanity checks as possible.

slide-60
SLIDE 60

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Using the model

from data.preprocessing_helpers import preprocess from features.as_numpy import get_data_as_numpy_array from models.train import ( split_into_training_and_testing_sets, train_model ) preprocess("data/raw/housing_data.txt", "data/clean/clean_housing_data.txt" ) data = get_data_as_numpy_array( "data/clean/clean_housing_data.txt", 2 ) training_set, testing_set = ( split_into_training_and_testing_sets(data) ) slope, intercept = train_model(training_set) train_model(training_set) 151.78430060614986 17140.77537937442

slide-61
SLIDE 61

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Testing model performance

def model_test(testing_set, slope, intercept): """Return r^2 of fit"""

Returns a quantity r . Indicates how well the model performs on unseen data. Usually, 0 ≤ r ≤ 1.

r = 1 indicates perfect t. r = 0 indicates no t.

Complicated to compute r manually.

2 2 2 2 2

slide-62
SLIDE 62

Let's practice writing sanity tests!

UN IT TES TIN G F OR DATA S CIEN CE IN P YTH ON

slide-63
SLIDE 63

Testing plots

UN IT TES TIN G F OR DATA S CIEN CE IN P YTH ON

Dibya Chakravorty

Test Automation Engineer

slide-64
SLIDE 64

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Pizza without cheese!

slide-65
SLIDE 65

UNIT TESTING FOR DATA SCIENCE IN PYTHON

This lesson: testing matplotlib visualizations

slide-66
SLIDE 66

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The plotting function

data/ src/ |-- data/ |-- features/ |-- models/ |-- visualization | |-- __init__.py tests/

slide-67
SLIDE 67

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The plotting function

plots.py

def get_plot_for_best_fit_line(slope, intercept, x_array, y_array, title ): """ slope: slope of best fit line intercept: intercept of best fit line """ data/ src/ |-- data/ |-- features/ |-- models/ |-- visualization | |-- __init__.py | |-- plots.py tests/

slide-68
SLIDE 68

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The plotting function

plots.py

def get_plot_for_best_fit_line(slope, intercept, x_array, y_array, title ): """ slope: slope of best fit line intercept: intercept of best fit line x_array: array containing housing areas y_array: array containing housing prices """ data/ src/ |-- data/ |-- features/ |-- models/ |-- visualization | |-- __init__.py | |-- plots.py tests/

slide-69
SLIDE 69

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The plotting function

plots.py

def get_plot_for_best_fit_line(slope, intercept, x_array, y_array, title ): """ slope: slope of best fit line intercept: intercept of best fit line x_array: array containing housing areas y_array: array containing housing prices title: title of the plot """ data/ src/ |-- data/ |-- features/ |-- models/ |-- visualization | |-- __init__.py | |-- plots.py tests/

slide-70
SLIDE 70

UNIT TESTING FOR DATA SCIENCE IN PYTHON

The plotting function

plots.py

def get_plot_for_best_fit_line(slope, intercept, x_array, y_array, title ): """ slope: slope of best fit line intercept: intercept of best fit line x_array: array containing housing areas y_array: array containing housing prices title: title of the plot Returns: matplotlib.figure.Figure() """ data/ src/ |-- data/ |-- features/ |-- models/ |-- visualization | |-- __init__.py | |-- plots.py tests/

slide-71
SLIDE 71

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Training plot

... from visualization import get_plot_for_best_fit_l preprocess(...) data = get_data_as_numpy_array(...) training_set, testing_set = ( split_into_training_and_testing_sets(data) ) slope, intercept = train_model(training_set) get_plot_for_best_fit_line(slope, intercept, training_set[:, 0], training_set[:, 1], "Training" )

slide-72
SLIDE 72

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Testing plot

... from visualization import get_plot_for_best_fit_l preprocess(...) data = get_data_as_numpy_array(...) training_set, testing_set = ( split_into_training_and_testing_sets(data) ) slope, intercept = train_model(training_set) get_plot_for_best_fit_line(slope, intercept, training_set[:, 0], training_set[:, 1], "Training" ) get_plot_for_best_fit_line(slope, intercept, testing_set[:, 0], testing_set[:, 1], "Testin )

slide-73
SLIDE 73

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Don't test properties individually

matplotlib.figure.Figure()

Axes conguration style Data style Annotations style ...

slide-74
SLIDE 74

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Testing strategy for plots

slide-75
SLIDE 75

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Testing strategy for plots

One-time baseline generation Testing

slide-76
SLIDE 76

UNIT TESTING FOR DATA SCIENCE IN PYTHON

One-time baseline generation

Decide on test arguments One-time baseline generation Testing

slide-77
SLIDE 77

UNIT TESTING FOR DATA SCIENCE IN PYTHON

One-time baseline generation

Decide on test arguments Call plotting function

  • n test arguments

One-time baseline generation Testing

slide-78
SLIDE 78

UNIT TESTING FOR DATA SCIENCE IN PYTHON

One-time baseline generation

Decide on test arguments Call plotting function

  • n test arguments

Convert Figure() to PNG image One-time baseline generation Testing

slide-79
SLIDE 79

UNIT TESTING FOR DATA SCIENCE IN PYTHON

One-time baseline generation

Decide on test arguments Call plotting function

  • n test arguments

Convert Figure() to PNG image Image looks OK? One-time baseline generation Testing

slide-80
SLIDE 80

UNIT TESTING FOR DATA SCIENCE IN PYTHON

One-time baseline generation

Decide on test arguments Call plotting function

  • n test arguments

Convert Figure() to PNG image Image looks OK? Store image as baseline image Yes One-time baseline generation Testing

slide-81
SLIDE 81

UNIT TESTING FOR DATA SCIENCE IN PYTHON

One-time baseline generation

Decide on test arguments Call plotting function

  • n test arguments

Convert Figure() to PNG image Image looks OK? Store image as baseline image Fix plotting function No Yes One-time baseline generation Testing

slide-82
SLIDE 82

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Testing

Decide on test arguments Call plotting function

  • n test arguments

Convert Figure() to PNG image Image looks OK? Store image as baseline image Fix plotting function No Yes Call plotting function

  • n test arguments

Convert Figure() to PNG image One-time baseline generation Testing

slide-83
SLIDE 83

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Testing

Decide on test arguments Call plotting function

  • n test arguments

Convert Figure() to PNG image Image looks OK? Store image as baseline image Fix plotting function No Yes Call plotting function

  • n test arguments

Convert Figure() to PNG image Compare One-time baseline generation Testing

slide-84
SLIDE 84

UNIT TESTING FOR DATA SCIENCE IN PYTHON

pytest-mpl

Knows how to ignore OS related differences. Makes it easy to generate baseline images.

pip install pytest-mpl

slide-85
SLIDE 85

UNIT TESTING FOR DATA SCIENCE IN PYTHON

An example test

import pytest import numpy as np from visualization import get_plot_for_best_fit_line def test_plot_for_linear_data(): slope = 2.0 intercept = 1.0 x_array = np.array([1.0, 2.0, 3.0]) # Linear data set y_array = np.array([3.0, 5.0, 7.0]) title = "Test plot for linear data" return get_plot_for_best_fit_line(slope, intercept, x_array, y_array, title)

slide-86
SLIDE 86

UNIT TESTING FOR DATA SCIENCE IN PYTHON

An example test

import pytest import numpy as np from visualization import get_plot_for_best_fit_line @pytest.mark.mpl_image_compare # Under the hood baseline generation and comparison def test_plot_for_linear_data(): slope = 2.0 intercept = 1.0 x_array = np.array([1.0, 2.0, 3.0]) # Linear data set y_array = np.array([3.0, 5.0, 7.0]) title = "Test plot for linear data" return get_plot_for_best_fit_line(slope, intercept, x_array, y_array, title)

slide-87
SLIDE 87

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Generating the baseline image

Generate baseline image

!pytest -k "test_plot_for_linear_data"

  • -mpl-generate-path

visualization/baseline data/ src/ tests/ |-- data/ |-- features/ |-- models/ |-- visualization |-- __init__.py |-- test_plots.py # Test module |-- baseline # Contains baselines

slide-88
SLIDE 88

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Verify the baseline image

data/ src/ tests/ |-- data/ |-- features/ |-- models/ |-- visualization |-- __init__.py |-- test_plots.py # Test module |-- baseline # Contains baselines |-- test_plot_for_linear_data.png

slide-89
SLIDE 89

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Run the test

!pytest -k "test_plot_for_linear_data" --mpl ======================= test session starts ======================= ... collected 24 items / 23 deselected / 1 selected visualization/test_plots.py . [100%] ============= 1 passed, 23 deselected in 0.68 seconds =============

slide-90
SLIDE 90

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Reading failure reports

!pytest -k "test_plot_for_linear_data" --mpl ============================ FAILURES ============================= _______ TestGetPlotForBestFitLine.test_plot_for_linear_data _______ Error: Image files did not match. RMS Value: 11.191347848524174 Expected: /tmp/tmplcbtsb10/baseline-test_plot_for_linear_data.png Actual: /tmp/tmplcbtsb10/test_plot_for_linear_data.png Difference: /tmp/tmplcbtsb10/test_plot_for_linear_data-failed-diff.png Tolerance: 2 ============= 1 failed, 36 deselected in 1.13 seconds =============

slide-91
SLIDE 91

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Yummy!

slide-92
SLIDE 92

Let's test plots!

UN IT TES TIN G F OR DATA S CIEN CE IN P YTH ON

slide-93
SLIDE 93

Congratulations

UN IT TES TIN G F OR DATA S CIEN CE IN P YTH ON

Dibya Chakravorty

Test Automation Engineer

slide-94
SLIDE 94

UNIT TESTING FOR DATA SCIENCE IN PYTHON

slide-95
SLIDE 95

UNIT TESTING FOR DATA SCIENCE IN PYTHON

You've written so many tests

slide-96
SLIDE 96

UNIT TESTING FOR DATA SCIENCE IN PYTHON

You've written so many tests

slide-97
SLIDE 97

UNIT TESTING FOR DATA SCIENCE IN PYTHON

You've written so many tests

slide-98
SLIDE 98

UNIT TESTING FOR DATA SCIENCE IN PYTHON

You've written so many tests

slide-99
SLIDE 99

UNIT TESTING FOR DATA SCIENCE IN PYTHON

You learned a lot

slide-100
SLIDE 100

UNIT TESTING FOR DATA SCIENCE IN PYTHON

T esting saves time and effort.

pytest

T esting return values and exceptions. Running tests and reading the test result report. Best practices Well tested function using normal, special and bad arguments. TDD, where tests get written before implementation. T est organization and management. Advanced skills Setup and teardown with xtures, mocking. Sanity tests for data science models. Plot testing.

slide-101
SLIDE 101

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Code for this course

https://github.com/gutfeeling/univariate-linear-regression

slide-102
SLIDE 102

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Icon sources

Icons made by the following authors from aticon.com. Freepik Smashicons Vectors Market Kiranshastry Dimitry Miroliubov Creaticca Creative Agency Gregor Cresnar

slide-103
SLIDE 103

UNIT TESTING FOR DATA SCIENCE IN PYTHON

Image sources

  • 1. https://chibird.com/post/20998191414/i-make-a-lot-of-procrastination-drawings-theyre
  • 2. http://www.dekoleidenschaft.de/ratgeber/10-tipps-fuer-mehr-ordnung-im-kleiderschrank/
  • 3. http://me-monaco.me/paper-storage-box-with-lid/
  • 4. https://towardsdatascience.com/random-forests-and-decision-trees-from-scratch-in-python-

3e4fa5ae4249

  • 5. https://towardsdatascience.com/demystifying-support-vector-machines-8453b39f7368
  • 6. https://www.bbc.co.uk/bbcthree/article/b290ff0e-1d75-43b1-8ff1-a9ac80d4d842
slide-104
SLIDE 104

I wish you all the best!

UN IT TES TIN G F OR DATA S CIEN CE IN P YTH ON