{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\\rightarrow$Run All).\n", "\n", "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\", as well as your name below.\n", "\n", "Rename this problem sheet as follows:\n", "\n", " ps{number of lab}_{your user name}_problem{number of problem sheet in this lab}\n", " \n", "for example\n", " \n", " ps2_blja_problem1\n", "\n", "Submit your homework within one week until next Monday, 9 a.m." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "NAME = \"\"\n", "EMAIL = \"\"\n", "USERNAME = \"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "da465154543cb14b6a58467e18d23899", "grade": false, "grade_id": "cell-09e40ef55e74ddbf", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "# Introduction to Data Science\n", "## Lab 8: Cross-validation for a diabetes data set" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "96c70e1e5d52ad0fd39374d921d1f24e", "grade": false, "grade_id": "cell-6a9c350b8253c1c6", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "### Part A: Importing the data set" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "cca43054d388b13e54a7d7219a290adf", "grade": false, "grade_id": "cell-e9f33802d3e4ed8f", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "The diabetes data set contains ten measurements (age, sex, body mass index, average blood pressure, and six blood serum measurements) for each of the `n = 442` patients.\n", "\n", "The response variable is a quantitative measure of disease progression one year after baseline.\n", "\n", "**Task**: The data set is part of scikit learn, you can import it by executing the next cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "60a65c36d4e824c587423b2ff9770ba4", "grade": false, "grade_id": "cell-64880bb943a3eeb3", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "from sklearn import datasets\n", "diabetes = datasets.load_diabetes()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "33e371c44f82b5e59167810ad53e13d6", "grade": false, "grade_id": "cell-233004448533d63f", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Here, `diabetes` will be a dictionary.\n", "A *dictionary* is an unordered collection of items.\n", "While other compound data types have only `values` as elements (a *list* for example), a dictionary consists of `key: value` pairs.\n", "\n", "**Task**: You can return the keys using the method `.keys()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "56a1c617b946cc02d136ac354ce8580d", "grade": false, "grade_id": "cell-230736cbe8e06f83", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "diabetes.keys()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "91781a52b045bf88ce5a8a43053bd49a", "grade": false, "grade_id": "cell-f0857d4c565e2188", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Here, you find that the *dictionary* diabetes contains the keys\n", "\n", " 'data', 'target', 'DESCR', 'feature_names', 'data_filename', 'target_filename'\n", "\n", "Since `DESCR` sounds like description, we print its *value* by the following command\n", " \n", " print(diabetes.DESCR)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "720b643071033e63bb59cadd6d81b9b2", "grade": false, "grade_id": "cell-d6d1064cfde4400b", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "print(diabetes.DESCR)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "5a2b4d9ac718e0d022221030fa2e281b", "grade": false, "grade_id": "cell-9d7423f254773e52", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Your first task will be to create a `pandas.DataFrame` to hold this information.\n", "\n", "**Task (2 points)**:\n", "Create a pandas data frame `X` holding the ten predictor variables. You should name the columns in the data frame using the optional argument `columns=cols`, where `cols` is given by\n", " \n", " cols = [\"age\", \"sex\", \"bmi\", \"map\", \"tc\",\n", " \"ldl\", \"hdl\", \"tch\", \"ltg\", \"glu\"]\n", " \n", "Store the response variables as an numpy array `y`\n", "\n", "**Hint**:\n", "As in the iris data set, the diabetes data set is as a python dictionary.\n", "The predictor variables can be accessed by `diabetes.data`, the responses via `diabetes.target`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "171e20fd57fab2b905c00bf35dc7604b", "grade": false, "grade_id": "cell-db36fecb18abea08", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "import pandas as pd\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "8af74716c4794ab6a3efb2e5d1eca262", "grade": true, "grade_id": "cell-29c0a4f3fa57f9b3", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert X.shape == (442,10)\n", "assert all(X.columns == [\"age\", \"sex\", \"bmi\", \"map\", \"tc\", \"ldl\", \"hdl\", \"tch\", \"ltg\", \"glu\"])\n", "assert abs(X.age.mean()) < 1e-10" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "e0217ba271e8f6a96547a693e814d8eb", "grade": false, "grade_id": "cell-671c2584a7cbb043", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "In the following, we want to try two different estimation approaches:\n", "1. At first, we use a plain training-test set approach, where we exclude $1/5$ of the data from training.\n", "2. Our second approach is to estimate $5$ different models using 5-fold cross-validation" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "f04e342ee40d9681f67a58634bb6defc", "grade": false, "grade_id": "cell-33c796b1ed15517f", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "### Part B: Simple splitting into training and validation set" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "b206583ebc778f1220606cb32f5d5236", "grade": false, "grade_id": "cell-c1f1f7d6b17407a1", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "In this part, we want to train a linear model using a subset of our samples.\n", "We have done this by hand so far, but there are also methods provided by `sklearn` which will do this work for us.\n", "\n", "Use the function `train_test_split` from the module `sklearn.model_selection` to divide your data inta a training and a validation set. SInce this selection is made randomly, you should set the optional input `random_state` to fix the seed of the random number generator to ensure comparability, e.g., by setting `random_state = 1`.\n", "\n", "**Task (1 point)**: Split your data into a training and a test set using the function `train_test_split`.\n", "Your *test set should contain 20\\% of the data*.\n", "Use `random_state=1`.\n", "Store your sets in variables `X_train, X_test, y_train, y_test`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "b6e437dfd4263f73d0772f81208be3d3", "grade": false, "grade_id": "cell-253d355e804f66c1", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "5ff33f7c9565e8eee8833ae10d020ded", "grade": true, "grade_id": "cell-5c8e4e639d5436be", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert X_train.shape == (353,10)\n", "assert y_test.shape == (89,)\n", "assert abs(y_test.mean() - 147.20224719101122) < 1e-8\n", "assert abs(X_test.age.var() - 0.0023730443017513166) < 1e-8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task (1 point)**:\n", "Fit a *LinearRegression model* to your **training** data.\n", "Use the appropriate method from `sklearn`.\n", "\n", "Use your model to predict the response on the test set and store your prediction in a variable `test_pred`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "6a3ba410a65a7e6c817eb978e4109b80", "grade": false, "grade_id": "cell-9d2306f93189f453", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "3630d23a04894298c3b05029de0b157f", "grade": true, "grade_id": "cell-44c6b5d6f023a572", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert abs(test_pred.mean() - 143.7088817962804) < 1e-8" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "a8cd7f8f8f968b6f11c13d01d023d564", "grade": false, "grade_id": "cell-cfb671f87ebfc359", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Until now, our plots were always of the type predictor against response or against regression line.\n", "Another way to display the quality of a regression fit is to plot the true values against the predicted values.\n", "The closer the values are to the identity $f(x) = x$, the better the fit.\n", "\n", "**Task (2 points)**:\n", "Produce a scatterplot of the true values in the validation response against the predicted values. Draw also a line corresponding to the *ideal prediction*, i.e., each prediction is equal to its true value.\n", "Label the axes accordingly." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "8ebe308a38f31a94c988dc7ab0c147ed", "grade": true, "grade_id": "cell-0d7703946fd94797", "locked": false, "points": 2, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "from matplotlib import pyplot as plt\n", "%matplotlib inline\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "4e9e52aed23b79c9f26595cfa240628e", "grade": false, "grade_id": "cell-0bfc37e71a4e2058", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task (1 point)**: Compute the mean squared error $\\text{MSE}_\\text{val}$ on the validation set.\n", "You can either use the method `mean_squared_error` from the module `sklearn.metrics`, or you can implement it by yourself.\n", "Store the mean squared error in a variable `mse_test`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "a835cb6a9fe99c014d786c4b3224118c", "grade": false, "grade_id": "cell-0cf189f73fe1b8bb", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "10b811c5dee79e74c1905c38bf9bb2bb", "grade": true, "grade_id": "cell-39655950c0337042", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert abs(mse_test - 2992.5576814529445) < 1e-8" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "302266d9dfa009268d1ed0539321cb42", "grade": false, "grade_id": "cell-6e3f79f47f759691", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task (1 point)**: What is the proportion of variability that is explained by this linear fit. Store your answer in a variable `expl_var`.\n", "\n", "*Remember*: A `LinearRegression` has a method that computes exactly this." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "35c5865db1c7dbf69560091ebb9f6c0d", "grade": false, "grade_id": "cell-02da9e3b706ff219", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "5a5f4ede57677febecab982bb1b518cd", "grade": true, "grade_id": "cell-cd94b31e3414158f", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert abs(expl_var - 0.43843604017332694) < 1e-8" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "2895bb3f81c3b25c143ac1e3b17dcbac", "grade": false, "grade_id": "cell-63670992fba16704", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "### Part C: K-Fold Cross-Validation" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "8995999bcba98f079fe9ed13c47427a7", "grade": false, "grade_id": "cell-e8c06d868e91a629", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Next, we want to use cross-validation to select our model.\n", "Scikit-learn is a powerful library and possesses numerous modules and functions.\n", "Here, we explore the function `cross_val_score`, which can be imported by\n", "\n", " from sklearn.model_selection import cross_val_score\n", " \n", "This function performs K-fold cross-validation and returns a score for each fold (this is the $R^2$-score by default).\n", " \n", "**Task**: Please read the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score) and import the function `cross_val_score`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "df0129b8049eac98c44bf8eb16284407", "grade": false, "grade_id": "cell-13e697d4b0ee21f2", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "f7ec837aa594008a7463680567e93b59", "grade": false, "grade_id": "cell-b40037e4f1431811", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "The functions expects as a first argument an `estimator`.\n", "We are informed by the documentation that this should be an \"estimator object implementing \\[the method\\] ‘fit’\".\n", "\n", "This is fulfilled by all estimation methods used so far (e.g. linear models, logistic regression, LDA).\n", "In the case of a linear regression fit, this could be\n", " \n", " model = linear_model.LinearRegression()\n", "\n", "**Task (1 point)**: Perform a 5-fold cross-validation for a linear model on the diabetes data set and print the scores.\n", "Store the output of the function `cross_val_score` in a variable `cv_scores`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "278b6e777f3d6fb0e14441fe1cb94773", "grade": false, "grade_id": "cell-c553fd69d4271177", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "d25ec6da7f8965685f11c6ebb2073562", "grade": true, "grade_id": "cell-99d619405f87a651", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert (cv_scores.mean() - 0.48231812211149394) < 1e-8" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "6ea091c1daace19403d9e9849cfc441d", "grade": false, "grade_id": "cell-1243681c5a0df9e3", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task (1 point)**: Use the function `cross_val_predict` in the module `sklearn.model_selection` to make prediction on the diabetes data set.\n", "Store your answer in a variable `cv_pred`. Use again 5 folds." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "ebefdea203e886bf52e1ba3b784ceb34", "grade": false, "grade_id": "cell-007de9e596ef7ff7", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "f5cda670c79c6416ec792129c62ac9c4", "grade": true, "grade_id": "cell-c0de606ebc88d567", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert cv_pred.shape == (442,)\n", "assert abs(cv_pred.mean() - 151.7873610258396) < 1e-8" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "d963193ea6d66d843c72229747ac4cad", "grade": false, "grade_id": "cell-91ce7cdbeca42c0e", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task (1 point)**: Make a scatterplot of the true values in the test response against the predicted values similar to the one in **Part B**, but now using all of the data. Label the axes." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "739cceb0711dfd4d999913729d695c13", "grade": true, "grade_id": "cell-688b3f2f9f7696d3", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "from matplotlib import pyplot as plt\n", "%matplotlib inline\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "cb010681074690de7cbe85a135d2429a", "grade": false, "grade_id": "cell-f37498936a09e60f", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task (1 point)**: Compute the $R^2$-score this model and store it in a variable `accuracy`. You can use the function `r2_score` from the module `sklearn.metrics`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "9edd8eb48fb829f877d9c890099b64bb", "grade": false, "grade_id": "cell-924703d0d9799788", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "print(\"Cross-validated Accuracy:\", accuracy)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "0ffb8018ee2ae036e1c3f47e31550511", "grade": true, "grade_id": "cell-bec19fb2b666940c", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert abs(accuracy - 0.49532382463572844) < 1e-8" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "b3db316ac354719bddc3db6578e7e5f9", "grade": false, "grade_id": "cell-d59732cd980a7b79", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Caution**: Altough this $R^2$-score is higher than the score for the training/validation set split, they are not really comparable since we computed them on different subsets of the data.\n", "To get a more reliable comparison, we must keep part of the data as a so-called *hold-out* data set to be used for estimating the true learning error." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }