{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Problem Sheet 6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem 1: Cross-validation methods provided by Scikit-Learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to experiment with the methods `sklearn` provides to us.\n", "\n", "**Task**: For this we generate a *toy* dataset containing only the numbers from 0 to 9, i.e.,\n", "\n", " X = range(10)\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "X = list(range(10))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "X?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Leave One Out Cross-Validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function `LeaveOneOut` is a simple cross-validation.\n", "Each training set is created by taking all the samples except one, the test set consisting of the single remaining sample.\n", "Thus, for `n` samples, we have `n` different training sets and `n` different test sets.\n", "Leave-one-out cross-validation (LOOCV) can be computationally expensive for large datasets.\n", "\n", "You can import the function `LeaveOneOut` by\n", "\n", " from sklearn.model_selection import LeaveOneOut\n", " \n", "The documentation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html#sklearn.model_selection.LeaveOneOut).\n", "\n", "With\n", "\n", " loo = LeaveOneOut()\n", " \n", "you generate a so-called *iterator* in python.\n", "An iterator is an object that can be iterated upon, meaning that you can traverse through all its values.\n", "\n", "The command\n", "\n", " S = loo.split(X)\n", "\n", "generates a leave-one-out cross-validation iterator `S` across the set/list/array `X`.\n", "\n", "**Task**: Execute the above commands." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import LeaveOneOut\n", "loo = LeaveOneOut()\n", "S = loo.split(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, you can always access the next item in the iterator `S` by typing\n", "\n", " next(S)\n", " \n", "**Task**: Try this out multiple times and see what changes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([0, 1, 3, 4, 5, 6, 7, 8, 9]), array([2]))" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "next(S)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, iterators are used in loops:\n", "\n", " for train, test in loo.split(X):\n", " print(\"Training set: %s\\t Test set: %s\" % (train, test))\n", "\n", "**Task**: Try it!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training set: [1 2 3 4 5 6 7 8 9]\t Test set: [0]\n", "Training set: [0 2 3 4 5 6 7 8 9]\t Test set: [1]\n", "Training set: [0 1 3 4 5 6 7 8 9]\t Test set: [2]\n", "Training set: [0 1 2 4 5 6 7 8 9]\t Test set: [3]\n", "Training set: [0 1 2 3 5 6 7 8 9]\t Test set: [4]\n", "Training set: [0 1 2 3 4 6 7 8 9]\t Test set: [5]\n", "Training set: [0 1 2 3 4 5 7 8 9]\t Test set: [6]\n", "Training set: [0 1 2 3 4 5 6 8 9]\t Test set: [7]\n", "Training set: [0 1 2 3 4 5 6 7 9]\t Test set: [8]\n", "Training set: [0 1 2 3 4 5 6 7 8]\t Test set: [9]\n" ] } ], "source": [ "for train, test in loo.split(X):\n", " print(\"Training set: %s\\t Test set: %s\" % (train, test))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "X = list(X)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X[train[0]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## K-Fold cross validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function `KFold` divides all the samples into `k` groups of samples called folds (if $k=n$, this is equivalent to the Leave-One-Out strategy) of equal sizes (if possible).\n", "The prediction function is learned using `k−1` folds, and the omitted fold is used for testing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can import the function `KFold` by\n", "\n", " from sklearn.model_selection import KFold\n", "\n", "Check out the documentation of the function [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold).\n", "As for LOOCV, create a test example that shows the behaviour of the function.\n", "For `n_splits=2`, you should obtain\n", "\n", " Training set: [5 6 7 8 9]\t Test set: [0 1 2 3 4]\n", " Training set: [0 1 2 3 4]\t Test set: [5 6 7 8 9]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training set: [5 6 7 8 9]\t Test set: [0 1 2 3 4]\n", "Training set: [0 1 2 3 4]\t Test set: [5 6 7 8 9]\n" ] } ], "source": [ "from sklearn.model_selection import KFold\n", "\n", "kf = KFold(n_splits=2)\n", "for train, test in kf.split(X):\n", " print(\"Training set: %s\\t Test set: %s\" % (train, test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem 2 - Cross-validation for a diabetes data set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The diabetes data set contains ten measurements (age, sex, body mass index, average blood pressure, and six blood serum measurements) for each of the `n = 442` patients.\n", "\n", "The response variable is a quantitative measure of disease progression one year after baseline.\n", "\n", "**Task**: The data set is part of scikit learn, you can import it using\n", "\n", " from sklearn import datasets\n", " diabetes = datasets.load_diabetes()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "from sklearn import datasets\n", "diabetes = datasets.load_diabetes()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we create a pandas data frame to hold this information.\n", "\n", "**Task**:\n", "Create a pandas data frame `X` holding the ten predictor variables. You should name the columns in the data frame using the optional argument `columns=cols`, where `cols` is given by\n", " \n", " cols = [\"age\", \"sex\", \"bmi\", \"map\", \"tc\",\n", " \"ldl\", \"hdl\", \"tch\", \"ltg\", \"glu\"]\n", " \n", "Store the response variables as an numpy array `y`\n", "\n", "**Hint**:\n", "As in the iris data set, the diabetes dataset is as a python dictionary. The predictor variables can be accessed by `diabetes.data`, the responses via `diabetes.target`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "cols = [\"age\", \"sex\", \"bmi\", \"map\", \"tc\",\n", " \"ldl\", \"hdl\", \"tch\", \"ltg\", \"glu\"]\n", "X = pd.DataFrame(diabetes.data, columns=cols)\n", "y = diabetes.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to try two different estimation approaches here.\n", "1. At first, we use a plain training set/validation set approach, where we exclude $1/5$ of the data from training.\n", "2. Our second approach is to estimate $5$ different models using 5-fold cross-validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1st approach: Simple splitting into training and validation set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this part, we want to train a linear model using a subset of our samples.\n", "We have done this by hand so far, but there are also methods provided by `sklearn` which will do this work for us.\n", "Use the function `train_test_split` from the module `sklearn.model_selection` to divide your data inta a training and a validation set. SInce this selection is made randomly, you should set the optional input `random_state` to fix the seed of the random number generator to ensure comparability, e.g., by setting `random_state = 1`.\n", "\n", "**Task**: Split your data into a training and a validation set using the function `train_test_split`.\n", "Your validation set should contain 20\\% of the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Check the size of your sets. The training set should contain 353 samples, while the test set contains 89." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(353, 10) (353,)\n", "(89, 10) (89,)\n" ] } ], "source": [ "print(X_train.shape, y_train.shape)\n", "print(X_test.shape, y_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**:\n", "Fit a linear regression model to your **training** data. Use the appropriate method in `sklearn`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "from sklearn import linear_model\n", "lm = linear_model.LinearRegression()\n", "test_model = lm.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Use your model to predict the response on the validation set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "test_pred = test_model.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Until now, our plots were always of the type predictor against response or against regression line.\n", "Another way to display the quality of a regression fit is to plot the true values against the predicted values.\n", "The closer the values are to the identity $f(x) = x$, the better the fit.\n", "\n", "**Task**:\n", "Produce a scatterplot of the true values in the validation response against the predicted values. Label the axes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "/* Put everything inside the global mpl namespace */\n", "window.mpl = {};\n", "\n", "\n", "mpl.get_websocket_type = function() {\n", " if (typeof(WebSocket) !== 'undefined') {\n", " return WebSocket;\n", " } else if (typeof(MozWebSocket) !== 'undefined') {\n", " return MozWebSocket;\n", " } else {\n", " alert('Your browser does not have WebSocket support.' +\n", " 'Please try Chrome, Safari or Firefox ≥ 6. ' +\n", " 'Firefox 4 and 5 are also supported but you ' +\n", " 'have to enable WebSockets in about:config.');\n", " };\n", "}\n", "\n", "mpl.figure = function(figure_id, websocket, ondownload, parent_element) {\n", " this.id = figure_id;\n", "\n", " this.ws = websocket;\n", "\n", " this.supports_binary = (this.ws.binaryType != undefined);\n", "\n", " if (!this.supports_binary) {\n", " var warnings = document.getElementById(\"mpl-warnings\");\n", " if (warnings) {\n", " warnings.style.display = 'block';\n", " warnings.textContent = (\n", " \"This browser does not support binary websocket messages. \" +\n", " \"Performance may be slow.\");\n", " }\n", " }\n", "\n", " this.imageObj = new Image();\n", "\n", " this.context = undefined;\n", " this.message = undefined;\n", " this.canvas = undefined;\n", " this.rubberband_canvas = undefined;\n", " this.rubberband_context = undefined;\n", " this.format_dropdown = undefined;\n", "\n", " this.image_mode = 'full';\n", "\n", " this.root = $('
');\n", " this._root_extra_style(this.root)\n", " this.root.attr('style', 'display: inline-block');\n", "\n", " $(parent_element).append(this.root);\n", "\n", " this._init_header(this);\n", " this._init_canvas(this);\n", " this._init_toolbar(this);\n", "\n", " var fig = this;\n", "\n", " this.waiting = false;\n", "\n", " this.ws.onopen = function () {\n", " fig.send_message(\"supports_binary\", {value: fig.supports_binary});\n", " fig.send_message(\"send_image_mode\", {});\n", " if (mpl.ratio != 1) {\n", " fig.send_message(\"set_dpi_ratio\", {'dpi_ratio': mpl.ratio});\n", " }\n", " fig.send_message(\"refresh\", {});\n", " }\n", "\n", " this.imageObj.onload = function() {\n", " if (fig.image_mode == 'full') {\n", " // Full images could contain transparency (where diff images\n", " // almost always do), so we need to clear the canvas so that\n", " // there is no ghosting.\n", " fig.context.clearRect(0, 0, fig.canvas.width, fig.canvas.height);\n", " }\n", " fig.context.drawImage(fig.imageObj, 0, 0);\n", " };\n", "\n", " this.imageObj.onunload = function() {\n", " fig.ws.close();\n", " }\n", "\n", " this.ws.onmessage = this._make_on_message_function(this);\n", "\n", " this.ondownload = ondownload;\n", "}\n", "\n", "mpl.figure.prototype._init_header = function() {\n", " var titlebar = $(\n", " '');\n", " var titletext = $(\n", " '');\n", " titlebar.append(titletext)\n", " this.root.append(titlebar);\n", " this.header = titletext[0];\n", "}\n", "\n", "\n", "\n", "mpl.figure.prototype._canvas_extra_style = function(canvas_div) {\n", "\n", "}\n", "\n", "\n", "mpl.figure.prototype._root_extra_style = function(canvas_div) {\n", "\n", "}\n", "\n", "mpl.figure.prototype._init_canvas = function() {\n", " var fig = this;\n", "\n", " var canvas_div = $('');\n", "\n", " canvas_div.attr('style', 'position: relative; clear: both; outline: 0');\n", "\n", " function canvas_keyboard_event(event) {\n", " return fig.key_event(event, event['data']);\n", " }\n", "\n", " canvas_div.keydown('key_press', canvas_keyboard_event);\n", " canvas_div.keyup('key_release', canvas_keyboard_event);\n", " this.canvas_div = canvas_div\n", " this._canvas_extra_style(canvas_div)\n", " this.root.append(canvas_div);\n", "\n", " var canvas = $('');\n", " canvas.addClass('mpl-canvas');\n", " canvas.attr('style', \"left: 0; top: 0; z-index: 0; outline: 0\")\n", "\n", " this.canvas = canvas[0];\n", " this.context = canvas[0].getContext(\"2d\");\n", "\n", " var backingStore = this.context.backingStorePixelRatio ||\n", "\tthis.context.webkitBackingStorePixelRatio ||\n", "\tthis.context.mozBackingStorePixelRatio ||\n", "\tthis.context.msBackingStorePixelRatio ||\n", "\tthis.context.oBackingStorePixelRatio ||\n", "\tthis.context.backingStorePixelRatio || 1;\n", "\n", " mpl.ratio = (window.devicePixelRatio || 1) / backingStore;\n", "\n", " var rubberband = $('');\n", " rubberband.attr('style', \"position: absolute; left: 0; top: 0; z-index: 1;\")\n", "\n", " var pass_mouse_events = true;\n", "\n", " canvas_div.resizable({\n", " start: function(event, ui) {\n", " pass_mouse_events = false;\n", " },\n", " resize: function(event, ui) {\n", " fig.request_resize(ui.size.width, ui.size.height);\n", " },\n", " stop: function(event, ui) {\n", " pass_mouse_events = true;\n", " fig.request_resize(ui.size.width, ui.size.height);\n", " },\n", " });\n", "\n", " function mouse_event_fn(event) {\n", " if (pass_mouse_events)\n", " return fig.mouse_event(event, event['data']);\n", " }\n", "\n", " rubberband.mousedown('button_press', mouse_event_fn);\n", " rubberband.mouseup('button_release', mouse_event_fn);\n", " // Throttle sequential mouse events to 1 every 20ms.\n", " rubberband.mousemove('motion_notify', mouse_event_fn);\n", "\n", " rubberband.mouseenter('figure_enter', mouse_event_fn);\n", " rubberband.mouseleave('figure_leave', mouse_event_fn);\n", "\n", " canvas_div.on(\"wheel\", function (event) {\n", " event = event.originalEvent;\n", " event['data'] = 'scroll'\n", " if (event.deltaY < 0) {\n", " event.step = 1;\n", " } else {\n", " event.step = -1;\n", " }\n", " mouse_event_fn(event);\n", " });\n", "\n", " canvas_div.append(canvas);\n", " canvas_div.append(rubberband);\n", "\n", " this.rubberband = rubberband;\n", " this.rubberband_canvas = rubberband[0];\n", " this.rubberband_context = rubberband[0].getContext(\"2d\");\n", " this.rubberband_context.strokeStyle = \"#000000\";\n", "\n", " this._resize_canvas = function(width, height) {\n", " // Keep the size of the canvas, canvas container, and rubber band\n", " // canvas in synch.\n", " canvas_div.css('width', width)\n", " canvas_div.css('height', height)\n", "\n", " canvas.attr('width', width * mpl.ratio);\n", " canvas.attr('height', height * mpl.ratio);\n", " canvas.attr('style', 'width: ' + width + 'px; height: ' + height + 'px;');\n", "\n", " rubberband.attr('width', width);\n", " rubberband.attr('height', height);\n", " }\n", "\n", " // Set the figure to an initial 600x600px, this will subsequently be updated\n", " // upon first draw.\n", " this._resize_canvas(600, 600);\n", "\n", " // Disable right mouse context menu.\n", " $(this.rubberband_canvas).bind(\"contextmenu\",function(e){\n", " return false;\n", " });\n", "\n", " function set_focus () {\n", " canvas.focus();\n", " canvas_div.focus();\n", " }\n", "\n", " window.setTimeout(set_focus, 100);\n", "}\n", "\n", "mpl.figure.prototype._init_toolbar = function() {\n", " var fig = this;\n", "\n", " var nav_element = $('')\n", " nav_element.attr('style', 'width: 100%');\n", " this.root.append(nav_element);\n", "\n", " // Define a callback function for later on.\n", " function toolbar_event(event) {\n", " return fig.toolbar_button_onclick(event['data']);\n", " }\n", " function toolbar_mouse_event(event) {\n", " return fig.toolbar_button_onmouseover(event['data']);\n", " }\n", "\n", " for(var toolbar_ind in mpl.toolbar_items) {\n", " var name = mpl.toolbar_items[toolbar_ind][0];\n", " var tooltip = mpl.toolbar_items[toolbar_ind][1];\n", " var image = mpl.toolbar_items[toolbar_ind][2];\n", " var method_name = mpl.toolbar_items[toolbar_ind][3];\n", "\n", " if (!name) {\n", " // put a spacer in here.\n", " continue;\n", " }\n", " var button = $('');\n", " button.addClass('ui-button ui-widget ui-state-default ui-corner-all ' +\n", " 'ui-button-icon-only');\n", " button.attr('role', 'button');\n", " button.attr('aria-disabled', 'false');\n", " button.click(method_name, toolbar_event);\n", " button.mouseover(tooltip, toolbar_mouse_event);\n", "\n", " var icon_img = $('');\n", " icon_img.addClass('ui-button-icon-primary ui-icon');\n", " icon_img.addClass(image);\n", " icon_img.addClass('ui-corner-all');\n", "\n", " var tooltip_span = $('');\n", " tooltip_span.addClass('ui-button-text');\n", " tooltip_span.html(tooltip);\n", "\n", " button.append(icon_img);\n", " button.append(tooltip_span);\n", "\n", " nav_element.append(button);\n", " }\n", "\n", " var fmt_picker_span = $('');\n", "\n", " var fmt_picker = $('');\n", " fmt_picker.addClass('mpl-toolbar-option ui-widget ui-widget-content');\n", " fmt_picker_span.append(fmt_picker);\n", " nav_element.append(fmt_picker_span);\n", " this.format_dropdown = fmt_picker[0];\n", "\n", " for (var ind in mpl.extensions) {\n", " var fmt = mpl.extensions[ind];\n", " var option = $(\n", " '', {selected: fmt === mpl.default_extension}).html(fmt);\n", " fmt_picker.append(option)\n", " }\n", "\n", " // Add hover states to the ui-buttons\n", " $( \".ui-button\" ).hover(\n", " function() { $(this).addClass(\"ui-state-hover\");},\n", " function() { $(this).removeClass(\"ui-state-hover\");}\n", " );\n", "\n", " var status_bar = $('