{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\\rightarrow$Run All).\n", "\n", "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\", as well as your name below.\n", "\n", "Rename this problem sheet as follows:\n", "\n", " ps{number of lab}_{your user name}_problem{number of problem sheet in this lab}\n", " \n", "for example\n", " \n", " ps2_blja_problem1\n", "\n", "Submit your homework within one week until next Monday, 9 a.m." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "NAME = \"\"\n", "EMAIL = \"\"\n", "USERNAME = \"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "4c51662f251928d5c6821ab4ff7deb7a", "grade": false, "grade_id": "cell-09e40ef55e74ddbf", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "# Introduction to Data Science\n", "## Lab 8: Cross-validation methods provided by Scikit-Learn" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "610c3dae12809a51f1dff73941de9df0", "grade": false, "grade_id": "cell-396894a2b8c065ce", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "### Part A: Generation of a toy data set" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "2c800e8b4aa3da40efb3b93684409aab", "grade": false, "grade_id": "cell-2682f6f5a719ca20", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "We want to experiment with the methods `sklearn` provides to us.\n", "\n", "**Task (1 point)**: Generate a *toy* dataset `X`.\n", "It should be a 1-dimensional `numpy.ndarray` containing only the numbers from 1 to 10." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "d2865395806b9b67e64b8ea416e43443", "grade": false, "grade_id": "cell-d03013d433692b05", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "import numpy as np\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "fa022859242ba2eeb8ed8ce0748f2586", "grade": true, "grade_id": "cell-6261d31e38956add", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert type(X) == np.ndarray\n", "assert X.shape == (10,)\n", "assert X.mean() == 5.5\n", "assert X.var() == 8.25" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "15fa6c86753f771ccb3bb312f1ed2378", "grade": false, "grade_id": "cell-09b755ff0f4771f2", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "### Part B: Leave-One-Out Cross-Validation" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "f44dff4cc498bb3835f3dd5f0b171b0a", "grade": false, "grade_id": "cell-da37fe9cbe9977bc", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "The function `LeaveOneOut` is a simple cross-validation method.\n", "Each training set is created by taking all the samples except one, the test set consisting of the single remaining sample.\n", "Thus, for `n` samples, we have `n` different training sets and `n` different test sets.\n", "Leave-one-out cross-validation (LOOCV) can be computationally expensive for large datasets.\n", "\n", "You can import the function `LeaveOneOut` by\n", "\n", " from sklearn.model_selection import LeaveOneOut\n", " \n", "The documentation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html#sklearn.model_selection.LeaveOneOut).\n", "\n", "The command\n", "\n", " S = LeaveOneOut().split(X)\n", "\n", "generates a leave-one-out cross-validation iterator `S` across the set/list/array `X`.\n", "An *iterator* is an object that can be iterated upon, meaning that you can traverse through all its values.\n", "\n", "Once, you've set up an iterator you would typically access its train and test set within a loop and do the data science stuff that you want.\n", "\n", "**Task (2 points)**: Set up a leave-one-out cross validation iterator for the data set `X`.\n", "Afterwards, set up a loop which prints the training and test data set in each iteration.\n", "Your output should look like this:\n", "\n", " Training set: [ 2 3 4 5 6 7 8 9 10]\t Test set: [1]\n", " Training set: [ 1 3 4 5 6 7 8 9 10]\t Test set: [2]\n", " Training set: [ 1 2 4 5 6 7 8 9 10]\t Test set: [3]\n", " Training set: [ 1 2 3 5 6 7 8 9 10]\t Test set: [4]\n", " Training set: [ 1 2 3 4 6 7 8 9 10]\t Test set: [5]\n", " Training set: [ 1 2 3 4 5 7 8 9 10]\t Test set: [6]\n", " Training set: [ 1 2 3 4 5 6 8 9 10]\t Test set: [7]\n", " Training set: [ 1 2 3 4 5 6 7 9 10]\t Test set: [8]\n", " Training set: [ 1 2 3 4 5 6 7 8 10]\t Test set: [9]\n", " Training set: [1 2 3 4 5 6 7 8 9]\t Test set: [10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "66dccda14e2ad9dabcc45d9f6a7ecdb8", "grade": true, "grade_id": "cell-0f0f095941c2002a", "locked": false, "points": 2, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "from sklearn.model_selection import LeaveOneOut\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part C: K-Fold cross validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function `KFold` divides all the samples into `k` groups of samples called folds (if $k=n$, this is equivalent to the Leave-One-Out strategy) of equal sizes (if possible).\n", "The prediction function is learned using `k−1` folds, and the omitted fold is used for testing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can import the function `KFold` by\n", "\n", " from sklearn.model_selection import KFold\n", "\n", "Check out the documentation of the function [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold).\n", "\n", "**Task (2 points)**: As for LOOCV, use the data set `X` and create a test example that shows the behaviour of the function.\n", "For `n_splits=2`, you should obtain\n", "\n", " Training set: [5 6 7 8 9]\t Test set: [0 1 2 3 4]\n", " Training set: [0 1 2 3 4]\t Test set: [5 6 7 8 9]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "b8beefb211140b54194e709e1a77bea8", "grade": true, "grade_id": "cell-02051e22719c142c", "locked": false, "points": 2, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }