{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\\rightarrow$Run All).\n",
    "\n",
    "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\", as well as your name below.\n",
    "\n",
    "Rename this problem sheet as follows:\n",
    "\n",
    "    ps{number of lab}_{your user name}_problem{number of problem sheet in this lab}\n",
    "    \n",
    "for example\n",
    "    \n",
    "    ps2_blja_problem1\n",
    "\n",
    "Submit your homework within one week until next Monday, 9 a.m."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NAME = \"\"\n",
    "EMAIL = \"\"\n",
    "USERNAME = \"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "83be57aa54b869534a3f21c3602df38c",
     "grade": false,
     "grade_id": "cell-2f2cbd6f50d7ae76",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "# Introduction to Data Science\n",
    "\n",
    "## Lab 8: Multi-class classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "dd840324c8b525bc29069122d8a53d6f",
     "grade": false,
     "grade_id": "cell-cbc0379fbf3909bb",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "### Part A: Exploration of the flower petal data set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "45ef14557ea09dfdd2db60809b0f95ab",
     "grade": false,
     "grade_id": "cell-ea3f7ad38a908c79",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "The flower petal data set consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal size.\n",
    "\n",
    "Given the predictors\n",
    "- 1st column: sepal length,\n",
    "- 2nd column: sepal width,\n",
    "- 3rd column: petal length and\n",
    "- 4th column: petal width,\n",
    "\n",
    "our goal is to predict the correct class (0-Setosa, 1-Versicolour or 2-Virginica).\n",
    "\n",
    "The data set is part of `scikit-learn`'s datasets module and can be imported with the following commands:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "316d0bf0b445eae3759b7a3922c06661",
     "grade": false,
     "grade_id": "cell-a8324ad376a0593b",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "iris = datasets.load_iris()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "dbf44cc176c08085b88eefea503279b4",
     "grade": false,
     "grade_id": "cell-c042ae9393ae8933",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "The data comes as a dictionary. You can access the predictors using `iris.data` and the classes using `iris.target`.\n",
    "\n",
    "**Task (1 point)**: Store the predictors in a variable `X` and the response in a variable `y`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "fc364188b7041d84fc05410c7c7fd076",
     "grade": false,
     "grade_id": "cell-4c5d0ba46cdd1586",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "1a0e11a3e3b5be144664352cb186dd2c",
     "grade": true,
     "grade_id": "cell-53d8fde18f295bd5",
     "locked": true,
     "points": 1,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "assert X.shape == (150,4)\n",
    "assert y.shape == (150,)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "32545404c2a15f75ae687623895a5dbf",
     "grade": false,
     "grade_id": "cell-a26b22e257317813",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "**Task (2 points)**: Plot the sepal length on the x-axis and the sepal width on the y-axis. Color each of the three types of irises differently.\n",
    "Add a legend that gives the correct iris type (0-Setosa, 1-Versicolour, 2-Virginica)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "ab1bbc32c4721156186fcbe765fd301f",
     "grade": true,
     "grade_id": "cell-a99e46c9b162136b",
     "locked": false,
     "points": 2,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "e7ece228cba506adebffe2694f4bd56c",
     "grade": false,
     "grade_id": "cell-21f5a94901836304",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "**Task (2 points)**:\n",
    "Split your data into a training and a test set.\n",
    "Put the first 40 samples within each class in the training set and the remaining samples in a test data set.\n",
    "\n",
    "Store the training set in variables `Xtrain` and `ytrain`, and the test set in variables `Xtest` and `ytest`, resp."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "8f9a730d0542df47c71a8058bec6f970",
     "grade": false,
     "grade_id": "cell-3397ff5382060b53",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "369a39331af3016d2d00d3e9ac594664",
     "grade": true,
     "grade_id": "cell-4cff00529ce30e6d",
     "locked": true,
     "points": 2,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "assert Xtrain.shape == (120,4)\n",
    "assert Xtest.shape == (30,4)\n",
    "assert ytrain.shape == (120,)\n",
    "assert ytest.shape == (30,)\n",
    "assert abs(Xtrain.mean() - 3.485208333333333) < 1e-10\n",
    "assert abs(ytrain.mean() - 1) < 1e-10\n",
    "assert abs(ytest.mean() - 1) < 1e-10\n",
    "assert abs(Xtest.mean() - 3.3816666666666673) < 1e-10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "66af2645db5fe1d9a750c08a09fc3890",
     "grade": false,
     "grade_id": "cell-121f56ad6127c490",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "### Part B: Linear discriminant analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "54779c677544b85e505984439ca72b34",
     "grade": false,
     "grade_id": "cell-9f1175b6693317b1",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "In the lecture you've heard about the classification method called\n",
    "*Linear discriminant analysis (LDA)*.\n",
    "\n",
    "**Task (1 points)**: Find a way using `scikit-learn` to accomplish a linear discriminant analysis on the **training data set**.\n",
    "\n",
    "Perform an LDA using only the first two predictors, i.e., `sepal length` and `sepal width`.\n",
    "Store your trained model in the variable `lda`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "2bdac8840317d9dfc921f6a5bbf310ce",
     "grade": false,
     "grade_id": "cell-02fee2efc178e8e6",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "b07e83cf9b33eaa8ae9c234eaafb894a",
     "grade": true,
     "grade_id": "cell-f84d0268df55c5ca",
     "locked": true,
     "points": 1,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "assert 'lda' in locals()\n",
    "assert abs(lda.predict_proba([[2.1,1.1]])[0][0] - 0.7867422283434491) < 1e-10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "bac8f7a08275380139c2dfee0f36a138",
     "grade": false,
     "grade_id": "cell-c43f7abfa5544dd3",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "**Task (1 point)**: What is the proportion of correctly classified irises in the **test data set**? Store your answer in the variable `prop1`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "339a49a4db9e080c9f6c9cb969ebfaf3",
     "grade": false,
     "grade_id": "cell-902ceff2ae133619",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "eaa59f65ea5e635564530754f01604d4",
     "grade": true,
     "grade_id": "cell-1cc1e8c94bb792dd",
     "locked": true,
     "points": 1,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "assert 'prop1' in locals()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "00a9de394cb7b20780c991c359776141",
     "grade": false,
     "grade_id": "cell-b865e2c9260730db",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "**Task (2 points)**: Now, incorporate all of the predictors and perform a second linear discriminant analysis using **only the training data**.\n",
    "How does the proportion of correct classifications change (for the **test data**)?\n",
    "Store the proportion of correct classifications for the test set in the variable `prop2`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "72b46f63e18e7aa428d404a3fe900eb8",
     "grade": false,
     "grade_id": "cell-8b7d19220834da29",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "6793c7b8a2c398988ebff1e9f47ff3e4",
     "grade": true,
     "grade_id": "cell-e8a557d63a9a00ad",
     "locked": true,
     "points": 2,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "assert 'prop2' in locals()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "046558f7c8a27d81c33a16c5a9be2f01",
     "grade": false,
     "grade_id": "cell-8c53e0fecaf4f120",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "### Part C: Multi-class logistic regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "21dd54f7a96e1dfeba0c7ce8e193ba52",
     "grade": false,
     "grade_id": "cell-82ffaac2c14cc54d",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Here, we want to apply logistic regression in a multi-class setting.\n",
    "\n",
    "#### One vs. rest approach\n",
    "One way to extend the logistic regression to a setting with $k$ classes is by training not one but $k$ models for $k > 2$, one for each individual class.\n",
    "As the name suggests, we train one model for each individual class $i$ and try to fit a logistic regression model to a modified data set, where the responses of members belonging to class $i$ are set to `True` and **all** others are set to `False`, i.e. we keep class $i$, set their responses to `True` and modify the responses of the remaining data and set those to `False`.\n",
    "\n",
    "Fortunately, this **one vs. rest** approach is implemented for many models, and we can train it using the functions which have been already used for the *simple logistic regression* problem.\n",
    "\n",
    "**Task (1 point)**: Train a logistic regression model (on our **training data**) with the following parameters: \n",
    "- penalty parameter: `C = 1e10` \n",
    "- solver: `solver = 'liblinear'`\n",
    "- multi-class option active: `multi_class`\n",
    "\n",
    "Store your model in the variable `lr`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "c429067e6a3516601d75d204be8e147a",
     "grade": false,
     "grade_id": "cell-719cc6f0de88caea",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "eeddcca889beca55559b024f1068b208",
     "grade": true,
     "grade_id": "cell-0293e9d4d1d99ce4",
     "locked": true,
     "points": 1,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "assert 'lr' in locals()\n",
    "assert abs(lr.predict_proba([[2.1,1.1,1.1,1.0]])[0][0] - 0.07823538026445785) < 1e-6"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "1eb21c09eca81b108f91021846eb3159",
     "grade": false,
     "grade_id": "cell-74d1b2bc5747517e",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "**Task (1 point)**: Store the proportion of correct classifications for the **test set** in the variable `prop0`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "8b26fc8bf797eda19a0652ef26dac6fe",
     "grade": false,
     "grade_id": "cell-a0082b233f668687",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "7d4761e4fe9a4ef8aa30252ef4498179",
     "grade": true,
     "grade_id": "cell-d632bb14fc6d2528",
     "locked": true,
     "points": 1,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "assert 'prop0' in locals()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}