{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\\rightarrow$Run All).\n", "\n", "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\", as well as your name below.\n", "\n", "Rename this problem sheet as follows:\n", "\n", " ps{number of lab}_{your user name}_problem{number of problem sheet in this lab}\n", " \n", "for example\n", " \n", " ps2_blja_problem1\n", "\n", "Submit your homework within one week until next Monday, 9 a.m." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "NAME = \"\"\n", "EMAIL = \"\"\n", "USERNAME = \"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "83be57aa54b869534a3f21c3602df38c", "grade": false, "grade_id": "cell-2f2cbd6f50d7ae76", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "# Introduction to Data Science\n", "\n", "## Lab 8: Multi-class classification" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "dd840324c8b525bc29069122d8a53d6f", "grade": false, "grade_id": "cell-cbc0379fbf3909bb", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "### Part A: Exploration of the flower petal data set" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "45ef14557ea09dfdd2db60809b0f95ab", "grade": false, "grade_id": "cell-ea3f7ad38a908c79", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "The flower petal data set consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal size.\n", "\n", "Given the predictors\n", "- 1st column: sepal length,\n", "- 2nd column: sepal width,\n", "- 3rd column: petal length and\n", "- 4th column: petal width,\n", "\n", "our goal is to predict the correct class (0-Setosa, 1-Versicolour or 2-Virginica).\n", "\n", "The data set is part of `scikit-learn`'s datasets module and can be imported with the following commands:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "316d0bf0b445eae3759b7a3922c06661", "grade": false, "grade_id": "cell-a8324ad376a0593b", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "from sklearn import datasets\n", "iris = datasets.load_iris()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "dbf44cc176c08085b88eefea503279b4", "grade": false, "grade_id": "cell-c042ae9393ae8933", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "The data comes as a dictionary. You can access the predictors using `iris.data` and the classes using `iris.target`.\n", "\n", "**Task (1 point)**: Store the predictors in a variable `X` and the response in a variable `y`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "fc364188b7041d84fc05410c7c7fd076", "grade": false, "grade_id": "cell-4c5d0ba46cdd1586", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "1a0e11a3e3b5be144664352cb186dd2c", "grade": true, "grade_id": "cell-53d8fde18f295bd5", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert X.shape == (150,4)\n", "assert y.shape == (150,)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "32545404c2a15f75ae687623895a5dbf", "grade": false, "grade_id": "cell-a26b22e257317813", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task (2 points)**: Plot the sepal length on the x-axis and the sepal width on the y-axis. Color each of the three types of irises differently.\n", "Add a legend that gives the correct iris type (0-Setosa, 1-Versicolour, 2-Virginica)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "ab1bbc32c4721156186fcbe765fd301f", "grade": true, "grade_id": "cell-a99e46c9b162136b", "locked": false, "points": 2, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "e7ece228cba506adebffe2694f4bd56c", "grade": false, "grade_id": "cell-21f5a94901836304", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task (2 points)**:\n", "Split your data into a training and a test set.\n", "Put the first 40 samples within each class in the training set and the remaining samples in a test data set.\n", "\n", "Store the training set in variables `Xtrain` and `ytrain`, and the test set in variables `Xtest` and `ytest`, resp." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "8f9a730d0542df47c71a8058bec6f970", "grade": false, "grade_id": "cell-3397ff5382060b53", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "import numpy as np\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "369a39331af3016d2d00d3e9ac594664", "grade": true, "grade_id": "cell-4cff00529ce30e6d", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert Xtrain.shape == (120,4)\n", "assert Xtest.shape == (30,4)\n", "assert ytrain.shape == (120,)\n", "assert ytest.shape == (30,)\n", "assert abs(Xtrain.mean() - 3.485208333333333) < 1e-10\n", "assert abs(ytrain.mean() - 1) < 1e-10\n", "assert abs(ytest.mean() - 1) < 1e-10\n", "assert abs(Xtest.mean() - 3.3816666666666673) < 1e-10" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "66af2645db5fe1d9a750c08a09fc3890", "grade": false, "grade_id": "cell-121f56ad6127c490", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "### Part B: Linear discriminant analysis" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "54779c677544b85e505984439ca72b34", "grade": false, "grade_id": "cell-9f1175b6693317b1", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "In the lecture you've heard about the classification method called\n", "*Linear discriminant analysis (LDA)*.\n", "\n", "**Task (1 points)**: Find a way using `scikit-learn` to accomplish a linear discriminant analysis on the **training data set**.\n", "\n", "Perform an LDA using only the first two predictors, i.e., `sepal length` and `sepal width`.\n", "Store your trained model in the variable `lda`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "2bdac8840317d9dfc921f6a5bbf310ce", "grade": false, "grade_id": "cell-02fee2efc178e8e6", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "b07e83cf9b33eaa8ae9c234eaafb894a", "grade": true, "grade_id": "cell-f84d0268df55c5ca", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert 'lda' in locals()\n", "assert abs(lda.predict_proba([[2.1,1.1]])[0][0] - 0.7867422283434491) < 1e-10" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "bac8f7a08275380139c2dfee0f36a138", "grade": false, "grade_id": "cell-c43f7abfa5544dd3", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task (1 point)**: What is the proportion of correctly classified irises in the **test data set**? Store your answer in the variable `prop1`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "339a49a4db9e080c9f6c9cb969ebfaf3", "grade": false, "grade_id": "cell-902ceff2ae133619", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "eaa59f65ea5e635564530754f01604d4", "grade": true, "grade_id": "cell-1cc1e8c94bb792dd", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert 'prop1' in locals()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "00a9de394cb7b20780c991c359776141", "grade": false, "grade_id": "cell-b865e2c9260730db", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task (2 points)**: Now, incorporate all of the predictors and perform a second linear discriminant analysis using **only the training data**.\n", "How does the proportion of correct classifications change (for the **test data**)?\n", "Store the proportion of correct classifications for the test set in the variable `prop2`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "72b46f63e18e7aa428d404a3fe900eb8", "grade": false, "grade_id": "cell-8b7d19220834da29", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "6793c7b8a2c398988ebff1e9f47ff3e4", "grade": true, "grade_id": "cell-e8a557d63a9a00ad", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert 'prop2' in locals()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "046558f7c8a27d81c33a16c5a9be2f01", "grade": false, "grade_id": "cell-8c53e0fecaf4f120", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "### Part C: Multi-class logistic regression" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "21dd54f7a96e1dfeba0c7ce8e193ba52", "grade": false, "grade_id": "cell-82ffaac2c14cc54d", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "Here, we want to apply logistic regression in a multi-class setting.\n", "\n", "#### One vs. rest approach\n", "One way to extend the logistic regression to a setting with $k$ classes is by training not one but $k$ models for $k > 2$, one for each individual class.\n", "As the name suggests, we train one model for each individual class $i$ and try to fit a logistic regression model to a modified data set, where the responses of members belonging to class $i$ are set to `True` and **all** others are set to `False`, i.e. we keep class $i$, set their responses to `True` and modify the responses of the remaining data and set those to `False`.\n", "\n", "Fortunately, this **one vs. rest** approach is implemented for many models, and we can train it using the functions which have been already used for the *simple logistic regression* problem.\n", "\n", "**Task (1 point)**: Train a logistic regression model (on our **training data**) with the following parameters: \n", "- penalty parameter: `C = 1e10` \n", "- solver: `solver = 'liblinear'`\n", "- multi-class option active: `multi_class`\n", "\n", "Store your model in the variable `lr`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "c429067e6a3516601d75d204be8e147a", "grade": false, "grade_id": "cell-719cc6f0de88caea", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "eeddcca889beca55559b024f1068b208", "grade": true, "grade_id": "cell-0293e9d4d1d99ce4", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert 'lr' in locals()\n", "assert abs(lr.predict_proba([[2.1,1.1,1.1,1.0]])[0][0] - 0.07823538026445785) < 1e-6" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "1eb21c09eca81b108f91021846eb3159", "grade": false, "grade_id": "cell-74d1b2bc5747517e", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Task (1 point)**: Store the proportion of correct classifications for the **test set** in the variable `prop0`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "8b26fc8bf797eda19a0652ef26dac6fe", "grade": false, "grade_id": "cell-a0082b233f668687", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "7d4761e4fe9a4ef8aa30252ef4498179", "grade": true, "grade_id": "cell-d632bb14fc6d2528", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert 'prop0' in locals()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }