{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 1 - Implementing the logistic function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**:\n", "Implement the logistic function\n", "\n", "$$ \\sigma(x) = \\frac{e^x}{1+e^x} = \\frac{1}{1+e^{-x}}$$" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "#def sigma(x):\n", " # Put your definition here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def sigma(x):\n", " s = np.exp(-x)\n", " return 1 / (1. + s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we want to investigate how the shape of the logistic function changes for an affine linear input, i.e.,\n", "\n", "$$ \\sigma(\\beta_0 + \\beta_1 x) $$\n", "\n", "for different values of $\\beta_0$ and $\\beta_1$.\n", "\n", "**Task**: Take your time and try different values.\n", "What happens for negative/positive values of $\\beta_1$?\n", "What role does $\\beta_0$ play?\n", "\n", "**You have nothing to implement here, only evaluate the cells below.**" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def my_sigma(b0, b1) : return sigma(b0 + b1 * x)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "8cbc0286157147d589ebb702655091b5", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(FloatSlider(value=0.0, description='b0', max=10.0, min=-10.0, step=1.0), FloatSlider(val…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "from ipywidgets import interactive\n", "def f(b0, b1):\n", " plt.figure(1)\n", " x = np.linspace(-10,10,1001)\n", " plt.plot(x, sigma(b0 + b1*x))\n", " plt.plot(x,0.5*np.ones(x.shape))\n", " plt.ylim(-0.1, 1.1)\n", " plt.xlabel('x')\n", " plt.ylabel('p(x)')\n", " plt.show()\n", "\n", "interactive_plot = interactive(f, b0=(-10.0, 10.0, 1.0), b1=(-3., 3., 0.2))\n", "output = interactive_plot.children[-1]\n", "output.layout.height = '350px'\n", "interactive_plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 2 - Logistic regression in practice" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this lab, we want to investigate the `Default` data set known from the lecture.\n", "We first load the necessary modules.\n", "The command\n", " \n", " plt.rcParams['figure.figsize'] = [13, 5]\n", " \n", "changes the size of the figure (in inches)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "plt.rcParams['figure.figsize'] = [13, 5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Download the file `Default.csv` from the webpage and read it using the `pandas` function `read_csv`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "D = pd.read_csv('./datasets/Default.csv',index_col =0, decimal=',')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Inspect the data using the methods you've learned so far, e.g., `describe`, `hist`, `head`, etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
defaultstudentbalanceincome
1NoNo729.52649544361.625074
2NoYes817.18040712106.134700
3NoNo1073.54916431767.138947
4NoNo529.25060535704.493935
5NoNo785.65588338463.495879
\n", "
" ], "text/plain": [ " default student balance income\n", "1 No No 729.526495 44361.625074\n", "2 No Yes 817.180407 12106.134700\n", "3 No No 1073.549164 31767.138947\n", "4 No No 529.250605 35704.493935\n", "5 No No 785.655883 38463.495879" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "D.head()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
balanceincome
count10000.00000010000.000000
mean835.37488633516.981876
std483.71498513336.639563
min0.000000771.967729
25%481.73110521340.462903
50%823.63697334552.644802
75%1166.30838643807.729272
max2654.32257673554.233495
\n", "
" ], "text/plain": [ " balance income\n", "count 10000.000000 10000.000000\n", "mean 835.374886 33516.981876\n", "std 483.714985 13336.639563\n", "min 0.000000 771.967729\n", "25% 481.731105 21340.462903\n", "50% 823.636973 34552.644802\n", "75% 1166.308386 43807.729272\n", "max 2654.322576 73554.233495" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "D.describe()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[,\n", " ]],\n", " dtype=object)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "D.hist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Observation**: If you try the `describe` function you should see that the predictors `default` and `student` are not part of the summary.\n", "This is due to the fact that these values were read in by the `read_csv` function as strings. We know from the lecture that these predictors are categorical (in particular binary)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to process these values we convert them to the data type `boolean`, i.e., we replace the `String` objects in the columns `default` and `student` by `Boolean`'s.\n", "There are a lot of ways to accomplish this task; the easiest might be\n", "\n", " D.replace(to_replace='No',value=False,inplace=True)\n", " \n", "**Task**: Replace every 'No' and 'Yes' in the `DataFrame` by the values `False` and `True`, resp." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "D.replace(to_replace='No',value=False,inplace=True)\n", "D.replace(to_replace='Yes',value=True,inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we want to plot both, the `income` and `balance` predictors as boxplots as a function of the `default` status.\n", "\n", "**Task**: Complete the plotting command in the following cell. What do you observe?" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(1,2)\n", "D.boxplot(column='balance',by='default', ax=ax[0]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answer**:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(1,2)\n", "D.boxplot(column='balance',by='default', ax=ax[0]);\n", "D.boxplot(column='income',by='default', ax=ax[1]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answer**: It seems that the credit card balance has a large effect on the default status, while the income seems not to predict the default status very well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we want to fit a logistic regression model to our data.\n", "Use the `LogisticRegression` function in the module `sklearn.linear_model`.\n", "The behaviour is similar to a `LinearRegression` fit.\n", "\n", "You can find the documentation of this function [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).\n", "There are a lot of optional arguments, the most important might be the unimpressive looking parameter `C`, which determines the strength of regularization used in the algorithm that solves the maximum likelihood problem.\n", "\n", "We will discuss regularization later in the lecture as well as in the labs. For now, it suffices if you keep the following in mind:\n", "\n", "**The larger you choose `C`, the less the problem will be regularized.**\n", "\n", "**Task**: Fit a logistic regression model that predicts the probability of `default` using `balance` as predictor. You should obtain the following values: $\\beta_0: -10.6513$, $\\beta_\\text{balance}: 0.0055$.\n", "\n", "Choose the following optional parameters:\n", "* set the regularization parameter `C = 1e10` (which is the scientific notation of $C = 10^{10}$, and thus very large)\n", "* set the error tolerance to `tol=1e-10`\n", "* set the solver to `solver = 'liblinear'`\n", "\n", "in this and the upcoming problems." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "lr = LogisticRegression(solver='liblinear',tol=1e-10,C = 1e10)\n", "reg = lr.fit(D.balance.values.reshape(-1,1), D.default)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Print the intercept as well as the coefficients in a nice way." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The log-odds can be computed by: -10.651 + 0.005 x balance\n" ] } ], "source": [ "print('The log-odds can be computed by: %6.3f + %6.3f x balance' % (reg.intercept_[0], reg.coef_[0][0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**:\n", "Predict the probability of `default` for a `balance` value of $\\$ 1.000 $ and $\\$ 2.000 $, resp.\n", "Use the method `predict_proba`.\n", "Interpret the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answer**:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.99424779, 0.00575221],\n", " [0.41423209, 0.58576791]])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reg.predict_proba(np.array([1000.,2000.]).reshape(-1,1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answer**: The probality of default of an individual with a credit card balance of $\\$ 1.000 $ is approximately 0.57\\%.\n", "The probality of default of an individual with a credit card balance of $\\$ 2.000 $ is approximately 58.6\\%." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we want to incorporate the predictors `income` and `student` status as well. This can be done easily using the same methods." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "lr2= LogisticRegression(solver='liblinear',tol=1e-10,C=1e10)\n", "X = D.loc[:,['balance','income','student']]\n", "y = D.loc[:,'default']\n", "reg2 = lr2.fit(X,y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**: Print the intercept as well as the coefficients in a nice way." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The log-odds can be computed by: -10.869 + 0.006 x balance + 0.000 x income + -0.647 x student\n" ] } ], "source": [ "print('The log-odds can be computed by: %6.3f + %6.3f x balance + %6.3f x income + %6.3f x student' \n", " % ((reg2.intercept_,) + tuple(reg2.coef_[0])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Task**:\n", "What is the default probability of a student and a non-student with a credit card balance of $\\$ 1500$, an income of $\\$40,000$?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answer**:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.94211806, 0.05788194],\n", " [0.89500808, 0.10499192]])" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reg2.predict_proba(np.array([[1500,40000,1],[1500,40000,0]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answer**: A student with a credit card balance of $\\$1.500\\$$ and an income of $\\$40,000$ has an estimated probability of default of 5.8\\%, while an non-student with the same balance and income has a probability of default of 10.5\\%." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }