{ "metadata": { "name": "", "signature": "sha256:f8bfb9094f511867c12a4dd94fdfbd1f0c34cbaf45b2d1ec75ec7c80da588e93" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Enron is one of the tenth largest companies back in 2002 at U.S. This multi-billion dollar company suddenly collapsed and thousands of people losing their jobs, and some of them going to jail. This company have identified as corporate fraud, and go to bankruptcy as one of the most complex bankruptcy cases in U.S. history. The fraud was so massive and some may wonder why it lasted until 2001. Even Enron bankruptcy some said closely tied with 9/11 incident.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the incidents, the emails of the company is open to public. This is huge and real datasets. It contains communication between people, and this corpes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use Enron datasets to apply our machine learning problem. We will follow Katie experience in explore the datasets and learn by her mistake. We will use our various technique, different algorithm, to solve datasets problem. We will use clustering (unsupervised learning algorithm) to uncover who's the board director or just staff. We use regression to see the relationships between salary/bonuses.We use recommender systems, based on the conversation of the movie they have, and also identify some of the bugs/outliers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This datasets, while decades earlier, still applicable for modern learning. This is a real one, not synthesized, and can be made as a project. We will explore this dataset, comes with interesting question, and becoming expert through this dataset." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Person of Interest (POI)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Indicted\n", "* Settled without admitting guilt\n", "* Testified in exchange for immunity" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/1.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "POI is person of interest by the government. In web they listed 35 people in Enron fraud. Here Katie listed by hand. It's possible to have more or less than he mentioned." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Accuracy vs. Training Set Size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/2.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the plot that for example we have 1000 training datasets. By usual machine learning expert, they split the dataset by 80:20(train,test). For some they split 60:20:20(train,cv,test). Here we also plot the the graph using this incremental 200 examples. With the plot, we see that, monothonic increase, we have curve that get's better and better, with the slope slowly increase at the end." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we can plot it like this, we know that if just have 400 examples, we have to get more data. With 800 examples, we may also predict based on the curve, increase 1000 examples will gain small performance increase." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, more data could be better than fine tuned algorithm. But it's impartially true. it's true that more data could fixed high variance. But if you algorithm has high bias, then increase your dataset wouldn't fix your performance. For more information please check my other [blog](http://napitupulu-jon.appspot.com/posts/Data-for-Machine-Learning.html)." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Enron Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When I have downloaded, I can see that of all 150 in Enron data, only 5 people (marked by Y in the list Katie mentioned) that are in the dataset. This generally not a good problem, since 5 example wouldn't be enought. Let's approach this from different angle." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/3.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are type of data that we typically found in the dataset.\n", "\n", "For Example:\n", "\n", "* Numerical: Salary, number of emails,\n", "* Categorical: Job Title\n", "* Time Series: Timestamps on emails,\n", "* Text: Contents of emails,to/from fields email" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Enron Mini-Project" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Enron fraud is a big, messy and totally fascinating story about corporate malfeasance of nearly every imaginable type. The Enron email and financial datasets are also big, messy treasure troves of information, which become much more useful once you know your way around them a bit. We\u2019ve combined the email and finance data into a single dataset, which you\u2019ll explore in this mini-project." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load explore_enron_data.py" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "#!/usr/bin/python\n", "\n", "\"\"\" \n", " starter code for exploring the Enron dataset (emails + finances) \n", " loads up the dataset (pickled dict of dicts)\n", "\n", " the dataset has the form\n", " enron_data[\"LASTNAME FIRSTNAME MIDDLEINITIAL\"] = { features_dict }\n", "\n", " {features_dict} is a dictionary of features associated with that person\n", " you should explore features_dict as part of the mini-project,\n", " but here's an example to get you started:\n", "\n", " enron_data[\"SKILLING JEFFREY K\"][\"bonus\"] = 5600000\n", " \n", "\"\"\"\n", "\n", "import pickle\n", "\n", "enron_data = pickle.load(open(\"../final_project/final_project_dataset.pkl\", \"r\"))\n", "\n", "\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "enron_data['SKILLING JEFFREY K']['bonus']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ "5600000" ] } ], "prompt_number": 6 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/4.jpg)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "len(enron_data)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ "146" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/5.jpg)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "len(enron_data['SKILLING JEFFREY K'])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "21" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/6.jpg)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "count = 0\n", "for user in enron_data:\n", " if enron_data[user]['poi'] == True:\n", " count+=1\n", "print count" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "18\n" ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "%load ../final_project/poi_email_addresses.py" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "%load ../final_project/poi_names.txt" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "http://usatoday30.usatoday.com/money/industries/energy/2005-12-28-enron-participants_x.htm\n", "\n", "(y) Lay, Kenneth\n", "(y) Skilling, Jeffrey\n", "(n) Howard, Kevin\n", "(n) Krautz, Michael\n", "(n) Yeager, Scott\n", "(n) Hirko, Joseph\n", "(n) Shelby, Rex\n", "(n) Bermingham, David\n", "(n) Darby, Giles\n", "(n) Mulgrew, Gary\n", "(n) Bayley, Daniel\n", "(n) Brown, James\n", "(n) Furst, Robert\n", "(n) Fuhs, William\n", "(n) Causey, Richard\n", "(n) Calger, Christopher\n", "(n) DeSpain, Timothy\n", "(n) Hannon, Kevin\n", "(n) Koenig, Mark\n", "(y) Forney, John\n", "(n) Rice, Kenneth\n", "(n) Rieker, Paula\n", "(n) Fastow, Lea\n", "(n) Fastow, Andrew\n", "(y) Delainey, David\n", "(n) Glisan, Ben\n", "(n) Richter, Jeffrey\n", "(n) Lawyer, Larry\n", "(n) Belden, Timothy\n", "(n) Kopper, Michael\n", "(n) Duncan, David\n", "(n) Bowen, Raymond\n", "(n) Colwell, Wesley\n", "(n) Boyle, Dan\n", "(n) Loehr, Christopher\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def poiEmails():\n", " email_list = [\"kenneth_lay@enron.net\", \n", " \"kenneth_lay@enron.com\",\n", " \"klay.enron@enron.com\",\n", " \"kenneth.lay@enron.com\", \n", " \"klay@enron.com\",\n", " \"layk@enron.com\",\n", " \"chairman.ken@enron.com\",\n", " \"jeffreyskilling@yahoo.com\",\n", " \"jeff_skilling@enron.com\",\n", " \"jskilling@enron.com\",\n", " \"effrey.skilling@enron.com\",\n", " \"skilling@enron.com\",\n", " \"jeffrey.k.skilling@enron.com\",\n", " \"jeff.skilling@enron.com\",\n", " \"kevin_a_howard.enronxgate.enron@enron.net\",\n", " \"kevin.howard@enron.com\",\n", " \"kevin.howard@enron.net\",\n", " \"kevin.howard@gcm.com\",\n", " \"michael.krautz@enron.com\"\n", " \"scott.yeager@enron.com\",\n", " \"syeager@fyi-net.com\",\n", " \"scott_yeager@enron.net\",\n", " \"syeager@flash.net\",\n", " \"joe'.'hirko@enron.com\", \n", " \"joe.hirko@enron.com\", \n", " \"rex.shelby@enron.com\", \n", " \"rex.shelby@enron.nt\", \n", " \"rex_shelby@enron.net\",\n", " \"jbrown@enron.com\",\n", " \"james.brown@enron.com\", \n", " \"rick.causey@enron.com\", \n", " \"richard.causey@enron.com\", \n", " \"rcausey@enron.com\",\n", " \"calger@enron.com\",\n", " \"chris.calger@enron.com\", \n", " \"christopher.calger@enron.com\", \n", " \"ccalger@enron.com\",\n", " \"tim_despain.enronxgate.enron@enron.net\", \n", " \"tim.despain@enron.com\",\n", " \"kevin_hannon@enron.com\", \n", " \"kevin'.'hannon@enron.com\", \n", " \"kevin_hannon@enron.net\", \n", " \"kevin.hannon@enron.com\",\n", " \"mkoenig@enron.com\", \n", " \"mark.koenig@enron.com\",\n", " \"m..forney@enron.com\",\n", " \"ken'.'rice@enron.com\", \n", " \"ken.rice@enron.com\",\n", " \"ken_rice@enron.com\", \n", " \"ken_rice@enron.net\",\n", " \"paula.rieker@enron.com\",\n", " \"prieker@enron.com\", \n", " \"andrew.fastow@enron.com\", \n", " \"lfastow@pdq.net\", \n", " \"andrew.s.fastow@enron.com\", \n", " \"lfastow@pop.pdq.net\", \n", " \"andy.fastow@enron.com\",\n", " \"david.w.delainey@enron.com\", \n", " \"delainey.dave@enron.com\", \n", " \"'delainey@enron.com\", \n", " \"david.delainey@enron.com\", \n", " \"'david.delainey'@enron.com\", \n", " \"dave.delainey@enron.com\", \n", " \"delainey'.'david@enron.com\",\n", " \"ben.glisan@enron.com\", \n", " \"bglisan@enron.com\", \n", " \"ben_f_glisan@enron.com\", \n", " \"ben'.'glisan@enron.com\",\n", " \"jeff.richter@enron.com\", \n", " \"jrichter@nwlink.com\",\n", " \"lawrencelawyer@aol.com\", \n", " \"lawyer'.'larry@enron.com\", \n", " \"larry_lawyer@enron.com\", \n", " \"llawyer@enron.com\", \n", " \"larry.lawyer@enron.com\", \n", " \"lawrence.lawyer@enron.com\",\n", " \"tbelden@enron.com\", \n", " \"tim.belden@enron.com\", \n", " \"tim_belden@pgn.com\", \n", " \"tbelden@ect.enron.com\",\n", " \"michael.kopper@enron.com\",\n", " \"dave.duncan@enron.com\", \n", " \"dave.duncan@cipco.org\", \n", " \"duncan.dave@enron.com\",\n", " \"ray.bowen@enron.com\", \n", " \"raymond.bowen@enron.com\", \n", " \"'bowen@enron.com\",\n", " \"wes.colwell@enron.com\",\n", " \"dan.boyle@enron.com\",\n", " \"cloehr@enron.com\", \n", " \"chris.loehr@enron.com\"\n", " ]\n", " return email_list\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many POI\u2019s were there total? (Use the names file, not the email addresses, since many folks have more than one address and a few didn\u2019t work for Enron, so we don\u2019t have their emails.)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "len(poiEmails())" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ "90" ] } ], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "fo = open('../final_project/poi_names.txt','r')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 38 }, { "cell_type": "code", "collapsed": false, "input": [ "fr = fo.readlines()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 39 }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/7.jpg)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "len(fr[2:])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 40, "text": [ "35" ] } ], "prompt_number": 40 }, { "cell_type": "code", "collapsed": false, "input": [ "fo.close()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 41 }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, we have many of the POIs in our E+F dataset, but not all of them. Why is that a potential problem?\n", "\n", "We will return to this later to explain how a POI could end up not being in the Enron E+F dataset, so you fully understand the issue before moving on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/8.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a few things you could say here, but our main thought is about having enough data to really learn the patterns. In general, more data is always better--only having 18 data points doesn't give you that many examples to learn from." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/9.jpg)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "enron_data.keys()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 43, "text": [ "['METTS MARK',\n", " 'BAXTER JOHN C',\n", " 'ELLIOTT STEVEN',\n", " 'CORDES WILLIAM R',\n", " 'HANNON KEVIN P',\n", " 'MORDAUNT KRISTINA M',\n", " 'MEYER ROCKFORD G',\n", " 'MCMAHON JEFFREY',\n", " 'HORTON STANLEY C',\n", " 'PIPER GREGORY F',\n", " 'HUMPHREY GENE E',\n", " 'UMANOFF ADAM S',\n", " 'BLACHMAN JEREMY M',\n", " 'SUNDE MARTIN',\n", " 'GIBBS DANA R',\n", " 'LOWRY CHARLES P',\n", " 'COLWELL WESLEY',\n", " 'MULLER MARK S',\n", " 'JACKSON CHARLENE R',\n", " 'WESTFAHL RICHARD K',\n", " 'WALTERS GARETH W',\n", " 'WALLS JR ROBERT H',\n", " 'KITCHEN LOUISE',\n", " 'CHAN RONNIE',\n", " 'BELFER ROBERT',\n", " 'SHANKMAN JEFFREY A',\n", " 'WODRASKA JOHN',\n", " 'BERGSIEKER RICHARD P',\n", " 'URQUHART JOHN A',\n", " 'BIBI PHILIPPE A',\n", " 'RIEKER PAULA H',\n", " 'WHALEY DAVID A',\n", " 'BECK SALLY W',\n", " 'HAUG DAVID L',\n", " 'ECHOLS JOHN B',\n", " 'MENDELSOHN JOHN',\n", " 'HICKERSON GARY J',\n", " 'CLINE KENNETH W',\n", " 'LEWIS RICHARD',\n", " 'HAYES ROBERT E',\n", " 'MCCARTY DANNY J',\n", " 'KOPPER MICHAEL J',\n", " 'LEFF DANIEL P',\n", " 'LAVORATO JOHN J',\n", " 'BERBERIAN DAVID',\n", " 'DETMERING TIMOTHY J',\n", " 'WAKEHAM JOHN',\n", " 'POWERS WILLIAM',\n", " 'GOLD JOSEPH',\n", " 'BANNANTINE JAMES M',\n", " 'DUNCAN JOHN H',\n", " 'SHAPIRO RICHARD S',\n", " 'SHERRIFF JOHN R',\n", " 'SHELBY REX',\n", " 'LEMAISTRE CHARLES',\n", " 'DEFFNER JOSEPH M',\n", " 'KISHKILL JOSEPH G',\n", " 'WHALLEY LAWRENCE G',\n", " 'MCCONNELL MICHAEL S',\n", " 'PIRO JIM',\n", " 'DELAINEY DAVID W',\n", " 'SULLIVAN-SHAKLOVITZ COLLEEN',\n", " 'WROBEL BRUCE',\n", " 'LINDHOLM TOD A',\n", " 'MEYER JEROME J',\n", " 'LAY KENNETH L',\n", " 'BUTTS ROBERT H',\n", " 'OLSON CINDY K',\n", " 'MCDONALD REBECCA',\n", " 'CUMBERLAND MICHAEL S',\n", " 'GAHN ROBERT S',\n", " 'MCCLELLAN GEORGE',\n", " 'HERMANN ROBERT J',\n", " 'SCRIMSHAW MATTHEW',\n", " 'GATHMANN WILLIAM D',\n", " 'HAEDICKE MARK E',\n", " 'BOWEN JR RAYMOND M',\n", " 'GILLIS JOHN',\n", " 'FITZGERALD JAY L',\n", " 'MORAN MICHAEL P',\n", " 'REDMOND BRIAN L',\n", " 'BAZELIDES PHILIP J',\n", " 'BELDEN TIMOTHY N',\n", " 'DURAN WILLIAM D',\n", " 'THORN TERENCE H',\n", " 'FASTOW ANDREW S',\n", " 'FOY JOE',\n", " 'CALGER CHRISTOPHER F',\n", " 'RICE KENNETH D',\n", " 'KAMINSKI WINCENTY J',\n", " 'LOCKHART EUGENE E',\n", " 'COX DAVID',\n", " 'OVERDYKE JR JERE C',\n", " 'PEREIRA PAULO V. FERRAZ',\n", " 'STABLER FRANK',\n", " 'SKILLING JEFFREY K',\n", " 'BLAKE JR. NORMAN P',\n", " 'SHERRICK JEFFREY B',\n", " 'PRENTICE JAMES',\n", " 'GRAY RODNEY',\n", " 'PICKERING MARK R',\n", " 'THE TRAVEL AGENCY IN THE PARK',\n", " 'NOLES JAMES L',\n", " 'KEAN STEVEN J',\n", " 'TOTAL',\n", " 'FOWLER PEGGY',\n", " 'WASAFF GEORGE',\n", " 'WHITE JR THOMAS E',\n", " 'CHRISTODOULOU DIOMEDES',\n", " 'ALLEN PHILLIP K',\n", " 'SHARP VICTORIA T',\n", " 'JAEDICKE ROBERT',\n", " 'WINOKUR JR. HERBERT S',\n", " 'BROWN MICHAEL',\n", " 'BADUM JAMES P',\n", " 'HUGHES JAMES A',\n", " 'REYNOLDS LAWRENCE',\n", " 'DIMICHELE RICHARD G',\n", " 'BHATNAGAR SANJAY',\n", " 'CARTER REBECCA C',\n", " 'BUCHANAN HAROLD G',\n", " 'YEAP SOON',\n", " 'MURRAY JULIA H',\n", " 'GARLAND C KEVIN',\n", " 'DODSON KEITH',\n", " 'YEAGER F SCOTT',\n", " 'HIRKO JOSEPH',\n", " 'DIETRICH JANET R',\n", " 'DERRICK JR. JAMES V',\n", " 'FREVERT MARK A',\n", " 'PAI LOU L',\n", " 'BAY FRANKLIN R',\n", " 'HAYSLETT RODERICK J',\n", " 'FUGH JOHN L',\n", " 'FALLON JAMES B',\n", " 'KOENIG MARK E',\n", " 'SAVAGE FRANK',\n", " 'IZZO LAWRENCE L',\n", " 'TILNEY ELIZABETH A',\n", " 'MARTIN AMANDA K',\n", " 'BUY RICHARD B',\n", " 'GRAMM WENDY L',\n", " 'CAUSEY RICHARD A',\n", " 'TAYLOR MITCHELL S',\n", " 'DONAHUE JR JEFFREY M',\n", " 'GLISAN JR BEN F']" ] } ], "prompt_number": 43 }, { "cell_type": "code", "collapsed": false, "input": [ "enron_data['SKILLING JEFFREY K'].keys()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 42, "text": [ "['salary',\n", " 'to_messages',\n", " 'deferral_payments',\n", " 'total_payments',\n", " 'exercised_stock_options',\n", " 'bonus',\n", " 'restricted_stock',\n", " 'shared_receipt_with_poi',\n", " 'restricted_stock_deferred',\n", " 'total_stock_value',\n", " 'expenses',\n", " 'loan_advances',\n", " 'from_messages',\n", " 'other',\n", " 'from_this_person_to_poi',\n", " 'poi',\n", " 'director_fees',\n", " 'deferred_income',\n", " 'long_term_incentive',\n", " 'email_address',\n", " 'from_poi_to_this_person']" ] } ], "prompt_number": 42 }, { "cell_type": "code", "collapsed": false, "input": [ "enron_data['PRENTICE JAMES']['total_stock_value']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 44, "text": [ "1095040" ] } ], "prompt_number": 44 }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/10.jpg)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "enron_data['COLWELL WESLEY']['from_this_person_to_poi']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 45, "text": [ "11" ] } ], "prompt_number": 45 }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/11.jpg)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "enron_data['SKILLING JEFFREY K']['exercised_stock_options']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 46, "text": [ "19250000" ] } ], "prompt_number": 46 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Research the Enron Fraud" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the coming lessons, we\u2019ll talk about how the best features are often motivated by our human understanding of the problem at hand. In this case, that means knowing a little about the story of the Enron fraud.\n", "\n", "If you have an hour and a half to spare, \u201cEnron: The Smartest Guys in the Room\u201d is a documentary that gives an amazing overview of the story. Alternatively, there are plenty of archival newspaper stories that chronicle the rise and fall of Enron." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/12.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/13.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/14.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/15.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/16.jpg)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sorted(enron_data.keys())" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 48, "text": [ "['ALLEN PHILLIP K',\n", " 'BADUM JAMES P',\n", " 'BANNANTINE JAMES M',\n", " 'BAXTER JOHN C',\n", " 'BAY FRANKLIN R',\n", " 'BAZELIDES PHILIP J',\n", " 'BECK SALLY W',\n", " 'BELDEN TIMOTHY N',\n", " 'BELFER ROBERT',\n", " 'BERBERIAN DAVID',\n", " 'BERGSIEKER RICHARD P',\n", " 'BHATNAGAR SANJAY',\n", " 'BIBI PHILIPPE A',\n", " 'BLACHMAN JEREMY M',\n", " 'BLAKE JR. NORMAN P',\n", " 'BOWEN JR RAYMOND M',\n", " 'BROWN MICHAEL',\n", " 'BUCHANAN HAROLD G',\n", " 'BUTTS ROBERT H',\n", " 'BUY RICHARD B',\n", " 'CALGER CHRISTOPHER F',\n", " 'CARTER REBECCA C',\n", " 'CAUSEY RICHARD A',\n", " 'CHAN RONNIE',\n", " 'CHRISTODOULOU DIOMEDES',\n", " 'CLINE KENNETH W',\n", " 'COLWELL WESLEY',\n", " 'CORDES WILLIAM R',\n", " 'COX DAVID',\n", " 'CUMBERLAND MICHAEL S',\n", " 'DEFFNER JOSEPH M',\n", " 'DELAINEY DAVID W',\n", " 'DERRICK JR. JAMES V',\n", " 'DETMERING TIMOTHY J',\n", " 'DIETRICH JANET R',\n", " 'DIMICHELE RICHARD G',\n", " 'DODSON KEITH',\n", " 'DONAHUE JR JEFFREY M',\n", " 'DUNCAN JOHN H',\n", " 'DURAN WILLIAM D',\n", " 'ECHOLS JOHN B',\n", " 'ELLIOTT STEVEN',\n", " 'FALLON JAMES B',\n", " 'FASTOW ANDREW S',\n", " 'FITZGERALD JAY L',\n", " 'FOWLER PEGGY',\n", " 'FOY JOE',\n", " 'FREVERT MARK A',\n", " 'FUGH JOHN L',\n", " 'GAHN ROBERT S',\n", " 'GARLAND C KEVIN',\n", " 'GATHMANN WILLIAM D',\n", " 'GIBBS DANA R',\n", " 'GILLIS JOHN',\n", " 'GLISAN JR BEN F',\n", " 'GOLD JOSEPH',\n", " 'GRAMM WENDY L',\n", " 'GRAY RODNEY',\n", " 'HAEDICKE MARK E',\n", " 'HANNON KEVIN P',\n", " 'HAUG DAVID L',\n", " 'HAYES ROBERT E',\n", " 'HAYSLETT RODERICK J',\n", " 'HERMANN ROBERT J',\n", " 'HICKERSON GARY J',\n", " 'HIRKO JOSEPH',\n", " 'HORTON STANLEY C',\n", " 'HUGHES JAMES A',\n", " 'HUMPHREY GENE E',\n", " 'IZZO LAWRENCE L',\n", " 'JACKSON CHARLENE R',\n", " 'JAEDICKE ROBERT',\n", " 'KAMINSKI WINCENTY J',\n", " 'KEAN STEVEN J',\n", " 'KISHKILL JOSEPH G',\n", " 'KITCHEN LOUISE',\n", " 'KOENIG MARK E',\n", " 'KOPPER MICHAEL J',\n", " 'LAVORATO JOHN J',\n", " 'LAY KENNETH L',\n", " 'LEFF DANIEL P',\n", " 'LEMAISTRE CHARLES',\n", " 'LEWIS RICHARD',\n", " 'LINDHOLM TOD A',\n", " 'LOCKHART EUGENE E',\n", " 'LOWRY CHARLES P',\n", " 'MARTIN AMANDA K',\n", " 'MCCARTY DANNY J',\n", " 'MCCLELLAN GEORGE',\n", " 'MCCONNELL MICHAEL S',\n", " 'MCDONALD REBECCA',\n", " 'MCMAHON JEFFREY',\n", " 'MENDELSOHN JOHN',\n", " 'METTS MARK',\n", " 'MEYER JEROME J',\n", " 'MEYER ROCKFORD G',\n", " 'MORAN MICHAEL P',\n", " 'MORDAUNT KRISTINA M',\n", " 'MULLER MARK S',\n", " 'MURRAY JULIA H',\n", " 'NOLES JAMES L',\n", " 'OLSON CINDY K',\n", " 'OVERDYKE JR JERE C',\n", " 'PAI LOU L',\n", " 'PEREIRA PAULO V. FERRAZ',\n", " 'PICKERING MARK R',\n", " 'PIPER GREGORY F',\n", " 'PIRO JIM',\n", " 'POWERS WILLIAM',\n", " 'PRENTICE JAMES',\n", " 'REDMOND BRIAN L',\n", " 'REYNOLDS LAWRENCE',\n", " 'RICE KENNETH D',\n", " 'RIEKER PAULA H',\n", " 'SAVAGE FRANK',\n", " 'SCRIMSHAW MATTHEW',\n", " 'SHANKMAN JEFFREY A',\n", " 'SHAPIRO RICHARD S',\n", " 'SHARP VICTORIA T',\n", " 'SHELBY REX',\n", " 'SHERRICK JEFFREY B',\n", " 'SHERRIFF JOHN R',\n", " 'SKILLING JEFFREY K',\n", " 'STABLER FRANK',\n", " 'SULLIVAN-SHAKLOVITZ COLLEEN',\n", " 'SUNDE MARTIN',\n", " 'TAYLOR MITCHELL S',\n", " 'THE TRAVEL AGENCY IN THE PARK',\n", " 'THORN TERENCE H',\n", " 'TILNEY ELIZABETH A',\n", " 'TOTAL',\n", " 'UMANOFF ADAM S',\n", " 'URQUHART JOHN A',\n", " 'WAKEHAM JOHN',\n", " 'WALLS JR ROBERT H',\n", " 'WALTERS GARETH W',\n", " 'WASAFF GEORGE',\n", " 'WESTFAHL RICHARD K',\n", " 'WHALEY DAVID A',\n", " 'WHALLEY LAWRENCE G',\n", " 'WHITE JR THOMAS E',\n", " 'WINOKUR JR. HERBERT S',\n", " 'WODRASKA JOHN',\n", " 'WROBEL BRUCE',\n", " 'YEAGER F SCOTT',\n", " 'YEAP SOON']" ] } ], "prompt_number": 48 }, { "cell_type": "code", "collapsed": false, "input": [ "enron_data['SKILLING JEFFREY K']['total_payments']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 49, "text": [ "8682716" ] } ], "prompt_number": 49 }, { "cell_type": "code", "collapsed": false, "input": [ "enron_data['LAY KENNETH L']['total_payments']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 50, "text": [ "103559793" ] } ], "prompt_number": 50 }, { "cell_type": "code", "collapsed": false, "input": [ "enron_data['FASTOW ANDREW S']['total_payments']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 51, "text": [ "2424083" ] } ], "prompt_number": 51 }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/17.jpg)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "enron_data['FASTOW ANDREW S']['deferral_payments']\n" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 57, "text": [ "'NaN'" ] } ], "prompt_number": 57 }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/18.jpg)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "count_salary = 0\n", "count_email = 0\n", "for key in enron_data.keys():\n", " if enron_data[key]['salary'] != 'NaN':\n", " count_salary+=1\n", " if enron_data[key]['email_address'] != 'NaN':\n", " count_email+=1\n", "print count_salary\n", "print count_email" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "95\n", "111\n" ] } ], "prompt_number": 59 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Dict to Array Conversion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A python dictionary can\u2019t be read directly into an sklearn classification or regression algorithm; instead, it needs a numpy array or a list of lists (each element of the list (itself a list) is a data point, and the elements of the smaller list are the features of that point).\n", "\n", "We\u2019ve written some helper functions (featureFormat() and targetFeatureSplit() in tools/feature_format.py) that can take a list of feature names and the data dictionary, and return a numpy array.\n", "\n", "In the case when a feature does not have a value for a particular person, this function will also replace the feature value with 0 (zero)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load ../tools/feature_format.py" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 60 }, { "cell_type": "code", "collapsed": false, "input": [ "#!/usr/bin/python\n", "\n", "\"\"\" \n", " A general tool for converting data from the\n", " dictionary format to an (n x k) python list that's \n", " ready for training an sklearn algorithm\n", "\n", " n--no. of key-value pairs in dictonary\n", " k--no. of features being extracted\n", "\n", " dictionary keys are names of persons in dataset\n", " dictionary values are dictionaries, where each\n", " key-value pair in the dict is the name\n", " of a feature, and its value for that person\n", "\n", " In addition to converting a dictionary to a numpy \n", " array, you may want to separate the labels from the\n", " features--this is what targetFeatureSplit is for\n", "\n", " so, if you want to have the poi label as the target,\n", " and the features you want to use are the person's\n", " salary and bonus, here's what you would do:\n", "\n", " feature_list = [\"poi\", \"salary\", \"bonus\"] \n", " data_array = featureFormat( data_dictionary, feature_list )\n", " label, features = targetFeatureSplit(data_array)\n", "\n", " the line above (targetFeatureSplit) assumes that the\n", " label is the _first_ item in feature_list--very important\n", " that poi is listed first!\n", "\"\"\"\n", "\n", "\n", "import numpy as np\n", "\n", "def featureFormat( dictionary, features, remove_NaN=True, remove_all_zeroes=True, remove_any_zeroes=False ):\n", " \"\"\" convert dictionary to numpy array of features\n", " remove_NaN=True will convert \"NaN\" string to 0.0\n", " remove_all_zeroes=True will omit any data points for which\n", " all the features you seek are 0.0\n", " remove_any_zeroes=True will omit any data points for which\n", " any of the features you seek are 0.0\n", " \"\"\"\n", "\n", "\n", " return_list = []\n", "\n", " for key in dictionary.keys():\n", " tmp_list = []\n", " append = False\n", " for feature in features:\n", " try:\n", " dictionary[key][feature]\n", " except KeyError:\n", " print \"error: key \", feature, \" not present\"\n", " return\n", " value = dictionary[key][feature]\n", " if value==\"NaN\" and remove_NaN:\n", " value = 0\n", " tmp_list.append( float(value) )\n", "\n", " ### if all features are zero and you want to remove\n", " ### data points that are all zero, do that here\n", " if remove_all_zeroes:\n", " all_zeroes = True\n", " for item in tmp_list:\n", " if item != 0 and item != \"NaN\":\n", " append = True\n", "\n", " ### if any features for a given data point are zero\n", " ### and you want to remove data points with any zeroes,\n", " ### handle that here\n", " if remove_any_zeroes:\n", " any_zeroes = False\n", " if 0 in tmp_list or \"NaN\" in tmp_list:\n", " append = False\n", " if append:\n", " return_list.append( np.array(tmp_list) )\n", "\n", "\n", " return np.array(return_list)\n", "\n", "\n", "def targetFeatureSplit( data ):\n", " \"\"\" \n", " given a numpy array like the one returned from\n", " featureFormat, separate out the first feature\n", " and put it into its own list (this should be the \n", " quantity you want to predict)\n", "\n", " return targets and features as separate lists\n", "\n", " (sklearn can generally handle both lists and numpy arrays as \n", " input formats when training/predicting)\n", " \"\"\"\n", "\n", " target = []\n", " features = []\n", " for item in data:\n", " target.append( item[0] )\n", " features.append( item[1:] )\n", "\n", " return target, features\n", "\n", "\n", "\n", "\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Mission POI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you saw a little while ago, not every POI has an entry in the dataset (e.g. Michael Krautz). That\u2019s because the dataset was created using the financial data you can find in final_project/enron61702insiderpay.pdf, which is missing some POI\u2019s (those absences propagated through to the final dataset). On the other hand, for many of these \u201cmissing\u201d POI\u2019s, we do have emails.\n", "\n", "While it would be straightforward to add these POI\u2019s and their email information to the E+F dataset, and just put \u201cNaN\u201d for their financial information, this could introduce a subtle problem. You will walk through that here.\n", "\n", "How many people in the E+F dataset (as it currently exists) have \u201cNaN\u201d for their total payments? What percentage of people in the dataset as a whole is this?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/19.jpg)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "count_NaN_tp = 0\n", "for key in enron_data.keys():\n", " if enron_data[key]['total_payments'] == 'NaN':\n", " count_NaN_tp+=1\n", "print count_NaN_tp\n", "print float(count_NaN_tp)/len(enron_data.keys())" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "21\n", "0.143835616438\n" ] } ], "prompt_number": 62 }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/20.jpg)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "count_NaN_tp = 0\n", "for key in enron_data.keys():\n", " if enron_data[key]['total_payments'] == 'NaN' and enron_data[key]['poi'] == True :\n", " print \n", " count_NaN_tp+=1\n", "print count_NaN_tp\n", "print float(count_NaN_tp)/len(enron_data.keys())" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0\n", "0.0\n" ] } ], "prompt_number": 65 }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/21.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yes, correct. No training points would have \"NaN\" for total_payments when the class label is \"POI\"" ] }, { "cell_type": "code", "collapsed": false, "input": [ "len(enron_data.keys())" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 66, "text": [ "146" ] } ], "prompt_number": 66 }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/22.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now there are 156 folks in dataset, 31 of whom have \"NaN\" total_payments. This makes for 20% of them with a \"NaN\" overall.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "count = 0\n", "for user in enron_data:\n", " if enron_data[user]['poi'] == True and enron_data[user]['total_payments'] == 'NaN':\n", " count+=1\n", "print count" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0\n" ] } ], "prompt_number": 69 }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/23.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now there are 28 POI's, 10 of whom have \"NaN\" for total_payments\n", "\n", "That's 36% of the POI's who have \"NaN\" for total_payments, a big jump from before." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpeg](../galleries/datasets-questions/24.jpg)" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Final Notes from Instructor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Adding in the new POI\u2019s in this example, none of whom we have financial information for, has introduced a subtle problem, that our lack of financial information about them can be picked up by an algorithm as a clue that they\u2019re POIs. Another way to think about this is that there\u2019s now a difference in how we generated the data for our two classes--non-POIs all come from the financial spreadsheet, while many POIs get added in by hand afterwards. That difference can trick us into thinking we have better performance than we do--suppose you use your POI detector to decide whether a new, unseen person is a POI, and that person isn\u2019t on the spreadsheet. Then all their financial data would contain \u201cNaN\u201d but the person is very likely not a POI (there are many more non-POIs than POIs in the world, and even at Enron)--you\u2019d be likely to accidentally identify them as a POI, though!\n", "\n", "This goes to say that, when generating or augmenting a dataset, you should be exceptionally careful if your data are coming from different sources for different classes. It can easily lead to the type of bias or mistake that we showed here. There are ways to deal with this, for example, you wouldn\u2019t have to worry about this problem if you used only email data--in that case, discrepancies in the financial data wouldn\u2019t matter because financial features aren\u2019t being used. There are also more sophisticated ways of estimating how much of an effect these biases can have on your final answer; those are beyond the scope of this course.\n", "\n", "For now, the takeaway message is to be very careful about introducing features that come from different sources depending on the class! It\u2019s a classic way to accidentally introduce biases and mistakes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">Resource:\n", "\n", "> * http://en.wikipedia.org/wiki/Enron\n", "> * https://www.udacity.com/course/viewer#!/c-ud120/l-2291728537/m-2473678541" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }