2. Database exploration

2.1. Statistics

First step, I explore the database by loading it. Then, I realize some basic statistics. TO do that , I transform python dict to pandas dataframe.


### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
features_list = [myTools.CONST_POI_LABEL,myTools.CONST_SALARY_LABEL,myTools.CONST_BONUS_LABEL,myTools.CONST_TOTAL_PAYMENTS_LABEL,\
				myTools.CONST_TOTAL_STOCK_VALUE_LABEL,myTools.CONST_EXERCISED_STOCK_OPTIONS_LABEL,myTools.CONST_RESTRICTED_STOCK_LABEL]

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "rb") as data_file:
    data_dict = pickle.load(data_file)

#------------------------------------------------------------------
# DATA EXPLORATION
#------------------------------------------------------------------

EnronDataFrame = pd.DataFrame.from_dict(data_dict, orient='index')

# export ENRON database for documentation
EnronDataFrame.to_csv("ENRON_DATABASE.csv",sep=";")
EnronDataFrame.head(10).to_csv("ENRON_DATABASE_REDUCED.csv",sep=";")

print("Enron database statistic:")
print(EnronDataFrame.describe().transpose())

print(str("Number of data points: {}").format(EnronDataFrame[myTools.CONST_POI_LABEL].size))
print(str("Number of person of interest: {}").format(EnronDataFrame[myTools.CONST_POI_LABEL].sum()))
print(str("Number of non person of interest: {}").format(EnronDataFrame[myTools.CONST_POI_LABEL].size-EnronDataFrame[myTools.CONST_POI_LABEL].sum()))
print(str("number of features used: {}. details:({})").format(len(features_list),features_list))

FeatureNaNNumber=dict()
for feature in myTools.CONST_ALL_FEATURES:
	FeatureNaNNumber[feature]=(EnronDataFrame[feature]=="NaN").sum()
FeatureNaNNumberDataFrame = pd.DataFrame.from_dict(FeatureNaNNumber, orient='index',columns=['NaN number'])
print("Number of NaN by feature:",FeatureNaNNumberDataFrame.sort_values(by='NaN number',ascending=False))


ENRON database

salary

to_messages

deferral_payments

total_payments

loan_advances

bonus

email_address

restricted_stock_deferred

deferred_income

total_stock_value

expenses

from_poi_to_this_person

exercised_stock_options

from_messages

other

from_this_person_to_poi

poi

long_term_incentive

shared_receipt_with_poi

restricted_stock

director_fees

ALLEN PHILLIP K

201955

2902

2869717

4484442

NaN

4175000

phillip.allen@enron.com

-126027

-3081055

1729541

13868

47

1729541

2195

152

65

False

304805

1407

126027

NaN

BADUM JAMES P

NaN

NaN

178980

182466

NaN

NaN

NaN

NaN

NaN

257817

3486

NaN

257817

NaN

NaN

NaN

False

NaN

NaN

NaN

NaN

BANNANTINE JAMES M

477

566

NaN

916197

NaN

NaN

james.bannantine@enron.com

-560222

-5104

5243487

56301

39

4046157

29

864523

0

False

NaN

465

1757552

NaN

BAXTER JOHN C

267102

NaN

1295738

5634343

NaN

1200000

NaN

NaN

-1386055

10623258

11200

NaN

6680544

NaN

2660303

NaN

False

1586055

NaN

3942714

NaN

BAY FRANKLIN R

239671

NaN

260455

827696

NaN

400000

frank.bay@enron.com

-82782

-201641

63014

129142

NaN

NaN

NaN

69

NaN

False

NaN

NaN

145796

NaN

BAZELIDES PHILIP J

80818

NaN

684694

860136

NaN

NaN

NaN

NaN

NaN

1599641

NaN

NaN

1599641

NaN

874

NaN

False

93750

NaN

NaN

NaN

BECK SALLY W

231330

7315

NaN

969068

NaN

700000

sally.beck@enron.com

NaN

NaN

126027

37172

144

NaN

4343

566

386

False

NaN

2639

126027

NaN

BELDEN TIMOTHY N

213999

7991

2144013

5501630

NaN

5249999

tim.belden@enron.com

NaN

-2334434

1110705

17355

228

953136

484

210698

108

True

NaN

5521

157569

NaN

BELFER ROBERT

NaN

NaN

-102500

102500

NaN

NaN

NaN

44093

NaN

-44093

NaN

NaN

3285

NaN

NaN

NaN

False

NaN

NaN

NaN

3285

BERBERIAN DAVID

216582

NaN

NaN

228474

NaN

NaN

david.berberian@enron.com

NaN

NaN

2493616

11892

NaN

1624396

NaN

NaN

NaN

False

NaN

NaN

869220

NaN

  • Number of data points: 146

  • Number of person of interest: 18

  • Number of non person of interest: 128

  • number of features used: 6. details:([‘poi’, ‘salary’, ‘bonus’, ‘total_payments’, ‘total_stock_value’, ‘exercised_stock_options’,’restricted_stock’])

Missing value repartition

Missing value repartition

features

NaN number

loan_advances

142

director_fees

129

restricted_stock_deferred

128

deferral_payments

107

deferred_income

97

long_term_incentive

80

bonus

64

shared_receipt_with_poi

60

from_this_person_to_poi

60

from_messages

60

from_poi_to_this_person

60

to_messages

60

other

53

salary

51

expenses

51

exercised_stock_options

44

restricted_stock

36

email_address

35

total_payments

21

total_stock_value

20

poi

0

Conclusion, it will be better to select features with low NaN rate (eg total_payments). The selected features are confirm:

  • salary

  • bonus

  • total_payments

  • total_stock_value

  • exercised_stock_options

  • restricted_stock

2.2. Outliers

In this step, I want to detect outliers and think about the associated added value.


#------------------------------------------------------------------
# OUTLIER INVERTIGATION
#------------------------------------------------------------------

# Convert all string type of selected feature to numeric. In case of error, the value is set to NaN
EnronDataFrame=EnronDataFrame[features_list]
for feature in features_list:
	if myTools.isNumeric(feature):
		EnronDataFrame[feature]=EnronDataFrame[feature].apply(pd.to_numeric,errors='coerce')

# Raplace NaN value to 0.0 on all data frame not only on selected features
EnronDataFrame=EnronDataFrame.fillna(0.0)

myTools.pyplot_scatter(data=EnronDataFrame,label_x=myTools.CONST_SALARY_LABEL,label_y=myTools.CONST_BONUS_LABEL,fileName="withOutlier.png",show=True,title="ORIGINAL DATA REPARTITION")

# Display 5 first employe for each feature. Objective: identify name of plotted data
print("Outlier:")
for feature in features_list:
	print(str("Sorted by {}:{}").format(feature,EnronDataFrame.sort_values(by=feature,ascending=False).head(5)))
	print()

# remove OUTLIER TOTAL from data frame
EnronDataFrame=EnronDataFrame.drop(["TOTAL"])

myTools.pyplot_scatter(data=EnronDataFrame,label_x=myTools.CONST_SALARY_LABEL,label_y=myTools.CONST_BONUS_LABEL,fileName="withoutOutlier.png",show=True,title="DATA REPARTITION WITHOUT OUTLIER")

../_images/withOutlier.png

I clearly identify inside this picture an outlier. To identify it, I sort database by all feature:

Database sorted by salary

Database sorted by salary

name

poi

salary

bonus

total_payments

total_stock_value

exercised_stock_options

TOTAL

False

26704229.0

97343619.0

309886585.0

434509511.0

311764000.0

SKILLING JEFFREY K

True

1111258.0

5600000.0

8682716.0

26093672.0

19250000.0

LAY KENNETH L

True

1072321.0

7000000.0

103559793.0

49110078.0

34348384.0

FREVERT MARK A

False

1060932.0

2000000.0

17252530.0

14622185.0

10433518.0

PICKERING MARK R

False

655037.0

300000.0

1386690.0

28798.0

28798.0

Database sorted by bonus

Database sorted by bonus

name:

poi

salary

bonus

total_payments

total_stock_value

exercised_stock_options

TOTAL

False

26704229.0

97343619.0

309886585.0

434509511.0

311764000.0

LAVORATO JOHN J

False

339288.0

8000000.0

10425757.0

5167144.0

4158995.0

LAY KENNETH L

True

1072321.0

7000000.0

103559793.0

49110078.0

34348384.0

SKILLING JEFFREY K

True

1111258.0

5600000.0

8682716.0

26093672.0

19250000.0

BELDEN TIMOTHY N

True

213999.0

5249999.0

5501630.0

1110705.0

953136.0

The TOTAL field is an error of loaded database. It doesn’t represent an Enron employe. Wed can remove it.

../_images/withoutOutlier.png

Based on this graph, we can consider two other outliers :

  • FREVERT MARK A: with high salary but low bonus

  • LAVORATO JOHN J : with high bonus and low salary

Neverthelless, this data correspond to real persons. This incoherence raise probably an indicator for POI. I keep this data.

Warning

I have two possibility for person with NaN value on features. Drop it or replace NaN to 0.0. If i drop it, the number of available data for algorithm is reduced and so the score is not suffisant. I choose to replace NaN value to 0.0 even if it is not the same information.