Kaggle : New York City Taxi Trip Duration

25 Apr 2018 on Project

alt text

1 EDA (Exploratory Data Analysis)

purpose of EDA

Suggest hypotheses about the causes of observed phenomena
Assess assumptions on which statistical inference will be based
Support the selection of appropriate statistical tools and techniques
Provide a basis for further data collection through surveys or experiments

EDA methods

Graphical techniques used in EDA are:
- boxplot
  - detailed feature (datetime by month, day of week, hours)
- historgram or barplot (distribution) # bin = range of value
  - origin feature (pick lat,long, drop lat, long, duration, passenger count, flag)
  - detailed feature (datetime by month, day of week, hours)
- scatter plot
  - duration vs distance = to check odd data
- Parallel Coordinates vs Colormaps vs Andrews curves charts
- odd ratio????
Quantative methods:
- Trimean == tukey method?

1.1 Understanding data

from IPython.display import display
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from math import sin, cos, sqrt, atan2, radians
import seaborn as sns
import lightgbm as lgb
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
from sklearn.cluster import SpectralClustering
from sklearn.cluster import MeanShift
from sklearn.cluster import MiniBatchKMeans
from sklearn.model_selection import train_test_split

from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
from scipy.stats import norm, skew
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer

%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

/home/jk/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

train = pd.read_csv("train.csv")
train.head()

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration
0	id2875421	2	2016-03-14 17:24:55	2016-03-14 17:32:30	1	-73.982155	40.767937	-73.964630	40.765602	N	455
1	id2377394	1	2016-06-12 00:43:35	2016-06-12 00:54:38	1	-73.980415	40.738564	-73.999481	40.731152	N	663
2	id3858529	2	2016-01-19 11:35:24	2016-01-19 12:10:48	1	-73.979027	40.763939	-74.005333	40.710087	N	2124
3	id3504673	2	2016-04-06 19:32:31	2016-04-06 19:39:40	1	-74.010040	40.719971	-74.012268	40.706718	N	429
4	id2181028	2	2016-03-26 13:30:55	2016-03-26 13:38:10	1	-73.973053	40.793209	-73.972923	40.782520	N	435

test = pd.read_csv("test.csv")
test.head()

	id	vendor_id	pickup_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag
0	id3004672	1	2016-06-30 23:59:58	1	-73.988129	40.732029	-73.990173	40.756680	N
1	id3505355	1	2016-06-30 23:59:53	1	-73.964203	40.679993	-73.959808	40.655403	N
2	id1217141	1	2016-06-30 23:59:47	1	-73.997437	40.737583	-73.986160	40.729523	N
3	id2150126	2	2016-06-30 23:59:41	1	-73.956070	40.771900	-73.986427	40.730469	N
4	id1598245	1	2016-06-30 23:59:33	1	-73.970215	40.761475	-73.961510	40.755890	N

sample_submission = pd.read_csv("sample_submission.csv")
sample_submission.head()

	id	trip_duration
0	id3004672	959
1	id3505355	959
2	id1217141	959
3	id2150126	959
4	id1598245	959

1.1.a Data type and unit

unit

1. latitude / longtitude = decimal degree

111.32mm per 0.000001° / 11.132 m per 0.0001° / 1.1132 km per 0.01° / 111.32 km per 1.0°
14 demical degree
ex) 40.767937 , -73.982155

2. datetime = year-month-day: hour-minute-second

3. vendor_id = 1, 2

4. passenger_count = 0 - 9

4. store_and_fwd_flag = N, Y

6. duration = second

ex) 455 sec = 7min 35sec

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 11 columns):
id                    1458644 non-null object
vendor_id             1458644 non-null int64
pickup_datetime       1458644 non-null object
dropoff_datetime      1458644 non-null object
passenger_count       1458644 non-null int64
pickup_longitude      1458644 non-null float64
pickup_latitude       1458644 non-null float64
dropoff_longitude     1458644 non-null float64
dropoff_latitude      1458644 non-null float64
store_and_fwd_flag    1458644 non-null object
trip_duration         1458644 non-null int64
dtypes: float64(4), int64(3), object(4)
memory usage: 122.4+ MB

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 625134 entries, 0 to 625133
Data columns (total 9 columns):
id                    625134 non-null object
vendor_id             625134 non-null int64
pickup_datetime       625134 non-null object
passenger_count       625134 non-null int64
pickup_longitude      625134 non-null float64
pickup_latitude       625134 non-null float64
dropoff_longitude     625134 non-null float64
dropoff_latitude      625134 non-null float64
store_and_fwd_flag    625134 non-null object
dtypes: float64(4), int64(2), object(3)
memory usage: 42.9+ MB

train data

1.4M data, 11 columns

test data

0.6M data, 9 columns (no dropoff_datetime, trip_duration)

1.1.b Missing Data check

#none of missing data
train.isnull().sum()

id                    0
vendor_id             0
pickup_datetime       0
dropoff_datetime      0
passenger_count       0
pickup_longitude      0
pickup_latitude       0
dropoff_longitude     0
dropoff_latitude      0
store_and_fwd_flag    0
trip_duration         0
dtype: int64

test.isnull().sum()

id                    0
vendor_id             0
pickup_datetime       0
passenger_count       0
pickup_longitude      0
pickup_latitude       0
dropoff_longitude     0
dropoff_latitude      0
store_and_fwd_flag    0
dtype: int64

1.1.c Trip duration

trip duration calculation validation

train["pickup_datetime"] =  pd.to_datetime(train["pickup_datetime"])
train["dropoff_datetime"] =  pd.to_datetime(train["dropoff_datetime"])
sample_duration = train["dropoff_datetime"] - train["pickup_datetime"]
sample_duration_sec = sample_duration.dt.total_seconds().astype('int')
train['trip_sec'] =  sample_duration_sec

train_d = train[train["trip_duration"] != train["trip_sec"]]
print(len(train_d))

if len(train_d) == 0:
    train = train.drop(['trip_sec'], axis=1)

drop odd data

#drop 100,000 duration time data
print(len(train.loc[train.trip_duration > 100000]))

print("before drop : ", len(train))
train = train[train["trip_duration"] < 100000]
print("after  drop : ", len(train))

before drop :  1458644
after  drop :  1458640

trip duration visualization

#logigramatic trip duration data
plt.figure(figsize=(16,8))
plt.subplot(121)
sns.distplot(np.log(train["trip_duration"]))
plt.subplot(122)
stats.probplot(np.log(train["trip_duration"]), plot=plt)
plt.tight_layout()
plt.show()

png

1.2 Feature Engineering & Data Cleaning

date time convert

train = train.drop("dropoff_datetime", axis=1)

#data type convert to datetime from object
train["pickup_datetime"] =  pd.to_datetime(train["pickup_datetime"])
test["pickup_datetime"] =  pd.to_datetime(test["pickup_datetime"])

#day of week
#Monday=0, Sunday=6
train["pick_dayofweek"] = train["pickup_datetime"].dt.dayofweek
test["pick_dayofweek"] = test["pickup_datetime"].dt.dayofweek

train["pick_month"] = train["pickup_datetime"].apply(lambda x : x.month)
train["pick_day"] = train["pickup_datetime"].apply(lambda x : x.day)
train["pick_hour"] = train["pickup_datetime"].apply(lambda x : x.hour)
# train["pick_min"] = train["pickup_datetime"].apply(lambda x : x.minute)
# train["pick_sec"] = train["pickup_datetime"].apply(lambda x : x.second)

test["pick_month"] = test["pickup_datetime"].apply(lambda x : x.month)
test["pick_day"] = test["pickup_datetime"].apply(lambda x : x.day)
test["pick_hour"] = test["pickup_datetime"].apply(lambda x : x.hour)
# test["pick_min"] = test["pickup_datetime"].apply(lambda x : x.minute)
# test["pick_sec"] = test["pickup_datetime"].apply(lambda x : x.second)

train.head()

	id	vendor_id	pickup_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration	pick_dayofweek	pick_month	pick_day	pick_hour
0	id2875421	2	2016-03-14 17:24:55	1	-73.982155	40.767937	-73.964630	40.765602	N	455	0	3	14	17
1	id2377394	1	2016-06-12 00:43:35	1	-73.980415	40.738564	-73.999481	40.731152	N	663	6	6	12	0
2	id3858529	2	2016-01-19 11:35:24	1	-73.979027	40.763939	-74.005333	40.710087	N	2124	1	1	19	11
3	id3504673	2	2016-04-06 19:32:31	1	-74.010040	40.719971	-74.012268	40.706718	N	429	2	4	6	19
4	id2181028	2	2016-03-26 13:30:55	1	-73.973053	40.793209	-73.972923	40.782520	N	435	5	3	26	13

national holiday

from bs4 import BeautifulSoup
import requests
import urllib.request

#wrapper > div:nth-child(3) > div.twelve.columns > table.list-table > tbody > tr:nth-child(2) > td:nth-child(2)

df = pd.DataFrame(columns=["rank","keyword"])
response = requests.get("https://www.officeholidays.com/countries/usa/2016.php")
bs = BeautifulSoup(response.content, "html.parser")
trs = bs.select("table td")
trs1 = trs[1::5]

li = []
holi = pd.DataFrame()
count = 0

for i in trs1[0:14]:
    li.append((i.text).strip())
    li[count] = '2016 ' + li[count]
    li[count] = li[count].split(" ")
    li[count] = li[count][0] + "-" + li[count][1] + '-' + li[count][2]
    count += 1

holi['date'] = li
holi['date'] = pd.to_datetime(holi['date'])
holi

	date
0	2016-01-01
1	2016-01-18
2	2016-02-15
3	2016-04-15
4	2016-05-08
5	2016-05-30
6	2016-06-19
7	2016-07-04
8	2016-09-05
9	2016-10-10
10	2016-11-11
11	2016-11-24
12	2016-11-25
13	2016-12-26

select_date = list(holi["date"].astype("str"))
holiday = train.pickup_datetime.apply(lambda x : str(x.date())).isin(select_date)
train["holiday"] = holiday

select_date = list(holi["date"].astype("str"))
holiday = test.pickup_datetime.apply(lambda x : str(x.date())).isin(select_date)
test["holiday"] = holiday

train['holiday'] = 1 * (train.holiday == True)
test['holiday'] = 1 * (test.holiday == True)

New York City Weather Event

from selenium import webdriver

driver =  webdriver.PhantomJS()    
driver.get("https://www.weather.gov/okx/stormevents")

date = driver.find_elements_by_css_selector('#pagebody > div:nth-child(3) > div > table > tbody > tr > td ul:nth-child(6)')
lis = date[0].find_elements_by_css_selector('li')

li_wea = []
count = 0

rows=[]
for i in lis:
    li_wea.append(i.text)
    li_wea[count] = '2016 ' + li_wea[count]
    li_wea[count] = li_wea[count].split(" ")

    li_wea[count] = li_wea[count][0] + "-" + li_wea[count][1] + "-" + li_wea[count][2] + "-" + li_wea[count][3]
#     rows.append([li_wea[count][0], li_wea[count][1], li_wea[count][2] ,li_wea[count][3]])
    count += 1


#rows

new1 = pd.DataFrame(li_wea, columns=['old'])
new1['date'] = new1['old'].str.extract('(\d\d\d\d-...-\d\d)', expand=True)

new1['date'][4] = '2016-Feb-05'
new1['date'][5] = '2016-Feb-08'
new1['date'][11] = '2016-Apr-03'
new1['date'][12] = '2016-Apr-04'
new1['date'][14] = '2016-June-28'
new1['date'][15] = '2016-July-18'
new1['date'][16] = '2016-July-29'
new1['date'][17] = '2016-July-31'
new1['date'][25] = '2016-Oct-08'
new1['date'][35] = '2016-Dec-05'
new1 = new1.drop('old', axis=1)

new2 = pd.DataFrame(['2016-August-01', '2016-Dec-01'], columns=['date'])
new1 = new1.append(new2, ignore_index=True).dropna()
new1['date'] = pd.to_datetime(new1['date'])
new1

	date
0	2016-01-10
1	2016-01-13
2	2016-01-17
3	2016-01-23
4	2016-02-05
5	2016-02-08
6	2016-02-15
7	2016-02-24
8	2016-03-14
9	2016-03-21
10	2016-03-28
11	2016-04-03
12	2016-04-04
13	2016-05-30
14	2016-06-28
15	2016-07-18
16	2016-07-29
17	2016-07-31
18	2016-08-10
19	2016-08-11
20	2016-08-12
21	2016-08-13
22	2016-08-20
24	2016-09-19
25	2016-10-08
26	2016-10-22
27	2016-10-22
28	2016-10-27
29	2016-10-30
30	2016-11-11
31	2016-11-14
32	2016-11-20
33	2016-11-29
34	2016-11-30
35	2016-12-05
36	2016-12-15
37	2016-12-17
38	2016-12-18
39	2016-08-01
40	2016-12-01

select_date = list(new1["date"].astype("str"))
weather = train.pickup_datetime.apply(lambda x : str(x.date())).isin(select_date)
train["weather"] = weather

select_date = list(new1["date"].astype("str"))
weather = test.pickup_datetime.apply(lambda x : str(x.date())).isin(select_date)
test["weather"] = weather

train['weather'] = 1 * (train.weather == True)
test['weather'] = 1 * (test.weather == True)

driver.close()

1.2.b Distance between pickup and dropoff location

uclidean Distance

def uclidean(lat1, lng1, lat2, lng2):
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    R = 6371.0
    lat = lat2 - lat1
    lng = lng2 - lng1
    d = np.sin(lat * 0.5) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(lng * 0.5) ** 2
    h = 2 * R * np.arcsin(np.sqrt(d))
    return h

train['uclidean'] = uclidean(train.pickup_latitude, train.pickup_longitude, train.dropoff_latitude, train.dropoff_longitude)
test['uclidean'] = uclidean(test.pickup_latitude, test.pickup_longitude, test.dropoff_latitude, test.dropoff_longitude)

manhatan distance

train['manhatan'] = (abs(train.dropoff_longitude - train.pickup_longitude) + abs(train.dropoff_latitude - train.pickup_latitude)) * 113.2
test['manhatan'] = (abs(test.dropoff_longitude - test.pickup_longitude) + abs(test.dropoff_latitude - test.pickup_latitude)) * 113.2

direction

def direction(pickup_lat, pickup_long, dropoff_lat, dropoff_long):

    pickup_lat_rads = np.radians(pickup_lat)
    pickup_long_rads = np.radians(pickup_long)
    dropoff_lat_rads = np.radians(dropoff_lat)
    dropoff_long_rads = np.radians(dropoff_long)
    long_delta_rads = np.radians(dropoff_long_rads - pickup_long_rads)

    y = np.sin(long_delta_rads) * np.cos(dropoff_lat_rads)
    x = (np.cos(pickup_lat_rads) * np.sin(dropoff_lat_rads) - np.sin(pickup_lat_rads) * np.cos(dropoff_lat_rads) * np.cos(long_delta_rads))

    return np.degrees(np.arctan2(y, x))

train['direction'] = direction(train.pickup_latitude, train.pickup_longitude, train.dropoff_latitude, train.dropoff_longitude)
test['direction'] = direction(test.pickup_latitude, test.pickup_longitude, test.dropoff_latitude, test.dropoff_longitude)

1.2.d.2 Spatial Data Analysis

Types of spatial analysis

FA(factor analysis)
- Euclidean metric = > PCA(principal component analysis)
- Chi-Square distance => Correspondence Analysis (similar to PCA, but better for categrorical data)
- Generalized Mahalanobis distance => Discriminant Analysis

stack-up coordinates data

coord_pick_lat = pd.concat([train['pickup_latitude'], test['pickup_latitude']], axis=0)
coord_pick_lon = pd.concat([train['pickup_longitude'], test['pickup_longitude']], axis=0)
coord_drop_lat = pd.concat([train['dropoff_latitude'], test['dropoff_latitude']], axis=0)
coord_drop_lon = pd.concat([train['dropoff_longitude'], test['dropoff_longitude']], axis=0)

coord_pick = pd.concat([coord_pick_lat, coord_pick_lon], axis=1)
coord_drop = pd.concat([coord_drop_lat, coord_drop_lon], axis=1)

coord_lat = pd.concat([train['pickup_latitude'], train['dropoff_latitude'], test['pickup_latitude'], test['dropoff_latitude']], axis=0)
coord_lon = pd.concat([train['pickup_longitude'], train['dropoff_longitude'], test['pickup_longitude'], test['dropoff_longitude']], axis=0)
coord_all = pd.concat([coord_lat, coord_lon], axis=1)
coord_all.columns = ['lat', 'lon']

coordinates scatter plot

# new york city coordinate = (41.145495, −73.994901)
city_lon_border = (-74.03, -73.75)
city_lat_border = (40.63, 40.85)

sns.lmplot(x='pickup_latitude', y='pickup_longitude', data=coord_pick, fit_reg=False, scatter_kws={"s": 1}, size=10)
plt.ylim(city_lon_border)
plt.xlim(city_lat_border)
plt.title('Pick up')
plt.show()

png

sns.lmplot(x='dropoff_latitude', y='dropoff_longitude', data=coord_drop, fit_reg=False, scatter_kws={"s": 1}, size=10)
plt.ylim(city_lon_border)
plt.xlim(city_lat_border)
plt.title('Drop off')
plt.show()

png

PCA

pca = PCA(random_state=0).fit(coord_all)

#PCA
train['pick_pca0'] = pca.transform(train[['pickup_latitude', 'pickup_longitude']])[:, 0]
train['pick_pca1'] = pca.transform(train[['pickup_latitude', 'pickup_longitude']])[:, 1]
train['drop_pca0'] = pca.transform(train[['dropoff_latitude', 'dropoff_longitude']])[:, 0]
train['drop_pca1'] = pca.transform(train[['dropoff_latitude', 'dropoff_longitude']])[:, 1]
test['pick_pca0'] = pca.transform(test[['pickup_latitude', 'pickup_longitude']])[:, 0]
test['pick_pca1'] = pca.transform(test[['pickup_latitude', 'pickup_longitude']])[:, 1]
test['drop_pca0'] = pca.transform(test[['dropoff_latitude', 'dropoff_longitude']])[:, 0]
test['drop_pca1'] = pca.transform(test[['dropoff_latitude', 'dropoff_longitude']])[:, 1]

train.head()

	id	vendor_id	pickup_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration	...	pick_hour	weather	uclidean	manhatan	direction	pick_pca0	pick_pca1	drop_pca0	drop_pca1
0	id2875421	2	2016-03-14 17:24:55	1	-73.982155	40.767937	-73.964630	40.765602	N	455	...	17	1	1.498521	2.248074	174.333195	0.007691	0.017053	-0.009667	0.013695
1	id2377394	1	2016-06-12 00:43:35	1	-73.980415	40.738564	-73.999481	40.731152	N	663	...	0	0	1.805507	2.997289	-178.051506	0.007677	-0.012371	0.027145	-0.018652
2	id3858529	2	2016-01-19 11:35:24	1	-73.979027	40.763939	-74.005333	40.710087	N	2124	...	11	0	6.385098	9.073912	-179.629721	0.004803	0.012879	0.034222	-0.039337
3	id3504673	2	2016-04-06 19:32:31	1	-74.010040	40.719971	-74.012268	40.706718	N	429	...	19	0	1.485498	1.752341	-179.872566	0.038342	-0.029194	0.041343	-0.042293
4	id2181028	2	2016-03-26 13:30:55	1	-73.973053	40.793209	-73.972923	40.782520	N	435	...	13	0	1.188588	1.224652	179.990812	-0.002877	0.041748	-0.002380	0.031070

5 rows × 23 columns

1.2.d.3 Coordinates Clustering

Gaussian Mixture

from sklearn.mixture import GaussianMixture

gaus_pick = GaussianMixture(n_components=20).fit(coord_pick)
gaus_drop = GaussianMixture(n_components=20).fit(coord_drop)

train['gaus_pick'] = gaus_pick.predict(train[['pickup_latitude', 'pickup_longitude']])
test['gaus_pick'] = gaus_pick.predict(test[['pickup_latitude', 'pickup_longitude']])
train['gaus_drop'] = gaus_drop.predict(train[['dropoff_latitude', 'dropoff_longitude']])
test['gaus_drop'] = gaus_drop.predict(test[['dropoff_latitude', 'dropoff_longitude']])

1.2.d.4 Time data manipulating

office hour

labels = ["dawn", "morning", "afternoon", "evening", "night"]
cats1 = pd.cut(train['pick_hour'], 5, labels = labels)
cats2 = pd.cut(test['pick_hour'], 5, labels = labels)
train['office'] = cats1
test['office'] = cats2

train.head()

	id	vendor_id	pickup_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration	...	uclidean	manhatan	direction	pick_pca0	pick_pca1	drop_pca0	drop_pca1	gaus_pick	gaus_drop	office
0	id2875421	2	2016-03-14 17:24:55	1	-73.982155	40.767937	-73.964630	40.765602	N	455	...	1.498521	2.248074	174.333195	0.007691	0.017053	-0.009667	0.013695	15	1	evening
1	id2377394	1	2016-06-12 00:43:35	1	-73.980415	40.738564	-73.999481	40.731152	N	663	...	1.805507	2.997289	-178.051506	0.007677	-0.012371	0.027145	-0.018652	8	13	dawn
2	id3858529	2	2016-01-19 11:35:24	1	-73.979027	40.763939	-74.005333	40.710087	N	2124	...	6.385098	9.073912	-179.629721	0.004803	0.012879	0.034222	-0.039337	3	0	afternoon
3	id3504673	2	2016-04-06 19:32:31	1	-74.010040	40.719971	-74.012268	40.706718	N	429	...	1.485498	1.752341	-179.872566	0.038342	-0.029194	0.041343	-0.042293	4	0	night
4	id2181028	2	2016-03-26 13:30:55	1	-73.973053	40.793209	-73.972923	40.782520	N	435	...	1.188588	1.224652	179.990812	-0.002877	0.041748	-0.002380	0.031070	15	15	afternoon

5 rows × 26 columns

weekend

train['weekend'] = 1 * ((train["pick_dayofweek"] == 5) | (train["pick_dayofweek"] == 6))
test['weekend'] = 1 * ((test["pick_dayofweek"] == 5) | (test["pick_dayofweek"] == 6))

train.head()

	id	vendor_id	pickup_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration	...	manhatan	direction	pick_pca0	pick_pca1	drop_pca0	drop_pca1	gaus_pick	gaus_drop	office	weekend
0	id2875421	2	2016-03-14 17:24:55	1	-73.982155	40.767937	-73.964630	40.765602	N	455	...	2.248074	174.333195	0.007691	0.017053	-0.009667	0.013695	15	1	evening	0
1	id2377394	1	2016-06-12 00:43:35	1	-73.980415	40.738564	-73.999481	40.731152	N	663	...	2.997289	-178.051506	0.007677	-0.012371	0.027145	-0.018652	8	13	dawn	1
2	id3858529	2	2016-01-19 11:35:24	1	-73.979027	40.763939	-74.005333	40.710087	N	2124	...	9.073912	-179.629721	0.004803	0.012879	0.034222	-0.039337	3	0	afternoon	0
3	id3504673	2	2016-04-06 19:32:31	1	-74.010040	40.719971	-74.012268	40.706718	N	429	...	1.752341	-179.872566	0.038342	-0.029194	0.041343	-0.042293	4	0	night	0
4	id2181028	2	2016-03-26 13:30:55	1	-73.973053	40.793209	-73.972923	40.782520	N	435	...	1.224652	179.990812	-0.002877	0.041748	-0.002380	0.031070	15	15	afternoon	1

5 rows × 27 columns

3. Modeling

evaluation metric

Root Mean Squared Logarithmic Error

$\epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }$

Where:

ϵ is the RMSLE value (score)
n is the total number of observations in the (public/private) data set,
pi is your prediction of trip duration, and
ai is the actual trip duration for i.
log(x) is the natural logarithm of x

data type manipulation

categorical data convert encoding

train['store_and_fwd_flag'] = 1 * (train.store_and_fwd_flag.values == 'Y')
test['store_and_fwd_flag'] = 1 * (test.store_and_fwd_flag.values == 'Y')

input data shape check

train = train.drop('id', axis=1)
test = test.drop('id', axis=1)
train = train.drop(['pickup_datetime'], axis=1)
test = test.drop(['pickup_datetime'], axis=1)
print(train.shape, test.shape)

(1458640, 25) (625134, 24)

train = pd.get_dummies(train)
test = pd.get_dummies(test)

X_train = train.drop(['trip_duration'], axis=1)
y_train = train['trip_duration']
y_log = np.log(y_train)

lightgbm

model_log = lgb.LGBMRegressor(n_estimators=12500, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1).fit(X_train, y_log)

y_pred = model_log.predict(test)

y_exp = np.exp(y_pred)

sub = pd.DataFrame(columns= ['id', 'trip_duration'])
sub['id'] = sample_submission["id"]
sub['trip_duration'] = y_exp
sub.to_csv('sub_lgb_exp1.csv',index=False)

!kaggle competitions submit -c nyc-taxi-trip-duration -f sub_lgb_exp1.csv -m "Message"

#n_estimators=500, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.39475, 0.39786
#n_estimators=1000, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.39072, 0.39368
#n_estimators=1500, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.38848, 0.39147
#n_estimators=2000, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.38670, 0.38967
#n_estimators=3000, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.38499, 0.38761
#n_estimators=4000, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.38368, 0.38634
#n_estimators=5000, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.38295, 0.38553
#n_estimators=10000, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.38206, 0.38433


#top score #top 16%
#n_estimators=15000, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.38180, 0.38384


#n_estimators=20000, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.38195, 0.38401

Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/jk/.kaggle/kaggle.json'
Successfully submitted to New York City Taxi Trip Duration

lgb.plot_importance(model_log)

<matplotlib.axes._subplots.AxesSubplot at 0x7f4bc3ef5cf8>

png

Cross Validation

# from sklearn.cross_validation import cross_val_score

# cross_lgb = cross_val_score(model_log, X_train, y_log, cv=2, n_jobs=-1)
# cross_lgb

OLS

OLS_model = sm.OLS(y_log, X_train).fit()
print(OLS_model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          trip_duration   R-squared:                       0.358
Model:                            OLS   Adj. R-squared:                  0.358
Method:                 Least Squares   F-statistic:                 3.540e+04
Date:                Thu, 26 Apr 2018   Prob (F-statistic):               0.00
Time:                        16:15:42   Log-Likelihood:            -1.4198e+06
No. Observations:             1458640   AIC:                         2.840e+06
Df Residuals:                 1458616   BIC:                         2.840e+06
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
vendor_id              0.0199      0.001     17.901      0.000       0.018       0.022
passenger_count        0.0079      0.000     18.698      0.000       0.007       0.009
pickup_longitude      -0.9181      0.007   -131.463      0.000      -0.932      -0.904
pickup_latitude       -0.4466      0.010    -46.867      0.000      -0.465      -0.428
dropoff_longitude      0.4693      0.007     71.526      0.000       0.456       0.482
dropoff_latitude      -0.2251      0.008    -26.933      0.000      -0.241      -0.209
store_and_fwd_flag     0.0007      0.007      0.098      0.922      -0.013       0.015
pick_dayofweek         0.0153      0.000     34.435      0.000       0.014       0.016
pick_month             0.0170      0.000     53.156      0.000       0.016       0.018
pick_day               0.0007    6.1e-05     10.770      0.000       0.001       0.001
pick_hour              0.0068      0.000     16.933      0.000       0.006       0.008
holiday               -0.1176      0.003    -40.438      0.000      -0.123      -0.112
weather               -0.0499      0.002    -23.725      0.000      -0.054      -0.046
uclidean               0.1655      0.001    177.266      0.000       0.164       0.167
manhatan              -0.0367      0.001    -61.396      0.000      -0.038      -0.036
direction             -0.0003   4.85e-06    -53.675      0.000      -0.000      -0.000
pick_pca0              0.5870      0.008     76.411      0.000       0.572       0.602
pick_pca1             -0.6161      0.011    -55.641      0.000      -0.638      -0.594
drop_pca0             -0.8110      0.008   -102.752      0.000      -0.827      -0.796
drop_pca1             -0.4763      0.009    -50.474      0.000      -0.495      -0.458
gaus_pick              0.0009   8.42e-05     10.158      0.000       0.001       0.001
gaus_drop              0.0055   8.01e-05     68.585      0.000       0.005       0.006
weekend               -0.1547      0.002    -80.970      0.000      -0.158      -0.151
office_dawn           -0.0790      0.004    -18.453      0.000      -0.087      -0.071
office_morning        -0.0312      0.002    -15.946      0.000      -0.035      -0.027
office_afternoon       0.1163      0.001    105.402      0.000       0.114       0.118
office_evening         0.0868      0.002     41.610      0.000       0.083       0.091
office_night          -0.0879      0.004    -22.720      0.000      -0.095      -0.080
==============================================================================
Omnibus:                  2672070.271   Durbin-Watson:                   2.001
Prob(Omnibus):                  0.000   Jarque-Bera (JB):     253169639877.915
Skew:                         -12.080   Prob(JB):                         0.00
Kurtosis:                    2043.831   Cond. No.                     1.28e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.65e-22. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Y_test = OLS_model.predict(test)
Y_test_exp = np.exp(Y_test)


sub = pd.DataFrame(columns= ['id', 'trip_duration'])
sub['id'] = sample_submission["id"]
sub['trip_duration'] = Y_test_exp
sub.to_csv('submission_OLS.csv',index=False)
!kaggle competitions submit -c nyc-taxi-trip-duration -f submission_OLS.csv -m "Message"

Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/jk/.kaggle/kaggle.json'
Successfully submitted to New York City Taxi Trip Duration

Appendix

1. degree of decimal

0.000001 = 1.11mm

2. spatial data analysis

PCA
discriminant analysis

3. clustering

K means
K nearest neighbor
Expectation Maximization

4. ensemble methods

aggregation
boosting

Reference

1 EDA (Exploratory Data Analysis)

purpose of EDA

EDA methods

1.1 Understanding data

1.1.a Data type and unit

unit

1. latitude / longtitude = decimal degree

2. datetime = year-month-day: hour-minute-second

3. vendor_id = 1, 2

4. passenger_count = 0 - 9

4. store_and_fwd_flag = N, Y

6. duration = second

train data

test data

1.1.b Missing Data check

1.1.c Trip duration

trip duration calculation validation

drop odd data

trip duration visualization

1.2 Feature Engineering & Data Cleaning

date time convert

national holiday

New York City Weather Event

1.2.b Distance between pickup and dropoff location

uclidean Distance

manhatan distance

direction

1.2.d.2 Spatial Data Analysis

Types of spatial analysis

stack-up coordinates data

coordinates scatter plot

PCA

1.2.d.3 Coordinates Clustering

Gaussian Mixture

1.2.d.4 Time data manipulating

office hour

weekend

3. Modeling

evaluation metric

data type manipulation

input data shape check

lightgbm

Cross Validation

OLS

Appendix

1. degree of decimal

2. spatial data analysis

3. clustering

4. ensemble methods

Templates (for web app):

Error