Kaggle : New York City Taxi Trip Duration

alt text

1 EDA (Exploratory Data Analysis)

purpose of EDA

  • Suggest hypotheses about the causes of observed phenomena
  • Assess assumptions on which statistical inference will be based
  • Support the selection of appropriate statistical tools and techniques
  • Provide a basis for further data collection through surveys or experiments

EDA methods

  • Graphical techniques used in EDA are:
    • boxplot
      • detailed feature (datetime by month, day of week, hours)
    • historgram or barplot (distribution) # bin = range of value
      • origin feature (pick lat,long, drop lat, long, duration, passenger count, flag)
      • detailed feature (datetime by month, day of week, hours)
    • scatter plot
      • duration vs distance = to check odd data
    • Parallel Coordinates vs Colormaps vs Andrews curves charts
    • odd ratio????
  • Quantative methods:
    • Trimean == tukey method?

1.1 Understanding data

from IPython.display import display
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from math import sin, cos, sqrt, atan2, radians
import seaborn as sns
import lightgbm as lgb
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
from sklearn.cluster import SpectralClustering
from sklearn.cluster import MeanShift
from sklearn.cluster import MiniBatchKMeans
from sklearn.model_selection import train_test_split

from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
from scipy.stats import norm, skew
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer

%matplotlib inline

import warnings
warnings.filterwarnings("ignore")
/home/jk/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
train = pd.read_csv("train.csv")
train.head()
idvendor_idpickup_datetimedropoff_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flagtrip_duration
0id287542122016-03-14 17:24:552016-03-14 17:32:301-73.98215540.767937-73.96463040.765602N455
1id237739412016-06-12 00:43:352016-06-12 00:54:381-73.98041540.738564-73.99948140.731152N663
2id385852922016-01-19 11:35:242016-01-19 12:10:481-73.97902740.763939-74.00533340.710087N2124
3id350467322016-04-06 19:32:312016-04-06 19:39:401-74.01004040.719971-74.01226840.706718N429
4id218102822016-03-26 13:30:552016-03-26 13:38:101-73.97305340.793209-73.97292340.782520N435
test = pd.read_csv("test.csv")
test.head()
idvendor_idpickup_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flag
0id300467212016-06-30 23:59:581-73.98812940.732029-73.99017340.756680N
1id350535512016-06-30 23:59:531-73.96420340.679993-73.95980840.655403N
2id121714112016-06-30 23:59:471-73.99743740.737583-73.98616040.729523N
3id215012622016-06-30 23:59:411-73.95607040.771900-73.98642740.730469N
4id159824512016-06-30 23:59:331-73.97021540.761475-73.96151040.755890N
sample_submission = pd.read_csv("sample_submission.csv")
sample_submission.head()
idtrip_duration
0id3004672959
1id3505355959
2id1217141959
3id2150126959
4id1598245959

1.1.a Data type and unit

unit

1. latitude / longtitude = decimal degree

  • 111.32mm per 0.000001° / 11.132 m per 0.0001° / 1.1132 km per 0.01° / 111.32 km per 1.0°
  • 14 demical degree
  • ex) 40.767937 , -73.982155

2. datetime = year-month-day: hour-minute-second

3. vendor_id = 1, 2

4. passenger_count = 0 - 9

4. store_and_fwd_flag = N, Y

6. duration = second

  • ex) 455 sec = 7min 35sec
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 11 columns):
id                    1458644 non-null object
vendor_id             1458644 non-null int64
pickup_datetime       1458644 non-null object
dropoff_datetime      1458644 non-null object
passenger_count       1458644 non-null int64
pickup_longitude      1458644 non-null float64
pickup_latitude       1458644 non-null float64
dropoff_longitude     1458644 non-null float64
dropoff_latitude      1458644 non-null float64
store_and_fwd_flag    1458644 non-null object
trip_duration         1458644 non-null int64
dtypes: float64(4), int64(3), object(4)
memory usage: 122.4+ MB
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 625134 entries, 0 to 625133
Data columns (total 9 columns):
id                    625134 non-null object
vendor_id             625134 non-null int64
pickup_datetime       625134 non-null object
passenger_count       625134 non-null int64
pickup_longitude      625134 non-null float64
pickup_latitude       625134 non-null float64
dropoff_longitude     625134 non-null float64
dropoff_latitude      625134 non-null float64
store_and_fwd_flag    625134 non-null object
dtypes: float64(4), int64(2), object(3)
memory usage: 42.9+ MB

train data

  • 1.4M data, 11 columns

test data

  • 0.6M data, 9 columns (no dropoff_datetime, trip_duration)

1.1.b Missing Data check

#none of missing data
train.isnull().sum()
id                    0
vendor_id             0
pickup_datetime       0
dropoff_datetime      0
passenger_count       0
pickup_longitude      0
pickup_latitude       0
dropoff_longitude     0
dropoff_latitude      0
store_and_fwd_flag    0
trip_duration         0
dtype: int64
test.isnull().sum()
id                    0
vendor_id             0
pickup_datetime       0
passenger_count       0
pickup_longitude      0
pickup_latitude       0
dropoff_longitude     0
dropoff_latitude      0
store_and_fwd_flag    0
dtype: int64

1.1.c Trip duration

trip duration calculation validation

train["pickup_datetime"] =  pd.to_datetime(train["pickup_datetime"])
train["dropoff_datetime"] =  pd.to_datetime(train["dropoff_datetime"])
sample_duration = train["dropoff_datetime"] - train["pickup_datetime"]
sample_duration_sec = sample_duration.dt.total_seconds().astype('int')
train['trip_sec'] =  sample_duration_sec
train_d = train[train["trip_duration"] != train["trip_sec"]]
print(len(train_d))

if len(train_d) == 0:
    train = train.drop(['trip_sec'], axis=1)
0

drop odd data

#drop 100,000 duration time data
print(len(train.loc[train.trip_duration > 100000]))
4
print("before drop : ", len(train))
train = train[train["trip_duration"] < 100000]
print("after  drop : ", len(train))
before drop :  1458644
after  drop :  1458640

trip duration visualization

#logigramatic trip duration data
plt.figure(figsize=(16,8))
plt.subplot(121)
sns.distplot(np.log(train["trip_duration"]))
plt.subplot(122)
stats.probplot(np.log(train["trip_duration"]), plot=plt)
plt.tight_layout()
plt.show()

png

1.2 Feature Engineering & Data Cleaning

date time convert

train = train.drop("dropoff_datetime", axis=1)
#data type convert to datetime from object
train["pickup_datetime"] =  pd.to_datetime(train["pickup_datetime"])
test["pickup_datetime"] =  pd.to_datetime(test["pickup_datetime"])
#day of week
#Monday=0, Sunday=6
train["pick_dayofweek"] = train["pickup_datetime"].dt.dayofweek
test["pick_dayofweek"] = test["pickup_datetime"].dt.dayofweek
train["pick_month"] = train["pickup_datetime"].apply(lambda x : x.month)
train["pick_day"] = train["pickup_datetime"].apply(lambda x : x.day)
train["pick_hour"] = train["pickup_datetime"].apply(lambda x : x.hour)
# train["pick_min"] = train["pickup_datetime"].apply(lambda x : x.minute)
# train["pick_sec"] = train["pickup_datetime"].apply(lambda x : x.second)

test["pick_month"] = test["pickup_datetime"].apply(lambda x : x.month)
test["pick_day"] = test["pickup_datetime"].apply(lambda x : x.day)
test["pick_hour"] = test["pickup_datetime"].apply(lambda x : x.hour)
# test["pick_min"] = test["pickup_datetime"].apply(lambda x : x.minute)
# test["pick_sec"] = test["pickup_datetime"].apply(lambda x : x.second)
train.head()
idvendor_idpickup_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flagtrip_durationpick_dayofweekpick_monthpick_daypick_hour
0id287542122016-03-14 17:24:551-73.98215540.767937-73.96463040.765602N455031417
1id237739412016-06-12 00:43:351-73.98041540.738564-73.99948140.731152N66366120
2id385852922016-01-19 11:35:241-73.97902740.763939-74.00533340.710087N2124111911
3id350467322016-04-06 19:32:311-74.01004040.719971-74.01226840.706718N42924619
4id218102822016-03-26 13:30:551-73.97305340.793209-73.97292340.782520N435532613

national holiday

from bs4 import BeautifulSoup
import requests
import urllib.request
#wrapper > div:nth-child(3) > div.twelve.columns > table.list-table > tbody > tr:nth-child(2) > td:nth-child(2)
df = pd.DataFrame(columns=["rank","keyword"])
response = requests.get("https://www.officeholidays.com/countries/usa/2016.php")
bs = BeautifulSoup(response.content, "html.parser")
trs = bs.select("table td")
trs1 = trs[1::5]
li = []
holi = pd.DataFrame()
count = 0

for i in trs1[0:14]:
    li.append((i.text).strip())
    li[count] = '2016 ' + li[count]
    li[count] = li[count].split(" ")
    li[count] = li[count][0] + "-" + li[count][1] + '-' + li[count][2]
    count += 1

holi['date'] = li
holi['date'] = pd.to_datetime(holi['date'])
holi
date
02016-01-01
12016-01-18
22016-02-15
32016-04-15
42016-05-08
52016-05-30
62016-06-19
72016-07-04
82016-09-05
92016-10-10
102016-11-11
112016-11-24
122016-11-25
132016-12-26
select_date = list(holi["date"].astype("str"))
holiday = train.pickup_datetime.apply(lambda x : str(x.date())).isin(select_date)
train["holiday"] = holiday
select_date = list(holi["date"].astype("str"))
holiday = test.pickup_datetime.apply(lambda x : str(x.date())).isin(select_date)
test["holiday"] = holiday
train['holiday'] = 1 * (train.holiday == True)
test['holiday'] = 1 * (test.holiday == True)

New York City Weather Event

from selenium import webdriver
driver =  webdriver.PhantomJS()    
driver.get("https://www.weather.gov/okx/stormevents")
date = driver.find_elements_by_css_selector('#pagebody > div:nth-child(3) > div > table > tbody > tr > td ul:nth-child(6)')
lis = date[0].find_elements_by_css_selector('li')
li_wea = []
count = 0

rows=[]
for i in lis:
    li_wea.append(i.text)
    li_wea[count] = '2016 ' + li_wea[count]
    li_wea[count] = li_wea[count].split(" ")

    li_wea[count] = li_wea[count][0] + "-" + li_wea[count][1] + "-" + li_wea[count][2] + "-" + li_wea[count][3]
#     rows.append([li_wea[count][0], li_wea[count][1], li_wea[count][2] ,li_wea[count][3]])
    count += 1


#rows
new1 = pd.DataFrame(li_wea, columns=['old'])
new1['date'] = new1['old'].str.extract('(\d\d\d\d-...-\d\d)', expand=True)
new1['date'][4] = '2016-Feb-05'
new1['date'][5] = '2016-Feb-08'
new1['date'][11] = '2016-Apr-03'
new1['date'][12] = '2016-Apr-04'
new1['date'][14] = '2016-June-28'
new1['date'][15] = '2016-July-18'
new1['date'][16] = '2016-July-29'
new1['date'][17] = '2016-July-31'
new1['date'][25] = '2016-Oct-08'
new1['date'][35] = '2016-Dec-05'
new1 = new1.drop('old', axis=1)
new2 = pd.DataFrame(['2016-August-01', '2016-Dec-01'], columns=['date'])
new1 = new1.append(new2, ignore_index=True).dropna()
new1['date'] = pd.to_datetime(new1['date'])
new1
date
02016-01-10
12016-01-13
22016-01-17
32016-01-23
42016-02-05
52016-02-08
62016-02-15
72016-02-24
82016-03-14
92016-03-21
102016-03-28
112016-04-03
122016-04-04
132016-05-30
142016-06-28
152016-07-18
162016-07-29
172016-07-31
182016-08-10
192016-08-11
202016-08-12
212016-08-13
222016-08-20
242016-09-19
252016-10-08
262016-10-22
272016-10-22
282016-10-27
292016-10-30
302016-11-11
312016-11-14
322016-11-20
332016-11-29
342016-11-30
352016-12-05
362016-12-15
372016-12-17
382016-12-18
392016-08-01
402016-12-01
select_date = list(new1["date"].astype("str"))
weather = train.pickup_datetime.apply(lambda x : str(x.date())).isin(select_date)
train["weather"] = weather
select_date = list(new1["date"].astype("str"))
weather = test.pickup_datetime.apply(lambda x : str(x.date())).isin(select_date)
test["weather"] = weather
train['weather'] = 1 * (train.weather == True)
test['weather'] = 1 * (test.weather == True)
driver.close()

1.2.b Distance between pickup and dropoff location

uclidean Distance

def uclidean(lat1, lng1, lat2, lng2):
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    R = 6371.0
    lat = lat2 - lat1
    lng = lng2 - lng1
    d = np.sin(lat * 0.5) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(lng * 0.5) ** 2
    h = 2 * R * np.arcsin(np.sqrt(d))
    return h
train['uclidean'] = uclidean(train.pickup_latitude, train.pickup_longitude, train.dropoff_latitude, train.dropoff_longitude)
test['uclidean'] = uclidean(test.pickup_latitude, test.pickup_longitude, test.dropoff_latitude, test.dropoff_longitude)

manhatan distance

train['manhatan'] = (abs(train.dropoff_longitude - train.pickup_longitude) + abs(train.dropoff_latitude - train.pickup_latitude)) * 113.2
test['manhatan'] = (abs(test.dropoff_longitude - test.pickup_longitude) + abs(test.dropoff_latitude - test.pickup_latitude)) * 113.2

direction

def direction(pickup_lat, pickup_long, dropoff_lat, dropoff_long):

    pickup_lat_rads = np.radians(pickup_lat)
    pickup_long_rads = np.radians(pickup_long)
    dropoff_lat_rads = np.radians(dropoff_lat)
    dropoff_long_rads = np.radians(dropoff_long)
    long_delta_rads = np.radians(dropoff_long_rads - pickup_long_rads)

    y = np.sin(long_delta_rads) * np.cos(dropoff_lat_rads)
    x = (np.cos(pickup_lat_rads) * np.sin(dropoff_lat_rads) - np.sin(pickup_lat_rads) * np.cos(dropoff_lat_rads) * np.cos(long_delta_rads))

    return np.degrees(np.arctan2(y, x))
train['direction'] = direction(train.pickup_latitude, train.pickup_longitude, train.dropoff_latitude, train.dropoff_longitude)
test['direction'] = direction(test.pickup_latitude, test.pickup_longitude, test.dropoff_latitude, test.dropoff_longitude)

1.2.d.2 Spatial Data Analysis

Types of spatial analysis

  • FA(factor analysis)
    • Euclidean metric = > PCA(principal component analysis)
    • Chi-Square distance => Correspondence Analysis (similar to PCA, but better for categrorical data)
    • Generalized Mahalanobis distance => Discriminant Analysis

stack-up coordinates data

coord_pick_lat = pd.concat([train['pickup_latitude'], test['pickup_latitude']], axis=0)
coord_pick_lon = pd.concat([train['pickup_longitude'], test['pickup_longitude']], axis=0)
coord_drop_lat = pd.concat([train['dropoff_latitude'], test['dropoff_latitude']], axis=0)
coord_drop_lon = pd.concat([train['dropoff_longitude'], test['dropoff_longitude']], axis=0)

coord_pick = pd.concat([coord_pick_lat, coord_pick_lon], axis=1)
coord_drop = pd.concat([coord_drop_lat, coord_drop_lon], axis=1)

coord_lat = pd.concat([train['pickup_latitude'], train['dropoff_latitude'], test['pickup_latitude'], test['dropoff_latitude']], axis=0)
coord_lon = pd.concat([train['pickup_longitude'], train['dropoff_longitude'], test['pickup_longitude'], test['dropoff_longitude']], axis=0)
coord_all = pd.concat([coord_lat, coord_lon], axis=1)
coord_all.columns = ['lat', 'lon']

coordinates scatter plot

# new york city coordinate = (41.145495, −73.994901)
city_lon_border = (-74.03, -73.75)
city_lat_border = (40.63, 40.85)
sns.lmplot(x='pickup_latitude', y='pickup_longitude', data=coord_pick, fit_reg=False, scatter_kws={"s": 1}, size=10)
plt.ylim(city_lon_border)
plt.xlim(city_lat_border)
plt.title('Pick up')
plt.show()

png

sns.lmplot(x='dropoff_latitude', y='dropoff_longitude', data=coord_drop, fit_reg=False, scatter_kws={"s": 1}, size=10)
plt.ylim(city_lon_border)
plt.xlim(city_lat_border)
plt.title('Drop off')
plt.show()

png

PCA

pca = PCA(random_state=0).fit(coord_all)
#PCA
train['pick_pca0'] = pca.transform(train[['pickup_latitude', 'pickup_longitude']])[:, 0]
train['pick_pca1'] = pca.transform(train[['pickup_latitude', 'pickup_longitude']])[:, 1]
train['drop_pca0'] = pca.transform(train[['dropoff_latitude', 'dropoff_longitude']])[:, 0]
train['drop_pca1'] = pca.transform(train[['dropoff_latitude', 'dropoff_longitude']])[:, 1]
test['pick_pca0'] = pca.transform(test[['pickup_latitude', 'pickup_longitude']])[:, 0]
test['pick_pca1'] = pca.transform(test[['pickup_latitude', 'pickup_longitude']])[:, 1]
test['drop_pca0'] = pca.transform(test[['dropoff_latitude', 'dropoff_longitude']])[:, 0]
test['drop_pca1'] = pca.transform(test[['dropoff_latitude', 'dropoff_longitude']])[:, 1]
train.head()
idvendor_idpickup_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flagtrip_duration...pick_hourholidayweatheruclideanmanhatandirectionpick_pca0pick_pca1drop_pca0drop_pca1
0id287542122016-03-14 17:24:551-73.98215540.767937-73.96463040.765602N455...17011.4985212.248074174.3331950.0076910.017053-0.0096670.013695
1id237739412016-06-12 00:43:351-73.98041540.738564-73.99948140.731152N663...0001.8055072.997289-178.0515060.007677-0.0123710.027145-0.018652
2id385852922016-01-19 11:35:241-73.97902740.763939-74.00533340.710087N2124...11006.3850989.073912-179.6297210.0048030.0128790.034222-0.039337
3id350467322016-04-06 19:32:311-74.01004040.719971-74.01226840.706718N429...19001.4854981.752341-179.8725660.038342-0.0291940.041343-0.042293
4id218102822016-03-26 13:30:551-73.97305340.793209-73.97292340.782520N435...13001.1885881.224652179.990812-0.0028770.041748-0.0023800.031070

5 rows × 23 columns

1.2.d.3 Coordinates Clustering

Gaussian Mixture

from sklearn.mixture import GaussianMixture
gaus_pick = GaussianMixture(n_components=20).fit(coord_pick)
gaus_drop = GaussianMixture(n_components=20).fit(coord_drop)
train['gaus_pick'] = gaus_pick.predict(train[['pickup_latitude', 'pickup_longitude']])
test['gaus_pick'] = gaus_pick.predict(test[['pickup_latitude', 'pickup_longitude']])
train['gaus_drop'] = gaus_drop.predict(train[['dropoff_latitude', 'dropoff_longitude']])
test['gaus_drop'] = gaus_drop.predict(test[['dropoff_latitude', 'dropoff_longitude']])

1.2.d.4 Time data manipulating

office hour

labels = ["dawn", "morning", "afternoon", "evening", "night"]
cats1 = pd.cut(train['pick_hour'], 5, labels = labels)
cats2 = pd.cut(test['pick_hour'], 5, labels = labels)
train['office'] = cats1
test['office'] = cats2
train.head()
idvendor_idpickup_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flagtrip_duration...uclideanmanhatandirectionpick_pca0pick_pca1drop_pca0drop_pca1gaus_pickgaus_dropoffice
0id287542122016-03-14 17:24:551-73.98215540.767937-73.96463040.765602N455...1.4985212.248074174.3331950.0076910.017053-0.0096670.013695151evening
1id237739412016-06-12 00:43:351-73.98041540.738564-73.99948140.731152N663...1.8055072.997289-178.0515060.007677-0.0123710.027145-0.018652813dawn
2id385852922016-01-19 11:35:241-73.97902740.763939-74.00533340.710087N2124...6.3850989.073912-179.6297210.0048030.0128790.034222-0.03933730afternoon
3id350467322016-04-06 19:32:311-74.01004040.719971-74.01226840.706718N429...1.4854981.752341-179.8725660.038342-0.0291940.041343-0.04229340night
4id218102822016-03-26 13:30:551-73.97305340.793209-73.97292340.782520N435...1.1885881.224652179.990812-0.0028770.041748-0.0023800.0310701515afternoon

5 rows × 26 columns

weekend

train['weekend'] = 1 * ((train["pick_dayofweek"] == 5) | (train["pick_dayofweek"] == 6))
test['weekend'] = 1 * ((test["pick_dayofweek"] == 5) | (test["pick_dayofweek"] == 6))
train.head()
idvendor_idpickup_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flagtrip_duration...manhatandirectionpick_pca0pick_pca1drop_pca0drop_pca1gaus_pickgaus_dropofficeweekend
0id287542122016-03-14 17:24:551-73.98215540.767937-73.96463040.765602N455...2.248074174.3331950.0076910.017053-0.0096670.013695151evening0
1id237739412016-06-12 00:43:351-73.98041540.738564-73.99948140.731152N663...2.997289-178.0515060.007677-0.0123710.027145-0.018652813dawn1
2id385852922016-01-19 11:35:241-73.97902740.763939-74.00533340.710087N2124...9.073912-179.6297210.0048030.0128790.034222-0.03933730afternoon0
3id350467322016-04-06 19:32:311-74.01004040.719971-74.01226840.706718N429...1.752341-179.8725660.038342-0.0291940.041343-0.04229340night0
4id218102822016-03-26 13:30:551-73.97305340.793209-73.97292340.782520N435...1.224652179.990812-0.0028770.041748-0.0023800.0310701515afternoon1

5 rows × 27 columns

3. Modeling

evaluation metric

Root Mean Squared Logarithmic Error

$\epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }$

Where:

  • ϵ is the RMSLE value (score)

  • n is the total number of observations in the (public/private) data set,

  • pi is your prediction of trip duration, and
  • ai is the actual trip duration for i.
  • log(x) is the natural logarithm of x

data type manipulation

  • categorical data convert encoding
train['store_and_fwd_flag'] = 1 * (train.store_and_fwd_flag.values == 'Y')
test['store_and_fwd_flag'] = 1 * (test.store_and_fwd_flag.values == 'Y')

input data shape check

train = train.drop('id', axis=1)
test = test.drop('id', axis=1)
train = train.drop(['pickup_datetime'], axis=1)
test = test.drop(['pickup_datetime'], axis=1)
print(train.shape, test.shape)
(1458640, 25) (625134, 24)
train = pd.get_dummies(train)
test = pd.get_dummies(test)
X_train = train.drop(['trip_duration'], axis=1)
y_train = train['trip_duration']
y_log = np.log(y_train)

lightgbm

model_log = lgb.LGBMRegressor(n_estimators=12500, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1).fit(X_train, y_log)
y_pred = model_log.predict(test)
y_exp = np.exp(y_pred)

sub = pd.DataFrame(columns= ['id', 'trip_duration'])
sub['id'] = sample_submission["id"]
sub['trip_duration'] = y_exp
sub.to_csv('sub_lgb_exp1.csv',index=False)

!kaggle competitions submit -c nyc-taxi-trip-duration -f sub_lgb_exp1.csv -m "Message"

#n_estimators=500, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.39475, 0.39786
#n_estimators=1000, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.39072, 0.39368
#n_estimators=1500, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.38848, 0.39147
#n_estimators=2000, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.38670, 0.38967
#n_estimators=3000, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.38499, 0.38761
#n_estimators=4000, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.38368, 0.38634
#n_estimators=5000, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.38295, 0.38553
#n_estimators=10000, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.38206, 0.38433


#top score #top 16%
#n_estimators=15000, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.38180, 0.38384


#n_estimators=20000, reg_alpha=0.5, reg_lambda=0.5, n_jobs=-1
#0.38195, 0.38401
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/jk/.kaggle/kaggle.json'
Successfully submitted to New York City Taxi Trip Duration
lgb.plot_importance(model_log)
<matplotlib.axes._subplots.AxesSubplot at 0x7f4bc3ef5cf8>

png

Cross Validation

# from sklearn.cross_validation import cross_val_score
# cross_lgb = cross_val_score(model_log, X_train, y_log, cv=2, n_jobs=-1)
# cross_lgb

OLS

OLS_model = sm.OLS(y_log, X_train).fit()
print(OLS_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          trip_duration   R-squared:                       0.358
Model:                            OLS   Adj. R-squared:                  0.358
Method:                 Least Squares   F-statistic:                 3.540e+04
Date:                Thu, 26 Apr 2018   Prob (F-statistic):               0.00
Time:                        16:15:42   Log-Likelihood:            -1.4198e+06
No. Observations:             1458640   AIC:                         2.840e+06
Df Residuals:                 1458616   BIC:                         2.840e+06
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
vendor_id              0.0199      0.001     17.901      0.000       0.018       0.022
passenger_count        0.0079      0.000     18.698      0.000       0.007       0.009
pickup_longitude      -0.9181      0.007   -131.463      0.000      -0.932      -0.904
pickup_latitude       -0.4466      0.010    -46.867      0.000      -0.465      -0.428
dropoff_longitude      0.4693      0.007     71.526      0.000       0.456       0.482
dropoff_latitude      -0.2251      0.008    -26.933      0.000      -0.241      -0.209
store_and_fwd_flag     0.0007      0.007      0.098      0.922      -0.013       0.015
pick_dayofweek         0.0153      0.000     34.435      0.000       0.014       0.016
pick_month             0.0170      0.000     53.156      0.000       0.016       0.018
pick_day               0.0007    6.1e-05     10.770      0.000       0.001       0.001
pick_hour              0.0068      0.000     16.933      0.000       0.006       0.008
holiday               -0.1176      0.003    -40.438      0.000      -0.123      -0.112
weather               -0.0499      0.002    -23.725      0.000      -0.054      -0.046
uclidean               0.1655      0.001    177.266      0.000       0.164       0.167
manhatan              -0.0367      0.001    -61.396      0.000      -0.038      -0.036
direction             -0.0003   4.85e-06    -53.675      0.000      -0.000      -0.000
pick_pca0              0.5870      0.008     76.411      0.000       0.572       0.602
pick_pca1             -0.6161      0.011    -55.641      0.000      -0.638      -0.594
drop_pca0             -0.8110      0.008   -102.752      0.000      -0.827      -0.796
drop_pca1             -0.4763      0.009    -50.474      0.000      -0.495      -0.458
gaus_pick              0.0009   8.42e-05     10.158      0.000       0.001       0.001
gaus_drop              0.0055   8.01e-05     68.585      0.000       0.005       0.006
weekend               -0.1547      0.002    -80.970      0.000      -0.158      -0.151
office_dawn           -0.0790      0.004    -18.453      0.000      -0.087      -0.071
office_morning        -0.0312      0.002    -15.946      0.000      -0.035      -0.027
office_afternoon       0.1163      0.001    105.402      0.000       0.114       0.118
office_evening         0.0868      0.002     41.610      0.000       0.083       0.091
office_night          -0.0879      0.004    -22.720      0.000      -0.095      -0.080
==============================================================================
Omnibus:                  2672070.271   Durbin-Watson:                   2.001
Prob(Omnibus):                  0.000   Jarque-Bera (JB):     253169639877.915
Skew:                         -12.080   Prob(JB):                         0.00
Kurtosis:                    2043.831   Cond. No.                     1.28e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.65e-22. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Y_test = OLS_model.predict(test)
Y_test_exp = np.exp(Y_test)


sub = pd.DataFrame(columns= ['id', 'trip_duration'])
sub['id'] = sample_submission["id"]
sub['trip_duration'] = Y_test_exp
sub.to_csv('submission_OLS.csv',index=False)
!kaggle competitions submit -c nyc-taxi-trip-duration -f submission_OLS.csv -m "Message"
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/jk/.kaggle/kaggle.json'
Successfully submitted to New York City Taxi Trip Duration

Appendix

1. degree of decimal

  • 0.000001 = 1.11mm

2. spatial data analysis

  • PCA
  • discriminant analysis

3. clustering

  • K means
  • K nearest neighbor
  • Expectation Maximization

4. ensemble methods

  • aggregation
  • boosting

Reference


© 2018. All rights reserved.

Powered by Hydejack v7.5.0