Let’s Understand Feature Engineering with NYC Taxi Fare Prediction

Rishabh R Rahatgaonkar
3 min readMar 6, 2021

Feature Engineering is one of the crucial step of any data science process. Feature Engineering places a huge role in improving the predictive power of machine learning models.In this article we will understand some feature engineering concepts on the NYC Taxi Fare Prediction Dataset.

But Before that we have to understand what actually Feature Engineering is all about.

Feature Engineering can be divided into two terms “Feature”+”Engineering

Feature means the necessary attributes or columns that are present in the dataset and Engineering means to create new features out of existing features inorder to improve the predictive power of the machine learning models.So as this article is focused on feature engineering so I will skip the Data Exploration or Exploratory Data Analysis(EDA) part and jump directly on the Feature Engineering Part.The complete code link is given at the end of the article

Now it’s time to see how our dataset looks like

NYC Train Dataset

So the dataset consists of 8 columns(features) of geographic and datetime and 40,000 rows.

The dataset is splitted into 80:20 ratio.80% of the data is kept for training and 20% for the test set.Due to less data the dataset is only divided into train and test and cross validation is performed inorder to evaluate the model

So what will be do is to create a function which will extract the necessary features from the datetime column like day,month,year and time.

def extract_features(dataframe):
dataframe['pickup_datetime']=pd.to_datetime(dataframe['pickup_datetime'])
day=dataframe['pickup_datetime'].dt.day

month=dataframe['pickup_datetime'].dt.month
year=dataframe['pickup_datetime'].dt.year

time=dataframe['pickup_datetime'].dt.time

# Creating Features
dataframe['day']=day
dataframe['month']=month
dataframe['year']=year
dataframe['time']=time

return dataframe
train_df=extract_features(train_df)
test_df=extract_features(test_df)

Now we will create another function which will compute distance based on the geographical features like pickup latitude,pickup longitude,drop off latitude and drop off longitude

def compute_distance(dataframe):

for index,rows in dataframe.iterrows():
try:
pickup_latitude=rows['pickup_latitude']
pickup_longitude=rows['pickup_longitude']
dropoff_latitude=rows['dropoff_latitude']
dropoff_longitude=rows['dropoff_longitude']

pickup_coords=(pickup_latitude,pickup_longitude)

dropoff_coords=(dropoff_latitude,dropoff_longitude)
distance=geopy.distance.geodesic(pickup_coords,dropoff_coords).km
dataframe.loc[index,'distance(KM)']=distance

except Exception as e:

continue

return dataframe
train_df=compute_distance(train_df)
test_df=compute_distance(test_df)

Now what we will do is to create a function which will separate the time into hours,minutes and seconds

def separate_time(dataframe):

dataframe['time']=dataframe['time'].astype(str)
for index,rows in dataframe.iterrows():

time=rows['time'].split(":")
dataframe.loc[index,'hour']=time[0]
dataframe.loc[index,'minutes']=time[1]
dataframe.loc[index,'seconds']=time[2]

return dataframe
train_df=separate_time(train_df)
test_df=separate_time(test_df)

After seperating the time into hours,minutes and seconds what we will do is to convert the time(hours) feature into the part of the day like early morning,morning,evening,night.

def get_part_day(x):
if (x > 4) and (x <= 8):
return 'Early Morning'
elif (x > 8) and (x <= 12 ):
return 'Morning'
elif (x > 12) and (x <= 16):
return'Noon'
elif (x > 16) and (x <= 20) :
return 'Eve'
elif (x > 20) and (x <= 24):
return'Night'
elif (x <= 4):
return'Late Night'
train_df['session']=train_df['hour'].apply(lambda x:get_part_day(x))
test_df['session']=test_df['hour'].apply(lambda x:get_part_day(x))

Now we will create a function which will convert the passenger count from 1 to 6 into only two categories(Single Passenger and Multi Passenger)

def get_passenger(x):
if x==1:
return "Single Passenger"
elif x>1:
return "Multi Passenger"
else:
return "No Passenger"
train_df['Passenger(Class)']=train_df['passenger_count'].apply(lambda x:get_passenger(x))

test_df['Passenger(Class)']=test_df['passenger_count'].apply(lambda x:get_passenger(x))

So that’s all for now and if you find this article useful then give it a clap and until then

Happy Learning !!!!!!

Complete Code Link https://github.com/rishabh706/NYC-Taxi-Fare-Prediction

--

--