Let’s Understand Feature Engineering with NYC Taxi Fare Prediction
Feature Engineering is one of the crucial step of any data science process. Feature Engineering places a huge role in improving the predictive power of machine learning models.In this article we will understand some feature engineering concepts on the NYC Taxi Fare Prediction Dataset.
But Before that we have to understand what actually Feature Engineering is all about.
Feature Engineering can be divided into two terms “Feature”+”Engineering”
Feature means the necessary attributes or columns that are present in the dataset and Engineering means to create new features out of existing features inorder to improve the predictive power of the machine learning models.So as this article is focused on feature engineering so I will skip the Data Exploration or Exploratory Data Analysis(EDA) part and jump directly on the Feature Engineering Part.The complete code link is given at the end of the article
Now it’s time to see how our dataset looks like
So the dataset consists of 8 columns(features) of geographic and datetime and 40,000 rows.
The dataset is splitted into 80:20 ratio.80% of the data is kept for training and 20% for the test set.Due to less data the dataset is only divided into train and test and cross validation is performed inorder to evaluate the model
So what will be do is to create a function which will extract the necessary features from the datetime column like day,month,year and time.
def extract_features(dataframe):
dataframe['pickup_datetime']=pd.to_datetime(dataframe['pickup_datetime'])
day=dataframe['pickup_datetime'].dt.day
month=dataframe['pickup_datetime'].dt.month
year=dataframe['pickup_datetime'].dt.year
time=dataframe['pickup_datetime'].dt.time
# Creating Features
dataframe['day']=day
dataframe['month']=month
dataframe['year']=year
dataframe['time']=time
return dataframetrain_df=extract_features(train_df)
test_df=extract_features(test_df)
Now we will create another function which will compute distance based on the geographical features like pickup latitude,pickup longitude,drop off latitude and drop off longitude
def compute_distance(dataframe):
for index,rows in dataframe.iterrows():
try:
pickup_latitude=rows['pickup_latitude']
pickup_longitude=rows['pickup_longitude']
dropoff_latitude=rows['dropoff_latitude']
dropoff_longitude=rows['dropoff_longitude']
pickup_coords=(pickup_latitude,pickup_longitude)
dropoff_coords=(dropoff_latitude,dropoff_longitude)
distance=geopy.distance.geodesic(pickup_coords,dropoff_coords).km
dataframe.loc[index,'distance(KM)']=distance
except Exception as e:
continue
return dataframetrain_df=compute_distance(train_df)
test_df=compute_distance(test_df)
Now what we will do is to create a function which will separate the time into hours,minutes and seconds
def separate_time(dataframe):
dataframe['time']=dataframe['time'].astype(str)
for index,rows in dataframe.iterrows():
time=rows['time'].split(":")
dataframe.loc[index,'hour']=time[0]
dataframe.loc[index,'minutes']=time[1]
dataframe.loc[index,'seconds']=time[2]
return dataframetrain_df=separate_time(train_df)
test_df=separate_time(test_df)
After seperating the time into hours,minutes and seconds what we will do is to convert the time(hours) feature into the part of the day like early morning,morning,evening,night.
def get_part_day(x):
if (x > 4) and (x <= 8):
return 'Early Morning'
elif (x > 8) and (x <= 12 ):
return 'Morning'
elif (x > 12) and (x <= 16):
return'Noon'
elif (x > 16) and (x <= 20) :
return 'Eve'
elif (x > 20) and (x <= 24):
return'Night'
elif (x <= 4):
return'Late Night'train_df['session']=train_df['hour'].apply(lambda x:get_part_day(x))
test_df['session']=test_df['hour'].apply(lambda x:get_part_day(x))
Now we will create a function which will convert the passenger count from 1 to 6 into only two categories(Single Passenger and Multi Passenger)
def get_passenger(x):
if x==1:
return "Single Passenger"
elif x>1:
return "Multi Passenger"
else:
return "No Passenger"train_df['Passenger(Class)']=train_df['passenger_count'].apply(lambda x:get_passenger(x))
test_df['Passenger(Class)']=test_df['passenger_count'].apply(lambda x:get_passenger(x))
So that’s all for now and if you find this article useful then give it a clap and until then
Happy Learning !!!!!!
Complete Code Link https://github.com/rishabh706/NYC-Taxi-Fare-Prediction