Predicting number of Bike-share Users

Our goal is to use and optimize Machine Learning models that effectively predict the number of ride-sharing bikes that will be used in any given 1 hour time-period, using available information about that time/day.Data-set usedThe data-set we are using is from University of California Irvine’s Machine Learning Repository..Values are divided to 41 (max)Normalized feeling temperature in Celsius..Values are divided to 50 (max)Normalized humidity..The values are divided to 100 (max)Normalized wind speed..The values are divided to 67 (max)Count of casual usersCount of registered usersCount of total rental bikes including both casual and registeredFrom an initial look, the data-points far exceed the number of features, which makes this a “skinny” data-set, considered ideal for ML.Exploratory data analysisBefore starting to process a data-set with algorithms, it’s always a good idea to explore it visually..Using the ggplot2 and ggextra packages, we can quickly make some plots to investigate how the bicycle usage count is affected by the features available..Once again, this intuitively makes sense as users may also be discouraged to bike when it’s too hot outside.Scatter plot- Humidity vs UsageThere seems to be a negative correlation between the humidity and the usage rate, with a linear fit being very close to the best curve fit for all of the data (excluding some outliers with very low humidity)..We may conclude that it makes more sense to predict these two counts separately and add up the tallies to find the total count..However, when I tried that, I found the final predictions to be less accurate than what we get if we simply predict the overall count..However, with some simple data manipulation (more on this in the next section), we can change this to represent the usage rate based on the temporal distance to 4 am, and a find a somewhat linear fit (see below). Note: having features that linearly predict the outcome is ideal as it reduces the need for complex non-linear ML algorithms.A somewhat similar trend can also be observed in the month vs usage plot (below), with an evidently higher usage rate during the warmer months of the summer and the lowest usage during January..With some manipulation similar to the previous plot, this data can also used to represent usage based on the temporal distance to the month of January..Since we have a massive data-set with 17,000+ data points, we can expand these features and still have a “skinny” data-set so that we don’t risk over-fitting..Since we are using only non-complex regression algorithms in this project, we can afford to add computational complexity for better accuracy in our predictions.Modifying cyclic variable values to represent temporal distance from a single time-point.. More details

Leave a Reply