Feature Engineering Time

Let’s just describe direction as a 2d vector (x, y), not a single scalar that is the “heading” .

Yep, let’s do that thing, with timeLet’s map the times to the circumference of a 24 hour clockIf we want to cluster, we cluster in 2d space (x, y).

If we want to take an average, we take the geometric average of the points (center of gravity).

 If we want to use the time as a feature in a model we use the vector, (x, y).

What happens when the average is not on the circle.

Though any average of more than one point will be inside the circle; the larger the range, the further away from the circumference the average will be.

SummaryBy mapping time of day to a 2D circle that represents 24 hours we can alleviate the discontinuity that occurs when the hour hand goes from 23 to 0.

 This would also work for day of year, or actual direction.

This new 2D vector (that can simply be two columns in a model) behaves very naturally, it defines a metric that puts 23:59 and 00:01 close together.

Similarly it puts Dec 31st and Jan 1st right next to each other.

NNE and NNW do not care that they are on different sides of 360° — (0°)Those points of discontinuity were arbitrary to begin with, but so long as we were using only a single dimension they had to occur somewhere.

In 2D things are nice again.

Code:You can clone this notebook to play around with the code.

(also contains workable example of pySpark calculating the averages.

)Appendix, math and how-to:Mapping the time to the circumference of the circle is straightforward.

Convert the time to a single number [0, 24) {or [0, 1) , this might be free} so 12:36 could be 12.

6 {or 0.


Then stretch that number to cover 360° by multiplying by 15°.

{or by 360°}Then take the Sin and Cos to get the Y and X coordinates.

(Happy note: SQL has trigonometric functions, don’t forget to convert to radians)To calculate the average, simply take the average of X and Y independently.

To transform back to time using atan2(E(Y), E(X)), then reverse the transformations (radians to degrees, and un-stretch to 24H )Bonus, STDDEV:Calculate the radius, r with the Pythagorean theorem given X and Y.

The arc that represents the standard deviation is given by arccos(r).

Proving this is left as an exercise for the reader.


. More details

Leave a Reply