How to go from bias to buyer

No, I was not.

Imagine a distribution of apartments ranging from underpriced to overpriced.

The underpriced apartments sell quickly and maybe even above the initial asking price.

The middle of the spectrum, the units with a fair asking price also get sold without bigger issues and leave the market so that both the underpriced and fair-priced side of the spectrum disappears out of sight.

We only observe the units which were priced too high and need to try again with a more competitive price tag.

We never see the underpriced ones being uploaded with the higher price that they were sold for.

A form of survivorship bias is at play here and not necessarily the collapse of the Norwegian real estate market.

I still think it is useful information to have if you are interested in a particular apartment, and you observe that it did not sell for a certain price and came back cheaper.

Imagine you would go to a viewing for such an apartment.

You already know that the market rejected it for a certain price and you would probably not make a bid in the area of the first, unsuccessful price.

It might also give a tiny bit of negotiation power to know that the seller side is potentially growing impatient.

After a few weeks I had found an ad for an apartment I really liked.

It seemed relatively cheap, but with only average m² price shown on Finn.

no, how could I be certain.

I thought a predictive model trained on my data set and predicting the price of the apartment of my interest might help.

After cleaning the data, it is always a good idea to explore the data a bit and look for features that can be transformed or dropped for a better fit and performance.

Let’s have a first look at the general relationship between the m² and the price of housing in Oslo with a linear line fit laid on top of it.

We observe the expected positive relationship and a common attribute of scatterplots of the housing market: the variance in prices increases with the size of the unit.

While the estimator in a regression model with such a spreading shape remains unbiased and predictive models would still be workable, there is usually a simple way to account for the phenomenon (called heteroscedasticity).

Instead of a linear relationship between size and price, we can take the logarithm of both variables.

Looking at the logarithmic scatterplot we see that the spread of the scatter decreased.

Next, I had to select meaningful features from the wide range of collected attributes.

I have seen people use scatterplots for feature selection.

They explore alleged causality between the explanatory and the target variable by eye-balling the plot.

Sometimes the conclusion is that the feature has no impact on the target variable because the graph shows a horizontal line fit.

But a scatterplot is simply a two-dimensional representation of a multi-dimensional relationship and as such, it doesn’t tell the whole story.

Imagine this example: Lets say we have houses of different size and in different locations.

The smallest one is directly in the popular city center.

The other ones are located with increasing distance to the city center and are also increasingly getting larger with distance.

The negative effect of the increasing distance to the center cancels out the price increase that the additional floor space would bring perfectly.

If you plot a scatterplot of these observations with the price on the y- and the size on the x-axis, you could observe a horizontal line and wrongly conclude that the size has absolutely no impact on the price.

The plot shows only two dimensions in isolation but gives no information on how some features interact with each other.

To investigate such interaction, a correlation matrix is often used.

In my case, it is however important to keep in mind, how the data came to existence.

Not every observation had a value for every feature on the Finn website.

The realtors simply listed whatever they thought is worth listing.

Seeing that many variables appear frequently with the exact same wording, there might be a pre-selection of attributes they can choose from, but many “exotic” features appear only once for a specific ad, suggesting they were entered manually.

Some seemingly reliable variables therefore disqualify as not reliable and show also wrong degree of correlation to other features.

Some ads f.


stated that the house or apartment is “central”.

And a major driver for prices is of course the area it is situated in or as every realtor ever preaches: “location, location, location!”But not every housing unit that is located central also received this attribute in the data set.

Only the ones for which the realtor wanted to emphasize this quality in particular.

 Better than relying on such a weak feature is to create zip code dummy variables from the address of the housing unit.

We cannot list each individual factor that makes a certain location attractive and we can certainly not collect it from Finn, but we somehow know or accept that some post codes are just generally good locations.

Zip code dummies capture the entire unexplained attractiveness of a location which includes being central, having great schools or a low crime rate.

Since post codes are usually geographically clustered (0251 is next to 0252 and so on), we use a higher level, say 3-digits, to group postcodes and make sure we have enough observations per post code dummy.

In order to get post code dummies, you can use the very handy “get_dummies” function:df = pd.

concat([df, pd.


post_code3)], axis=1)It is also important to keep in mind that ads are ads (duh) and as such only list positive attributes and sometimes conveniently leave out the ones that might be perceived negatively (apartments on the ground floor often “forgot” to state the floor they were on).

There is very little I can do about this.

But even without this complication, two apartments of equal attributes in the data do not have to be alike.

No attribute in my data captures the feature if the view out of the windows of a flat is blocked by the outside wall of the neighboring building, or the bathroom looks like a half-rotten fungus swinger party from the 70s.

The data captures many things, but aesthetics are not among them.

So the prediction will also not take the condition of an apartment into account.

The continuous variables on the other hand are quite reliable as they consist of the most basic information.

From left to right: floorspace, price (in NOK), running costs, floor, energylabel (converted from letters), construction year.

Of the categorical variables I found the most useful to be: dummies for the type of housing and ownership, zip codes, the presence of a balcony, garden, fireplace (very common and popular in Norway → 25% of all observations!), elevator, garage or parking spot.

I also kept some more “exotic” dummies if they were stated frequently enough (> than 10% of observations) but not obsolete like “basic access to the sewage”.

These variables include f.


“janitor-service” .

I ran two different models.

A linear regression in log specification, and a gradient boosting model.

I separated the one observation I wanted to predict before training the model, so that the observation is truly “new”.

For the linear model it is not necessary to do a train-test-split, as overfitting is not a concern if you restrict functional form of your estimator.

from sklearn.

metrics import mean_squared_errorfrom sklearn.

linear_model import LinearRegressionY = df.

price_logX = df.

drop(columns = 'price_log')regressor = LinearRegression()regressor.

fit(X, Y)print('Liner Regression R squared: %.

4f' % regressor.

score(X, Y))mse_linear = mean_squared_error(Y, regressor.

predict(X))print("MSE: %.

4f" % mse_linear)regressor.

predict(ad_for_pred)For the gradient boosting model I tried a few parameter settings but did not spend too much time optimizing them as the prediction did not change significantly with each change.

The R² was a bit higher for the test set than for the entire set in the linear model (0.

89 vs 0.

85) and the mean squared error decreased (0.

012 vs 0.


The model is easy to set up:from sklearn.

model_selection import train_test_splitfrom sklearn import ensembleX_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.

3, random_state=0)params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 30, 'min_samples_leaf':10,'learning_rate': 0.

1, 'loss': 'ls'}regressor_gr_boost = ensemble.


fit(X_train, y_train)mse_gb = mean_squared_error(y_test, regressor_gr_boost.

predict(X_test))print("MSE: %.

4f" % mse_gb)print('Gradient Boosting training set R squared: %.

4f' %regressor_gr_boost.

score(X_train, y_train)print('Gradient Boosting test set R squared: %.

4f' %regressor_gr_boost.

score(X_test, y_test)regressor_gr_boost.

predict(ad_for_pred)The price for the object of my interest was about 5% higher than the asking price listed on Finn with the linear prediction.

The gradient boosting showed even an 8% difference.

The exact prediction is in this case of minor importance and certainly not correct to the euro given the data quality flaws discussed before.

But we get a clear indication that the apartment is on the low end of the price range given its features.

I decided to go to the viewing to see if the apartment was indeed great value for money.

It was located on a lively street with many shops and cafes, but also some traffic and even a tram line which was probably a reason for the lower price.

I still thought it was relatively cheap as my prediction provided me with a sense of the prices I had to expect if I were to look for a comparable apartment in a more quiet side street.

In fact, I ended up buying (true story, not just for the narrative of this article).

The data analysis did not take the decision what apartment to buy of my shoulders, but it provided decision support.

And when I take up a loan of a few hundred thousand euros, I’ll happily take any support I can get.


. More details

Leave a Reply