Know your data — Pricing diamonds using scatterplots and predictive models

On the supply side, larger contiguous chunks of diamonds without significant flaws are probably much harder to find than smaller ones.

This may help explain the exponential-looking curve — and I thought I noticed this when I was shopping for a diamond for my soon-to-be wife.

Of course, this is related to the fact that the weight of a diamond is a function of volume, and volume is a function of x * y * z, suggesting that we might be especially interested in the cubed-root of carat weight.

On the demand side, customers in the market for a less expensive, smaller diamond are probably more sensitive to price than more well-to-do buyers.

Many less-than-one-carat customers would surely never buy a diamond were it not for the social norm of presenting one when proposing.

And, there are fewer consumers who can afford a diamond larger than one carat.

Hence, we shouldn’t expect the market for bigger diamonds to be as competitive as that for smaller ones, so it makes sense that the variance as well as the price would increase with carat size.

Often the distribution of any monetary variable will be highly skewed and vary over orders of magnitude.

This can result from path-dependence (e.


, the rich get richer) and/or the multiplicitive processes (e.


, year on year inflation) that produce the ultimate price/dollar amount.

Hence, it’s a good idea to look into compressing any such variable by putting it on a log scale (for more take a look at this guest post on Tal Galili’s blog).

Indeed, we can see that the prices for diamonds are heavily skewed, but when put on a log10 scale seem much better behaved (i.


, closer to the bell curve of a normal distribution).

In fact, we can see that the data show some evidence of bimodality on the log10 scale, consistent with our two-class, “rich-buyer, poor-buyer” speculation about the nature of customers for diamonds.

Let’s re-plot our data, but now let’s put price on a log10 scale:Better, though still a little funky — let’s try using use the cube-root of carat as we speculated about above:Nice, looks like an almost-linear relationship after applying the transformations above to get our variables on a nice scale.

OverplottingNote that until now I haven’t done anything about overplotting — where multiple points take on the same value, often due to rounding.

Indeed, price is rounded to dollars and carats are rounded to two digits.

Not bad, though when we’ve got this much data we’re going to have some serious overplotting.

Often you can deal with this by making your points smaller, using “jittering” to randomly shift points to make multiple points visible, and using transparency, which can be done in ggplot using the “alpha” parameter.

This gives us a better sense of how dense and sparse our data is at key places.

Using Color to Understand Qualitative FactorsWhen I was looking around at diamonds, I also noticed that clarity seemed to factor in to price.

Of course, many consumers are looking for a diamond of a certain size, so we shouldn’t expect clarity to be as strong a factor as carat weight.

And I must admit that even though my grandparents were jewelers, I initially had a hard time discerning a diamond rated VVS1 from one rated SI2.

Surely most people need a loop to tell the difference.

And, according to BlueNile, the cut of a diamond has a much more consequential impact on that “fiery” quality that jewelers describe as the quintessential characteristic of a diamond.

On clarity, the website states, “Many of these imperfections are microscopic, and do not affect a diamond’s beauty in any discernible way.

” Yet, clarity seems to explain an awful lot of the remaining variance in price when we visualize it as a color on our plot:Despite what BlueNile says, we don’t see as much variation on cut (though most diamonds in this data set are ideal cut anyway):Color seems to explain some of the variance in price as well, though BlueNile states that all color grades from D-J are basically not noticeable.

At this point, we’ve got a pretty good idea of how we might model price.

But there are a few problems with our 2008 data — not only do we need to account for inflation but the diamond market is quite different now than it was in 2008.

In fact, when I fit models to this data then attempted to predict the price of diamonds I found on the market, I kept getting predictions that were far too low.

After some additional digging, I found the Global Diamond Report.

It turns out that prices plummeted in 2008 due to the global financial crisis, and since then prices (at least for wholesale polished diamond) have grown at a roughly a 6 percent compound annual rate.

The rapidly-growing number of couples in China buying diamond engagement rings might also help explain this increase.

After looking at data on PriceScope, I realized that diamond prices grew unevenly across different carat sizes, meaning that the model I initially estimated couldn’t simply be adjusted by inflation.

While I could have done ok with that model, I really wanted to estimate a new model based on fresh data.

Thankfully I was able to put together a python script to scrape diamondse.

info without too much trouble.

This dataset is about 10 times the size of the 2008 diamonds data set and features diamonds from all over the world certified by an array of authorities besides just the Gemological Institute of America (GIA).

You can read in this data as follows (be forewarned — it’s over 500K rows):My github repository has the code necessary to replicate each of the figures above — most look quite similar, though this data set contains much more expensive diamonds than the original.

Regardless of whether you’re using the original diamonds data set or the current larger diamonds data set, you can estimate a model based on what we learned from our scatterplots.

We’ll regress carat, the cubed-root of carat, clarity, cut and color on log-price.

I’m using only GIA-certified diamonds in this model and looking only at diamonds under $10K because these are the type of diamonds sold at most retailers I’ve seen and hence the kind I care most about.

By trimming the most expensive diamonds from the dataset, our model will also be less likely to be thrown off by outliers at the high end of price and carat.

The new data set has mostly the same columns as the old one, so we can just run the following (if you want to run it on the old data set, just set data=diamonds).

Here are the results for my scraped data set:Now those are some very nice R-squared values — we are accounting for almost all of the variance in price with the 4Cs.

If we want to know what whether the price for a diamond is reasonable, we can now use this model and exponentiate the result (since we took the log of price).

We need to multiply the result by exp(sigma²/2), because the our error is no longer zero in expectation:To dig further into that last step, have a look at the Wikipedia page on log-normal distributed variables.

Thanks to Miguel for catching this.

Let’s take a look at an example from Blue Nile.

I’ll use the full model, m4.

The results yield an expected value for price given the characteristics of our diamond and the upper and lower bounds of a 95% CI — note that because this is a linear model, predict() is just multiplying each model coefficient by each value in our data.

Turns out that this diamond is a touch pricier than expected value under the full model, though it is by no means outside our 95% CI.

BlueNile has by most accounts a better reputation than diamondse.

info however, and reputation is worth a lot in a business that relies on easy-to-forge certificates and one in which the non-expert can be easily fooled.

This illustrates an important point about generalizing a model from one data set to another.

First, there may be important differences between data sets — as I’ve speculated about above — making the estimates systematically biased.

Second, overfitting — our model may be fitting noise present in data set.

Even a model cross-validated against out-of-sample predictions can be over-fit to noise that results in differences between data sets.

Of course, while this model may give you a sense of whether your diamond is a rip-off against diamondse.

info diamonds, it’s not clear that diamondse.

info should be regarded as a source of universal truth about whether the price of a diamond is reasonable.

Nonetheless, to have the expected price at diamondse.

info with a 95% interval is a lot more information than we had about the price we should be willing to pay for a diamond before we started this exercise.

An important point — even though we can predict diamondse.

info prices almost perfectly based on a function of the 4c’s, one thing that you should NOT conclude from this exercise is that where you buy your diamond is irrelevant, which apparently used to be conventional wisdom in some circles.

You will almost surely pay more if you buy the same diamond at Tiffany’s versus Costco.

But Costco sells some pricy diamonds as well.

Regardless, you can use this kind of model to give you an indication of whether you’re overpaying.

Of course, the value of a natural diamond is largely socially constructed.

Like money, diamonds are only valuable because society says they are — -there’s no obvious economic efficiencies to be gained or return on investment in a diamond, except perhaps in a very subjective sense concerning your relationship with your significant other.

To get a sense for just how much value is socially constructed, you can compare the price of a natural diamond to a synthetic diamond, which thanks to recent technological developments are of comparable quality to a “natural” diamond.

Of course, natural diamonds fetch a dramatically higher price.

One last thing — there are few guarantees in life, and I offer none here.

Though what we have here seems pretty good, data and models are never infallible, and obviously you can still get taken (or be persuaded to pass on a great deal) based on this model.

Always shop with a reputable dealer, and make sure her incentives are aligned against selling you an overpriced diamond or worse one that doesn’t match its certificate.

There’s no substitute for establishing a personal connection and lasting business relationship with an established jeweler you can trust.

Plotting your data can help you understand it and can yield key insights.

But even scatterplot visualizations can be deceptive if you’re not careful.

Consider another data set the comes with the alr3 package — soil temperature data from Mitchell, Nebraska, collected by Kenneth G.

Hubbard from 1976–1992, which I came across in Weisberg, S.


Applied Linear Regression, 3rd edition.

New York: Wiley (from which I’ve shamelessly stolen this example).

Let’s plot the data, naively:Looks kinda like noise.

What’s the story here?.When all else fails, think about it.

What’s on the X axis?.Month.

What’s on the Y-axis?.Temperature.

Hmm, well there are seasons in Nebraska, so temperature should fluctuate every 12 months.

But we’ve put more than 200 months in a pretty tight space.

Let’s stretch it out and see how it looks:Don’t make that mistake.

That concludes part I of this series on scatterplots.

Part II will illustrate the advantages of using facets/panels/small multiples, and show how tools to fit trendlines including linear regression and local regression (loess) can help yield additional insight about your data.

You can also learn more about exploratory data analysis via this Udacity course taught by my colleagues Dean Eckles and Moira Burke, and Chris Saden, which will be coming out in the next few weeks.

–>Originally published at https://solomonmg.



.. More details

Leave a Reply