Foreword: So, I fell in love with a girl. I want to marry her. To do this, I need to get a diamond ring. And none of that is really relevant to this discussion
This is going to be long. It's going to have math. You'll probably get pretty bored reading this. But I hope that somewhere, someone will find my research useful. Enjoy.
I'd like to answer a seemingly simple question, "How much should I expect to pay for a diamond?" Unfortunately I'm a jaded human being and I don't think I'll ever get an honest answer if I just ask somebody. So, no personal anecdotes, no talking to sales people, let's just look at solid numbers. Thankfully, we have the Internet. This is going to be a very focused analysis of round brilliant loose stones only. I'm not looking at settings. And I'm not looking at any other fancy cuts. I'm lucky in that I know what my girlfriend wants.
So here's what I've done so far. I went to http://www.diamondse.info This is a large database of diamonds. These are real diamonds that I can buy with a few clicks. I assume it's large enough to be a statistically significant sample of the full population of stones on the market. And more importantly to me, relatively easy to data mine. I wrote a script to crawl this website and extract 306560 individual stones with prices. I limited my search for round brilliant stones priced $0-$10,000 USD. My data set was taken 8/25/2014.
My expectation is that price can be determined as a function as:
PRICE = F(CARAT, CUT, COLOR, CLARITY)
specifically:
Price = C0 + C1*Carart + C2*Cut + C3*Color + C4*Clarity
where C0, C1, C2, C3, and C4 are constants.
My first pass is going assume a linear relationship and use multiple linear regression. Multiple linear regression requires integer input, so I'm going to map color, cut and clarity codes to numbers. Because I want my coefficients to have similar weights, and I'm going to pretend that a "typical" carat is 1, I'm going to scale the other factors from 0-1.
Cut Mapping:
Hearts & Arrows = 1
Ideal = 5/6
Very Good = 4/6
Good = 3/6
Fair = 2/6
Poor = 1/6
Color Mapping
D=1
E=0.9
F=0.8
G=0.7
H=0.6
I=0.5
J=0.4
K=0.3
L=0.2
M=0.1
Clarity
F=1
IF=10/11
VVS1=9/11
VVS2=8/11
VS1=7/11
VS2=6/11
SI1=5/11
SI2=4/11
I1=3/11
I2=2/11
I3=1/11
I put all that into Excel and run a Regression Analysis and:
PRICE = -9172.929 + 9073.0*Carat + 1538.85*Cut + 3173.84219*Color + 4315.93*Clarity
With R^2 = 0.80164
Looks like a decent model at first glance. But there are plenty of problems with this model that I'll get to addressing.
I'd like to answer a seemingly simple question, "How much should I expect to pay for a diamond?" Unfortunately I'm a jaded human being and I don't think I'll ever get an honest answer if I just ask somebody. So, no personal anecdotes, no talking to sales people, let's just look at solid numbers. Thankfully, we have the Internet. This is going to be a very focused analysis of round brilliant loose stones only. I'm not looking at settings. And I'm not looking at any other fancy cuts. I'm lucky in that I know what my girlfriend wants.
So here's what I've done so far. I went to http://www.diamondse.info This is a large database of diamonds. These are real diamonds that I can buy with a few clicks. I assume it's large enough to be a statistically significant sample of the full population of stones on the market. And more importantly to me, relatively easy to data mine. I wrote a script to crawl this website and extract 306560 individual stones with prices. I limited my search for round brilliant stones priced $0-$10,000 USD. My data set was taken 8/25/2014.
My expectation is that price can be determined as a function as:
PRICE = F(CARAT, CUT, COLOR, CLARITY)
specifically:
Price = C0 + C1*Carart + C2*Cut + C3*Color + C4*Clarity
where C0, C1, C2, C3, and C4 are constants.
My first pass is going assume a linear relationship and use multiple linear regression. Multiple linear regression requires integer input, so I'm going to map color, cut and clarity codes to numbers. Because I want my coefficients to have similar weights, and I'm going to pretend that a "typical" carat is 1, I'm going to scale the other factors from 0-1.
Cut Mapping:
Hearts & Arrows = 1
Ideal = 5/6
Very Good = 4/6
Good = 3/6
Fair = 2/6
Poor = 1/6
Color Mapping
D=1
E=0.9
F=0.8
G=0.7
H=0.6
I=0.5
J=0.4
K=0.3
L=0.2
M=0.1
Clarity
F=1
IF=10/11
VVS1=9/11
VVS2=8/11
VS1=7/11
VS2=6/11
SI1=5/11
SI2=4/11
I1=3/11
I2=2/11
I3=1/11
I put all that into Excel and run a Regression Analysis and:
PRICE = -9172.929 + 9073.0*Carat + 1538.85*Cut + 3173.84219*Color + 4315.93*Clarity
With R^2 = 0.80164
Looks like a decent model at first glance. But there are plenty of problems with this model that I'll get to addressing.