What happens when we categorize an independent variable in regression?

Suppose we have a slightly complex quadratic relationship such as y = 3*(x-2)² + 4x + e, where I have put the (x-4) in parens to make it clear that the curve changes at x = 2.

We can model this:x <- rnorm(n) #X is now normally distributed with mean 0 and sd 1, N — 100y <- 3*(x-4)² + 4*x + rnorm(n,0,10) #Y is related to xx2 <- cut(x, quantile(x, seq(0,1, .

5)), include.

lowest = TRUE) #Cuts x into 2 partsx10 <- cut(x,quantile(x, seq(0,1, .

1)), include.

lowest = TRUE)I am not going to show the linear version of x; I know it doesn’t work.

The diagnostics would show that it doesn’t work.

But what about deciles?The coefficients look like this:Coefficients: Estimate Std.

Error t value Pr(>|t|) (Intercept) 92.

602 1.

142 81.

10 <2e-16 as.

factor(x10)(-1.

21,-0.

849] -20.

189 1.

615 -12.

50 <2e-16 as.

factor(x10)(-0.

849,-0.

538] -29.

849 1.

615 -18.

48 <2e-16 as.

factor(x10)(-0.

538,-0.

285] -37.

904 1.

615 -23.

47 <2e-16 as.

factor(x10)(-0.

285,-0.

0398] -40.

362 1.

615 -25.

00 <2e-16 as.

factor(x10)(-0.

0398,0.

193] -45.

895 1.

615 -28.

42 <2e-16 as.

factor(x10)(0.

193,0.

466] -49.

652 1.

615 -30.

75 <2e-16 as.

factor(x10)(0.

466,0.

761] -56.

066 1.

615 -34.

72 <2e-16 as.

factor(x10)(0.

761,1.

33] -60.

625 1.

615 -37.

54 <2e-16 as.

factor(x10)(1.

33,3.

2] -68.

871 1.

615 -42.

65 <2e-16 where I removed the * because they are all significant.

But what about the graph?Not very good.

OK, I made it hard by using (x-2) which is near the right edge.

But there’s not much hint here that the relationship is quadratic.

And other curves fare just as badly.

Categorizing just doesn’t work.

What does work?.Well, a few things.

But one thing is restricted cubic splines.

I’m not going to go into those now, but here’s how to model them (using the defaults from Frank Harrell’s rms package):library(Hmisc)library(rms)mspline <- ols(y~rcs(x))the table of coefficients is not too clear without guidance.

Here it is:Coef S.

E.

t Pr(>|t|) Intercept 0.

4845 1.

7686 0.

27 0.

7842 x -20.

5336 1.

2484 -16.

45 <0.

0001 x' 39.

8415 7.

3191 5.

44 <0.

0001 x'' -155.

4683 41.

8642 -3.

71 0.

0002 x''' 216.

2249 67.

7194 3.

19 0.

0015But, the plot!.Oh, the plot is clear to all:It goes down, it flattens, it starts to go up at the end.

It’s just about exact.

Don’t categorize IVs.

It doesn’t work.

There are better methods (and if you’d like me to write a post with more about splines and such, let me know in comments).

.

. More details

Leave a Reply