Friday, July 6, 2018

The "control" variables

Where I want to start trying to understand "regression".


Pondering The Real Effects of Debt by Cecchetti et al

I've never done regression. So maybe I don't know what I'm talking about. It wouldn't be the first time that happened. If I have something wrong, please do point it out.


I googled controlling for variables in regression. Got a bunch of quotables. Here are the first three:
What does it mean when you control for a variable?
Importantly, regression automatically controls for every variable that you include in the model. What does it mean to control for the variables in the model? It means that when you look at the effect of one variable in the model, you are holding constant all of the other predictors in the model.
A Tribute to Regression Analysis - Minitab Blog
blog.minitab.com/blog/adventures-in-statistics-2/a-tribute-to-regression-analysis
Control variables are variables that you "are holding constant".
Why do you control for variables?
The number of dependent variables in an experiment varies, but there can be more than one. Experiments also have controlled variables. Controlled variables are quantities that a scientist wants to remain constant, and she must observe them as carefully as the dependent variables.
Variables in Your Science Fair Project - Science Buddies
https://www.sciencebuddies.org/science-fair-projects/science-fair/variables
Control variables are variables you want "to remain constant".
Why do we need to control variables?
The control variable strongly influences experimental results, and it is held constant during the experiment in order to test the relative relationship of the dependent and independent variables. The control variable itself is not of primary interest to the experimenter.
Control variable - Wikipedia
https://en.wikipedia.org/wiki/Control_variable
Control variables are variables that are "held constant".

That's interesting because, normally, a value is is either a constant or a variable.


The same Google search turned up this bit:
What are control variables in regression?
A control variable enters a regression in the same way as an independent variable - the method is the same. But the interpretation is different. Control variables are usually variables that you are not particularly interested in, but that are related to the dependent variable.
What are control variables and how do I use them in regression ...
https://www.quora.com/What-are-control-variables-and-how-do-I-use-them-in-regressio...
This, okay, I like it; this makes some sense to me.

The dependent variable is, say, economic growth. We want to know what it depends on.

Or actually, we think we know it depends on some "independent" variables and we think we know what they are. So we use them in our regression, to see if we are correct. For example we think economic growth depends on debt: on government debt and household debt and corporate debt. So we have three independent variables, all changing, and having or not having an impact on economic growth.

So then we decide to look at these independent variables one at a time. So we pick one and call it the "independent" variable, and we make the other two "control" variables. We hold them constant, so that we can see the effect that the independent variable has on economic growth.

The crude version of this might be simply to ignore the control variables and pretend that the independent variable is the only thing that influences economic growth. In the more sophisticated version you run the test repeatedly, giving each control variable a chance to be the independent variable.


Maybe I'm misinterpreting what it means to hold something "constant". But it seems pretty strange to me, this idea that if you make something a control variable, it suddenly has no effect on the outcome.

I mean, suppose we take Federal debt as our independent variable. And we make household debt and corporate debt our control variables, holding them constant. So then we can see how the Federal debt affects economic growth.

But this is some kind of nonsense, because economic growth is influenced by household debt and by corporate debt as well as by the Federal debt. If we hold household and corporate constant, what the regression tells us is that all the effects of debt on economic growth are due to the Federal debt and the Federal debt alone.

Suppose for example that increases in household and corporate debt always boost economic growth. Well, in the real world those increases really happened. Therefore, economic growth was boosted up by those increases. But our regression tells us that all of that boost was due to the Federal debt. Pretty ridiculous, no?

Suppose instead that increases in household and corporate debt always harm economic growth. Again, in the real world those increases actually happened. Therefore, in the real world, economic growth was held down by those increases. But our regression says the Federal debt is responsible for that poor growth. This is pretty ridiculous, too.

Yet until 2008 or so, that's how economics was done. Some economists still do it that way, asserting that all our economic troubles are due to the Federal debt and none to any other debt. To my mind, that's obviously wrong.

I don't know anything about regression, so I probably shouldn't talk. But it seems to me the idea that you can just "control" some variables to find out how your one "independent" variable affects economic growth is more grotesquely absurd than anything that was done before 2008.

The outcome we got is the outcome we got. You can see it on a graph of GDP growth. Even if it is true that growth is affected by all the components of debt, we're not going to see the GDP growth of the last 70 years change when we hold most of those components constant.

Let's say that the Federal debt is always bad for economic growth; and that household and corporate debt is always good for economic growth. (Until 2008, this is more or less what economists thought.) Well, if we control for household and corporate debt, we should see only the bad effects of the Federal debt. Economic growth should be less. The growth numbers should go down. But that doesn't happen. The growth numbers don't change.

If we suddenly switch variables then, controlling for household and Federal debt, we should see only the good effects from corporate debt. Economic growth should be higher. But obviously, it isn't. The growth numbers don't change. The outcome doesn't change retroactively when you control for different variables.


I was a math major in school. But I've never done regression. So maybe I don't know what the hell I'm talking about. If I have something wrong, or if I seem to have something wrong, please do point it out.

8 comments:

The Arthurian said...

"Suppose for example that increases in household and corporate debt always boost economic growth... Suppose instead that increases in household and corporate debt always harm economic growth."

For the record, it is more reasonable to suppose that increases in household and corporate debt tend to boost economic growth when the level of that debt is low, and tend to harm economic growth when the level is high.

Jerry said...

IF the function is linear, then it's totally valid to consider the variables one at a time. (this is mathematically provable)

Maybe like -- this defines a plane tilted through space:
z = x + 2y

I start out standing at x=1, y=1 (so z=3). If I move do x=2,y=2, how far up do I go?
I can hold y constant at y=1, and move x to 2 (z increases by 1). Then I can independently think about holding x constant at x=1 and moving y to 2 (z increases by 2). Adding those two indepent increases together (z+1, z+2) gives me the right answer (3+1+2 = z = 6 = 2+2*2).

But if the function is not linear, then it's less valid. e.g., instead of a flat plane, consider this curvy shape:
z = x * y

starting at x=1,y=1 gives me z=1
holding y constant at y=1, moving x to 2, would increase z to 2 (so the difference is +1 (=2-1)).
holding x constant at x=1, moving y to 2, would also increase z to 2.
Adding up those independent considerations might lead me to expect that increasing both x and y to 2 simultaneously (i.e. without holding anything constant) might increase z to 3(=1+1+1).
But, that is not correct because 2*2=4, not 3.
This error is because the function is nonlinear.

So: linear regression is only a completely valid thing to do if the system you're looking at is a linear system.

But, in practice, it can be a "mostly valid" thing to do if the changes you're looking at are small. This is basically because as long as you're only moving around a little bit on the curved surface, it looks flat (like the earth). It's "approximately linear". So then, to the extent that the approximation is valid, the regression works.

In our z=x*y example, when we moved by 1, our relative error in the difference was something like 33% (off by one, on a total of 3). But if we only moved by 0.1 instead of by 1, we would have less relative error. Adding up the two 0.1's in z difference gives a 0.2 difference in z "expected". The correct answer is 0.21. So this is only an error of 5%, because the curvy shape is "pretty flat" as long as you're only moving around by 0.1, and the linear approximation is closer to correct. Whereas it is not very flat at all if you are moving around by 1.

So: linear regression can be a "mostly valid" way to examine a nonlinear system, as long as you're studying small movements in the system. As the movements get larger, the regression gets less valid (more error).

How you can tell whether a movement is "large" or "small", in this sense, can be a little subtle. Looking at the earth or at the plane or the saddle curve, it's not hard to figure out. But I remember a lot of physics homework problems (and published papers, for that matter) coming out wrong if you get that "linear approximation" part wrong.


Jerry said...

"suppose that increases in household and corporate debt tend to boost economic growth when the level of that debt is low, and tend to harm economic growth when the level is high" -- this is describing a non-linear curve where you would have to think more carefully about using linear regression.

The regression would work when you're low on the curve ("adding debt always boosts growth, i can use the regression to find out how many dollars of growth a dollar of debt will get me") or high on the curve ("adding debt always harms growth, i can use the regression to figure out how many dollars of growth will be lost by adding a dollar of debt") -- but it would not work if you're looking at a larger set of data that spans both sides of the curve. Since it's not flat.

The Arthurian said...

Great explanation, Jerry! Thank you.

The Arthurian said...

A little more on "control for" ...

See Wage discrimination and the dangers of ‘controlling for’ confounders at Syll's.

The Arthurian said...

Here's my complaint, from the post:
"So then we decide to look at these independent variables one at a time. So we pick one and call it the "independent" variable, and we make the other two "control" variables. We hold them constant, so that we can see the effect that the independent variable has on economic growth...

But it seems pretty strange to me, this idea that if you make something a control variable, it suddenly has no effect on the outcome.
"

I'm looking at a recent post by David Glasner: James Buchanan Calling the Kettle Black. An excerpt:
"The second problem with Buchanan’s position is less straightforward and less well-known, but more important, than the first. The inverse relationship by which Buchanan set such great store is valid only if qualified by a ceteris paribus condition. Demand is a function of many variables of which price is only one. So the inverse relationship between price and quantity demanded is premised on the assumption that all the other variables affecting demand are held (at least approximately) constant...

But in some markets the factors affecting demand are themselves interrelated so that the ceteris paribus assumption can’t be maintained. Such markets can’t be analyzed in isolation, they can only be analyzed as a system in which all the variables are jointly determined.
"

Glasner's second paragraph there, that's my complaint. Let me restate my complaint in his terms:

In some cases the factors affecting growth are themselves interrelated so that the ceteris paribus assumption can’t be maintained. Such cases can’t be analyzed in isolation, they can only be analyzed as a system in which all the variables are jointly determined.

Or as I have it: It seems pretty strange to me, this idea that if you make something a control variable, it suddenly has no effect on the outcome.

I am thinking specifically of the analysis in The Real Effects of Debt by Cecchetti et al.

The Arthurian said...

My complaint:
"Maybe I'm misinterpreting what it means to hold something 'constant'. But it seems pretty strange to me, this idea that if you make something a control variable, it suddenly has no effect on the outcome."

Okay. At Cold and dark stars, in Crisis Theory: The Decline of Capitalism As The Growth of Expensive and Fragile Complexity, a different topic, but a better statement of what I wanted to say:

"[T]he capitalist world system is complex and nonlinear. It is complex because it is made of various interlocking parts (firms, individuals, governments, etc.) that form causal chains that connect across planetary scales. It is nonlinear because the behaviour of the system is not simply the “sum” of the interlocking parts, as the parts depend on each other. Therefore one cannot really study the individual components in isolation and then understand the whole system by adding these components. In other words, interdependence of the units within capitalism makes the system nonlinear."

Exactly.

The Arthurian said...

Gregory Norton writes:
"You want to find the relationship between height at the shoulders and body weight for household pets, but you suspect that the relationship may be different for dogs, cats, rabbits, birds, and hamsters. So you do the regression with species of animal as a control variable: Separate regressions for dogs, cats, rabbits, birds, and hamsters... That is my understanding: control variables are used to select the sample population. Each member of the population will have the same values of the control variables."

Okay. But suppose you want to find the relationship between DEBT and GDP for THE USA, but you suspect that the relationship may be different for GOVERNMENT, BUSINESS, CONSUMERS, birds, and hamsters. So you do the regression with species of ECONOMIC ACTIVITY as a control variable: Separate regressions for GOVERNMENT, BUSINESS, CONSUMERS, birds, and hamsters.

Well, you can select different measures of debt for government, business and consumers. But you cannot select different measures of GDP. GDP is what it is. It's like evaluating dogs and cats and rabbits using the body weight of all mammals.

Unless... Cecchetti and them do a variety of countries, so they get a variety of 'body weight' measures, a variety of GDP.

But all those GDPs are for countries that have government AND business AND consumers. So you would have to look at nuances of difference between species of economic activity, and hope there is enough information there to get some kind of worthwhile result.

I still think it's iffy.