Generate predictions that are orthogonal (uncorrelated) to a given variable Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHow to get correlation between two categorical variable and a categorical variable and continuous variable?What is the best Data Mining algorithm for prediction based on a single variable?Are Correlithm Objects used for anything in the industry?Can i use chi square test to remove a particular variable from the model?How to make predictions based on correlations?Devices behavior in one continuous variable vs events rateANN Variable CorrelationIs there any logic to adding a threshold to see if two variables are related?How to statistically prove that a column in a dataframe is not neededCorrelation / regression / association between one categorical variable and two non-independent others

What are the motives behind Cersei's orders given to Bronn?

What is this single-engine low-wing propeller plane?

Should gear shift center itself while in neutral?

How does cp -a work

IndentationError when pasting code in Python 3 interpreter mode

What makes black pepper strong or mild?

Are variable time comparisons always a security risk in cryptography code?

Does the Giant Rocktopus have a Swim Speed?

Why don't the Weasley twins use magic outside of school if the Trace can only find the location of spells cast?

ListPlot join points by nearest neighbor rather than order

Why is "Consequences inflicted." not a sentence?

Should I use Javascript Classes or Apex Classes in Lightning Web Components?

How do I keep my slimes from escaping their pens?

How to deal with a team lead who never gives me credit?

If Jon Snow became King of the Seven Kingdoms what would his regnal number be?

Why did the IBM 650 use bi-quinary?

Does polymorph use a PC’s CR or its level?

Is there a concise way to say "all of the X, one of each"?

What does '1 unit of lemon juice' mean in a grandma's drink recipe?

How to bypass password on Windows XP account?

Can inflation occur in a positive-sum game currency system such as the Stack Exchange reputation system?

What are the pros and cons of Aerospike nosecones?

Gastric acid as a weapon

How was the dust limit of 546 satoshis was chosen? Why not 550 satoshis?

Generate predictions that are orthogonal (uncorrelated) to a given variable

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)

2019 Moderator Election Q&A - Questionnaire

2019 Community Moderator Election ResultsHow to get correlation between two categorical variable and a categorical variable and continuous variable?What is the best Data Mining algorithm for prediction based on a single variable?Are Correlithm Objects used for anything in the industry?Can i use chi square test to remove a particular variable from the model?How to make predictions based on correlations?Devices behavior in one continuous variable vs events rateANN Variable CorrelationIs there any logic to adding a threshold to see if two variables are related?How to statistically prove that a column in a dataframe is not neededCorrelation / regression / association between one categorical variable and two non-independent others

I have an X matrix, a y variable, and another variable ORTHO_VAR. I need to predict the y variable using X, however, the predictions from that model need to be orthogonal to ORTHO_VAR while being as correlated with y as possible.

I would prefer that the predictions are generated with a non-parametric method such as xgboost.XGBRegressor but I could use a linear method if absolutely necessary.

This code:

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

# Create regression dataset with two correlated targets
X, y = make_regression(n_features=20, random_state=245, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

# Correlation should be low or preferably zero
pred_corr_w_ortho = df.corr().abs()['yhat']['ortho_var']
assert pred_corr_w_ortho < 0.01, pred_corr_w_ortho

Returns this:

---------------------------------------------------------------------------
AssertionError 
 1 pred_corr_w_ortho = df.corr().abs()['yhat']['ortho_var']
----> 2 assert pred_corr_w_ortho < 0.05, pred_corr_w_ortho

AssertionError: 0.5895885756753665

...and I would like something that maintains as much predictive accuracy as possible while remaining orthogonal to ORTHO_VAR

edited 3 hours ago

Esmailian

3,301420

asked 2 days ago

Chris

1448

This question has an open bounty worth +50
reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.

This question has not received enough attention.

Could use some help on this one. Need something that produces a yhat that is orthogonal to another variable in the dataset (ORTHO_VAR) but does a good job sorting the target variable (y) in the correct order.

2

$begingroup$
What is the correlation of TARGET with ORTHO_VAR?
$endgroup$
– Esmailian
18 hours ago

$begingroup$
Good question. They are indeed correlated (let’s say 50%) the predictions will likely suffer in terms of accuracy by being made orthogonal.
$endgroup$
– Chris
12 hours ago

add a comment |

I would prefer that the predictions are generated with a non-parametric method such as xgboost.XGBRegressor but I could use a linear method if absolutely necessary.

This code:

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

# Create regression dataset with two correlated targets
X, y = make_regression(n_features=20, random_state=245, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

# Correlation should be low or preferably zero
pred_corr_w_ortho = df.corr().abs()['yhat']['ortho_var']
assert pred_corr_w_ortho < 0.01, pred_corr_w_ortho

Returns this:

---------------------------------------------------------------------------
AssertionError 
 1 pred_corr_w_ortho = df.corr().abs()['yhat']['ortho_var']
----> 2 assert pred_corr_w_ortho < 0.05, pred_corr_w_ortho

AssertionError: 0.5895885756753665

...and I would like something that maintains as much predictive accuracy as possible while remaining orthogonal to ORTHO_VAR

edited 3 hours ago

Esmailian

3,301420

asked 2 days ago

Chris

1448

This question has an open bounty worth +50
reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.

This question has not received enough attention.

2

$begingroup$
What is the correlation of TARGET with ORTHO_VAR?
$endgroup$
– Esmailian
18 hours ago

$begingroup$
Good question. They are indeed correlated (let’s say 50%) the predictions will likely suffer in terms of accuracy by being made orthogonal.
$endgroup$
– Chris
12 hours ago

add a comment |

I would prefer that the predictions are generated with a non-parametric method such as xgboost.XGBRegressor but I could use a linear method if absolutely necessary.

This code:

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

# Create regression dataset with two correlated targets
X, y = make_regression(n_features=20, random_state=245, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

# Correlation should be low or preferably zero
pred_corr_w_ortho = df.corr().abs()['yhat']['ortho_var']
assert pred_corr_w_ortho < 0.01, pred_corr_w_ortho

Returns this:

---------------------------------------------------------------------------
AssertionError 
 1 pred_corr_w_ortho = df.corr().abs()['yhat']['ortho_var']
----> 2 assert pred_corr_w_ortho < 0.05, pred_corr_w_ortho

AssertionError: 0.5895885756753665

...and I would like something that maintains as much predictive accuracy as possible while remaining orthogonal to ORTHO_VAR

edited 3 hours ago

Esmailian

3,301420

asked 2 days ago

Chris

1448

I would prefer that the predictions are generated with a non-parametric method such as xgboost.XGBRegressor but I could use a linear method if absolutely necessary.

This code:

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

# Create regression dataset with two correlated targets
X, y = make_regression(n_features=20, random_state=245, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

# Correlation should be low or preferably zero
pred_corr_w_ortho = df.corr().abs()['yhat']['ortho_var']
assert pred_corr_w_ortho < 0.01, pred_corr_w_ortho

Returns this:

---------------------------------------------------------------------------
AssertionError 
 1 pred_corr_w_ortho = df.corr().abs()['yhat']['ortho_var']
----> 2 assert pred_corr_w_ortho < 0.05, pred_corr_w_ortho

AssertionError: 0.5895885756753665

...and I would like something that maintains as much predictive accuracy as possible while remaining orthogonal to ORTHO_VAR

correlation

edited 3 hours ago

Esmailian

3,301420

asked 2 days ago

Chris

1448

edited 3 hours ago

Esmailian

3,301420

asked 2 days ago

Chris

1448

edited 3 hours ago

Esmailian

3,301420

edited 3 hours ago

Esmailian

3,301420

edited 3 hours ago

Esmailian

3,301420

asked 2 days ago

Chris

1448

asked 2 days ago

Chris

1448

asked 2 days ago

Chris

1448

This question has an open bounty worth +50
reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.

This question has not received enough attention.

This question has an open bounty worth +50
reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.

This question has not received enough attention.

2

$begingroup$
What is the correlation of TARGET with ORTHO_VAR?
$endgroup$
– Esmailian
18 hours ago

$begingroup$
Good question. They are indeed correlated (let’s say 50%) the predictions will likely suffer in terms of accuracy by being made orthogonal.
$endgroup$
– Chris
12 hours ago

add a comment |

2

$begingroup$
What is the correlation of TARGET with ORTHO_VAR?
$endgroup$
– Esmailian
18 hours ago

$begingroup$
Good question. They are indeed correlated (let’s say 50%) the predictions will likely suffer in terms of accuracy by being made orthogonal.
$endgroup$
– Chris
12 hours ago

What is the correlation of TARGET with ORTHO_VAR?

– Esmailian
18 hours ago

Good question. They are indeed correlated (let’s say 50%) the predictions will likely suffer in terms of accuracy by being made orthogonal.

– Chris
12 hours ago

add a comment |

1 Answer
1

active

oldest

votes

This requirement can be satisfied by adding sufficient noise to predictions $haty$ to decorrelate them from orthogonal values $v$. Ideally, if $haty$ is already decorrelated from $v$, no noise would be added to $haty$, thus $haty$ would be maximally correlated with $y$.

Mathematically, we want to create $haty'=haty+epsilon$ from $epsilon sim mathcalN(0, sigma_epsilon)$, to satisfy $$r_haty'v = fracsigma_haty'vsigma_haty'sigma_v < delta$$ for arbitrary threshold $delta$. Now, lets expand this inequality to find a lower-bound for std of noise $epsilon$, i.e. $sigma_epsilon$.
$$beginalign*
sigma_haty'^2&=sigma_haty^2 + sigma_epsilon^2,\
sigma_haty'v&=Bbb Eleft[(haty+epsilon - mu_haty - overbracemu_epsilon^=0)(v-mu_v)right]\
&=Bbb Eleft[(haty - mu_haty)(v-mu_v)right]+overbraceBbb Eleft[epsilon(v-mu_v)right]^=0&\
&=sigma_hatyv,\
r_haty'v &= fracsigma_haty'vsigma_haty'sigma_v =fracsigma_hatyvsigma_v sqrtsigma_haty^2+sigma_epsilon^2 < delta\
&Rightarrow sigma_hatysqrtleft(fracr_hatyvdeltaright)^2 - 1 < sigma_epsilon
endalign*$$

Since all the variables in the left side of inequality can be calculated, we can sample noises from $mathcalN(0, sigma_epsilon)$ and add them to $haty$ to satisfy the original inequality.

Here is a code that does the exact same thing:

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

seed = 245
# Create regression dataset with two correlated targets
X, y = make_regression(n_samples=10000, n_features=20, random_state=seed, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

delta = 0.01

# std of noise required to be added to y_hat to bring the correlation
# of y_hat with ORTHO_VAR below delta
std_y_hat = np.std(df['yhat'], ddof=1)
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
std_noise_lower_bound = std_y_hat * np.sqrt((corr_y_hat_ortho_var / delta)**2 - 1.0)
std_noise = max(0, std_noise_lower_bound) + 1
print('delta: ', delta)
print('std_y_hat: ', std_y_hat)
print('corr_y_hat_target: ', corr_y_hat_target)
print('corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
print('std_noise_lower_bound: ', std_noise_lower_bound)
print('std_noise: ', std_noise)

# add noise
np.random.seed(seed)
noises = np.random.normal(0, std_noise, len(df['yhat']))
noises -= np.mean(noises) # remove slight deviations from zero mean
print('noise_samples: mean:', np.mean(noises), ', std: ', np.std(noises))
df['yhat'] = df['yhat'] + noises

# measure new correlation
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
print('new corr_y_hat_target: ', corr_y_hat_target)
print('new corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
# Correlation should be low or preferably zero
assert corr_y_hat_ortho_var < delta, corr_y_hat_ortho_var
assert -delta < corr_y_hat_ortho_var, corr_y_hat_ortho_var

which outputs:

delta: 0.01
std_y_hat: 69.48568725585938
corr_y_hat_target: 0.8207672834857673
corr_y_hat_ortho_var: 0.7663936356880843
std_noise_lower_bound: 5324.885500165032
std_noise: 5325.885500165032
noise_samples: mean: 1.1059455573558807e-13 , std: 5373.914830034988
new corr_y_hat_target: -0.004125016071865934
new corr_y_hat_ortho_var: -0.000541131379457552

You can experiment with other deltas. By comparing std_y_hat with std_noise_lower_bound, you can see that a huge noise must be added to $haty$ to decorrelate it from $v$ bellow $0.01$, which dramatically decolerates $haty$ from $y$ too.

Note: Assertion might fail for too small thresholds $delta$ due to insufficient sample count.

answered 9 hours ago

Esmailian

3,301420

$begingroup$
Is the lack of correlation between yhat and the target variable related to the high correlation between the two target variables (e.g. my fault)? Ideally we would want new corr_y_hat_target to be as high as possible with new corr_y_hat_ortho_var to be as low as possible
$endgroup$
– Chris
9 hours ago

$begingroup$
@Chris Indirectly yes. If target had a low correlation with orth, y_hat (which has a high correlation with target) would also had a low correlation with orth. As a result, a low noise would have been added to y_hat and its correlation with target would have changed slightly.
$endgroup$
– Esmailian
9 hours ago

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49226%2fgenerate-predictions-that-are-orthogonal-uncorrelated-to-a-given-variable%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Since all the variables in the left side of inequality can be calculated, we can sample noises from $mathcalN(0, sigma_epsilon)$ and add them to $haty$ to satisfy the original inequality.

Here is a code that does the exact same thing:

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

seed = 245
# Create regression dataset with two correlated targets
X, y = make_regression(n_samples=10000, n_features=20, random_state=seed, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

delta = 0.01

# std of noise required to be added to y_hat to bring the correlation
# of y_hat with ORTHO_VAR below delta
std_y_hat = np.std(df['yhat'], ddof=1)
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
std_noise_lower_bound = std_y_hat * np.sqrt((corr_y_hat_ortho_var / delta)**2 - 1.0)
std_noise = max(0, std_noise_lower_bound) + 1
print('delta: ', delta)
print('std_y_hat: ', std_y_hat)
print('corr_y_hat_target: ', corr_y_hat_target)
print('corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
print('std_noise_lower_bound: ', std_noise_lower_bound)
print('std_noise: ', std_noise)

# add noise
np.random.seed(seed)
noises = np.random.normal(0, std_noise, len(df['yhat']))
noises -= np.mean(noises) # remove slight deviations from zero mean
print('noise_samples: mean:', np.mean(noises), ', std: ', np.std(noises))
df['yhat'] = df['yhat'] + noises

# measure new correlation
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
print('new corr_y_hat_target: ', corr_y_hat_target)
print('new corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
# Correlation should be low or preferably zero
assert corr_y_hat_ortho_var < delta, corr_y_hat_ortho_var
assert -delta < corr_y_hat_ortho_var, corr_y_hat_ortho_var

which outputs:

delta: 0.01
std_y_hat: 69.48568725585938
corr_y_hat_target: 0.8207672834857673
corr_y_hat_ortho_var: 0.7663936356880843
std_noise_lower_bound: 5324.885500165032
std_noise: 5325.885500165032
noise_samples: mean: 1.1059455573558807e-13 , std: 5373.914830034988
new corr_y_hat_target: -0.004125016071865934
new corr_y_hat_ortho_var: -0.000541131379457552

Note: Assertion might fail for too small thresholds $delta$ due to insufficient sample count.

answered 9 hours ago

Esmailian

3,301420

$begingroup$
Is the lack of correlation between yhat and the target variable related to the high correlation between the two target variables (e.g. my fault)? Ideally we would want new corr_y_hat_target to be as high as possible with new corr_y_hat_ortho_var to be as low as possible
$endgroup$
– Chris
9 hours ago

$begingroup$
@Chris Indirectly yes. If target had a low correlation with orth, y_hat (which has a high correlation with target) would also had a low correlation with orth. As a result, a low noise would have been added to y_hat and its correlation with target would have changed slightly.
$endgroup$
– Esmailian
9 hours ago

add a comment |

Since all the variables in the left side of inequality can be calculated, we can sample noises from $mathcalN(0, sigma_epsilon)$ and add them to $haty$ to satisfy the original inequality.

Here is a code that does the exact same thing:

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

seed = 245
# Create regression dataset with two correlated targets
X, y = make_regression(n_samples=10000, n_features=20, random_state=seed, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

delta = 0.01

# std of noise required to be added to y_hat to bring the correlation
# of y_hat with ORTHO_VAR below delta
std_y_hat = np.std(df['yhat'], ddof=1)
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
std_noise_lower_bound = std_y_hat * np.sqrt((corr_y_hat_ortho_var / delta)**2 - 1.0)
std_noise = max(0, std_noise_lower_bound) + 1
print('delta: ', delta)
print('std_y_hat: ', std_y_hat)
print('corr_y_hat_target: ', corr_y_hat_target)
print('corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
print('std_noise_lower_bound: ', std_noise_lower_bound)
print('std_noise: ', std_noise)

# add noise
np.random.seed(seed)
noises = np.random.normal(0, std_noise, len(df['yhat']))
noises -= np.mean(noises) # remove slight deviations from zero mean
print('noise_samples: mean:', np.mean(noises), ', std: ', np.std(noises))
df['yhat'] = df['yhat'] + noises

# measure new correlation
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
print('new corr_y_hat_target: ', corr_y_hat_target)
print('new corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
# Correlation should be low or preferably zero
assert corr_y_hat_ortho_var < delta, corr_y_hat_ortho_var
assert -delta < corr_y_hat_ortho_var, corr_y_hat_ortho_var

which outputs:

delta: 0.01
std_y_hat: 69.48568725585938
corr_y_hat_target: 0.8207672834857673
corr_y_hat_ortho_var: 0.7663936356880843
std_noise_lower_bound: 5324.885500165032
std_noise: 5325.885500165032
noise_samples: mean: 1.1059455573558807e-13 , std: 5373.914830034988
new corr_y_hat_target: -0.004125016071865934
new corr_y_hat_ortho_var: -0.000541131379457552

Note: Assertion might fail for too small thresholds $delta$ due to insufficient sample count.

answered 9 hours ago

Esmailian

3,301420

$begingroup$
Is the lack of correlation between yhat and the target variable related to the high correlation between the two target variables (e.g. my fault)? Ideally we would want new corr_y_hat_target to be as high as possible with new corr_y_hat_ortho_var to be as low as possible
$endgroup$
– Chris
9 hours ago

$begingroup$
@Chris Indirectly yes. If target had a low correlation with orth, y_hat (which has a high correlation with target) would also had a low correlation with orth. As a result, a low noise would have been added to y_hat and its correlation with target would have changed slightly.
$endgroup$
– Esmailian
9 hours ago

add a comment |

Since all the variables in the left side of inequality can be calculated, we can sample noises from $mathcalN(0, sigma_epsilon)$ and add them to $haty$ to satisfy the original inequality.

Here is a code that does the exact same thing:

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

seed = 245
# Create regression dataset with two correlated targets
X, y = make_regression(n_samples=10000, n_features=20, random_state=seed, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

delta = 0.01

# std of noise required to be added to y_hat to bring the correlation
# of y_hat with ORTHO_VAR below delta
std_y_hat = np.std(df['yhat'], ddof=1)
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
std_noise_lower_bound = std_y_hat * np.sqrt((corr_y_hat_ortho_var / delta)**2 - 1.0)
std_noise = max(0, std_noise_lower_bound) + 1
print('delta: ', delta)
print('std_y_hat: ', std_y_hat)
print('corr_y_hat_target: ', corr_y_hat_target)
print('corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
print('std_noise_lower_bound: ', std_noise_lower_bound)
print('std_noise: ', std_noise)

# add noise
np.random.seed(seed)
noises = np.random.normal(0, std_noise, len(df['yhat']))
noises -= np.mean(noises) # remove slight deviations from zero mean
print('noise_samples: mean:', np.mean(noises), ', std: ', np.std(noises))
df['yhat'] = df['yhat'] + noises

# measure new correlation
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
print('new corr_y_hat_target: ', corr_y_hat_target)
print('new corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
# Correlation should be low or preferably zero
assert corr_y_hat_ortho_var < delta, corr_y_hat_ortho_var
assert -delta < corr_y_hat_ortho_var, corr_y_hat_ortho_var

which outputs:

delta: 0.01
std_y_hat: 69.48568725585938
corr_y_hat_target: 0.8207672834857673
corr_y_hat_ortho_var: 0.7663936356880843
std_noise_lower_bound: 5324.885500165032
std_noise: 5325.885500165032
noise_samples: mean: 1.1059455573558807e-13 , std: 5373.914830034988
new corr_y_hat_target: -0.004125016071865934
new corr_y_hat_ortho_var: -0.000541131379457552

Note: Assertion might fail for too small thresholds $delta$ due to insufficient sample count.

answered 9 hours ago

Esmailian

3,301420

Since all the variables in the left side of inequality can be calculated, we can sample noises from $mathcalN(0, sigma_epsilon)$ and add them to $haty$ to satisfy the original inequality.

Here is a code that does the exact same thing:

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

seed = 245
# Create regression dataset with two correlated targets
X, y = make_regression(n_samples=10000, n_features=20, random_state=seed, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

delta = 0.01

# std of noise required to be added to y_hat to bring the correlation
# of y_hat with ORTHO_VAR below delta
std_y_hat = np.std(df['yhat'], ddof=1)
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
std_noise_lower_bound = std_y_hat * np.sqrt((corr_y_hat_ortho_var / delta)**2 - 1.0)
std_noise = max(0, std_noise_lower_bound) + 1
print('delta: ', delta)
print('std_y_hat: ', std_y_hat)
print('corr_y_hat_target: ', corr_y_hat_target)
print('corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
print('std_noise_lower_bound: ', std_noise_lower_bound)
print('std_noise: ', std_noise)

# add noise
np.random.seed(seed)
noises = np.random.normal(0, std_noise, len(df['yhat']))
noises -= np.mean(noises) # remove slight deviations from zero mean
print('noise_samples: mean:', np.mean(noises), ', std: ', np.std(noises))
df['yhat'] = df['yhat'] + noises

# measure new correlation
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
print('new corr_y_hat_target: ', corr_y_hat_target)
print('new corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
# Correlation should be low or preferably zero
assert corr_y_hat_ortho_var < delta, corr_y_hat_ortho_var
assert -delta < corr_y_hat_ortho_var, corr_y_hat_ortho_var

which outputs:

delta: 0.01
std_y_hat: 69.48568725585938
corr_y_hat_target: 0.8207672834857673
corr_y_hat_ortho_var: 0.7663936356880843
std_noise_lower_bound: 5324.885500165032
std_noise: 5325.885500165032
noise_samples: mean: 1.1059455573558807e-13 , std: 5373.914830034988
new corr_y_hat_target: -0.004125016071865934
new corr_y_hat_ortho_var: -0.000541131379457552

Note: Assertion might fail for too small thresholds $delta$ due to insufficient sample count.

answered 9 hours ago

Esmailian

3,301420

answered 9 hours ago

Esmailian

3,301420

answered 9 hours ago

Esmailian

3,301420

answered 9 hours ago

Esmailian

3,301420

$begingroup$
Is the lack of correlation between yhat and the target variable related to the high correlation between the two target variables (e.g. my fault)? Ideally we would want new corr_y_hat_target to be as high as possible with new corr_y_hat_ortho_var to be as low as possible
$endgroup$
– Chris
9 hours ago

$begingroup$
@Chris Indirectly yes. If target had a low correlation with orth, y_hat (which has a high correlation with target) would also had a low correlation with orth. As a result, a low noise would have been added to y_hat and its correlation with target would have changed slightly.
$endgroup$
– Esmailian
9 hours ago

add a comment |

$begingroup$
Is the lack of correlation between yhat and the target variable related to the high correlation between the two target variables (e.g. my fault)? Ideally we would want new corr_y_hat_target to be as high as possible with new corr_y_hat_ortho_var to be as low as possible
$endgroup$
– Chris
9 hours ago

$begingroup$
@Chris Indirectly yes. If target had a low correlation with orth, y_hat (which has a high correlation with target) would also had a low correlation with orth. As a result, a low noise would have been added to y_hat and its correlation with target would have changed slightly.
$endgroup$
– Esmailian
9 hours ago

Is the lack of correlation between yhat and the target variable related to the high correlation between the two target variables (e.g. my fault)? Ideally we would want new corr_y_hat_target to be as high as possible with new corr_y_hat_ortho_var to be as low as possible

– Chris
9 hours ago

@Chris Indirectly yes. If target had a low correlation with orth, y_hat (which has a high correlation with target) would also had a low correlation with orth. As a result, a low noise would have been added to y_hat and its correlation with target would have changed slightly.

– Esmailian
9 hours ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Hfrxdjt

This question has an open bounty worth +50
reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.

This question has an open bounty worth +50
reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.

This question has an open bounty worth +50
reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.

This question has an open bounty worth +50
reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

This question has an open bounty worth +50 reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.

This question has an open bounty worth +50 reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.

This question has an open bounty worth +50 reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.

This question has an open bounty worth +50 reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

This question has an open bounty worth +50
reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.

This question has an open bounty worth +50
reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.

This question has an open bounty worth +50
reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.

This question has an open bounty worth +50
reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.

1 Answer
1

1 Answer
1

1 Answer
1