Generate predictions that are orthogonal (uncorrelated) to a given variable Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHow to get correlation between two categorical variable and a categorical variable and continuous variable?What is the best Data Mining algorithm for prediction based on a single variable?Are Correlithm Objects used for anything in the industry?Can i use chi square test to remove a particular variable from the model?How to make predictions based on correlations?Devices behavior in one continuous variable vs events rateANN Variable CorrelationIs there any logic to adding a threshold to see if two variables are related?How to statistically prove that a column in a dataframe is not neededCorrelation / regression / association between one categorical variable and two non-independent others

What are the motives behind Cersei's orders given to Bronn?

What is this single-engine low-wing propeller plane?

Should gear shift center itself while in neutral?

How does cp -a work

IndentationError when pasting code in Python 3 interpreter mode

What makes black pepper strong or mild?

Are variable time comparisons always a security risk in cryptography code?

Does the Giant Rocktopus have a Swim Speed?

Why don't the Weasley twins use magic outside of school if the Trace can only find the location of spells cast?

ListPlot join points by nearest neighbor rather than order

Why is "Consequences inflicted." not a sentence?

Should I use Javascript Classes or Apex Classes in Lightning Web Components?

How do I keep my slimes from escaping their pens?

How to deal with a team lead who never gives me credit?

If Jon Snow became King of the Seven Kingdoms what would his regnal number be?

Why did the IBM 650 use bi-quinary?

Does polymorph use a PC’s CR or its level?

Is there a concise way to say "all of the X, one of each"?

What does '1 unit of lemon juice' mean in a grandma's drink recipe?

How to bypass password on Windows XP account?

Can inflation occur in a positive-sum game currency system such as the Stack Exchange reputation system?

What are the pros and cons of Aerospike nosecones?

Gastric acid as a weapon

How was the dust limit of 546 satoshis was chosen? Why not 550 satoshis?



Generate predictions that are orthogonal (uncorrelated) to a given variable



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsHow to get correlation between two categorical variable and a categorical variable and continuous variable?What is the best Data Mining algorithm for prediction based on a single variable?Are Correlithm Objects used for anything in the industry?Can i use chi square test to remove a particular variable from the model?How to make predictions based on correlations?Devices behavior in one continuous variable vs events rateANN Variable CorrelationIs there any logic to adding a threshold to see if two variables are related?How to statistically prove that a column in a dataframe is not neededCorrelation / regression / association between one categorical variable and two non-independent others










6












$begingroup$


I have an X matrix, a y variable, and another variable ORTHO_VAR. I need to predict the y variable using X, however, the predictions from that model need to be orthogonal to ORTHO_VAR while being as correlated with y as possible.



I would prefer that the predictions are generated with a non-parametric method such as xgboost.XGBRegressor but I could use a linear method if absolutely necessary.



This code:



import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

# Create regression dataset with two correlated targets
X, y = make_regression(n_features=20, random_state=245, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

# Correlation should be low or preferably zero
pred_corr_w_ortho = df.corr().abs()['yhat']['ortho_var']
assert pred_corr_w_ortho < 0.01, pred_corr_w_ortho


Returns this:



---------------------------------------------------------------------------
AssertionError
1 pred_corr_w_ortho = df.corr().abs()['yhat']['ortho_var']
----> 2 assert pred_corr_w_ortho < 0.05, pred_corr_w_ortho

AssertionError: 0.5895885756753665


...and I would like something that maintains as much predictive accuracy as possible while remaining orthogonal to ORTHO_VAR










share|improve this question











$endgroup$





This question has an open bounty worth +50
reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.


This question has not received enough attention.


Could use some help on this one. Need something that produces a yhat that is orthogonal to another variable in the dataset (ORTHO_VAR) but does a good job sorting the target variable (y) in the correct order.











  • 2




    $begingroup$
    What is the correlation of TARGET with ORTHO_VAR?
    $endgroup$
    – Esmailian
    18 hours ago










  • $begingroup$
    Good question. They are indeed correlated (let’s say 50%) the predictions will likely suffer in terms of accuracy by being made orthogonal.
    $endgroup$
    – Chris
    12 hours ago















6












$begingroup$


I have an X matrix, a y variable, and another variable ORTHO_VAR. I need to predict the y variable using X, however, the predictions from that model need to be orthogonal to ORTHO_VAR while being as correlated with y as possible.



I would prefer that the predictions are generated with a non-parametric method such as xgboost.XGBRegressor but I could use a linear method if absolutely necessary.



This code:



import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

# Create regression dataset with two correlated targets
X, y = make_regression(n_features=20, random_state=245, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

# Correlation should be low or preferably zero
pred_corr_w_ortho = df.corr().abs()['yhat']['ortho_var']
assert pred_corr_w_ortho < 0.01, pred_corr_w_ortho


Returns this:



---------------------------------------------------------------------------
AssertionError
1 pred_corr_w_ortho = df.corr().abs()['yhat']['ortho_var']
----> 2 assert pred_corr_w_ortho < 0.05, pred_corr_w_ortho

AssertionError: 0.5895885756753665


...and I would like something that maintains as much predictive accuracy as possible while remaining orthogonal to ORTHO_VAR










share|improve this question











$endgroup$





This question has an open bounty worth +50
reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.


This question has not received enough attention.


Could use some help on this one. Need something that produces a yhat that is orthogonal to another variable in the dataset (ORTHO_VAR) but does a good job sorting the target variable (y) in the correct order.











  • 2




    $begingroup$
    What is the correlation of TARGET with ORTHO_VAR?
    $endgroup$
    – Esmailian
    18 hours ago










  • $begingroup$
    Good question. They are indeed correlated (let’s say 50%) the predictions will likely suffer in terms of accuracy by being made orthogonal.
    $endgroup$
    – Chris
    12 hours ago













6












6








6


2



$begingroup$


I have an X matrix, a y variable, and another variable ORTHO_VAR. I need to predict the y variable using X, however, the predictions from that model need to be orthogonal to ORTHO_VAR while being as correlated with y as possible.



I would prefer that the predictions are generated with a non-parametric method such as xgboost.XGBRegressor but I could use a linear method if absolutely necessary.



This code:



import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

# Create regression dataset with two correlated targets
X, y = make_regression(n_features=20, random_state=245, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

# Correlation should be low or preferably zero
pred_corr_w_ortho = df.corr().abs()['yhat']['ortho_var']
assert pred_corr_w_ortho < 0.01, pred_corr_w_ortho


Returns this:



---------------------------------------------------------------------------
AssertionError
1 pred_corr_w_ortho = df.corr().abs()['yhat']['ortho_var']
----> 2 assert pred_corr_w_ortho < 0.05, pred_corr_w_ortho

AssertionError: 0.5895885756753665


...and I would like something that maintains as much predictive accuracy as possible while remaining orthogonal to ORTHO_VAR










share|improve this question











$endgroup$




I have an X matrix, a y variable, and another variable ORTHO_VAR. I need to predict the y variable using X, however, the predictions from that model need to be orthogonal to ORTHO_VAR while being as correlated with y as possible.



I would prefer that the predictions are generated with a non-parametric method such as xgboost.XGBRegressor but I could use a linear method if absolutely necessary.



This code:



import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

# Create regression dataset with two correlated targets
X, y = make_regression(n_features=20, random_state=245, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

# Correlation should be low or preferably zero
pred_corr_w_ortho = df.corr().abs()['yhat']['ortho_var']
assert pred_corr_w_ortho < 0.01, pred_corr_w_ortho


Returns this:



---------------------------------------------------------------------------
AssertionError
1 pred_corr_w_ortho = df.corr().abs()['yhat']['ortho_var']
----> 2 assert pred_corr_w_ortho < 0.05, pred_corr_w_ortho

AssertionError: 0.5895885756753665


...and I would like something that maintains as much predictive accuracy as possible while remaining orthogonal to ORTHO_VAR







correlation






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 3 hours ago









Esmailian

3,301420




3,301420










asked 2 days ago









ChrisChris

1448




1448






This question has an open bounty worth +50
reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.


This question has not received enough attention.


Could use some help on this one. Need something that produces a yhat that is orthogonal to another variable in the dataset (ORTHO_VAR) but does a good job sorting the target variable (y) in the correct order.








This question has an open bounty worth +50
reputation from Chris ending ending at 2019-04-22 12:40:21Z">in 7 days.


This question has not received enough attention.


Could use some help on this one. Need something that produces a yhat that is orthogonal to another variable in the dataset (ORTHO_VAR) but does a good job sorting the target variable (y) in the correct order.









  • 2




    $begingroup$
    What is the correlation of TARGET with ORTHO_VAR?
    $endgroup$
    – Esmailian
    18 hours ago










  • $begingroup$
    Good question. They are indeed correlated (let’s say 50%) the predictions will likely suffer in terms of accuracy by being made orthogonal.
    $endgroup$
    – Chris
    12 hours ago












  • 2




    $begingroup$
    What is the correlation of TARGET with ORTHO_VAR?
    $endgroup$
    – Esmailian
    18 hours ago










  • $begingroup$
    Good question. They are indeed correlated (let’s say 50%) the predictions will likely suffer in terms of accuracy by being made orthogonal.
    $endgroup$
    – Chris
    12 hours ago







2




2




$begingroup$
What is the correlation of TARGET with ORTHO_VAR?
$endgroup$
– Esmailian
18 hours ago




$begingroup$
What is the correlation of TARGET with ORTHO_VAR?
$endgroup$
– Esmailian
18 hours ago












$begingroup$
Good question. They are indeed correlated (let’s say 50%) the predictions will likely suffer in terms of accuracy by being made orthogonal.
$endgroup$
– Chris
12 hours ago




$begingroup$
Good question. They are indeed correlated (let’s say 50%) the predictions will likely suffer in terms of accuracy by being made orthogonal.
$endgroup$
– Chris
12 hours ago










1 Answer
1






active

oldest

votes


















3












$begingroup$

This requirement can be satisfied by adding sufficient noise to predictions $haty$ to decorrelate them from orthogonal values $v$. Ideally, if $haty$ is already decorrelated from $v$, no noise would be added to $haty$, thus $haty$ would be maximally correlated with $y$.



Mathematically, we want to create $haty'=haty+epsilon$ from $epsilon sim mathcalN(0, sigma_epsilon)$, to satisfy $$r_haty'v = fracsigma_haty'vsigma_haty'sigma_v < delta$$ for arbitrary threshold $delta$. Now, lets expand this inequality to find a lower-bound for std of noise $epsilon$, i.e. $sigma_epsilon$.
$$beginalign*
sigma_haty'^2&=sigma_haty^2 + sigma_epsilon^2,\
sigma_haty'v&=Bbb Eleft[(haty+epsilon - mu_haty - overbracemu_epsilon^=0)(v-mu_v)right]\
&=Bbb Eleft[(haty - mu_haty)(v-mu_v)right]+overbraceBbb Eleft[epsilon(v-mu_v)right]^=0&\
&=sigma_hatyv,\
r_haty'v &= fracsigma_haty'vsigma_haty'sigma_v =fracsigma_hatyvsigma_v sqrtsigma_haty^2+sigma_epsilon^2 < delta\
&Rightarrow sigma_hatysqrtleft(fracr_hatyvdeltaright)^2 - 1 < sigma_epsilon
endalign*$$



Since all the variables in the left side of inequality can be calculated, we can sample noises from $mathcalN(0, sigma_epsilon)$ and add them to $haty$ to satisfy the original inequality.



Here is a code that does the exact same thing:



import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

seed = 245
# Create regression dataset with two correlated targets
X, y = make_regression(n_samples=10000, n_features=20, random_state=seed, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

delta = 0.01

# std of noise required to be added to y_hat to bring the correlation
# of y_hat with ORTHO_VAR below delta
std_y_hat = np.std(df['yhat'], ddof=1)
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
std_noise_lower_bound = std_y_hat * np.sqrt((corr_y_hat_ortho_var / delta)**2 - 1.0)
std_noise = max(0, std_noise_lower_bound) + 1
print('delta: ', delta)
print('std_y_hat: ', std_y_hat)
print('corr_y_hat_target: ', corr_y_hat_target)
print('corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
print('std_noise_lower_bound: ', std_noise_lower_bound)
print('std_noise: ', std_noise)

# add noise
np.random.seed(seed)
noises = np.random.normal(0, std_noise, len(df['yhat']))
noises -= np.mean(noises) # remove slight deviations from zero mean
print('noise_samples: mean:', np.mean(noises), ', std: ', np.std(noises))
df['yhat'] = df['yhat'] + noises

# measure new correlation
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
print('new corr_y_hat_target: ', corr_y_hat_target)
print('new corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
# Correlation should be low or preferably zero
assert corr_y_hat_ortho_var < delta, corr_y_hat_ortho_var
assert -delta < corr_y_hat_ortho_var, corr_y_hat_ortho_var


which outputs:



delta: 0.01
std_y_hat: 69.48568725585938
corr_y_hat_target: 0.8207672834857673
corr_y_hat_ortho_var: 0.7663936356880843
std_noise_lower_bound: 5324.885500165032
std_noise: 5325.885500165032
noise_samples: mean: 1.1059455573558807e-13 , std: 5373.914830034988
new corr_y_hat_target: -0.004125016071865934
new corr_y_hat_ortho_var: -0.000541131379457552


You can experiment with other deltas. By comparing std_y_hat with std_noise_lower_bound, you can see that a huge noise must be added to $haty$ to decorrelate it from $v$ bellow $0.01$, which dramatically decolerates $haty$ from $y$ too.



Note: Assertion might fail for too small thresholds $delta$ due to insufficient sample count.






share|improve this answer









$endgroup$












  • $begingroup$
    Is the lack of correlation between yhat and the target variable related to the high correlation between the two target variables (e.g. my fault)? Ideally we would want new corr_y_hat_target to be as high as possible with new corr_y_hat_ortho_var to be as low as possible
    $endgroup$
    – Chris
    9 hours ago











  • $begingroup$
    @Chris Indirectly yes. If target had a low correlation with orth, y_hat (which has a high correlation with target) would also had a low correlation with orth. As a result, a low noise would have been added to y_hat and its correlation with target would have changed slightly.
    $endgroup$
    – Esmailian
    9 hours ago












Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49226%2fgenerate-predictions-that-are-orthogonal-uncorrelated-to-a-given-variable%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









3












$begingroup$

This requirement can be satisfied by adding sufficient noise to predictions $haty$ to decorrelate them from orthogonal values $v$. Ideally, if $haty$ is already decorrelated from $v$, no noise would be added to $haty$, thus $haty$ would be maximally correlated with $y$.



Mathematically, we want to create $haty'=haty+epsilon$ from $epsilon sim mathcalN(0, sigma_epsilon)$, to satisfy $$r_haty'v = fracsigma_haty'vsigma_haty'sigma_v < delta$$ for arbitrary threshold $delta$. Now, lets expand this inequality to find a lower-bound for std of noise $epsilon$, i.e. $sigma_epsilon$.
$$beginalign*
sigma_haty'^2&=sigma_haty^2 + sigma_epsilon^2,\
sigma_haty'v&=Bbb Eleft[(haty+epsilon - mu_haty - overbracemu_epsilon^=0)(v-mu_v)right]\
&=Bbb Eleft[(haty - mu_haty)(v-mu_v)right]+overbraceBbb Eleft[epsilon(v-mu_v)right]^=0&\
&=sigma_hatyv,\
r_haty'v &= fracsigma_haty'vsigma_haty'sigma_v =fracsigma_hatyvsigma_v sqrtsigma_haty^2+sigma_epsilon^2 < delta\
&Rightarrow sigma_hatysqrtleft(fracr_hatyvdeltaright)^2 - 1 < sigma_epsilon
endalign*$$



Since all the variables in the left side of inequality can be calculated, we can sample noises from $mathcalN(0, sigma_epsilon)$ and add them to $haty$ to satisfy the original inequality.



Here is a code that does the exact same thing:



import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

seed = 245
# Create regression dataset with two correlated targets
X, y = make_regression(n_samples=10000, n_features=20, random_state=seed, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

delta = 0.01

# std of noise required to be added to y_hat to bring the correlation
# of y_hat with ORTHO_VAR below delta
std_y_hat = np.std(df['yhat'], ddof=1)
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
std_noise_lower_bound = std_y_hat * np.sqrt((corr_y_hat_ortho_var / delta)**2 - 1.0)
std_noise = max(0, std_noise_lower_bound) + 1
print('delta: ', delta)
print('std_y_hat: ', std_y_hat)
print('corr_y_hat_target: ', corr_y_hat_target)
print('corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
print('std_noise_lower_bound: ', std_noise_lower_bound)
print('std_noise: ', std_noise)

# add noise
np.random.seed(seed)
noises = np.random.normal(0, std_noise, len(df['yhat']))
noises -= np.mean(noises) # remove slight deviations from zero mean
print('noise_samples: mean:', np.mean(noises), ', std: ', np.std(noises))
df['yhat'] = df['yhat'] + noises

# measure new correlation
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
print('new corr_y_hat_target: ', corr_y_hat_target)
print('new corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
# Correlation should be low or preferably zero
assert corr_y_hat_ortho_var < delta, corr_y_hat_ortho_var
assert -delta < corr_y_hat_ortho_var, corr_y_hat_ortho_var


which outputs:



delta: 0.01
std_y_hat: 69.48568725585938
corr_y_hat_target: 0.8207672834857673
corr_y_hat_ortho_var: 0.7663936356880843
std_noise_lower_bound: 5324.885500165032
std_noise: 5325.885500165032
noise_samples: mean: 1.1059455573558807e-13 , std: 5373.914830034988
new corr_y_hat_target: -0.004125016071865934
new corr_y_hat_ortho_var: -0.000541131379457552


You can experiment with other deltas. By comparing std_y_hat with std_noise_lower_bound, you can see that a huge noise must be added to $haty$ to decorrelate it from $v$ bellow $0.01$, which dramatically decolerates $haty$ from $y$ too.



Note: Assertion might fail for too small thresholds $delta$ due to insufficient sample count.






share|improve this answer









$endgroup$












  • $begingroup$
    Is the lack of correlation between yhat and the target variable related to the high correlation between the two target variables (e.g. my fault)? Ideally we would want new corr_y_hat_target to be as high as possible with new corr_y_hat_ortho_var to be as low as possible
    $endgroup$
    – Chris
    9 hours ago











  • $begingroup$
    @Chris Indirectly yes. If target had a low correlation with orth, y_hat (which has a high correlation with target) would also had a low correlation with orth. As a result, a low noise would have been added to y_hat and its correlation with target would have changed slightly.
    $endgroup$
    – Esmailian
    9 hours ago
















3












$begingroup$

This requirement can be satisfied by adding sufficient noise to predictions $haty$ to decorrelate them from orthogonal values $v$. Ideally, if $haty$ is already decorrelated from $v$, no noise would be added to $haty$, thus $haty$ would be maximally correlated with $y$.



Mathematically, we want to create $haty'=haty+epsilon$ from $epsilon sim mathcalN(0, sigma_epsilon)$, to satisfy $$r_haty'v = fracsigma_haty'vsigma_haty'sigma_v < delta$$ for arbitrary threshold $delta$. Now, lets expand this inequality to find a lower-bound for std of noise $epsilon$, i.e. $sigma_epsilon$.
$$beginalign*
sigma_haty'^2&=sigma_haty^2 + sigma_epsilon^2,\
sigma_haty'v&=Bbb Eleft[(haty+epsilon - mu_haty - overbracemu_epsilon^=0)(v-mu_v)right]\
&=Bbb Eleft[(haty - mu_haty)(v-mu_v)right]+overbraceBbb Eleft[epsilon(v-mu_v)right]^=0&\
&=sigma_hatyv,\
r_haty'v &= fracsigma_haty'vsigma_haty'sigma_v =fracsigma_hatyvsigma_v sqrtsigma_haty^2+sigma_epsilon^2 < delta\
&Rightarrow sigma_hatysqrtleft(fracr_hatyvdeltaright)^2 - 1 < sigma_epsilon
endalign*$$



Since all the variables in the left side of inequality can be calculated, we can sample noises from $mathcalN(0, sigma_epsilon)$ and add them to $haty$ to satisfy the original inequality.



Here is a code that does the exact same thing:



import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

seed = 245
# Create regression dataset with two correlated targets
X, y = make_regression(n_samples=10000, n_features=20, random_state=seed, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

delta = 0.01

# std of noise required to be added to y_hat to bring the correlation
# of y_hat with ORTHO_VAR below delta
std_y_hat = np.std(df['yhat'], ddof=1)
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
std_noise_lower_bound = std_y_hat * np.sqrt((corr_y_hat_ortho_var / delta)**2 - 1.0)
std_noise = max(0, std_noise_lower_bound) + 1
print('delta: ', delta)
print('std_y_hat: ', std_y_hat)
print('corr_y_hat_target: ', corr_y_hat_target)
print('corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
print('std_noise_lower_bound: ', std_noise_lower_bound)
print('std_noise: ', std_noise)

# add noise
np.random.seed(seed)
noises = np.random.normal(0, std_noise, len(df['yhat']))
noises -= np.mean(noises) # remove slight deviations from zero mean
print('noise_samples: mean:', np.mean(noises), ', std: ', np.std(noises))
df['yhat'] = df['yhat'] + noises

# measure new correlation
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
print('new corr_y_hat_target: ', corr_y_hat_target)
print('new corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
# Correlation should be low or preferably zero
assert corr_y_hat_ortho_var < delta, corr_y_hat_ortho_var
assert -delta < corr_y_hat_ortho_var, corr_y_hat_ortho_var


which outputs:



delta: 0.01
std_y_hat: 69.48568725585938
corr_y_hat_target: 0.8207672834857673
corr_y_hat_ortho_var: 0.7663936356880843
std_noise_lower_bound: 5324.885500165032
std_noise: 5325.885500165032
noise_samples: mean: 1.1059455573558807e-13 , std: 5373.914830034988
new corr_y_hat_target: -0.004125016071865934
new corr_y_hat_ortho_var: -0.000541131379457552


You can experiment with other deltas. By comparing std_y_hat with std_noise_lower_bound, you can see that a huge noise must be added to $haty$ to decorrelate it from $v$ bellow $0.01$, which dramatically decolerates $haty$ from $y$ too.



Note: Assertion might fail for too small thresholds $delta$ due to insufficient sample count.






share|improve this answer









$endgroup$












  • $begingroup$
    Is the lack of correlation between yhat and the target variable related to the high correlation between the two target variables (e.g. my fault)? Ideally we would want new corr_y_hat_target to be as high as possible with new corr_y_hat_ortho_var to be as low as possible
    $endgroup$
    – Chris
    9 hours ago











  • $begingroup$
    @Chris Indirectly yes. If target had a low correlation with orth, y_hat (which has a high correlation with target) would also had a low correlation with orth. As a result, a low noise would have been added to y_hat and its correlation with target would have changed slightly.
    $endgroup$
    – Esmailian
    9 hours ago














3












3








3





$begingroup$

This requirement can be satisfied by adding sufficient noise to predictions $haty$ to decorrelate them from orthogonal values $v$. Ideally, if $haty$ is already decorrelated from $v$, no noise would be added to $haty$, thus $haty$ would be maximally correlated with $y$.



Mathematically, we want to create $haty'=haty+epsilon$ from $epsilon sim mathcalN(0, sigma_epsilon)$, to satisfy $$r_haty'v = fracsigma_haty'vsigma_haty'sigma_v < delta$$ for arbitrary threshold $delta$. Now, lets expand this inequality to find a lower-bound for std of noise $epsilon$, i.e. $sigma_epsilon$.
$$beginalign*
sigma_haty'^2&=sigma_haty^2 + sigma_epsilon^2,\
sigma_haty'v&=Bbb Eleft[(haty+epsilon - mu_haty - overbracemu_epsilon^=0)(v-mu_v)right]\
&=Bbb Eleft[(haty - mu_haty)(v-mu_v)right]+overbraceBbb Eleft[epsilon(v-mu_v)right]^=0&\
&=sigma_hatyv,\
r_haty'v &= fracsigma_haty'vsigma_haty'sigma_v =fracsigma_hatyvsigma_v sqrtsigma_haty^2+sigma_epsilon^2 < delta\
&Rightarrow sigma_hatysqrtleft(fracr_hatyvdeltaright)^2 - 1 < sigma_epsilon
endalign*$$



Since all the variables in the left side of inequality can be calculated, we can sample noises from $mathcalN(0, sigma_epsilon)$ and add them to $haty$ to satisfy the original inequality.



Here is a code that does the exact same thing:



import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

seed = 245
# Create regression dataset with two correlated targets
X, y = make_regression(n_samples=10000, n_features=20, random_state=seed, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

delta = 0.01

# std of noise required to be added to y_hat to bring the correlation
# of y_hat with ORTHO_VAR below delta
std_y_hat = np.std(df['yhat'], ddof=1)
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
std_noise_lower_bound = std_y_hat * np.sqrt((corr_y_hat_ortho_var / delta)**2 - 1.0)
std_noise = max(0, std_noise_lower_bound) + 1
print('delta: ', delta)
print('std_y_hat: ', std_y_hat)
print('corr_y_hat_target: ', corr_y_hat_target)
print('corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
print('std_noise_lower_bound: ', std_noise_lower_bound)
print('std_noise: ', std_noise)

# add noise
np.random.seed(seed)
noises = np.random.normal(0, std_noise, len(df['yhat']))
noises -= np.mean(noises) # remove slight deviations from zero mean
print('noise_samples: mean:', np.mean(noises), ', std: ', np.std(noises))
df['yhat'] = df['yhat'] + noises

# measure new correlation
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
print('new corr_y_hat_target: ', corr_y_hat_target)
print('new corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
# Correlation should be low or preferably zero
assert corr_y_hat_ortho_var < delta, corr_y_hat_ortho_var
assert -delta < corr_y_hat_ortho_var, corr_y_hat_ortho_var


which outputs:



delta: 0.01
std_y_hat: 69.48568725585938
corr_y_hat_target: 0.8207672834857673
corr_y_hat_ortho_var: 0.7663936356880843
std_noise_lower_bound: 5324.885500165032
std_noise: 5325.885500165032
noise_samples: mean: 1.1059455573558807e-13 , std: 5373.914830034988
new corr_y_hat_target: -0.004125016071865934
new corr_y_hat_ortho_var: -0.000541131379457552


You can experiment with other deltas. By comparing std_y_hat with std_noise_lower_bound, you can see that a huge noise must be added to $haty$ to decorrelate it from $v$ bellow $0.01$, which dramatically decolerates $haty$ from $y$ too.



Note: Assertion might fail for too small thresholds $delta$ due to insufficient sample count.






share|improve this answer









$endgroup$



This requirement can be satisfied by adding sufficient noise to predictions $haty$ to decorrelate them from orthogonal values $v$. Ideally, if $haty$ is already decorrelated from $v$, no noise would be added to $haty$, thus $haty$ would be maximally correlated with $y$.



Mathematically, we want to create $haty'=haty+epsilon$ from $epsilon sim mathcalN(0, sigma_epsilon)$, to satisfy $$r_haty'v = fracsigma_haty'vsigma_haty'sigma_v < delta$$ for arbitrary threshold $delta$. Now, lets expand this inequality to find a lower-bound for std of noise $epsilon$, i.e. $sigma_epsilon$.
$$beginalign*
sigma_haty'^2&=sigma_haty^2 + sigma_epsilon^2,\
sigma_haty'v&=Bbb Eleft[(haty+epsilon - mu_haty - overbracemu_epsilon^=0)(v-mu_v)right]\
&=Bbb Eleft[(haty - mu_haty)(v-mu_v)right]+overbraceBbb Eleft[epsilon(v-mu_v)right]^=0&\
&=sigma_hatyv,\
r_haty'v &= fracsigma_haty'vsigma_haty'sigma_v =fracsigma_hatyvsigma_v sqrtsigma_haty^2+sigma_epsilon^2 < delta\
&Rightarrow sigma_hatysqrtleft(fracr_hatyvdeltaright)^2 - 1 < sigma_epsilon
endalign*$$



Since all the variables in the left side of inequality can be calculated, we can sample noises from $mathcalN(0, sigma_epsilon)$ and add them to $haty$ to satisfy the original inequality.



Here is a code that does the exact same thing:



import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

seed = 245
# Create regression dataset with two correlated targets
X, y = make_regression(n_samples=10000, n_features=20, random_state=seed, n_targets=2)
indep_vars = ['var'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

delta = 0.01

# std of noise required to be added to y_hat to bring the correlation
# of y_hat with ORTHO_VAR below delta
std_y_hat = np.std(df['yhat'], ddof=1)
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
std_noise_lower_bound = std_y_hat * np.sqrt((corr_y_hat_ortho_var / delta)**2 - 1.0)
std_noise = max(0, std_noise_lower_bound) + 1
print('delta: ', delta)
print('std_y_hat: ', std_y_hat)
print('corr_y_hat_target: ', corr_y_hat_target)
print('corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
print('std_noise_lower_bound: ', std_noise_lower_bound)
print('std_noise: ', std_noise)

# add noise
np.random.seed(seed)
noises = np.random.normal(0, std_noise, len(df['yhat']))
noises -= np.mean(noises) # remove slight deviations from zero mean
print('noise_samples: mean:', np.mean(noises), ', std: ', np.std(noises))
df['yhat'] = df['yhat'] + noises

# measure new correlation
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
print('new corr_y_hat_target: ', corr_y_hat_target)
print('new corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
# Correlation should be low or preferably zero
assert corr_y_hat_ortho_var < delta, corr_y_hat_ortho_var
assert -delta < corr_y_hat_ortho_var, corr_y_hat_ortho_var


which outputs:



delta: 0.01
std_y_hat: 69.48568725585938
corr_y_hat_target: 0.8207672834857673
corr_y_hat_ortho_var: 0.7663936356880843
std_noise_lower_bound: 5324.885500165032
std_noise: 5325.885500165032
noise_samples: mean: 1.1059455573558807e-13 , std: 5373.914830034988
new corr_y_hat_target: -0.004125016071865934
new corr_y_hat_ortho_var: -0.000541131379457552


You can experiment with other deltas. By comparing std_y_hat with std_noise_lower_bound, you can see that a huge noise must be added to $haty$ to decorrelate it from $v$ bellow $0.01$, which dramatically decolerates $haty$ from $y$ too.



Note: Assertion might fail for too small thresholds $delta$ due to insufficient sample count.







share|improve this answer












share|improve this answer



share|improve this answer










answered 9 hours ago









EsmailianEsmailian

3,301420




3,301420











  • $begingroup$
    Is the lack of correlation between yhat and the target variable related to the high correlation between the two target variables (e.g. my fault)? Ideally we would want new corr_y_hat_target to be as high as possible with new corr_y_hat_ortho_var to be as low as possible
    $endgroup$
    – Chris
    9 hours ago











  • $begingroup$
    @Chris Indirectly yes. If target had a low correlation with orth, y_hat (which has a high correlation with target) would also had a low correlation with orth. As a result, a low noise would have been added to y_hat and its correlation with target would have changed slightly.
    $endgroup$
    – Esmailian
    9 hours ago

















  • $begingroup$
    Is the lack of correlation between yhat and the target variable related to the high correlation between the two target variables (e.g. my fault)? Ideally we would want new corr_y_hat_target to be as high as possible with new corr_y_hat_ortho_var to be as low as possible
    $endgroup$
    – Chris
    9 hours ago











  • $begingroup$
    @Chris Indirectly yes. If target had a low correlation with orth, y_hat (which has a high correlation with target) would also had a low correlation with orth. As a result, a low noise would have been added to y_hat and its correlation with target would have changed slightly.
    $endgroup$
    – Esmailian
    9 hours ago
















$begingroup$
Is the lack of correlation between yhat and the target variable related to the high correlation between the two target variables (e.g. my fault)? Ideally we would want new corr_y_hat_target to be as high as possible with new corr_y_hat_ortho_var to be as low as possible
$endgroup$
– Chris
9 hours ago





$begingroup$
Is the lack of correlation between yhat and the target variable related to the high correlation between the two target variables (e.g. my fault)? Ideally we would want new corr_y_hat_target to be as high as possible with new corr_y_hat_ortho_var to be as low as possible
$endgroup$
– Chris
9 hours ago













$begingroup$
@Chris Indirectly yes. If target had a low correlation with orth, y_hat (which has a high correlation with target) would also had a low correlation with orth. As a result, a low noise would have been added to y_hat and its correlation with target would have changed slightly.
$endgroup$
– Esmailian
9 hours ago





$begingroup$
@Chris Indirectly yes. If target had a low correlation with orth, y_hat (which has a high correlation with target) would also had a low correlation with orth. As a result, a low noise would have been added to y_hat and its correlation with target would have changed slightly.
$endgroup$
– Esmailian
9 hours ago


















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49226%2fgenerate-predictions-that-are-orthogonal-uncorrelated-to-a-given-variable%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown