why did the subset and factor influenced coefficients of logistic regression in R Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)Omitted variable bias in logistic regression vs. omitted variable bias in ordinary least squares regressionbinomial GLM output hugely affected by a factor level with all zerosglm in R - which pvalue represents the goodness of fit of entire model?How to fit a glm with sum to zero constraints in R (no reference level)Why do different negative binomial regression functions produce different coefficients, p-valuesinterpretation of random effects in GLMMInterpreting odds ratio of multiple comparisons from a logistic regression model (using R)Test for effects of categorical variables on a binary response variable considering their interactions?Comparison of two odds ratios: Take 2Negative Binomial Regression Coefficients and Std. Errors in RSensitivity and Specificity of gaussian and negative binomial glm family

Nose gear failure in single prop aircraft: belly landing or nose-gear up landing?

If Windows 7 doesn't support WSL, then what is "Subsystem for UNIX-based Applications"?

How does Belgium enforce obligatory attendance in elections?

why did the subset and factor influenced coefficients of logistic regression in R

Why do early math courses focus on the cross sections of a cone and not on other 3D objects?

What order were files/directories output in dir?

Central Vacuuming: Is it worth it, and how does it compare to normal vacuuming?

Why is a lens darker than other ones when applying the same settings?

What are the main differences between Stargate SG-1 cuts?

Tannaka duality for semisimple groups

A term for a woman complaining about things/begging in a cute/childish way

Did Mueller's report provide an evidentiary basis for the claim of Russian govt election interference via social media?

Why weren't discrete x86 CPUs ever used in game hardware?

Understanding p-Values using an example

Does the Mueller report show a conspiracy between Russia and the Trump Campaign?

Do reserved cards get returned when gold token is spent?

What adaptations would allow standard fantasy dwarves to survive in the desert?

In musical terms, what properties are varied by the human voice to produce different words / syllables?

What does Turing mean by this statement?

Project Euler #1 in C++

AppleTVs create a chatty alternate WiFi network

Co-worker has annoying ringtone

What is the chair depicted in Cesare Maccari's 1889 painting "Cicerone denuncia Catilina"?

Random body shuffle every night—can we still function?



why did the subset and factor influenced coefficients of logistic regression in R



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)Omitted variable bias in logistic regression vs. omitted variable bias in ordinary least squares regressionbinomial GLM output hugely affected by a factor level with all zerosglm in R - which pvalue represents the goodness of fit of entire model?How to fit a glm with sum to zero constraints in R (no reference level)Why do different negative binomial regression functions produce different coefficients, p-valuesinterpretation of random effects in GLMMInterpreting odds ratio of multiple comparisons from a logistic regression model (using R)Test for effects of categorical variables on a binary response variable considering their interactions?Comparison of two odds ratios: Take 2Negative Binomial Regression Coefficients and Std. Errors in RSensitivity and Specificity of gaussian and negative binomial glm family



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








2












$begingroup$


The coefficients changed a lot when I used all the factor levels versus when I limited to only one level of a factor as a subset of the data.



I am trying to do a logistic regression between the disease and contact exposure. There were several different sites, so I use the factor function (model:ml1).
I also tried to focus on only a specific site:WB to analyze the association, which site was used as the subset of the data (model:ml2).



ml1<-glm(disease~x+**factor(site)**+factor(anycontact) +factor(comecat), data=gianalysis_bd, family= binomial )
summary(ml1)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.44400 0.25761 -13.369 < 2e-16 ***
x 0.24559 0.08309 2.956 0.003121 **
factor(site)FB 0.03967 0.15177 0.261 0.793792
factor(site)GB -0.54896 0.16538 -3.319 0.000902 ***
factor(site)HB 0.39635 0.14699 2.696 0.007010 **
factor(site)SB -0.13887 0.14347 -0.968 0.333069
factor(site)WB -0.06200 0.14647 -0.423 0.672067
factor(site)WP -0.03706 0.15388 -0.241 0.809677
**factor(anycontact)1 0.40856** 0.06846 5.968 2.41e-09 ***
factor(comecat)2 0.02260 0.07184 0.315 0.753037
factor(comecat)3 0.11195 0.07574 1.478 0.139405


ml2<-glm(disease~x+factor(anycontact) +factor(comecat), data=gianalysis_bd, **subset=site=="WB"**, family= binomial )
summary(ml2)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4016 0.4347 -7.825 5.06e-15 ***
x 0.1421 0.1454 0.977 0.32834
**factor(anycontact)1 0.7380** 0.2590 2.850 0.00438 **
factor(comecat)2 -0.4049 0.2042 -1.983 0.04738 *
factor(comecat)3 0.1136 0.2182 0.520 0.60273


However, the coefficient of factor(anycontact) changed significantly, increasing from 0.4085 (ml1) to 0.7380. I could not tell why that happened (I think it should be the same in both the models). Can someone help to explain the difference between the two model and the reason? Thank you very much.










share|cite|improve this question











$endgroup$



migrated from stackoverflow.com 3 hours ago


This question came from our site for professional and enthusiast programmers.

















  • $begingroup$
    can you please reword the question to be clear that infact you're training 2 different models, one specifically for "WB" and another across all sites.
    $endgroup$
    – behold
    3 hours ago

















2












$begingroup$


The coefficients changed a lot when I used all the factor levels versus when I limited to only one level of a factor as a subset of the data.



I am trying to do a logistic regression between the disease and contact exposure. There were several different sites, so I use the factor function (model:ml1).
I also tried to focus on only a specific site:WB to analyze the association, which site was used as the subset of the data (model:ml2).



ml1<-glm(disease~x+**factor(site)**+factor(anycontact) +factor(comecat), data=gianalysis_bd, family= binomial )
summary(ml1)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.44400 0.25761 -13.369 < 2e-16 ***
x 0.24559 0.08309 2.956 0.003121 **
factor(site)FB 0.03967 0.15177 0.261 0.793792
factor(site)GB -0.54896 0.16538 -3.319 0.000902 ***
factor(site)HB 0.39635 0.14699 2.696 0.007010 **
factor(site)SB -0.13887 0.14347 -0.968 0.333069
factor(site)WB -0.06200 0.14647 -0.423 0.672067
factor(site)WP -0.03706 0.15388 -0.241 0.809677
**factor(anycontact)1 0.40856** 0.06846 5.968 2.41e-09 ***
factor(comecat)2 0.02260 0.07184 0.315 0.753037
factor(comecat)3 0.11195 0.07574 1.478 0.139405


ml2<-glm(disease~x+factor(anycontact) +factor(comecat), data=gianalysis_bd, **subset=site=="WB"**, family= binomial )
summary(ml2)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4016 0.4347 -7.825 5.06e-15 ***
x 0.1421 0.1454 0.977 0.32834
**factor(anycontact)1 0.7380** 0.2590 2.850 0.00438 **
factor(comecat)2 -0.4049 0.2042 -1.983 0.04738 *
factor(comecat)3 0.1136 0.2182 0.520 0.60273


However, the coefficient of factor(anycontact) changed significantly, increasing from 0.4085 (ml1) to 0.7380. I could not tell why that happened (I think it should be the same in both the models). Can someone help to explain the difference between the two model and the reason? Thank you very much.










share|cite|improve this question











$endgroup$



migrated from stackoverflow.com 3 hours ago


This question came from our site for professional and enthusiast programmers.

















  • $begingroup$
    can you please reword the question to be clear that infact you're training 2 different models, one specifically for "WB" and another across all sites.
    $endgroup$
    – behold
    3 hours ago













2












2








2





$begingroup$


The coefficients changed a lot when I used all the factor levels versus when I limited to only one level of a factor as a subset of the data.



I am trying to do a logistic regression between the disease and contact exposure. There were several different sites, so I use the factor function (model:ml1).
I also tried to focus on only a specific site:WB to analyze the association, which site was used as the subset of the data (model:ml2).



ml1<-glm(disease~x+**factor(site)**+factor(anycontact) +factor(comecat), data=gianalysis_bd, family= binomial )
summary(ml1)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.44400 0.25761 -13.369 < 2e-16 ***
x 0.24559 0.08309 2.956 0.003121 **
factor(site)FB 0.03967 0.15177 0.261 0.793792
factor(site)GB -0.54896 0.16538 -3.319 0.000902 ***
factor(site)HB 0.39635 0.14699 2.696 0.007010 **
factor(site)SB -0.13887 0.14347 -0.968 0.333069
factor(site)WB -0.06200 0.14647 -0.423 0.672067
factor(site)WP -0.03706 0.15388 -0.241 0.809677
**factor(anycontact)1 0.40856** 0.06846 5.968 2.41e-09 ***
factor(comecat)2 0.02260 0.07184 0.315 0.753037
factor(comecat)3 0.11195 0.07574 1.478 0.139405


ml2<-glm(disease~x+factor(anycontact) +factor(comecat), data=gianalysis_bd, **subset=site=="WB"**, family= binomial )
summary(ml2)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4016 0.4347 -7.825 5.06e-15 ***
x 0.1421 0.1454 0.977 0.32834
**factor(anycontact)1 0.7380** 0.2590 2.850 0.00438 **
factor(comecat)2 -0.4049 0.2042 -1.983 0.04738 *
factor(comecat)3 0.1136 0.2182 0.520 0.60273


However, the coefficient of factor(anycontact) changed significantly, increasing from 0.4085 (ml1) to 0.7380. I could not tell why that happened (I think it should be the same in both the models). Can someone help to explain the difference between the two model and the reason? Thank you very much.










share|cite|improve this question











$endgroup$




The coefficients changed a lot when I used all the factor levels versus when I limited to only one level of a factor as a subset of the data.



I am trying to do a logistic regression between the disease and contact exposure. There were several different sites, so I use the factor function (model:ml1).
I also tried to focus on only a specific site:WB to analyze the association, which site was used as the subset of the data (model:ml2).



ml1<-glm(disease~x+**factor(site)**+factor(anycontact) +factor(comecat), data=gianalysis_bd, family= binomial )
summary(ml1)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.44400 0.25761 -13.369 < 2e-16 ***
x 0.24559 0.08309 2.956 0.003121 **
factor(site)FB 0.03967 0.15177 0.261 0.793792
factor(site)GB -0.54896 0.16538 -3.319 0.000902 ***
factor(site)HB 0.39635 0.14699 2.696 0.007010 **
factor(site)SB -0.13887 0.14347 -0.968 0.333069
factor(site)WB -0.06200 0.14647 -0.423 0.672067
factor(site)WP -0.03706 0.15388 -0.241 0.809677
**factor(anycontact)1 0.40856** 0.06846 5.968 2.41e-09 ***
factor(comecat)2 0.02260 0.07184 0.315 0.753037
factor(comecat)3 0.11195 0.07574 1.478 0.139405


ml2<-glm(disease~x+factor(anycontact) +factor(comecat), data=gianalysis_bd, **subset=site=="WB"**, family= binomial )
summary(ml2)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4016 0.4347 -7.825 5.06e-15 ***
x 0.1421 0.1454 0.977 0.32834
**factor(anycontact)1 0.7380** 0.2590 2.850 0.00438 **
factor(comecat)2 -0.4049 0.2042 -1.983 0.04738 *
factor(comecat)3 0.1136 0.2182 0.520 0.60273


However, the coefficient of factor(anycontact) changed significantly, increasing from 0.4085 (ml1) to 0.7380. I could not tell why that happened (I think it should be the same in both the models). Can someone help to explain the difference between the two model and the reason? Thank you very much.







r logistic






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited 2 hours ago









EdM

22.6k23497




22.6k23497










asked 5 hours ago









bb wwbb ww

141




141




migrated from stackoverflow.com 3 hours ago


This question came from our site for professional and enthusiast programmers.









migrated from stackoverflow.com 3 hours ago


This question came from our site for professional and enthusiast programmers.













  • $begingroup$
    can you please reword the question to be clear that infact you're training 2 different models, one specifically for "WB" and another across all sites.
    $endgroup$
    – behold
    3 hours ago
















  • $begingroup$
    can you please reword the question to be clear that infact you're training 2 different models, one specifically for "WB" and another across all sites.
    $endgroup$
    – behold
    3 hours ago















$begingroup$
can you please reword the question to be clear that infact you're training 2 different models, one specifically for "WB" and another across all sites.
$endgroup$
– behold
3 hours ago




$begingroup$
can you please reword the question to be clear that infact you're training 2 different models, one specifically for "WB" and another across all sites.
$endgroup$
– behold
3 hours ago










2 Answers
2






active

oldest

votes


















2












$begingroup$

Without knowing more about the details of your data it's hard to say precisely what's going on in your case, but here are 2 possibilities.



First, omitting predictors in any regression model that are correlated with the included predictors can even go so far as to reverse the signs of the coefficients for the included predictors, as in Simpson's paradox.



Second, omitting any predictor related to outcome in models like logistic or Cox proportional hazards regression can lead to bias in coefficient values, even if it is not correlated with the included predictors. This answer provides an analytic demonstration for a similar approach, probit modeling.



In your example, not only did the coefficient for anycontact1 change from the full model when analysis was restricted to the subset, but so did the values and apparent significance of coefficients for x and factor(comecat)2. I suspect that the reasons for these differences lie in some combination of the correlations among these predictors and how they might change between the entire data set and the subset.






share|cite|improve this answer









$endgroup$




















    0












    $begingroup$

    I think it makes sense for site "WB" specific model to be different from a model for all sites combined.



    Looks like, in terms of sites, there are 3 combinations "HB", "GB" and "Not HB/GB".



    Only HB and GB are significant with low p values.



    I think if you run the regression for "Not HB/GB" it should yield you a model similar to what you fitted only for "WB". Can you try that and post?






    share|cite|improve this answer









    $endgroup$













      Your Answer








      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "65"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f404133%2fwhy-did-the-subset-and-factor-influenced-coefficients-of-logistic-regression-in%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      2












      $begingroup$

      Without knowing more about the details of your data it's hard to say precisely what's going on in your case, but here are 2 possibilities.



      First, omitting predictors in any regression model that are correlated with the included predictors can even go so far as to reverse the signs of the coefficients for the included predictors, as in Simpson's paradox.



      Second, omitting any predictor related to outcome in models like logistic or Cox proportional hazards regression can lead to bias in coefficient values, even if it is not correlated with the included predictors. This answer provides an analytic demonstration for a similar approach, probit modeling.



      In your example, not only did the coefficient for anycontact1 change from the full model when analysis was restricted to the subset, but so did the values and apparent significance of coefficients for x and factor(comecat)2. I suspect that the reasons for these differences lie in some combination of the correlations among these predictors and how they might change between the entire data set and the subset.






      share|cite|improve this answer









      $endgroup$

















        2












        $begingroup$

        Without knowing more about the details of your data it's hard to say precisely what's going on in your case, but here are 2 possibilities.



        First, omitting predictors in any regression model that are correlated with the included predictors can even go so far as to reverse the signs of the coefficients for the included predictors, as in Simpson's paradox.



        Second, omitting any predictor related to outcome in models like logistic or Cox proportional hazards regression can lead to bias in coefficient values, even if it is not correlated with the included predictors. This answer provides an analytic demonstration for a similar approach, probit modeling.



        In your example, not only did the coefficient for anycontact1 change from the full model when analysis was restricted to the subset, but so did the values and apparent significance of coefficients for x and factor(comecat)2. I suspect that the reasons for these differences lie in some combination of the correlations among these predictors and how they might change between the entire data set and the subset.






        share|cite|improve this answer









        $endgroup$















          2












          2








          2





          $begingroup$

          Without knowing more about the details of your data it's hard to say precisely what's going on in your case, but here are 2 possibilities.



          First, omitting predictors in any regression model that are correlated with the included predictors can even go so far as to reverse the signs of the coefficients for the included predictors, as in Simpson's paradox.



          Second, omitting any predictor related to outcome in models like logistic or Cox proportional hazards regression can lead to bias in coefficient values, even if it is not correlated with the included predictors. This answer provides an analytic demonstration for a similar approach, probit modeling.



          In your example, not only did the coefficient for anycontact1 change from the full model when analysis was restricted to the subset, but so did the values and apparent significance of coefficients for x and factor(comecat)2. I suspect that the reasons for these differences lie in some combination of the correlations among these predictors and how they might change between the entire data set and the subset.






          share|cite|improve this answer









          $endgroup$



          Without knowing more about the details of your data it's hard to say precisely what's going on in your case, but here are 2 possibilities.



          First, omitting predictors in any regression model that are correlated with the included predictors can even go so far as to reverse the signs of the coefficients for the included predictors, as in Simpson's paradox.



          Second, omitting any predictor related to outcome in models like logistic or Cox proportional hazards regression can lead to bias in coefficient values, even if it is not correlated with the included predictors. This answer provides an analytic demonstration for a similar approach, probit modeling.



          In your example, not only did the coefficient for anycontact1 change from the full model when analysis was restricted to the subset, but so did the values and apparent significance of coefficients for x and factor(comecat)2. I suspect that the reasons for these differences lie in some combination of the correlations among these predictors and how they might change between the entire data set and the subset.







          share|cite|improve this answer












          share|cite|improve this answer



          share|cite|improve this answer










          answered 2 hours ago









          EdMEdM

          22.6k23497




          22.6k23497























              0












              $begingroup$

              I think it makes sense for site "WB" specific model to be different from a model for all sites combined.



              Looks like, in terms of sites, there are 3 combinations "HB", "GB" and "Not HB/GB".



              Only HB and GB are significant with low p values.



              I think if you run the regression for "Not HB/GB" it should yield you a model similar to what you fitted only for "WB". Can you try that and post?






              share|cite|improve this answer









              $endgroup$

















                0












                $begingroup$

                I think it makes sense for site "WB" specific model to be different from a model for all sites combined.



                Looks like, in terms of sites, there are 3 combinations "HB", "GB" and "Not HB/GB".



                Only HB and GB are significant with low p values.



                I think if you run the regression for "Not HB/GB" it should yield you a model similar to what you fitted only for "WB". Can you try that and post?






                share|cite|improve this answer









                $endgroup$















                  0












                  0








                  0





                  $begingroup$

                  I think it makes sense for site "WB" specific model to be different from a model for all sites combined.



                  Looks like, in terms of sites, there are 3 combinations "HB", "GB" and "Not HB/GB".



                  Only HB and GB are significant with low p values.



                  I think if you run the regression for "Not HB/GB" it should yield you a model similar to what you fitted only for "WB". Can you try that and post?






                  share|cite|improve this answer









                  $endgroup$



                  I think it makes sense for site "WB" specific model to be different from a model for all sites combined.



                  Looks like, in terms of sites, there are 3 combinations "HB", "GB" and "Not HB/GB".



                  Only HB and GB are significant with low p values.



                  I think if you run the regression for "Not HB/GB" it should yield you a model similar to what you fitted only for "WB". Can you try that and post?







                  share|cite|improve this answer












                  share|cite|improve this answer



                  share|cite|improve this answer










                  answered 3 hours ago









                  beholdbehold

                  3659




                  3659



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Cross Validated!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f404133%2fwhy-did-the-subset-and-factor-influenced-coefficients-of-logistic-regression-in%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Ружовы пелікан Змест Знешні выгляд | Пашырэнне | Асаблівасці біялогіі | Літаратура | НавігацыяДагледжаная версіяправерана1 зменаДагледжаная версіяправерана1 змена/ 22697590 Сістэматыкана ВіківідахВыявына Вікісховішчы174693363011049382

                      ValueError: Error when checking input: expected conv2d_13_input to have shape (3, 150, 150) but got array with shape (150, 150, 3)2019 Community Moderator ElectionError when checking : expected dense_1_input to have shape (None, 5) but got array with shape (200, 1)Error 'Expected 2D array, got 1D array instead:'ValueError: Error when checking input: expected lstm_41_input to have 3 dimensions, but got array with shape (40000,100)ValueError: Error when checking target: expected dense_1 to have shape (7,) but got array with shape (1,)ValueError: Error when checking target: expected dense_2 to have shape (1,) but got array with shape (0,)Keras exception: ValueError: Error when checking input: expected conv2d_1_input to have shape (150, 150, 3) but got array with shape (256, 256, 3)Steps taking too long to completewhen checking input: expected dense_1_input to have shape (13328,) but got array with shape (317,)ValueError: Error when checking target: expected dense_3 to have shape (None, 1) but got array with shape (7715, 40000)Keras exception: Error when checking input: expected dense_input to have shape (2,) but got array with shape (1,)

                      Illegal assignment from SObject to ContactFetching String, Id from Map - Illegal Assignment Id to Field / ObjectError: Compile Error: Illegal assignment from String to BooleanError: List has no rows for assignment to SObjectError on Test Class - System.QueryException: List has no rows for assignment to SObjectRemote action problemDML requires SObject or SObject list type error“Illegal assignment from List to List”Test Class Fail: Batch Class: System.QueryException: List has no rows for assignment to SObjectMapping to a user'List has no rows for assignment to SObject' Mystery