Intuition behind using non-hypercubic kernels in density estimation The 2019 Stack Overflow Developer Survey Results Are Inintuition behind the difference between likelihood function of discriminative and generative algorithmsPoisson point process application and terminology

How to support a colleague who finds meetings extremely tiring?

What to do when moving next to a bird sanctuary with a loosely-domesticated cat?

Old scifi movie from the 50s or 60s with men in solid red uniforms who interrogate a spy from the past

How to type this arrow in math mode?

Why is the maximum length of OpenWrt’s root password 8 characters?

How to translate "being like"?

"as much details as you can remember"

How to charge AirPods to keep battery healthy?

If I score a critical hit on an 18 or higher, what are my chances of getting a critical hit if I roll 3d20?

Likelihood that a superbug or lethal virus could come from a landfill

Can we generate random numbers using irrational numbers like π and e?

What could be the right powersource for 15 seconds lifespan disposable giant chainsaw?

Match Roman Numerals

Loose spokes after only a few rides

Why not take a picture of a closer black hole?

Is Cinnamon a desktop environment or a window manager? (Or both?)

Can a flute soloist sit?

What is preventing me from simply constructing a hash that's lower than the current target?

The phrase "to the numbers born"?

How come people say “Would of”?

How to notate time signature switching consistently every measure

Getting crown tickets for Statue of Liberty

What is this business jet?

Pokemon Turn Based battle (Python)



Intuition behind using non-hypercubic kernels in density estimation



The 2019 Stack Overflow Developer Survey Results Are Inintuition behind the difference between likelihood function of discriminative and generative algorithmsPoisson point process application and terminology










1












$begingroup$


Suppose that we perform density estimation in m-dimensional space: we estimate the value $p(a)$ for some point $a$ given observations $x_1, dots, x_n $.



It is known that if region $A subset mathbbR^m$ is "small" enough to consider density being constant on points from $A$ then we can make the following estimate:
$$ p(a) approx frack / nA $$
where $k$ is the number of observations that lie in $A$ and $|A|$ is Lebesgue measure of $A$.



Let parameter $h$ be small enough to consider density as constant inside hypercube centered at $a$ with side length equal to $h$. The volume of this hypercube is equal to $h^m$ and point $x$ lies inside this hypercube iff $K(fracx-ah) = 1$ where
$$K(u) =cases
1textfracu^k - a^khcr
0text, otherwise
$$
It's easy to see that the number of observations inside this hypercube equals to
$$k = sum_i = 1^n K(fracx-ah)$$
and so the estimation described above gets the following form:
$$p(a) approx frac1n h^m sum_i = 1^n K(fracx-ah) $$



We can interpret $K$ as "weight" given to particular observations and one of the drawbacks of hypercubic approach is that all observations lying inside hypercube have equal weights despite having different distances from $a$. Yet another drawback is that the resulting estimate is not continuous. That's what i understand to be the main reason of using non-hypercubic kernels such as gaussian kernel which give more weight to points close to $a$ and yields continuous estimate.



But i have troubles with interpreting the usage of such kernels. The sum $sum_i = 1^n K(fracx-ah)$ is no longer equal to $k$ so we can't justify the usage of these kernels by formula $p(a) approx frack / nA $. Finally here are my questions: how do we justify the usage of smooth kernels? how can one interpret this usage?



Thank you for any ideas.










share|improve this question









$endgroup$




bumped to the homepage by Community 37 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.



















    1












    $begingroup$


    Suppose that we perform density estimation in m-dimensional space: we estimate the value $p(a)$ for some point $a$ given observations $x_1, dots, x_n $.



    It is known that if region $A subset mathbbR^m$ is "small" enough to consider density being constant on points from $A$ then we can make the following estimate:
    $$ p(a) approx frack / nA $$
    where $k$ is the number of observations that lie in $A$ and $|A|$ is Lebesgue measure of $A$.



    Let parameter $h$ be small enough to consider density as constant inside hypercube centered at $a$ with side length equal to $h$. The volume of this hypercube is equal to $h^m$ and point $x$ lies inside this hypercube iff $K(fracx-ah) = 1$ where
    $$K(u) =cases
    1textfracu^k - a^khcr
    0text, otherwise
    $$
    It's easy to see that the number of observations inside this hypercube equals to
    $$k = sum_i = 1^n K(fracx-ah)$$
    and so the estimation described above gets the following form:
    $$p(a) approx frac1n h^m sum_i = 1^n K(fracx-ah) $$



    We can interpret $K$ as "weight" given to particular observations and one of the drawbacks of hypercubic approach is that all observations lying inside hypercube have equal weights despite having different distances from $a$. Yet another drawback is that the resulting estimate is not continuous. That's what i understand to be the main reason of using non-hypercubic kernels such as gaussian kernel which give more weight to points close to $a$ and yields continuous estimate.



    But i have troubles with interpreting the usage of such kernels. The sum $sum_i = 1^n K(fracx-ah)$ is no longer equal to $k$ so we can't justify the usage of these kernels by formula $p(a) approx frack / nA $. Finally here are my questions: how do we justify the usage of smooth kernels? how can one interpret this usage?



    Thank you for any ideas.










    share|improve this question









    $endgroup$




    bumped to the homepage by Community 37 mins ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

















      1












      1








      1


      0



      $begingroup$


      Suppose that we perform density estimation in m-dimensional space: we estimate the value $p(a)$ for some point $a$ given observations $x_1, dots, x_n $.



      It is known that if region $A subset mathbbR^m$ is "small" enough to consider density being constant on points from $A$ then we can make the following estimate:
      $$ p(a) approx frack / nA $$
      where $k$ is the number of observations that lie in $A$ and $|A|$ is Lebesgue measure of $A$.



      Let parameter $h$ be small enough to consider density as constant inside hypercube centered at $a$ with side length equal to $h$. The volume of this hypercube is equal to $h^m$ and point $x$ lies inside this hypercube iff $K(fracx-ah) = 1$ where
      $$K(u) =cases
      1textfracu^k - a^khcr
      0text, otherwise
      $$
      It's easy to see that the number of observations inside this hypercube equals to
      $$k = sum_i = 1^n K(fracx-ah)$$
      and so the estimation described above gets the following form:
      $$p(a) approx frac1n h^m sum_i = 1^n K(fracx-ah) $$



      We can interpret $K$ as "weight" given to particular observations and one of the drawbacks of hypercubic approach is that all observations lying inside hypercube have equal weights despite having different distances from $a$. Yet another drawback is that the resulting estimate is not continuous. That's what i understand to be the main reason of using non-hypercubic kernels such as gaussian kernel which give more weight to points close to $a$ and yields continuous estimate.



      But i have troubles with interpreting the usage of such kernels. The sum $sum_i = 1^n K(fracx-ah)$ is no longer equal to $k$ so we can't justify the usage of these kernels by formula $p(a) approx frack / nA $. Finally here are my questions: how do we justify the usage of smooth kernels? how can one interpret this usage?



      Thank you for any ideas.










      share|improve this question









      $endgroup$




      Suppose that we perform density estimation in m-dimensional space: we estimate the value $p(a)$ for some point $a$ given observations $x_1, dots, x_n $.



      It is known that if region $A subset mathbbR^m$ is "small" enough to consider density being constant on points from $A$ then we can make the following estimate:
      $$ p(a) approx frack / nA $$
      where $k$ is the number of observations that lie in $A$ and $|A|$ is Lebesgue measure of $A$.



      Let parameter $h$ be small enough to consider density as constant inside hypercube centered at $a$ with side length equal to $h$. The volume of this hypercube is equal to $h^m$ and point $x$ lies inside this hypercube iff $K(fracx-ah) = 1$ where
      $$K(u) =cases
      1textfracu^k - a^khcr
      0text, otherwise
      $$
      It's easy to see that the number of observations inside this hypercube equals to
      $$k = sum_i = 1^n K(fracx-ah)$$
      and so the estimation described above gets the following form:
      $$p(a) approx frac1n h^m sum_i = 1^n K(fracx-ah) $$



      We can interpret $K$ as "weight" given to particular observations and one of the drawbacks of hypercubic approach is that all observations lying inside hypercube have equal weights despite having different distances from $a$. Yet another drawback is that the resulting estimate is not continuous. That's what i understand to be the main reason of using non-hypercubic kernels such as gaussian kernel which give more weight to points close to $a$ and yields continuous estimate.



      But i have troubles with interpreting the usage of such kernels. The sum $sum_i = 1^n K(fracx-ah)$ is no longer equal to $k$ so we can't justify the usage of these kernels by formula $p(a) approx frack / nA $. Finally here are my questions: how do we justify the usage of smooth kernels? how can one interpret this usage?



      Thank you for any ideas.







      probability






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jan 16 '18 at 18:36









      IgorIgor

      1144




      1144





      bumped to the homepage by Community 37 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community 37 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.






















          2 Answers
          2






          active

          oldest

          votes


















          0












          $begingroup$

          Histograms and methods based on binning have a number of well-known problems. Different anchor points etc. can introduce artificial patterns that make interpretation unreliable. Smooth kernels don't use a grid and thus smooth out the noise.



          This also has the advantage that it makes it easier to get a single overall picture of the data because it takes into account neighboring points and smooths the data into areas where no data is observed.



          Smooth kernels can also be justified by their favorable statistical properties. Popular methods like fastKDE use the fact that one can find "an empirical kernel that is optimal in the sense that the integrated, squared difference between the resulting KDE and the true PDF is minimized."






          share|improve this answer









          $endgroup$




















            0












            $begingroup$

            If we're estimating a continious distribution's density, perhaps we should introduce an integral in here right? A kernel estimate should be such that $int_-infty^inftyK(x)dx = 1$. Therefore, it should be relatively easy to see that an estimate for $f(x)$ called $hatf(x)$ should have the following:



            $int_-infty^inftyhatf(x)dx = frac1nsum_j=1^nfrac1hK(fracx-ah) $
            $= frac1nsum_j=1^n1 = 1$. Naturally since, the kernal and the estimate for the pdf are greater than 1, then our hat function is also a probability density function.



            Now for a bit more detail: $hatf(x)$ is usually derived from a definition of the derivative of the emperical CDF. So instead of justifying it via the way you would a parzen window, you instead just justify it from what it means to be a pdf and what you want a good estimate for that pdf to be.



            edit: With regards to knn and your estimator. I think it's also important to realize that the for any fixed point the nearest neighhor estiamte is the kernel estimate. However, it is different estimate for each point. The kernel still remains an estimate because each individual estimate is a density so overall the kernel is a linear combination of densities. Furthermore the coefficients for the k estimates will sum up to 1.






            share|improve this answer











            $endgroup$













              Your Answer





              StackExchange.ifUsing("editor", function ()
              return StackExchange.using("mathjaxEditing", function ()
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
              );
              );
              , "mathjax-editing");

              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "557"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













              draft saved

              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f26706%2fintuition-behind-using-non-hypercubic-kernels-in-density-estimation%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              0












              $begingroup$

              Histograms and methods based on binning have a number of well-known problems. Different anchor points etc. can introduce artificial patterns that make interpretation unreliable. Smooth kernels don't use a grid and thus smooth out the noise.



              This also has the advantage that it makes it easier to get a single overall picture of the data because it takes into account neighboring points and smooths the data into areas where no data is observed.



              Smooth kernels can also be justified by their favorable statistical properties. Popular methods like fastKDE use the fact that one can find "an empirical kernel that is optimal in the sense that the integrated, squared difference between the resulting KDE and the true PDF is minimized."






              share|improve this answer









              $endgroup$

















                0












                $begingroup$

                Histograms and methods based on binning have a number of well-known problems. Different anchor points etc. can introduce artificial patterns that make interpretation unreliable. Smooth kernels don't use a grid and thus smooth out the noise.



                This also has the advantage that it makes it easier to get a single overall picture of the data because it takes into account neighboring points and smooths the data into areas where no data is observed.



                Smooth kernels can also be justified by their favorable statistical properties. Popular methods like fastKDE use the fact that one can find "an empirical kernel that is optimal in the sense that the integrated, squared difference between the resulting KDE and the true PDF is minimized."






                share|improve this answer









                $endgroup$















                  0












                  0








                  0





                  $begingroup$

                  Histograms and methods based on binning have a number of well-known problems. Different anchor points etc. can introduce artificial patterns that make interpretation unreliable. Smooth kernels don't use a grid and thus smooth out the noise.



                  This also has the advantage that it makes it easier to get a single overall picture of the data because it takes into account neighboring points and smooths the data into areas where no data is observed.



                  Smooth kernels can also be justified by their favorable statistical properties. Popular methods like fastKDE use the fact that one can find "an empirical kernel that is optimal in the sense that the integrated, squared difference between the resulting KDE and the true PDF is minimized."






                  share|improve this answer









                  $endgroup$



                  Histograms and methods based on binning have a number of well-known problems. Different anchor points etc. can introduce artificial patterns that make interpretation unreliable. Smooth kernels don't use a grid and thus smooth out the noise.



                  This also has the advantage that it makes it easier to get a single overall picture of the data because it takes into account neighboring points and smooths the data into areas where no data is observed.



                  Smooth kernels can also be justified by their favorable statistical properties. Popular methods like fastKDE use the fact that one can find "an empirical kernel that is optimal in the sense that the integrated, squared difference between the resulting KDE and the true PDF is minimized."







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jan 16 '18 at 19:28









                  oW_oW_

                  3,306933




                  3,306933





















                      0












                      $begingroup$

                      If we're estimating a continious distribution's density, perhaps we should introduce an integral in here right? A kernel estimate should be such that $int_-infty^inftyK(x)dx = 1$. Therefore, it should be relatively easy to see that an estimate for $f(x)$ called $hatf(x)$ should have the following:



                      $int_-infty^inftyhatf(x)dx = frac1nsum_j=1^nfrac1hK(fracx-ah) $
                      $= frac1nsum_j=1^n1 = 1$. Naturally since, the kernal and the estimate for the pdf are greater than 1, then our hat function is also a probability density function.



                      Now for a bit more detail: $hatf(x)$ is usually derived from a definition of the derivative of the emperical CDF. So instead of justifying it via the way you would a parzen window, you instead just justify it from what it means to be a pdf and what you want a good estimate for that pdf to be.



                      edit: With regards to knn and your estimator. I think it's also important to realize that the for any fixed point the nearest neighhor estiamte is the kernel estimate. However, it is different estimate for each point. The kernel still remains an estimate because each individual estimate is a density so overall the kernel is a linear combination of densities. Furthermore the coefficients for the k estimates will sum up to 1.






                      share|improve this answer











                      $endgroup$

















                        0












                        $begingroup$

                        If we're estimating a continious distribution's density, perhaps we should introduce an integral in here right? A kernel estimate should be such that $int_-infty^inftyK(x)dx = 1$. Therefore, it should be relatively easy to see that an estimate for $f(x)$ called $hatf(x)$ should have the following:



                        $int_-infty^inftyhatf(x)dx = frac1nsum_j=1^nfrac1hK(fracx-ah) $
                        $= frac1nsum_j=1^n1 = 1$. Naturally since, the kernal and the estimate for the pdf are greater than 1, then our hat function is also a probability density function.



                        Now for a bit more detail: $hatf(x)$ is usually derived from a definition of the derivative of the emperical CDF. So instead of justifying it via the way you would a parzen window, you instead just justify it from what it means to be a pdf and what you want a good estimate for that pdf to be.



                        edit: With regards to knn and your estimator. I think it's also important to realize that the for any fixed point the nearest neighhor estiamte is the kernel estimate. However, it is different estimate for each point. The kernel still remains an estimate because each individual estimate is a density so overall the kernel is a linear combination of densities. Furthermore the coefficients for the k estimates will sum up to 1.






                        share|improve this answer











                        $endgroup$















                          0












                          0








                          0





                          $begingroup$

                          If we're estimating a continious distribution's density, perhaps we should introduce an integral in here right? A kernel estimate should be such that $int_-infty^inftyK(x)dx = 1$. Therefore, it should be relatively easy to see that an estimate for $f(x)$ called $hatf(x)$ should have the following:



                          $int_-infty^inftyhatf(x)dx = frac1nsum_j=1^nfrac1hK(fracx-ah) $
                          $= frac1nsum_j=1^n1 = 1$. Naturally since, the kernal and the estimate for the pdf are greater than 1, then our hat function is also a probability density function.



                          Now for a bit more detail: $hatf(x)$ is usually derived from a definition of the derivative of the emperical CDF. So instead of justifying it via the way you would a parzen window, you instead just justify it from what it means to be a pdf and what you want a good estimate for that pdf to be.



                          edit: With regards to knn and your estimator. I think it's also important to realize that the for any fixed point the nearest neighhor estiamte is the kernel estimate. However, it is different estimate for each point. The kernel still remains an estimate because each individual estimate is a density so overall the kernel is a linear combination of densities. Furthermore the coefficients for the k estimates will sum up to 1.






                          share|improve this answer











                          $endgroup$



                          If we're estimating a continious distribution's density, perhaps we should introduce an integral in here right? A kernel estimate should be such that $int_-infty^inftyK(x)dx = 1$. Therefore, it should be relatively easy to see that an estimate for $f(x)$ called $hatf(x)$ should have the following:



                          $int_-infty^inftyhatf(x)dx = frac1nsum_j=1^nfrac1hK(fracx-ah) $
                          $= frac1nsum_j=1^n1 = 1$. Naturally since, the kernal and the estimate for the pdf are greater than 1, then our hat function is also a probability density function.



                          Now for a bit more detail: $hatf(x)$ is usually derived from a definition of the derivative of the emperical CDF. So instead of justifying it via the way you would a parzen window, you instead just justify it from what it means to be a pdf and what you want a good estimate for that pdf to be.



                          edit: With regards to knn and your estimator. I think it's also important to realize that the for any fixed point the nearest neighhor estiamte is the kernel estimate. However, it is different estimate for each point. The kernel still remains an estimate because each individual estimate is a density so overall the kernel is a linear combination of densities. Furthermore the coefficients for the k estimates will sum up to 1.







                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Jan 16 '18 at 20:24

























                          answered Jan 16 '18 at 20:09









                          TophatTophat

                          1,382212




                          1,382212



























                              draft saved

                              draft discarded
















































                              Thanks for contributing an answer to Data Science Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f26706%2fintuition-behind-using-non-hypercubic-kernels-in-density-estimation%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Францішак Багушэвіч Змест Сям'я | Біяграфія | Творчасць | Мова Багушэвіча | Ацэнкі дзейнасці | Цікавыя факты | Спадчына | Выбраная бібліяграфія | Ушанаванне памяці | У філатэліі | Зноскі | Літаратура | Спасылкі | НавігацыяЛяхоўскі У. Рупіўся дзеля Бога і людзей: Жыццёвы шлях Лявона Вітан-Дубейкаўскага // Вольскі і Памідораў з песняй пра немца Адвакат, паэт, народны заступнік Ашмянскі веснікВ Минске появится площадь Богушевича и улица Сырокомли, Белорусская деловая газета, 19 июля 2001 г.Айцец беларускай нацыянальнай ідэі паўстаў у бронзе Сяргей Аляксандравіч Адашкевіч (1918, Мінск). 80-я гады. Бюст «Францішак Багушэвіч».Яўген Мікалаевіч Ціхановіч. «Партрэт Францішка Багушэвіча»Мікола Мікалаевіч Купава. «Партрэт зачынальніка новай беларускай літаратуры Францішка Багушэвіча»Уладзімір Іванавіч Мелехаў. На помніку «Змагарам за родную мову» Барэльеф «Францішак Багушэвіч»Памяць пра Багушэвіча на Віленшчыне Страчаная сталіца. Беларускія шыльды на вуліцах Вільні«Krynica». Ideologia i przywódcy białoruskiego katolicyzmuФранцішак БагушэвічТворы на knihi.comТворы Францішка Багушэвіча на bellib.byСодаль Уладзімір. Францішак Багушэвіч на Лідчыне;Луцкевіч Антон. Жыцьцё і творчасьць Фр. Багушэвіча ў успамінах ягоных сучасьнікаў // Запісы Беларускага Навуковага таварыства. Вільня, 1938. Сшытак 1. С. 16-34.Большая российская1188761710000 0000 5537 633Xn9209310021619551927869394п

                              Partai Komunis Tiongkok Daftar isi Kepemimpinan | Pranala luar | Referensi | Menu navigasidiperiksa1 perubahan tertundacpc.people.com.cnSitus resmiSurat kabar resmi"Why the Communist Party is alive, well and flourishing in China"0307-1235"Full text of Constitution of Communist Party of China"smengembangkannyas

                              ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6 (SMOTE) The 2019 Stack Overflow Developer Survey Results Are InCan SMOTE be applied over sequence of words (sentences)?ValueError when doing validation with random forestsSMOTE and multi class oversamplingLogic behind SMOTE-NC?ValueError: Error when checking target: expected dense_1 to have shape (7,) but got array with shape (1,)SmoteBoost: Should SMOTE be ran individually for each iteration/tree in the boosting?solving multi-class imbalance classification using smote and OSSUsing SMOTE for Synthetic Data generation to improve performance on unbalanced dataproblem of entry format for a simple model in KerasSVM SMOTE fit_resample() function runs forever with no result