Why keep vocabulary and posting list separate in a search engine Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsWhy do popular search engines not follow the usual AND, OR logic for queries?Grid Search and High Variance

Centre cell vertically in tabularx

How do Java 8 default methods hеlp with lambdas?

Centre cell contents vertically

Twin's vs. Twins'

What was the last profitable war?

Can the Haste spell grant both a Beast Master ranger and their animal companion extra attacks?

Why not use the yoke to control yaw, as well as pitch and roll?

French equivalents of おしゃれは足元から (Every good outfit starts with the shoes)

Is the time—manner—place ordering of adverbials an oversimplification?

3D Masyu - A Die

Is Mordenkainens' Sword under powered?

Marquee sign letters

Why can't fire hurt Daenerys but it did to Jon Snow in season 1?

Understanding piped commands in GNU/Linux

What is a more techy Technical Writer job title that isn't cutesy or confusing?

Is there any significance to the prison numbers of the Beagle Boys starting with 176-?

New Order #6: Easter Egg

How to ask rejected full-time candidates to apply to teach individual courses?

systemd and copy (/bin/cp): no such file or directory

Is a copyright notice with a non-existent name be invalid?

Is this Half-dragon Quaggoth boss monster balanced?

Fit odd number of triplets in a measure?

Is it OK to use the testing sample to compare algorithms?

Determine whether an integer is a palindrome



Why keep vocabulary and posting list separate in a search engine



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsWhy do popular search engines not follow the usual AND, OR logic for queries?Grid Search and High Variance










4












$begingroup$


I am taking a class in information retrieval. We learned that the index of a search engine has (possibly among other things):



  1. A vocabulary mapping terms to their statistics (frequency, type, ...) and

  2. A posting list mapping terms to the documents were they are stored (with or without positions, fields, ...)

These are separate data structures. I understand why those information is needed and what for. But I don't understand why we want to keep them separate. Why can't we have one data structure that maps terms to statistics and documents?



I am currently thinking it might be because the vocabulary would be much smaller and we could read it from memory. So we could use the statistics to remove certain query terms, which are likely not useful or to try to find misspellings in the query without having to touch the large posting list.



Is this correct or is there another reason to keep vocabulary and posting list separate?










share|improve this question











$endgroup$




bumped to the homepage by Community 3 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.



















    4












    $begingroup$


    I am taking a class in information retrieval. We learned that the index of a search engine has (possibly among other things):



    1. A vocabulary mapping terms to their statistics (frequency, type, ...) and

    2. A posting list mapping terms to the documents were they are stored (with or without positions, fields, ...)

    These are separate data structures. I understand why those information is needed and what for. But I don't understand why we want to keep them separate. Why can't we have one data structure that maps terms to statistics and documents?



    I am currently thinking it might be because the vocabulary would be much smaller and we could read it from memory. So we could use the statistics to remove certain query terms, which are likely not useful or to try to find misspellings in the query without having to touch the large posting list.



    Is this correct or is there another reason to keep vocabulary and posting list separate?










    share|improve this question











    $endgroup$




    bumped to the homepage by Community 3 hours ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

















      4












      4








      4


      2



      $begingroup$


      I am taking a class in information retrieval. We learned that the index of a search engine has (possibly among other things):



      1. A vocabulary mapping terms to their statistics (frequency, type, ...) and

      2. A posting list mapping terms to the documents were they are stored (with or without positions, fields, ...)

      These are separate data structures. I understand why those information is needed and what for. But I don't understand why we want to keep them separate. Why can't we have one data structure that maps terms to statistics and documents?



      I am currently thinking it might be because the vocabulary would be much smaller and we could read it from memory. So we could use the statistics to remove certain query terms, which are likely not useful or to try to find misspellings in the query without having to touch the large posting list.



      Is this correct or is there another reason to keep vocabulary and posting list separate?










      share|improve this question











      $endgroup$




      I am taking a class in information retrieval. We learned that the index of a search engine has (possibly among other things):



      1. A vocabulary mapping terms to their statistics (frequency, type, ...) and

      2. A posting list mapping terms to the documents were they are stored (with or without positions, fields, ...)

      These are separate data structures. I understand why those information is needed and what for. But I don't understand why we want to keep them separate. Why can't we have one data structure that maps terms to statistics and documents?



      I am currently thinking it might be because the vocabulary would be much smaller and we could read it from memory. So we could use the statistics to remove certain query terms, which are likely not useful or to try to find misspellings in the query without having to touch the large posting list.



      Is this correct or is there another reason to keep vocabulary and posting list separate?







      information-retrieval search indexing






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Apr 6 '16 at 20:13







      icehawk

















      asked Apr 6 '16 at 15:53









      icehawkicehawk

      1212




      1212





      bumped to the homepage by Community 3 hours ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community 3 hours ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.






















          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          It has many reasons to be (performance, design, storage, compression, evaluation of data structures). The principal reason is that all structures are verified in the practice, but you can make you own data structure and show a new mode to do.



          Even the google has a paper for verify that he works fine, if you has your own data structure and design, I suggest you choose a database with the correct size for your experiment, search for the precision and recall values for your own collection and make it.



          Other reasons that you can see is that the information has different requirements about performance, compression, storage and hardware.



          The cost of gigabyte for main memory is very different between RAM and HD, when you have a lot of servers your strategy of cost reduction is a reason of improve low cost storage and no high hardware requirements.



          It's clear when you have a collection with Terabytes of data, many cultures or many countries. (google has a paper about his cluster).



          Think simple, a collection with 50GB is a small collection, but how much cost a server with 60GB of RAM and how much cost a server with 50GB of HD?



          The answer is clear and the it's reflects in the data structures (B-tree is good for secondary memory and simple hastables are good and fast in main memory).



          But you decide.



          A old good reference is the chapter 5 of Managing Gigabytes (https://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703/ref=sr_1_1?s=books&ie=UTF8&qid=1474990622&sr=1-1&keywords=managing+gigabytes)



          The chapter shows a table with different results for many data structures and types of memory.



          The google's paper shows the mode for verify your implementation and design:
          http://infolab.stanford.edu/~backrub/google.html



          Think about the cluster and jobs:
          http://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/43438.pdf






          share|improve this answer









          $endgroup$













            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f11069%2fwhy-keep-vocabulary-and-posting-list-separate-in-a-search-engine%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            It has many reasons to be (performance, design, storage, compression, evaluation of data structures). The principal reason is that all structures are verified in the practice, but you can make you own data structure and show a new mode to do.



            Even the google has a paper for verify that he works fine, if you has your own data structure and design, I suggest you choose a database with the correct size for your experiment, search for the precision and recall values for your own collection and make it.



            Other reasons that you can see is that the information has different requirements about performance, compression, storage and hardware.



            The cost of gigabyte for main memory is very different between RAM and HD, when you have a lot of servers your strategy of cost reduction is a reason of improve low cost storage and no high hardware requirements.



            It's clear when you have a collection with Terabytes of data, many cultures or many countries. (google has a paper about his cluster).



            Think simple, a collection with 50GB is a small collection, but how much cost a server with 60GB of RAM and how much cost a server with 50GB of HD?



            The answer is clear and the it's reflects in the data structures (B-tree is good for secondary memory and simple hastables are good and fast in main memory).



            But you decide.



            A old good reference is the chapter 5 of Managing Gigabytes (https://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703/ref=sr_1_1?s=books&ie=UTF8&qid=1474990622&sr=1-1&keywords=managing+gigabytes)



            The chapter shows a table with different results for many data structures and types of memory.



            The google's paper shows the mode for verify your implementation and design:
            http://infolab.stanford.edu/~backrub/google.html



            Think about the cluster and jobs:
            http://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/43438.pdf






            share|improve this answer









            $endgroup$

















              0












              $begingroup$

              It has many reasons to be (performance, design, storage, compression, evaluation of data structures). The principal reason is that all structures are verified in the practice, but you can make you own data structure and show a new mode to do.



              Even the google has a paper for verify that he works fine, if you has your own data structure and design, I suggest you choose a database with the correct size for your experiment, search for the precision and recall values for your own collection and make it.



              Other reasons that you can see is that the information has different requirements about performance, compression, storage and hardware.



              The cost of gigabyte for main memory is very different between RAM and HD, when you have a lot of servers your strategy of cost reduction is a reason of improve low cost storage and no high hardware requirements.



              It's clear when you have a collection with Terabytes of data, many cultures or many countries. (google has a paper about his cluster).



              Think simple, a collection with 50GB is a small collection, but how much cost a server with 60GB of RAM and how much cost a server with 50GB of HD?



              The answer is clear and the it's reflects in the data structures (B-tree is good for secondary memory and simple hastables are good and fast in main memory).



              But you decide.



              A old good reference is the chapter 5 of Managing Gigabytes (https://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703/ref=sr_1_1?s=books&ie=UTF8&qid=1474990622&sr=1-1&keywords=managing+gigabytes)



              The chapter shows a table with different results for many data structures and types of memory.



              The google's paper shows the mode for verify your implementation and design:
              http://infolab.stanford.edu/~backrub/google.html



              Think about the cluster and jobs:
              http://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/43438.pdf






              share|improve this answer









              $endgroup$















                0












                0








                0





                $begingroup$

                It has many reasons to be (performance, design, storage, compression, evaluation of data structures). The principal reason is that all structures are verified in the practice, but you can make you own data structure and show a new mode to do.



                Even the google has a paper for verify that he works fine, if you has your own data structure and design, I suggest you choose a database with the correct size for your experiment, search for the precision and recall values for your own collection and make it.



                Other reasons that you can see is that the information has different requirements about performance, compression, storage and hardware.



                The cost of gigabyte for main memory is very different between RAM and HD, when you have a lot of servers your strategy of cost reduction is a reason of improve low cost storage and no high hardware requirements.



                It's clear when you have a collection with Terabytes of data, many cultures or many countries. (google has a paper about his cluster).



                Think simple, a collection with 50GB is a small collection, but how much cost a server with 60GB of RAM and how much cost a server with 50GB of HD?



                The answer is clear and the it's reflects in the data structures (B-tree is good for secondary memory and simple hastables are good and fast in main memory).



                But you decide.



                A old good reference is the chapter 5 of Managing Gigabytes (https://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703/ref=sr_1_1?s=books&ie=UTF8&qid=1474990622&sr=1-1&keywords=managing+gigabytes)



                The chapter shows a table with different results for many data structures and types of memory.



                The google's paper shows the mode for verify your implementation and design:
                http://infolab.stanford.edu/~backrub/google.html



                Think about the cluster and jobs:
                http://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/43438.pdf






                share|improve this answer









                $endgroup$



                It has many reasons to be (performance, design, storage, compression, evaluation of data structures). The principal reason is that all structures are verified in the practice, but you can make you own data structure and show a new mode to do.



                Even the google has a paper for verify that he works fine, if you has your own data structure and design, I suggest you choose a database with the correct size for your experiment, search for the precision and recall values for your own collection and make it.



                Other reasons that you can see is that the information has different requirements about performance, compression, storage and hardware.



                The cost of gigabyte for main memory is very different between RAM and HD, when you have a lot of servers your strategy of cost reduction is a reason of improve low cost storage and no high hardware requirements.



                It's clear when you have a collection with Terabytes of data, many cultures or many countries. (google has a paper about his cluster).



                Think simple, a collection with 50GB is a small collection, but how much cost a server with 60GB of RAM and how much cost a server with 50GB of HD?



                The answer is clear and the it's reflects in the data structures (B-tree is good for secondary memory and simple hastables are good and fast in main memory).



                But you decide.



                A old good reference is the chapter 5 of Managing Gigabytes (https://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703/ref=sr_1_1?s=books&ie=UTF8&qid=1474990622&sr=1-1&keywords=managing+gigabytes)



                The chapter shows a table with different results for many data structures and types of memory.



                The google's paper shows the mode for verify your implementation and design:
                http://infolab.stanford.edu/~backrub/google.html



                Think about the cluster and jobs:
                http://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/43438.pdf







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Sep 27 '16 at 15:47









                IntrusoIntruso

                1013




                1013



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f11069%2fwhy-keep-vocabulary-and-posting-list-separate-in-a-search-engine%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown