Why keep vocabulary and posting list separate in a search engine Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsWhy do popular search engines not follow the usual AND, OR logic for queries?Grid Search and High Variance
Centre cell vertically in tabularx
How do Java 8 default methods hеlp with lambdas?
Centre cell contents vertically
Twin's vs. Twins'
What was the last profitable war?
Can the Haste spell grant both a Beast Master ranger and their animal companion extra attacks?
Why not use the yoke to control yaw, as well as pitch and roll?
French equivalents of おしゃれは足元から (Every good outfit starts with the shoes)
Is the time—manner—place ordering of adverbials an oversimplification?
3D Masyu - A Die
Is Mordenkainens' Sword under powered?
Marquee sign letters
Why can't fire hurt Daenerys but it did to Jon Snow in season 1?
Understanding piped commands in GNU/Linux
What is a more techy Technical Writer job title that isn't cutesy or confusing?
Is there any significance to the prison numbers of the Beagle Boys starting with 176-?
New Order #6: Easter Egg
How to ask rejected full-time candidates to apply to teach individual courses?
systemd and copy (/bin/cp): no such file or directory
Is a copyright notice with a non-existent name be invalid?
Is this Half-dragon Quaggoth boss monster balanced?
Fit odd number of triplets in a measure?
Is it OK to use the testing sample to compare algorithms?
Determine whether an integer is a palindrome
Why keep vocabulary and posting list separate in a search engine
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsWhy do popular search engines not follow the usual AND, OR logic for queries?Grid Search and High Variance
$begingroup$
I am taking a class in information retrieval. We learned that the index of a search engine has (possibly among other things):
- A vocabulary mapping terms to their statistics (frequency, type, ...) and
- A posting list mapping terms to the documents were they are stored (with or without positions, fields, ...)
These are separate data structures. I understand why those information is needed and what for. But I don't understand why we want to keep them separate. Why can't we have one data structure that maps terms to statistics and documents?
I am currently thinking it might be because the vocabulary would be much smaller and we could read it from memory. So we could use the statistics to remove certain query terms, which are likely not useful or to try to find misspellings in the query without having to touch the large posting list.
Is this correct or is there another reason to keep vocabulary and posting list separate?
information-retrieval search indexing
$endgroup$
bumped to the homepage by Community♦ 3 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I am taking a class in information retrieval. We learned that the index of a search engine has (possibly among other things):
- A vocabulary mapping terms to their statistics (frequency, type, ...) and
- A posting list mapping terms to the documents were they are stored (with or without positions, fields, ...)
These are separate data structures. I understand why those information is needed and what for. But I don't understand why we want to keep them separate. Why can't we have one data structure that maps terms to statistics and documents?
I am currently thinking it might be because the vocabulary would be much smaller and we could read it from memory. So we could use the statistics to remove certain query terms, which are likely not useful or to try to find misspellings in the query without having to touch the large posting list.
Is this correct or is there another reason to keep vocabulary and posting list separate?
information-retrieval search indexing
$endgroup$
bumped to the homepage by Community♦ 3 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I am taking a class in information retrieval. We learned that the index of a search engine has (possibly among other things):
- A vocabulary mapping terms to their statistics (frequency, type, ...) and
- A posting list mapping terms to the documents were they are stored (with or without positions, fields, ...)
These are separate data structures. I understand why those information is needed and what for. But I don't understand why we want to keep them separate. Why can't we have one data structure that maps terms to statistics and documents?
I am currently thinking it might be because the vocabulary would be much smaller and we could read it from memory. So we could use the statistics to remove certain query terms, which are likely not useful or to try to find misspellings in the query without having to touch the large posting list.
Is this correct or is there another reason to keep vocabulary and posting list separate?
information-retrieval search indexing
$endgroup$
I am taking a class in information retrieval. We learned that the index of a search engine has (possibly among other things):
- A vocabulary mapping terms to their statistics (frequency, type, ...) and
- A posting list mapping terms to the documents were they are stored (with or without positions, fields, ...)
These are separate data structures. I understand why those information is needed and what for. But I don't understand why we want to keep them separate. Why can't we have one data structure that maps terms to statistics and documents?
I am currently thinking it might be because the vocabulary would be much smaller and we could read it from memory. So we could use the statistics to remove certain query terms, which are likely not useful or to try to find misspellings in the query without having to touch the large posting list.
Is this correct or is there another reason to keep vocabulary and posting list separate?
information-retrieval search indexing
information-retrieval search indexing
edited Apr 6 '16 at 20:13
icehawk
asked Apr 6 '16 at 15:53
icehawkicehawk
1212
1212
bumped to the homepage by Community♦ 3 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 3 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
It has many reasons to be (performance, design, storage, compression, evaluation of data structures). The principal reason is that all structures are verified in the practice, but you can make you own data structure and show a new mode to do.
Even the google has a paper for verify that he works fine, if you has your own data structure and design, I suggest you choose a database with the correct size for your experiment, search for the precision and recall values for your own collection and make it.
Other reasons that you can see is that the information has different requirements about performance, compression, storage and hardware.
The cost of gigabyte for main memory is very different between RAM and HD, when you have a lot of servers your strategy of cost reduction is a reason of improve low cost storage and no high hardware requirements.
It's clear when you have a collection with Terabytes of data, many cultures or many countries. (google has a paper about his cluster).
Think simple, a collection with 50GB is a small collection, but how much cost a server with 60GB of RAM and how much cost a server with 50GB of HD?
The answer is clear and the it's reflects in the data structures (B-tree is good for secondary memory and simple hastables are good and fast in main memory).
But you decide.
A old good reference is the chapter 5 of Managing Gigabytes (https://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703/ref=sr_1_1?s=books&ie=UTF8&qid=1474990622&sr=1-1&keywords=managing+gigabytes)
The chapter shows a table with different results for many data structures and types of memory.
The google's paper shows the mode for verify your implementation and design:
http://infolab.stanford.edu/~backrub/google.html
Think about the cluster and jobs:
http://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/43438.pdf
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f11069%2fwhy-keep-vocabulary-and-posting-list-separate-in-a-search-engine%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
It has many reasons to be (performance, design, storage, compression, evaluation of data structures). The principal reason is that all structures are verified in the practice, but you can make you own data structure and show a new mode to do.
Even the google has a paper for verify that he works fine, if you has your own data structure and design, I suggest you choose a database with the correct size for your experiment, search for the precision and recall values for your own collection and make it.
Other reasons that you can see is that the information has different requirements about performance, compression, storage and hardware.
The cost of gigabyte for main memory is very different between RAM and HD, when you have a lot of servers your strategy of cost reduction is a reason of improve low cost storage and no high hardware requirements.
It's clear when you have a collection with Terabytes of data, many cultures or many countries. (google has a paper about his cluster).
Think simple, a collection with 50GB is a small collection, but how much cost a server with 60GB of RAM and how much cost a server with 50GB of HD?
The answer is clear and the it's reflects in the data structures (B-tree is good for secondary memory and simple hastables are good and fast in main memory).
But you decide.
A old good reference is the chapter 5 of Managing Gigabytes (https://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703/ref=sr_1_1?s=books&ie=UTF8&qid=1474990622&sr=1-1&keywords=managing+gigabytes)
The chapter shows a table with different results for many data structures and types of memory.
The google's paper shows the mode for verify your implementation and design:
http://infolab.stanford.edu/~backrub/google.html
Think about the cluster and jobs:
http://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/43438.pdf
$endgroup$
add a comment |
$begingroup$
It has many reasons to be (performance, design, storage, compression, evaluation of data structures). The principal reason is that all structures are verified in the practice, but you can make you own data structure and show a new mode to do.
Even the google has a paper for verify that he works fine, if you has your own data structure and design, I suggest you choose a database with the correct size for your experiment, search for the precision and recall values for your own collection and make it.
Other reasons that you can see is that the information has different requirements about performance, compression, storage and hardware.
The cost of gigabyte for main memory is very different between RAM and HD, when you have a lot of servers your strategy of cost reduction is a reason of improve low cost storage and no high hardware requirements.
It's clear when you have a collection with Terabytes of data, many cultures or many countries. (google has a paper about his cluster).
Think simple, a collection with 50GB is a small collection, but how much cost a server with 60GB of RAM and how much cost a server with 50GB of HD?
The answer is clear and the it's reflects in the data structures (B-tree is good for secondary memory and simple hastables are good and fast in main memory).
But you decide.
A old good reference is the chapter 5 of Managing Gigabytes (https://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703/ref=sr_1_1?s=books&ie=UTF8&qid=1474990622&sr=1-1&keywords=managing+gigabytes)
The chapter shows a table with different results for many data structures and types of memory.
The google's paper shows the mode for verify your implementation and design:
http://infolab.stanford.edu/~backrub/google.html
Think about the cluster and jobs:
http://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/43438.pdf
$endgroup$
add a comment |
$begingroup$
It has many reasons to be (performance, design, storage, compression, evaluation of data structures). The principal reason is that all structures are verified in the practice, but you can make you own data structure and show a new mode to do.
Even the google has a paper for verify that he works fine, if you has your own data structure and design, I suggest you choose a database with the correct size for your experiment, search for the precision and recall values for your own collection and make it.
Other reasons that you can see is that the information has different requirements about performance, compression, storage and hardware.
The cost of gigabyte for main memory is very different between RAM and HD, when you have a lot of servers your strategy of cost reduction is a reason of improve low cost storage and no high hardware requirements.
It's clear when you have a collection with Terabytes of data, many cultures or many countries. (google has a paper about his cluster).
Think simple, a collection with 50GB is a small collection, but how much cost a server with 60GB of RAM and how much cost a server with 50GB of HD?
The answer is clear and the it's reflects in the data structures (B-tree is good for secondary memory and simple hastables are good and fast in main memory).
But you decide.
A old good reference is the chapter 5 of Managing Gigabytes (https://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703/ref=sr_1_1?s=books&ie=UTF8&qid=1474990622&sr=1-1&keywords=managing+gigabytes)
The chapter shows a table with different results for many data structures and types of memory.
The google's paper shows the mode for verify your implementation and design:
http://infolab.stanford.edu/~backrub/google.html
Think about the cluster and jobs:
http://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/43438.pdf
$endgroup$
It has many reasons to be (performance, design, storage, compression, evaluation of data structures). The principal reason is that all structures are verified in the practice, but you can make you own data structure and show a new mode to do.
Even the google has a paper for verify that he works fine, if you has your own data structure and design, I suggest you choose a database with the correct size for your experiment, search for the precision and recall values for your own collection and make it.
Other reasons that you can see is that the information has different requirements about performance, compression, storage and hardware.
The cost of gigabyte for main memory is very different between RAM and HD, when you have a lot of servers your strategy of cost reduction is a reason of improve low cost storage and no high hardware requirements.
It's clear when you have a collection with Terabytes of data, many cultures or many countries. (google has a paper about his cluster).
Think simple, a collection with 50GB is a small collection, but how much cost a server with 60GB of RAM and how much cost a server with 50GB of HD?
The answer is clear and the it's reflects in the data structures (B-tree is good for secondary memory and simple hastables are good and fast in main memory).
But you decide.
A old good reference is the chapter 5 of Managing Gigabytes (https://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703/ref=sr_1_1?s=books&ie=UTF8&qid=1474990622&sr=1-1&keywords=managing+gigabytes)
The chapter shows a table with different results for many data structures and types of memory.
The google's paper shows the mode for verify your implementation and design:
http://infolab.stanford.edu/~backrub/google.html
Think about the cluster and jobs:
http://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/43438.pdf
answered Sep 27 '16 at 15:47
IntrusoIntruso
1013
1013
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f11069%2fwhy-keep-vocabulary-and-posting-list-separate-in-a-search-engine%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown