Why keep vocabulary and posting list separate in a search engine Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsWhy do popular search engines not follow the usual AND, OR logic for queries?Grid Search and High Variance

Centre cell vertically in tabularx

How do Java 8 default methods hеlp with lambdas?

Centre cell contents vertically

Twin's vs. Twins'

What was the last profitable war?

Can the Haste spell grant both a Beast Master ranger and their animal companion extra attacks?

Why not use the yoke to control yaw, as well as pitch and roll?

French equivalents of おしゃれは足元から (Every good outfit starts with the shoes)

Is the time—manner—place ordering of adverbials an oversimplification?

3D Masyu - A Die

Is Mordenkainens' Sword under powered?

Marquee sign letters

Why can't fire hurt Daenerys but it did to Jon Snow in season 1?

Understanding piped commands in GNU/Linux

What is a more techy Technical Writer job title that isn't cutesy or confusing?

Is there any significance to the prison numbers of the Beagle Boys starting with 176-?

New Order #6: Easter Egg

How to ask rejected full-time candidates to apply to teach individual courses?

systemd and copy (/bin/cp): no such file or directory

Is a copyright notice with a non-existent name be invalid?

Is this Half-dragon Quaggoth boss monster balanced?

Fit odd number of triplets in a measure?

Is it OK to use the testing sample to compare algorithms?

Determine whether an integer is a palindrome

Why keep vocabulary and posting list separate in a search engine

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)

2019 Moderator Election Q&A - Questionnaire

2019 Community Moderator Election ResultsWhy do popular search engines not follow the usual AND, OR logic for queries?Grid Search and High Variance

I am taking a class in information retrieval. We learned that the index of a search engine has (possibly among other things):

A vocabulary mapping terms to their statistics (frequency, type, ...) and

A posting list mapping terms to the documents were they are stored (with or without positions, fields, ...)

These are separate data structures. I understand why those information is needed and what for. But I don't understand why we want to keep them separate. Why can't we have one data structure that maps terms to statistics and documents?

I am currently thinking it might be because the vocabulary would be much smaller and we could read it from memory. So we could use the statistics to remove certain query terms, which are likely not useful or to try to find misspellings in the query without having to touch the large posting list.

Is this correct or is there another reason to keep vocabulary and posting list separate?

edited Apr 6 '16 at 20:13

asked Apr 6 '16 at 15:53

icehawk

1212

bumped to the homepage by Community♦ 3 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

I am taking a class in information retrieval. We learned that the index of a search engine has (possibly among other things):

A vocabulary mapping terms to their statistics (frequency, type, ...) and

A posting list mapping terms to the documents were they are stored (with or without positions, fields, ...)

Is this correct or is there another reason to keep vocabulary and posting list separate?

edited Apr 6 '16 at 20:13

asked Apr 6 '16 at 15:53

icehawk

1212

bumped to the homepage by Community♦ 3 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

I am taking a class in information retrieval. We learned that the index of a search engine has (possibly among other things):

A vocabulary mapping terms to their statistics (frequency, type, ...) and

A posting list mapping terms to the documents were they are stored (with or without positions, fields, ...)

Is this correct or is there another reason to keep vocabulary and posting list separate?

edited Apr 6 '16 at 20:13

asked Apr 6 '16 at 15:53

icehawk

1212

I am taking a class in information retrieval. We learned that the index of a search engine has (possibly among other things):

A vocabulary mapping terms to their statistics (frequency, type, ...) and

A posting list mapping terms to the documents were they are stored (with or without positions, fields, ...)

Is this correct or is there another reason to keep vocabulary and posting list separate?

information-retrieval search indexing

edited Apr 6 '16 at 20:13

asked Apr 6 '16 at 15:53

icehawk

1212

edited Apr 6 '16 at 20:13

asked Apr 6 '16 at 15:53

icehawk

1212

edited Apr 6 '16 at 20:13

asked Apr 6 '16 at 15:53

icehawk

1212

asked Apr 6 '16 at 15:53

icehawk

1212

asked Apr 6 '16 at 15:53

icehawk

1212

bumped to the homepage by Community♦ 3 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 3 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

1 Answer
1

active

oldest

votes

It has many reasons to be (performance, design, storage, compression, evaluation of data structures). The principal reason is that all structures are verified in the practice, but you can make you own data structure and show a new mode to do.

Even the google has a paper for verify that he works fine, if you has your own data structure and design, I suggest you choose a database with the correct size for your experiment, search for the precision and recall values for your own collection and make it.

Other reasons that you can see is that the information has different requirements about performance, compression, storage and hardware.

The cost of gigabyte for main memory is very different between RAM and HD, when you have a lot of servers your strategy of cost reduction is a reason of improve low cost storage and no high hardware requirements.

It's clear when you have a collection with Terabytes of data, many cultures or many countries. (google has a paper about his cluster).

Think simple, a collection with 50GB is a small collection, but how much cost a server with 60GB of RAM and how much cost a server with 50GB of HD?

The answer is clear and the it's reflects in the data structures (B-tree is good for secondary memory and simple hastables are good and fast in main memory).

But you decide.

A old good reference is the chapter 5 of Managing Gigabytes (https://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703/ref=sr_1_1?s=books&ie=UTF8&qid=1474990622&sr=1-1&keywords=managing+gigabytes)

The chapter shows a table with different results for many data structures and types of memory.

The google's paper shows the mode for verify your implementation and design:
http://infolab.stanford.edu/~backrub/google.html

Think about the cluster and jobs:
http://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/43438.pdf

answered Sep 27 '16 at 15:47

Intruso

1013

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f11069%2fwhy-keep-vocabulary-and-posting-list-separate-in-a-search-engine%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Other reasons that you can see is that the information has different requirements about performance, compression, storage and hardware.

It's clear when you have a collection with Terabytes of data, many cultures or many countries. (google has a paper about his cluster).

Think simple, a collection with 50GB is a small collection, but how much cost a server with 60GB of RAM and how much cost a server with 50GB of HD?

The answer is clear and the it's reflects in the data structures (B-tree is good for secondary memory and simple hastables are good and fast in main memory).

But you decide.

The chapter shows a table with different results for many data structures and types of memory.

The google's paper shows the mode for verify your implementation and design:
http://infolab.stanford.edu/~backrub/google.html

Think about the cluster and jobs:
http://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/43438.pdf

answered Sep 27 '16 at 15:47

Intruso

1013

add a comment |

Other reasons that you can see is that the information has different requirements about performance, compression, storage and hardware.

It's clear when you have a collection with Terabytes of data, many cultures or many countries. (google has a paper about his cluster).

Think simple, a collection with 50GB is a small collection, but how much cost a server with 60GB of RAM and how much cost a server with 50GB of HD?

The answer is clear and the it's reflects in the data structures (B-tree is good for secondary memory and simple hastables are good and fast in main memory).

But you decide.

The chapter shows a table with different results for many data structures and types of memory.

The google's paper shows the mode for verify your implementation and design:
http://infolab.stanford.edu/~backrub/google.html

Think about the cluster and jobs:
http://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/43438.pdf

answered Sep 27 '16 at 15:47

Intruso

1013

add a comment |

Other reasons that you can see is that the information has different requirements about performance, compression, storage and hardware.

It's clear when you have a collection with Terabytes of data, many cultures or many countries. (google has a paper about his cluster).

Think simple, a collection with 50GB is a small collection, but how much cost a server with 60GB of RAM and how much cost a server with 50GB of HD?

The answer is clear and the it's reflects in the data structures (B-tree is good for secondary memory and simple hastables are good and fast in main memory).

But you decide.

The chapter shows a table with different results for many data structures and types of memory.

The google's paper shows the mode for verify your implementation and design:
http://infolab.stanford.edu/~backrub/google.html

Think about the cluster and jobs:
http://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/43438.pdf

answered Sep 27 '16 at 15:47

Intruso

1013

Other reasons that you can see is that the information has different requirements about performance, compression, storage and hardware.

It's clear when you have a collection with Terabytes of data, many cultures or many countries. (google has a paper about his cluster).

Think simple, a collection with 50GB is a small collection, but how much cost a server with 60GB of RAM and how much cost a server with 50GB of HD?

The answer is clear and the it's reflects in the data structures (B-tree is good for secondary memory and simple hastables are good and fast in main memory).

But you decide.

The chapter shows a table with different results for many data structures and types of memory.

The google's paper shows the mode for verify your implementation and design:
http://infolab.stanford.edu/~backrub/google.html

Think about the cluster and jobs:
http://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/43438.pdf

answered Sep 27 '16 at 15:47

Intruso

1013

answered Sep 27 '16 at 15:47

Intruso

1013

answered Sep 27 '16 at 15:47

Intruso

1013

answered Sep 27 '16 at 15:47

Intruso

1013

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Hfrxdjt

bumped to the homepage by Community♦ 3 hours ago

bumped to the homepage by Community♦ 3 hours ago

bumped to the homepage by Community♦ 3 hours ago

bumped to the homepage by Community♦ 3 hours ago

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

bumped to the homepage by Community♦ 3 hours ago

bumped to the homepage by Community♦ 3 hours ago

bumped to the homepage by Community♦ 3 hours ago

bumped to the homepage by Community♦ 3 hours ago

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

1 Answer
1

1 Answer
1

1 Answer
1