CC5212-1 Procesamiento Masivo de Datos...

71
CC5212-1 PROCESAMIENTO MASIVO DE DATOS O TOÑO 2017 Lecture 7: Information Retrieval II Aidan Hogan [email protected]

Transcript of CC5212-1 Procesamiento Masivo de Datos...

Page 1: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

CC5212-1

PROCESAMIENTO MASIVO DE DATOS

OTOÑO 2017

Lecture 7: Information Retrieval II

Aidan Hogan

[email protected]

Page 2: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

How does Google know about the Web?

Page 3: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Inverted Index: Example

Term List Posting List

a (1,2,…)

american (1,5,…)

and (1,2,…)

by (1,2,…)

directed (1,2,…)

drama (1,16,…)

… …

Inverted index:

1

Fruitvale Station is a 2013 American drama film written and directed by Ryan Coogler.

Page 4: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

• Inverted Index

– They built one so you don’t have to!

– Open Source in Java

Apache Lucene

Page 5: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

• Inverted Index

– They built one so you don’t have to!

– Open Source in Java

Apache Lucene

Page 6: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

INFORMATION RETRIEVAL:

RANKING

Page 7: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

How Does Google Get Such Good Results?

obama

Page 8: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

How Does Google Get Such Good Results?

aidan hogan

Page 9: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

How does Google Get Such Good Results?

Page 10: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Two Sides to Ranking: Relevance

Page 11: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Two Sides to Ranking: Importance

>

Page 12: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

RANKING:

RELEVANCE

Page 13: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Example Query

Which of these three keyword terms is most “important”?

Page 14: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Matches in a Document

freedom• 7 occurrences

Page 15: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Matches in a Document

freedom• 7 occurrences

movie• 16 occurrences

Page 16: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Matches in a Document

freedom• 7 occurrences

movie• 16 occurrences

wallace• 88 occurrences

Page 17: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Usefulness of Words

movie• occurs very frequently

freedom• occurs frequently

wallace• occurs occassionally

Page 18: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Estimating Relevance

• Rare words more important than common words

– wallace (49M) more important than freedom (198M)

more important than movie (835M)

• Words occurring more frequently in a document

indicate higher relevance

– wallace (88) more matches than movie (16) more

matches than freedom (7)

Page 19: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Relevance Measure: TF–IDF

• TF: Term Frequency

– Measures occurrences of a term in a document

– … various options

• Raw count of occurrences

• Logarithmically scaled

• Normalised by document length

• A combination / something else

Page 20: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Relevance Measure: TF–IDF

• IDF: Inverse Document Frequency

– How common a term is across all documents

– …

• Logarithmically scaled document occurrences

• Note: The more rare, the larger the value

Page 21: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Relevance Measure: TF–IDF

• TF–IDF: Combine Term Frequency and Inverse

Document Frequency:

• Score for a query

– Let query

– Score for a query:

(There are other possibilities)

Page 22: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Relevance Measure: TF–IDF

Term Frequency Inverse Document Frequency

Page 23: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Relevance Measure: TF–IDF

Term Frequency Inverse Document Frequency

Page 24: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Relevance Measure: TF–IDF

Term Frequency Inverse Document Frequency

Page 25: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Relevance Measure: TF–IDF

Term Frequency Inverse Document Frequency

Page 26: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Relevance Measure: TF–IDF

Term Frequency Inverse Document Frequency

Page 27: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Relevance Measure: TF–IDF

Term Frequency Inverse Document Frequency

Page 28: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Relevance Measure: TF–IDF

Term Frequency Inverse Document Frequency

Page 29: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Vector Space Model (a mention)

Page 30: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Vector Space Model (a mention)

Page 31: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Vector Space Model (a mention)

Page 32: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Vector Space Model (a mention)

• Cosine Similarity

• Note:

Page 33: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Relevance Measure: TF–IDF

• TF–IDF: Combine Term Frequency and Inverse

Document Frequency:

• Score for a query

– Let query

– Score for a query:

(There are other possibilities)

… we could also use cosine similarity between

query and document using TF–IDF weights

Page 34: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Two Sides to Ranking: Relevance

Page 35: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Field-Based Boosting

• Not all text is equal: titles, headers, etc.

Page 36: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Anchor Text

• See how the Web views/tags a page

Page 37: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Anchor Text

• See how the Web views/tags a page

Page 38: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

• Inverted Index

– They built one so you don’t have to!

– Open Source in Java

Lucene uses relevance scoring

Page 39: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

RANKING:

IMPORTANCE

Page 40: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Two Sides to Ranking: Importance

>

How could we determine that Barack Obama is more important

than Mount Obama as a search result on the Web?

Page 41: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Link Analysis

Which will have more links from other pages?

The Wikipedia article for Mount Obama?

The Wikipedia article for Barack Obama?

Page 42: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Link Analysis

• Consider links as votes of confidence in a page

• A hyperlink is the open Web’s version of …

(… even if the page is linked in a negative way.)

Page 43: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Link Analysis

So if we just count links to a page we can

determine its importance and we are done?

Page 44: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Link Spamming

Page 45: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Link Importance

So which should count for more?

A link from http://en.wikipedia.org/wiki/Barack_Obama?

Or a link from http://freev1agra.com/shop.html?

Page 46: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Link Importance

Maybe we could consider links from some

domains as having more “vote”?

Page 47: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

PageRank

Page 48: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

PageRank

• Not just a count of inlinks

– A link from a more important page is more important

– A link from a page with fewer links is more important

∴ A page with lots of inlinks from important pages (which

have few outlinks) is more important

Page 49: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

PageRank is Recursive

• Not just a count of inlinks

– A link from a more important page is more important

– A link from a page with fewer links is more important

∴ A page with lots of inlinks from important pages (which

have few outlinks) is more important

Page 50: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

PageRank Model

• The Web: a directed graph

Vertices(pages)

Edges(links)

f a

e b

d c

0.2650.225

0.138 0.127

0.0740.172

Which vertex is most important?

Page 51: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

PageRank Model

• The Web: a directed graph

Vertices(pages)

Edges(links)

f a

e b

d c

Page 52: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

PageRank Model

Vertices(pages)

Edges(links)

f

e b

d

Page 53: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

PageRank Model

Vertices(pages)

Edges(links)

f

e b

d c

a

Page 54: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

PageRank: Random Surfer Model

f a

e b

d c

= someone surfing the web, clicking links randomly

• What is the probability of being at page x after n hops?

Page 55: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

PageRank: Random Surfer Model

f a

e b

d c

= someone surfing the web, clicking links randomly

• What is the probability of being at page x after n hops?

• Initial state: surfer equally likely to start at any node

Page 56: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

PageRank: Random Surfer Model

f a

e b

d c

g

What would happen with gover time?

= someone surfing the web, clicking links randomly

• What is the probability of being at page x after n hops?

• Initial state: surfer equally likely to start at any node

• PageRank applied iteratively for each hop: score indicates probability of being at that page after that many hops

Page 57: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

PageRank: Random Surfer Model

f a

e b

d c

g

= someone surfing the web, clicking links randomly

• What is the probability of being at page x after n hops?

• Initial state: surfer equally likely to start at any node

• PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops

• If the surfer reaches a page without links, the surfer randomly jumps to another page

Page 58: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

PageRank: Random Surfer Model

f a

e b

d c

g i

= someone surfing the web, clicking links randomly

• What is the probability of being at page x after n hops?

• Initial state: surfer equally likely to start at any node

• PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops

• If the surfer reaches a page without links, the surfer randomly jumps to another page

What would happen with g and i over time?

Page 59: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

PageRank: Random Surfer Model

f a

e b

d c

g i

What would happen with g and i over time?

= someone surfing the web, clicking links randomly

• What is the probability of being at page x after n hops?

• Initial state: surfer equally likely to start at any node

• PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops

• If the surfer reaches a page without links, the surfer randomly jumps to another page

Page 60: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

PageRank: Random Surfer Model

= someone surfing the web, clicking links randomly

f a

e b

d c

• What is the probability of being at page x after n hops?

• Initial state: surfer equally likely to start at any node

• PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops

• If the surfer reaches a page without links, the surfer randomly jumps to another page

• The surfer will jump to a random page at any time with a probability 1 – d … this avoids traps and ensures convergence!

g i

Page 61: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

PageRank Model: Final Version

• The Web: a directed graph

Vertices(pages)

Edges(links)

f a

e b

d c

Page 62: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

PageRank: Benefits

More robust than a simple link count

Fewer ties than link counting

Scalable to approximate (for sparse graphs)

Convergence guaranteed

Page 63: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Two Sides to Ranking: Importance

>

Page 64: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

GOOGLE: A GUESS

Page 65: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

How Modern Google ranks results (maybe)

According to survey of SEO experts, not people in Google

Page 66: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

How Modern Google ranks results (maybe)

According to survey of SEO experts, not people in Google

Why so secretive?

Page 67: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

INFORMATION RETRIEVAL:

RECAP

Page 68: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

How Does Google Get Such Good Results?

aidan hogan

Page 69: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Ranking in Information Retrieval

• Relevance: Is the document relevant for the query?

– Term Frequency * Inverse Document Frequency

– Cosine similarity

• Importance: Is the document a popular one?

– Links analysis

– PageRank

Page 70: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Ranking: Science or Art?

Page 71: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:

Questions?