Antonio Fernández Antafiles.meetup.com/16506412/PresentationBIGDAP.pdfAntonio Fernández Anta Joint...

Antonio Fernández Anta Joint work with Luis F. Chiroque, Héctor Cordobés,

Rafael A. García Leiva, Philippe Morere,

Lorenzo Ornella, Fernando Pérez, and Agustín Santos

•  Recommendation Engines (RE) suggest items to users •  RE are becoming highly popular in many different

contexts -  Shopping websites (Amazon),

-  Content distribution (Netflix, Spotify)

- Online social networks (Facebook, twitter, LikendIn)

•  Some authors talk about “the age of recommendation” versus the age of search (Chris Anderson, The Long Tail)

•  Most modern RE are based on collaborative filtering: Recommendations are based on

- Historic data

- Similarity between

users

- Similarity between

items

•  Multiple metrics to quantify similarity: Euclidean distance, Cosine similarity, Pearson correlation similarity, etc.

•  The users and items are typically connected as a bipartite graph

•  Considering users as nodes and items as hyperlinks (or vice versa) we obtain hypergraphs (that can be transformed into graphs of users or items)

•  Graph theory and network analysis concepts can be useful (as before)

- Google, Pagerank

- Natural Language Processing

•  We explore graph-based approaches for recommendation engines

•  Apply them to an ecosystem of smartphone apps.

- Apps are advertised in other

apps via banners

- Performance metric is

click-through rate (CTR)

•  Large (big) data available

- More than a billion records

- Millions of users

- Hundreds of items (apps)

•  Recommendation engines based on

- Collaborative filtering (in wide sense)

- Graph theory concepts

•  The engines were evaluated in the real world (and showed good performance)

•  Involves processing the historical data available with big data technologies: Hadoop, Elastic Map Reduce, Pig

•  The processing is done is a few hours thanks to these technologies

•  There are more that 100 GiB of historical data

- Millions of users

- 300 applications

- More than 1400 millions records

•  The data is in multiple tables

•  The first process is to “clean”

the data

•  This involves joining several

tables

•  The output of this process is a file with records that contain

- User

- Application advertised

- Running application (publisher app)

- Action (advertisement, click)

- Date and time

•  In MySQL we started in of the join operations (not the largest) and stopped it after 30 hours and no more than 15% completion

•  We have used Hadoop on Amazon Elastic Map Reduce and Pig scripts to process the data

•  The output has more that 700 million records

•  From the clean data the graphs used by the RE have been generated

•  This process is less time consuming since it typically involves data aggregation

WRQWRPHWUR

WHSLOOH

KRURVFRSR

WHVWYLGD

ELRUULWPR

SHVRLGHDO

WHVWFRPSDWLELOLGDGWHVWVH[R

HVSLULWXULFR

SXUJDWRULR

VHUDVPLOORQDULR

IULNLWHVW

DQLPDOVH[XDOVHUIHOL]

WHVWGDOWRQLVPR

WR\HQDPRUDGR

P\PR]DUWHGDGPHQWDO

FU\VWDOEDOO

WHVWVHGXFFLRQFKLFD

EXHQFRQGXFWRU

WRQWRPHWURSOXV

FRQGXFWRUYHUGH

WHVWVH[RSOXV

WHPSHUDPHQWR

OLVWRPHWUR

PRYLHWHVW

ZHHNHQGSODQQHU

KLVWRU\WHVW

KLVWRU\WHVWSOXV

P\VWLFFRIIHH

PDWKVSHHGWHVWPXVLFWHVW

RUDFXOR

KHPLVIHULRV

LQWURH[WURIRRWEDOOWHVWZKRVDLGWKDW

OLHGHWHFWRU

ZKHUHLQWKHZRUOG

RFWRSXVSDXOWHVWWHOHSDWLFR

WLSRSHUVRQDOLGDG

VXSHUW\SHU

VR\XQLFR

P\HUVEULJJV

VXSHUGRWDGR

VWURRSHIIHFW

KHPLVIHULRVZS

WHVWYLGDZS

WRQWRPHWURZS

HGDGPHQWDOZS

QHZ\RUNHU

P\HUVEULJJVSOXV

WLSRSHUVRQDOLGDGZS

DPHULFDQ

VWURRSHIIHFWSOXV

SHVRLGHDOZS

VH[\WHVW

WHVWGDOWRQLVPROLWH

WR\HQDPRUDGRZS

KLVWRU\WHVWZS

SHVRLGHDOSOXV

DQLPDOVH[XDOZS

XQGHUZHDUWHVW

OLHGHWHFWRUZS

PHPRU\WHVW

PHPRU\WHVWSOXV

VH[\WHVWZS

QDPHFRPSDWLELOLW\

WHPSHUDPHQWRZS

ILUVWLPSUHVVLRQ

ELJILYHWHVW

IULNLWHVWZS

ELJILYHWHVWSOXV

VH[XDODJH

JHHNRPHWHU

XQGHUZHDUWHVWZS

WHVWGDOWRQLVPRZS

WHVWFRPSDWLELOLGDGZS

KHPLVIHULRVDQ

MRNHVURXOHWWH

HDUIRUPXVLF

HDUIRUPXVLFSOXV

QHZ\RUNHUZS

HPRWLRQDOLQWHOOLJHQFH

PHPHWHVW

SXUJDWRULRZS

LGHDOMRE

P\HUVEULJJVZS

WHVWVHGXFFLRQFKLFDZS

JX\WDON

SDVWOLIH

WHSLOOHZS

DQLPDOVH[XDODQ

EXHQFRQGXFWRUZS

ZHHNHQGSODQQHUZS

PXVLFDOLQVWUXPHQWV

JLUOWDON

GHDGO\VLQV

VHFUHWQDPH

ILUVWLPSUHVVLRQZS

VXSHUGRWDGRZS

VHUDVPLOORQDULRZS

ORYHWHVW

XQGHUZHDUWHVWDQ

VH[XDODJHDQ

WRQWRPHWUR$1

KLVWRU\WHVWDQ

WHVWVH[RDQ

DPDGMRXUQH\PLQLJROI

DPDGMRXUQH\PLQLJROIOLWH

LQWURH[WURZS

ZLQQHUORVHUQXWULWLRQWHVW

HVSLULWXULFRZS

PDWKVSHHGWHVWZS

VXSHUYLOODLQ

SDOPUHDGHU

HPRWLRQDOLQWHOOLJHQFHZS

JRRGNLVVHU

VSLQSHQJXLQ

VSLQSHQJXLQOLWH

VSDFHIHUU\OLWH

D[EXEEOH

WKHFODZ

VH[XDOGHVLUH

OLIHVW\OHGLHW

VH[XDODJHZS

SHUVXDGH

SDVWOLIHZS

ELJILYHWHVWZS

MRNHVURXOHWWHZS

QDPHFRPSDWLELOLW\ZS

SHVRLGHDODQ

WHVWYLGDDQ

VH[\NLQG

WHPSHUDPHQWRDQ

WHVWGDOWRQLVPRDQ

EUDLQVH[

\RXQJROG

SDOPUHDGHUZS

LGHDOMREZS

VHFUHWQDPHZS

VXEFRQVFLRXV

MRNHVURXOHWWHDQ

LQWURH[WURDQ

ILUVWLPSUHVVLRQDQWHVWFRPSDWLELOLGDGDQ

HPRWLRQDOLQWHOOLJHQFHDQ

JX\WDONZS

JLUOWDONZS

GHDGO\VLQVZS

ZLQQHUORVHUZS

KHPLVIHULRVD]WRQWRPHWURD]

LQWURH[WURD]

WHVWGDOWRQLVPRD]

ILUVWLPSUHVVLRQD]

HPRWLRQDOLQWHOOLJHQFHD]

KLVWRU\WHVWD]

WHVWVH[RD]

MRNHVURXOHWWHD]

WHVWFRPSDWLELOLGDGD]

WHVWYLGDD]

XQGHUZHDUWHVWD]

VH[XDODJHD]

WHPSHUDPHQWRD]

DQLPDOVH[XDOD]

SHVRLGHDOD]

P\HUVEULJJVDQ

P\HUVEULJJVD]

ELJILYHWHVWD]

ELJILYHWHVWDQ

PDWKVSHHGWHVWDQ

PDWKVSHHGWHVWD]

SDOPUHDGHUDQ

SDOPUHDGHUD]

SDVWOLIHDQ

SDVWOLIHD]

GHDGO\VLQVDQ

GHDGO\VLQVD]

SHUVXDGHZS

PHPRU\WHVWZS

ZDU

JRRGNLVVHUZS

VWURRSHIIHFWZS

QRPHUF\

QRPHUF\SOXV

KDOORZHHQL]HUXQMXHJRJUDWLV

IDFHKRURVFRSH

\RXQJROGZS

OLIHVW\OHGLHWZS

VH[\NLQGZS

LFRQIHVV

FORXGSHUVRQDOLW\

RWKHUVVHH\RX

DXUDFRORU

EUHDWK

VXEFRQVFLRXVZS

HURVFRSH

UHDODJH

KDQGZULWLQJ

LGHDOH[HUFLVH

OLNHGLVOLNH

WKHPROH\V

OLQNDZRUG

WHVWPXHUWH

PVDQLPDOVH[XDO��PVLGHDOH[HUFLVH��

PVVH[\WHVW��

PVWHPSHUDPHQWR��

PVWHVWFRPSDWLELOLGDG��

PVIULNLWHVW��

JLIWKHOSHU

FRXQWGRZQZLVKHV

ZLVKHV

WKHPROH\VDQ

WUDFNQHVW

OLQNDZRUGDQ

•  The graph that has apps as nodes and undirected links weighted by the number of common users

4

45

7

22

5

15

25

•  Shared users: The apps with largest number of users shared with the publisher app are preferred

4

45

7

22

5

15

25

Publisher app

recommendation

•  Filtering algorithm:

- Let v the binary vector of the requesting user apps, and M the adjacency matrix of the common users graph

- The apps whose position have the largest value in vTM are preferred

7

22

25

5 recommendation

25

27

7

•  Common users graph, modulated by age (user weight decreases exponentially with age)

Weight(app1,app2)= Σu δage(u)

•  Aged shared users: Same as shared users in the aged graph

•  Aged filtering: Same as shared users in the aged graph

•  The CTR graph is a directed graph with links weighted by the frequency of clicks in the banner of the application (head) in the publisher (tail)

4/12

4/50

7/14

22/40

5/20

6/10

25/50

•  Maxflow algorithm:

- Recommendation algorithm used to promote specific applications

- The apps with largest maxflow in the CTR graph to the promoted apps are preferred

.33

.08

.5

.55

.25

.6

.4

source

Promoted app

.6

.25 .25

.08

.08

.35 recommendation

•  For reference there are two basic recommendation engines:

- Random: Engine that suggests random applications

- Static promotion: Returns always the promoted apps

•  The different algorithms have been tested over a week in the real system

•  These values improve over the current CTR

Algorithm CTR

Random 1.57%

Shared users 1.64%

Aged shared users 1.69%

Filtering 1.51%

Aged filtering 1.71%

Static promotion 1.45%

Maxflow 1.86%

Aging is useful

Global view

•  Graph analytics can be useful in the development of recommendation engines

•  Big data technology allowed us to process historical data and produce graphs

•  The graphs generated are small. They could be processed with classical technologies

•  Current map reduce technologies do not seem to be the solution for large graph analysis

•  Explore technologies that are more suited for large graph analytics:

- Graphlab, GraphChi

- Spark, GraphX

- Stratosphere, Flink, Spargel

•  Devise ways to process incremental data

•  Design and testing of new recommendation algorithms that use larger graphs

Thank you!

Antonio Fernández Antafiles.meetup.com/16506412/PresentationBIGDAP.pdfAntonio Fernández Anta Joint...

Documents

Transcript of Antonio Fernández Antafiles.meetup.com/16506412/PresentationBIGDAP.pdfAntonio Fernández Anta Joint...