Antonio Fernández Antafiles.meetup.com/16506412/PresentationBIGDAP.pdfAntonio Fernández Anta Joint...
Transcript of Antonio Fernández Antafiles.meetup.com/16506412/PresentationBIGDAP.pdfAntonio Fernández Anta Joint...
Antonio Fernández Anta Joint work with Luis F. Chiroque, Héctor Cordobés,
Rafael A. García Leiva, Philippe Morere,
Lorenzo Ornella, Fernando Pérez, and Agustín Santos
• Recommendation Engines (RE) suggest items to users • RE are becoming highly popular in many different
contexts - Shopping websites (Amazon),
- Content distribution (Netflix, Spotify)
- Online social networks (Facebook, twitter, LikendIn)
• Some authors talk about “the age of recommendation” versus the age of search (Chris Anderson, The Long Tail)
• Most modern RE are based on collaborative filtering: Recommendations are based on
- Historic data
- Similarity between
users
- Similarity between
items
• Multiple metrics to quantify similarity: Euclidean distance, Cosine similarity, Pearson correlation similarity, etc.
• The users and items are typically connected as a bipartite graph
• Considering users as nodes and items as hyperlinks (or vice versa) we obtain hypergraphs (that can be transformed into graphs of users or items)
• Graph theory and network analysis concepts can be useful (as before)
- Google, Pagerank
- Natural Language Processing
• We explore graph-based approaches for recommendation engines
• Apply them to an ecosystem of smartphone apps.
- Apps are advertised in other
apps via banners
- Performance metric is
click-through rate (CTR)
• Large (big) data available
- More than a billion records
- Millions of users
- Hundreds of items (apps)
• Recommendation engines based on
- Collaborative filtering (in wide sense)
- Graph theory concepts
• The engines were evaluated in the real world (and showed good performance)
• Involves processing the historical data available with big data technologies: Hadoop, Elastic Map Reduce, Pig
• The processing is done is a few hours thanks to these technologies
• There are more that 100 GiB of historical data
- Millions of users
- 300 applications
- More than 1400 millions records
• The data is in multiple tables
• The first process is to “clean”
the data
• This involves joining several
tables
• The output of this process is a file with records that contain
- User
- Application advertised
- Running application (publisher app)
- Action (advertisement, click)
- Date and time
• In MySQL we started in of the join operations (not the largest) and stopped it after 30 hours and no more than 15% completion
• We have used Hadoop on Amazon Elastic Map Reduce and Pig scripts to process the data
• The output has more that 700 million records
• From the clean data the graphs used by the RE have been generated
• This process is less time consuming since it typically involves data aggregation
WRQWRPHWUR
WHSLOOH
KRURVFRSR
WHVWYLGD
ELRUULWPR
SHVRLGHDO
WHVWFRPSDWLELOLGDGWHVWVH[R
HVSLULWXULFR
SXUJDWRULR
VHUDVPLOORQDULR
IULNLWHVW
DQLPDOVH[XDOVHUIHOL]
WHVWGDOWRQLVPR
WR\HQDPRUDGR
P\PR]DUWHGDGPHQWDO
FU\VWDOEDOO
WHVWVHGXFFLRQFKLFD
EXHQFRQGXFWRU
WRQWRPHWURSOXV
FRQGXFWRUYHUGH
WHVWVH[RSOXV
WHPSHUDPHQWR
OLVWRPHWUR
PRYLHWHVW
ZHHNHQGSODQQHU
KLVWRU\WHVW
KLVWRU\WHVWSOXV
P\VWLFFRIIHH
PDWKVSHHGWHVWPXVLFWHVW
RUDFXOR
KHPLVIHULRV
LQWURH[WURIRRWEDOOWHVWZKRVDLGWKDW
OLHGHWHFWRU
ZKHUHLQWKHZRUOG
RFWRSXVSDXOWHVWWHOHSDWLFR
WLSRSHUVRQDOLGDG
VXSHUW\SHU
VR\XQLFR
P\HUVEULJJV
VXSHUGRWDGR
VWURRSHIIHFW
KHPLVIHULRVZS
WHVWYLGDZS
WRQWRPHWURZS
HGDGPHQWDOZS
QHZ\RUNHU
P\HUVEULJJVSOXV
WLSRSHUVRQDOLGDGZS
DPHULFDQ
VWURRSHIIHFWSOXV
SHVRLGHDOZS
VH[\WHVW
WHVWGDOWRQLVPROLWH
WR\HQDPRUDGRZS
KLVWRU\WHVWZS
SHVRLGHDOSOXV
DQLPDOVH[XDOZS
XQGHUZHDUWHVW
OLHGHWHFWRUZS
PHPRU\WHVW
PHPRU\WHVWSOXV
VH[\WHVWZS
QDPHFRPSDWLELOLW\
WHPSHUDPHQWRZS
ILUVWLPSUHVVLRQ
ELJILYHWHVW
IULNLWHVWZS
ELJILYHWHVWSOXV
VH[XDODJH
JHHNRPHWHU
XQGHUZHDUWHVWZS
WHVWGDOWRQLVPRZS
WHVWFRPSDWLELOLGDGZS
KHPLVIHULRVDQ
MRNHVURXOHWWH
HDUIRUPXVLF
HDUIRUPXVLFSOXV
QHZ\RUNHUZS
HPRWLRQDOLQWHOOLJHQFH
PHPHWHVW
SXUJDWRULRZS
LGHDOMRE
P\HUVEULJJVZS
WHVWVHGXFFLRQFKLFDZS
JX\WDON
SDVWOLIH
WHSLOOHZS
DQLPDOVH[XDODQ
EXHQFRQGXFWRUZS
ZHHNHQGSODQQHUZS
PXVLFDOLQVWUXPHQWV
JLUOWDON
GHDGO\VLQV
VHFUHWQDPH
ILUVWLPSUHVVLRQZS
VXSHUGRWDGRZS
VHUDVPLOORQDULRZS
ORYHWHVW
XQGHUZHDUWHVWDQ
VH[XDODJHDQ
WRQWRPHWUR$1
KLVWRU\WHVWDQ
WHVWVH[RDQ
DPDGMRXUQH\PLQLJROI
DPDGMRXUQH\PLQLJROIOLWH
LQWURH[WURZS
ZLQQHUORVHUQXWULWLRQWHVW
HVSLULWXULFRZS
PDWKVSHHGWHVWZS
VXSHUYLOODLQ
SDOPUHDGHU
HPRWLRQDOLQWHOOLJHQFHZS
JRRGNLVVHU
VSLQSHQJXLQ
VSLQSHQJXLQOLWH
VSDFHIHUU\OLWH
D[EXEEOH
WKHFODZ
VH[XDOGHVLUH
OLIHVW\OHGLHW
VH[XDODJHZS
SHUVXDGH
SDVWOLIHZS
ELJILYHWHVWZS
MRNHVURXOHWWHZS
QDPHFRPSDWLELOLW\ZS
SHVRLGHDODQ
WHVWYLGDDQ
VH[\NLQG
WHPSHUDPHQWRDQ
WHVWGDOWRQLVPRDQ
EUDLQVH[
\RXQJROG
SDOPUHDGHUZS
LGHDOMREZS
VHFUHWQDPHZS
VXEFRQVFLRXV
MRNHVURXOHWWHDQ
LQWURH[WURDQ
ILUVWLPSUHVVLRQDQWHVWFRPSDWLELOLGDGDQ
HPRWLRQDOLQWHOOLJHQFHDQ
JX\WDONZS
JLUOWDONZS
GHDGO\VLQVZS
ZLQQHUORVHUZS
KHPLVIHULRVD]WRQWRPHWURD]
LQWURH[WURD]
WHVWGDOWRQLVPRD]
ILUVWLPSUHVVLRQD]
HPRWLRQDOLQWHOOLJHQFHD]
KLVWRU\WHVWD]
WHVWVH[RD]
MRNHVURXOHWWHD]
WHVWFRPSDWLELOLGDGD]
WHVWYLGDD]
XQGHUZHDUWHVWD]
VH[XDODJHD]
WHPSHUDPHQWRD]
DQLPDOVH[XDOD]
SHVRLGHDOD]
P\HUVEULJJVDQ
P\HUVEULJJVD]
ELJILYHWHVWD]
ELJILYHWHVWDQ
PDWKVSHHGWHVWDQ
PDWKVSHHGWHVWD]
SDOPUHDGHUDQ
SDOPUHDGHUD]
SDVWOLIHDQ
SDVWOLIHD]
GHDGO\VLQVDQ
GHDGO\VLQVD]
SHUVXDGHZS
PHPRU\WHVWZS
ZDU
JRRGNLVVHUZS
VWURRSHIIHFWZS
QRPHUF\
QRPHUF\SOXV
KDOORZHHQL]HUXQMXHJRJUDWLV
IDFHKRURVFRSH
\RXQJROGZS
OLIHVW\OHGLHWZS
VH[\NLQGZS
LFRQIHVV
FORXGSHUVRQDOLW\
RWKHUVVHH\RX
DXUDFRORU
EUHDWK
VXEFRQVFLRXVZS
HURVFRSH
UHDODJH
KDQGZULWLQJ
LGHDOH[HUFLVH
OLNHGLVOLNH
WKHPROH\V
OLQNDZRUG
WHVWPXHUWH
PVDQLPDOVH[XDO����PVLGHDOH[HUFLVH����
PVVH[\WHVW����
PVWHPSHUDPHQWR����
PVWHVWFRPSDWLELOLGDG����
PVIULNLWHVW����
JLIWKHOSHU
FRXQWGRZQZLVKHV
ZLVKHV
WKHPROH\VDQ
WUDFNQHVW
OLQNDZRUGDQ
• The graph that has apps as nodes and undirected links weighted by the number of common users
4
45
7
22
5
15
25
• Shared users: The apps with largest number of users shared with the publisher app are preferred
4
45
7
22
5
15
25
Publisher app
recommendation
• Filtering algorithm:
- Let v the binary vector of the requesting user apps, and M the adjacency matrix of the common users graph
- The apps whose position have the largest value in vTM are preferred
7
22
25
5 recommendation
25
27
7
• Common users graph, modulated by age (user weight decreases exponentially with age)
Weight(app1,app2)= Σu δage(u)
• Aged shared users: Same as shared users in the aged graph
• Aged filtering: Same as shared users in the aged graph
• The CTR graph is a directed graph with links weighted by the frequency of clicks in the banner of the application (head) in the publisher (tail)
4/12
4/50
7/14
22/40
5/20
6/10
25/50
• Maxflow algorithm:
- Recommendation algorithm used to promote specific applications
- The apps with largest maxflow in the CTR graph to the promoted apps are preferred
.33
.08
.5
.55
.25
.6
.4
source
Promoted app
.6
.25 .25
.08
.08
.35 recommendation
• For reference there are two basic recommendation engines:
- Random: Engine that suggests random applications
- Static promotion: Returns always the promoted apps
• The different algorithms have been tested over a week in the real system
• These values improve over the current CTR
Algorithm CTR
Random 1.57%
Shared users 1.64%
Aged shared users 1.69%
Filtering 1.51%
Aged filtering 1.71%
Static promotion 1.45%
Maxflow 1.86%
Aging is useful
Global view
• Graph analytics can be useful in the development of recommendation engines
• Big data technology allowed us to process historical data and produce graphs
• The graphs generated are small. They could be processed with classical technologies
• Current map reduce technologies do not seem to be the solution for large graph analysis
• Explore technologies that are more suited for large graph analytics:
- Graphlab, GraphChi
- Spark, GraphX
- Stratosphere, Flink, Spargel
• Devise ways to process incremental data
• Design and testing of new recommendation algorithms that use larger graphs
Thank you!