Recommender System
Preference Learning
Computational Advertising

Recommender System

Recommender Systems (RSs) are software tools and techniques providing suggestions for items to be of use to a user. RSs are primarily directed towards individuals who lack sufficient personal experience or competence to evaluate the potentially overwhelming number of alternative items that a Web site, for example, may offer.

Xavier Amatriain discusses the traditional definition and its data mining core.

Traditional definition: The recommender system is to estimate a utility function that automatically predicts how a user will like an item.

User Interest is implicitly reflected in Interaction history, Demographics and Contexts, which can be regarded as a typical example of data mining. Recommender system should match a context to a collection of information objects. There are some methods called Deep Matching Models for Recommendation. It is an application of machine learning, which is in the representation + evaluation + optimization form. And we will focus on the representation and evaluation.

https://github.com/hongleizhang/RSPapers
https://github.com/chihming/competitive-recsys
https://rsbd2019.wordpress.com/
https://github.com/familyld/AwesomeRecSysPaper/
https://github.com/benfred/implicit
https://github.com/YuyangZhangFTD/awesome-RecSys-papers
https://github.com/daicoolb/RecommenderSystem-Paper
https://github.com/grahamjenson/list_of_recommender_systems
https://www.zhihu.com/question/20465266/answer/142867207
http://www.mbmlbook.com/Recommender.html
https://wiki.recsys.acm.org/index.php/Main_Page
直接优化物品排序的推荐算法
推荐系统遇上深度学习
Large-Scale Recommender Systems@UTexas
Alan Said's publication
MyMediaLite Recommender System Library
Recommender System Algorithms @ deitel.com
Workshop on Recommender Systems: Algorithms and Evaluation
Semantic Recommender Systems. Analysis of the state of the topic
Recommender Systems (2019/1)
Recommender systems & ranking
Large scale recommender systems
Symposium on Semantic Computing and Personalization
International Workshop on Web Personalization, Recommender Systems, and Social Media (WPRSM2018)
https://www.comp.hkbu.edu.hk/mdm2019/files/slides/keynote_xie.pdf

Evolution of the Recommender Problem:

Rating
Ranking
Page Optimization
Context-aware Recommendations

------------------
Collaborative Filtering (CF)
Content-Based Filtering (CBF)
Demographic Filtering (DF)
Knowledge-Based Filtering (KBF)
Hybrid Recommendation Systems

Evaluation of Recommendation System

The evaluation of machine learning algorithms depends on the tasks. The evaluation of recommendation system can be regarded as some machine learning models such as regression, classification and so on. We only take the mathematical convenience into consideration in the following methods. Gini index, covering rate and more realistic factors are not discussed in the following content.

Recommendation Strategies
Evaluating recommender systems
Distance Metrics for Fun and Profit
Recsys2018 evaluation: tutorial

Collaborative Filtering

There are 3 kinds of collaborative filtering: user-based, item-based and model-based collaborative filtering.

The user-based methods are based on the similarities of users. If user ${u}$ and ${v}$ are very similar friends, we may recommend the items which user ${u}$ bought to the user ${v}$ and explains it that your friends also bought it.

The item-based methods are based on the similarity of items. If one person added a brush to shopping-list, it is reasonable to recommend some toothpaste to him or her. And we can explain that you bought item $X$ and the people who bought $X$ also bought $Y$. And we focus on the model-based collaborative filtering.

协同过滤详解
深入推荐引擎相关算法 - 协同过滤

Matrix Completion

Matrix completion is to complete the matrix $X$ with missing elements, such as

$$ \min_{Z} Rank(Z) \\ s.t. \sum_{(i,j):Observed}(Z_{(i,j)}-X_{(i,j)})^2\leq \delta $$

Note that the rank of a matrix is not easy or robust to compute.

We can apply customized PPA to matrix completion problem

$$ \min { {|Z|}{\ast}} \ s.t. Z{\Omega} = X_{\Omega} $$

We let ${Y}\in\mathbb{R}^{n\times n}$ be the the Lagrangian multiplier to the constraints $Z_{\Omega} = X_{\Omega}$ and Lagrange function is $$ L(Z,Y) = {|Z|}{\ast} - Y(Z{\Omega} - X_{\Omega}). $$

Producing $Y^{k+1}$ by $$Y^{k+1}=\arg\max_{Y} {L([2Z^k-Z^{k-1}],Y)-\frac{s}{2}|Y-Y^k|};$$
Producing $Z^{k+1}$ by $$Z^{k+1}=\arg\min_{Z} {L(Z,Y^{k+1}) + \frac{r}{2}|Z-Z^k|}.$$

Rahul Mazumder, Trevor Hastie, Robert Tibshirani reformulate it as the following:

$$ \min f_{\lambda}(Z)=\frac{1}{2}{|P_{\Omega}(Z-X)|}F^2 + \lambda {|Z|}{\ast} $$

where $X$ is the observed matrix, $P_{\Omega}$ is a projector and ${|\cdot|}_{\ast}$ is the nuclear norm of matrix.

A SINGULAR VALUE THRESHOLDING ALGORITHM FOR MATRIX COMPLETION
Matrix and Tensor Decomposition in Recommender Systems
Low-Rank Matrix Recovery
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization

Matrix Completion/Sensing as NonConvex Optimization Problem
Exact Matrix Completion via Convex Optimization
A SINGULAR VALUE THRESHOLDING ALGORITHM FOR MATRIX COMPLETION
Customized PPA for convex optimization
Matrix Completion.m

Maximum Margin Matrix Factorization

A novel approach to collaborative prediction is presented, using low-norm instead of low-rank factorizations. The approach is inspired by, and has strong connections to, large-margin linear discrimination. We show how to learn low-norm factorizations by solving a semi-definite program, and present generalization error bounds based on analyzing the Rademacher complexity of low-norm factorizations.

Consider the soft-margin learning, where we minimize a trade-off between the trace norm of $Z$ and its hinge-loss relative to $X_O$: $$ \min_{Z} { | Z | }{\Omega} + c \sum{(ui)\in O}\max(0, 1 - Z_{ui}X_{ui}). $$

And it can be rewritten as a semi-definite optimization problem (SDP): $$ \min_{A, B} \frac{1}{2}(tr(A)+tr(B))+c\sum_{(ui)\in O}\xi_{ui}, \ s.t. , \begin{bmatrix} A & X \ X^T & B \ \end{bmatrix} \geq 0, Z_{ui}X_{ui}\geq 1- \xi_{ui}, \xi_{ui}>0 ,\forall ui\in O $$ where $c$ is a trade-off constant.

Maximum Margin Matrix Factorization
Fast Maximum Margin Matrix Factorization for Collaborative Prediction
Maximum Margin Matrix Factorization by Nathan Srebro

This technique is also called nonnegative matrix factorization.

$\color{red}{Note:}$ The data sets we more frequently encounter in collaborative prediction problem are of ordinal ratings $X_{ij} \in {1, 2, \dots, R}$ such as ${1, 2, 3, 4, 5}$. To relate the real-valued $Z_{ij}$ to the discrete $X_{ij}$. we use $R − 1$ thresholds $\theta_{1}, \dots, \theta_{R-1}$.

SVD and Beyond

If we have collected user ${u}$'s explicit evaluation score to the item ${i}$ , $R_{[u][i]}$, and all such data makes up a matrix $R=(R_{[u][i]})$ while the user $u$ cannot evaluate all the item so that the matrix is incomplete and missing much data. SVD is to factorize the matrix into the multiplication of matrices so that $$ \hat{R} = P^{T}Q. $$

And we can predict the score $R_{[u][i]}$ via $$ \hat{R}{[u][i]} = \hat{r}{u,i} = \left<P_u, Q_i\right> = \sum_f p_{u,f} q_{i,f} $$

where $P_u, Q_i$ is the ${u}$-th column of ${P}$ and the ${i}$-th column of ${Q}$, respectively. And we can define the cost function

$$ C(P,Q) = \sum_{(u,i):Observed}(r_{u,i}-\hat{r}{u,i})^{2}=\sum{(u,i):Observed}(r_{u,i}-\sum_f p_{u,f}q_{i,f})^{2}\ \arg\min_{P_u, Q_i} C(P, Q) $$

where $\lambda_u$ is always equal to $\lambda_i$.

Additionally, we can add regular term into the cost function to void over-fitting

$$ C(P,Q) = \sum_{(u,i):Observed}(r_{u,i}-\sum_f p_{u,f}q_{i,f})^{2}+\lambda_u|P_u|^2+\lambda_i|Q_i|^2. $$

It is called the regularized singular value decomposition or Regularized SVD.

Funk-SVD considers the user's preferences or bias. It predicts the scores by $$ \hat{r}{u,i} = \mu + b_u + b_i + \left< P_u, Q_i \right> $$ where $\mu, b_u, b_i$ is biased mean, biased user, biased item, respectively. And the cost function is defined as $$ \min\sum{(u,i): Observed}(r_{u,i} - \hat{r}_{u,i})^2 + \lambda (|P_u|^2+|Q_i|^2+|b_i|^2+|b_u|^2). $$

SVD ++ predicts the scores by

$$ \hat{r}{u,i} = \mu + b_u + b_i + (P_u + |N(u)|^{-0.5}\sum{i\in N(u)} y_i) Q_i^{T} $$ where $y_j$ is the implicit feedback of item ${j}$ and $N(u)$ is user ${u}$'s item set. And it can decompose into 3 parts:

$\mu + b_u + b_i$ is the base-line prediction;
$\left<P_u, Q_i\right>$ is the SVD of rating matrix;
$\left<|N(u)|^{-0.5}\sum_{i\in N(u)} y_i, Q_i\right>$ is the implicit feedback where $N(u)$ is user ${u}$'s item set, $y_j$ is the implicit feedback of item $j$.

We learn the values of involved parameters by minimizing the regularized squared error function.

Biased Regularized Incremental Simultaneous Matrix Factorization@orange3-recommender
SVD++@orange3-recommender
矩阵分解之SVD和SVD++
SVD++：推荐系统的基于矩阵分解的协同过滤算法的提高
https://zhuanlan.zhihu.com/p/42269534
使用SVD++进行协同过滤（算法原理部分主要引用自他人）
SVD++推荐系统

Probabilistic Matrix Factorization

In linear regression, the least square methods is equivalent to maximum likelihood estimation of the error in standard normal distribution.

Regularized SVD
$C(P,Q) = \sum_{(u,i):Observed}(r_{(u,i)}-\sum_f p_{(u,f)} q_{(i,f)})^{2}+\lambda_u\|P_u\|^2+\lambda_i\|Q_i\|^2$

Probabilistic model
$r_{u,i}\sim N(\sum_f p_{(u,f)} q_{(i,f)},\sigma^2), P_u\sim N(0,\sigma_u^2 I), Q_i\sim N(0,\sigma_i^2 I)$

And $\sigma_u^2$ and $\sigma_i^2$ is related with the regular term $\lambda_u$ and $\lambda_u$.

So that we can reformulate the optimization problem as maximum likelihood estimation.

Latent Factor Models for Web Recommender Systems
Regression-based Latent Factor Models @CS 732 - Spring 2018 - Advanced Machine Learning by Zhi Wei
Probabilistic Matrix Factorization
Indexable Probabilistic Matrix Factorization for Maximum Inner Product Search

Poisson Factorization

We develop a Bayesian Poisson matrix factorization model for forming recommendations from sparse user behavior data. These data are large user/item matrices where each user has provided feedback on only a small subset of items, either explicitly (e.g., through star ratings) or implicitly (e.g., through views or purchases). In contrast to traditional matrix factorization approaches, Poisson factorization implicitly models each user's limited attention to consume items. Moreover, because of the mathematical form of the Poisson likelihood, the model needs only to explicitly consider the observed entries in the matrix, leading to both scalable computation and good predictive performance. We develop a variational inference algorithm for approximate posterior inference that scales up to massive data sets. This is an efficient algorithm that iterates over the observed entries and adjusts an approximate posterior over the user/item representations. We apply our method to large real-world user data containing users rating movies, users listening to songs, and users reading scientific papers. In all these settings, Bayesian Poisson factorization outperforms state-of-the-art matrix factorization methods.

https://lkpy.readthedocs.io/en/stable/hpf.html
https://hpfrec.readthedocs.io/en/latest/
Scalable Recommendation with Hierarchical Poisson Factorization
Dynamic Poisson Factorization
Coupled Poisson Factorization Integrated with User/Item Metadata for Modeling Popular and Sparse Ratings in Scalable Recommendation
Coupled Compound Poisson Factorization
https://github.com/mehmetbasbug/ccpf

Collaborative Less-is-More Filtering(CliMF)

Sometimes, the information of user we could collect is implicit such as the clicking at some item.

In CLiMF the model parameters are learned by directly maximizing the Mean Reciprocal Rank (MRR).

Its objective function is $$ F(U,V)=\sum_{i=1}^{M}\sum_{j=1}^{N} Y_{ij} [\ln g(U_{i}^{T}V_{j})+\sum_{k=1}^{N}\ln (1 - Y_{ij} g(U_{i}^{T}V_{k}-U_{i}^{T}V_{j}))] \-\frac{\lambda}{2}({|U|}^2 + {|V|}^2) $$

where ${M, N}$ is the number of users and items, respectively. Additionally, $\lambda$ denotes the regularization coefficient and $Y_{ij}$ denotes the binary relevance score of item ${j}$ to user ${i}$, i.e., $Y_{ij} = 1$ if item ${j}$ is relevant to user ${j}$, 0 otherwise. The function $g$ is logistic function $g(x)=\frac{1}{1+\exp(-x)}$. The vector $U_i$ denotes a d-dimensional latent factor vector for user ${i}$, and $V_j$ a d-dimensional latent factor vector for item ${i}$.

Numbers		Factors		Others
$M$	the number of users	$U_i$	latent factor vector for user ${i}$	$Y_{ij}$	binary relevance score
$N$	the number of items	$V_j$	latent factor vector for item ${i}$	$f$	logistic function

We use stochastic gradient ascent to maximize the objective function.

Collaborative Less-is-More Filtering@orange3-recommendation
https://dl.acm.org/citation.cfm?id=2540581
Collaborative Less-is-More Filtering python Implementation
CLiMF: Collaborative Less-Is-More Filtering

Matrix Factorization for Implicit Feedback

Another advantage of collaborative filtering or matrix completion is that even the element of matrix is binary or implicit information such as

BPR: Bayesian Personalized Ranking from Implicit Feedback,
Applications of the conjugate gradient method for implicit feedback collaborative filtering,
Intro to Implicit Matrix Factorization
a curated list in github.com.

Explicit and implicit feedback

WRMF is simply a modification of this loss function:

$$ {C(P,Q)}{WRMF} = \sum{(u,i):Observed}c_{u,i}(I_{u,i} - \sum_f p_{u,f}q_{i,f})^{2} + \underbrace{\lambda_u|P_u|^2 + \lambda_i|Q_i|^2}_{\text{regularization terms}}. $$

We make the assumption that if a user has interacted at all with an item, then $I_{u,i} = 1$. Otherwise, $I_{u,i} = 0$. If we take $d_{u,i}$ to be the number of times a user ${u}$ has clicked on an item ${i}$ on a website, then $$c_{u,i}=1+\alpha d_{u,i}$$ where $\alpha$ is some hyperparameter determined by cross validation. The new term in cost function $C=(c_{u,i})$ is called confidence matrix.

WRMF does not make the assumption that a user who has not interacted with an item does not like the item. WRMF does assume that that user has a negative preference towards that item, but we can choose how confident we are in that assumption through the confidence hyperparameter.

Alternating least square (ALS) can give an analytic solution to this optimization problem by setting the gradients equal to 0s.

Faster Implicit Matrix Factorization
CUDA Tutorial: Implicit Matrix Factorization on the GPU
Intro to Implicit Matrix Factorization: Classic ALS with Sketchfab Models
Logistic Matrix Factorization for Implicit Feedback Data

Discrete Collaborative Filtering

Discrete Collaborative Filtering
https://arxiv.org/abs/2003.10719
https://github.com/hanwangzhang/Discrete-Collaborative-Filtering
Discrete Content-aware Matrix Factorization
Discrete Factorization Machines for Fast Feature-based Recommendation
Binomial Matrix Factorization for Discrete Collaborative Filtering
Discrete Matrix Factorization and Extension for Fast Item Recommendation
Discrete Ranking-based Matrix Factorization with Self-Paced Learning
https://github.com/yixianqianzy/drmf-spl

Recommendation with Implicit Information

Collaborative Filtering for Implicit Feedback Datasets
A Generic Framework for Learning Explicit and Implicit User-Item Couplings in Recommendation
Recommending Based on Implicit Feedback
Fast Collaborative Filtering from Implicit Feedback with Provable Guarantees
http://nicolas-hug.com/blog/matrix_facto_1
http://nicolas-hug.com/blog/matrix_facto_2
http://nicolas-hug.com/blog/matrix_facto_3
A recommender systems development and evaluation package by Mendeley
https://mendeley.github.io/mrec/
Fast Python Collaborative Filtering for Implicit Feedback Datasets
Alternating Least Squares Method for Collaborative Filtering
Implicit Feedback and Collaborative Filtering

BPR: Bayesian Personalized Ranking from Implicit Feedback
Collaborative Filtering for Implicit Feedback Datasets
Improving Pairwise Learning for Item Recommendation from Implicit Feedback
A-RecSys : a Tensorflow Toolkit for Implicit Recommendation Tasks
http://lyst.github.io/lightfm/docs/examples/warp_loss.html

Matrix factorization for recommender system@Wikiwand
http://www.cnblogs.com/DjangoBlog/archive/2014/06/05/3770374.html
Learning to Rank Sketchfab Models with LightFM
Finding Similar Music using Matrix Factorization
Top-N Recommendations from Implicit Feedback Leveraging Linked Open Data ?

Inductive Matrix Completion

One possible improvement of this cost function is that we may design more appropriate loss function other than the squared error function.

Inductive Matrix Completion (IMC) is an algorithm for recommender systems with side-information of users and items. The IMC formulation incorporates features associated with rows (users) and columns (items) in matrix completion, so that it enables predictions for users or items that were not seen during training, and for which only features are known but no dyadic information (such as ratings or linkages).

IMC assumes that the associations matrix is generated by applying feature vectors associated with its rows as well as columns to a low-rank matrix ${Z}$. The goal is to recover ${Z}$ using observations from ${P}$.

The inputs $x_i, y_j$ are feature vectors. The entry $P_{(i, j)}$ of the matrix is modeled as $P_{(i, j)}=x_i^T Z y_j$ and ${Z}$ is to recover in the form of $Z=WH^T$.

$$ \min \sum_{(i,j)\in \Omega}\ell(P_{(i,j)}, x_i^T W H^T y_j) + \frac{\lambda}{2}(| W |^2+| H |^2) $$ The loss function $\ell$ penalizes the deviation of estimated entries from the observations. And $\ell$ is diverse such as the squared error $\ell(a,b)=(a-b)^2$, the logistic error $\ell(a,b) = \log(1 + \exp(-ab))$.

Inductive Matrix Completion for Recommender Systems with Side-Information
Inductive Matrix Completion for Predicting Gene-Diseasev Associations

More on Matrix Factorization

The Advanced Matrix Factorization Jungle
Non-negative Matrix Factorizations
http://people.eecs.berkeley.edu/~yima/
New tools for recovering low-rank matrices from incomplete or corrupted observations by Yi Ma@UCB
DiFacto — Distributed Factorization Machines
Learning with Nonnegative Matrix Factorizations
Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview
Taming Nonconvexity in Information Science, tutorial at ITW 2018.
Nonnegative Matrix Factorization by Optimization on the Stiefel Manifold with SVD Initialization
Matrix and Tensor Completion Algorithms
Parallel matrix factorization for low-rank tensor completion
https://canyilu.github.io/publications/
http://people.eecs.berkeley.edu/~yima/matrix-rank/references.html
A Library of ADMM for Sparse and Low-rank Optimization
https://arxiv.org/abs/1603.06038

Beyond Matrix Completion

There are 2 common techniques in recommender systems:

The goal of matrix factorization techniques in RS is to determine a low-rank approximation of the user-item rating matrix by decomposing it into a product of (user and item) matrices of lower dimensionality (latent factors).
The idea of ensemble methods is to combine multiple alternative machine learning models to obtain more accurate predictions.

There are 2 disadvantages of Matrix Completion:

$Postdiction \not= prediction$
- Need initial post data
- Predict poorly on a random set of items the user has not rated.
- Repeated recommendation of purchased items
- The evaluation method of Netflix Prize is misleading. RMSE(regression) vs Rank-based measures(sorting)
Quality factors beyond accuracy
- Introduce why we use the quality factors:
- Novelty, diversity and unexpectedness(How to recommend new things to users exactly)
- Depend on context and different problems
- Interact with users: conversational recommender systems
- Example of context and interaction:To Be Continued: Helping you find shows to continue watching on Netflix(search the “context”)
- Manipulation resistance
- Recommendation is optimal to sellers not users - transparency and explanation strategy (nearly a moral problem).

From Algorithms to Systems

Beyond the computer science perspective.

Putting the user back in the loop.

Toward a more comprehensive characterization of the recommendation task.

Collaborative filtering has become a key tool in recommender systems. The Netflix competition was instrumental in this context to further development of scalable tools. At its heart lies the minimization of the Root Mean Squares Error (RMSE) which helps to decide upon the quality of a recommender system. Moreover, minimizing the RMSE comes with desirable guarantees of statistical consistency. In this talk I make the case that RMSE minimization is a poor choice for a number of reasons: firstly, review scores are anything but Gaussian distributed, often exhibiting asymmetry and bimodality in their scores. Secondly, in a retrieval setting accuracy matters primarily for the top rated items. Finally, such ratings are highly context dependent and should only be considered in interaction with a user. I will show how this can be accomplished easily by relatively minor changes to existing systems.

https://www.researchgate.net/project/Proactive-Recommendation-Delivery
Beyond Matrix Completion of the traditional Recommender System
Recommender systems---: Recommender systems---: beyond matrix completion
Notes of "Recommender Systems - Beyond Matrix Completion"
Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions

Factorization Machines(FM)

The matrix completion used in recommender system are linear combination of some features such as regularized SVD and they only take the user-user interaction and item-item similarity. Factorization Machines(FM) is inspired from previous factorization models. It represents each feature an embedding vector, and models the second-order feature interactions: $$ \hat{y} = w_0 + \sum_{i=1}^{n} w_i x_i+\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}\left<v_i, v_j\right> x_i x_j\ = \underbrace{w_0 + \left<w, x\right>}{\text{First-order: Linear Regression}} + \underbrace{\sum{i=1}^{n-1}\sum_{j=i+1}^{n}\left<v_i, v_j\right> x_i x_j}_{\text{Second-order: pair-wise interactions between features}} $$

where the model parameters that have to be estimated are $$ w_0 \in \mathbb{R}, w\in\mathbb{R}^n, V\in\mathbb{R}^{n\times k}. $$

And $\left<\cdot,\cdot\right>$ is the dot (inner) product of two vectors so that $\left<v_i, v_j\right>=\sum_{f=1}^{k}v_{i,f} \cdot v_{j,f}$. A row $v_i$ within ${V}$ describes the ${i}$-th latent variable with ${k}$ factors for $x_i$.

And the linear regression $w_0 + \sum_{i=1}^{n} w_i x_i$ is called the first order part; the pair-wise interactions between features $\sum_{i=1}^{n}\sum_{j=i+1}^{n}\left<v_i, v_j\right> x_i x_j$ is called the second order part.

However, why we call it factorization machine? Where is the factorization? If ${[W]}{ij}=w{ij}= \left<v_i, v_j\right>$, $W=V V^T$, the second order part $\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}\left<v_i, v_j\right> x_i x_j$ is equivalent to the following relationship: $$\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}\left<v_i, v_j\right> x_i x_j=\frac{1}{2}x^TWx=\frac{1}{2}\sum_{i,j}w_{ij}x_i x_j$$ thus it is to factorize the matrix $W$ into the product of the $V$ and $V^T$.

In order to reduce the computation complexity, the second order part $\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}\left<v_i, v_j\right> x_i x_j$ is rewritten in the following form $$\frac{1}{2}\sum_{l=1}^{k}{[\sum_{i=1}^{n}(v_{il}x_i))]^2-\sum_{i=1}^{n}(v_{il}x_i)^2}.$$ This show that we can use less resource to compute the model.

The next step is to find the optimal parameters of the model using the numerical optimization methods. Optimality of model parameters is usually defined with a loss function $\ell$ where the task is to minimize the sum of losses over the observed data $S={(x_i,y_i)}{i=1}^N$. $$\arg\min{\Theta}\sum_{(x_i,y_i)\in S}\ell(\hat{y}(x_i), x_i)$$

第09章：深入浅出ML之Factorization家族
FM算法（Factorization Machine）
分解机(Factorization Machines)推荐算法原理 by 刘建平Pinard
Factorization Machines for Recommendation Systems
http://www.libfm.org/
https://github.com/ibayer/fastFM
https://www.d2l.ai/chapter_recommender-systems/fm.html

FMs model all interactions between variables using factorized parameters. Thus they are able to estimate interactions even in problems with huge sparsity (like recommender systems) where SVMs fail. It is shown that the model equation of FMs can be calculated in linear time and thus FMs can be optimized directly.

In Factorization Machines, the factorization machine models all nested variable interactions (comparable to a polynomial kernel in SVM), but uses a factorized parametrization instead of a dense parametrization like in SVMs.

And sparse factorization machines enforce group sparsity to remove the effect of a feature of user or a feature of item.

Polynomial networks and factorization machines are two recently-proposed models that can efficiently use feature interactions in classification and regression tasks. In this paper, we revisit both models from a unified perspective. Based on this new view, we study the properties of both models and propose new efficient training algorithms. Key to our approach is to cast parameter learning as a low-rank symmetric tensor estimation problem, which we solve by multi-convex optimization. We demonstrate our approach on regression and recommender system tasks.

https://arxiv.org/abs/1607.08810
Polynomial Networks and Factorization Machines: New Insights and Efficient Training Algorithms
https://www.csie.ntu.edu.tw/~cjlin/talks/sdm2015.pdf
https://www.ismll.uni-hildesheim.de/pub/pdfs/RendleFreudenthaler2010-FPMC.pdf
https://mlconf.com/speakers/steffen-rendle/
http://www.cs.cmu.edu/~wcohen/10-605/2015-guest-lecture/FM.pdf
Synergies that Matter: Efficient Interaction Selection via Sparse Factorization Machine
http://csse.szu.edu.cn/staff/panwk/recommendation/Sequence/FPMC.pdf

Field-aware Factorization Machine(FFM)

In FMs, every feature has only one latent vector to learn the latent effect with any other features. In FFMs, each feature has several latent vectors. Depending on the field of other features, one of them is used to do the inner product. Mathematically, $$ \hat{y}=\sum_{j_1=1}^{n}\sum_{j_2=i+1}^{n}\left<v_{j_1,f_2}, v_{j_2,f_1}\right> x_{j_1} x_{j_2} $$ where $f_1$ and $f_2$ are respectively the fields of $j_1$ and $j_2$.

https://www.csie.ntu.edu.tw/~cjlin/
https://github.com/ycjuan/libffm
Yuchin Juan at ACEMAP
Field-aware Factorization Machines for CTR Prediction
https://blog.csdn.net/mmc2015/article/details/51760681
https://www.aaai.org/ojs/index.php/AAAI/article/view/4267/4145
https://ailab.criteo.com/ctr-prediction-linear-model-field-aware-factorization-machines/

Convex Factorization Machines

Factorization machines are a generic framework which allows to mimic many factorization models simply by feature engineering. In this way, they combine the high predictive accuracy of factorization models with the flexibility of feature engineering. Unfortunately, factorization machines involve a non-convex optimization problem and are thus subject to bad local minima. In this paper, we propose a convex formulation of factorization machines based on the nuclear norm. Our formulation imposes fewer restrictions on the learned model and is thus more general than the original formulation. To solve the corresponding optimization problem, we present an efficient globally-convergent two-block coordinate descent algorithm. Empirically, we demonstrate that our approach achieves comparable or better predictive accuracy than the original factorization machines on 4 recommendation tasks and scales to datasets with 10 million samples.

And the objective function to optimize is the regularized empirical loss function or structured empirical loss function: $$\sum_{i}\ell(\hat{y}(x_i), y_i)+\frac{\alpha}{2}|w|2^2+\beta|Z|{\ast}$$ where $|Z|_{\ast}$ is the nuclear norm of the matrix $Z$.

Online Compact Convexified Factorization Machine
http://mblondel.org/talks/mblondel-cambridge-2015-09.pdf
https://bigdata.nii.ac.jp/eratokansyasai4/wp-content/uploads/2017/09/929e8b7e82a0043cc993d328bfbb400e.pdf
https://maidousj.github.io/2020/06/02/Convex-FM/
http://www.yichang-cs.com/yahoo/KDD17_FM.pdf
https://arxiv.org/abs/1507.01073
http://talks.cam.ac.uk/talk/index/60262

Higher-Order Factorization Machines

In Factorization Machines, the d-way factorization machines are proposed in the following form: $$\hat{y}=w_0+\left<x, w\right>+\sum_{m=2}^d\sum_{n_1=1}^n\cdots\sum_{n_m}^n(\prod_{j=1}^{m}x_j)(\sum_{f}^k\prod_{j=1}^{m}v_{m_j,f}^{(m)}).$$ Unfortunately, despite increasing interest in FMs, there exists to date no efficient training algorithm for higher-order FMs (HOFMs).

The FM can be considered as the second order polynomial regression with lower computation complexity. And FM is also considered as the ANOVA kernel regression of degree 2. So we can generalize the FM into Higher-Order Factorization Machines (HOFM) based on ANOVA kernel.

https://people.eecs.berkeley.edu/~jordan/kernels/0521813972c09_p291-326.pdf
http://mblondel.org/talks/mblondel-stair-2016-09.pdf
https://rdrr.io/cran/FactoRizationMachines/man/010-FactoRizationMachines.html
https://papers.nips.cc/paper/6144-higher-order-factorization-machines.pdf
http://mblondel.org/talks/mblondel-stair-2016-09.pdf
https://ideas.repec.org/p/zbw/iwqwdp/132017.html
http://www.kecl.ntt.co.jp/as/members/ueda/
https://papers.nips.cc/paper/6144-higher-order-factorization-machines.pdf

Deep Learning for Recommender System

Deep learning is powerful in processing visual and text information so that it helps to find the interests of users such as Deep Interest Network, xDeepFM and more.

Deep learning models for recommender system may come from the restricted Boltzman machine. And deep learning models are powerful information extractors. Deep learning is really popular in recommender system such as spotlight.

What is the role deep learning plays in recommender system? At one hand, deep learning helps to match the user and items based on the history of their interactions such as deep matching and deep collaborative learning. In mathematics, it is a function that evaluates the how likely the user would interact with the items in some context: $f(X_U, X_I, X_C)$ where $X_U, X_I, X_C$ is the features of user, item and context, respectively. At another hand, deep learning leads a role as one representation methods to embedded high dimensional sparse data into semantics space.

A review on deep learning for recommender systems: challenges and remedies
Deep Learning Recommendation Model for Personalization and Recommendation Systems
http://tw991.github.io/
https://dlp-kdd.github.io/
https://recsys.acm.org/recsys17/workshops/
https://recsys.acm.org/recsys17/dlrs/
https://dl.acm.org/citation.cfm?id=3125486
The 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data with KDD 2019 (DLP-KDD 2019）
https://recsys.acm.org/recsys19/session-3/
http://bdsc.lab.uic.edu/docs/survey-critique-deep.pdf

Deep Learning Meets Recommendation Systems
Using Keras' Pretrained Neural Networks for Visual Similarity Recommendations
Recommending music on Spotify with deep learning
https://bdsc.lab.uic.edu/docs/survey-critique-deep.pdf
Deep Learning based Recommender System
http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=101417&copyownerid=158713
https://786121244.github.io/NeuRec-Workshop/

Restricted Boltzmann Machines for Collaborative Filtering

Let ${V}$ be a $K\times m$ observed binary indicator matrix with $v_i^k = 1$ if the user rated item ${i}$ as ${k}$ and ${0}$ otherwise. We also let $h_j$, $j = 1, \dots, F,$ be the binary values of hidden (latent) variables, that can be thought of as representing stochastic binary features that have different values for different users.

We use a conditional multinomial distribution (a “softmax”) for modeling each column of the observed "visible" binary rating matrix ${V}$ and a conditional Bernoulli distribution for modeling "hidden" user features ${h}$: $$ p(v_i^k = 1 \mid h) = \frac{\exp(b_i^k + \sum_{j=1}^{F} h_j W_{i,j}^{k})}{\sum_{l=1}^{K}\exp( b_i^k + \sum_{j=1}^{F} h_j W_{i, j}^{l})} \ p( h_j = 1 \mid V) = \sigma(b_j + \sum_{i=1}^{m}\sum_{k=1}^{K} v_i^k W_{i,j}^k) $$ where $\sigma(x) = \frac{1}{1 + exp(-x)}$ is the logistic function, $W_{i,j}^{k}$ is is a symmetric interaction parameter between feature ${j}$ and rating ${k}$ of item ${i}$, $b_i^k$ is the bias of rating ${k}$ for item ${i}$, and $b_j$ is the bias of feature $j$.

The marginal distribution over the visible ratings ${V}$ is $$ p(V) = \sum_{h}\frac{\exp(-E(V,h))}{\sum_{V^{\prime},h^{\prime}} \exp(-E(V^{\prime},h^{\prime}))} $$ with an "energy" term given by:

$$ E(V,h) = -\sum_{i=1}^{m}\sum_{j=1}^{F}\sum_{k=1}^{K}W_{i,j}^{k} h_j v_i^k - \sum_{i=1}^{m}\sum_{k=1}^{K} v_i^k b_i^k -\sum_{j=1}^{F} h_j b_j. $$ The items with missing ratings do not make any contribution to the energy function

The parameter updates required to perform gradient ascent in the log-likelihood over the visible ratings ${V}$ can be obtained $$ \Delta W_{i,j}^{k} = \epsilon \frac{\partial\log(p(V))}{\partial W_{i,j}^{k}} $$ where $\epsilon$ is the learning rate. The authors put a Contrastive Divergence to approximate the gradient.

We can also model “hidden” user features $h$ as Gaussian latent variables: $$ p(v_i^k = 1 | h) = \frac{\exp(b_i^k+\sum_{j=1}^{F}h_j W_{i,j}^{k})}{\sum_{l=1}^{K}\exp(b_i^k+\sum_{j=1}^{F}h_j W_{i,j}^{l})} \ p( h_j = 1 | V) = \frac{1}{\sqrt{2\pi}\sigma_j} \exp(\frac{(h - b_j -\sigma_j \sum_{i=1}^{m}\sum_{k=1}^{K} v_i^k W_{i,j}^k)^2}{2\sigma_j^2}) $$ where $\sigma_j^2$ is the variance of the hidden unit ${j}$.

https://www.cnblogs.com/pinard/p/6530523.html
https://www.cnblogs.com/kemaswill/p/3269138.html
Restricted Boltzmann Machines for Collaborative Filtering
Building a Book Recommender System using Restricted Boltzmann Machines
On Contrastive Divergence Learning
http://deeplearning.net/tutorial/rbm.html
RBM notebook form Microsoft

AutoRec for Collaborative Filtering

AutoRec is a novel autoencoder framework for collaborative filtering (CF). Empirically, AutoRec’s compact and efficiently trainable model outperforms state-of-the-art CF techniques (biased matrix factorization, RBMCF and LLORMA) on the Movielens and Netflix datasets.

Formally, the objective function for the Item-based AutoRec (I-AutoRec) model is, for regularization strength $\lambda > 0$,

$$\min_{\theta}\sum_{i=1}^{n} {|r^{i}-h(r^{i}|\theta)|}_{O}^2 +\frac{1}{2}({|W|}_F^{2}+ {|V|}_F^{2})$$

where ${r^{i}\in\mathbb{R}^{d}, i=1,2,\dots,n}$ is partially observed vector and ${| \cdot |}_{o}^2$ means that we only consider the contribution of observed ratings. The function $h(r|\theta)$ is the reconstruction of input $r\in\mathbb{R}^{d}$:

$$h(r|\theta) = f(W\cdot g(Vr+\mu)+b)$$

for for activation functions $f, g$ as described in dimension reduction. Here $\theta = {W,V,r,b}$.

AutoRec: Autoencoders Meet Collaborative Filtering》WWW2015 阅读笔记
AutoRec: Autoencoders Meet Collaborative Filtering

Deep crossing

https://zhuanlan.zhihu.com/p/91057914
https://www.microsoft.com/en-us/research/people/xingx/
https://www.pianshen.com/article/31571380403/
https://www.kdd.org/kdd2016/papers/files/adf0975-shanA.pdf

Neural collaborative filtering

This model leverages the flexibility and non-linearity of neural networks to replace dot products of matrix factorization, aiming at enhancing the model expressiveness. In specific, this model is structured with two subnetworks including generalized matrix factorization (GMF) and MLP and models the interactions from two pathways instead of simple inner products. The outputs of these two networks are concatenated for the final prediction scores calculation. Unlike the rating prediction task in AutoRec, this model generates a ranked recommendation list to each user based on the implicit feedback. We will use the personalized ranking loss introduced in the last section to train this model.

Neural Collaborative Filtering
https://d2l.ai/chapter_recommender-systems/neumf.html
https://github.com/hexiangnan/neural_collaborative_filtering
Neural Collaborative Filtering vs. Matrix Factorization Revisited
Outer Product-based Neural Collaborative Filtering

Collaborative deep learning for RecSys

Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional CF-based methods use the ratings given to items by users as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in many applications, causing CF-based methods to degrade significantly in their recommendation performance. To address this sparsity problem, auxiliary information such as item content information may be utilized. Collaborative topic regression (CTR) is an appealing recent method taking this approach which tightly couples the two components that learn from two different sources of information. Nevertheless, the latent representation learned by CTR may not be very effective when the auxiliary information is very sparse. To address this problem, we generalize recently advances in deep learning from i.i.d. input to non-i.i.d. (CF-based) input and propose in this paper a hierarchical Bayesian model called collaborative deep learning (CDL), which jointly performs deep representation learning for the content information and collaborative filtering for the ratings (feedback) matrix. Extensive experiments on three real-world datasets from different domains show that CDL can significantly advance the state of the art.

Given part of the ratings in ${R}$ and the content information $X_c$, the problem is to predict the other ratings in ${R}$, where row ${j}$ of the content information matrix $X_c$ is the bag-of-words vector $Xc;j{\ast}$ for item ${j}$ based on a vocabulary of size ${S}$.

Stacked denoising autoencoders(SDAE) is a feedforward neural network for learning representations (encoding) of the input data by learning to predict the clean input itself in the output. Using the Bayesian SDAE as a component, the generative process of CDL is defined as follows:

For each layer ${l}$ of the SDAE network,
- For each column ${n}$ of the weight matrix $W_l$, draw $$W_l;{\ast}n \sim \mathcal{N}(0,\lambda_w^{-1} I_{K_l}).$$
- Draw the bias vector $$b_l \sim \mathcal{N}(0,\lambda_w^{-1} I_{K_l}).$$
- For each row ${j}$ of $X_l$, draw $$X_{l;j\ast}\sim \mathcal{N}(\sigma(X_{l-1;j\ast}W_l b_l), \lambda_s^{-1} I_{K_l}).$$
For each item ${j}$,
- Draw a clean input $$X_{c;j\ast}\sim \mathcal{N}(X_{L, j\ast}, \lambda_n^{-1} I_{K_l}).$$
- Draw a latent item offset vector $\epsilon_j \sim \mathcal{N}(0, \lambda_v^{-1} I_{K_l})$ and then set the latent item vector to be: $$v_j=\epsilon_j+X^T_{\frac{L}{2}, j\ast}.$$
Draw a latent user vector for each user ${i}$: $$u_i \sim \mathcal{N}(0, \lambda_u^{-1} I_{K_l}).$$
Draw a rating $R_{ij}$ for each user-item pair $(i; j)$: $$R_{ij}\sim \mathcal{N}(u_i^T v_j, C_{ij}^{-1}).$$

Here $\lambda_w, \lambda_s, \lambda_n, \lambda_u$and $\lambda_v$ are hyperparameters and $C_{ij}$ is a confidence parameter similar to that for CTR ($C_{ij} = a$ if $R_{ij} = 1$ and $C_{ij} = b$ otherwise).

And joint log-likelihood of these parameters is $$L=-\frac{\lambda_u}{2}\sum_{i} {|u_i|}2^2-\frac{\lambda_w}{2}\sum{l} [{|W_l|}F+{|b_l|}2^2]\ -\frac{\lambda_v}{2}\sum{j} {|v_j - X^T{\frac{L}{2},j\ast}|}2^2-\frac{\lambda_n}{2}\sum{l} {|X_{c;j\ast}-X_{L;j\ast}|}2^2 \ -\frac{\lambda_s}{2}\sum{l}\sum_{j} {|\sigma(X_{l-1;j\ast}W_l b_l)-X_{l;j}|}2^2 -\sum{ij} {|R_{ij}-u_i^Tv_j|}_2^2 $$

It is not easy to prove that it converges.

http://www.winsty.net/
http://www.wanghao.in/
https://www.cse.ust.hk/~dyyeung/
Collaborative Deep Learning for Recommender Systems
Deep Learning for Recommender Systems
https://github.com/robi56/Deep-Learning-for-Recommendation-Systems
推荐系统中基于深度学习的混合协同过滤模型
CoupledCF: Learning Explicit and Implicit User-item Couplings in Recommendation for Deep Collaborative Filtering

Wide & Deep Model

The output of this model is $$ P(Y=1|x) = \sigma(W_{wide}^T[x,\phi(x)] + W_{deep}^T \alpha^{(lf)}+b) $$ where the wide part deal with the categorical features such as user demographics and the deep part deal with continuous features.

https://arxiv.org/pdf/1606.07792.pdf
Wide & Deep Learning: Better Together with TensorFlow, Wednesday, June 29, 2016
Wide & Deep
https://www.sohu.com/a/190148302_115128

Deep FM

DeepFM ensembles FM and DNN and to learn both second order and higher-order feature interactions: $$\hat{y}=\sigma(y_{FM} + y_{DNN})$$ where $\sigma$ is the sigmoid function so that $\hat{y}\in[0, 1]$ is the predicted CTR, $y_{FM}$ is the output of FM component, and $y_{DNN}$ is the output of deep component.

The FM component is a factorization machine and the output of FM is the summation of an Addition unit and a number of Inner Product units:

$$ \hat{y} = \left<w, x\right>+\sum_{j_1=1}^{n}\sum_{j_2=i+1}^{n}\left<v_i, v_j\right> x_{j_1} x_{j_2}. $$

The deep component is a feed-forward neural network, which is used to learn high-order feature interactions. There is a personal guess that the component function in activation function $e^x$ can expand in the polynomials form $e^x=1+x+\frac{x^2}{2!}+\dots,+\frac{x^n}{n!}+\dots$, which include all the order of interactions.

We would like to point out the two interesting features of this network structure:

while the lengths of different input field vectors can be different, their embeddings are of the same size $(k)$;
the latent feature vectors $(V)$ in FM now server as network weights which are learned and used to compress the input field vectors to the embedding vectors.

It is worth pointing out that FM component and deep component share the same feature embedding, which brings two important benefits:

it learns both low- and high-order feature interactions from raw features;
there is no need for expertise feature engineering of the input.

https://zhuanlan.zhihu.com/p/27999355
https://zhuanlan.zhihu.com/p/25343518
https://zhuanlan.zhihu.com/p/32127194
https://arxiv.org/pdf/1703.04247.pdf
CTR预估算法之FM, FFM, DeepFM及实践

Neural Factorization Machines

$$ \hat{y} = w_0 + \left<w, x\right> + f(x) $$ where the first and second terms are the linear regression part similar to that for FM, which models global bias of data and weight of features. The third term $f(x)$ is the core component of NFM for modelling feature interactions, which is a multi-layered feedforward neural network.

B-Interaction Layer including Bi-Interaction Pooling is an innovation in artificial neural network.

http://staff.ustc.edu.cn/~hexn/
https://github.com/hexiangnan/neural_factorization_machine
LibRec 每周算法：NFM (SIGIR'17)

Attentional Factorization Machines

Attentional Factorization Machine (AFM) learns the importance of each feature interaction from data via a neural attention network.

We employ the attention mechanism on feature interactions by performing a weighted sum on the interacted vectors:

$$\sum_{(i, j)} a_{(i, j)}(V_i \odot V_j) x_i x_j$$

where $a_{i, j}$ is the attention score for feature interaction.

https://www.comp.nus.edu.sg/~xiangnan/papers/ijcai17-afm.pdf
https://arxiv.org/abs/1708.04617
http://blog.leanote.com/post/ryan_fan/Attention-FM%EF%BC%88AFM%EF%BC%89
https://www.cnblogs.com/Lee-yl/p/9643098.html

xDeepFM

It mainly consists of 3 parts: Embedding Layer, Compressed Interaction Network(CIN) and DNN.

KDD 2018 | 推荐系统特征构建新进展：极深因子分解机模型
xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems
https://arxiv.org/abs/1803.05170
据说有RNN和CNN结合的xDeepFM
推荐系统遇上深度学习(二十二)--DeepFM升级版XDeepFM模型强势来袭！

RepeatNet

https://arxiv.org/pdf/1806.08977.pdf
RepeatNet: A Repeat Aware Neural Recommendation Machine for Session-based Recommendation
https://github.com/PengjieRen/RepeatNet
https://github.com/PengjieRen/RepeatNet-pytorch
https://xamat.github.io/pubs/recsys12-tutorial.pdf

Deep Knowledge-aware Network for News Recommendation
https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
https://www.cnblogs.com/pinard/p/6370127.html
https://www.jianshu.com/p/6f1c2643d31b
https://blog.csdn.net/John_xyz/article/details/78933253
https://zhuanlan.zhihu.com/p/38613747
Recommender Systems with Deep Learning
深度学习在序列化推荐中的应用
深入浅出 Factorization Machine 系列
论文快读 - Deep Neural Networks for YouTube Recommendations

Deep Matrix Factorization

Matrix Factorization is a widely used collaborative filtering method in recommender systems. However, most of them are under the assumption that the rating data is missing at random (MAR), which may not be very common. For some users, they may only rate those movies they like, so the inferences will be biased in previous models. In this paper, we proposed a deep matrix factorization method based on missing not at random (MNAR) assumption. As far as we know, this model firstly uses deep learning method to address MNAR issue. The model consists of a complete data model (CDM) and a missing data model (MDM), which are both learned by neural networks. The CDM is nonlinearly determined by two factors, the user latent features and item latent features like other matrix factorization methods. And the MDM also use these two factors but taking the rating value as extra information while training. We used variational Bayesian inference to generate the posterior distribution of our proposed model. Through extensive experiments on different kind of datasets, our proposed model produce gains in some widely used metrics, comparing with several state-of-the-art models. We also explore the performance of our model within different experimental settings.

Deep Matrix Factorization Models for Recommender Systems
Deep Matrix Factorization for Recommender Systems with Missing Data not at Random

Deep Matching Models for Recommendation

It is essential for the recommender system to find the item which matches the users' demand. Its difference from web search is that recommender system provides item information even if the users' demands or generally interests are not provided. It sounds like modern crystal ball to read your mind.

In A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems the authors propose to extract rich features from user’s browsing and search histories to model user’s interests. The underlying assumption is that, users’ historical online activities reflect a lot about user’s background and preference, and therefore provide a precise insight of what items and topics users might be interested in.

Its training data set and the test data is ${(\mathrm{X}i, y_i, r_i)\mid i =1, 2, \cdots, n}$ and $(\mathrm{X}{n+1}, y_{n+1})$, respectively. Matching Model is trained using the training data set: a class of `matching functions’ $\mathcal F= {f(x, y)}$ is defined, while the value of the function $r(\mathrm{X}, y)\in \mathcal F$ is a real number a set of numbers $R$ and the $r_{n+1}$ is predicted as $r_{n+1} = r(\mathrm{X}{n+1}, y{n+1})$.

The data is assumed to be generated according to the distributions $(x, y) \sim P(X,Y)$, $r \sim P(R \mid X,Y)$ . The goal of the learning task is to select a matching function $f (x, y)$ from the class $F$ based on the observation of the training data. The learning task, then, becomes the following optimization problem. $$\arg\min_{r\in \mathcal F}\sum_{i=1}^{n}L(r_i, r(x_i, y_i))+\Omega(r)$$ where $L(\cdot, \cdot)$ denotes a loss function and $\Omega(\cdot)$ denotes regularization.

In fact, the inputs x and y can be instances (IDs), feature vectors, and structured objects, and thus the task can be carried out at instance level, feature level, and structure level.

And $r(x, y)$ is supposed to be non-negative in some cases.

Framework of Matching
Output: MLP
Aggregation: Pooling, Concatenation
Interaction: Matrix, Tensor
Representation: MLP, CNN, LSTM
Input: ID Vectors $\mathrm{X}$, Feature Vectors $y$

Sometimes, matching model and ranking model are combined and trained together with pairwise loss. Deep Matching models takes the ID vectors and features together as the input to a deep neural network to train the matching scores including Deep Matrix Factorization, AutoRec, Collaborative Denoising Auto-Encoder, Deep User and Image Feature, Attentive Collaborative Filtering, Collaborative Knowledge Base Embedding.

semantic-based matching models

https://sites.google.com/site/nkxujun/
http://sonyis.me/dnn.html
https://akmenon.github.io/
https://sigir.org/sigir2018/program/tutorials/
Learning to Match
Deep Learning for Matching in Search and Recommendation
Facilitating the design, comparison and sharing of deep text matching models.
Framework and Principles of Matching Technologies
A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems
Learning to Match using Local and Distributed Representations of Text for Web Search
https://github.com/super-zhangchao/learning-to-match
https://deepctr-doc.readthedocs.io/en/latest/Features.html

Embedding methods for RecSys

https://u.osu.edu/cep1/
https://labs.pinterest.com/publications/embeddings/
https://theory.cs.northwestern.edu/events/embeddings/
https://recsys.acm.org/recsys18/tutorials/
https://dawenl.github.io/publications/LiangACB16-cofactor.pdf
https://cseweb.ucsd.edu/~jmcauley/pdfs/recsys18c.pdf
http://diposit.ub.edu/dspace/bitstream/2445/130481/3/memoria.pdf
https://cseweb.ucsd.edu/~jmcauley/workshops/scmls20/
https://www.ismll.uni-hildesheim.de/pub/pdfs/Ahmed_RecSys19.pdf
https://www.cc.gatech.edu/~lsong/papers/rnn_coevolve.pdf
Unified Collaborative Filtering over Graph Embeddings
Hyperbolic embedding
Item-based Collaborative Filtering with BER
https://github.com/xiangwang1223/tree_enhanced_embedding_model

Hyperbolic Recommender Systems

Many well-established recommender systems are based on representation learning in Euclidean space. In these models, matching functions such as the Euclidean distance or inner product are typically used for computing similarity scores between user and item embeddings. Hyperbolic Recommender Systems investigate the notion of learning user and item representations in hyperbolic space.

Given a user ${u}$ and an item ${v}$ that are both lying in the Poincare ball $B^n$, the distance between two points on P is given by $$d_p(x, y)=cosh^{-1}(1+2\frac{|(x-y|^2}{(1-|x|^2)(1-|y|^2)}).$$

Hyperbolic Bayesian Personalized Ranking(HyperBPR) leverages BPR pairwise learning to minimize the pairwise ranking loss between the positive and negative items. Given a user ${u}$ and an item ${v}$ that are both lying in Poincare ball $B^n$, we take: $$\alpha(u, v) = f(d_p(u, v))$$ where $f(\cdot)$ is simply preferred as a linear function $f(x) = \beta x + c$ with $\beta\in\mathbb{R}$ and $c\in\mathbb{R}$ are scalar parameters and learned along with the network. The objective function is defined as follows: $$\arg\min_{\Theta} \sum_{i, j, k} -\ln(\sigma{\alpha(u_i, v_j) - \alpha(u_i, v_k)}) + \lambda {|\Theta|}_2^2$$

where $(i, j, k)$ is the triplet that belongs to the set ${D}$ that contains all pairs of positive and negative items for each user; $\sigma$ is the logistic sigmoid function; $\Theta$ represents the model parameters; and $\lambda$ is the regularization parameter.

The parameters of our model are learned by using RSGD.

Stochastic gradient descent on Riemannian manifolds
Hyperbolic Recommender Systems
Scalable Hyperbolic Recommender Systems

Prod2Vec

Product embedding

Based on item-item co-occurrence from transaction sequences co-purhased products)
Uses method of word embedding: low-dimensional, distributed embeddings of words based on word sequences in text documents

https://astro.temple.edu/~tuc17157/pdfs/grbovic2015kddB.pdf
http://www.majumderb.com/prod2vec_initial_report.pdf
https://dl.acm.org/citation.cfm?id=2959166

Item2vec

https://www.cnblogs.com/hellojamest/p/11766401.html
http://ceur-ws.org/Vol-1688/paper-13.pdf

Meta-Prod2Vec

Meta-Prod2ve is a novel method to compute item similarities for recommendation that leverages existing item metadata.

https://arxiv.org/abs/1607.07326
http://labs.criteo.com/2016/09/meta-prod2vec-product-embeddings-using-side-information-recommendation/

Graph Embeddings for RecSys

https://github.com/aarivan/Graph-Embeddings-for-Recommender-Systems

proNet

http://cherry.cs.nccu.edu.tw/~g10018/portfolio/slides/pronet.pdf
https://github.com/haowei01/proNet-core

Modularize Graph Embedding for Recommendation

We can take the recommendation as Link Prediction on Graphs.

Efficient retrieval from approximate nearest neighbor (ANN) search methods.
Efficient pairwise comparison due to dimensionality reduction (DR)
Reduced space complexity due to DR
Transfer learning with pertained embeddings

So graph embedding is GREAT for recommendation :

Reduces data sparsity and cold start via integrating auxiliary information
Provides holistic view of REC problem and jointly mines different relations in terms of graph structures
Trains fast, compares fast, and retrieves fast while taking less space

In order to address the challenges of the graph embedding, we need modularize graph embedding for adaptability.

Extracts graph structures from dataset while remains type-agnostic to sampled entities, i.e., nodes & edges
Converts entities into spatial features via embedding stacking operations, e.g, lookup, pooling (average, etc.)
Preserves entity relatedness as spatial properties with customizable similarity metrics and loss functions

https://github.com/cnclabs/smore
https://github.com/chihming/awesome-network-embedding
http://cherry.cs.nccu.edu.tw/~g10018/recsys19_smore.pdf
http://staff.ustc.edu.cn/~hexn/papers/sigir19-NGCF.pdf
https://www.slideshare.net/changecandy/recsys19-smore

Graph-based RecSys

Graph is an important structure for System II intelligence, with the universal representation ability to capture the relationship between different variables, and support interpretability, causality, and transferability / inductive generalization. Traditional logic and symbolic reasoning over graphs has relied on methods and tools which are very different from deep learning models, such Prolog language, SMT solvers, constrained optimization and discrete algorithms. Is such a methodology separation between System I and System II intelligence necessary? How to build a flexible, effective and efficient bridge to smoothly connect these two systems, and create higher order artificial intelligence?

Graph neural networks, have emerged as the tool of choice for graph representation learning, which has led to impressive progress in many classification and regression problems such as chemical synthesis, 3D-vision, recommender systems and social network analysis. However, prediction and classification tasks can be very different from logic/symbolic reasoning.

In this tutorial, we revisit the recommendation problem from the perspective of graph learning. Common data sources for recommendation can be organized into graphs, such as user-item interactions (bipartite graphs), social networks, item knowledge graphs (heterogeneous graphs), among others. Such a graph-based organization connects the isolated data instances, bringing benefits for exploiting high-order connectivities that encode meaningful patterns for collaborative filtering, content-based filtering, social influence modeling and knowledge-aware reasoning. Together with the recent success of graph neural networks (GNNs), graph-based models have exhibited the potential to be the technologies for next generation recommendation systems. The tutorial provides a review on graph-based learning methods for recommendation, with special focus on recent developments of GNNs and knowledge graph-enhanced recommendation. By introducing this emerging and promising area in the tutorial, we expect the audience can get deep understanding and accurate insight on the spaces, stimulate more ideas and discussions, and promote developments of technologies.

https://next-nus.github.io/
https://logicalreasoninggnn.github.io/
https://arxiv.org/abs/1902.07243
https://zhuanlan.zhihu.com/p/66521058
Graph-search based Recommendation system
https://next-nus.github.io/slides/tuto-cikm2019-public.pdf
Multi-behavior Recommendation with Graph Convolutional Networks

Deep Geometric Matrix Completion

It’s easy to observe how better matrix completions can be achieved by considering the sparse matrix as defined over two different graphs: a user graph and an item graph. From a signal processing point of view, the matrix ${X}$ can be considered as a bi-dimensional signal defined over two distinct domains. Instead of recurring to multigraph convolutions realized over the entire matrix ${X}$, two independent single-graph GCNs (graph convolution networks) can be applied on matrices ${W}$ and ${H}$.

Given the aforementioned multi-graph convolutional layers, the last step that remains concerns the choice of the architecture to use for reconstructing the missing information. Every (user, item) pair in the multi-graph approach and every user/item in the separable one present in this case an independent state, which is updated (at every step) by means of the features produced by the selected GCN.

graph convolution network有什么比较好的应用task？ - superbrother的回答 - 知乎
https://arxiv.org/abs/1704.06803
Deep Geometric Matrix Completion: a Geometric Deep Learning approach to Recommender Systems
Talk: Deep Geometric Matrix Completion

PinSage

http://snap.stanford.edu/graphsage/
http://cedric.cnam.fr/~thomen/journal_club/19-10-18.pdf
https://sites.google.com/view/ruining-he/
https://samsiatrtp.wordpress.com/category/program/computational-advertising/
https://docs.dgl.ai/en/latest/_modules/dgl/sampling/pinsage.html
https://www.hotbak.net/key/%E5%9B%BE%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C%E7%94%A8%E4%BA%8E%E6%8E%A8%E8%8D%90%E7%B3%BB%E7%BB%9F%E9%97%AE%E9%A2%98PinSage.html

Spectral Collaborative Filtering

https://www.cs.uic.edu/~clu/doc/recsys18_spectralCF.pdf

LightGCN

https://github.com/kuandeng/LightGCN
https://blog.csdn.net/qq_39388410/article/details/106970194
http://staff.ustc.edu.cn/~hexn/

GraphRec

https://github.com/wenqifan03/GraphRec-WWW19
https://daiwk.github.io/posts/dl-graph-recommendations.html

Feature Interaction Selection in RecSys

A feature interaction is some way in which a feature or features modify or influence another feature in defining overall system behavior.

https://www.tinymind.cn/articles/4233?from=articles_commend
HOP-rec: high-order proximity for implicit recommendation
https://archsummit.infoq.cn/2020/shenzhen/presentation/2330
Bayesian Personalized Feature Interaction Selection for Factorization Machines

Generally, feature interactions matter in recommender system.

Attribute interactions are the irreducible dependencies between attributes. Interactions underlie feature relevance and selection, the structure of joint probability and classification models: if and only if the attributes interact, they should be connected. While the issue of 2-way interactions, especially of those between an attribute and the label, has already been addressed, we introduce an operational definition of a generalized n-way interaction by highlighting two models: the reductionistic part-to-whole approximation, where the model of the whole is reconstructed from models of the parts, and the holistic reference model, where the whole is modelled directly. An interaction is deemed significant if these two models are significantly different. Correlation is a special case of attribute interaction.

http://pamelazave.com/fi.html
https://staff.fnwi.uva.nl/m.derijke/wp-content/papercite-data/pdf/chen-2019-bayesian.pdf
Experimentation with fairness-aware recommendation using librec-auto
http://www.inf.unibz.it/~ricci/papers/intro-rec-sys-handbook.pdf
https://pycaret.org/feature-interaction/
https://www.cs.cmu.edu/~ckaestne/pdf/icse12.pdf
https://www.public.asu.edu/~huanliu/papers/ijcai07.pdf
https://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1189&context=cseconfwork
http://stat.columbia.edu/~jakulin/Int/interaction-slides.pdf
https://iclr.cc/virtual_2020/poster_BkgnhTEtDS.html
http://stat.columbia.edu/~jakulin/Int/
https://christophm.github.io/interpretable-ml-book/interaction.html

AutoCross

By performing beam search in a tree-structured space, AutoCross enables efficient generation of high-order cross features, which is not yet visited by existing works.

AutoCross: Automatic Feature Crossing for Tabular Data in Real-World Applications
https://aijishu.com/a/1060000000081601

Product-based Neural Network

Facing with the extreme sparsity, traditional models may limit their capacity of mining shallow patterns from the data, i.e. low-order feature combinations. Deep models like deep neural networks, on the other hand, cannot be directly applied for the high-dimensional input because of the huge feature space.

https://arxiv.org/abs/1611.00144
https://www.kdd.org/kdd2016/papers/files/adf0975-shanA.pdf
https://dl.acm.org/doi/10.1145/3233770
https://github.com/Atomu2014/product-nets
https://app.dimensions.ai/details/publication/pub.1007555309

Deep Crossing

The Deep Crossing model is a deep neural network that automatically combines features to produce superior models. The input of Deep Crossing is a set of individual features that can be either dense or sparse. The important crossing features are discovered implicitly by the networks, which are comprised of an embedding and stacking layer, as well as a cascade of Residual Units.

https://www.kdd.org/kdd2016/papers/files/adf0975-shanA.pdf

AutoGroup

AutoGroup casts the selection of feature interactions as a structural optimization problem. In a nutshell, AutoGroup first automatically groups useful features into a number of feature sets. Then, it generates interactions of any order from these feature sets using a novel interaction function. The main contribution of AutoGroup is that it performs both dimensionality reduction and feature selection which are not seen in previous models.

Efficient Sparse Modeling with Automatic Feature Grouping
AutoGroup: Automatic Feature Grouping for Modelling Explicit High-Order Feature Interactions in CTR Prediction
https://zhuanlan.zhihu.com/p/136594025

AutoFIS

AutoFIS can automatically identify important feature interactions for factorization models with computational cost just equivalent to training the target model to convergence. In the $\color{red}\text{search stage}$, instead of searching over a discrete set of candidate feature interactions, we relax the choices to be continuous by introducing the architecture parameters

AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate PredictionCODE
https://github.com/zhuchenxv/AutoFIS

AutoFeature

$\mathrm{AutoInt}$ to automatically learn the high-order feature interactions of input features. Our proposed algorithm is very general, which can be applied to both numerical and categorical input features. Specifically, we map both the numerical and categorical features into the same low-dimensional space. Afterwards, a multi-head self-attentive neural network with residual connections is proposed to explicitly model the feature interactions in the low-dimensional space. With different layers of the multi-head self-attentive neural networks, different orders of feature combinations of input features can be modeled.

AutoFeature: Searching for Feature Interactions and Their Architectures for Click-through Rate Prediction
https://arxiv.org/abs/1810.11921

Automated Embedding

AutoEmb can enable various embedding dimensions according to the popularity in an automated and dynamic manner.

AutoEmb: Automated Embedding Dimensionality Search in Streaming Recommendations
Automated Embedding Size Search in Deep Recommender Systems
https://www.cse.msu.edu/~zhaoxi35/
https://sites.google.com/view/kdd20-marketplace-autorecsys/

Ensemble Methods for Recommender System

The RecSys can be considered as some regression or classification tasks, so that we can apply the ensemble methods to these methods as BellKor's Progamatic Chaos used the blended solution to win the prize. In fact, its essence is bagging or blending, which is one sequential ensemble strategy in order to avoid over-fitting or reduce the variance.

In this section, the boosting is the focus, which is to reduce the error and boost the performance from a weaker learner.

There are two common methods to construct a stronger learner from a weaker learner: (1) reweight the samples and learn from the error: AdaBoosting; (2) retrain another learner and learn to approximate the error: Gradient Boosting.

General Functional Matrix Factorization Using Gradient Boosting
recsys2019

BellKor's Progamatic Chaos

Until now, we consider the recommendation task as a regression prediction process, which is really common in machine learning. The boosting or stacking methods may help us to enhance these methods.

A key to achieving highly competitive results on the Netflix data is usage of sophisticated blending schemes, which combine the multiple individual predictors into a single final solution. This significant component was managed by our colleagues at the Big Chaos team. Still, we were producing a few blended solutions, which were later incorporated as individual predictors in the final blend. Our blending techniques were applied to three distinct sets of predictors. First is a set of 454 predictors, which represent all predictors of the BellKor’s Pragmatic Chaos team for which we have matching Probe and Qualifying results. Second, is a set of 75 predictors, which the BigChaos team picked out of the 454 predictors by forward selection. Finally, a set of 24 BellKor predictors for which we had matching Probe and Qualifying results. from Netflix Prize.

https://www.netflixprize.com/community/topic_1537.html
https://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
https://www.netflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdf

BoostFM

BoostFM integrates boosting into factorization models during the process of item ranking. Specifically, BoostFM is an adaptive boosting framework that linearly combines multiple homogeneous component recommender system, which are repeatedly constructed on the basis of the individual FM model by a re-weighting scheme.

BoostFM

Input: The observed context-item interactions or Training Data $S ={(\mathbf{x}_i, y_i)}$ parameters E and T.
Output: The strong recommender $g^{T}$.
Initialize $Q_{ci}^{(t)}=1/|S|,g^{(0)}=0, \forall (c, i)\in S$.
for $t = 1 \to T$ do
- 1. Create component recommender $\hat{y}^{(t)}$ with $\bf{Q}^{(t)}$ on $\bf S$,$\forall (c,i) \in \bf S$, , i.e., Component Recommender Learning Algorithm;
- 1. Compute the ranking accuracy $E[\hat{r}(c, i, y^{(t)})], \forall (c,i) \in \bf S$;
- 1. Compute the coefficient $\beta_t$, $$ \beta_t = \ln (\frac{\sum_{(c,i) \in \bf S} \bf{Q}^{(t)}{ci}{1 + E[\hat{r}(c, i, y^{(t)})]}}{\sum{(c,i) \in \bf S} \bf{Q}^{(t)}_{ci}{1- E[\hat{r}(c, i, y^{(t)})]}})^{\frac{1}{2}} ; $$
- 1. Create the strong recommender $g^{(t)}$, $$ g^{(t)} = \sum_{h=1}^{t} \beta_h \hat{y}^{(t)} ;$$
- 1. Update weight distribution (\bf{Q}^{t+1}), $$ \bf{Q}^{t+1}{ci} = \frac{\exp(E[\hat{r}(c, i, y^{(t)})])}{\sum{(c,i)\in \bf{S}} E[\hat{r}(c, i, y^{(t)})]} ; $$
end for

Component Recommender

Naturally, it is feasible to exploit the L2R techniques to optimize Factorization Machines (FM). There are two major approaches in the field of L2R, namely, pairwise and listwise approaches. In the following, we demonstrate ranking factorization machines with both pairwise and listwise optimization.

Weighted Pairwise FM (WPFM)

Weighted ‘Listwise’ FM (WLFM)

BoostFM: Boosted Factorization Machines for Top-N Feature-based Recommendation
http://wnzhang.net/
https://fajieyuan.github.io/
https://www.librec.net/luckymoon.me/
The author’s final accepted version.

Adaptive Boosting Personalized Ranking (AdaBPR)

AdaBPR (Adaptive Boosting Personalized Ranking) is a boosting algorithm for top-N item recommendation using users' implicit feedback. In this framework, multiple homogeneous component recommenders are linearly combined to achieve more accurate recommendation. The component recommenders are learned based on a re-weighting strategy that assigns a dynamic weight to each observed user-item interaction.

Here explicit feedback refers to users' ratings to items while implicit feedback is derived from users' interactions with items, e.g., number of times a user plays a song.

The primary idea of applying boosting for item recommendation is to learn a set of homogeneous component recommenders and then create an ensemble of the component recommenders to predict users' preferences.

Here, we use a linear combination of component recommenders as the final recommendation model $$f=\sum_{t=1}^{T}{\alpha}t f{t}.$$

In the training process, AdaBPR runs for ${T}$ rounds, and the component recommender $f_t$ is created at t-th round by $$ \arg\min_{f_t\in\mathbb{H}} \sum_{(u,i)\in\mathbb{O}} {\beta}{u} \exp{-E(\pi(u,i,\sum{n=1}^{t}{\alpha}n f{n}))}. $$

where the notations are listed as follows:

$\mathbb{H}$ is the set of possible component recommenders such as collaborative ranking algorithms;
$E(\pi(u,i,f))$ denotes the ranking accuracy associated with each observed interaction pair;
$\pi(u,i,f)$ is the rank position of item ${i}$ in the ranked item list of ${u}$, resulted by a learned ranking model ${f}$;
$\mathbb{O}$ is the set of all observed user-item interactions;
${\beta}{u}$ is defined as reciprocal of the number of user $u$'s historical items ${\beta}{u}=\frac{1}{|V_{u}^{+}|}$ ($V_{u}^{+}$ is the historical items of ${u}$).

A Boosting Algorithm for Item Recommendation with Implicit Feedback
The review @Arivin's blog

Gradient Boosting Factorization Machines

Gradient Boosting Factorization Machine (GBFM) model is to incorporate feature selection algorithm with Factorization Machines into a unified framework.

Gradient Boosting Factorization Machine Model

Input: Training Data $S ={(\mathbf{x}_i, y_i)}$.

Output: $\hat{y}S =y_0(x) + {\sum}^S{s=1}\left<v_{si}, v_{sj}\right>$.

Initialize rating prediction function as $\hat{y}_0(x)$

for $s = 1 \to S$ do

Select interaction feature $C_p$ and $C_q$ from Greedy Feature Selection Algorithm;

Estimate latent feature matrices $V_p$ and $V_q$;

Update $\hat{y}s(\mathrm{x}) = \hat{y}{s-1}(\mathrm{x}) + {\sum}{i\in C_p}{\sum}{j\in C_q} \mathbb{I}[i,j\in \mathrm{x}]\left<V_{p}^{i}, V_{q}^{j}\right>$

end for

where s is the iteration step of the learning algorithm. At step s, we greedily select two interaction features $C_p$ and $C_q$ where $\mathbb{I}$ is the indicator function, the value is 1 if the condition holds otherwise 0.

Greedy Feature Selection Algorithm

From the view of gradient boosting machine, at each step s, we would like to search a function ${f}$ in the function space ${F}$ that minimize the objective function: $$L=\sum_{i}\ell(\hat{y}_s(\mathrm{x}_i), y_i)+\Omega(f)$$

where $\hat{y}s(\mathrm{x}) = \hat{y}{s−1}(\mathrm{x}) + \alpha_s f_s(\mathrm{x})$.

We heuristically assume that the function ${f}$ has the following form: $$ f_{\ell}(\mathrm{x})={\prod}{t=1}^{\ell} q{C_{i}(t)}(\mathrm{x}) $$ where the function q maps latent feature vector x to real value domain $$ q_{C_{i}(t)}(\mathrm{x})=\sum_{j\in C_{i}(t)}\mathbb{I}[j\in \mathrm{x}]w_{t} $$

It is hard for a general convex loss function $\ell$ to search function ${f}$ to optimize the objective function: $L=\sum_{i}\ell(\hat{y}_s(\mathrm{x}_i), y_i)+\Omega(f)$.

The most common way is to approximate it by least-square minimization, i.e., $\ell={| \cdot |}_2^2$. Like in xGBoost, it takes second order Taylor expansion of the loss function $\ell$ and problem isfinalized to find the ${i}$(t)-th feature which:

$$\arg{\min}{i(t)\in {0, \dots, m}} \sum{i=1}^{n} h_i(\frac{g_i}{h_i}-f_{t-1}(\mathrm{x}i) q{C_{i}(t)}(\mathrm{x}_i))^2 + {|\theta|}_2^2 $$ where the negativefirst derivative and the second derivative at instance ${i}$ as $g_i$ and $h_i$.

Gradient boosting factorization machines

Gradient Boosted Categorical Embedding and Numerical Trees

Gradient Boosted Categorical Embedding and Numerical Trees (GB-CSENT) is to combine Tree-based Models and Matrix-based Embedding Models in order to handle numerical features and large-cardinality categorical features. A prediction is based on:

Bias terms from each categorical feature.
Dot-product of embedding features of two categorical features,e.g., user-side v.s. item-side.
Per-categorical decision trees based on numerical features ensemble of numerical decision trees where each tree is based on one categorical feature.

In details, it is as following: $$ \hat{y}(x) = \underbrace{\underbrace{\sum_{i=0}^{k} w_{a_i}}{bias} + \underbrace{(\sum{a_i\in U(a)} Q_{a_i})^{T}(\sum_{a_i\in I(a)} Q_{a_i}) }{factors}}{CAT-E} + \underbrace{\sum_{i=0}^{k} T_{a_i}(b)}_{CAT-NT}. $$ And it is decomposed as the following table.

Ingredients	Formulae	Features
Factorization Machines	$\underbrace{\underbrace{\sum_{i=0}^{k} w_{a_i}}{bias} + \underbrace{(\sum{a_i\in U(a)} Q_{a_i})^{T}(\sum_{a_i\in I(a)} Q_{a_i}) }{factors}}{CAT-E}$	Categorical Features
GBDT	$\underbrace{\sum_{i=0}^{k} T_{a_i}(b)}_{CAT-NT}$	Numerical Features

http://www.hongliangjie.com/talks/GB-CENT_SD_2017-02-22.pdf
http://www.hongliangjie.com/talks/GB-CENT_SantaClara_2017-03-28.pdf
http://www.hongliangjie.com/talks/GB-CENT_Lehigh_2017-04-12.pdf
http://www.hongliangjie.com/talks/GB-CENT_PopUp_2017-06-14.pdf
http://www.hongliangjie.com/talks/GB-CENT_CAS_2017-06-23.pdf
http://www.hongliangjie.com/talks/GB-CENT_Boston_2017-09-07.pdf
Talk: Gradient Boosted Categorical Embedding and Numerical Trees
Paper: Gradient Boosted Categorical Embedding and Numerical Trees
https://qzhao2018.github.io/

Deep Embedding Networks and Gradient Boosting Decision Trees

http://www.simflow.net/Publications/Papers/Year2018/Baibing.pdf
https://doogkong.github.io/proposal_2019.html

Tree-based Deep Model for Recommender Systems

By indexing items in a tree hierarchy and training a user-node preference prediction model satisfying a max-heap like property in the tree, TDM provides logarithmic computational complexity w.r.t. the corpus size, enabling the use of arbitrary advanced models in candidate retrieval and recommendation.

Our purpose, in this paper, is to develop a method to jointly learn the index structure and user preference prediction model.

Recommendation problem is basically to retrieve a set of most relevant or preferred items for each user request from the entire corpus. In the practice of large-scale recommendation, the algorithm design should strike a balance between accuracy and efficiency.

The above methods include 2 stages/models: (1) find the preference of the users based on history or other information; (2) retrive some items according to the predicted preferences.

TDM uses a tree hierarchy to organize items, and each leaf node in the tree corresponds to an item. Like a max-heap, TDM assumes that each user-node preference is the largest one among the node’s all children’s preferences. The main idea is to predict user interests from coarse to fine by traversing tree nodes in a top-down fashion and making decisions for each user-node pair.

Each item in the corpus is firstly assigned to a leaf node of a tree hierarchy $\mathcal{T}$. The non-leaf nodes can be seen as a coarser abstraction of their children. In retrieval, the user information combined with the node to score is firstly vectorized to a user preference representation as the input of a deep neural network $\mathcal{M}$ (e.g. fully connected networks). While retrieving for the top-k items (leaf nodes), a top-down beam search strategy is carried out level by level.

TDM uses a tree as index and creatively proposes a max-heap like probability formulation on the tree, where the user preference for each non-leaf node $n$ in level $l$ is derived as: $$p^{(l)}(u \mid n)=\frac{\max_{n_c\in{\text{the children of the node $n$ in the $l+1$ level}}} p^{(l)}(n_c \mid u)}{\alpha^{(l)}}$$

where $p^{(l)}(u \mid n)$ is the ground truth probability that the user $u$ prefers the node $n$. The above formulation means that the ground truth user-node probability on a node equals to the maximum user-node probability of its children divided by a normalization term. Therefore, the top-k nodes in level $l$ must be contained in the children of top-k nodes in level $l −1$ and the retrieval for top-k leaf items can be restricted to top-k nodes in each layer without losing the accuracy. Based on this, TDM turns the recommendation task into a hierarchical retrieval problem. By a top-down retrieval process, the candidate items are selected gradually from coarse to detailed.

According to the retrieval process, the recommendation accuracy of TDM is determined by the quality of the user preference model $\mathcal M$ and tree index $\mathcal T$. Given n pairs of positive training data $(u_i, c_i)$, which means the user $u_i$ is interested in the target item $c_i$, $\mathcal T$ determines which non-leaf nodes $\mathcal M$ should select to achieve $c_i$ for $u_i$.

Denote $p (\pi(c_i)|u_i; \pi)$ as user u’s preference probability over leaf node $\pi(c_i)$ given a user-item pair $(u_i, c_i)$, where $\pi(·)$ is a projection function that projects an item to a leaf node in $\mathcal T$. Note that the projection function $\pi(\cdot)$ actually determines the item hierarchy in the tree. The model $\mathcal M$ is used to estimate and output the user-node preference $\hat{p} (\pi(c_i)|u_i;\theta \pi)$ given $\theta$ as model parameters. If the pair $(u_i , c_i)$ is a positive sample, we have the ground truth preference $p (\pi(c_i)|u_i; \pi)=1$. According to the max-heap property, the user preference probability of all $π(c_i)$’s ancestor nodes, i.e., ${p(b_j (\pi(c_i))|u_i; \pi)}^{l_{max}}{j=0}$ should also be 1, in which $b_j(\cdot)$ is the projection from a node to its ancestor node in level $j$ and $l{max}$ is the max level in $\mathcal T$. To fit such a user-node preference distribution, the global loss function is formulated as

$$L(\theta, \mathcal T)= -\sum_{i=1}^n \sum_{j=1}^{l_{max}}\log(\hat{p}(b_j (\pi(c_i))|u_i; \pi) )$$

where we sum up the negative logarithm of predicted user-node preference probability on all the positive training samples and their ancestor user-node pairs as the global empirical loss.

https://github.com/DeepGraphLearning/RecommenderSystems
https://github.com/DeepGraphLearning
https://jian-tang.com/
Learning Tree-based Deep Model for Recommender Systems
Joint Optimization of Tree-based Index and Deep Model for Recommender Systems
https://developer.aliyun.com/article/720309
学习基于树的推荐系统深度模型
Behavior Sequence Transformer for E-commerce Recommendation in Alibaba
Multi-Interest Network with Dynamic Routing for Recommendation at Tmall
AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks

The core of TDM is to regard the recommendation as ranking.

Context-aware Recommendations

Context-aware information is widely available in various ways and is becoming more and more important for enhancing retrieval performance and recommendation results. The current main issue to cope with is not only recommending or retrieving the most relevant items and content, but defining them ad hoc. Further relevant issues are personalizing and adapting the information and the way it is displayed to the user’s current situation and interests.

https://www.baltrunas.info/context-aware
https://www.kdd.org/kdd2014/tutorials/KDD-%20The%20RecommenderProblemRevisited-Part2.pdf
Workshop on Context-Aware Recommender Systems (CARS-2009)
Context-aware Recommendation Using Role-based Trust Network
Context-Aware Recommendations Based on Deep Learning Frameworks
CARS2: Learning Context-aware Representations for Context-aware Recommendations
http://carr-workshop.org/

Context-Aware Factorization Machines

Fast Context-aware Recommendations with Factorization Machines
Optimizing Factorization Machines for Top-N Context-Aware Recommendations
Factorization Models for Context-/Time-Aware Movie Recommendations
Gaussian Process Factorization Machines for Context-aware Recommendations

Sequential Recommender Systems

https://project.inria.fr/sr4sg/home/
https://jiaxit.github.io/resources/wsdm18caser.pdf
https://www.ijcai.org/Proceedings/2019/0883.pdf
Adaptive-Hierarchical-Translation-Sequential
https://www.hongliangjie.com/publications/wsdm2020.pdf
Disentangled Self-Supervision in Sequential Recommenders

Top-N recommendation

http://yongfeng.me/attach/jrl-cikm17.pdf
Local Item-Item Models for Top-N Recommendation
Improving Top-N Recommendation with Heterogeneous Loss
https://blog.csdn.net/lthirdonel/article/details/80021282
Top-N Recommendations from Implicit Feedback Leveraging Linked Open Data
Top-N Recommendations from Implicit Feedback Leveraging Linked Open Data ?

Explainable Recommendations

Explainable recommendation and search attempt to develop models or methods that not only generate high-quality recommendation or search results, but also intuitive explanations of the results for users or system designers, which can help improve the system transparency, persuasiveness, trustworthiness, and effectiveness, etc.

Providing personalized explanations for recommendations can help users to understand the underlying insight of the recommendation results, which is helpful to the effectiveness, transparency, persuasiveness and trustworthiness of recommender systems. Current explainable recommendation models mostly generate textual explanations based on pre-defined sentence templates. However, the expressiveness power of template-based explanation sentences are limited to the pre-defined expressions, and manually defining the expressions require significant human efforts

Explainable Recommendation and Search @ rutgers
Explainable Recommendation: A Survey and New Perspectives
Explainable Entity-based Recommendations with Knowledge Graphs
2018 Workshop on Explainable Recommendation and Search (EARS 2018)
EARS 2019
Explainable Recommendation and Search (EARS)
TEM: Tree-enhanced Embedding Model for Explainable Recommendation
https://ears2019.github.io/
Explainable Recommendation for Self-Regulated Learning
Dynamic Explainable Recommendation based on Neural Attentive Models
https://github.com/fridsamt/Explainable-Recommendation
Explainable Recommendation for Event Sequences: A Visual Analytics Approach by Fan Du
https://wise.cs.rutgers.edu/code/
http://www.cs.cmu.edu/~rkanjira/thesis/rose_proposal.pdf
http://jamesmc.com/publications
FIRST INTERNATIONAL WORKSHOP ON DEEP MATCHING IN PRACTICAL APPLICATIONS
Explainable Matrix Factorization for Collaborative Filtering

Social Recommendation

We present a novel framework for studying recommendation algorithms in terms of the ‘jumps’ that they make to connect people to artifacts. This approach emphasizes reachability via an algorithm within the implicit graph structure underlying a recommender dataset and allows us to consider questions relating algorithmic parameters to properties of the datasets.

Social Recommender Systems (SRSs) aim to alleviate information overload over social media users by presenting the most attractive and relevant content, often using personalization techniques adapted for the specific user. SRSs also aim at increasing adoption, engagement, and participation of new and existing users of social media sites. In addition to recommending content to consume, new types of recommendations emerge within social media, such as of people and communities to connect to, to follow, or to join.

User-item/user-user interactions are usually in the form of graph/network structure. What is more, the graph is dynamic, and we need to apply to new nodes without model retraining.

6th International Workshop on Social Recommender Systems (SRS 2015)
http://www.comp.hkbu.edu.hk/~lichen/srs2010/
http://www.comp.hkbu.edu.hk/~lichen/srs2012/
http://www.comp.hkbu.edu.hk/~lichen/srs2011/
1st International Workshop on Adaptation, Personalization and REcommendation in the Social-semantic We 7th Extended Semantic Web Conference (ESWC 2010)
Recommendation and Advertising in Online Social Networks
Fairness and Discrimination in Retrieval and Recommendation
https://fairumap.wordpress.com/
Recommender Systems with Social Regularization
Do Social Explanations Work? Studying and Modeling the Effects of Social Explanations in Recommender Systems
Existing Methods for Including Social Networks until 2015
Social Recommendation With Evolutionary Opinion Dynamics
Workshop on Responsible Recommendation
https://recsys.acm.org/recsys18/fatrec/
A Probabilistic Model for Using Social Networks in Personalized Item Recommendation
Product Recommendation and Rating Prediction based on Multi-modal Social Networks
Graph Neural Networks for Social Recommendation
Studying Recommendation Algorithms by Graph Analysis
Low-rank Linear Cold-Start Recommendation from Social Data
Accurate and scalable social recommendation using mixed-membership stochastic block models
Social Choice Theory and Recommender Systems
https://raweb.inria.fr/rapportsactivite/RA2004/axis/uid26.html

SocialMF: MF with social trust propagation

Based on the assumption of trust aware recommender

users have similar tastes with other users they trust
the transitivity of trust and propagate trust to indirect neighbors in the social network.

https://github.com/grahamjenson/list_of_recommender_systems
https://www.librec.net/doc/librec-v1.1/librec/rating/SocialMF.html
A matrix factorization technique with trust propagation for recommendation in social networks

Algorithmic Bias in Search and Recommendation

Both search and recommendation algorithms provide results based on their relevance for the current user. In order to do so, such a relevance is usually computed by models trained on historical data, which is biased in most cases. Hence, the results produced by these algorithms naturally propagate, and frequently reinforce, biases hidden in the data, consequently strengthening inequalities. Being able to measure, characterize, and mitigate these biases while keeping high effectiveness is a topic of central interest for the information retrieval community.

http://bias.disim.univaq.it/
https://www.mirkomarras.com/publication/bias2020/
FAIRNESS, ACCOUNTABILITY AND TRANSPARENCY IN RECOMMENDER SYSTEMS
On the Need for Fairness in Financial Recommendation Engines
What are you optimizing for? Aligning Recommender Systems with Human Values
Algorithms are not neutral: Bias in collaborative filtering

Knowledge Graph and Recommender System

Items usually correspond to entities in many fields, such as books, movies and music, making it possible for transferring information between them. These information involving in recommender system and knowledge graph are complementary revealing the connectivity among items or between users and items. In terms of models, the two tasks are both to rank candidates for a target according to either implicit or explicit relations. For example, KG completion is to find correct movies (e.g., Death Becomes Her) for the person Robert Zemeckis given the explicit relation is Director Of. Item recommendation aims at recommending movies for a target user satisfying some implicit preference. Therefore, we are to fill in the gap between item recommendation and KG completion via a joint model, and systematically investigate how the two tasks impact each other.

推荐算法不够精准？让知识图谱来解决
如何将知识图谱特征学习应用到推荐系统？
可解释推荐系统：身怀绝技，一招击中用户心理
深度学习与知识图谱在美团搜索广告排序中的应用实践
Unifying Knowledge Graph Learning and Recommendation: Towards a Better Understanding of User Preferences
Explainable Reasoning over Knowledge Graphs for Recommendation

https://github.com/BaeSeulki/WhySoMuch
https://github.com/numb3r3/kgraph-rec

https://www.cs.cmu.edu/~wcohen/postscript/recsys-2016.pdf
https://ieeexplore.ieee.org/abstract/document/9216015/
https://github.com/hwwang55/KGNN-LS
https://github.com/hwwang55/KGCN
https://xiangwang1223.github.io/papers/KGAT_final.pdf
https://xiangwang1223.github.io/
https://arxiv.org/abs/2003.05753
https://arxiv.org/abs/2102.07057
https://www.comp.nus.edu.sg/~xiangnan/papers/www19-KGRec.pdf

RippleNet

https://arxiv.org/pdf/1803.03467.pdf
https://github.com/hwwang55/RippleNet
https://caojiangxia.github.io/RippleNet/

Reinforcement Learning and Recommender System

Services that introduce stores to users on the Internet are increasing in recent years. Each service conducts thorough analyses in order to display stores matching each user's preferences. In the field of recommendation, collaborative filtering performs well when there is sufficient click information from users. Generally, when building a user-item matrix, data sparseness becomes a problem. It is especially difficult to handle new users. When sufficient data cannot be obtained, a multi-armed bandit algorithm is applied. Bandit algorithms advance learning by testing each of a variety of options sufficiently and obtaining rewards (i.e. feedback). It is practically impossible to learn everything when the number of items to be learned periodically increases. The problem of having to collect sufficient data for a new user of a service is the same as the problem that collaborative filtering faces. In order to solve this problem, we propose a recommender system based on deep reinforcement learning. In deep reinforcement learning, a multilayer neural network is used to update the value function.

https://www.ashudeepsingh.com/publications/
http://www.cse.msu.edu/~zhaoxi35/
https://github.com/Jinjiarui/rl4rs-papers

https://recsys.acm.org/recsys19/reveal/
https://recsys.acm.org/recsys20/reveal/

Ieep reinforcement learning for recommender systems
Deep Reinforcement Learning for Page-wise Recommendations
A Reinforcement Learning Framework for Explainable Recommendation
TPGR: Large-scale Interactive Recommendation with Tree-structured Policy Gradient

Explore, Exploit, and Explain: Personalizing Explainable Recommendations with Bandits
Learning from logged bandit feedback
Improving the Quality of Top-N Recommendation
ParsRec: Meta-Learning Recommendations for Bibliographic Reference Parsing
强化学习在阿里的技术演进与业务创新 | 免费资料库
Closing the loop with the real world: reinforcement and robust estimators for recommendation

Traditional Approaches	Beyond Traditional Methods
Collaborative Filtering	Tensor Factorization & Factorization Machines
Content-Based Recommendation	Social Recommendations
Item-based Recommendation	Learning to rank
Hybrid Approaches	MAB Explore/Exploit

https://github.com/wzhe06/Reco-papers
https://github.com/hongleizhang/RSPapers
https://github.com/hongleizhang/RSAlgorithms
https://zhuanlan.zhihu.com/p/26977788
https://zhuanlan.zhihu.com/p/45097523
https://www.zhihu.com/question/20830906
https://www.zhihu.com/question/56806755/answer/150755503

Adversarial Learning for Recommender Systems

https://github.com/sisinflab/adversarial-recommender-systems-survey
https://github.com/EdisonLeeeee/RS-Adversarial-Learning
https://yasdel.github.io/
Adversarial Personalized Ranking for Recommendation
Deep Adversarial Social Recommendation
https://yasdel.github.io/files/RecSys20_tutorial.pdf
Generative Adversarial User Model for Reinforcement Learning BasedRecommendation System
http://tommasodinoia.com/

Generative Adversarial Networks for Recommender Systems

Generative Adversarial User Model for Reinforcement Learning Based Recommendation System
RecGAN: Recurrent Generative Adversarial Networks for Recommendation Systems
Enhancing Social Recommendation with Adversarial Graph Convolutional Networks

Adversarial Attacks

Adversarial Attacks on an oblivious recommender
https://www-users.cs.umn.edu/~baner029/papers/19/adv_attack.pdf

Adversarial Training

Adversarial Training Towards Robust Multimedia Recommender System

https://espace.library.uq.edu.au/view/UQ:b731966
Generating Reliable Friends via Adversarial Training to Improve Social Recommendation
https://github.com/laugh12321/adversarial_multi_feedback_ranking

Health Recommender Systems

Recommendations are becoming evermore important in health settings with the aim being to assist people live healthier lives. Three previous workshops on Health Recommender Systems (HRS) have incorporated diverse research fields and problems in which recommender systems can improve our awareness, understanding and behaviour regarding our own, and the general public's health. At the same time, these application areas bring new challenges into the recommender community. Recommendations that influence the health status of a patient need to be legally sound and, as such, today, they often involve a human in the loop to make sure the recommendations are appropriate. To make the recommender infallible, complex domain-specific user models need to be created, which creates privacy issues. While trust in a recommendation needs to be explicitly earned through, for example, transparency, explanations and empowerment, other systems might want to persuade users into taking beneficial actions that would not be willingly chosen otherwise. Multiple and diverse stakeholders in health systems produce further challenges.

Taking the patient's perspective, simple interaction and safety against harmful recommendations might be the prioritized concern.
For clinicians and experts, on the other hand, what matters is precise and accurate content.
Healthcare and insurance providers and clinics all have other priorities.

This workshop will deepen the discussions started at the three prior workshops and will work towards further development of the research topics in Health Recommender Systems.

http://132.199.138.79/healthrecsys/papers/index.html
http://ceur-ws.org/Vol-1953/
https://recsys.acm.org/recsys18/healthrecsys/
https://healthrecsys.github.io/2019/
https://www.vis.uni-konstanz.de/en/members/schaefer/
https://www.christophtrattner.info/
Towards Health (Aware) Recommender Systems
HealthRecSys 2018 Health Recommender Systems
UMUAI: Special Issue on Recommender Systems for Health and Wellbeing
SeWeBMeDa 2019 Semantic Web Solutions for Large-Scale Biomedical Data Analytics
2019 KDD Workshop on Applied Data Science for Healthcare
https://dshealthkdd.github.io/dshealth-2019/#papers
Health Recommender Systems: Concepts, Requirements, Technical Basics and Challenges
Health Recommender System using Big data analytics
Health Recommender System in Social Networks: A Case of Facebook
Health Recommender research project
Consumers’ intention to use health recommendation systems to receive personalized nutrition advice
Visual instance-based recommendation system for medical data mining
Health Recommender System design in the context of CAREGIVERSPRO-MMD Project
Towards Health (Aware) Recommender Systems
Personalized Recommendation System for Medical Assistance using Hybrid Filtering
Designing a Mobile Recommender System for Treatment Adherence Improvement among Hypertensives
https://caregiversprommd-project.eu/
https://wiki.aalto.fi/display/~llahti@aalto.fi/Lauri+Lahti
https://fruct.org/
A Systematic Literature Review on Health Recommender Systems
http://people.dbmi.columbia.edu/noemie/
MACHINE LEARNING FOR HEALTHCARE (MLHC)
Microsoft’s focus on transforming healthcare: Intelligent health through AI and the cloud
https://www.cs.ubc.ca/~rng/
http://homepages.inf.ed.ac.uk/ckiw/
http://groups.csail.mit.edu/medg/people/psz/home/Pete_MEDG_site/Home.html
MIT CSAIL Clinical Decision Making Group

Recommdender System for Doctor

Finding a primary care doctor is simpler than it used to be, thanks to on-demand services like ZocDoc, SimplyBook, and Doodle. But matching up with a clinician who’s compatible with your (or your family’s) personality is another story.

Which Doctor to Trust: A Recommender System for Identifying the Right Doctors
Recommendation of Doctors and Medicines Using Review Mining
Which Doctor to Trust: A Recommender System for Identifying the Right Doctors

The recommender system is the core component of the social network named HealthNet (HN). The recommendation algorithm first computes similarities among patients, and then generates a ranked list of doctors and hospitals suitable for a given patient profile, by exploiting health data shared by the community. Accordingly, the HN user can find her most similar patients, look how they cured their diseases, and receive suggestions for solving her problem.

HN is implemented as a standard social network where users are patients. The first interaction with the system is the registration step. Then, the patient can enter personal health data: conditions, treatments (e.g., drugs, dosages, side effects, surgeries), health indicators (e.g., blood pressure, body weight, laboratory analysis, etc.), consulted doctors, hospitalizations. In this way, HN centralizes individual health data and allows a simple and organized access to them.

The Recommender System is the core component of HN. It exploits patient profiles for suggesting other similar patients, doctors,hospitals (the list of suggested, patients, doctors and hospitals can be further filtered by position and disease). The similarity between two patients $p,p^{\prime}$ is computed in terms of conditions and treatments. The semantic matching between the conditions exploits the HN disease hierarchy. More formally, the similarity score between two patients is computed as follows: $$s(p, p^{\prime}) = \alpha\frac{\sum_{i=1}^{k}\sum_{j=1}^{n}s_c(p_{c_i}, p^{\prime}_{c_j})}{kn}\

(1-\alpha)\frac{\sum_{i=1}^{z}\sum_{j=1}^{r}s_t(p_{t_i}, p^{\prime}{t_j})}{zr} $$ where $k$ (respectively $n$) is the number of conditions $p$ (respectively $p^{\prime}$) is affected by, $p_c$ is a condition of the patient $p$, $z$ (respectively $r$) is the number of treatments for $p$ (respectively $p^{\prime}$), $p_t$ is a treatment for the patient $p$. They are computed as follows: $$s_c(p{c_i}, p^{\prime}{c_j}) = \begin{cases} \log\frac{p{c_i}}{p^{\prime}{c_j}}, &\text{if $c_i=c_j$}\ \frac{1}{sp(c_i, c_j)}, &\text{otherwise} \end{cases}, s_t(p{t_i}, p^{\prime}_{t_j}) = \begin{cases} 1, &\text{if $t_i=t_j$}\ 0, &\text{otherwise} \end{cases}. $$

A Recommender System for Connecting Patients to the Right Doctors in the HealthNet Social Network

Patient-Doctor Matchmaking

There are different perspectives of patient-doctor matchmaking system:

From patients’ perspectives, such systems should provide explainable recommendations and safeguard against poor recommendations in order to be trustworthy.
From the perspective of healthcare professionals, these systems need to provide suitable recommendations based on their domain knowledge and experience.
More generally, insurance companies and healthcare institutes are interested in improving recommendation rates through research and reaping the potential benefits of these recommendation systems.

The features include demographic data, behevioral data, ICD-9, interaction, the number of visits to the doctor.

A Hybrid Recommender System for Patient-Doctor Matchmaking in Primary Care perform hybrid matrix factorization (MF) and recommend each patient a list of family doctors according to the level of information available about them. We achieve this by learning latent representations for patients and doctors from their interactions and metadata

Given the different level of information available to us about different patients, five use cases are proposed to make doctor recommendations in different scenarios.

The patient-doctor interaction matrix $Y \in \mathbb{R}^{M\times N}$ is defined as: $$y_{ij} = \begin{cases} 1, &\text{if interaction (patient i, doctor j) exists}\ 0, &\text{otherwise} \end{cases} $$

MF learns $\mathbf{p}i$ and $\mathbf{q}j$, such that the predicted score for unobserved entries $\hat{y}{ij}$ is given by the inner product of latent patient and doctor representations: $$\hat{y}{ij}=g(i,j\mid \mathbf{p}_i, \mathbf{q}_j)=g(\mathbf{p}_i\cdot \mathbf{q}_j)=\frac{1}{1+\exp(\left<\mathbf{p}_i,\mathbf{q}_j)\right>}.$$

Then formulate a learning-to-rank task by using Weighted Approximate-Rank Pairwise (WARP) loss. For each observed interaction $\hat{y}{ij}$, WARP samples a negative doctor $d$ and computes the difference between predicted $\hat{y}{ij}$ and $\hat{y}_{id}$, and performs a gradient update to rank the positive doctor higher if the difference is negative, i.e., a rank violation is found. Otherwise, it continues sampling negative doctors until it identifies a violating example. Thus, the rank of doctor j for patient i is minimized when taking a large number of sampled doctors d that need to be considered before finding a violating example.

We can model the trust $T_{ij} (t)$ between a patient $i$ and a family doctor $j$ at time $t$, given both the frequency and recency of their consultation history as: $$T_{ij} (t)=\sum_{t}\sum_{k}\frac{C_{ij}(t)e^{-\lambda t}}{C_{ik}(t)}$$ where $\lambda$ is annualized discount rate for the exponential decay function and treated as hyper-parameter during the model training, $C_{ij}(t)$ is the number of consultations between patient $i$ and doctor $j$ until year $t$, which is normalized by the total number of her consultations with $k$ doctors $C_{ik} (t)$ thus far.

Collaborative Filtering for Implicit Feedback Datasets

AI Researchers use AI to match patients with primary care doctors
A Hybrid Recommender System for Patient-Doctor Matchmaking in Primary Care
http://www.suggestadoctor.com/
https://www.researchgate.net/profile/Bo_Jin16
https://orcid.org/0000-0002-4209-4637
https://nyulangone.org/doctors
https://destrin.smalldata.io/
https://smalldata.io/
http://www.www2015.it/industrial-track/
http://www.itu.dk/~bardram/pmwiki/
http://www.bardram.net/
http://www.cachet.dk/
https://www.researchgate.net/profile/Jakob_Bardram
https://wp.cs.ucl.ac.uk/acm-digitalhealth-2015/alberto-sanna/
https://www.acm-digitalhealth.org/2018/committee/alberto-sanna/index.html
https://research.hsr.it/en/index.html

DeepReco

DeepReco: Deep Learning Based Health Recommender System Using Collaborative Filtering
https://utd-ir.tdl.org/handle/10735.1/7542

Resource on RecSys

http://www.cs.ucr.edu/~cshelton/
http://hst.mit.edu/users/rgmarkmitedu
http://erichorvitz.com/
https://www.hms.harvard.edu/dms/neuroscience/fac/Kohane.php
https://www.khoury.northeastern.edu/people/carla-brodley/
https://mquad.github.io/
https://www.aau.at/en/ainf/research-groups/infsys/team/dietmar-jannach/
https://xamat.github.io/
http://presnick.people.si.umich.edu/
https://www.stern.nyu.edu/faculty/bio/alexander-tuzhilin
http://people.stern.nyu.edu/atuzhili/
https://www.researchgate.net/profile/Markus_Zanker
https://cseweb.ucsd.edu/~jmcauley/datasets.html

https://zhuanlan.zhihu.com/p/87293483
https://www.zhihu.com/question/336304380/answer/784976195
最新！五大顶会2019必读的深度推荐系统与CTR预估相关的论文 - 深度传送门的文章 - 知乎
深度学习在搜索和推荐系统中的应用
CSE 258: Web Mining and Recommender Systems
CSE 291: Trends in Recommender Systems and Human Behavioral Modeling
THE AAAI-19 WORKSHOP ON RECOMMENDER SYSTEMS AND NATURAL LANGUAGE PROCESSING (RECNLP)
Information Recommendation for Online Scientific Communities, Purdue University, Luo Si, Gerhard Klimeck and Michael McLennan
Recommendations for all : solving thousands of recommendation problems a day
http://staff.ustc.edu.cn/~hexn/
Learning Item-Interaction Embeddings for User Recommendations
Summary of RecSys
How Netflix’s Recommendations System Works
个性化推荐系统，必须关注的五大研究热点
How Does Spotify Know You So Well?
推荐系统论文集合
http://csse.szu.edu.cn/staff/panwk/recommendation/
https://hong.xmu.edu.cn/Services___fw/Recommender_System.htm
https://blog.statsbot.co/recommendation-system-algorithms-ba67f39ac9a3
https://buildingrecommenders.wordpress.com/
https://homepages.dcc.ufmg.br/~rodrygo/recsys-2019-1/
https://developers.google.com/machine-learning/recommendation/
https://sites.google.com/view/lianghu/home/tutorials/ijcai2019
https://acmrecsys.github.io/rsss2019/
https://github.com/alibaba/x-deeplearning/wiki
https://apple.github.io/turicreate/docs/userguide/recommender/

Labs

http://www.christophtrattner.com/
http://elizabethchurchill.com/presentations/
http://www.that-recsys-lab.net/
https://www.christophtrattner.info/publications.html
https://www.ludovicoboratto.com/publications/
https://www.ucsm.info/publications
http://recsys.deib.polimi.it/
https://www.know-center.tugraz.at/en/publications/publications/
https://qcri.academia.edu/LuisLuque
http://www.martijnwillemsen.nl/recommenderlab/
https://cseweb.ucsd.edu/~jmcauley/
https://github.com/mJackie/RecSys
https://piret.gitlab.io/fatrec/
https://ailab.criteo.com/publications/
https://layer6.ai/
https://cseweb.ucsd.edu/~jmcauley/career.html
Recommender Systems
https://libraries.io/github/computational-class
http://www.52caml.com/
http://www-scf.usc.edu/~kuanl/
http://www.yjzheng.com/

Workshop

DLRS 2018 : 3rd Workshop on Deep Learning for Recommender Systems
Deep Learning based Recommender System: A Survey and New Perspectives
$5^{th}$ International Workshop on Machine Learning Methods for Recommender Systems
MoST-Rec 2019: Workshop on Model Selection and Parameter Tuning in Recommender Systems
2018 Personalization, Recommendation and Search (PRS) Workshop
WIDE & DEEP RECOMMENDER SYSTEMS AT PAPI
Interdisciplinary Workshop on Recommender Systems
2nd FATREC Workshop: Responsible Recommendation

Social Media Mining: An Introduction
The 2nd International Workshop on ExplainAble Recommendation and Search (EARS 2019)
NLP meets RecSys
http://dmml.asu.edu/smm/slide/SMM-Slides-ch9.pdf
PRS 2019
https://dlp-kdd.github.io/
https://recsys.acm.org/blog/

Implementation

Surprise: a Python scikit building and analyzing recommender systems
Orange3-Recommendation: a Python library that extends Orange3 to include support for recommender systems.
MyMediaLite: a recommender system library for the Common Language Runtime
http://www.mymediaproject.org/
Workshop: Building Recommender Systems with Apache Spark 2.x
A Leading Java Library for Recommender Systems
lenskit: Python Tools for Recommender Experiments
Samantha - A generic recommender and predictor server

http://libfm.org/
https://github.com/srendle/libfm
https://www.csie.ntu.edu.tw/~cjlin/libffm/
https://github.com/srendle
https://github.com/lyst/lightfm
https://github.com/guoguibing/librec
https://recbole.io/index.html

TensorFlow implementation of an arbitrary order Factorization Machine
https://github.com/tensorflow/recommenders
https://github.com/fuhailin/DeePray

https://github.com/gasevi/pyreclab
https://github.com/cheungdaven/DeepRec
https://github.com/cyhong549/DeepFM-Keras
https://github.com/grahamjenson/list_of_recommender_systems
https://github.com/maciejkula/spotlight
https://github.com/Microsoft/Recommenders
https://github.com/alibaba/euler
https://github.com/alibaba/x-deeplearning/wiki/
https://github.com/lyst/lightfm

Preference Learning

Roughly speaking, preference learning is about methods for learning preference models from explicit or implicit preference information, typically used for predicting the preferences of an individual or a group of individuals. Approaches relevant to this area range from learning special types of preference models, such as lexicographic orders, over "learning to rank" for information retrieval to collaborative filtering techniques for recommender systems.

http://www.ke.tu-darmstadt.de/events/PL-10/
http://www.preference-learning.org/
Preference Learning: Problems and Applications in AI (PL-12) ECAI-12 Workshop, Montpellier
From Multiple Criteria Decision Aid to Preference Learning
Research Workshop on AI for Preference Learning: Sentiment, Comparison, and Recommendation
PREFERENCE LEARNING WHAT DEFINES AN OPTIMAL SHIFT SCHEDULE?
Preference Learning: A Tutorial Introduction
Preference Learning: An Introduction
http://plt.institutedigitalgames.com/
http://www.gatsby.ucl.ac.uk/~chuwei/

Pairwise Preference Learning and Ranking

Pairwise Preference Learning and Ranking
Preference Learning and Ranking by Pairwise Comparison
Preference Uncertainty, Preference Learning, and Paired Comparison Experiments
Learning Mallows Models with Pairwise Preferences

Collaborative Preference Learning

Collaborative Gaussian Processes for Preference Learning
Poster Collaborative Gaussian Processes for Preference Learning
Scalable Collaborative Bayesian Preference Learning
Collaborative Preference Learning: A Case Study
Collaborative Context-aware Preference Learning
Neural Collaborative Preference Learning with Pairwise Comparisons

Preference Learning and Gaussian Processes

Multi-Task Preference Learning with Gaussian Processes
Gaussian Process Preference Elicitation
Extensions of Gaussian Processes for Ranking: Semi-supervised and Active Learning
Preference Learning with Gaussian Processes
https://github.com/neilhoulsby/pref_learning
Fast Active Exploration for Link-Based Preference Learning using Gaussian Processes

Preference Learning and Choice Model

Preference learning has been studied for several decades and has drawn increasing attention in recent years due to its importance in web applications, such as ad serving, search, and electronic commerce. In all of these applications, we observe (often discrete) choices that reflect relative preferences among several items, e.g. products, songs, web pages or documents. Moreover, observations are in many cases censored. Hence, the goal is to reconstruct the overall model of preferences by, for example, learning a general ordering function based on the partially observed decisions. Choice models try to predict the specific choices individuals (or groups of individuals) make when offered a possibly very large number of alternatives. Traditionally, they are concerned with the decision process of individuals and have been studied independently in machine learning, data and web mining, econometrics, and psychology. However, these diverse communities have had few interactions in the past. One goal of this workshop is to foster interdisciplinary exchange, by encouraging abstraction of the underlying problem (and solution) characteristics.

Choice Models and Preference Learning Workshop
The 12th International Conference on Modeling Decisions for Artificial Intelligence
Choice Models and Preference Learning: NIPS workshop, 17 December 2011, Sierra Nevada, Spain
Active Preference Learning with Discrete Choice Data
A rational model of preference learning and choice prediction by children
Modeling preference evolution in discrete choice models: A Bayesian state-space approach

Modeling Users’ Preferences

The ever-growing nature of user generated data in online systems poses obvious challenges on how we process such data. Typically, this issue is regarded as a scalability problem and has been mainly addressed with distributed algorithms able to train on massive amounts of data in short time windows. However, data is inevitably adding up at high speeds. Eventually one needs to discard or archive some of it. Moreover, the dynamic nature of data in user modeling and recommender systems, such as change of user preferences, and the continuous introduction of new users and items make it increasingly difficult to maintain up-to-date, accurate recommendation models.

Workshop on Online Recommender Systems and User Modeling
Modeling Users’ Preferences and Social Links in Social Networking Services: A Joint-Evolving Perspective
Modeling and Learning User Preferences Over Sets
Modeling the Dynamics of User Preferences in Coupled Tensor Factorization
Modeling Users’ Mobile App Privacy Preferences: Restoring Usability in a Sea of Permission Settings
Adaptive User Modeling with Long and Short-Term Preferences for Personalized Recommendation
bbbbbbbbbbbbbbbbbb
https://fangyuan1st.github.io/paper/ECML16_SEQ_slides.pdf
Deep Modeling of the Evolution of User Preferences and Item Attributes in Dynamic Social Networks

Computational Advertising

Advertising, recommendation and search is 3 fundation stone of e-economics.

https://www.ecommercefoundation.org/reports

Online advertising has grown over the past decade to over 26 billion dollars in recorded revenue in 2010. The revenues generated are based on different pricing models that can be fundamentally grouped into two types: cost per (thousand) impressions (CPM) and cost per action (CPA), where an action can be a click, signing up with the advertiser, a sale, or any other measurable outcome. A web publisher generating revenues by selling advertising space on its site can offer either a CPM or CPA contract. We analyze the conditions under which the two parties agree on each contract type, accounting for the relative risk experienced by each party.

The information technology industry relies heavily on the on-line advertising such as [Google，Facebook or Alibaba]. Advertising is nothing except information, which is not usually accepted gladly. In fact, it is more difficult than recommendation because it is less known of the context where the advertisement is placed.

Hongliang Jie shares 3 challenges of computational advertising in Etsy, which will be the titles of the following subsections.

广告为什么要计算
计算广告资料汇总
ONLINE VIDEO ADVERTISING: All you need to know in 2019
计算广告
计算广告和机器学习
https://headerbidding.co/category/adops/
Deep Learning Based Modeling in Computational Advertising: A Winning Formula
Computational Marketing
Data Science and Analytics in Computational Advertising
Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising
Text Mining in Computational Advertising
https://stat.duke.edu/people/david-l-banks
http://wnzhang.net/teaching/ee448/
https://recsys.acm.org/recsys08/keynotes/
https://www.researchgate.net/profile/Andrei_Broder

Click-Through Rate Modeling

GBRT+LR

When the feature vector ${x}$ are given, the tree split the features by GBRT then we transform and input the features to the logistic regression.

Practical Lessons from Predicting Clicks on Ads at Facebook or the blog use the GBRT to select proper features and LR to map these features into the interval $[0,1]$ as a ratio. Once we have the right features and the right model (decisions trees plus logistic regression), other factors play small roles (though even small improvements are important at scale).

Learning Piece-wise Linear Models from Large Scale Data for Ad Click Prediction
聊聊CTR预估的中的深度学习
Deep Models at DeepCTR
镶嵌在互联网技术上的明珠：漫谈深度学习时代点击率预估技术进展
CTR预估算法之FM, FFM, DeepFM及实践
Turning Clicks into Purchases
https://github.com/shenweichen/DeepCTR
https://github.com/wzhe06/CTRmodel
https://github.com/cnkuangshi/LightCTR
https://github.com/evah/CTR_Prediction
http://2016.qconshanghai.com/track/3025/
https://blog.csdn.net/u011747443/article/details/68928447

Conversion Rate Modeling

Post-Click Conversion Modeling and Analysis for Non-Guaranteed Delivery Display Advertising
Estimating Conversion Rate in Display Advertising from Past Performance Data
https://www.optimizesmart.com/

Bid Optimization

A collection of research and survey papers of real-time bidding (RTB) based display advertising techniques.

http://yelp.github.io/MOE/
http://www.hongliangjie.com/talks/AICon2018.pdf
https://sites.google.com/view/tsmo2018/invited-talks
https://matinathomaidou.github.io/research/
https://www.usermind.com/

User Engagement

User engagement measures whether users find value in a product or service. Engagement can be measured by a variety or combination of activities such as downloads, clicks, shares, and more. Highly engaged users are generally more profitable, provided that their activities are tied to valuable outcomes such as purchases, signups, subscriptions, or clicks.

WHAT IS USER ENGAGEMENT?
What is Customer Engagement, and Why is it Important?
What is user engagement? A conceptual framework for defining user engagement with technology
How to apply AI for customer engagement
The future of customer engagement
Second Uber Science Symposium: Exploring Advances in Behavioral Science
Measuring User Engagement
https://uberbehavioralsciencesymposium.splashthat.com/
https://inlabdigital.com/
https://www.futurelab.net/
http://www.ueo-workshop.com/

The User Engagement Optimization Workshop2
The User Engagement Optimization Workshop1
EVALUATION OF USER EXPERIENCE IN MOBILE ADVERTISI
WWW 2019 Tutorial on Online User Engagement
http://www.ueo-workshop.com/
http://www.ueo-workshop.com/program/
https://www.nngroup.com/
https://labtomarket.eu/
http://research.google.com/pubs/AmrAhmed.html
https://home.ubalt.edu/ntsbarsh/business-stat/opre504.htm
https://www.nersc.gov/about/nersc-staff/user-engagement/
https://www.microsoft.com/en-us/research/people/eladyt/
http://yom-tov.info/

User Modeling

User models are used to generate or adapt user interfaces at runtime, to address particular user needs and preferences. User models are also known as user profiles, personas or archetypes. They can be used by designers and developers for personalisation purposes and to increase the usability and accessibility of products and services.

https://www.um.org/
https://www.um.org/umap2020/
https://www.um.org/awards/best-paper-awards
https://www.w3.org/WAI/RD/wiki/User_modeling
https://www2018.thewebconf.org/program/user-modeling/
http://kdd2018tutorial-behavior.datasciences.org/
https://www2019.thewebconf.org/research-track/user-modeling-personalization-and-experience
User Modeling: Recent Work, Prospects and Hazards1
Research on the Use, Characteristics, and Impact of e-Commerce Product Recommendation Agents: A Review and Update for 2007–2012
Beyond Bags of Words: Modeling Implicit User Preferences in Information Retrieval
Modeling User Exposure in Recommendation
E-Commerce Product Recommendation Agents: Use, Characteristics, and Impact
Modeling User Preferences and Mediating Agents in Electronic Commerce
http://www.humanize-workshop.org/
http://iwum.org/

Resource