Understanding BPR, COFISET and BPRH
BPR, COFISET, and BPRH all focus on collaborative filtering with matrix factorization and implicit feedbacks. This task is different from 5-star rating estimation in a sense that only some users’ action behaviors (e.g. purchase, view, like, dislike) are available.
This post includes my understanding of these three models.
TODO
- BPR
- Backgrounds for COFISET
- Backgrounds for BPRH
- SGD for CIFISET and BPRH
- Implementation codes for BPRH in Jupyter Notebook on a small and simple example
- Full Implementation codes for BPRH
BPR
UNDER CONSTRUCTION
Implementation
You may find several verions of BPR implementations below.
- https://github.com/jokingcoco/BPR_Bayesian-Personalized-Ranking_MPR_Multiple-Pairwise-Ranking
- https://github.com/valerystrizh/bpr
- https://github.com/lyst/lightfm
- https://github.com/sh0416/bpr
- https://github.com/guoguibing/librec/search?q=BPR&unscoped_q=BPR
COFISET
Item-sets and Pairwise Preferences
An item-set is a set of items. A user $u$'s preference over an item-set is a function of $u$'s preferences on items in this item-set. For example, $u$'s preference on item-set $\mathcal{P}$ can be $\hat{r}_{u \mathcal{P}} = |\mathcal{P}|^{-1} \sum_{i \in \mathcal{P}} \hat{r}_{ui}$. $\hat{r}_{ui}$ is $u$'s estimated preference on item $i$. A user $u$'s pairwise preference over two item-sets $\mathcal{P}, \mathcal{A}$ is defined as $\hat{r}_{u \mathcal{P} \mathcal{A}} = \hat{r}_{u \mathcal{P}} - \hat{r}_{u \mathcal{A}}$. And in COFISET, the assumption of pairwise preference is $\hat{r}_{u \mathcal{P} \mathcal{A}} = \hat{r}_{u \mathcal{P}} - \hat{r}_{u \mathcal{A}} > 0, \mathcal{P} \subseteq \mathcal{I}^{tr}_{u}, \mathcal{A} \subseteq \mathcal{I}^{tr} \setminus \mathcal{I}^{tr}_{u}$. $\mathcal{P} \subseteq \mathcal{I}^{tr}_{u}$ is a set of items of observed feedback from user $u$ and $\mathcal{A} \subseteq \mathcal{I}^{tr} \setminus \mathcal{I}^{tr}_{u}$ is a set of items without observed feedback from user $u$.Objective Function and Optimization
COFISET’s relaxed optimization for all users is
$$ \min_{\Theta} \sum_{u \in \mathcal{U}^{tr}} \sum_{\mathcal{P} \subseteq \mathcal{I}_{u}^{tr}} \sum_{\mathcal{A} \subseteq \mathcal{I} \setminus \mathcal{I}_{u}^{tr}} \left[ \mathcal{L}(u, \mathcal{P}, \mathcal{A}) + \mathcal{R}(u, \mathcal{P}, \mathcal{A}) \right] $$ , where $\Theta = \{ U_{u} \in \mathbb{R}^{ d \times 1}, V_{i} \in \mathbb{R}^{d \times 1}, b_i \in \mathbb{R}, u \in \mathcal{U}^{tr}, i \in \mathcal{I}^{tr} \}$ is the set of model parameter, the loss term $\mathcal{L}(u, \mathcal{P}, \mathcal{A}) = - \ln{\sigma(\hat{r}_{u\mathcal{P}\mathcal{A}})}$, $\sigma(z) = 1/(1+\exp{(-z)})$, the pairwise preferences of user $u$ across 2 item-sets $\mathcal{P}$ and $\mathcal{A}$ is $\hat{r}_{u\mathcal{P}\mathcal{A}} = \hat{r}_{u\mathcal{P}} - \hat{r}_{u\mathcal{A}}$, the estimated preferences of user $u$ on item-sets $\mathcal{P}$ and $\mathcal{A}$ are $\hat{r}_{u\mathcal{P}} = |P|^{-1} \sum_{i \in \mathcal{P}} \hat{r}_{ui}, \hat{r}_{u\mathcal{A}} = |A|^{-1} \sum_{j \in \mathcal{A}} \hat{r}_{uj}$, the estimated preference of user $u$ on item $i$ is $\hat{r}_{ui} = U_{u}^\top V_{i} + b_i$, and the regularization term $\mathcal{R}(u, \mathcal{P}, \mathcal{A}) = \frac{\alpha_u}{2} \| U_u \|^2 + \sum_{i \in \mathcal{P}} [\frac{\alpha_V}{2} \| V_i \|^2 + \frac{\beta_V}{2} b_i^2] + \sum_{j \in \mathcal{A}} [\frac{\alpha_V}{2} \| V_j \|^2 + \frac{\beta_V}{2} b_j^2]$.Corresponding Gradients in SGD
Stochastic gradient descent is applied in COFISET’s optimization. The corresponding gradients of model parameter $\Theta$ w.r.t. $\mathcal{L}(u, \mathcal{P}, \mathcal{A})$ and $\mathcal{R}(u, \mathcal{P}, \mathcal{A})$ are
$$ \begin{aligned} \dfrac{\partial \mathcal{L}}{\partial U_{u}} & = - \dfrac{\partial \ln \sigma}{\partial \sigma} \dfrac{\partial \sigma}{\partial \hat{r}_{u \mathcal{P} \mathcal{A}} } \dfrac{\partial \hat{r}_{u \mathcal{P} \mathcal{A}}}{\partial U_{u}} \\ & = - \sigma^{-1}(-\hat{r}_{u \mathcal{P} \mathcal{A}}) \sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}}) (1-\sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}})) \dfrac{\partial (\hat{r}_{u\mathcal{P}} - \hat{r}_{u\mathcal{A}}) }{\partial U_{u}}\\ & = - \sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}}) (|\mathcal{P}|^{-1} \sum_{i \in \mathcal{P}} V_{i} - |\mathcal{A}|^{-1} \sum_{j \in \mathcal{A}} V_{j} ) \\ & = - \sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}}) (\overline{V}_{\mathcal{P}} - \overline{V}_{\mathcal{A}}), \forall u \in \mathcal{U}^{tr} \\ \dfrac{\partial \mathcal{R}}{\partial U_u} & = \alpha_{u} U_{u}, \forall u \in \mathcal{U}^{tr} \\ \end{aligned} $$ $$\begin{aligned} \dfrac{\partial \mathcal{L}}{\partial V_{i}} & = - \dfrac{\partial \ln \sigma}{\partial \sigma} \dfrac{\partial \sigma}{\partial \hat{r}_{u \mathcal{P} \mathcal{A}} } \dfrac{\partial \hat{r}_{u \mathcal{P} \mathcal{A}}}{\partial V_{i}} \\ & = - \sigma^{-1}(-\hat{r}_{u \mathcal{P} \mathcal{A}}) \sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}}) (1-\sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}})) \dfrac{\partial (\hat{r}_{u\mathcal{P}} - \hat{r}_{u\mathcal{A}}) }{\partial V_{i}}\\ & = - \sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}}) (U_{u} / |\mathcal{P}|), \forall i \in \mathcal{P} \\ \dfrac{\partial \mathcal{R}}{\partial V_i} & = \alpha_{V} V_{i}, \forall i \in \mathcal{P} \\ \dfrac{\partial \mathcal{L}}{\partial b_i} & = - \dfrac{\partial \ln \sigma}{\partial \sigma} \dfrac{\partial \sigma}{\partial \hat{r}_{u \mathcal{P} \mathcal{A}} } \dfrac{\partial \hat{r}_{u \mathcal{P} \mathcal{A}}}{\partial b_{i}} \\ & = - \sigma^{-1}(-\hat{r}_{u \mathcal{P} \mathcal{A}}) \sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}}) (1-\sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}})) \dfrac{\partial (\hat{r}_{u\mathcal{P}} - \hat{r}_{u\mathcal{A}}) }{\partial b_{i}}\\ & = - \sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}}) / |\mathcal{P}|, \forall i \in \mathcal{P} \\ \dfrac{\partial \mathcal{R}}{\partial b_i} & = \beta_V b_{i}, \forall i \in \mathcal{P} \\ \end{aligned} $$ $$\begin{aligned} \dfrac{\partial \mathcal{L}}{\partial V_j} & = - \dfrac{\partial \ln \sigma}{\partial \sigma} \dfrac{\partial \sigma}{\partial \hat{r}_{u \mathcal{P} \mathcal{A}} } \dfrac{\partial \hat{r}_{u \mathcal{P} \mathcal{A}}}{\partial V_{j}} \\ & = - \sigma^{-1}(-\hat{r}_{u \mathcal{P} \mathcal{A}}) \sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}}) (1-\sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}})) \dfrac{\partial (\hat{r}_{u\mathcal{P}} - \hat{r}_{u\mathcal{A}}) }{\partial V_{j}}\\ & = \sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}}) (U_{u} / |\mathcal{A}|), \forall j \in \mathcal{A} \\ \dfrac{\partial \mathcal{R}}{\partial V_j} & = \alpha_{V} V_{j}, \forall j \in \mathcal{A} \\ \dfrac{\partial \mathcal{L}}{\partial b_j} & = - \dfrac{\partial \ln \sigma}{\partial \sigma} \dfrac{\partial \sigma}{\partial \hat{r}_{u \mathcal{P} \mathcal{A}} } \dfrac{\partial \hat{r}_{u \mathcal{P} \mathcal{A}}}{\partial b_{j}} \\ & = - \sigma^{-1}(-\hat{r}_{u \mathcal{P} \mathcal{A}}) \sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}}) (1-\sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}})) \dfrac{\partial (\hat{r}_{u\mathcal{P}} - \hat{r}_{u\mathcal{A}}) }{\partial b_{j}}\\ & = \sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}}) / |\mathcal{A}|, \forall j \in \mathcal{A} \\ \dfrac{\partial \mathcal{R}}{\partial b_j} & = \beta_{V} b_{j}, \forall j \in \mathcal{A} \end{aligned} $$Updating Rules
Combining the gradients w.r.t. the loss term and the regularization term, the final gradients for each variable $U_{u}, V_{i}, V_{j}, b_{i}, b_{j}, u \in \mathcal{U}^{tr}, i \in \mathcal{P}, j \in \mathcal{A}$ are
$$ \begin{aligned} \nabla U_{u} & = \dfrac{\partial \mathcal{L}}{\partial U_{u}} + \dfrac{\partial \mathcal{R}}{\partial U_{u}} \\ & = - \sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}}) (\overline{V}_{\mathcal{P}} - \overline{V}_{\mathcal{A}}) + \alpha_{u} U_{u} \\ \nabla V_{i} & = \dfrac{\partial \mathcal{L}}{\partial V_{i}} + \dfrac{\partial \mathcal{R}}{\partial V_{i}} \\ & = - \sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}}) (U_{u} / |\mathcal{P}|) + \alpha_{V} V_{i} \\ \nabla V_{j} & = \dfrac{\partial \mathcal{L}}{\partial V_{j}} + \dfrac{\partial \mathcal{R}}{\partial V_{j}} \\ & = \sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}}) (U_{u} / |\mathcal{A}|) + \alpha_{V} V_{j} \\ \nabla b_{i} & = \dfrac{\partial \mathcal{L}}{\partial b_{i}} + \dfrac{\partial \mathcal{R}}{\partial b_{i}} \\ & = - \sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}}) / |\mathcal{P}| + \beta_V b_{i}\\ \nabla b_{j} & = \dfrac{\partial \mathcal{L}}{\partial b_{j}} + \dfrac{\partial \mathcal{R}}{\partial b_{j}} \\ & = \sigma(-\hat{r}_{u \mathcal{P} \mathcal{A}}) / |\mathcal{A}| + \beta_{V} b_{j} \end{aligned} $$ And the updating rules for each variable are $$ \begin{aligned} U_{u}^{t+1} & = U_{u}^{t} - \gamma \nabla U_{u}^{t}, u \in \mathcal{U}^{tr} \\ V_{i}^{t+1} & = V_{i}^{t} - \gamma \nabla V_{i}^{t}, i \in \mathcal{P} \\ V_{j}^{t+1} & = V_{j}^{t} - \gamma \nabla V_{j}^{t}, j \in \mathcal{A} \\ b_{i}^{t+1} & = b_{i}^{t} - \gamma \nabla b_{i}^{t}, i \in \mathcal{P} \\ b_{j}^{t+1} & = b_{j}^{t} - \gamma \nabla b_{j}^{t}, j \in \mathcal{A} \end{aligned} $$JAVA Implementation
You may find the implementation of COFISET in this famous JAVA library of recommendation systems LibRec.
BPRH
Extended Pairwise Preferences
Following COFISET, BPRH extends the pairwise preferences over item-sets. Besides the target action (e.g. purchase), a type of user's action called auxiliary action is introduced. Viewing, liking, and disliking are examples for auxiliary actions. Now the assumption of pairwise preferences in BPRH becomes $\hat{r}_{u \mathcal{I}} > \hat{r}_{u \mathcal{J}} > \hat{r}_{u \mathcal{K}}, \mathcal{I} \subseteq \mathcal{I}_{t}^{u}, \mathcal{J} \subseteq \mathcal{I}_{t}^{oa}, \mathcal{K} \subseteq \mathcal{I}_{n}^{u}$. $\mathcal{I} \subseteq \mathcal{I}_{t}^{u}$ is a set of items with target feedback from user $u$. $\mathcal{J} \subseteq \mathcal{I}_{t}^{oa}$ is a set of items with only auxiliary feedback from user $u$. And $\mathcal{K} \subseteq \mathcal{I}_{n}^{u}$ is a set of items without any observed feedback from user $u$. Then we can get $\hat{r}_{u\mathcal{I}\mathcal{J}} > 0, \hat{r}_{u\mathcal{J}\mathcal{K}} > 0$ for user $u$.Objective Function and Optimization
BPRH reaches the maximization for all users $u$ as
$$ \begin{aligned} \max_{\Theta} \quad & f(\Theta) - \mathcal{R}(\Theta) \end{aligned}$$ , where $f(\Theta) = \sum_{u \in U} \left\lbrace \sum_{ \mathcal{I} \subseteq I^{u}_{t}, \mathcal{J} \subseteq I^{u}_{oa} } {\ln \sigma \left( \dfrac{\hat{r}_{u \mathcal{I} \mathcal{J}} (\Theta) }{\alpha_{u}} \right) } + \sum_{ \mathcal{J} \subseteq I^{u}_{oa}, \mathcal{K} \subseteq I^{u}_{n}} {\ln \sigma \left( \hat{r}_{u \mathcal{J} \mathcal{K}} (\Theta) \right) } \right\rbrace$ is the negative value of loss term, $\Theta = \{ U_{u} \in \mathbb{R}^{1 \times d}, V_{i} \in \mathbb{R}^{1 \times d}, b_i \in \mathbb{R}, u \in \mathcal{U}, i \in I \}$ is the set of model parameter, the pairwise preference of user $u$ across 2 item-sets $\mathcal{I}$ and $\mathcal{J}$ is $\hat{r}_{u\mathcal{I}\mathcal{J}} = \hat{r}_{u\mathcal{I}} - \hat{r}_{u\mathcal{J}}$,the pairwise preference of user $u$ across 2 item-sets $\mathcal{J}$ and $\mathcal{K}$ is $\hat{r}_{u\mathcal{J}\mathcal{K}} = \hat{r}_{u\mathcal{J}} - \hat{r}_{u\mathcal{K}}$, the estimated preferences of user $u$ on item-sets $\mathcal{I}, \mathcal{J}, \mathcal{K}$ are $\hat{r}_{u\mathcal{I}} = |I|^{-1} \sum_{i \in \mathcal{I}} \hat{r}_{ui}$, $ \hat{r}_{u\mathcal{J}} = |J|^{-1} \sum_{j \in \mathcal{J}} \hat{r}_{uj}$, $\hat{r}_{u\mathcal{K}} = |K|^{-1} \sum_{k \in \mathcal{K}} \hat{r}_{uk}$, the estimated preference of user $u$ on item $i$ is $\hat{r}_{ui} = U_{u}^\top V_{i} + b_i$, and the regularization term $$ \begin{aligned} \mathcal{R}(\Theta) = \sum_{u \in U} \sum_{\mathcal{I} \subseteq I^{u}_{t}, \mathcal{J} \subseteq I^{u}_{oa}, \mathcal{K} \subseteq I^{u}_{n}} \{ & \lambda_u \| U_{u} \|^2 \\ & + \lambda_v \left( \| V_{\mathcal{I}} \|^2 + \| V_{\mathcal{J}} \|^2 + \| V_{\mathcal{K}} \|^2 \right) \\ & + \lambda_b \left( b_{\mathcal{I}}^2 + b_{\mathcal{J}}^2 + b_{\mathcal{K}}^2 \right) \} \end{aligned} $$, $V_{\mathcal{I}} = |\mathcal{I}|^{-1} \sum_{i \in \mathcal{I}} V_i$, $V_{\mathcal{J}} = |\mathcal{J}|^{-1} \sum_{j \in \mathcal{J}} V_j$, $V_{\mathcal{K}} = |\mathcal{K}|^{-1} \sum_{k \in \mathcal{K}} V_k$, $b_{\mathcal{I}} = |\mathcal{I}|^{-1} \sum_{i \in \mathcal{I}} b_i$, $b_{\mathcal{J}} = |\mathcal{J}|^{-1} \sum_{j \in \mathcal{J}} b_j$, $b_{\mathcal{K}} = |\mathcal{K}|^{-1} \sum_{k \in \mathcal{K}} b_k$. $\alpha_{u}$ is a parameter in $f(\Theta)$ controlling the contribution of $\hat{r}_{u \mathcal{I} \mathcal{J}} (\Theta)$ and calculated from data.Corresponding Gradients in SGD
Stochastic gradient descent is utilized in optimization. The corresponding gradients of model parameter $\Theta$ w.r.t the negative loss term $f(\Theta)$ and the regularization term $\mathcal{R}(\Theta)$ are
$$\begin{aligned} \dfrac{\partial f}{\partial U_{u}} & = \dfrac{\partial \ln(\sigma)}{\partial \sigma} \dfrac{\partial \sigma(\hat{r}_{u\mathcal{I}\mathcal{J}} / \alpha_{u} ) }{\partial \hat{r}_{u\mathcal{I}\mathcal{J}} / \alpha_{u} } \dfrac{\partial \hat{r}_{u\mathcal{I}\mathcal{J}} / \alpha_{u} }{\partial U_{u}} + \dfrac{\partial \ln(\sigma)}{\partial \sigma} \dfrac{\partial \sigma(\hat{r}_{u\mathcal{J}\mathcal{K}})}{\partial \hat{r}_{u\mathcal{J}\mathcal{K}} } \dfrac{\partial \hat{r}_{u\mathcal{J}\mathcal{K}} }{\partial U_{u}} \\ & = \alpha_{u}^{-1} \sigma(- \alpha_{u}^{-1} \hat{r}_{u\mathcal{I}\mathcal{J}} ) \left[ |I|^{-1} \sum_{i \in \mathcal{I}} V_{i} - |J|^{-1} \sum_{j \in \mathcal{J}} V_{j} \right] \\ & + \sigma(- \hat{r}_{u\mathcal{J}\mathcal{K}} ) \left[ |J|^{-1} \sum_{j \in \mathcal{J}} V_{j} - |K|^{-1} \sum_{k \in \mathcal{K}} V_{k} \right] \\ & = \alpha_{u}^{-1} \sigma(- \alpha_{u}^{-1} \hat{r}_{u\mathcal{I}\mathcal{J}} ) \left[ \overline{V}_{\mathcal{I}} - \overline{V}_{\mathcal{J}} \right] + \sigma(- \hat{r}_{u\mathcal{J}\mathcal{K}} ) \left[ \overline{V}_{\mathcal{J}} - \overline{V}_{\mathcal{K}} \right], \forall u \in U \\ \dfrac{\partial \mathcal{R}}{\partial U_{u}} & = 2 \lambda_{u} U_{u}, \forall u \in U\\ \end{aligned}$$ $$\begin{aligned} \dfrac{\partial f}{\partial V_{i}} & = \dfrac{\partial \ln(\sigma)}{\partial \sigma} \dfrac{\partial \sigma(\hat{r}_{u\mathcal{I}\mathcal{J}} / \alpha_{u} ) }{\partial \hat{r}_{u\mathcal{I}\mathcal{J}} / \alpha_{u} } \dfrac{\partial \hat{r}_{u\mathcal{I}\mathcal{J}} / \alpha_{u} }{\partial V_{i}} \\ & = (|\mathcal{I}| \alpha_{u})^{-1} \sigma(- \alpha_u^{-1} \hat{r}_{u\mathcal{I}\mathcal{J}} ) U_{u}, \forall i \in \mathcal{I}\\ \dfrac{\partial \mathcal{R}}{\partial V_{i}} & = \lambda_{v} \dfrac{\partial \| V_{\mathcal{I}} \|^2 }{\partial V_{\mathcal{I}}} \dfrac{\partial V_{\mathcal{I}}}{\partial V_{i}} = 2 |\mathcal{I}|^{-1} \lambda_{v} V_{\mathcal{I}}, \forall i \in \mathcal{I} \\ \dfrac{\partial f}{\partial b_{i}} & = \dfrac{\partial \ln(\sigma)}{\partial \sigma} \dfrac{\partial \sigma(\hat{r}_{u\mathcal{I}\mathcal{J}} / \alpha_{u} ) }{\partial \hat{r}_{u\mathcal{I}\mathcal{J}} / \alpha_{u} } \dfrac{\partial \hat{r}_{u\mathcal{I}\mathcal{J}} / \alpha_{u} }{\partial b_{i}} \\ & = (|\mathcal{I}| \alpha_{u})^{-1} \sigma(- \alpha_u^{-1} \hat{r}_{u\mathcal{I}\mathcal{J}} ), \forall i \in \mathcal{I}\\ \dfrac{\partial \mathcal{R}}{\partial b_{i}} & = \lambda_b \dfrac{\partial b_{\mathcal{I}}^2 }{\partial b_{\mathcal{I}}} \dfrac{\partial b_{\mathcal{I}}}{\partial b_{i}} = 2 \lambda_b |\mathcal{I}|^{-1} b_{\mathcal{I}}, \forall i \in \mathcal{I} \\ \end{aligned}$$ $$\begin{aligned} \dfrac{\partial f}{\partial V_{j}} & = \dfrac{\partial \ln(\sigma)}{\partial \sigma} \dfrac{\partial \sigma(\hat{r}_{u\mathcal{I}\mathcal{J}} / \alpha_{u} ) }{\partial \hat{r}_{u\mathcal{I}\mathcal{J}} / \alpha_{u} } \dfrac{\partial \hat{r}_{u\mathcal{I}\mathcal{J}} / \alpha_{u} }{\partial V_{j} } + \dfrac{\partial \ln(\sigma)}{\partial \sigma} \dfrac{\partial \sigma(\hat{r}_{u\mathcal{J}\mathcal{K}})}{\partial \hat{r}_{u\mathcal{J}\mathcal{K}} } \dfrac{\partial \hat{r}_{u\mathcal{J}\mathcal{K}} }{\partial V_{j}} \\ & = - ( |\mathcal{J}| \alpha_{u})^{-1} \sigma(- \alpha_{u}^{-1} \hat{r}_{u\mathcal{I}\mathcal{J}} ) U_{u} + |\mathcal{J}|^{-1} \sigma(- \hat{r}_{u\mathcal{J}\mathcal{K}} ) U_{u} \\ & = |\mathcal{J}|^{-1} \left[ - \alpha_{u}^{-1} \sigma(- \alpha_{u}^{-1} \hat{r}_{u\mathcal{I}\mathcal{J}} ) + \sigma(- \hat{r}_{u\mathcal{J}\mathcal{K}} ) \right] U_{u}, \forall j \in \mathcal{J} \\ \dfrac{\partial \mathcal{R}}{\partial V_{j}} & = \lambda_{v} \dfrac{\partial \| V_{\mathcal{J}} \|^2 }{\partial V_{\mathcal{J}}} \dfrac{\partial V_{\mathcal{J}}}{\partial V_{j}} = 2 |\mathcal{J}|^{-1} \lambda_{v} V_{\mathcal{J}}, \forall j \in \mathcal{J} \\ \dfrac{\partial f}{\partial b_{j}} & = \dfrac{\partial \ln(\sigma)}{\partial \sigma} \dfrac{\partial \sigma(\hat{r}_{u\mathcal{I}\mathcal{J}} / \alpha_{u} ) }{\partial \hat{r}_{u\mathcal{I}\mathcal{J}} / \alpha_{u} } \dfrac{\partial \hat{r}_{u\mathcal{I}\mathcal{J}} / \alpha_{u} }{\partial b_{j} } + \dfrac{\partial \ln(\sigma)}{\partial \sigma} \dfrac{\partial \sigma(\hat{r}_{u\mathcal{J}\mathcal{K}})}{\partial \hat{r}_{u\mathcal{J}\mathcal{K}} } \dfrac{\partial \hat{r}_{u\mathcal{J}\mathcal{K}} }{\partial b_{j}} \\ & = - ( |\mathcal{J}| \alpha_{u})^{-1} \sigma(- \alpha_{u}^{-1} \hat{r}_{u\mathcal{I}\mathcal{J}} ) + |\mathcal{J}|^{-1} \sigma(- \hat{r}_{u\mathcal{J}\mathcal{K}} ) \\ & = |\mathcal{J}|^{-1} \left[ - \alpha_{u}^{-1} \sigma(- \alpha_{u}^{-1} \hat{r}_{u\mathcal{I}\mathcal{J}} ) + \sigma(- \hat{r}_{u\mathcal{J}\mathcal{K}} ) \right], \forall j \in \mathcal{J} \\ \dfrac{\partial \mathcal{R}}{\partial b_{j}} & = \lambda_b \dfrac{\partial b_{\mathcal{J}}^2 }{\partial b_{\mathcal{J}}} \dfrac{\partial b_{\mathcal{J}}}{\partial b_{j}} = 2 \lambda_b |\mathcal{J}|^{-1} b_{\mathcal{J}}, \forall j \in \mathcal{J} \\ \end{aligned}$$ $$\begin{aligned} \dfrac{\partial f}{\partial V_{k}} & = \dfrac{\partial \ln(\sigma)}{\partial \sigma} \dfrac{\partial \sigma(\hat{r}_{u\mathcal{J}\mathcal{K}}) }{\partial \hat{r}_{u\mathcal{J}\mathcal{K}}} \dfrac{\partial \hat{r}_{u\mathcal{J}\mathcal{K}} }{\partial V_{k}} \\ & = - |\mathcal{K}|^{-1} \sigma(- \hat{r}_{u\mathcal{J}\mathcal{K}} ) U_{u}, \forall k \in \mathcal{K}\\ \dfrac{\partial \mathcal{R}}{\partial V_{k}} & = \lambda_{v} \dfrac{\partial \| V_{\mathcal{K}} \|^2 }{\partial V_{\mathcal{K}}} \dfrac{\partial V_{\mathcal{K}}}{\partial V_{k}} = 2 |\mathcal{K}|^{-1} \lambda_{v} V_{\mathcal{K}}, \forall k \in \mathcal{K}\\ \dfrac{\partial f}{\partial b_{k}} & = \dfrac{\partial \ln(\sigma)}{\partial \sigma} \dfrac{\partial \sigma(\hat{r}_{u\mathcal{J}\mathcal{K}} ) }{\partial \hat{r}_{u\mathcal{J}\mathcal{K}} } \dfrac{\partial \hat{r}_{u\mathcal{J}\mathcal{K}} }{\partial b_{k}} \\ & = - |\mathcal{K}|^{-1} \sigma(- \hat{r}_{u\mathcal{J}\mathcal{K}} ), \forall k \in \mathcal{K}\\ \dfrac{\partial \mathcal{R}}{\partial b_{k}} & = \lambda_b \dfrac{\partial b_{\mathcal{K}}^2 }{\partial b_{\mathcal{K}}} \dfrac{\partial b_{\mathcal{K}}}{\partial b_{k}} = 2 \lambda_b |\mathcal{K}|^{-1} b_{\mathcal{K}}, \forall k \in \mathcal{K}\\ \end{aligned}$$Updating Rules
Combining the gradients w.r.t. the negative loss term and the regularization term, the final gradients for each variable $U_{u}, V_{i}, V_{j}, V_{k}, b_{i}, b_{j}, b_{k}, u \in U, i \in \mathcal{I}, j \in \mathcal{J}, k \in \mathcal{K}$ are
$$ \begin{aligned} \nabla U_{u} & = \dfrac{\partial f}{\partial U_{u}} - \dfrac{\partial \mathcal{R}}{\partial U_{u}} \\ & = \alpha_{u}^{-1} \sigma(- \alpha_{u}^{-1} \hat{r}_{u\mathcal{I}\mathcal{J}} ) \left[ \overline{V}_{\mathcal{I}} - \overline{V}_{\mathcal{J}} \right] + \sigma(- \hat{r}_{u\mathcal{J}\mathcal{K}} ) \left[ \overline{V}_{\mathcal{J}} - \overline{V}_{\mathcal{K}} \right] - 2 \lambda_{u} U_{u}, \forall u \in U \\ \nabla V_{i} & = \dfrac{\partial f}{\partial V_{i}} - \dfrac{\partial \mathcal{R}}{\partial V_{i}} \\ & = (|\mathcal{I}| \alpha_{u})^{-1} \sigma(- \alpha_u^{-1} \hat{r}_{u\mathcal{I}\mathcal{J}} ) U_{u} - 2 |\mathcal{I}|^{-1} \lambda_{v} V_{\mathcal{I}}, \forall i \in \mathcal{I}\\ \nabla V_{j} & = \dfrac{\partial f}{\partial V_{j}} - \dfrac{\partial \mathcal{R}}{\partial V_{j}} \\ & = |\mathcal{J}|^{-1} \left[ - \alpha_{u}^{-1} \sigma(- \alpha_{u}^{-1} \hat{r}_{u\mathcal{I}\mathcal{J}} ) + \sigma(- \hat{r}_{u\mathcal{J}\mathcal{K}} ) \right] U_{u} - 2 |\mathcal{J}|^{-1} \lambda_{v} V_{\mathcal{J}}, \forall j \in \mathcal{J}\\ \nabla V_{k} & = \dfrac{\partial f}{\partial V_{k}} - \dfrac{\partial \mathcal{R}}{\partial V_{k}} \\ & = - |\mathcal{K}|^{-1} \sigma(- \hat{r}_{u\mathcal{J}\mathcal{K}} ) U_{u} - 2 |\mathcal{K}|^{-1} \lambda_{v} V_{\mathcal{K}}, \forall k \in \mathcal{K}\\ \nabla b_{i} & = \dfrac{\partial f}{\partial b_{i}} - \dfrac{\partial \mathcal{R}}{\partial b_{i}} \\ & = (|\mathcal{I}| \alpha_{u})^{-1} \sigma(- \alpha_u^{-1} \hat{r}_{u\mathcal{I}\mathcal{J}} ) - 2 \lambda_b |\mathcal{I}|^{-1} b_{\mathcal{I}}, \forall i \in \mathcal{I}\\ \nabla b_{j} & = \dfrac{\partial f}{\partial b_{j}} - \dfrac{\partial \mathcal{R}}{\partial b_{j}} \\ & = |\mathcal{J}|^{-1} \left[ - \alpha_{u}^{-1} \sigma(- \alpha_{u}^{-1} \hat{r}_{u\mathcal{I}\mathcal{J}} ) + \sigma(- \hat{r}_{u\mathcal{J}\mathcal{K}} ) \right] - 2 \lambda_b |\mathcal{J}|^{-1} b_{\mathcal{J}}, \forall j \in \mathcal{J}\\ \nabla b_{k} & = \dfrac{\partial f}{\partial b_{k}} - \dfrac{\partial \mathcal{R}}{\partial b_{k}} \\ & = - |\mathcal{K}|^{-1} \sigma(- \hat{r}_{u\mathcal{J}\mathcal{K}} ) - 2 \lambda_b |\mathcal{K}|^{-1} b_{\mathcal{K}}, \forall k \in \mathcal{K}\\ \end{aligned} $$ And the updating rules for each variable are $$ \begin{aligned} U_{u}^{t+1} & = U_{u}^{t} + \gamma \nabla U_{u}^{t}, u \in U \\ V_{i}^{t+1} & = V_{i}^{t} + \gamma \nabla V_{i}^{t}, i \in \mathcal{I} \\ V_{j}^{t+1} & = V_{j}^{t} + \gamma \nabla V_{j}^{t}, j \in \mathcal{J} \\ V_{k}^{t+1} & = V_{k}^{t} + \gamma \nabla V_{k}^{t}, k \in \mathcal{K} \\ b_{i}^{t+1} & = b_{i}^{t} + \gamma \nabla b_{i}^{t}, i \in \mathcal{I} \\ b_{j}^{t+1} & = b_{j}^{t} + \gamma \nabla b_{j}^{t}, j \in \mathcal{J} \\ b_{k}^{t+1} & = b_{k}^{t} + \gamma \nabla b_{k}^{t}, k \in \mathcal{K} \end{aligned} $$Online Updating for New User
Rendle et al (2008) came with an online updating schema in their regularized kernal matrix factorization model along with stochastic gradient descent. Since BPRH also utilizes SGD in its objective function maximization, we can follow Rendle’s methods to update user and item matrices when a new user or item enters the landscape (cold-start problem). You can see the implementation code at bprH_gpu.py - Line 656.
Python Implementation
You can star my BPRH repository and see the details of my BPRH implementation at
Reference
- Rendle, Steffen, et al. “BPR: Bayesian personalized ranking from implicit feedback.” arXiv preprint arXiv:1205.2618 (2012). link
- Pan, Weike, and Li Chen. “Cofiset: Collaborative filtering via learning pairwise preferences over item-sets.” Proceedings of the 2013 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2013. link
- Qiu, Huihuai, et al. “BPRH: Bayesian personalized ranking for heterogeneous implicit feedback.” Information Sciences 453 (2018): 80-98. link
- Rendle, Steffen, and Lars Schmidt-Thieme. “Online-updating regularized kernel matrix factorization models for large-scale recommender systems.” Proceedings of the 2008 ACM conference on Recommender systems. 2008. link
Understanding BPR, COFISET and BPRH
https://blog.yihong-liu.com/2020/06/26/Understanding-BPR-COFISET-and-BPRH/