Multi-Task Feature Learning for Knowledge Graph Enhanced Recommendation

Collaborative filtering often suffers from sparsity and cold start problems in real recommendation scenarios, therefore, researchers and engineers usually use side information to address the issues and improve the performance of recommender systems. In this paper, we consider knowledge graphs as the source of side information. We propose MKR, a Multi-task feature learning approach for Knowledge graph enhanced Recommendation. MKR is a deep end-to-end framework that utilizes knowledge graph embedding task to assist recommendation task. The two tasks are associated by cross&compress units, which automatically share latent features and learn high-order interactions between items in recommender systems and entities in the knowledge graph. We prove that cross&compress units have sufficient capability of polynomial approximation, and show that MKR is a generalized framework over several representative methods of recommender systems and multi-task learning. Through extensive experiments on real-world datasets, we demonstrate that MKR achieves substantial gains in movie, book, music, and news recommendation, over state-of-the-art baselines. MKR is also shown to be able to maintain a decent performance even if user-item interactions are sparse.


INTRODUCTION
Recommender systems (RS) aims to address the information explosion and meet users personalized interests. One of the most popular recommendation techniques is collaborative filtering (CF) [11], which utilizes users' historical interactions and makes recommendations based on their common preferences. However, CF-based methods usually suffer from the sparsity of user-item interactions and the cold start problem. Therefore, researchers propose using side information in recommender systems, including social networks [10], attributes [30], and multimedia (e.g., texts [29], images [40]). Knowledge graphs (KGs) are one type of side information for RS, which usually contain fruitful facts and connections about items. Recently, researchers have proposed several academic and commercial KGs, such as NELL 1 , DBpedia 2 , Google Knowledge Graph 3 and Microsoft Satori 4 . Due to its high dimensionality and heterogeneity, a KG is usually pre-processed by knowledge graph embedding (KGE) methods [27], which embeds entities and relations into low-dimensional vector spaces while preserving its inherent structure.

Existing KG-aware methods
Inspired by the success of applying KG in a wide variety of tasks, researchers have recently tried to utilize KG to improve the performance of recommender systems [31,32,39,40,45]. Personalized Entity Recommendation (PER) [39] and Factorization Machine with Group lasso (FMG) [45] treat KG as a heterogeneous information network, and extract meta-path/meta-graph based latent features to represent the connectivity between users and items along different types of relation paths/graphs. It should be noted that PER and FMG rely heavily on manually designed meta-paths/meta-graphs, which limits its application in generic recommendation scenarios. Deep Knowledge-aware Network (DKN) [32] designs a CNN framework to combine entity embeddings with word embeddings for news recommendation. However, the entity embeddings are required in advance of using DKN, causing DKN to lack an endto-end way of training. Another concern about DKN is that it can hardly incorporate side information other than texts. RippleNet [31] is a memory-network-like model that propagates users' potential preferences in the KG and explores their hierarchical interests. But the importance of relations is weakly characterized in Rip-pleNet, because the embedding matrix of a relation R can hardly be trained to capture the sense of importance in the quadratic form v ⊤ Rh (v and h are embedding vectors of two entities). Collaborative Knowledge base Embedding (CKE) [40] combines CF with structural knowledge, textual knowledge, and visual knowledge in a unified framework. However, the KGE module in CKE (i.e., TransR [13]) is more suitable for in-graph applications (such as KG completion and link prediction) rather than recommendation. In addition, the CF module and the KGE module are loosely coupled in CKE under a Bayesian framework, making the supervision from KG less obvious for recommender systems.

The proposed approach
To address the limitations of previous work, we propose MKR, a multi-task learning (MTL) approach for knowledge graph enhanced recommendation. MKR is a generic, end-to-end deep recommendation framework, which aims to utilize KGE task to assist recommendation task 5 . Note that the two tasks are not mutually independent, but are highly correlated since an item in RS may associate with one or more entities in KG. Therefore, an item and its corresponding entity are likely to have a similar proximity structure in RS and KG, and share similar features in low-level and non-task-specific latent feature spaces [15]. We will further validate the similarity in the experiments section. To model the shared features between items and entities, we design a cross&compress unit in MKR. The cross&compress unit explicitly models high-order interactions between item and entity features, and automatically control the cross knowledge transfer for both tasks. Through cross&compress units, representations of items and entities can complement each other, assisting both tasks in avoiding fitting noises and improving generalization. The whole framework can be trained by alternately optimizing the two tasks with different frequencies, which endows MKR with high flexibility and adaptability in real recommendation scenarios.
We probe the expressive capability of MKR and show, through theoretical analysis, that the cross&compress unit is capable of approximating sufficiently high order feature interactions between items and entities. We also show that MKR is a generalized framework over several representative methods of recommender systems and multi-task learning, including factorization machines [22,23], deep&cross network [34], and cross-stitch network [18]. Empirically, we evaluate our method in four recommendation scenarios, i.e., movie, book, music, and news recommendations. The results demonstrate that MKR achieves substantial gains over state-ofthe-art baselines in both click-through rate (CTR) prediction (e.g., 11.6% AU C improvements on average for movies) and top-K recommendation (e.g., 66.4% Recall@10 improvements on average for books). MKR is also shown to maintain satisfactory performance even when user-item interactions are sparse.

Contribution
It is worth noticing that the problem studied in this paper can also be modelled as cross-domain recommendation [26] or transfer learning [21], since we care more about the performance of recommendation task. However, the key observation is that though cross-domain recommendation and transfer learning have single objective for the target domain, their loss functions still contain constraint terms for measuring data distribution in the source domain or similarity between two domains. In our proposed MKR, the KGE task serves as the constraint term explicitly to provide regularization for recommender systems. We would like to emphasize that the major contribution of this paper is exactly modeling the problem as multi-task learning: We go a step further than cross-domain recommendation and transfer learning by finding that the intertask similarity is helpful to not only recommender systems but also knowledge graph embedding, as shown in theoretical analysis and experiment results.

OUR APPROACH
In this section, we first formulate the knowledge graph enhanced recommendation problem, then introduce the framework of MKR and present the design of the cross&compress unit, recommendation module and KGE module in detail. We lastly discuss the learning algorithm for MKR.

Problem Formulation
We formulate the knowledge graph enhanced recommendation problem in this paper as follows. In a typical recommendation scenario, we have a set of M users U = {u 1 , u 2 , ..., u M } and a set of N items V = {v 1 , v 2 , ..., v N }. The user-item interaction matrix Y ∈ R M ×N is defined according to users' implicit feedback, where y uv = 1 indicates that user u engaged with item v, such as behaviors of clicking, watching, browsing, or purchasing; otherwise y uv = 0. Additionally, we also have access to a knowledge graph G, which is comprised of entity-relation-entity triples (h, r, t). Here h, r , and t denote the head, relation, and tail of a knowledge triple, respectively. For example, the triple (Quentin Tarantino, film.director.film, Pulp Fiction) states the fact that Quentin Tarantino directs the film Pulp Fiction. In many recommendation scenarios, an item v ∈ V may associate with one or more entities in G. For example, in movie recommendation, the item "Pulp Fiction" is linked with its namesake in a knowledge graph, while in news recommendation, news with the title "Trump pledges aid to Silicon Valley during tech meeting" is linked with entities "Donald Trump" and "Silicon Valley" in a knowledge graph.
Given the user-item interaction matrix Y as well as the knowledge graph G, we aim to predict whether user u has potential interest in item v with which he has had no interaction before. Our goal is to learn a prediction functionŷ uv = F (u, v |Θ, Y, G), wherê y uv denotes the probability that user u will engage with item v, and Θ is the model parameters of function F .

Framework
The framework of MKR is illustrated in Figure 1a. MKR consists of three main components: recommendation module, KGE module, and cross&compress units. (1) The recommendation module on the left takes a user and an item as input, and uses a multi-layer perceptron (MLP) and cross&compress units to extract short and dense features for the user and the item, respectively. The extracted features are then fed into another MLP together to output the predicted probability. (2) Similar to the left part, the KGE module in the right part also uses multiple layers to extract features from the head and relation of a knowledge triple, and outputs the representation of the predicted tail under the supervision of a score function f and the real tail.

Cross&compress Unit
To model feature interactions between items and entities, we design a cross&compress unit in MKR framework. As shown in Figure 1b, for item v and one of its associated entities e, we first construct d ×d pairwise interactions of their latent feature v l ∈ R d and e l ∈ R d from layer l: where C l ∈ R d ×d is the cross feature matrix of layer l, and d is the dimension of hidden layers. This is called the cross operation, since each possible feature interaction v l , ∀(i, j) ∈ {1, ..., d } 2 between item v and its associated entity e is modeled explicitly in the cross feature matrix. We then output the feature vectors of items and entities for the next layer by projecting the cross feature matrix into their latent representation spaces: where w ·· l ∈ R d and b · l ∈ R d are trainable weight and bias vectors. This is called the compress operation, since the weight vectors project the cross feature matrix from R d ×d space back to the feature spaces R d . Note that in Eq. (2), the cross feature matrix is compressed along both horizontal and vertical directions (by operating on C l and C ⊤ l ) for the sake of symmetry, but we will provide more insights of the design in Section 3.2. For simplicity, the cross&compress unit is denoted as: and we use a suffix [v] or [e] to distinguish its two outputs in the following of this paper. Through cross&compress units, MKR can adaptively adjust the weights of knowledge transfer and learn the relevance between the two tasks.
It should be noted that cross&compress units should only exist in low-level layers of MKR, as shown in Figure 1a. This is because: (1) In deep architectures, features usually transform from general to specific along the network, and feature transferability drops significantly in higher layers with increasing task dissimilarity [38]. Therefore, sharing high-level layers risks to possible negative transfer, especially for the heterogeneous tasks in MKR. (2) In highlevel layers of MKR, item features are mixed with user features, and entity features are mixed with relation features. The mixed features are not suitable for sharing since they have no explicit association.

Recommendation Module
The input of the recommendation module in MKR consists of two raw feature vectors u and v that describe user u and item v, respectively. u and v can be customized as one-hot ID [8], attributes [30], bag-of-words [29], or their combinations, based on the application scenario. Given user u's raw feature vector u, we use an L-layer MLP to extract his latent condensed feature 6 : where M(x) = σ (Wx+b) is a fully-connected neural network layer 7 with weight W, bias b, and nonlinear activation function σ (·). For item v, we use L cross&compress units to extract its feature: where S(v) is the set of associated entities of item v.
After having user u's latent feature u L and item v's latent feature v L , we combine the two pathways by a predicting function f RS , for example, inner product or an H -layer MLP. The final predicted probability of user u engaging item v is:

Knowledge Graph Embedding Module
Knowledge graph embedding is to embed entities and relations into continuous vector spaces while preserving their structure. Recently, researchers have proposed a great many KGE methods, including translational distance models [2,13] and semantic matching models [14,19]. In MKR, we propose a deep semantic matching architecture for KGE module. Similar to the recommendation module, for a given knowledge triple (h, r , t), we first utilize multiple cross&compress units and nonlinear layers to process the raw feature vectors of head h and relation r (including ID [13], types [36], textual description [35], etc.), respectively. Their latent features are then concatenated together, followed by a K-layer MLP for predicting tail t: where S(h) is the set of associated items of entity h, andt is the predicted vector of tail t. Finally, the score of the triple (h, r, t) is calculated using a score (similarity) function f KG : where t is the real feature vector of t. In this paper, we use the normalized inner product f KG (t,t) = σ (t ⊤t ) as the choice of score function [18], but other forms of (dis)similarity metrics can also be applied here such as KullbackâĂŞLeibler divergence.

Learning Algorithm
The complete loss function of MKR is as follows: In Eq. (9), the first term measures loss in the recommendation module, where u and v traverse the set of users and the items, respectively, and J is the cross-entropy function. The second term calculates the loss in the KGE module, in which we aim to increase the score for all true triples while reducing the score for all false triples. The last item is the regularization term for preventing overfitting, λ 1 and λ 2 are the balancing parameters. 8 Note that the loss function in Eq. (9) traverses all possible useritem pairs and knowledge triples. To make computation more efficient, following [17], we use a negative sampling strategy during training. The learning algorithm of MKR is presented in Algorithm 1, in which a training epoch consists of two stages: recommendation 8 λ 1 can be seen as the ratio of two learning rates for the two tasks.

Algorithm 1 Multi-Task Training for MKR
Require: Interaction matrix Y, knowledge graph G Ensure: Prediction function F (u, v |Θ, Y, G) 1: Initialize all parameters 2: for number of training iteration do // recommendation task 3: for t steps do 4: Sample minibatch of positive and negative interactions from Y; 5: Sample e ∼ S(v) for each item v in the minibatch; 6: Update parameters of F by gradient descent on Eq. (1)-(6), (9); 7: end for // knowledge graph embedding task 8: Sample minibatch of true and false triples from G; 9: Sample v ∼ S(h) for each head h in the minibatch; 10: Update parameters of F by gradient descent on Eq. (1)-(3), (7)-(9); 11: end for task (line 3-7) and KGE task (line 8-10). In each iteration, we repeat training on recommendation task for t times (t is a hyper-parameter and normally t > 1) before training on KGE task once in each epoch, since we are more focused on improving recommendation performance. We will discuss the choice of t in the experiments section.

THEORETICAL ANALYSIS
In this section, we prove that cross&compress units have sufficient capability of polynomial approximation. We also show that MKR is a generalized framework over several representative methods of recommender systems and multi-task learning.

Polynomial Approximation
According to the Weierstrass approximation theorem [25], any function under certain smoothness assumption can be approximated by a polynomial to an arbitrary accuracy. Therefore, we examine the ability of high-order interaction approximation of the cross&compress unit. We show that cross&compress units can model the order of item-entity feature interaction up to exponential degree: Theorem 1. Denote the input of item and entity in MKR network as v = [v 1 · · · v d ] ⊤ and e = [e 1 · · · e d ] ⊤ , respectively. Then the cross terms about v and e in ∥v L ∥ 1 and i is also called combinatorial feature, as it measures the interactions of multiple original features. Theorem 1 states that cross&compress units can automatically model the combinatorial features of items and entities for sufficiently high order, which demonstrates the superior approximation capacity of MKR as compared with existing work such as Wide&Deep [3], factorization machines [22,23] and DCN [34]. The proof of Theorem 1 is provided in the Appendix. Note that Theorem 1 gives a theoretical view of the polynomial approximation ability of the cross&compress unit rather than providing guarantees on its actual performance. We will empirically evaluate the cross&compress unit in the experiments section.

Unified View of Representative Methods
In the following we provide a unified view of several representative models in recommender systems and multi-task learning, by showing that they are restricted versions of or theoretically related to MKR. This justifies the design of cross&compress unit and conceptually explains its strong empirical performance as compared to baselines.
3.2.1 Factorization machines. Factorization machines [22,23] are a generic method for recommender systems. Given an input feature vector, FMs model all interactions between variables in the input vector using factorized parameters, thus being able to estimate interactions in problems with huge sparsity such as recommender systems. The model equation for a 2-degree factorization machine is defined aŝ where x i is the i-th unit of input vector x, w · is weight scalar, v · is weight vector, and ⟨·, ·⟩ is dot product of two vectors. We show that the essence of FM is conceptually similar to an 1-layer cross&compress unit: Proposition 1. The L1-norm of v 1 and e 1 can be written as the following form: where ⟨w i , w j ⟩ = w i + w j is the sum of two scalars.
It is interesting to notice that, instead of factorizing the weight parameter of x i x j into the dot product of two vectors as in FM, the weight of term v i e j is factorized into the sum of two scalars in cross&compress unit to reduce the number of parameters and increase robustness of the model.

Deep&Cross
Network. DCN [34] learns explicit and highorder cross features by introducing the layers: where x l , w l , and b l are representation, weight, and bias of the l-th layer. We demonstrate the link between DCN and MKR by the following proposition: In the formula of v l +1 in Eq. (2), if we restrict w V V l in the first term to satisfy e ⊤ l w V V l = 1 and restrict e l in the second term to be e 0 (and impose similar restrictions on e l +1 ), the cross&compress unit is then conceptually equivalent to DCN layer in the sense of multi-task learning: It can be proven that the polynomial approximation ability of the above DCN-equivalent version (i.e., the maximal degree of cross terms in v l and e l ) is O(l), which is weaker than original cross&compress units with O(2 l ) approximation ability.

Cross-stitch Networks.
Cross-stitch networks [18] is a multitask learning model in convolutional networks, in which the designed cross-stitch unit can learn a combination of shared and task-specific representations between two tasks. Specifically, given two activation maps x A and x B from layer l for both the tasks, cross-stitch networks learn linear combinationsx A andx B of both the input activations and feed these combinations as input to the next layers' filters. The formula at location (i, j) in the activation map is where α's are trainable transfer weights of representations between task A and task B. We show that the cross-stitch unit in Eq. (14) is a simplified version of our cross&compress unit by the following proposition: Proposition 3. If we omit all biases in Eq. (2), the cross&compress unit can be written as The transfer matrix in Eq. (15) serves as the cross-stitch unit [α AA α AB ; α BA α BB ] in Eq. (14). Like cross-stitch networks, MKR network can decide to make certain layers task specific by setting v ⊤ l w EV l (α AB ) or e ⊤ l w V E l (α BA ) to zero, or choose a more shared representation by assigning a higher value to them. But the transfer matrix is more fine-grained in cross&compress unit, because the transfer weights are replaced from scalars to dot products of two vectors. It is rather interesting to notice that Eq. (15) can also be regarded as an attention mechanism [1], as the computation of transfer weights involves the feature vectors v l and e l themselves.

EXPERIMENTS
In this section, we evaluate the performance of MKR in four realworld recommendation scenarios: movie, book, music, and news 9 .

Datasets
We utilize the following four datasets in our experiments: • MovieLens-1M 10 is a widely used benchmark dataset in movie recommendations, which consists of approximately 1 million explicit ratings (ranging from 1 to 5) on the Movie-Lens website. • Book-Crossing 11 dataset contains 1,149,780 explicit ratings (ranging from 0 to 10) of books in the Book-Crossing community. • Last.FM 12 dataset contains musician listening information from a set of 2 thousand users from Last.fm online music system. Since MovieLens-1M, Book-Crossing, and Last.FM are explicit feedback data (Last.FM provides the listening count as weight for each user-item interaction), we transform them into implicit feedback where each entry is marked with 1 indicating that the user has rated the item positively, and sample an unwatched set marked as 0 for each user. The threshold of positive rating is 4 for MovieLens-1M, while no threshold is set for Book-Crossing and Last.FM due to their sparsity.
We use Microsoft Satori to construct the KG for each dataset. We first select a subset of triples from the whole KG with a confidence level greater than 0.9. For MovieLens-1M and Book-Crossing, we additionally select a subset of triples from the sub-KG whose relation name contains "film" or "book" respectively to further reduce KG size.
Given the sub-KGs, for MovieLens-1M, Book-Crossing, and Last.FM, we collect IDs of all valid movies, books, or musicians by matching their names with tail of triples (head, film.film.name, tail), (head, book.book.title, tail), or (head, type.object.name, tail), respectively. For simplicity, items with no matched or multiple matched entities are excluded. We then match the IDs with the head and tail of all KG triples and select all well-matched triples from the sub-KG. The constructing process is similar for Bing-News except that: (1) we use entity linking tools to extract entities in news titles; (2) we do not impose restrictions on the names of relations since the entities in news titles are not within one particular domain. The basic statistics of the four datasets are presented in Table 1. Note that the number of users, items, and interactions are smaller than original datasets since we filtered out items with no corresponding entity in the KG.

Baselines
We compare our proposed MKR with the following baselines. Unless otherwise specified, the hyper-parameter settings of baselines are the same as reported in their original papers or as default in their codes.
• PER [39] treats the KG as heterogeneous information networks and extracts meta-path based features to represent the connectivity between users and items. In this paper, we use manually designed user-item-attribute-item paths as features, i.e., "user-movie-director-movie", "user-moviegenre-movie", and "user-movie-star-movie" for MovieLens-20M; "user-book-author-book" and "user-book-genre-book" 13 https://www.bing.com/news for Book-Crossing; "user-musician-genre-musician", "usermusician-country-musician", and "user-musician-age-musician" (age is discretized) for Last.FM. Note that PER cannot be applied to news recommendation because it's hard to pre-define meta-paths for entities in news. • CKE [40] combines CF with structural, textual, and visual knowledge in a unified framework for recommendation. We implement CKE as CF plus structural knowledge module in this paper. The dimension of user and item embeddings for the four datasets are set as 64, 128, 32, 64, respectively. The dimension of entity embeddings is 32. • DKN [32] treats entity embedding and word embedding as multiple channels and combines them together in CNN for CTR prediction. In this paper, we use movie/book names and news titles as textual input for DKN. The dimension of word embedding and entity embedding is 64, and the number of filters is 128 for each window size 1, 2, 3.

Experiments setup
In MKR, we set the number of high-level layers K = 1, f RS as inner product, and λ 2 = 10 −6 for all three datasets, and other hyperparameter are given in Table 1. The settings of hyper-parameters are determined by optimizing AUC on a validation set. For each dataset, the ratio of training, validation, and test set is 6 : 2 : 2. Each experiment is repeated 3 times, and the average performance is reported. We evaluate our method in two experiment scenarios: (1) In click-through rate (CTR) prediction, we apply the trained model to each piece of interactions in the test set and output the predicted click probability. We use AUC and Accuracy to evaluate the performance of CTR prediction. (2) In top-K recommendation, we use the trained model to select K items with highest predicted click probability for each user in the test set, and choose Precision@K and Recall@K to evaluate the recommended sets.

Empirical study
We conduct an empirical study to investigate the correlation of items in RS and their corresponding entities in KG. Specifically, we aim to reveal how the number of common neighbors of an item pair in KG changes with their number of common raters in RS. To this end, we first randomly sample 1 million item pairs from MovieLens-1M. We then classify each pair into 5 categories based on the number of their common raters in RS, and count their average number of common neighbors in KG for each category. The result is presented in Figure 2a, which clearly shows that if two items have more common raters in RS, they are likely to share more common neighbors in KG. Figure 2b shows the positive correlation from an opposite direction. The above findings empirically demonstrate that items share the similar structure of proximity in KG and RS, thus the cross knowledge transfer of items benefits both recommendation and KGE tasks in MKR.

Comparison with baselines.
The results of all methods in CTR prediction and top-K recommendation are presented in Table  2 and Figure 3, 4, respectively, from which we have the following observations: • PER performs poor on movie, book, and music recommendation because the user-defined meta-paths can hardly be optimal in reality. Moreover, PER cannot be applied to news recommendation since entities in news titles are not within one particular domain. • CKE performs better in movie, book, and music recommendation than news. This may be because MovieLens-1M, Book-Crossing, and Last.FM are much denser than Bing-News, which is more favorable for the collaborative filtering part in CKE. • DKN performs best in news recommendation compared with other baselines, but performs worst in other scenarios. This is because movie, book, and musician names are too short and ambiguous to provide useful information. • RippleNet performs best among all baselines, and even outperforms MKR on MovieLens-1M. This demonstrates that RippleNet can precisely capture user interests, especially in the case where user-item interactions are dense. However, RippleNet is more sensitive to the density of datasets, as it performs worse than MKR in Book-Crossing, Last.FM, and Bing-News. We will further study their performance in sparse scenarios in Section 4.5.3. • In general, our MKR performs best among all methods on the four datasets. Specifically, MKR achieves average Accuracy gains of 11.6%, 11.5%, 12.7%, and 8.7% in movie, book, music, and news recommendation, respectively, which demonstrates the efficacy of the multi-task learning framework in MKR. Note that the top-K metrics are much lower for Bing-News because the number of news is significantly larger than movies, books, and musicians.

Comparison with MKR variants.
We further compare MKR with its three variants to demonstrate the efficacy of cross&compress unit: • MKR-1L is MKR with one layer of cross&compress unit, which corresponds to FM model according to Proposition 1. Note that MKR-1L is actually MKR in the experiments for MovieLens-1M.    Table 2 we observe that MKR outperforms MKR-1L and MKR-DCN, which shows that modeling high-order interactions between item and entity features is helpful for maintaining decent performance. MKR also achieves better scores than MKR-stitch. This validates the efficacy of fine-grained control on knowledge transfer in MKR compared with the simple cross-stitch units.

4.5.3
Results in sparse scenarios. One major goal of using knowledge graph in MKR is to alleviate the sparsity and the cold start problem of recommender systems. To investigate the efficacy of the KGE module in sparse scenarios, we vary the ratio of training set of MovieLens-1M from 100% to 10% (while the validation and test set are kept fixed), and report the results of AU C in CTR prediction for all methods. The results are shown in Table 3. We observe that the performance of all methods deteriorates with the reduce of the training set. When r = 10%, the AUC score decreases by 15   the case when full training set is used (r = 100%). In contrast, the AU C score of MKR only decreases by 5.3%, which demonstrates that MKR can still maintain a decent performance even when the user-item interaction is sparse. We also notice that MKR performs better than RippleNet in sparse scenarios, which is accordance with our observation in Section 4.5.1 that RippleNet is more sensitive to the density of user-item interactions.

Results on KGE side.
Although the goal of MKR is to utilize KG to assist with recommendation, it is still interesting to investigate whether the RS task benefits the KGE task, since the principle of multi-task learning is to leverage shared information to help improve the performance of all tasks [42]. We present the result of RMSE (rooted mean square error) between predicted and real vectors of tails in the KGE task in Table 4. Fortunately, we find that the existence of RS module can indeed reduce the prediction error by 1.9% ∼ 6.4%. The results show that the cross&compress units are able to learn general and shared features that mutually benefit both sides of MKR.

Parameter Sensitivity
4.6.1 Impact of KG size. We vary the size of KG to further investigate the efficacy of usage of KG. The results of AUC on Bing-News are plotted in Figure 5a. Specifically, the AUC and Accuracy is enhanced by 13.6% and 11.8% with the KG ratio increasing from 0.1 to 1.0 in three scenarios, respectively. This is because the Bing-News dataset is extremely sparse, making the effect of KG usage rather obvious.
4.6.2 Impact of RS training frequency. We investigate the influence of parameters t in MKR by varying t from 1 to 10, while keeping other parameters fixed. The results are presented in Figure 5b. We observe that MKR achieves the best performance when t = 5. This is because a high training frequency of the KGE module will mislead the objective function of MKR, while too small of a training frequency of KGE cannot make full use of the transferred knowledge from the KG.

Impact of embedding dimension.
We also show how the dimension of users, items, and entities affects the performance of MKR in Figure 5c. We find that the performance is initially improved with the increase of dimension, because more bits in embedding layer can encode more useful information. However, the performance drops when the dimension further increases, as too large number of dimensions may introduce noises which mislead the subsequent prediction.

RELATED WORK 5.1 Knowledge Graph Embedding
The KGE module in MKR connects to a large body of work in KGE methods. KGE is used to embed entities and relations in a knowledge into low-dimensional vector spaces while still preserving the structural information [33]. KGE methods can be classified into the following two categories: (1) Translational distance models exploit distance-based scoring functions when learning representations of entities and relations, such as TransE [2], TransH [35], and TransR [13]; (2) Semantic matching models measure plausibility of knowledge triples by matching latent semantics of entities and relations, such as RESCAL [20], ANALOGY [19], and HolE [14]. Recently, researchers also propose incorporating auxiliary information, such as entity types [36], logic rules [24], and textual descriptions [46] to assist KGE. The above KGE methods can also be incorporated into MKR as the implementation of the KGE module, but note that the cross&compress unit in MKR needs to be redesigned accordingly. Exploring other designs of KGE module as well as the corresponding bridging unit is also an important direction of future work.

Multi-Task Learning
Multi-task learning is a learning paradigm in machine learning and its aim is to leverage useful information contained in multiple related tasks to help improve the generalization performance of all the tasks [42]. All of the learning tasks are assumed to be related to each other, and it is found that learning these tasks jointly can lead to performance improvement compared with learning them individually. In general, MTL algorithms can be classified into several categories, including feature learning approach [34,41], low-rank approach [7,16], task clustering approach [47], task relation learning approach [12], and decomposition approach [6]. For example, the cross-stitch network [41] determines the inputs of hidden layers in different tasks by a knowledge transfer matrix; Zhou et. al [47] aims to cluster tasks by identifying representative tasks which are a subset of the given m tasks, i.e., if task T i is selected by task T j as a representative task, then it is expected that model parameters for T j are similar to those of T i . MTL can also be combined with other learning paradigms to improve the performance of learning tasks further, including semi-supervised learning, active learning, unsupervised learning,and reinforcement learning.
Our work can be seen as an asymmetric multi-task learning framework [37,43,44], in which we aim to utilize the connection between RS and KG to help improve their performance, and the two tasks are trained with different frequencies.

Deep Recommender Systems
Recently, deep learning has been revolutionizing recommender systems and achieves better performance in many recommendation scenarios. Roughly speaking, deep recommender systems can be classified into two categories: (1) Using deep neural networks to process the raw features of users or items [5,[28][29][30]40]; For example, Collaborative Deep Learning [29] designs autoencoders to extract short and dense features from textual input and feeds the features into a collaborative filtering module; DeepFM [5] combines factorization machines for recommendation and deep learning for feature learning in a neural network architecture. (2) Using deep neural networks to model the interaction among users and items [3,4,8,9]. For example, Neural Collaborative Filtering [8] replaces the inner product with a neural architecture to model the user-item interaction. The major difference between these methods and ours is that MKR deploys a multi-task learning framework that utilizes the knowledge from a KG to assist recommendation.

CONCLUSIONS AND FUTURE WORK
This paper proposes MKR, a multi-task learning approach for knowledge graph enhanced recommendation. MKR is a deep and endto-end framework that consists of two parts: the recommendation module and the KGE module. Both modules adopt multiple nonlinear layers to extract latent features from inputs and fit the complicated interactions of user-item and head-relation pairs. Since the two tasks are not independent but connected by items and entities, we design a cross&compress unit in MKR to associate the two tasks, which can automatically learn high-order interactions of item and entity features and transfer knowledge between the two tasks. We conduct extensive experiments in four recommendation scenarios. The results demonstrate the significant superiority of MKR over strong baselines and the efficacy of the usage of KG.
For future work, we plan to investigate other types of neural networks (such as CNN) in MKR framework. We will also incorporate other KGE methods as the implementation of KGE module in MKR by redesigning the cross&compress unit.

APPENDIX A Proof of Theorem 1
Proof. We prove the theorem by induction: Base case: When l = 1, Therefore, we have .

B Proof of Proposition 1
Proof. In the proof of Theorem 1 in Appendix A, we have shown that It is easy to see that w i = w . The proof is similar for ∥e 1 ∥ 1 .
We omit the proofs for Proposition 2 and Proposition 3 as they are straightforward.