地理研究  2018 , 37 (4): 814-824 https://doi.org/10.11821/dlyj201804014

研究论文

基于混合过滤的地学数据个性化推荐方法设计与实现

王末12, 郑晓欢3, 王卷乐456, 柏永青45

1. 中国农业科学院农业信息研究所,北京 100081
2. 农业部农业大数据重点实验室,北京 100081
3. 中国科学院办公厅,北京 100864
4. 中国科学院地理科学与资源研究所,资源与环境信息系统国家重点实验室,北京 100101
5. 中国科学院大学,北京 100049
6. 江苏省地理信息资源开发与利用协同创新中心,南京 210023

A hybrid personalized data recommendation approach for geoscience data sharing

WANG Mo12, ZHENG Xiaohuan3, WANG Juanle456, BAI Yongqing45

1. Agricultural Information Institute of Chinese Academy of Agricultural Sciences, Beijing 100081, China
2. Key Laboratory of Agricultural Big Data, Ministry of Agriculture, Beijing 100081, China
3. Office of General Affairs, Chinese Academy of Sciences, Beijing 100864, China
4. State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, CAS, Beijing 100101, China
5. University of Chinese Academy of Sciences, Beijing 100049, China
6. Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing 210023, China

通讯作者:  通讯作者:王卷乐(1976- ),男,博士,研究员,主要研究方向为科学数据共享、地理信息系统与遥感应用。E-mail: wangjl@igsnrr.ac.cn

收稿日期: 2017-10-11

修回日期:  2018-02-1

网络出版日期:  2018-04-20

版权声明:  2018 《地理研究》编辑部 《地理研究》编辑部

基金资助:  国家科技基础条件平台建设项目(2005DKA32300);中国科学院特色研究所培育建设服务项目(TSYJS03);中国工程科技知识中心建设项目(CKCEST-2017-3-1);农业科学数据挖掘分析平台研究与建设项目(JBYW-AII-2017-32);中国农业科学院科技创新工程项目(CAAS-ASTIP-2016-AII)

作者简介:

作者简介:王末(1987- ),男,助理研究员,研究方向为地学数据共享与挖掘。E-mail: wangm.13b@igsnrr.ac.cn

展开

摘要

推荐系统是帮助互联网用户克服信息过剩的有效工具。在地学数据共享领域,较其他物品的内容属性,地学数据具有更加丰富的时空属性,这也给地学数据推荐带来挑战。针对地学数据的特点,为地学数据共享推荐服务开发了一种动态加权的混合过滤方法。该方法分别采用协同过滤和基于内容过滤算法预测用户对数据的兴趣度,再以训练模型计算最优加权权重,计算最终预测评分。在数据获取阶段,通过用户访问日志数据,采用Jenks Natural Break算法分析用户访问记录获取用户的数据兴趣度。在基于内容过滤部分,通过数据的空间、时间及内容属性计算数据相似度,并以用户历史行为为依据计算用户兴趣。在协同过滤和基于内容过滤中分别采用k-NN算法计算用户对未访问数据的预测评分,并进行加权求和。通过训练集,对理想权重值及用户的共同评价度(co-rating level)进行建模,拟合二者的关系。该模型被应用于混合过滤的权重调整,以获得最优的加权方程。测试结果显示,结合数据时空属性的混合过滤方法的准确度和召回率,较单一的协同过滤或基于内容过滤方法有显著提高。

关键词: 地理空间数据 ; 推荐系统 ; 混合过滤 ; 科学数据共享

Abstract

Recommender systems are effective tools helping Internet users mitigate information overloading. In geoscience data sharing domain, items (datasets) are more informative in terms of spatial and temporal attributes compared to regular item (e.g. books, movies, music). Thus, high-performance recommendation algorithms for geoscience data are more challenging. This study proposed an approach that combines content-based filtering with item-based collaborative filtering using dynamic weights. The approach examines merits of both collaborative filtering in its predictive ability and item content information to mitigating data sparsity and early ratter problem. Users' ratings on items were first derived with their historical visiting time by Jenks Natural Breaks. In the CBF part, spatial, temporal, and thematic information of geoscience datasets were extracted to compute item similarity. Predicted ratings were computed with k-NN method separately using CBF and CF, and then combined with dynamic weights. With training dataset, we attempted to find the best model describing ideal weights and users’ co-rating level. A logarithmic function was identified to be the best model. The model was then applied to tune the weights of CF and CBF on user-item basis with test dataset. Evaluation results showed that the dynamic weighted approach outperformed either solo CF or CBF approach in terms of Precision and Recall.

Keywords: recommender system ; geoscience data ; hybrid filtering ; science data sharing

0

PDF (2647KB) 元数据 多维度评价 相关文章 收藏文章

本文引用格式 导出 EndNote Ris Bibtex

王末, 郑晓欢, 王卷乐, 柏永青. 基于混合过滤的地学数据个性化推荐方法设计与实现[J]. 地理研究, 2018, 37(4): 814-824 https://doi.org/10.11821/dlyj201804014

WANG Mo, ZHENG Xiaohuan, WANG Juanle, BAI Yongqing. A hybrid personalized data recommendation approach for geoscience data sharing[J]. Geographical Research, 2018, 37(4): 814-824 https://doi.org/10.11821/dlyj201804014

1 引言

数据是进行科学研究的基本条件[1]。当今,地学领域每天以前所未有的速度产生、收集和储存了海量的科学数据。数据共享是有效利用这些数据重要的途径。资源查找是数据共享服务提供的基本功能之一。然而,地学数据内在属性包括空间、时间和主题内容信息,基于传统的检索技术可能不能满足用户对数据属性的需求。面对海量的数据,科研人员将面临如何发现所需数据的难题。个性化推荐是解决这一信息过剩问题的有效途径。个性化推荐系统已在多个领域得到成功地应用,包括多媒体内容(音乐、电影等)[2,3,4]、网络教学[5,6]、电子商务[7,8]、网络搜索[9,10]等。但在目前仍缺乏针对科学数据共享服务设计的个性化推荐方法。

个性化推荐系统是一种能够学习用户偏好,并基于用户偏好预测用户需求,在大量的可能选项里给出个性化推荐的Web应用系统[11]。常见的个性化推荐算法类型有协同过滤算法(collaborative filtering)、基于内容过滤算法(content-based filtering)以及人口统计学过滤算法(demographic filtering)。协同过滤依赖于用户间的共同评分来计算用户间相似度,并将用户喜好项目推荐给与其相似的用户;基于内容过滤则通过项目属性计算项目(item)间相似度,依据用户历史兴趣推荐具有相似属性的项目;人口统计学过滤则是通过用户的社会属性(比如年龄、性别、地域、职业等)来计算用户的相似性,划分用户类型,给出相应的推荐。这些推荐方法有着各自的优缺点,单一地使用某一种推荐算法并不能适应所有的应用场景。在有大量的用户评分数据情形下,协同过滤往往能获得比基于内容过滤更好的效果[12],但协同过滤算法效果容易受到数据稀疏性影响。由于无需其他用户的评分数据,基于内容过滤算法则能避免这种问题。人口统计学过滤算法则易受用户隐私问题的限制,对推荐算法及其重要的信息往往是用户不愿透露的隐私信息。此类推荐算法在实际应用中很少被采用。基于以上考虑,结合多种过滤算法的混合式推荐算法可利用各算法优点,避免其缺点,获得更好的推荐效果[13]

在学术界对推荐系统进行研究以来,提出了多种类型的混合过滤方法。其中,使用最广泛的两种是协同过滤和基于内容过滤[14,15]、协同过滤和人口统计学过滤[16]。协同过滤和基于内容过滤一般有四种混合模式[17]。第一种是分别计算协同过滤和基于内容过滤算法的推荐结果,并将二者结果加权输出[14,18,19,20,21]。第二种是将基于内容过滤的算法思想集成到协同过滤,以协同过滤的方式作出推荐[14,22]。第三种是建立一个新的模型来融合来自协同过滤和基于内容过滤的特征[23,24]。第四种是将系统过滤算法思想集成到基于内容过滤,以基于内容过滤的方式作出推荐[25]

混合式过滤算法在不同的应用场景下有不同的目标。最常见的设计目标是提高系统的推荐准确度[14,16]。也有些应用场景是为了克服推荐系统的冷启动问题。此外,推荐系统需要处理大量的数据,亦有些混合式推荐算法的目的是提高计算效率。由于混合式过滤算法具有应用潜力,此类算法已在多个领域得以研究应用,如应用书籍[26]、电影[4, 27]、音乐[28,29]等。除了上述的商品推荐外,混合式过滤推荐算法也被应用于推荐新闻[19,30]、网络教学课程[31,32,33]、数据图书[34]、旅游目的地[35,36]。然而在地学数据共享领域,缺少专业的数据推荐方法。

相比于传统的推荐应用,地理空间数据用户的需求更为专业和复杂。由于地理空间数据复杂的空间信息和时间属性信息,推荐算法面临更复杂的数据相似度计算问题。地学数据推荐较传统的多媒体内容推荐、商品推荐、文本内容推荐存在更复杂的挑战。此外,由于数据共享网站的设计思路不同于商业网站,往往缺少评价打分系统,用户的行为偏好也更难获取。针对这些挑战,以国家地球系统科学数据共享平台用户行为研究对象,开发一种基于协同过滤和内容过滤的混合式地学数据推荐方法。

2 混合式过滤地学数据推荐方法设计

如引言中所述,在有足够的用户评分数据情况下,协同过滤有较好的推荐效果。在协同过滤的实际应用中,往往有大量的项目缺乏足够数量的用户评分,使得项目间相似度计算的置信度较低。为了克服数据稀疏性问题,当用户对两个项目的评分数量不足时,可通过项目内容属性计算项目间相似度。针对地理空间数据,本研究设计一种动态的协同过滤和基于内容过滤混合算法,克服协同过滤情景下的数据稀疏问题。该方法同时利用了协同过滤在预测上的准确性优势和基于内容过滤在计算项目相似度及数据稀疏性上的优势。该方法的工作流程图如下图1所示。

图1   混合过滤地理空间数据推荐方法流程图

Fig. 1   Workflow of the proposed hybrid filtering algorithm

本研究中所指项目(item)即为地理空间数据。项目相似度分别通过协同过滤和基于内容过滤计算。在基于内容过滤部分,分别获取地理空间数据的主题、空间范围和时间范围信息,用来计算数据的相似度。在协同过滤部分,则通过用户间的共同评分来计算数据间的相似度。对于每一对用户——数据,协同过滤和基于内容过滤分别对用户评分作出预测。最终的预测评分对上述两个预测评分采用动态加权的方法计算。这一计算方法的基本原理是基于当用户对某两个项目共同评分的数量越多,协同过滤的预测能力越强,则给协同过滤赋予更高的权重;共同评分的数量越少基于内容过滤的预测能力越强,则给基于内容过滤赋予更高的权重。

2.1 基于内容过滤数据相似度计算

基于内容过滤进行用户评分预测的首要步骤是决定项目的相似度。相似度计算中,通常使用一定数量的属性值来描述项目。以属性向量来表达项目,即可通过向量计算的方式来确定项目的相似度。由于地理空间数据的空间和时间属性通常对应一个范围,较传统应用场景下的商品、电影、音乐等,其属性更难用值来表达,相似度的计算也更为复杂。本研究从三个维度来定义地理空间数据的属性,即空间范围、时间范围、主题内容。三个属性的值域通过数据集的元数据来抽取,并分别计算三个维度的相似度。数据的总体相似度将通过上述三个维度的相似度加权求和获得。计算方法如下:

Simcon=Wc×Simsub+Ws×Simspa+Wt×Simtime(1)

式中:SimsubSimspaSimtime分别表示数据的主题相似度、空间范围相似度、时间范围相似度;WcWsWt分别表示上述三个相似度的权重。需要指出的是,从地学数据用户需求的角度考虑,数据A和数据B的相似度与数据B和数据A的相似度可能并不相同。例如,数据A在某些维度上包含了数据B,若用户对数据B有需求,则数据A可能满足要求。反过来,若用户对数据A有需求,则数据B只能部分满足要求。计算所得的数据相似度矩阵是一个非对称矩阵。本研究定义此数据相似度为单向性相似度。三个维度的相似度计算方法将在下文小结中分别介绍。

公式(1)中的权重由领域专家打分确定。权重值采用一项地理空间语义相关度研究成果[37]。在咨询多位地学数据共享领域、地理空间语义领域、本体领域专家后,确定主题内容、空间范围、时间范围的权重分别为0.41、0.35、0.24。基于此,计算公式为:

Simcon=0.41×Simc+0.35×Sims+0.24×Simt(2)

2.1.1 主题内容相似度 地理空间数据的主题相似度与传统推荐系统应用里的书籍、电影、音乐内容相似度类似,由内容的描述属性确定。本研究从两个属性确定主题内容相似度:关键字和分类层级。计算公式为:

Simc=Wck×Simck+Wcc×Simcc(3)

式中:SimckSimcc分别表示主题词相似度和分类层级相似度;WckWcc分别表示二者的权重,且Wck+Wcc=1。权重的确定取决于领域知识,本研究中取Wck=Wcc=0.5。

每一个项目(地理空间数据)都有一定数量的关键词来描述。若数据i和数据j的关键词集合分别为KWiKWj,则关键词相似度计算公式为:

Simck(i,j)=KWiKWjKWi(4)

类似地,分类层级相似度以两个数据的分类层级重合度来计算。例如,数据i和数据j的分类层级分别为:

Hi:D1E1F1G1

Hj:D1E1F2G2

若分类层级深度表示为|Hi|,在ij的分类层级重合度为 HiHj-1。分类层级相似度的计算公式为:

Simcc(i,j)=HiHj-1Hi-1(5)

则本例中的分类层级相似度为1/3。

2.1.2 空间范围相似度 相比于商品、电影、音乐等,地理空间数据的一个显著特征是其空间属性。计算两个地理空间数据集的空间相似度最直接的方法是计算二者的拓扑关系,确定二者的空间范围重合度[38]。然而,采用地理信息系统计算拓扑关系开销较大。在地学数据共享平台处理大量地理空间数据的应用场景下,该计算方法实用性较差。地理空间本体则记录了位置名词间的空间关系,能提供快速的空间关系查询,适用于大量空间的空间位置关系查询计算。近年来,有多项的空间信息检索研究应用了地理空间本地作为语义检索工具[39,40]

平台共享的地理空间数据格式主要有栅格、矢量,及表格数据。不论格式,按几何类型所有的空间数据集可划分为面数据、线数据和点数据三种类型。从用户的数据需求角度考虑,三种类型的数据空间范围相似度计算原则为:

(1)不同几何类型的数据间相似度取决于其空间位置是否有重叠。点状线状数据的面积可忽略。若点状或线状数据空间位置被面状数据包含,则该数据与面状数据相似度为1。而反之面状数据与点状数据或线状数据的相似度为0。计算公式为:

Sims(i,j)=iji(6)

(2)两个点状数据间相似度取决于其空间位置是否相同,相同为1,不同为0。

(3)两个线状数据或两个面状数据的相似度由他们之间重叠的程度确定。计算采用公式(6)。

以上空间范围相似度计算是模拟用户对数据需求认知,并基于地理空间名词语义关系计算的近似值。其计算的精确度依赖于元数据记录的空间位置级别(如县级、乡镇级)。基于用户的数据需求考虑,两个空间范围ij的相似度是单向的(公式6),即ij的相似度和ji的相似度不同。最终获得的空间范围相似度矩阵也非对称矩阵。

2.1.3 时间范围相似度计算 由于时间的一维性,其相似度的计算较简单。时间范围相似度计算需考虑数据的时间数据类型。数据的时间属性类型有时间点和时间范围两种。时间属性A和B的相似度有二者的重叠程度确定。以|A|和|B|表示时间A和B的长度,则A和B的相似度计算公式为:

Simt(A,B)=A|B||A|(7)

2.2 协同过滤数据相似度计算

2.2.1 项目相似度计算 协同过滤分为基于用户的(User-based CF)和基于项目的(Item-based CF)两种。基于用户的协同过滤通过与用户有相同兴趣的用户群来预测用户偏好;而基于项目的协同过滤则通过用户间共同评分计算项目相似度,并依据用户历史预测用户偏好。科学数据共享平台提供的是专业性强的服务,其用户群主要来自高校和科研院所。科研人员在一段时间内将保持其科研兴趣,对某一主题的科学数据感兴趣。从这一角度考虑,基于项目的协同过滤更符合本应用场景。

余弦相似度(Cosine similarity)是基于项目的协同过滤中最常使用的相似度计算方法[41]。然而,余弦相似度忽略了不同用户对项目评分的习惯。一些用户倾向于较轻易地给出高评分,而一些用户很少给出高评分。修正余弦相似度(adjusted cosine similarity)可克服这一问题。令U为同时对项目a和项目b作出评分的用户集合,ru,a为用户u对项目a作出的评分, ru¯为用户u的所有评分的平均值,余弦相似度simcos和修正余弦相似度simadj_cos计算公式分别为:

sima,bcos=ru,a×ru,bru,a2ru,b2(8)

sima,badj_cos=ru,a-ru¯ru,b-ru¯ru,a-ru¯2ru,b-ru¯2(9)

修正余弦相似度的值域范围为-1~1。但是,2.1节中基于内容过滤计算的项目相似度值域范围为0~1。若要对这两种推荐方法进行融合,则其相似度计算值域范围需统一。为了避免这一问题,本研究在获取用户评分的过程中修正了用户的评分习惯,并在此基础上采用余弦相似度(值域为0到1)计算项目相似度。具体的用户评分获取方法将在下一小结中介绍。

2.2.2 项目评分计算 商业网站通常通过用户评分、浏览、收藏、购买等用户行为获取用户兴趣。科学数据共性平台也可通过用户浏览、下载等行为获取用户兴趣。本研究的目标是所有用户的行为模式,包括匿名用户和注册用户。部分共享数据用户并不能直接下载,且网站平台未提供直接的评分系统。因此,本研究通过用户浏览时间来推算用户评分。受制图学里分级方法Jenks Natural Breaks的启发,本研究通过该方法推算用户对数据的评分。该方法在每一分级下将数据差异最小化,可被看作是一维的k-means算法。因此,该方法能消除用户网络浏览行为习惯的差异。

首先,通过日志数据获取用户对每个数据集的历史累计时间。然后针对各用户,使用Jenks Natural Breaks法对数据集的浏览时间划分为5个等级,分别代表评分的1~5分。表1中以用户浏览时间为例,用户对数据集的累计时间从1~30 min不等,Jenks Natural Breaks划分的5个等级为[1,2]、(2,5]、(5,7]、(7,13]、(13,30]。

表1   Jenks Natural Breaks划分用户浏览时间示例

Tab. 1   Jenks Natural Breaks for item rating assignment

数据集ABCDEFG
浏览时间(min)1321730255
评分4113552

新窗口打开

2.3 动态加权混合过滤模型

本研究提出的混合过滤模型对协同过滤和基于内容过滤的预测结果进行动态加权,对不同的用户产生不同评分预测模型。假定协同过滤和基于内容过滤预测用户u对数据集i的评分分别为predCFpredCBF,基于内容过滤的权重为β。则协同过滤的权重为(1-β)。混合过滤模型的评分预测可表示为:

predweightedu,i=β×predCBF+(1-β)×predCF(10)

模型中协同过滤和基于内容过滤预测的评分得范围应一致。在协同过滤和基于内容过滤中,分别采用k最邻近(k-NN)方法计算预测评分。该方法预测用户u对数据集i的评分公式为:

predu,i=simr,i*ru,rsimr,i(11)

式中:r为用户u产生过评分的数据集;sim(r,i)为数据集r和数据集i之间的相似度。如前文所述,在协同过滤中,两个数据集间的相似度计算依赖于用户同时对这两个数据集产生的评分。对两个数据集同时评价的用户越多,则相似度计算的置信度越高。当对两个数据集共同评价的用户越多,则协同过滤应被赋予更高的权重。但共同评价的数量应如何影响权重的变化,从而获得最优的模型是未知的。本研究采用回归模型拟合权重和共同评价数量的关系,再将该模型应用到动态推荐模型中。从式(10)中可知,权重β可表示为:

β=predweightedu,i-predCFpredCBF-predCF(12)

若令 predweightedu,i为用户u对数据集i的实际评分,则计算出的β为理想的权重。本研究定义协同过滤中用来预测评分的k个最邻近数据集的平均共同评价数量为CL(co-rating level),则CL可表达为:

CL(u,i)=cnjk(13)

式中:cn为对数据集i和数据集j的共同评价数量。通过上述方法,可从训练样本中计算出相应的βCL值。理论上β的值域范围为[0,1]。但由于用户行为的不确定性,计算出的β范围可能超出[0,1]。本研究视超出此范围的β值为无效值。通过样本的有效βCL值,可拟合出二者关系:

β=f(CL)(14)

将该拟合方程代入式(14)即可得出动态预测模型。

在协同过滤中,通过用户历史浏览时间推算的用户评分为5个级别。该方法需要足够的数据集浏览数量(至少5个)来区分用户兴趣度的差异。若用户数据集浏览数少于5个,则只采用基于内容过滤进行评分预测。此外,随着CL的增加,协同过滤的权重相应增加,直到1。以thre表示协同过滤的权重为1时CL阈值,nu表示用户u访问过的数据集数量。则最终预测模型可表示为:

predu,i=predCBFu,i:nu<5predweightedu,i:nu5,CL<threpredCFu,i:CL>thre(15)

3 数据来源与实验设计

3.1 数据来源

3.1.1 服务器日志数据 服务器日志数据是本研究用户行为数据的来源。本研究获取了2015年的服务器日志数据进行试验,共12062607条。该日志数据以NCSA ECLF格式储存,每天日志信息里包含了用户IP、访问时间、方法、访问URL地址、状态、访问来源链接、客户端信息等。

3.1.2 数据集元数据 地空间数据集的元数据描述了数据的主题内容、空间范围、时间范围等信息,是基于内容过滤中计算数据集相似度的信息来源。在地球系统科学数据共享平台共享的数千个数据集中,本研究随机选择了200个样本数据集进行试验,并分别通过元数据提取了样本数据集的分类、关键词、空间范围、时间范围信息。

3.1.3 地理空间本体 本研究采用了王东旭等针对地学数据共享开发的地理空间本体[42]。通过本体查询工具,可获取不同地理名词间的空间拓扑关系,并用于数据间空间相似度计算。

3.2 数据预处理

原始数据存在大量的冗余信息,并不能直接用于实验。本研究进行了大量的数据预处理工作。对于服务器日志数据,预处理步骤包括数据清洗、用户识别、会话识别、用户访问时间计算。数据清洗是为了消除与挖掘任务无关的记录项,包括浏览器对图片、样式文件等的请求,网络爬虫的请求,以及错误的请求。用户识别是为了区分不同的用户。会话识别则在此基础上将不同用户的访问划分为单独访问时间段,作为一个完整的访问流程。本研究的数据预处理采用作者针对地学数据共享平台开发的预处理方法[43]。该方法在实际应用研究中表现出优秀的数据预处理效果[44]。会话识别后,以相邻两个访问记录的时间戳来计算访问时间。

此外,地理空间数据的元数据需进行提取和转换。元数据表以文本形式记录了数据的空间、时间和内容主题信息。本研究针对数据表格格式,开发了数据提取转换程序,获取了数据的空间描述词、时间范围描述,以及主题描述关键词。

3.3 实验设计

本研究随机选取了平台共享的200个数据集。根据用户历史访问,计算出用户对这200个数据集的评分。经过数据预处理,共得到7287个活跃用户的117375个评分。然后将这些评分中的70%作为训练集用于推荐系统中相似度计算,10%用于权重和CL关系的建模(建模集),剩下20%用于测试推荐效果(测试集)。推荐算法编程语言为Python。此外,基于内容过滤中数据相似度计算过程中采用Java Jena框架查询地理空间本体。

对训练集分别采用协同过滤和基于内容过滤中相似度计算方法计算数据集相似度,获得协同过滤数据集相似度矩阵和基于内容过滤数据集相似度矩阵。在计算协同过滤数据集相似度矩阵的同时,同时获取数据集的CL矩阵,用于记录协同过滤数据集相似度是基于多少共同评价而计算的。使用k-NN算法分别计算获得协同过滤和基于内容过滤对建模集中用户——数据集的预测评价。然后,通过公式(10)可得理想的权重计算方程:

β=r(u,i)-predCFpredCBF-predCF(16)

式中:r(u,i)为用户的真实评价;predCF为协同过滤预测的评价;predCBF为基于内容过滤预测的评价。得到理想权重β后,即可通过对应的CL来拟合βCL的关系。获得该拟合关系后,将该拟合方程应用于公式(10),获得动态的权重混合模型,并将该模型应用于测试集,检验推荐效果。

对于测试集中每个用户,本推荐方法将产生5个预测评分最高的数据集。采用准确度(Precision)和召回率(Recall)两个指标来评价推荐效果。实验分别测试了在不同的k值(k-NN算法中)情况下,协同过滤、基于内容过滤以及二者混合模型的推荐效果。

4 结果分析

理想权重β计算结果显示,建模集11738个评价中很大部分(62%)的β在[0,1]值域范围之外。因此,使用剩下38%的评价用来拟合βCL的关系。拟合结果显示二者关系最接近某一对数函数(图2)所示。拟合方程如公式(17)所示,拟合的方程的决定系数(R2)为0.328。

β=0.581+0.059×ln(CL)(17)

图2   理想权重β和CL散点及拟合图

Fig. 2   Scatter plot of ideal weight and co-rating level (CL)

β=1,则可得CL的阈值为1211。说明在CL>1211时,将仅采用协同过滤推荐结果。此实验结果获得的评分预测方程为:

predu,i=predCBF:nu<5(0.581+0.059×ln(CL))×predCBF+(0.419-0.059×lnCL)×predCF:nu5,CL1211predCF:CL>1211(18)

该推荐方法流程为:程序首先检查用户的历史评价记录,若历史评价记录数量<5,则只启用基于内容过滤推荐算法;若用户评价数量>5且预测评分对象数据集的CL<1211,则启用混合推荐算法;若CL>1211,则只启用协同过滤算法。

图3展示了推荐效果的对比结果。结果显示混合推荐模型的准确度和召回率指标较协同过滤和基于内容过滤有较大提高。当k=10,混合推荐模型获得最佳的推荐效果,准确度为0.271,召回率为0.424;协同过滤在k=15时获得最佳推荐效果,准确率为0.216,召回率为0.338;基于内容过滤则在k=10时获得最佳推荐效果,准确度为0.153,召回率为0.239。

图3   准确度和召回率评价结果

Fig. 3   Precision (left) and Recall (right) evaluation of CBF, item-based CF and proposed Hybrid approach

5 结论与讨论

本研究提出了一种面向地理空间数据推荐应用的混合式推荐算法。从数据的空间范围相似度、时间范围相似度、内容主题相似度三个方面来解决基于内容过滤中的相似度计算问题。实验结果表明,本研究提出的动态加权混合式过滤算法较单纯的协同过滤或基于内容过滤的推荐效果有明显提高。将地理空间数据的时空属性作为推荐系统的输入变量,提高了推荐效果,可应用于地理空间数据网络服务。研究中提出的以Jenks Natural Break来区分用户兴趣度的方法,亦可用于其他领域用户行为研究。

地学数据个性化推荐较传统的文本内容、多媒体内容推荐具有更复杂的空间信息计算问题。且用户对推荐内容的要求更高,用户需求的替代性差。实验结果发现即便将数据的空间范围和时间范围考虑进相似度的计算,相比协同过滤和混合推荐方法,单纯基于内容过滤的推荐效果依然较差。这反映了预测用户数据需求的难度。可能是由于用户在获取地理空间数据过程中,会考虑众多难以计算的因素,如数据来源,数据质量等。

致谢:感谢国家科技基础条件平台——地球系统科学数据共享平台为本研究提供数据支持。

The authors have declared that no competing interests exist.


参考文献

[1] Tenopir C, Allard S, Douglass K, et al.

Data sharing by scientists: Practices and perceptions

. Plos One, 2011, 6(6): e21101.

https://doi.org/10.1371/journal.pone.0021101      URL      PMID: 21738610      [本文引用: 1]      摘要

Scientific research in the 21st century is more data intensive and collaborative than in the past. It is important to study the data practices of researchers--data accessibility, discovery, re-use, preservation and, particularly, data sharing. Data sharing is a valuable part of the scientific method allowing for verification of results and extending research from prior results. A total of 1329 scientists participated in this survey exploring current data sharing practices and perceptions of the barriers and enablers of data sharing. Scientists do not make their data electronically available to others for various reasons, including insufficient time and lack of funding. Most respondents are satisfied with their current processes for the initial and short-term parts of the data or research lifecycle (collecting their research data; searching for, describing or cataloging, analyzing, and short-term storage of their data) but are not satisfied with long-term data preservation. Many organizations do not provide support to their researchers for data management both in the short- and long-term. If certain conditions are met (such as formal citation and sharing reprints) respondents agree they are willing to share their data. There are also significant differences and approaches in data management practices based on primary funding agency, subject discipline, age, work focus, and world region. Barriers to effective data sharing and preservation are deeply rooted in the practices and culture of the research process as well as the researchers themselves. New mandates for data management plans from NSF and other federal agencies and world-wide attention to the need to share and preserve data could lead to changes. Large scale programs, such as the NSF-sponsored DataNET (including projects like DataONE) will both bring attention and resources to the issue and make it easier for scientists to apply sound data management principles.
[2] Kaššák O, Kompan M, Bieliková M.

Personalized hybrid recommendation for group of users: Top-N multimedia recommender

. Information Processing & Management, 2016, 52(3): 459-477.

https://doi.org/10.1016/j.ipm.2015.10.001      URL      [本文引用: 1]      摘要

Nowadays, the increasing demand for group recommendations can be observed. In this paper we address the problem of recommendation performance for groups of users (group recommendation). We focus on the performance of very Top-N recommendations, which are important when recommending the long lasting items (only a few such items are consumed per session, e.g. movie). To improve existing group recommenders we propose a mixed hybrid recommender for groups combining content-based and collaborative strategies. The principle of proposed group recommender is to generate content and collaborative recommendations for each user, apply an aggregation strategy to solve the group conflict preferences for the content and collaborative sets separately, and finally reorder the collaborative candidates based on the content-based ones. It is based on an idea that candidates recommended by both recommendation strategies at the same time are presumably more appropriate for the group than the candidates recommended by individual strategies. The evaluation is performed by several experiments in the multimedia domain (as typical representative for group recommendations). Both, online and offline experiments were performed in order to compare real users鈥 satisfaction to the standard group recommenders and also, to compare performance of proposed approach to the state-of-the-art recommenders based on the MovieLens dataset. Finally, we experimented with the proposed hybrid recommender to generate the recommendation for a group of size one (i.e. single user recommendation). Obtained results, support our hypothesis that proposed mixed hybrid approach improves the precision of the recommendation for groups of users and for the single-user recommendation respectively on very Top-N recommended items.
[3] Lee S K, Cho Y H, Kim S H.

Collaborative filtering with ordinal scale-based implicit ratings for mobile music recommendations

. Information Sciences, 2010, 180(11): 2142-2155.

https://doi.org/10.1016/j.ins.2010.02.004      URL      [本文引用: 1]      摘要

Collaborative filtering (CF)-based recommender systems represent a promising solution for the rapidly growing mobile music market. However, in the mobile Web environment, a traditional CF system that uses explicit ratings to collect user preferences has a limitation: mobile customers find it difficult to rate their tastes directly because of poor interfaces and high telecommunication costs. Implicit ratings are more desirable for the mobile Web, but commonly used cardinal (interval, ratio) scales for representing preferences are also unsatisfactory because they may increase estimation errors. In this paper, we propose a CF-based recommendation methodology based on both implicit ratings and less ambitious ordinal scales. A mobile Web usage mining (mWUM) technique is suggested as an implicit rating approach, and a specific consensus model typically used in multi-criteria decision-making (MCDM) is employed to generate an ordinal scale-based customer profile. An experiment with the participation of real mobile Web customers shows that the proposed methodology provides better performance than existing CF algorithms in the mobile Web environment.
[4] Wei S, Zheng X, Chen D, et al.

A hybrid approach for movie recommendation via tags and ratings

. Electronic Commerce Research & Applications, 2016, 18(C): 83-94.

https://doi.org/10.1016/j.elerap.2016.01.003      URL      [本文引用: 2]      摘要

Selecting a movie often requires users to perform numerous operations when faced with vast resources from online movie platforms. Personalized recommendation services can effectively solve this problem by using annotating information from users. However, such current services are less accurate than expected because of their lack of comprehensive consideration for annotation. Thus, in this study, we propose a hybrid movie recommendation approach using tags and ratings. We built this model through the following processes. First, we constructed social movie networks and a preference-topic model. Then, we extracted, normalized, and reconditioned the social tags according to user preference based on social content annotation. Finally, we enhanced the recommendation model by using supplementary information based on user historical ratings. This model aims to improve fusion ability by applying the potential effect of two aspects generated by users. One aspect is the personalized scoring system and the singular value decomposition algorithm, the other aspect is the tag annotation system and topic model. Experimental results show that the proposed method significantly outperforms three categories of recommendation approaches, namely, user-based collaborative filtering (CF), model-based CF, and topic model based CF.
[5] Bobadilla J, Serradilla F, Hernando A.

Collaborative filtering adapted to recommender systems of e-learning

. Knowledge-Based Systems, 2009, 22(4): 261-265.

https://doi.org/10.1016/j.knosys.2009.01.008      URL      [本文引用: 1]      摘要

In the context of e-learning recommender systems, we propose that the users with greater knowledge (for example, those who have obtained better results in various tests) have greater weight in the calculation of the recommendations than the users with less knowledge. To achieve this objective, we have designed some new equations in the nucleus of the memory-based collaborative filtering, in such a way that the existent equations are extended to collect and process the information relative to the scores obtained by each user in a variable number of level tests.
[6] Zaíane O R.

Building a recommender agent for e-learning systems. Computers in Education, 2002. Proceedings. International Conference on

. IEEE, 2002: 55-59.

[本文引用: 1]     

[7] Huang Z, Zeng D, Chen H.

A comparison of collaborative-filtering recommendation algorithms for e-commerce

. IEEE Intelligent Systems, 2007, 22(5): 68-78.

https://doi.org/10.1109/MIS.2007.4338497      URL      [本文引用: 1]     

[8] Jcastroschez J, Miguel R, Vallejo D, et al.

A highly adaptive recommender system based on fuzzy logic for B2C e-commerce portals

. Expert Systems with Applications, 2011, 38(3): 2441-2454.

https://doi.org/10.1016/j.eswa.2010.08.033      URL      [本文引用: 1]      摘要

Past years have witnessed a growing interest in e-commerce as a strategy for improving business. Several paradigms have arisen from the e-commerce field in recent years which try to support different business activities, such as B2C and C2C. This paper introduces a prototype of e-commerce portal, called e-Zoco, of which main features are: (i) a catalogue service intended to arrange product categories hierarchically and describe them through sets of attributes, (ii) a product selection service able to deal with imprecise and vague search preferences which returns a set of results clustered in accordance with their potential relevance to the user, and (iii) a rule-based knowledge learning service to provide the users with knowledge about the existing relationships among the attributes that describe a given product category. The portal prototype is supported by a multi-agent infrastructure composed of a set of agents responsible for providing these and other services.
[9] He Q, Jiang D, Liao Z, et al.

Web query recommendation via sequential query prediction. Data Engineering, 2009. ICDE'09. IEEE 25th International Conference on

. IEEE, 2009: 1443-1454.

[本文引用: 1]     

[10] Mcnally K, Coyle M, Briggs P, et al.

A case study of collaboration and reputation in social web search

. Acm Transactions on Intelligent Systems & Technology, 2011, 3(1): 1-29.

https://doi.org/10.1145/2036264.2036268      URL      [本文引用: 1]      摘要

Although collaborative searching is not supported by mainstream search engines, recent research has highlighted the inherently collaborative nature of many Web search tasks. In this article, we describe HeyStaks, a collaborative Web search framework that is designed to complement mainstream search engines. At search time, HeyStaks learns from the search activities of other users and leverages this information to generate recommendations based on results that others have found relevant for similar searches. The key contribution of this article is to extend the HeyStaks social search model by considering the search expertise, or reputation, of HeyStaks users and using this information to enhance the result recommendation process. In particular, we propose a reputation model for HeyStaks users that utilise the implicit collaboration events that take place between users as recommendations are made and selected. We describe a live-user trial of HeyStaks that demonstrates the relevance of its core recommendations and the ability of the reputation model to further improve recommendation quality. Our findings indicate that incorporating reputation into the recommendation process further improves the relevance of HeyStaks recommendations by up to 40&percnt;.
[11] De Campos L M, Fernandezluna J M, Huete J F, et al.

Combining content-based and collaborative recommendations: A hybrid approach based on Bayesian networks

. International Journal of Approximate Reasoning, 2010, 51(7): 785-799.

https://doi.org/10.1016/j.ijar.2010.04.001      URL      [本文引用: 1]      摘要

Recommender systems enable users to access products or articles that they would otherwise not be aware of due to the wealth of information to be found on the Internet. The two traditional recommendation techniques are content-based and collaborative filtering. While both methods have their advantages, they also have certain disadvantages, some of which can be solved by combining both techniques to improve the quality of the recommendation. The resulting system is known as a hybrid recommender system. In the context of artificial intelligence, Bayesian networks have been widely and successfully applied to problems with a high level of uncertainty. The field of recommendation represents a very interesting testing ground to put these probabilistic tools into practice. This paper therefore presents a new Bayesian network model to deal with the problem of hybrid recommendation by combining content-based and collaborative features. It has been tailored to the problem in hand and is equipped with a flexible topology and efficient mechanisms to estimate the required probability distributions so that probabilistic inference may be performed. The effectiveness of the model is demonstrated using the MovieLens and IMDB data sets.
[12] Burke R.

Hybrid recommender systems: Survey and experiments

. User Modeling and User-Adapted Interaction, 2002, 12(4): 331-370.

https://doi.org/10.1023/A:1021240730564      URL      [本文引用: 1]     

[13] Porcel C, Tejeda-Lorente A, Martínez M A, et al.

A hybrid recommender system for the selective dissemination of research resources in a technology transfer office

. Information Sciences, 2012, 184(1): 1-19.

https://doi.org/10.1016/j.ins.2011.08.026      URL      [本文引用: 1]      摘要

Recommender systems could be used to help users in their access processes to relevant information. Hybrid recommender systems represent a promising solution for multiple applications. In this paper we propose a hybrid fuzzy linguistic recommender system to help the Technology Transfer Office staff in the dissemination of research resources interesting for the users. The system recommends users both specialized and complementary research resources and additionally, it discovers potential collaboration possibilities in order to form multidisciplinary working groups. Thus, this system becomes an application that can be used to help the Technology Transfer Office staff to selectively disseminate the research knowledge and to increase its information discovering properties and personalization capacities in an academic environment.
[14] Li Q, Kim B M.

Clustering approach for hybrid recommender system

. Web Intelligence, 2003: 33-38.

https://doi.org/10.1109/WI.2003.1241167      URL      [本文引用: 4]      摘要

Recommender system is a kind of Web intelligence techniques to make a daily information filtering for people. Clustering techniques have been applied to the item-based collaborative filtering framework to solve the cold start problem. It also suggests a way to integrate the content information into the collaborative filtering. Extensive experiments have been conducted on MovieLens data to analyze the characteristics of our technique. The results show that our approach contributes to the improvement of prediction quality of the item-based collaborative filtering, especially for the cold start problem.
[15] Spiegel S, Kunegis J, Li F, et al.

Hydra: A hybrid recommender system [cross-linked rating and content information]

. Conference on Information and Knowledge Management, 2009: 75-80.

https://doi.org/10.1145/1651274.1651289      URL      [本文引用: 1]      摘要

This paper discusses the combination of collaborative and content-based filtering in the context of web-based recommender systems. In particular, we link the well-known MovieLens rating data with supplementary IMDB content information. The resulting network of user-item relations and associated content features is converted into a unified mathematical model, which is applicable to our underlying neighbor-based prediction algorithm. By means of various experiments, we demonstrate the influence of supplementary user as well as item features on the prediction accuracy of Hydra, our proposed hybrid recommender. In order to decrease system runtime and to reveal latent user and item relations, we factorize our hybrid model via singular value decomposition (SVD).
[16] Alejandro Bellogín, Castells P, Chavarriaga E.

An empirical comparison of social, collaborative filtering, and hybrid recommenders

. Acm Transactions on Intelligent Systems & Technology, 2013, 4(1): 1-29.

https://doi.org/10.1145/2414425.2414439      URL      [本文引用: 2]      摘要

In the Social Web, a number of diverse recommendation approaches have been proposed to exploit the user generated contents available in the Web, such as rating, tagging, and social networking information. In general, these approaches naturally require the availability of a wide amount of these user preferences. This may represent an important limitation for real applications, and may be somewhat unnoticed in studies focusing on overall precision, in which a failure to produce recommendations gets blurred when averaging the obtained results or, even worse, is just not accounted for, as users with no recommendations are typically excluded from the performance calculations. In this article, we propose a coverage metric that uncovers and compensates for the incompleteness of performance evaluations based only on precision. We use this metric together with precision metrics in an empirical comparison of several social, collaborative filtering, and hybrid recommenders. The obtained results show that a better balance between precision and coverage can be achieved by combining social-based filtering (high accuracy, low coverage) and collaborative filtering (low accuracy, high coverage) recommendation techniques. We thus explore several hybrid recommendation approaches to balance this trade-off. In particular, we compare, on the one hand, techniques integrating collaborative and social information into a single model, and on the other, linear combinations of recommenders. For the last approach, we also propose a novel strategy to dynamically adjust the weight of each recommender on a user-basis, utilizing graph measures as indicators of the target user's connectedness and relevance in a social network.
[17] Bobadilla J, Ortega F, Hernando A, et al.

Recommender systems survey

. Knowledge-based Systems, 2013, 46: 109-132.

https://doi.org/10.1016/j.knosys.2013.03.012      URL      [本文引用: 1]     

[18] Billsus D, Pazzani M J.

User modeling for adaptive news access

. User Modeling and User-Adapted Interaction, 2000, 10(2): 147-180.

https://doi.org/10.1023/A:1026501525781      URL      [本文引用: 1]      摘要

We present a framework for adaptive news access, based on machine learning techniques specifically designed for this task. First, we focus on the system's general functionality and system architecture. We then describe the interface and design of two deployed news agents that are part of the described architecture. While the first agent provides personalized news through a web-based interface, the second system is geared towards wireless information devices such as PDAs (personal digital assistants) and cell phones. Based on implicit and explicit user feedback, our agents use a machine learning algorithm to induce individual user models. Motivated by general shortcomings of other user modeling systems for Information Retrieval applications, as well as the specific requirements of news classification, we propose the induction of hybrid user models that consist of separate models for short-term and long-term interests. Furthermore, we illustrate how the described algorithm can be used to address an important issue that has thus far received little attention in the Information Retrieval community: a user's information need changes as a direct result of interaction with information. We empirically evaluate the system's performance based on data collected from regular system users. The goal of the evaluation is not only to understand the performance contributions of the algorithm's individual components, but also to assess the overall utility of the proposed user modeling techniques from a user perspective. Our results provide empirical evidence for the utility of the hybrid user model, and suggest that effective personalization can be achieved without requiring any extra effort from the user.
[19] Claypool M, Gokhale A, Miranda T, et al.

Combining content-based and collaborative filters in an online newspaper

. Proceedings of ACM SIGIR Workshop on Recommender Systems, 1999: 60.

URL      [本文引用: 2]      摘要

The explosive growth of mailing lists, Web sites and Usenet news demands effective filtering solutions. Collaborative filtering combines the informed opinions of humans to make personalized, accurate predictions. Content-based filtering uses the speed of computers to make complete, fast predictions. In this work, we present a new filtering approach that combines the coverage and speed of content-filters with the depth of collaborative filtering. We apply our research approach to an online newspaper, an as yet untapped opportunity for filters useful to the wide-spread news reading populace. We present the design of our filtering system and describe the results from preliminary experiments that suggest merits to our approach.
[20] Marx P, Hennigthurau T, Marchand A, et al.

Increasing consumers' understanding of recommender results: A preference-based hybrid algorithm with strong explanatory power

. Conference on Recommender Systems, 2010: 297-300.

https://doi.org/10.1145/1864708.1864771      URL      [本文引用: 1]      摘要

Recommender systems are intended to assist consumers by making choices from a large scope of items. While most recommender research focuses on improving the accuracy of recommender algorithms, this paper stresses the role of explanations for recommended items for gaining acceptance and trust. Specifically, we present a method which is capable of providing detailed explanations of recommendations while exhibiting reasonable prediction accuracy. The method models the users' ratings as a function of their utility part-worths for those item attributes which influence the users' evaluation behavior, with part-worth being estimated through a set of auxiliary regressions and constrained optimization of their results. We provide evidence that under certain conditions the proposed method is superior to established recommender approaches not only regarding its ability to provide detailed explanations but also in terms of prediction accuracy. We further show that a hybrid recommendation algorithm can rely on the content-based component for a majority of the users, switching to collaborative recommendation only for about one third of the user base.
[21] Tran T, Cohen R.

Hybrid recommender systems for electronic commerce

. National Conference on Artificial Intelligence, 2000.

URL      [本文引用: 1]      摘要

Abstract In electronic commerce applications, prospective buyers may be interested in receiving recommendations to assist with their purchasing decisions. Previous research has described two main models for automated recommender systems-collaborative filtering
[22] Melville P, Mooney R J, Nagarajan R, et al.

Content-boosted collaborative filtering for improved recommendations

. National Conference on Artificial Intelligence, 2002: 187-192.

URL      [本文引用: 1]      摘要

Most recommender systems use Collaborative Filtering or Content-based methods to predict new items of interest for a user. While both methods have their own advantages, individually they fail to provide good recommendations in many situations. Incorporating components from both methods, a hybrid recommender system can overcome these shortcomings. In this paper, we present an elegant and effective framework for combining content and collaboration. Our approach uses a content-based predictor tc enhance existing user data, and then provides personalized suggestions through collaborative filtering. We present experimental results that show how this approach, &#60;i&#62;Content-Boosted Collaborative Filtering&#60;/i&#62;, performs better than a pure content-based predictor, pure collaborative filter, and a naive hybrid approach.
[23] Campos L, Mfernandezluna J, Fhuete J, et al.

Combining content-based and collaborative recommendations: A hybrid approach based on Bayesian networks

. International Journal of Approximate Reasoning, 2010, 51(7): 785-799.

https://doi.org/10.1016/j.ijar.2010.04.001      URL      [本文引用: 1]      摘要

Recommender systems enable users to access products or articles that they would otherwise not be aware of due to the wealth of information to be found on the Internet. The two traditional recommendation techniques are content-based and collaborative filtering. While both methods have their advantages, they also have certain disadvantages, some of which can be solved by combining both techniques to improve the quality of the recommendation. The resulting system is known as a hybrid recommender system. In the context of artificial intelligence, Bayesian networks have been widely and successfully applied to problems with a high level of uncertainty. The field of recommendation represents a very interesting testing ground to put these probabilistic tools into practice. This paper therefore presents a new Bayesian network model to deal with the problem of hybrid recommendation by combining content-based and collaborative features. It has been tailored to the problem in hand and is equipped with a flexible topology and efficient mechanisms to estimate the required probability distributions so that probabilistic inference may be performed. The effectiveness of the model is demonstrated using the MovieLens and IMDB data sets.
[24] Fouss F, Pirotte A, Renders J, et al.

Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation

. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(3): 355-369.

https://doi.org/10.1109/TKDE.2007.46      URL      [本文引用: 1]      摘要

This work presents a new perspective on characterizing the similarity between elements of a database or, more generally, nodes of a weighted and undirected graph. It is based on a Markov-chain model of random walk through the database. More precisely, we compute quantities (the average commute time, the pseudoinverse of the Laplacian matrix of the graph, etc.) that provide similarities between any pair of nodes, having the nice property of increasing when the number of paths connecting those elements increases and when the "length" of paths decreases. It turns out that the square root of the average commute time is a Euclidean distance and that the pseudoinverse of the Laplacian matrix is a kernel matrix (its elements are inner products closely related to commute times). A principal component analysis (PCA) of the graph is introduced for computing the subspace projection of the node vectors in a manner that preserves as much variance as possible in terms of the Euclidean commute-time distance. This graph PCA provides a nice interpretation to the "Fiedler vector," widely used for graph partitioning. The model is evaluated on a collaborative-recommendation task where suggestions are made about which movies people should watch based upon what they watched in the past. Experimental results on the MovieLens database show that the Laplacian-based similarities perform well in comparison with other methods. The model, which nicely fits into the so-called "statistical relational learning" framework, could also be used to compute document or word similarities, and, more generally, it could be applied to machine-learning and pattern-recognition tasks involving a relational database
[25] Mooney R J, Roy L.

Content-based book recommending using learning for text categorization

. Proceedings of the Fifth ACM conference on Digital libraries. ACM, 2000: 195-204.

https://doi.org/10.1145/336597.336662      URL      [本文引用: 1]      摘要

Recommender systems improve access to relevant products and information by making personalized suggestions based on previous examples of a user's likes and dislikes. Most existing recommender systems use collaborative filtering methods that base recommendations on other users' preferences. By contrast,content-based methods use information about an item itself to make suggestions.This approach has the advantage of being able to recommend previously unrated items to users with unique interests and to provide explanations for its recommendations. We describe a content-based book recommending system that utilizes information extraction and a machine-learning algorithm for text categorization. Initial experimental results demonstrate that this approach can produce accurate recommendations.
[26] Vaz P C, Matos D M D, Martins B, et al.

Improving a hybrid literary book recommendation system through author ranking

. Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries. Washington, DC, USA, ACM. 2012: 387-388.

https://doi.org/10.1145/2232817.2232904      URL      [本文引用: 1]      摘要

Literary reading is an important activity for individuals and can be a long term commitment, making book choice an important task for book lovers and public library users. In this paper, we present a hybrid recommendation system to help readers decide which book to read next. We study book and author recommendations in a hybrid recommendation setting and test our algorithm on the LitRec data set. Our hybrid method combines two item-based collaborative filtering algorithms to predict books and authors that the user will like. Author predictions are expanded into a booklist that is subsequently aggregated with the former book predictions. Finally, the resulting booklist is used to yield the top-n book recommendations. By means of various experiments, we demonstrate that author recommendation can improve overall book recommendation.
[27] Lommatzsch A, Kille B, Kim J W, et al.

An adaptive hybrid movie recommender based on semantic data. Proceedings of the 10th Conference on Open Research Areas in Information Retrieval

. Lisbon, Portugal, Le Centre De Hautes Etudes Internationales D'informatique Documentaire. 2013: 217-228.

[本文引用: 1]     

[28] Aureliodomingues M, Gouyon F, Mariojorge A, et al.

Combining usage and content in an online music recommendation system for music in the long-tail

. World Wide Web, 2012: 925-930.

https://doi.org/10.1145/2187980.2188224      URL      [本文引用: 1]      摘要

In this paper we propose a hybrid music recommender system, which combines usage and content data. We describe an online evaluation experiment performed in real time on a commercial music web site, specialised in content from the very long tail of music content. We compare it against two stand-alone recommenders, the first system based on usage and the second one based on content data. The results show that the proposed hybrid recommender shows advantages with respect to usage- and content-based systems, namely, higher user absolute acceptance rate, higher user activity rate and higher user loyalty.
[29] Yoshii K, Goto M, Komatani K, et al.

Hybrid collaborative and content-based music recommendation using probabilistic model with latent user preferences

. International Symposium/Conference on Music Information Retrieval, 2006: 296-301.

https://doi.org/10.1007/978-1-4020-5587-4_10      URL      [本文引用: 1]      摘要

This paper presents a hybrid music recommendation method that solves problems of two prominent conventional methods: collaborative filtering and content-based recommendation. The former cannot recommend musical pieces that have no ratings because recommendations are based on actual user ratings. In addition, artist variety in recommended pieces tends to be poor. The latter, which recommends musical pieces that are similar to users favorites in terms of music content, has not been fully investigated. This induces unreliability in modeling of user preferences; the content similarity does not completely reflect the preferences. Our method integrates both rating and content data by using a Bayesian network called an aspect model. Unobservable user preferences are directly represented by introducing latent variables, which are statistically estimated. To verify our method, we conducted experiments by using actual audio signals of Japanese songs and the corresponding rating data collected from Amazon. The results showed that our method outperforms the two conventional methods in terms of recommendation accuracy and artist variety and can reasonably recommend pieces even if they have no ratings.
[30] Wen H, Fang L, Guan L.

A hybrid approach for personalized recommendation of news on the web

. Expert Systems with Applications, 2012, 39(5): 5806-5814.

https://doi.org/10.1016/j.eswa.2011.11.087      URL      [本文引用: 1]      摘要

A hybrid method for personalized recommendation of news on the Web is presented, which provides Web users with an autonomous tool that is able to minimize repetitive and tedious Web surfing. The proposed approach classifies Web pages by calculating the respective weights of terms. A user’s interest and preference models are generated by analyzing the user’s navigational history. Based on the content of the Web pages and on a user’s interest and preference models, the recommender system suggests news Web pages to the user who is likely interested in the related topics. Moreover, the technique of collaborative filtering, which aims to choose the trusted users, is employed to improve the performance of the recommender system. Experiments are carried out in order to demonstrate the effectiveness of the proposed method. In the experiments, Web news items are classified and recommended to Web users by matching the users’ interests with the contents of the news.
[31] Cobos C, Rodriguez O, Rivera J, et al.

A hybrid system of pedagogical pattern recommendations based on singular value decomposition and variable data attributes

. Information Processing & Management An International Journal, 2013, 49(3): 607-625.

https://doi.org/10.1016/j.ipm.2012.12.002      URL      [本文引用: 1]      摘要

To carry out effective teaching/learning processes, lecturers in a variety of educational institutions frequently need support. They therefore resort to advice from more experienced lecturers, to formal training processes such as specializations, master or doctoral degrees, or to self-training. High costs in time and money are invariably involved in the processes of formal training, while self-training and advice each bring their own specific risks (e.g. of following new trends that are not fully evaluated or the risk of applying techniques that are inappropriate in specific contexts).This paper presents a system that allows lecturers to define their best teaching strategies for use in the context of a specific class. The context is defined by: the specific characteristics of the subject being treated, the specific objectives that are expected to be achieved in the classroom session, the profile of the students on the course, the dominant characteristics of the teacher, and the classroom environment for each session, among others. The system presented is the Recommendation System of Pedagogical Patterns (RSPP). To construct the RSPP, an ontology representing the pedagogical patterns and their interaction with the fundamentals of the educational process was defined. A web information system was also defined to record information on courses, students, lecturers, etc.; an option based on a unified hybrid model (for content and collaborative filtering) of recommendations for pedagogical patterns was further added to the system. RSPP features a minable view, a tabular structure that summarizes and organizes the information registered in the rest of the system as well as facilitating the task of recommendation. The data recorded in the minable view is taken to a latent space, where noise is reduced and the essence of the information contained in the structure is distilled. This process makes use of Singular Value Decomposition (SVD), commonly used by information retrieval and recommendation systems. Satisfactory results both in the accuracy of the recommendations and in the use of the general application open the door for further research and expand the role of recommender systems in educational teacher support processes.
[32] Salehi M, Kamalabadi I N.

Hybrid recommendation approach for learning material based on sequential pattern of the accessed material and the learner's preference tree

. Knowledge-Based Systems, 2013, 48: 57-69.

https://doi.org/10.1016/j.knosys.2013.04.012      URL      [本文引用: 1]      摘要

The explosion of the learning materials in personal learning environments has caused difficulties to locate appropriate learning materials to learners. Personalized recommendations have been used to support the activities of learners in personal learning environments and this technology can deliver suitable learning materials to learners. In order to improve the quality of recommendations, this research considers the multidimensional attributes of material, rating of learners, and the order and sequential patterns of the learner accessed material in a unified model. The proposed approach has two modules. In the sequential-based recommendation module, latent patterns of accessing materials are discovered and presented in two formats including the weighted association rules and the compact tree structure (called Pattern-tree). In the attribute-based module, after clustering the learners using latent patterns by K -means algorithm, the learner preference tree (LPT) is introduced to consider the multidimensional attributes of materials, rating of learners, and also order of the accessed materials. The mixed, weighted, and cascade hybrid methods are employed to generate the final combined recommendations. The experiments show that the proposed approach outperforms the previous algorithms in terms of precision, recall, and intra-list similarity measure. The main contributions are improvement of the recommendations quality and alleviation of the sparsity problem by combining the contextual information, including order and sequential patterns of the accessed material, rating of learners, and the multidimensional attributes of materials.
[33] Zhuhadar L, Nasraoui O, Wyatt R, et al.

Multi-model ontology-based hybrid recommender system in e-learning domain

. Web Intelligence, 2009, (3): 91-95.

https://doi.org/10.1109/WI-IAT.2009.238      URL      [本文引用: 1]      摘要

This paper introduces a multi-model ontology-based framework for semantic search of educational content in E-learning repository of courses, lectures, multimedia resources, etc. This hybrid recommender system is driven by two types of recommendations: content-based (domain ontology model) and rule-based (learner’s interest-based and cluster-based). The domain ontology is used to represent the learning materials. In this context, the ontology is composed by a hierarchy of concepts and sub-concepts. Whereas, the learner’s ontology model represents a subset of the domain ontology, and the cluster-based recommendations are added as additional semantic recommendations to the model. Combining the content-based with the rule-based provides the user with hybrid recommendations. All of them influenced the re-ranking of the retrieved documents with different weights. Our proposed approach has been implemented on the HyperManyMedia1 platform.
[34] Vellino A, Zeber D.

A hybrid, multi-dimensional recommender for journal articles in a scientific digital library

. Web Intelligence, 2007: 111-114.

https://doi.org/10.1109/WI-IATW.2007.29      URL      [本文引用: 1]      摘要

A recommender system for scientific scholarly articles that is both hybrid (content and collaborative filtering based) and multi-dimensional (across metadata categories such as subject hierarchies, journal clusters and keyphrases) can improve scientists' ability to discover new knowledge from a digital library. Providing users with an interface which enables the filtering of recommendations across these multiple dimensions can simultaneously provide explanations for the recommendations and increase the user's control over how the recommender behaves.
[35] Al-Hassan M, Lu H, Lu J.

A semantic enhanced hybrid recommendation approach: A case study of e-Government tourism service recommendation system

. Decision Support Systems, 2015, 72: 97-109.

https://doi.org/10.1016/j.dss.2015.02.001      URL      [本文引用: 1]      摘要

Recommender systems are effectively used as a personalized information filtering technology to automatically predict and identify a set of interesting items on behalf of users according to their personal needs and preferences. Collaborative Filtering (CF) approach is commonly used in the context of recommender systems; however, obtaining better prediction accuracy and overcoming the main limitations of the standard CF recommendation algorithms, such as sparsity and cold-start item problems, remain a significant challenge. Recent developments in personalization and recommendation techniques support the use of semantic enhanced hybrid recommender systems, which incorporate ontology-based semantic similarity measure with other recommendation approaches to improve the quality of recommendations. Consequently, this paper presents the effectiveness of utilizing semantic knowledge of items to enhance the recommendation quality. It proposes a new Inferential Ontology-based Semantic Similarity (IOBSS) measure to evaluate semantic similarity between items in a specific domain of interest by taking into account their explicit hierarchical relationships, shared attributes and implicit relationships. The paper further proposes a hybrid semantic enhanced recommendation approach by combining the new IOBSS measure and the standard item-based CF approach. A set of experiments with promising results validates the effectiveness of the proposed hybrid approach, using a case study of the Australian e-Government tourism services.
[36] Chen J, Chao K, Shah N, et al.

Hybrid recommendation system for tourism

. International Conference on E-business Engineering, 2013: 156-161.

https://doi.org/10.1109/ICEBE.2013.24      URL      [本文引用: 1]      摘要

This paper adopts item-based collaborative filtering to predict the interests of an active tourist by collecting preferences or taste information from a number of other tourists. Our proposed mechanism is able to predict a set recommended tourism places of elicited rating places (e.g., ratings of 1 through 5 stars) for the active tourist pre-traveling places. Furthermore, giving restriction of traveling factors, such as budge and time, the recommendation system will refine the exact set of tourism places by applying genetic algorithm mechanism. Finally, the system is based on minimum cost to schedule traveling path from a set of selected places by the using genetic algorithm approach. Our proposed hybrid recommendation algorithm focuses on the refining efficiency and provides multi-functional tourism information.
[37] 赵红伟, 诸云强, 杨宏伟, .

地理空间数据本质特征语义相关度计算模型

. 地理研究, 2016, 35(1): 58-70.

https://doi.org/10.11821/dlyj201601006      URL      [本文引用: 1]      摘要

关联数据是跨网域整合多源异构地理空间数据的有效方式,语义丰富的关联是准确、快速发现目标数据的关键。根据地理空间数据在空间、时间、内容上的语义关系,提出地理空间数据本质特征语义相关度计算模型。通过构建本质特征的关联指标体系,分层次逐级计算地理空间数据的语义相关度。与传统的语义相关度计算方式不同,以地理元数据为语料库,充分考虑地理空间数据的特点及空间、时间、内容在检索中不同的重要程度,分别采用几何运算、数值运算、词语语义相似度计算和类别层次相关度计算的方式,构建地理空间数据的语义相关度计算模型。该模型具有构建简单、适用于多源异构数据、充分结合了数学运算和专家经验知识等特点。实验表明:模型能够有效地计算地理空间数据本质特征的语义相关度,并具备一定的扩展性。

[Zhao Hongwei, Zhu Yunqiang, Yang Hongwei, et al.

The semantic relevancy computation model on essentialfeatures of geospatial data

. Geographical Research, 2016, 35(1): 58-70.]

https://doi.org/10.11821/dlyj201601006      URL      [本文引用: 1]      摘要

关联数据是跨网域整合多源异构地理空间数据的有效方式,语义丰富的关联是准确、快速发现目标数据的关键。根据地理空间数据在空间、时间、内容上的语义关系,提出地理空间数据本质特征语义相关度计算模型。通过构建本质特征的关联指标体系,分层次逐级计算地理空间数据的语义相关度。与传统的语义相关度计算方式不同,以地理元数据为语料库,充分考虑地理空间数据的特点及空间、时间、内容在检索中不同的重要程度,分别采用几何运算、数值运算、词语语义相似度计算和类别层次相关度计算的方式,构建地理空间数据的语义相关度计算模型。该模型具有构建简单、适用于多源异构数据、充分结合了数学运算和专家经验知识等特点。实验表明:模型能够有效地计算地理空间数据本质特征的语义相关度,并具备一定的扩展性。
[38] Schneider M.

Computing the topological relationship of complex regions. Database and Expert Systems Applications: 15th International Conference, DEXA 2004, Zaragoza, Spain, August 30-September 3, 2004 Proceedings

. Berlin, Heidelberg; Springer Berlin Heidelberg, 2004: 844-853.

[本文引用: 1]     

[39] Bowers S, Lin K, Ludascher B.

On integrating scientific resources through semantic registration. Scientific and Statistical Database Management. Proceedings. 16th International Conference on

. IEEE, 2004: 349-352.

[本文引用: 1]     

[40] Fox P, McGuinness D L, Cinquini L, et al.

Ontology-supported scientific data frameworks: The Virtual Solar-Terrestrial Observatory experience

. Computers & Geosciences, 2009, 35(4): 724-738.

https://doi.org/10.1016/j.cageo.2007.12.019      URL      [本文引用: 1]      摘要

We have developed a semantic data framework that supports interdisciplinary virtual observatory projects across the fields of solar physics, space physics and solar-terrestrial physics. This work required a formal, machine understandable representation for concepts, relations and attributes of physical quantities in the domains of interest as well as their underlying data representations. To fulfill this need, we developed a set of solar-terrestrial ontologies as formal encodings of the knowledge in the Ontology Web Language escription Logic (OWL L) format. We present our knowledge representation and reasoning needs motivated by the context of Virtual Observatories, from fields spanning upper atmospheric terrestrial physics to solar physics, whose intent is to provide access to observational datasets. The resulting data framework is built upon semantic web methodologies and technologies and provides virtual access to distributed and heterogeneous sets of data as if all resources appear to be organized, stored and retrieved from a local environment. Our conclusion is that the combination of use case-driven, small and modular ontology development, coupled with free and open-source software tools and languages provides sufficient expressiveness and capabilities for an initial production implementation and sets the stage for a more complete semantic-enablement of future frameworks.
[41] Jannach D, Zanker M, Felfernig A, et al.Recommender Systems: An Introduction. Cambridge: Cambridge University Press, 2010.

[本文引用: 1]     

[42] 王东旭, 诸云强, 潘鹏, .

地理数据空间本体构建及其在数据检索中的应用

. 地球信息科学学报, 2016, 18(4): 443-452.

https://doi.org/10.3724/SP.J.1047.2016.00443      URL      Magsci      [本文引用: 1]      摘要

<p>随着新地理信息时代的来临,地理数据已经呈现出爆炸式增长的趋势。如何在海量的地理数据中准确、及时地找到人们所需要的数据,并把相关联的数据智能地推荐给用户,成为亟待解决的一大难题。针对传统以关键词、主题词等字符串匹配为核心的数据发现方法存在的查不全、查不准的问题,本文通过对地理空间中的概念、属性、关系、规则,以及相应实例的详细表达,初步提出了地理空间本体构建框架,并在此基础上构建了较为完整的地理数据空间本体,以实现地理数据的智能关联,最后在地球系统科学数据共享平台中进行应用实践。结果表明,引入地理数据空间本体后,检索的结果在数据的查全和查准方面显著提高,而且还能智能推荐相关联的数据信息。本文构建的地理数据空间本体对于大数据时代背景下地理数据的精确发现和共享有重要意义。</p>

[Wang Dongxu, Zhu Yunqiang, Pan Peng, et al.

Construction of geodata spatial ontology and its application in data retrieval. Journal of Geo-information

Science, 2016, 18(4): 443-452.]

https://doi.org/10.3724/SP.J.1047.2016.00443      URL      Magsci      [本文引用: 1]      摘要

<p>随着新地理信息时代的来临,地理数据已经呈现出爆炸式增长的趋势。如何在海量的地理数据中准确、及时地找到人们所需要的数据,并把相关联的数据智能地推荐给用户,成为亟待解决的一大难题。针对传统以关键词、主题词等字符串匹配为核心的数据发现方法存在的查不全、查不准的问题,本文通过对地理空间中的概念、属性、关系、规则,以及相应实例的详细表达,初步提出了地理空间本体构建框架,并在此基础上构建了较为完整的地理数据空间本体,以实现地理数据的智能关联,最后在地球系统科学数据共享平台中进行应用实践。结果表明,引入地理数据空间本体后,检索的结果在数据的查全和查准方面显著提高,而且还能智能推荐相关联的数据信息。本文构建的地理数据空间本体对于大数据时代背景下地理数据的精确发现和共享有重要意义。</p>
[43] Wang M, Wang J.

A data preprocessing framework of geoscience data sharing portal for user behavior mining

. 23rd International Conference on Geoinformatics, IEEE, 2015: 1-5.

https://doi.org/10.1109/GEOINFORMATICS.2015.7378637      URL      [本文引用: 1]      摘要

Science data sharing has many advantages for both scientific research and education. Knowing about behaviors of science data sharing participants is valuable to support informed decision making on data sharing policy and data sharing website design. Nowadays, data sharing is mainly carried through the Internet, and web usage mining provides an ideal approach to uncover user behaviors of data sharing. This paper presents a data preprocessing framework for further user behavior mining of a geoscience data sharing portal (geodata.cn). The preprocessing steps included data cleaning, user identification, session identification, and data modeling. Web server logs served as the major data source of this study. Heuristic algorithms were employed to accomplish data cleaning and user identification. Different session identification methods were applied for comparison. Users' geolocation were identified using an online Geo-IP lookup tool, which provides geographical coordinates of an IP address. On the basis of all the preprocessing procedures, a web usage data model of science data sharing portal were proposed for further user behavior mining, such as user classification and spatial association rules mining.
[44] 王末, 王卷乐.

Web 环境下地学数据共享用户行为模式分析

. 地球信息科学学报, 2016, 18(9): 1174-1183.

https://doi.org/10.3724/SP.J.1047.2016.01174      URL      Magsci      [本文引用: 1]      摘要

<p>了解科学数据共享用户行为特征对实现高效、精准的数据共享服务具有重要的参考意义。本文基于国家地球系统科学数据共享平台网站服务器日志及服务记录数据,利用空间数据挖掘及Web使用挖掘技术,探索地球系统科学数据共享用户行为模式。在数据预处理阶段,完成用户识别、会话识别、位置识别,并对数据进行空间建模、空间数据库建库。在数据挖掘阶段,分别对用户产生的网页浏览数、会话数、数据集浏览数为对象进行空间“热点”分析,识别用户行为的地域差异。针对用户数据浏览和下载行为,采用FP-growth算法对用户——数据之间进行关联规则挖掘,发现用户对数据关注和使用的高频规律。分析结果表明:(1)该共享平台用户地在国内各省市均有分布,用户最多的3个省(市)分别为北京市、山东省、江苏省,该分布与国内高校学生分布相关程度不高,但与“211工程”高校学生的空间分布相关度较高;(2)空间“热点”分析表明,北京、天津及河北北部无论在网页浏览、数据浏览还是会话量上都是“热点”区域,但识别的“冷点”区域有较大不同,尤其是数据访问“冷点”分布较广,如南方沿海省份、河南省、山东省、四川省等;(3)关联规则挖掘发现多个数据浏览高频项目集以及关联规则。数据下载高频项与数据浏览高频模式较好吻合,但下载行为未表现出明显关联规则。本文提供了一种结合Web使用挖掘和空间数据挖掘的用户行为模式挖掘方法,该方法也可用于其他类型网站的数据挖掘。</p>

[Wang Mo, Wang Juanle.

A study on the user behavior of geoscience data sharing based on web usage mining

. Journal of Geo-Information Science, 2016, 18(9): 1174-1183.]

https://doi.org/10.3724/SP.J.1047.2016.01174      URL      Magsci      [本文引用: 1]      摘要

<p>了解科学数据共享用户行为特征对实现高效、精准的数据共享服务具有重要的参考意义。本文基于国家地球系统科学数据共享平台网站服务器日志及服务记录数据,利用空间数据挖掘及Web使用挖掘技术,探索地球系统科学数据共享用户行为模式。在数据预处理阶段,完成用户识别、会话识别、位置识别,并对数据进行空间建模、空间数据库建库。在数据挖掘阶段,分别对用户产生的网页浏览数、会话数、数据集浏览数为对象进行空间“热点”分析,识别用户行为的地域差异。针对用户数据浏览和下载行为,采用FP-growth算法对用户——数据之间进行关联规则挖掘,发现用户对数据关注和使用的高频规律。分析结果表明:(1)该共享平台用户地在国内各省市均有分布,用户最多的3个省(市)分别为北京市、山东省、江苏省,该分布与国内高校学生分布相关程度不高,但与“211工程”高校学生的空间分布相关度较高;(2)空间“热点”分析表明,北京、天津及河北北部无论在网页浏览、数据浏览还是会话量上都是“热点”区域,但识别的“冷点”区域有较大不同,尤其是数据访问“冷点”分布较广,如南方沿海省份、河南省、山东省、四川省等;(3)关联规则挖掘发现多个数据浏览高频项目集以及关联规则。数据下载高频项与数据浏览高频模式较好吻合,但下载行为未表现出明显关联规则。本文提供了一种结合Web使用挖掘和空间数据挖掘的用户行为模式挖掘方法,该方法也可用于其他类型网站的数据挖掘。</p>

/