MashZone NextGen 10.2 | Appendix | Legacy Presto components | Mashables and Mashups | Mashups in EMML | Advanced Mashup Techniques | Removing Duplicates With Filtering
 
Removing Duplicates With Filtering
 
XPath Axes
Merging, joining or grouping results from several mashables can result in duplicate items in the combined result. You can remove duplicates with the <filter> statement and a filter expression that uses the axis feature in XPath to compare preceding or following items. See XPath Axes for basic information on this XPath feature.
To remove duplicates in a mashup, simply merge, join or group results. If needed, sort the combined results based on the key field that determines uniqueness. This ensures that duplicates are contiguous.
Then use <filter> with a filtering expression that compares the key value of either the preceding or following 'item' to determine if this 'item' is unique. See Unique Filter Example for an example of removing duplicates.
Unique Filter Example
This example merges the results from two RSS mashables and then checks the title of each item to remove duplicates:
<mashup xmlns:xsi= "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openmashup.org/schemas/v1.0/EMML/
../xsd/EMMLPrestoSpec.xsd"
xmlns="http://www.openmashup.org/schemas/v1.0/EMML"
xmlns:macro="http://www.openmashup.org/schemas/v1.0/EMMLMacro"
xmlns:presto="http://www.jackbe.com/v1.0/EMMLPrestoExtensions"
name = "MergeFeeds">
<output name="result" type="document"/>
<!-- invoke two RSS feeds -->
<directinvoke outputvariable="$feed1"
endpoint="http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml" />
<directinvoke outputvariable="$feed2"
endpoint="http://www.nytimes.com/services/xml/rss/nyt/World.xml" />
<!-- merge the results -->
<merge inputvariables="$feed1, $feed2" outputvariable="$result"/>
<!-- filter for unique items based on title -->
<filter inputvariable="$result" outputvariable="$result"
filterexpr="/rss/channel/item[not(preceding::title = ./title)]" />
</mashup>
The filtering expression uses:
*The not() XPath function to negate the comparison. It only selects items that do not have any preceding items with matching titles.
*The preceding axis to check all previous titles in the merged feeds against the title for the current item.
Because of the structure of RSS results, you could also use preceding-sibling::item/title. If you sort the results based on item/title you could also simply check just the closest item title with preceding-sibling::item[1]/title to rule out duplicates.
*The . in ./title is the short syntax to identify the current context node. This selects the child title for the current context to compare it to all previous titles.

Copyright © 2013-2018 | Software AG, Darmstadt, Germany and/or Software AG USA, Inc., Reston, VA, USA, and/or its subsidiaries and/or its affiliates and/or their licensors.
Innovation Release