Removing Duplicates With Filtering
Merging, joining or grouping results from several mashables can result in duplicate items in the combined result. You can remove duplicates with the <filter> statement and a filter expression that uses the
axis feature in XPath to compare preceding or following items. See
XPath Axes for basic information on this XPath feature.
To remove duplicates in a mashup, simply merge, join or group results. If needed, sort the combined results based on the key field that determines uniqueness. This ensures that duplicates are contiguous.
Then use <filter> with a filtering expression that compares the key value of either the preceding or following 'item' to determine if this 'item' is unique. See
Unique Filter Example for an example of removing duplicates.
Unique Filter Example
This example merges the results from two RSS mashables and then checks the title of each item to remove duplicates:
<mashup xmlns:xsi= "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openmashup.org/schemas/v1.0/EMML/
../xsd/EMMLPrestoSpec.xsd"
xmlns="http://www.openmashup.org/schemas/v1.0/EMML"
xmlns:macro="http://www.openmashup.org/schemas/v1.0/EMMLMacro"
xmlns:presto="http://www.jackbe.com/v1.0/EMMLPrestoExtensions"
name = "MergeFeeds">
<output name="result" type="document"/>
<!-- invoke two RSS feeds -->
<directinvoke outputvariable="$feed1"
endpoint="http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml" />
<directinvoke outputvariable="$feed2"
endpoint="http://www.nytimes.com/services/xml/rss/nyt/World.xml" />
<!-- merge the results -->
<merge inputvariables="$feed1, $feed2" outputvariable="$result"/>
<!-- filter for unique items based on title -->
<filter inputvariable="$result" outputvariable="$result"
filterexpr="/rss/channel/item[not(preceding::title = ./title)]" />
</mashup>
The filtering expression uses:
The
not() XPath function to negate the comparison. It only selects items that do not have any preceding items with matching titles.
The
preceding axis to check all previous titles in the merged feeds against the title for the current item.
Because of the structure of RSS results, you could also use preceding-sibling::item/title. If you sort the results based on item/title you could also simply check just the closest item title with preceding-sibling::item[1]/title to rule out duplicates.
The
. in
./title is the short syntax to identify the current context node. This selects the child
title for the current context to compare it to all previous titles.