Removing Duplicates With Filtering

Mashups in EMML : Advanced Mashup Techniques : Removing Duplicates With Filtering

Removing Duplicates With Filtering

Merging, joining or grouping results from several mashables can result in duplicate items in the combined result. You can remove duplicates with the <filter> statement and a filter expression that uses the axis feature in XPath to compare preceding or following items. See XPath Axes for basic information on this XPath feature.

To remove duplicates in a mashup, simply merge, join or group results. If needed, sort the combined results based on the key field that determines uniqueness. This ensures that duplicates are contiguous.

Then use <filter> with a filtering expression that compares the key value of either the preceding or following 'item' to determine if this 'item' is unique. See Unique Filter Example for an example of removing duplicates.

Unique Filter Example

This example merges the results from two RSS mashables and then checks the title of each item to remove duplicates:

<mashup xmlns:xsi= "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openmashup.org/schemas/v1.0/EMML/
../xsd/EMMLPrestoSpec.xsd"
xmlns="http://www.openmashup.org/schemas/v1.0/EMML"
xmlns:macro="http://www.openmashup.org/schemas/v1.0/EMMLMacro"
xmlns:presto="http://www.jackbe.com/v1.0/EMMLPrestoExtensions"
name = "MergeFeeds">
<output name="result" type="document"/>

<directinvoke outputvariable="$feed1"
endpoint="http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml" />
<directinvoke outputvariable="$feed2"
endpoint="http://www.nytimes.com/services/xml/rss/nyt/World.xml" />

<merge inputvariables="$feed1, $feed2" outputvariable="$result"/>

<filter inputvariable="$result" outputvariable="$result"
filterexpr="/rss/channel/item[not(preceding::title = ./title)]" />
</mashup>

The filtering expression uses:

The not() XPath function to negate the comparison. It only selects items that do not have any preceding items with matching titles.

The preceding axis to check all previous titles in the merged feeds against the title for the current item.

Because of the structure of RSS results, you could also use preceding-sibling::item/title. If you sort the results based on item/title you could also simply check just the closest item title with preceding-sibling::item[1]/title to rule out duplicates.

The . in ./title is the short syntax to identify the current context node. This selects the child title for the current context to compare it to all previous titles.

Copyright © 2013-2016 Software AG, Darmstadt, Germany.

Contact Support | Community | Feedback