Products
- - - Platform
      Couchbase CapellaDatabase-as-a-Service
    - Services
      AI Data PlaneProduction AI agent data layer
      
      SearchFull-text, hybrid, geospatial, vector
      
      MobileEmbedded database, cloud to edge sync, peer-to-peer sync
      
      AnalyticsReal-time, multisource analytics
  - - Self-Managed
      Couchbase ServerOn-prem, multicloud, community
    - Capabilities
      In-memory ArchitectureSpeed, scale, availability
      
      Build Flexible AppsJSON, SQL++, multipurpose
      
      Cloud AutomationKubernetes Operator
      
      Dev ToolsSDKs, integrations, Capella iQ
      
      Couchbase Edge ServerFor resource-constrained environments
Solutions
- - - By Use Case
      Artificial Intelligence
      
      Caching and Session Management
      
      Field Services
      
      Product Catalog
      
      Real-Time Analytics for AI
      
      Smart Personalization & Profiles
      
      See all use cases
  - - By Industry
      Financial Services
      
      Healthcare
      
      High Tech
      
      Media & Entertainment
      
      Retail
      
      Telecommunications
      
      Travel & Hospitality
      
      See all industries
Resources
- - - Popular Docs
      Capella Overview
      
      Server Overview
      
      Mobile & Edge Overview
      
      Connecting Apps (SDKs)
      
      Tutorials & Samples
      
      Docs Home
  - - Quickstart
      Blog
      
      Case Studies
      
      Developer Portal
      
      Forums
      
      Training & Certification
      
      Webcasts & Events
- - - Resource Center
      
      View all Couchbase resources in one place
      
      Check it out
Company
- - - About
      About us
      
      Leadership
      
      Customers
      
      Why Couchbase
      
      Blog
      
      Newsroom
      
      Careers
  - - Partnerships
      Find a Partner
      
      Become a Partner
      
      Register a Deal
Pricing
Search
Spanish
Sign in
Try Free

Inicio del blog

Company
Engineering
Artificial Intelligence (AI)
Capella
Mobile
Analytics
AI Services
Application Design
Architecture
Best Practices and Tutorials
Community
Connectors
Cross Data Center Replication (XDCR)
Customers
Data Modeling
Features
Generative AI (GenAI)
Multi-Dimensional Scaling (MDS)
Partners
Performance
Security
SQL++ / N1QL Query
Tools & SDKs

Iniciar sesión

Uncategorized

Want to get rid of documents with duplicate content?

Don Pinto, Principal Product Manager, Couchbase

diciembre 16, 2014

Lectura de 4 minutos

Whether you’re combining data from two different data sources, have multiple purchases from the same customer or just entered the same data in a web form twice, it seems like everyone faces the problem of duplicate data at one point or the other.

In this blog post, we’ll look at using views in Couchbase Server 2.0 to find matching fields among documents and retain the non duplicate documents. For the sake of this example, assume each document has three common user specified fields – first_name, last_name, postal_code. Using the ruby client for Couchbase Server and the faker ruby gem, you can build a simple data generator to load some sample duplicate data into Couchbase. To use ruby as a programming language with Couchbase, you should download the Ruby SDK here.

Here is an execution sample:

$ ruby ./generate.rb –help

Usage: generate.rb [options]
   -h, –hostname HOSTNAME           Hostname to connect to (default: 127.0.0.1:8091)
   -u, –user USERNAME               Username to log with (default: none)
   -p, –passwd PASSWORD            Password to log with (default: none)
   -b, –bucket NAME                 Name of the bucket to connect to (default: default)
   -t, –total-records NUM           The total number of the records to generate (default: 10000)
   -d, –duplicate-rate NUM          Each NUM-th record will be duplicate (default: 30)
   -?, –help                        Show this message

$ ruby ./generate.rb -t 1000 -d 5
     1000 / 1000

Each document in Couchbase has an user specified key which is accessible as meta.id in the map function of the view. In Figure 1 below, there are multiple documents loaded into Couchbase Server using the data generator client above.

Step 1

Write a custom map function that emits the document ID (meta.id) of all the documents if the a particular duplicate pattern matches (first_name, last_name, postal_code in this case).

function (doc, meta) {

emit([doc.first_name + ‘-‘ + doc.last_name + ‘-‘ + doc.postal_code], meta.id);

}

The map function defines when two documents are duplicates. According to the map function defined above, two documents are duplicate when the first name, last name and postal code match. We use ‘-’ so that we prevent aliasing of the data when we concatenate the first name, last name and the postal code.

Step 2

The reduce function looks like –

function (keys, values, rereduce) {

if (rereduce) {
    var res = [];
    for (var i = 0; i < values.length; i++){
      res = res.concat(values[i])
    }
    return res;
} else {
    return values;
}
}

After grouping, if there are more than one meta.id values, we concatenate them to get a list of meta.id’s refering to a duplicate document.

Step 3

The core part of the data cleaner is written in Ruby.

require ‘couchbase’

connection = Couchbase.connect(options) ddoc = connection.design_docs[options[:design_document]] view = ddoc.send(options[:view]) connection.run do view.each(:group => true) do |doc| dup_num = doc.value.size if dup_num > 1 puts “left doc #{doc.value[0]}, “ # delete documents from second to last connection.delete(doc.value[1..–1]) puts “removed #{dup_num} duplicate(s)” end end end

Connect to Couchbase Server and query the view. The value field is an array of meta.id’s that correspond to duplicate documents (matching first name, last name and postal code). If the array size is greater than 1, we delete all the documents except the one corresponding to the last meta.id.

If the number of meta.id’s in the value array is greater than 2, there are duplicate documents corresponding to that key. As shown in the figure above id19 and id20 are duplicate documents.

The output of the data cleaner script looks like –

As shown in the figure below, duplicate documents are now eliminated.

Enjoy!

—

Thanks to Sergey for putting together the ruby code.

Share this article

Publicado en: Uncategorized

Get Couchbase blog updates in your inbox

Please leave this field empty

By checking this box, you acknowledge our Privacy Policy. You may unsubscribe at any time.

This field is required.

Check your inbox or spam folder to confirm your subscription.

Author

Postado por Don Pinto, Principal Product Manager, Couchbase

Don Pinto is a Principal Product Manager at Couchbase and is currently focused on advancing the capabilities of Couchbase Server. He is extremely passionate about data technology, and in the past has authored several articles on Couchbase Server including technical blogs and white papers. Prior to joining Couchbase, Don spent several years at IBM where he maintained the role of software developer in the DB2 information management group and most recently as a program manager on the SQL Server team at Microsoft. Don holds a master’s degree in computer science and a bachelor’s in computer engineering from the University of Toronto, Canada.

Todas las publicaciones

4 respuestas

Stephane

11 de diciembre de 2015 a las 15:14

It gives me a “Reduction too large” error.

Inicia sesión para responder
1. Stephane
  
  11 de diciembre de 2015 a las 15:57
  
  When I get rid of the reduce code chunk then the error disappears. But I suppose I need that reduce code…
  
  Inicia sesión para responder
Dejan Sunderic

2 de septiembre de 2016 a las 2:22

Is it possible to do this in N1QL? It should be faster then from client.

Inicia sesión para responder
1. Matt Ingenthron
  
  2 de septiembre de 2016 a las 5:18
  
  Yes it’d certainly be possible to do something similar with N1QL in 4.0 and later. This blog was written for 2.0 originally. That said, Couchbase is deployed as a distributed system, so a N1QL procedure running would perform the same as a client would. It really is a client to the underlying data. You get a benefit in some cases by running the query service co-located with the data, but you can certainly do that with other programs too.
  
  Inicia sesión para responder

Deja un comentario Cancelar respuesta

Lo siento, debes estar conectado para publicar un comentario.

Ready to get Started with Couchbase Capella?

Start building

Check out our developer portal to explore NoSQL, browse resources, and get started with tutorials.

Develop now

Use Capella free

Get hands-on with Couchbase in just a few clicks. Capella DBaaS is the easiest and fastest way to get started.

Use free

Get in touch

Want to learn more about Couchbase offerings? Let us help.

3155 Olsen Drive,
Suite 150, San Jose,
CA 95117, United States

Company

Blog
Downloads
Online Training
Resources
Why NoSQL
Pricing
Trust Center

Support

Developer Portal
Documentation
Forums
Professional Services
Support Login
Support Policy
Training

Quicklinks

Blog
Downloads
Online Training
Resources
Why NoSQL
Pricing
Trust Center

Twitter
LinkedIn
YouTube
Facebook
Github
Stack Overflow
Discord

© 2026 Couchbase, Inc. Couchbase and the Couchbase logo are registered trademarks of Couchbase, Inc. All third party trademarks (including logos and icons) referenced by Couchbase, Inc. remain the property of their respective owners.

Terms of Use
Privacy Policy
Cookie Policy
Support Policy
Do Not Sell My Personal Information
Marketing Preference Center
Trust Center

Couchbase. The Operational Data Platform for AI.^® Trademark registration in Switzerland

Platform

Services

Self-Managed

Capabilities

By Use Case

By Industry

Popular Docs

Quickstart

Resource Center

About

Partnerships

Want to get rid of documents with duplicate content?

Your AI Agents Are Stuck in Pilot. It’s a Data Problem, Not a Model Problem.

When the Internet Goes Down, Your Business Shouldn’t

Distributed Databases: An Overview

On-Device AI: Benefits, Use Cases, and Challenges

Accelerating AI in Healthcare: Fix Data Infrastructure Before AI Fails Become a Board Priority

Ready to get Started with Couchbase Capella?

Start building

Use Capella free

Get in touch