Updates, UnsavedRevisions and Attachments - Best Practice for Syncing

Hi,

Can I please get some advice on the best practice for dealing with Attachments to reduce the amount of data that gets synced.

I have a class of documents, each of which contains zero or more attachments.
I can load documents and their attachments ok. I can save updates to the docs and attachments ok if I overwrite all attachments every time, which is what I initially did.

However I was thinking (perhaps wrongly) that updating attachments that haven’t changed seems wrong since it might force them to be synced when it is necessary. So I tried to be smart. When updating a doc+attachments - I tried merging in the new attachments with minimal changes. Specifically (pseudocode):

- given a newDoc (properties) and newAttachments (list of files)
- create a new UnsavedRevision of the existing document.
- update the properties of the UnsavedRevision using newDoc.
- look for any attachments in UnsavedRevision that do not exist in newAttachments
    - delete attachment from UnsavedRevision
- look for any attachments in newAttachments that don't exist in UnsavedRevision
    - add attachment to UnsavedRevision
- for all other attachments (i.e. that exist in UnsavedRevision and newAttachments)
    - compare the attachment data. If unchanged, do nothing.
    - otherwise add attachment (overwrites).

Sadly - the next time I try to read this document, it does contain attachments, but the data is empty and mime type is corrupt.

So let me ask…
If I am about to update a document+attachments, and I want to be sure to minimise syncing traffic, what is the best procedure?

Is it to always “addAttachment”? Will the sync agent be smart enough to look for file changes and only sync unchanged attachments?

Any advice most gratefully received.
Thanks.
Paul.

p.s. this is C#/Mac/CouchbaseLite 1.4/Sync Gateway running on Ubuntu.

That pseudocode seems correct. This may be an issue with the implementation … any ideas, @borrrden?

Could you given an example, describing the previous and new attachments, and what happens? You don’t need to show code, but a dump of the document JSON’s _attachments property, before and after, would be very helpful.

I definitely would not rule out some kind of implementation error since I’ve recently made changes to this area due to other issues I found, but I’ve never seen or heard of this kind of behavior before. The sync process will NOT do diffs on attachments, so it is best to not update the attachment if you don’t have to. Could you say a bit more about the corrupt mime type and perhaps give an example of the flow that is happening and the results you are seeing?

Thanks for the response @jens and @borrrden.

Let me apologise in advance for the length of this… I’ll try to keep it as short as I can.

First - two types used to keep track of object properties loaded from the DB, and attachments loaded from the DB or being sent to the DB.

using DBProperties = IDictionary<string, object>;
public class AttachmentInfo
{
    public string fieldname;
    public string mimetype;
    public byte[] data;
}

Now the core routines: Update a document, and Merge in a set of attachments. Merge is what I am trying to optimise to reduce sync traffic.

public string Update(DBProperties new_props, List<AttachmentInfo> attachments = null)
{
    log.Debug("About to update {0}", new_props);
    if (!new_props.ContainsKey("_id"))
        throw new MissingIDException();
    var document = database.GetExistingDocument((string)new_props["_id"]);
    if (document == null)
        throw new MissingDataException();
    var rev = document.CreateRevision();
    if (rev == null)
        throw new FailedUpdateException();
    //    Set the properties.
    MergeProperties(rev.Properties, new_props);
    //    Set the attachments
    if (attachments != null)
        MergeAttachments(rev, attachments);
    rev.Save();
    return document.Id;
}
public void MergeAttachments(UnsavedRevision rev, List<AttachmentInfo> new_attachments)
{
    if (rev.Attachments != null)
    {
        //    Create a list of attachments that need to be removed (exist in Attachments, don't exist in new_attachments).
        var to_be_removed = new List<string>();
        foreach (var att in rev.Attachments)
        {
            if (new_attachments.Find((AttachmentInfo a) => a.fieldname == att.Name) == null)
                to_be_removed.Add(att.Name);
        }
        //    Now go through the attachments removing all that are no longer needed.
        foreach (var name in to_be_removed)
            rev.RemoveAttachment(name);
    }
    //    Now lets check the incoming attachments...
    foreach (var new_att in new_attachments)
    {
        var old_att = rev.GetAttachment(new_att.fieldname);
        if (old_att != null)
        {
            //    The attachment exists in both NEW and OLD attachment lists. So compare the data to see if its changed
            var old_data = old_att.Content.ToArray();
            //    **IF I COMMENT THE IF STMT AND *ALWAYS* SetAttachment, THIS WORKS!**
            if (!old_data.SequenceEqual(new_att.data))
                //    Attachment has changed... so update it.
                rev.SetAttachment(new_att.fieldname, new_att.mimetype, new_att.data);
        }
        else
        {
            //    This is a new attachment. Just add it in.
            rev.SetAttachment(new_att.fieldname, new_att.mimetype, new_att.data);
        }
    }
}

Note the comment above. If I comment out the IF statement (the check to see if the data has changed),
then each attachment is set every time… and this works. I have sat there loading and saving many objects very reliably.
If I uncomment the IF statement so that ONLY changed attachments get SetAttribute, then I get failures, but not at the point of saving. Its at the next attachment load.

The failure mode is very strange, but very reliable. To clarify the process for failure:

  • I load an document and its attachments.
  • Make a change to the properties
  • Attempt to update the document and its attachments
    • this appears to work
  • Load a different document
  • Load that document’s attachments.
    • this fails.

And here’s the code the loads the attachments for a given document:

public List<AttachmentInfo> GetAttachments(string _id)
{
    var document = database.GetExistingDocument(_id);
    if (document == null)
    {
        return null;
    }
    var revision = document.CurrentRevision;
    if (revision == null)
    {
        return null;
    }
    var attachments = revision.Attachments;
    List<AttachmentInfo> ret = new List<AttachmentInfo>();
    if (attachments == null)
        return ret;
    foreach (var attachment in attachments)
    {
        var ai = new AttachmentInfo();
        ai.fieldname = attachment.Name;
        ai.mimetype = attachment.ContentType;  <<<< FAILS HERE
        ai.data = (byte[])attachment.Content;
        ret.Add(ai);
    }
    return ret;
}

The failure occurs at the line indicated above.
I have uploaded two images of the xamarin debugger showing the contents of attachments at the point of failure, PLUS (and this is where it is a little weird) I have an image of the attachments from the on disk DB (i.e. using CoubaseLiteViewer).

The on disk attachments are completely healthy (there are 7).
The loaded attachments are corrupt - not all 7 of them, just the first few.

So changing that ONE line in the code that sets the attachments during save has the effect of causing the NEXT attachment fetch to fail.

Hope this helps.
Thanks for your help with it.
Paul.

I don’t see anything wrong here so far, but one odd thing I notice is that the Metadata in your debugger shot only has two items in it. The exception on the ContentType property also indicates that the metadata is missing the content_type key (which is weird since the second picture shows it present in all of them). Could you examine that and see what you find inside? The “first few” in your picture also happen to be pointing to the same exact file. Do you notice any relation to this problem with that? Is it only the first three that are doing this? If so, then there might be something going on here dealing with several different attachments on the same file (which should be fine, and if it’s not there is a problem)

Just an update here…

I’ve been trying to create a small fragment of code that exhibits the problem. Something of the form:

for (var i=0; i<5; i++) {
    load a document + attachments
    make a change
    save a document
}

I have been able to do that, and cause the problem to occur. I’m working to reduce the code that exhibits the problem to be as small as possible.

As to your other question of the duplicate images.
You are correct - I am using test data with duplicate objects. This ensures that my code correctly creates unique names for the attachments.

I have noticed that I get two exceptions regularly with attachments:
a) InvalidCastException - Object must implement IConvertible - inside SetAttachment and RemoveAttachment
b) IOException - ERROR_ALREADY_EXISTS - on Save.

I figured the second one is internally handled by the code watching for duplicate images, and de-duping them.
The first seems strange. I’m guessing its a problem with the string parameter “name” since that is in common between both calls. I know I’m passing in a string parameter (see AttachmentInfo class up above).

I haven’t worried too much about these exceptions because everything works (if I use SetAttachment on every attachment, whether or not its changed).

Regards.
Paul.

When you say “get” exceptions, do you mean they bubble up to your code? I suppose not if the code is continuing. The first one usually happens when I try to cast an object from a dictionary, which in the Couchbase world means it is usually System.Object and so I go through a dance to coax out the value or give back null, including attempting to use Convert.ChangeType. I don’t see anywhere that I log that though, so that’s a bit concerning.

The repro case would likely allow me to get to the bottom of this in a very short time. It doesn’t necessarily have to be super small, if you are comfortable with the contents of it. As long as it runs and reproduces then it will be extremely helpful, and should be put as an issue on the repo.

The exceptions do not bubble up to my code - but they are caught by the debugger. I just hit ‘continue’.

Images of the exceptions…

Hey there @borrrden. Sorry to go quiet for a while - other requirements needed to be taken care of so I’ve been doing this in the background.

I finally have a very small blob of code that reproduces some funny behaviour with de-duping of attachments - when those duplicates are attached to different documents.

This topic has got kinda long, so I’ll put it all in a new topic.
Cheers.
Paul.