Bizarre matching functionality results - we've halted matching for this species bec of this issue

In which Wildbook did the issue occur? ACW
What operating system were you using? Win 10
What web browser were you using? latest chrome
What is your role on the site? admin & researcher

What happened?
My researcher had reported some issues to me that I wanted to replicate with her on our weekly call this am. Her process for matching is to go chronologically through each ID’d individual’s record, select assigned encounters, run matching from each, for each L/R viewpoint, and find all of the unassigned encounters in the system that match, then, of course, match these unknowns to the target ID’d encounter.
Here’s the original match results link she sent me, which I used for all of the steps listed below:

ttps://africancarnivore.wildbook.org/iaResults.jsp?taskId=5486b0ab-cce5-421b-a10f-4ffb26b0d8b6

The first issue found, reported as a bug by me yesterday, was that the assigned encounter that was selected was not the one that was being displayed in the match results as the target encounter. Instead, the system was displaying the unassigned copy of that same image.
image

It is expected for us to find a duplicate copy of ID’d photos in the system bec the ID kits each contain a re-named copy of the raw tourist pic. The raw tourist pic remained as is in the relevant sighting folder, which was uploaded, as were the ID kits.
However, we do not expect the system to swap the assigned encounter selected for match results, for the unassigned copy. I was able to replicate this issue so reported it yesterday.

This am, we found more bizarre behaviour.

From the same match results page, since we knew that the target image was CH00051 and the 1st match result of an unassigned encounter, was a match to the target, I did the following:

  1. I checked the box and went to select CH00051 from the dropdown menu in the top yellow bar.
  2. I typed in “CH0005”, selected CH00051 from the dropdown menu, clicked on
  3. Result: a UUID was displayed on the target encounter record as well as the matched encounter record.
  4. I clicked on the UUID to see if it took me to CH00051’s individual record. It did not. It took me to that new UUID’s individual record, with the 2 recently matched encounters as the only 2 in the encounters list for that individual. If this had been CH0051’s individual record, it should have shown at least 1 additional encounter in the list, since CH00051 pre-existed the match done above. It did not have any additional encounters in the list.
  5. I opened each of the assigned encounters (for the UUID individual record) and removed them both from that individual, effectively deleting that new individual.
  6. I went back to the match results page that we’d started with, refreshed it and noted that the target and the 1st proposed match were no longer assigned to anyone.
  7. I repeated steps 1 & 2 above. Again, this resulted in a UUID displayed as the assigned individual, rather than the name “CH00051”.
  8. This time, when I clicked on the UUID link, the individual record for CH00051 appeared.
  9. I scrolled down to the encounters list in this CH00051 record and it displayed only the 2 encounters that I had just matched, no other encounters were attached to this individual ID.
  10. I went back to the tab with the results page, refreshed it, hoping the UUID would disappear and the name CH00051 would appear as the assigned individual for the 2 matched encounters but it did not.
  11. I went to > Individuals > View all > filtered the results to find CH00051. The results showed that there was only 1 CH00051 record with only 2 encounters assigned.
  12. I went back to the 2 encounters that had been assigned through the steps above, and removed them from their assigned individual.
  13. Result (as expected): CH0051 no longer exists in the system.

But it should still exist since there were encounters assigned to it when the researcher, and I, initiated the matching in the first place. She selected an encounter from CH0051’s original individual record, ran matching, but ended up with a different, unassigned encounter, as the target image, but with the alternate reference box indicating that the image was a duplicate of one in an assigned encounter (which was the original encounter record selected to be the target).

So what happened to the original encounter(s) assigned to CH0051, uploaded as part of the ID kit?

Note that I previously reported a similar issue regarding CH00033 (Match dropdown has duplicate of CH00033 & ID shows as alpha-numeric instead of CH00033), but we made assumptions that the researcher had erroneously created a 2nd CH00033 record somehow, and that she had somehow either created or we had uploaded originally, an ID’d individual without a name, so the system displayed only the UUID in the other ID’d individual record.
Now, I suspect that what we saw then wasn’t user error or upload error, but rather the same as what I’ve described above.

This does not happen every time this researcher runs matching. She’s been going systematically from CH00001 to now CH00050+, following the same steps for each match run. This is only the 2nd time she’s encountered the results above.

Per above, I’ve asked the researcher to stop matching for now until we get some insight into this issue.

I’m happy to jump on a call to clarify any of the points above. I suspect, because it’s not happening every time the same steps are followed, this is going to be a tough one to investigate & resolve.

thanks
Maureen
cc: @PaulK

I believe there’s an issue with your process at work here based on this:

It is expected for us to find a duplicate copy of ID’d photos in the system bec the ID kits each contain a re-named copy of the raw tourist pic. The raw tourist pic remained as is in the relevant sighting folder, which was uploaded, as were the ID kits.

The system does not support this behavior. IA does not recognize duplicate images as different; it returns an image ID, which applies to all identical images, and Wildbook selects from the related encounters at random. This also can lead to one image assigned to different encounters/individuals, which causes a 606 error.

To clarify, and trying not to sound too defensive, this wasn’t a deliberate process; we were unaware that duplicate images with different names were retained in the source data this way when we uploaded both the ID’d dataset and the census data. Since they have different file names, there was and is really no way for us to find these and remove the duplicates prior to upload.

Next, it’s not clear to us that the system doesn’t support this; it identifies the duplicate for us and tags it in the “alternate reference” box in the match results. Obviously, it also allows this in uploads. Both of which could imply that the system does support it to some degree.

So we were unaware of the impact of this scenario on matching - that is, if what you’re saying is that this duplicate image scenario is the source of all of the issues I listed here (except for what’s in yesterday’s ticket, now being tracked under WB-1154)? Can you confirm that’s what your reply means?

thanks
M

cc: @PaulK

Over the course of setting up your Wildbook, we mentioned some of the more common issues that can crop up, including duplicate images impacting match results, such as discussions around bulk import allowing duplicate images (it has no context of the image, IA does), and that things like “alternate reference” act as sign posts that something is off (this led to the development of the administrative tools that help to resolve things like 606 errors by allowing admins to actively seek out data issues).
That being said, I will make a point of updating documentation to reflect this information, and make a note for others we onboard that this is an issue whose importance we must underscore.
What I can say with certainty is this is a major problem that is causing the match result confusion. It could lead to additional issues such as viewpoint misassignment and such because errors cascade out in a machine learning platform.

Thanks,
Tanya

A little addendum about finding and removing duplicates:

When you see one of these alternate references, choose which of them you wish to keep and delete the other so it doesn’t come up again.

As an additional step, visit the link to the import page (found at the very bottom of the Metadata section) for the encounter you are going to delete. It seems likely to me that an import that created one duplicate might be responsible for others and is a good starting point to find and remove them. If the import is all duplicates you might be able to save some time by using the new import deletion button to get rid of them all.

If you aren’t sure whether one of the encounters in an import is a duplicate or not, then run a matching job. If it is you will see an ‘Alternate Reference’ in iaResults just like with the original encounter and can safely delete it.

Hi folks,

Thanks for clarifying. I think something that might have been helpful to us and will be to other new users is a better understanding from the get go of issues that users can cause and the severity of the impact of those issues. We received a lot of information and advice about what to do and what not to do, well before we had any decent understanding of how the system worked. We’re still learning.

In the case of “alternate reference”, when I specifically asked what this was in a support ticket filed 2 months ago, the response was that “this is just to make you aware that the annotation exists elsewhere in the system to help you avoid duplication and aid data curation.” I was not told that it will break matching. If I had known that then, we’d have avoided having the researcher get very deep into her matching process without correcting these.

I also feel that just being told (although we weren’t) that duplicates cause issues with matching is too vague and lacks specificity that could help drive some urgent action, particularly with new users to the system.

With the new data integrity tool, knowing that annotations being assigned to 2 different individuals is interesting and helpful but, for our WD researchers, not an urgent concern bec, from their perspective, all these individuals exist as distinct individuals, so the problem is purely that an image has been assigned to one of them incorrectly, in each instance. But I’m guessing now the impact is far more severe than that from an ML perspective, although I have no basis for that assumption other than extrapolating from this scenario.

As to cleanup, unfortunately Colin, your recommended approach won’t work for this particular dataset and, I suspect, others. There is no single import that created the duplicates. ID kits were built from what the researcher decided were the best L & R viewpoints they could find in a dataset of thousands of tourist images. So finding these duplicates would mean determining which images in the system are the original ID kit images for each individual, then opening each of the thousands of tourist images to find the matches. Even doing it via matching in the system is massive - the first 12 best matches do not necessarily always surface the “alternate reference” images. But obviously, we’ll need to figure out a way.

Meanwhile, I’d recommend being clearer about showstopper-type issues over minor inconvenience issues, or even just a Do’s & Don’ts type list for new users.

cc: @PaulK