Identifying People in Images

In the olden days, family photos went in a shoe box, and if you were lucky, someone wrote on the back of it to tell you who was in the photo. If nobody did that, chances are that a priceless old family photo would eventually become a worthless photo because nobody knows who it's of.

I went on a slide-scanning rampage a couple years ago and now have 15,000 images taken from slides, along with hundreds of really old family photos, and hundreds more of my own digital camera photos. I would like to identify who is in each photo, but I'm still waiting for the right metadata definition to come along so that doing so would not be a waste of time.

JPEG, TIFF and other image file formats typically allow metadata to be stored inside the file. This allows information about the image such as the date it was taken, keywords, a caption, a title, where it was taken, etc., to be stored inside the image file itself, so when you send someone a copy of the file, they're getting information about the the image at the same time.

Unfortunately, although you can add keywords and captions, there is not yet a well-established standard for identifying people in photographs. I have tried putting everyone's name in a caption, using a comma-separated list of names from left to right, and a semicolon between rows, back to front. But there is often some ambiguity with this ("was the baby on the lady's lap part of row 1 or 2?"), and it is just my own convention. Also, often there are only one or two people I know in a group shot, so I have to resort to "3rd from the left, second row" type descriptions, which is annoying. It really gets old after doing too many pictures with similar people in them, too, and uses up the space in the caption you'd rather use for a description sometimes.

Flickr allows you to draw a rectangle in an image and add a note that is associated with that rectangle. That's a great idea, and apparently they got it from "fotonotes", which does the same thing.

Fotonotes stores a chunk of XML text in "the 8th JPEG header", which is a good start, but I don't know whether such metadata can also be stored in TIFF files or not, and it doesn't seem likely that such headers would survive through utilities.

IPTC is the most common metadata format used in JPEG and TIFF files, but Adobe has been pushing its extensible "XMP" format for storing metadata in image files. Adobe's various programs (like Photoshop) recognize XMP metadata and preserve it whenever you manipulate or convert the files, so it may well be the best place to store the kind of metadata I'm talking about. Microsoft Vista's image browser apparently supports XMP as well. Since XMP is "extensible" (and is an open format), it should be possible to define tags in XMP that would support a rectangle with a note on it.

Does anyone know how to do that? If we could get that figured out, then the next step would be to get several big players (e.g., Adobe, Google, Flickr, Vista, iPhoto, etc.) to support those tags so that tagging your photos that way really does preserve the information for posterity instead of just being a proprietary tag that no software recognizes.

Resolution Changes
It is quite common to change the resolution of a photo for various uses, and one would not want the rectangles to break when that happens. The way Flickr handles it is that they treat the longest dimension as though it was 500 pixels, and then specify their coordinates accordingly. So if you draw a rectangle from 100,200 to 500,800 in an image that is 1500x3000, then they would take the longest dimension (3000) treat it as though it was 500, and scale the coordinates accordingly, in this case, multiply each coordinate by 500/3000=1/6. That way, if you reduce (or enlarge) the image, the coordinates can stay the same. If you crop, you're still hosed, although you would likely be able to manually "repair" the rectangles without knowing who's who in the picture by looking at how they are related to each other (i.e., cropping would slide rectangles up and to the left of where they belong, so if enough of the image is left, you might be able to tell where the rectangles go). Of course, software that is aware of the rectangles could translate them when a crop is done.

Another approach would be to store the original dimension of the image along with the coordinates of the rectangles. Then a rectangle-aware utility could modify both, but a utility that was not aware could pass both along as-is, and a later utility could recognize that the dimensions of the current image are not the same as the dimensions back when the rectangles were created, and could thus fix both.

Metadata content

I'd recommend that the metadata contain this much information:

rectangle: x,y,w,h
origImageSize: w,h // size of the image at the time the rectangle was specified
noteType: if unspecified, then treat as a generic tag. But when specified, types could include:
- person (name and/or description of a person)
- text (for typing what text appears in the image at that point,
  e.g., handwritten text or OCR'd text or a sign or billboard)
- animal (name and/or description of a pet or animal in the wild)
- etc.
title (name or other short description, as might appear briefly just below someone's face in a screensaver)
description (more detailed description about the person or thing)
URL - URL to a web site with more information about this person or thing, such as a link to a wikipedia article)
id - unique id for this person or thing (either a globally unique ID or just unique within a user's collection?). Maybe a URN or URI actually makes more sense here, since the type of ID would be important for the ID to mean anything.

Something like that would probably do it.

I have posted queries at a couple of places online. Here are the links:

First, I was hoping someone could figure out how to define Adobe XMP tags for identifying people in photos, so I posted a message on Adobe's XMP forum.

Second, I was thinking that Google's Picasa image editing program has enough momentum that they might be willing to help define the tags, and then add features in their program to help users add them to photos efficiently. They would have additional motivation to do so, since "Google Image" requires images to be tagged anyway. On this post, I started dreaming about using face and voice recognition technology to help speed up tagging of hundreds or thousands of photos, though those are really bells and whistles that could be developed after the core tagging format and tools are developed. Here is the message I posted on Google's Picasa forum.

A genealogy software developer said that he used the very popular Lead Tools SDK to handling image input, output, and perhaps even tagging. They do apparently handle all sorts of annotations, so they might be interested in helping to develop a standard as well. They even have their own proprietary XML format for the gas, but it would be nice if they had a format that survived various image editing applications.

I also came across FotoTagger, which is a free Windows application that lets you annotate images, and appears to allow you to upload photos to FotoTagger galleries and/or Flickr. I don't know yet if it lets you draw a rectangle around something or not. Examples showed text boxes with arrows pointing to something, and another one had a bunch of names of people in text boxes that were scattered around the outside of an image in such a way that I couldn't actually tell who was who.

Perhaps the most promising avenue for getting a standard really defined, however, is to contact someone at IPTC, the standards organization that defines metadata in photos. I e-mailed IPTC and got this response from the managing director:

Thanks for your inquiry and the very quick reply is: to assign metadata only to regions of a photo is already on our agenda but I feel the most challenging issue is to implement this into the XMP technology (this XML in the header) as this is developed by Adobe. But let's see.
Btw: this is part of our Photo Metadata Roadmap, check it out in full at www.iptc.org/photometadata

So it is encouraging to think that IPTC may already be pushing for this standard. Once it is available--and adopted into XMP--then various utilities and applications can be used to start annotating all of our photos in bulk so that we can keep track of who is who in them, hopefully with links to external URIs that contain more information about them when appropriate.

One thing I noticed in the IPTC photo metadata white paper was that as it talked about Adobe's XMP, it said:

The data model explains that each metadata field...has three components: the media content it refers to (usually this is the media content encapsulated by the same file, e.g. the photo in a JPEG file); the semantics of this field; and the value applied to the field, which must comply with its semantics.

So it would seem that "the media content it refers to" could, instead of just referring to the entire JPEG file, refer to a region within that file. Specifically, XMP attaches attribute values to resources, so perhaps a resource could be defined that refers to a region of an image instead of the entire image, and then various metadata tags could be attached to that.

Identifying People in Images

Sunday, October 23, 2011

Adobe XMP and Metadata Working Group (MWG)

Tuesday, July 10, 2007

Thursday, June 21, 2007

Who is that in that photo?

Blog Archive

About Me