m0rin09ma3 dup_images 20130730


Posted:

Prerequisite

I installed PIL modules for this assignment in my 'virt1' environment. Also, I prepared a exif/non-exif image files for testing.

(virt1) $ pip list
PIL (1.1.7)
(virt1) $ python dup_images.py -h
usage: dup_images.py [-h] D [D ...]

find duplicated image files.

positional arguments:
  D           directory to be searched

  optional arguments:
    -h, --help  show this help message and exit

A sample output. Current directory has exif.jpg and nonexif.jpg. /tmp/dir_1/ has a.jpg which is copied from ./exif.jpg and renamed as a.jpg. /tmp/dir_2/ has exif.jpg which is copied from ./exif.jpg.

(virt1) $ python dup_images.py /home/ska /usr/bin ./ abc /tmp/dir_1 /tmp/dir_2/
Warning: "/home/ska" is not a directory.
Warning: "abc" is not a directory.
Warning: "./nonexif.jpg" has no EXIF data.
Found a match: "/tmp/dir_2/exif.jpg"    "./exif.jpg"
Found a match: "/tmp/dir_2/exif.jpg"    "/tmp/dir_1/a.jpg"
Found a match: "/tmp/dir_1/a.jpg"       "./exif.jpg"

A link to the source code.

Explanation

This program will find duplicated JPEG image files from directories you specify as command line arguments. Please note that this program will skip finding a match for files from invalid directories or no EXIF data.

First of all, my main function

def main():
    """
    0. get command line arguments
    1. find image files from directory
    2. find a match by comparing exif info
    """
    list_directory = parse_command_line()
    #print list_directory.directory

    list_image_file = find_image_files(list_directory)
    #print list_image_file
    if not list_image_file:
        print 'no image file found in the directory.'
        return 1

    status = find_a_match(list_image_file)

    return 0

if __name__ == '__main__':
    exit(main())

Parsing command line arguments. Imported argparse module.

def parse_command_line():
    """ User requires to specify directory """
    parser = argparse.ArgumentParser(description='find duplicated image files.')
    parser.add_argument('directory', metavar='D', type=str, nargs='+',
                        help='directory to be searched')
    args = parser.parse_args()

    return args

Find JPEG image file from valid directory and return it. Imported os, glob modules.

def find_image_files(list_directory):
    """ Find image files only from valid directory """
    image_files = []
    for dir in list_directory.directory:
        if os.path.isdir(dir):
           image_files.extend( glob.glob(os.path.join(dir, '*.jpg')) )
        else:
           print 'Warning: "%s" is not a directory.' % dir

    return image_files

Finally, a bit longer/confusing/vulnerable code to find duplicates. I use 2 lists. The one has image files with EXIF data(I separated and created a sub function to check EXIF data in an image file). Another list has EXIF details. These 2 lists are related. Therefore, it is important whoever maintains the code should know the order of elements in these 2 lists must be persistent.

def find_a_match(list_image_file):
    """ Find duplicates only from files with EXIF data """
    list_image_file_with_exif = []
    list_exif_data = []
    for image_file in list_image_file:
        dict_exif_data = get_exif_data(image_file)
        #print dict_exif_data
        if not dict_exif_data:
            print 'Warning: "%s" has no EXIF data.' % image_file
        else: # Assume the order of elements in list is persistent
            list_image_file_with_exif.append(image_file)
            list_exif_data.append(dict_exif_data)

    #print list_image_file_with_exif
    #print list_exif_data
    total = len(list_image_file_with_exif)
    # Any better approach for finding a match? I'm keen to know/learn what others doing ;)
    for i in range(total-1, 0, -1):
        for j in range(i):
            #print 'cmp(dict_%d, dict_%d)' % (i, j),
            if not cmp(list_exif_data[i], list_exif_data[j]):
                print 'Found a match: "%s"\t"%s"' % (
                                      list_image_file_with_exif[i],
                                      list_image_file_with_exif[j])

    return 0

def get_exif_data(fname):
    """ Get embedded EXIF data from image file. """
    exif_data = {}
    try:
        img = Image.open(fname)
    except IOError:
        print 'Error: IOError ' + fname
    else:
        if hasattr(img, '_getexif'):
            exif = img._getexif()
            if exif != None:
                for tag, value in exif.items():
                    decoded = TAGS.get(tag, tag)
                    exif_data[decoded] = value

    return exif_data
Contents © 2013 dgplug - Powered by Nikola
Share
UA-42392315-1