diff --git a/README.md b/README.md index 35f4562..fce13da 100644 --- a/README.md +++ b/README.md @@ -6,23 +6,28 @@ table of contents 1. [summary](https://github.com/kieranjol/IFIscripts#summary) 2. [Arrangement](https://github.com/kieranjol/IFIscripts#arrangement) - * [sipcreator.py](https://github.com/kieranjol/IFIscripts#sipcreatorpy) -3. [Transcodes](https://github.com/kieranjol/IFIscripts#transcodes) + * [sipcreator.py](https://github.com/kieranjol/IFIscripts#sipcreator) +3. [PREMIS](https://github.com/kieranjol/IFIscripts#PREMIS) + * [premisobjects.py](https://github.com/kieranjol/IFIscripts#premisobjectspy) + * [logs2premis.py](https://github.com/kieranjol/IFIscripts#logs2premispy) + * [makepremis.py](https://github.com/kieranjol/IFIscripts#makepremispremispy) + * [premiscsv2xml.py](https://github.com/kieranjol/IFIscripts#premiscsv2xmlpy) +4. [Transcodes](https://github.com/kieranjol/IFIscripts#transcodes) * [makeffv1.py](https://github.com/kieranjol/IFIscripts#makeffv1py) * [bitc.py](https://github.com/kieranjol/IFIscripts#bitcpy) * [prores.py](https://github.com/kieranjol/IFIscripts#prorespy) * [concat.py](https://github.com/kieranjol/IFIscripts#concatpy) -4. [Digital Cinema Package Scripts](https://github.com/kieranjol/IFIscripts#digital-cinema-package-scripts) +5. [Digital Cinema Package Scripts](https://github.com/kieranjol/IFIscripts#digital-cinema-package-scripts) * [dcpaccess.py](https://github.com/kieranjol/IFIscripts#dcpaccesspy) * [dcpfixity.py](https://github.com/kieranjol/IFIscripts#dcpfixitypy) * [dcpsubs2srt.py](https://github.com/kieranjol/IFIscripts#dcpsubs2srtpy) -5. [Fixity Scripts](https://github.com/kieranjol/IFIscripts#fixity-scripts) +6. [Fixity Scripts](https://github.com/kieranjol/IFIscripts#fixity-scripts) * [copyit.py](https://github.com/kieranjol/IFIscripts#copyitpy) * [manifest.py](https://github.com/kieranjol/IFIscripts#manifestpy) * [sha512deep.py](https://github.com/kieranjol/IFIscripts#sha512deeppy) * [validate.py](https://github.com/kieranjol/IFIscripts#validatepy) * [batchfixity.py](https://github.com/kieranjol/IFIscripts#batchfixitypy) -6. [Image Sequences](https://github.com/kieranjol/IFIscripts#image-sequences) +7. [Image Sequences](https://github.com/kieranjol/IFIscripts#image-sequences) * [makedpx.py](https://github.com/kieranjol/IFIscripts#makedpxpy) * [seq2ffv1.py](https://github.com/kieranjol/IFIscripts#seq2ffv1py) * [seq2prores.py](https://github.com/kieranjol/IFIscripts#seq2prorespy) @@ -33,10 +38,10 @@ table of contents * [seq2dv.py](https://github.com/kieranjol/IFIscripts#seq2dvpy) * [batchmetadata.py](https://github.com/kieranjol/IFIscripts#batchmetadata) * [batchrename.py](https://github.com/kieranjol/IFIscripts#batchrename) -7. [Quality Control](https://github.com/kieranjol/IFIscripts#quality-control) +8. [Quality Control](https://github.com/kieranjol/IFIscripts#quality-control) * [qctools.py](https://github.com/kieranjol/IFIscripts#qctoolspy) +9. [Specific Workflows](https://github.com/kieranjol/IFIscripts#specific-workflows) * [ffv1mkvvalidate.py](https://github.com/kieranjol/IFIscripts#ffv1mkvvalidatespy) -8. [Specific Workflows](https://github.com/kieranjol/IFIscripts#specific-workflows) * [mezzaninecheck.py](https://github.com/kieranjol/IFIscripts#mezzaninecheckpy) * [loopline.py](https://github.com/kieranjol/IFIscripts#looplinepy) * [masscopy.py](https://github.com/kieranjol/IFIscripts#masscopypy) @@ -47,7 +52,7 @@ table of contents * [giffer.py](https://github.com/kieranjol/IFIscripts#gifferpy) * [makeuuid.py](https://github.com/kieranjol/IFIscripts#makeuuidpy) * [durationcheck.py](https://github.com/kieranjol/IFIscripts#durationcheck.py) -10. [Experimental-Premis](https://github.com/kieranjol/IFIscripts#experimental-premis) +11. [Experimental-Premis](https://github.com/kieranjol/IFIscripts#experimental-premis) * [premis.py](https://github.com/kieranjol/IFIscripts#premispy) * [revtmd.py](https://github.com/kieranjol/IFIscripts#revtmdpy) * [as11fixity.py](https://github.com/kieranjol/IFIscripts#as11fixitypy) @@ -71,6 +76,32 @@ Note: Documentation template has been copied from [mediamicroservices](https://g * Usage for more than one directory - `sipcreator.py -i /path/to/directory_name1 /path/to/directory_name2 -o /path/to/output_folder` * Run `sipcreator.py -h` for all options. +## PREMIS ## + +### makepremis.py ### +* Creates PREMIS CSV and XML descriptions by launching other IFIscripts, such as logs2premis.py, premisobjects.py, premiscsv2xml.py. +* Assumptions for now: representation UUID already exists as part of the SIP/AIP folder structure. Find a way to supply this, probably via argparse. +* For more information, run `pydoc makepremis ` +* Usage: `makepremis.py -event_csv path/to/events.csv -object_csv path/to/objects.csv` + +### premisobjects.py ### +* Creates a somewhat PREMIS compliant CSV file describing objects in a package. A seperate script will need to be written in order to transform these CSV files into XML. +* As the flat CSV structure prevents maintaining some of the relationships between units, some semantic units have been merged, for example:`relationship_structural_includes` is really a combination of the `relationshipType` and `relationshipSubType` units, which each have the values: `Structural` and `Includes` respectively. +* Assumptions for now: representation UUID already exists as part of the SIP/AIP folder structure. Find a way to supply this, probably via argparse. +* For more information, run `pydoc premisobjects` +* Usage: `premisobjects.py -i path/to/SIP -m path/to/manifest.md5 -o path/to/output.csv` + +### logs2premis.py ### +* Extracts preservation events from an IFI plain text log file and converts to a CSV using the PREMIS data dictionary. +* For more information, run `pydoc premiscsv` +* Usage: - `premiscsv.py -i path/to/logfile.log -o path/to/output.csv -object_csv path/to/objects.csv` + +### premiscsv2xml.py ### +* Transforms PREMIS csv files into XML. +* For more information, run `pydoc premiscsv2xml` +* Usage: `premiscsv2xml.py -ev path/to/events.csv -i path/to/objects.csv` + + ## Transcodes ## ### makeffv1.py ### diff --git a/ififuncs.py b/ififuncs.py index 2e4b480..150e596 100755 --- a/ififuncs.py +++ b/ififuncs.py @@ -855,6 +855,51 @@ def checksum_replace(manifest, logname): for lines in updated_manifest: fo.write(lines) +def get_pronom_format(filename): + ''' + Uses siegfried to return a tuple that contains: + pronom_id, authority, siegfried version + ''' + siegfried_json = subprocess.check_output( + ['sf', '-json', filename] + ) + json_object = json.loads(siegfried_json) + pronom_id = str(json_object['files'][0]['matches'][0]['id']) + authority = str(json_object['files'][0]['matches'][0]['ns']) + version = str(json_object['siegfried']) + return (pronom_id, authority, version) + +def get_checksum(manifest, filename): + ''' + Extracts the checksum and path within a manifest, returning both as a tuple. + ''' + if os.path.isfile(manifest): + with open(manifest, 'r') as manifest_object: + manifest_lines = manifest_object.readlines() + for md5 in manifest_lines: + if 'objects' in md5: + if filename in md5: + return md5[:32], md5[34:].rstrip() + +def find_representation_uuid(source): + ''' + This extracts the representation UUID from a directory name. + This should be moved to ififuncs as it can be used by other scripts. + ''' + for root, _, _ in os.walk(source): + if 'objects' in root: + return os.path.basename(os.path.dirname(root)) + +def extract_metadata(csv_file): + ''' + Read the PREMIS csv and store the metadata in a list of dictionaries. + ''' + object_dictionaries = [] + input_file = csv.DictReader(open(csv_file)) + for rows in input_file: + object_dictionaries.append(rows) + return object_dictionaries + def img_seq_pixfmt(start_number, path): ''' Determine the pixel format of an image sequence diff --git a/logs2premis.py b/logs2premis.py new file mode 100755 index 0000000..9d6b34d --- /dev/null +++ b/logs2premis.py @@ -0,0 +1,226 @@ +#!/usr/bin/env python +''' +Extracts preservation events from an IFI plain text log file and converts +to a CSV using the PREMIS data dictionary +''' +import os +import sys +import csv +import shutil +import argparse +# from lxml import etree +import ififuncs + + +def find_events(logfile, output): + ''' + A very hacky attempt to extract the relevant preservation events from our + log files. + ''' + sip_test = os.path.basename(logfile).replace('_sip_log.log', '') + if ififuncs.validate_uuid4(sip_test) != False: + linking_object_identifier_value = sip_test + with open(logfile, 'r') as logfile_object: + log_lines = logfile_object.readlines() + for event_test in log_lines: + if 'eventDetail=copyit.py' in event_test: + logsplit = event_test.split(',') + for line_fragment in logsplit: + manifest_event = line_fragment.replace( + 'eventDetail', '' + ).replace('\n', '').split('=')[1] + object_info = ififuncs.extract_metadata('objects.csv') + object_locations = {} + for i in object_info: + object_locations[ + i['contentLocationValue'] + ] = i['objectIdentifier'].split(', ')[1].replace(']', '') + for log_entry in log_lines: + valid_entries = [ + 'eventType', + 'eventDetail=sipcreator.py', + 'eventDetail=Mediatrace', + 'eventDetail=Technical', + 'eventDetail=copyit.py' + ] + for entry in valid_entries: + if entry in log_entry: + break_loop = '' + event_outcome = '' + event_detail = '' + event_outcome_detail_note = '' + event_type = '' + event_row = [] + datetime = log_entry[:19] + logsplit = log_entry.split(',') + for line_fragment in logsplit: + if 'eventType' in line_fragment: + if 'EVENT =' in line_fragment: + line_fragment = line_fragment.split('EVENT =')[1] + event_type = line_fragment.replace( + ' eventType=', '' + ).replace('assignement', 'assignment') + if ' value' in line_fragment: + # this assumes that the value is the outcome of an identifier assigment. + event_outcome = line_fragment[7:].replace('\n', '') + # we are less concerned with events starting. + if 'status=started' in line_fragment: + break_loop = 'continue' + if 'Generating destination manifest:' in line_fragment: + break_loop = '' + event_detail = manifest_event + # ugh, this might run multiple times. + if 'eventDetail=sipcreator.py' in log_entry: + event_type = 'Information Package Creation' + event_detail = line_fragment.replace( + 'eventDetail', '' + ).replace('\n', '').split('=')[1] + event_outcome_detail_note = 'Submission Information Package' + if ('eventDetail=Mediatrace' in log_entry) or ('eventDetail=Technical' in log_entry): + event_type = 'metadata extraction' + event_detail = log_entry.split( + 'eventDetail=', 1 + )[1].split(',')[0] + event_outcome = log_entry.split( + 'eventOutcome=', 1 + )[1].replace(', agentName=mediainfo', '').replace('\n', '') + if 'eventDetail=Mediatrace' in log_entry: + event_outcome = event_outcome.replace('mediainfo.xml', 'mediatrace.xml') + for x in object_locations: + ''' + This is trying to get the UUID of the source object + that relates to the mediainfo xmls. This is + achieved via a dictionary. + ''' + if 'objects' in x: + a = os.path.basename(event_outcome).replace('_mediainfo.xml', '').replace('_mediatrace.xml', '')[:-1] + b = os.path.basename(x) + if a == b: + linking_object_identifier_value = object_locations[x].replace('\'', '') + if (break_loop == 'continue') or (event_type == ''): + continue + print event_type + event_row = [ + 'UUID', ififuncs.create_uuid(), + event_type, datetime, event_detail, + '', + event_outcome, '', + event_outcome_detail_note, '', + '', '', + '', 'UUID', + linking_object_identifier_value, '' + ] + ififuncs.append_csv(output, event_row) + + +def update_objects(output, objects_csv): + ''' + Update the object description with the linkingEventIdentifiers + ''' + link_dict = {} + event_dicts = ififuncs.extract_metadata(output) + for i in event_dicts: + a = i['eventIdentifierValue'] + try: + link_dict[i['linkingObjectIdentifierValue']] += a + '|' + except KeyError: + link_dict[i['linkingObjectIdentifierValue']] = a + '|' + print link_dict + object_dicts = ififuncs.extract_metadata(objects_csv) + for x in object_dicts: + for link in link_dict: + if link == x['objectIdentifier'].split(', ')[1].replace(']', '').replace('\'', ''): + x['linkingEventIdentifierValue'] = link_dict[link] + premis_object_units = [ + 'objectIdentifier', + 'objectCategory', + 'messageDigestAlgorithm', 'messageDigest', 'messageDigestOriginator', + 'size', 'formatName', 'formatVersion', + 'formatRegistryName', 'formatRegistryKey', 'formatRegistryRole', + 'objectCharacteristicsExtension', 'originalName', + 'contentLocationType', 'contentLocationValue', + 'relatedObjectIdentifierType', 'relatedObjectIdentifierValue', + 'relatedObjectSequence', + 'relatedEventIdentifierType', 'relatedEventIdentifierValue', + 'relatedEventSequence', + 'linkingEventIdentifierType', 'linkingEventIdentifierValue', + 'relationship_structural_includes', + 'relationship_structural_isincludedin', + 'relationship_structural_represents', + 'relationship_structural_hasroot', + 'relationship_derivation_hassource' + ] + with open('mycsvfile.csv', 'wb') as f: + counter = 0 + for i in object_dicts: + w = csv.DictWriter(f, fieldnames=premis_object_units) + if counter == 0: + w.writeheader() + counter += 1 + w.writerow(i) + shutil.move('mycsvfile.csv', objects_csv) + + +def make_events_csv(output): + ''' + Generates a CSV with PREMIS-esque headings. Currently it's just called + 'bla.csv' but it will probably be called: + UUID_premisevents.csv + and sit in the metadata directory. + ''' + premis_events = [ + 'eventIdentifierType', 'eventIdentifierValue', + 'eventType', 'eventDateTime', 'eventDetail', + 'eventDetailExtension', + 'eventOutcome', 'eventOutcomeDetail', + 'eventOutcomeDetailNote', 'eventOutcomeDetailExtension', + 'linkingAgentIdentifierType', 'linkingAgentIdentifierValue', + 'linkingAgentIdentifierRole', 'linkingObjectIdentifierType', + 'linkingObjectIdentifierValue', 'linkingObjectRole' + ] + ififuncs.create_csv(output, premis_events) + + +def parse_args(args_): + ''' + Parse command line arguments. + ''' + parser = argparse.ArgumentParser( + description='Describes events using PREMIS data dictionary via CSV' + ' Written by Kieran O\'Leary.' + ) + parser.add_argument( + '-i', + help='full path of a log textfile', required=True + ) + parser.add_argument( + '-o', + help='full path of output csv', required=True + ) + parser.add_argument( + '-object_csv', + help='full path of object description csv', required=True + ) + parser.add_argument( + '-user', + help='Declare who you are. If this is not set, you will be prompted.' + ) + parsed_args = parser.parse_args(args_) + return parsed_args + + +def main(args_): + ''' + Launches all the other functions when run from the command line. + ''' + args = parse_args(args_) + logfile = args.i + output = args.o + objects_csv = args.object_csv + make_events_csv(output) + find_events(logfile, output) + update_objects(output, objects_csv) + + +if __name__ == '__main__': + main(sys.argv[1:]) diff --git a/makepremis.py b/makepremis.py new file mode 100755 index 0000000..5180869 --- /dev/null +++ b/makepremis.py @@ -0,0 +1,81 @@ +#!/usr/bin/env python +''' +Creates PREMIS CSV and XML descriptions by launching other IFIscripts, +such as logs2premis.py, premisobjects.py, premiscsv2xml.py' +''' +import os +import argparse +import premisobjects +import premiscsv2xml +import logs2premis + + +def parse_args(): + ''' + Parse command line arguments. + ''' + parser = argparse.ArgumentParser( + description='Creates PREMIS CSV and XML descriptions by launching' + 'other IFIscripts, such as logs2premis.py, premisobjects.py,' + 'premiscsv2xml.py' + ' Written by Kieran O\'Leary.' + ) + parser.add_argument( + 'input', + help='full path to your input directory' + ) + parser.add_argument( + '-user', + help='Declare who you are. If this is not set, you will be prompted.' + ) + parser.add_argument( + '-object_csv', required=True, + help='full path and filename of the output objects CSV.' + ) + parser.add_argument( + '-event_csv', required=True, + help='full path and filename of the output events CSV' + ) + parsed_args = parser.parse_args() + return parsed_args + + +def launch_scripts(source, args): + ''' + Launches premisobjects, logs2premis and premiscsv2xml in input directory + ''' + for root, _, _ in os.walk(source): + if os.path.basename(root) == 'objects': + objects_csv = args.object_csv + events_csv = args.event_csv + uuid_dir = os.path.dirname(root) + logs_dir = os.path.join( + uuid_dir, 'logs' + ) + logname = os.path.join( + logs_dir, os.path.basename(uuid_dir + '_sip_log.log') + ) + manifest = os.path.join( + os.path.dirname(uuid_dir), os.path.basename(uuid_dir + '_manifest.md5') + ) + premisobjects.main( + ['-i', root, '-m', manifest, '-o', objects_csv] + ) + logs2premis.main( + ['-i', logname, '-object_csv', objects_csv, '-o', events_csv] + ) + premiscsv2xml.main( + ['-i', objects_csv, '-ev', events_csv] + ) + + +def main(): + ''' + Launch the other functions when called from the command line + ''' + args = parse_args() + source = args.input + launch_scripts(source, args) + +if __name__ == '__main__': + main() diff --git a/premiscsv2xml.py b/premiscsv2xml.py new file mode 100755 index 0000000..610799a --- /dev/null +++ b/premiscsv2xml.py @@ -0,0 +1,228 @@ +#!/usr/bin/env python +''' +Takes a PREMIS CSV file, as generated by premiscsv.py, and transform into XML. +''' +import sys +import argparse +from lxml import etree +import ififuncs + + +def write_premis(doc, premisxml): + ''' + Writes the PREMIS object to a file. + ''' + with open(premisxml, 'w') as out_file: + doc.write(out_file, pretty_print=True) + + +def create_unit(index, parent, unitname): + ''' + Helper function that adds an XML element. + ''' + premis_namespace = "http://www.loc.gov/premis/v3" + unitname = etree.Element("{%s}%s" % (premis_namespace, unitname)) + parent.insert(index, unitname) + return unitname + + +def setup_xml(): + ''' + This should just create the PREMIS lxml object. + Actual metadata generation should be moved to other functions. + ''' + namespace = '' + premis = etree.fromstring(namespace) + return premis + + +def describe_objects(premis, object_dictionaries): + ''' + Converts the CSV object metadata into PREMIS XML. + ''' + xsi_namespace = "http://www.w3.org/2001/XMLSchema-instance" + for objects in object_dictionaries: + id_list = objects['objectIdentifier'].replace( + '[', '' + ).replace(']', '').replace('\'', '').split(', ') + object_parent = create_unit( + 0, premis, 'object' + ) + object_parent.attrib[ + "{%s}type" % xsi_namespace + ] = "premis:%s" % objects['objectCategory'] + object_identifier_uuid = create_unit( + 2, object_parent, 'objectIdentifier' + ) + object_identifier_uuid_type = create_unit( + 1, object_identifier_uuid, 'objectIdentifierType' + ) + object_identifier_uuid_value = create_unit( + 2, object_identifier_uuid, 'objectIdentifierValue' + ) + object_identifier_uuid_type.text = id_list[0] + object_identifier_uuid_value.text = id_list[1] + if objects['objectCategory'] == 'file': + object_characteristics = create_unit( + 5, object_parent, 'objectCharacteristics' + ) + storage = create_unit( + 7, object_parent, 'storage' + ) + content_location = create_unit( + 0, storage, 'contentLocation' + ) + content_location_type = create_unit( + 0, content_location, 'contentLocationType' + ) + content_location_value = create_unit( + 1, content_location, 'contentLocationValue' + ) + fixity = create_unit( + 0, object_characteristics, 'fixity' + ) + size = create_unit( + 1, object_characteristics, 'size' + ) + format_ = create_unit( + 2, object_characteristics, 'format' + ) + format_registry = create_unit( + 1, format_, 'formatRegistry' + ) + format_registry_name = create_unit( + 0, format_registry, 'formatRegistryName' + ) + format_registry_key = create_unit( + 1, format_registry, 'formatRegistryKey' + ) + format_registry_role = create_unit( + 2, format_registry, 'formatRegistryRole' + ) + size.text = objects['size'] + message_digest_algorithm = create_unit( + 0, fixity, 'messageDigestAlgorithm' + ) + message_digest = create_unit( + 1, fixity, 'messageDigest' + ) + message_digest_originator = create_unit( + 2, fixity, 'messageDigestOriginator' + ) + message_digest_originator.text = objects['messageDigestOriginator'] + message_digest.text = objects['messageDigest'] + message_digest_algorithm.text = objects['messageDigestAlgorithm'] + format_registry_name.text = objects['formatRegistryName'] + format_registry_key.text = objects['formatRegistryKey'] + format_registry_role.text = objects['formatRegistryRole'] + content_location_type.text = objects['contentLocationType'] + content_location_value.text = objects['contentLocationValue'] + linked_events = objects['linkingEventIdentifierValue'].split('|') + for event in linked_events: + if event != '': + linking_event_identifier = create_unit( + 99, object_parent, 'linkingEventIdentifier' + ) + linking_event_identifier_type = create_unit( + 1, linking_event_identifier, 'linkingEventIdentifierType' + ) + linking_event_identifier_value = create_unit( + 2, linking_event_identifier, 'linkingEventIdentifierValue' + ) + linking_event_identifier_type.text = 'UUID' + linking_event_identifier_value.text = event + return premis + + +def describe_events(premis, event_dictionaries): + ''' + Converts the CSV object metadata into PREMIS XML. + ''' + for x in event_dictionaries: + event_parent = create_unit( + 99, premis, 'event' + ) + event_identifier_uuid = create_unit( + 1, event_parent, 'eventIdentifier' + ) + event_identifier_uuid_type = create_unit( + 1, event_identifier_uuid, 'eventIdentifierType' + ) + event_identifier_uuid_value = create_unit( + 2, event_identifier_uuid, 'eventIdentifierValue' + ) + event_type = create_unit( + 1, event_parent, 'eventType' + ) + event_date_time = create_unit( + 2, event_parent, 'eventDateTime' + ) + event_detail_information = create_unit( + 3, event_parent, 'eventDetailInformation' + ) + event_detail = create_unit( + 1, event_detail_information, 'eventDetail' + ) + event_outcome_information = create_unit( + 4, event_parent, 'eventOutcomeInformation' + ) + event_outcome = create_unit( + 1, event_outcome_information, 'eventOutcome' + ) + event_outcome_detail = create_unit( + 2, event_outcome_information, 'eventOutcomeDetail' + ) + event_outcome_detail_note = create_unit( + 1, event_outcome_detail, 'eventOutcomeDetailNote' + ) + event_identifier_uuid_type.text = x['eventIdentifierType'] + event_identifier_uuid_value.text = x['eventIdentifierValue'] + event_type.text = x['eventType'] + event_date_time.text = x['eventDateTime'] + event_detail.text = x['eventDetail'] + event_outcome.text = x['eventOutcome'] + event_outcome_detail_note.text = x['eventOutcomeDetailNote'] + print(etree.tostring(premis, pretty_print=True)) + + +def parse_args(args_): + ''' + Parse command line arguments. + ''' + parser = argparse.ArgumentParser( + description='Converts PREMIS CSV to XML' + ' Written by Kieran O\'Leary.' + ) + parser.add_argument( + '-i', + help='full path of objects csv', required=True + ) + parser.add_argument( + '-ev', + help='full path of events csv', required=True + ) + parser.add_argument( + '-user', + help='Declare who you are. If this is not set, you will be prompted.' + ) + parsed_args = parser.parse_args(args_) + return parsed_args + + +def main(args_): + ''' + Launches all the other functions when run from the command line. + For debugging purposes, the contents of the CSV is printed to screen. + ''' + args = parse_args(args_) + csv_file = args.i + events_csv = args.ev + object_dictionaries = ififuncs.extract_metadata(csv_file) + event_dictionaries = ififuncs.extract_metadata(events_csv) + premis = setup_xml() + premis = describe_objects(premis, object_dictionaries) + describe_events(premis, event_dictionaries) + + +if __name__ == '__main__': + main(sys.argv[1:]) diff --git a/premisobjects.py b/premisobjects.py new file mode 100755 index 0000000..883d5e8 --- /dev/null +++ b/premisobjects.py @@ -0,0 +1,205 @@ +#!/usr/bin/env python +''' +Creates a somewhat PREMIS compliant CSV file describing objects in a package. +A seperate script will need to be written in order to transform these +CSV files into XML. +As the flat CSV structure prevents maintaining some of the complex +relationships between units, some semantic units have been merged, for example: +relation_structural_includes is really a combination of the +relationshipType and relationshipSubType units, which each have the values: +Structural and Includes respectively. + +todo: +Document identifier assignment for files and IE. Probably in events sheet?\ +This would ideally just add to the log in the helper script. +Allow for derivation to be entered +Link mediainfo xml in /metadata to the objectCharacteristicsExtension field. + + +Assumptions for now: representation UUID already exists as part of the +SIP/AIP folder structure. Find a way to supply this, probably via argparse. +''' + +import os +import sys +import argparse +import ififuncs + + +def make_skeleton_csv(output): + ''' + Generates a CSV with PREMIS-esque headings. Currently it's just called + 'objects.csv' but it will probably be called: + UUID_premisobjects.csv + and sit in the metadata directory. + ''' + premis_object_units = [ + 'objectIdentifier', + 'objectCategory', + 'messageDigestAlgorithm', 'messageDigest', 'messageDigestOriginator', + 'size', 'formatName', 'formatVersion', + 'formatRegistryName', 'formatRegistryKey', 'formatRegistryRole', + 'objectCharacteristicsExtension', 'originalName', + 'contentLocationType', 'contentLocationValue', + 'relatedObjectIdentifierType', 'relatedObjectIdentifierValue', + 'relatedObjectSequence', + 'relatedEventIdentifierType', 'relatedEventIdentifierValue', + 'relatedEventSequence', + 'linkingEventIdentifierType', 'linkingEventIdentifierValue', + 'relationship_structural_includes', + 'relationship_structural_isincludedin', + 'relationship_structural_represents', + 'relationship_structural_hasroot', + 'relationship_derivation_hassource' + ] + ififuncs.create_csv(output, premis_object_units) + + +def file_description(source, manifest, representation_uuid, output): + ''' + Generate PREMIS descriptions for items and write to CSV. + ''' + item_ids = [] + for root, _, filenames in os.walk(source): + filenames = [f for f in filenames if f[0] != '.'] + for item in filenames: + md5, uri = ififuncs.get_checksum(manifest, item) + item_uuid = ififuncs.create_uuid() + full_path = os.path.join(root, item) + print 'Using Siegfried to analyze %s' % item + pronom_id, authority, version = ififuncs.get_pronom_format( + full_path + ) + item_dictionary = {} + item_dictionary['objectIdentifier'] = ['UUID', item_uuid] + item_dictionary['objectCategory'] = 'file' + item_dictionary['size'] = str(os.path.getsize(full_path)) + item_dictionary['originalName'] = item + item_dictionary['relationship_structural_isincludedin'] = representation_uuid + item_ids.append(item_uuid) + file_data = [ + item_dictionary['objectIdentifier'], + item_dictionary['objectCategory'], + 'md5', md5, 'internal', + item_dictionary['size'], '', '', + authority, pronom_id, 'identification', + '', item, + 'uri', uri, + '', '', + '', + '', '', + '', + '', '', + '', + item_dictionary['relationship_structural_isincludedin'], + '', + '', + '' + ] + ififuncs.append_csv(output, file_data) + return item_ids + +def build_relationships(): + ''' + Placeholder function that will produce a CSV containing the relationships + within a PREMIS object description. + ''' + relationships = [ + "relationship_uuid", + "objectIdentifierValue", + "relationshipType", + "relationshipSubType", + "relatedObjectIdentifierType", + "relatedObjectIdentifierValue", + "relatedEventIdentifierType", + "relatedEventIdentifierValue", + "relatedEventSequence" + ] +def representation_description(representation_uuid, item_ids, output): + ''' + Generate PREMIS descriptions for a representation and write to CSV. + ''' + representation_dictionary = {} + representation_dictionary['objectIdentifier'] = ['UUID', representation_uuid] + representation_dictionary['objectCategory'] = 'representation' + representation_dictionary['relationship_structural_includes'] = '' + for item_id in item_ids: + representation_dictionary['relationship_structural_includes'] += item_id + '|' + representation_data = [ + representation_dictionary['objectIdentifier'], + representation_dictionary['objectCategory'], + '', '', '', + '', '', '', + '', '', '', + '', '', + '', '', + '', '', + '', + '', '', + '', + '', '', + representation_dictionary['relationship_structural_includes'], + '', + '', + '', + '' + ] + ififuncs.append_csv(output, representation_data) + + +def intellectual_entity_description(): + ''' + Generate PREMIS descriptions for Intellectual Entities and write to CSV. + ''' + intellectual_entity_dictionary = {} + intellectual_entity_dictionary['objectIdentifier'] = ['UUID', ififuncs.create_uuid()] + intellectual_entity_dictionary['objectCategory'] = 'intellectual entity' + #print intellectual_entity_dictionary + + +def parse_args(args_): + ''' + Parse command line arguments. + ''' + parser = argparse.ArgumentParser( + description='Describes objects using PREMIS data dictionary using CSV' + ' Written by Kieran O\'Leary.' + ) + parser.add_argument( + '-i', + help='full path of input objects directory', required=True + ) + parser.add_argument( + '-o', '-output', + help='full path of output directory', required=True + ) + parser.add_argument( + '-m', '-manifest', + help='full path to a pre-existing manifest', required=True + ) + parser.add_argument( + '-user', + help='Declare who you are. If this is not set, you will be prompted.' + ) + parsed_args = parser.parse_args(args_) + return parsed_args + + +def main(args_): + ''' + Launches all the other functions when run from the command line. + ''' + args = parse_args(args_) + source = args.i + output = args.o + manifest = args.m + make_skeleton_csv(output) + representation_uuid = ififuncs.find_representation_uuid(source) + item_ids = file_description(source, manifest, representation_uuid, output) + #intellectual_entity_description() + representation_description(representation_uuid, item_ids, output) + + +if __name__ == '__main__': + main(sys.argv[1:]) +