cardinal_pythonlib.openxml.find_recovered_openxml


Original code copyright (C) 2009-2022 Rudolf Cardinal (rudolf@pobox.com).

This file is part of cardinal_pythonlib.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.


Tool to recognize and rescue Microsoft Office OpenXML files, even if they have garbage appended to them. See the command-line help for details.

Version history:

  • Written 28 Sep 2017.

Notes:

  • use the vbindiff tool to show how two binary files differ.

Output from zip -FF bad.zip --out good.zip

Fix archive (-FF) - salvage what can
    zip warning: Missing end (EOCDR) signature - either this archive
                     is not readable or the end is damaged
Is this a single-disk archive?  (y/n):

… and note there are some tabs in that, too.

More zip -FF output:

Fix archive (-FF) - salvage what can
 Found end record (EOCDR) - says expect 50828 splits
  Found archive comment
Scanning for entries...


Could not find:
  /home/rudolf/tmp/ziptest/00008470.z01

Hit c      (change path to where this split file is)
    s      (skip this split)
    q      (abort archive - quit)
    e      (end this archive - no more splits)
    z      (look for .zip split - the last split)
 or ENTER  (try reading this split again):

More zip -FF output:

zip: malloc.c:2394: sysmalloc: ...

… this heralds a crash in zip. We need to kill it; otherwise it just sits there doing nothing and not asking for any input. Presumably this means the file is badly corrupted (or not a zip at all).

class cardinal_pythonlib.openxml.find_recovered_openxml.CorruptedOpenXmlReader(filename: str, show_zip_output: bool = False)[source]

Class to read a potentially corrupted OpenXML file. As it is created, it sets its file_type member to the detected OpenXML file type, if it can.

class cardinal_pythonlib.openxml.find_recovered_openxml.CorruptedZipReader(filename: str, show_zip_output: bool = False)[source]

Class to open a zip file, even one that is corrupted, and detect the files within.

Parameters:
  • filename – filename of the .zip file (or corrupted .zip file) to open
  • show_zip_output – show the output of the external zip tool?
move_to(destination_filename: str, alter_if_clash: bool = True) → None[source]

Move the file to which this class refers to a new location. The function will not overwrite existing files (but offers the option to rename files slightly to avoid a clash).

Parameters:
  • destination_filename – filename to move to
  • alter_if_clash – if True (the default), appends numbers to the filename if the destination already exists, so that the move can proceed.
cardinal_pythonlib.openxml.find_recovered_openxml.main() → None[source]

Command-line handler for the find_recovered_openxml tool. Use the --help option for help.

cardinal_pythonlib.openxml.find_recovered_openxml.process_file(filename: str, filetypes: List[str], move_to: str, delete_if_not_specified_file_type: bool, show_zip_output: bool) → None[source]

Deals with an OpenXML, including if it is potentially corrupted.

Parameters:
  • filename – filename to process
  • filetypes – list of filetypes that we care about, e.g. ['docx', 'pptx', 'xlsx'].
  • move_to – move matching files to this directory
  • delete_if_not_specified_file_type – if True, and the file is not a type specified in filetypes, then delete the file.
  • show_zip_output – show the output from the external zip tool?