cardinal_pythonlib.openxml.find_recovered_openxml
Original code copyright (C) 2009-2022 Rudolf Cardinal (rudolf@pobox.com).
This file is part of cardinal_pythonlib.
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Tool to recognize and rescue Microsoft Office OpenXML files, even if they have garbage appended to them. See the command-line help for details.
Version history:
Written 28 Sep 2017.
Notes:
use the
vbindiff
tool to show how two binary files differ.
Output from zip -FF bad.zip --out good.zip
Fix archive (-FF) - salvage what can
zip warning: Missing end (EOCDR) signature - either this archive
is not readable or the end is damaged
Is this a single-disk archive? (y/n):
… and note there are some tabs in that, too.
More zip -FF
output:
Fix archive (-FF) - salvage what can
Found end record (EOCDR) - says expect 50828 splits
Found archive comment
Scanning for entries...
Could not find:
/home/rudolf/tmp/ziptest/00008470.z01
Hit c (change path to where this split file is)
s (skip this split)
q (abort archive - quit)
e (end this archive - no more splits)
z (look for .zip split - the last split)
or ENTER (try reading this split again):
More zip -FF
output:
zip: malloc.c:2394: sysmalloc: ...
… this heralds a crash in zip
. We need to kill it; otherwise it just sits
there doing nothing and not asking for any input. Presumably this means the
file is badly corrupted (or not a zip at all).
- class cardinal_pythonlib.openxml.find_recovered_openxml.CorruptedOpenXmlReader(filename: str, show_zip_output: bool = False)[source]
Class to read a potentially corrupted OpenXML file. As it is created, it sets its
file_type
member to the detected OpenXML file type, if it can.
- class cardinal_pythonlib.openxml.find_recovered_openxml.CorruptedZipReader(filename: str, show_zip_output: bool = False)[source]
Class to open a zip file, even one that is corrupted, and detect the files within.
- Parameters:
- cardinal_pythonlib.openxml.find_recovered_openxml.main() None [source]
Command-line handler for the
find_recovered_openxml
tool. Use the--help
option for help.
- cardinal_pythonlib.openxml.find_recovered_openxml.process_file(filename: str, filetypes: List[str], move_to: str, delete_if_not_specified_file_type: bool, show_zip_output: bool) None [source]
Deals with an OpenXML, including if it is potentially corrupted.
- Parameters:
filename¶ – filename to process
filetypes¶ – list of filetypes that we care about, e.g.
['docx', 'pptx', 'xlsx']
.move_to¶ – move matching files to this directory
delete_if_not_specified_file_type¶ – if
True
, and the file is not a type specified infiletypes
, then delete the file.show_zip_output¶ – show the output from the external
zip
tool?