cardinal_pythonlib.openxml.find_recovered_openxml¶
Original code copyright (C) 2009-2022 Rudolf Cardinal (rudolf@pobox.com).
This file is part of cardinal_pythonlib.
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Tool to recognize and rescue Microsoft Office OpenXML files, even if they have garbage appended to them. See the command-line help for details.
Version history:
- Written 28 Sep 2017.
Notes:
- use the
vbindiff
tool to show how two binary files differ.
Output from zip -FF bad.zip --out good.zip
Fix archive (-FF) - salvage what can
zip warning: Missing end (EOCDR) signature - either this archive
is not readable or the end is damaged
Is this a single-disk archive? (y/n):
… and note there are some tabs in that, too.
More zip -FF
output:
Fix archive (-FF) - salvage what can
Found end record (EOCDR) - says expect 50828 splits
Found archive comment
Scanning for entries...
Could not find:
/home/rudolf/tmp/ziptest/00008470.z01
Hit c (change path to where this split file is)
s (skip this split)
q (abort archive - quit)
e (end this archive - no more splits)
z (look for .zip split - the last split)
or ENTER (try reading this split again):
More zip -FF
output:
zip: malloc.c:2394: sysmalloc: ...
… this heralds a crash in zip
. We need to kill it; otherwise it just sits
there doing nothing and not asking for any input. Presumably this means the
file is badly corrupted (or not a zip at all).
-
class
cardinal_pythonlib.openxml.find_recovered_openxml.
CorruptedOpenXmlReader
(filename: str, show_zip_output: bool = False)[source]¶ Class to read a potentially corrupted OpenXML file. As it is created, it sets its
file_type
member to the detected OpenXML file type, if it can.
-
class
cardinal_pythonlib.openxml.find_recovered_openxml.
CorruptedZipReader
(filename: str, show_zip_output: bool = False)[source]¶ Class to open a zip file, even one that is corrupted, and detect the files within.
Parameters: - filename – filename of the
.zip
file (or corrupted.zip
file) to open - show_zip_output – show the output of the external
zip
tool?
-
move_to
(destination_filename: str, alter_if_clash: bool = True) → None[source]¶ Move the file to which this class refers to a new location. The function will not overwrite existing files (but offers the option to rename files slightly to avoid a clash).
Parameters: - destination_filename – filename to move to
- alter_if_clash – if
True
(the default), appends numbers to the filename if the destination already exists, so that the move can proceed.
- filename – filename of the
-
cardinal_pythonlib.openxml.find_recovered_openxml.
main
() → None[source]¶ Command-line handler for the
find_recovered_openxml
tool. Use the--help
option for help.
-
cardinal_pythonlib.openxml.find_recovered_openxml.
process_file
(filename: str, filetypes: List[str], move_to: str, delete_if_not_specified_file_type: bool, show_zip_output: bool) → None[source]¶ Deals with an OpenXML, including if it is potentially corrupted.
Parameters: - filename – filename to process
- filetypes – list of filetypes that we care about, e.g.
['docx', 'pptx', 'xlsx']
. - move_to – move matching files to this directory
- delete_if_not_specified_file_type – if
True
, and the file is not a type specified infiletypes
, then delete the file. - show_zip_output – show the output from the external
zip
tool?