cardinal_pythonlib.extract_text
Original code copyright (C) 2009-2022 Rudolf Cardinal (rudolf@pobox.com).
This file is part of cardinal_pythonlib.
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Converts a bunch of stuff to text, either from external files or from in-memory binary objects (BLOBs).
Prerequisites:
sudo apt-get install antiword
pip install docx pdfminer
Author: Rudolf Cardinal (rudolf@pobox.com)
Created: Feb 2015
Last update: 24 Sep 2015
See also:
Word
PDF
RTF
Multi-purpose:
DOCX
- class cardinal_pythonlib.extract_text.CustomDocxParagraph(text: str = '')[source]
Represents a paragraph of text in a DOCX file.
- class cardinal_pythonlib.extract_text.CustomDocxTable(rows: List[CustomDocxTableRow] | None = None)[source]
Represents a table of a DOCX file. May contain several rows.
- class cardinal_pythonlib.extract_text.CustomDocxTableCell(paragraphs: List[CustomDocxParagraph] | None = None)[source]
Represents a cell within a table of a DOCX file. May contain several paragraphs.
- class cardinal_pythonlib.extract_text.CustomDocxTableRow(cells: List[CustomDocxTableCell] | None = None)[source]
Represents a row within a table of a DOCX file. May contain several cells (one per column).
- class cardinal_pythonlib.extract_text.DocxFragment(text: str, wordwrap: bool = True)[source]
Representation of a line, or multiple lines, which may or may not need word-wrapping.
- class cardinal_pythonlib.extract_text.TextProcessingConfig(encoding: str | None = None, width: int = 120, min_col_width: int = 15, plain: bool = False, semiplain: bool = False, docx_in_order: bool = True, horizontal_char='─', vertical_char='│', junction_char='┼', plain_table_start: str | None = None, plain_table_end: str | None = None, plain_table_col_boundary: str | None = None, plain_table_row_boundary: str | None = None, rstrip: bool = True)[source]
Class to manage control parameters for text extraction, without having to pass a lot of mysterious
**kwargs
around and lose track of what it means.All converter functions take one of these objects as a parameter.
- Parameters:
encoding¶ – optional text file encoding to try in addition to
sys.getdefaultencoding()
.width¶ – overall word-wrapping width
min_col_width¶ – minimum column width for tables
plain¶ – as plain as possible (e.g. for natural language processing); see
docx_process_table()
.semiplain¶ – quite plain, but with some ASCII art representation of the table structure.
docx_in_order¶ – for DOCX files: if
True
, process paragraphs and tables in the order they occur; ifFalse
, process all paragraphs followed by all tablesrstrip¶ – Right-strip whitespace from all lines?
horizontal_char¶ – horizontal character to use with PrettyTable, e.g.
-
or─
vertical_char¶ – vertical character to use with PrettyTable, e.g.
|
or│
junction_char¶ – junction character to use with PrettyTable, e.g.
+
or┼
plain_table_start¶ – table start line to use with
plain=True
plain_table_end¶ – table end line to use with
plain=True
plain_table_col_boundary¶ – boundary between columns to use with
plain==True
plain_table_row_boundary¶ – boundary between rows to use with
plain==True
Example of a DOCX table processed with:
plain=False, semiplain=False
┼─────────────┼─────────────┼ │ Row 1 col 1 │ Row 1 col 2 │ ┼─────────────┼─────────────┼ │ Row 2 col 1 │ Row 2 col 2 │ ┼─────────────┼─────────────┼
plain=False, semiplain=True
───────────────────────────── Row 1 col 1 ───────────────────────────── Row 1 col 2 ───────────────────────────── Row 2 col 1 ───────────────────────────── Row 2 col 2 ─────────────────────────────
plain=True
╔═════════════════════════════════════════════════════════════════╗ Row 1 col 1 ─────────────────────────────────────────────────────────────────── Row 1 col 2 ═══════════════════════════════════════════════════════════════════ Row 2 col 1 ─────────────────────────────────────────────────────────────────── Row 2 col 2 ╚═════════════════════════════════════════════════════════════════╝
The plain format is probably better, in general, for NLP, and is definitely clearer with nested tables (for which the word-wrapping algorithm is imperfect). We avoid “heavy” box drawing as it has a higher chance of being mangled under Windows.
- cardinal_pythonlib.extract_text.availability_anything() bool [source]
Is a generic “something-to-text” processor available?
- cardinal_pythonlib.extract_text.convert_anything_to_text(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) str [source]
Convert arbitrary files to text, using
strings
orstrings2
. (strings
is a standard Unix command to get text from any old rubbish.)
- cardinal_pythonlib.extract_text.convert_doc_to_text(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) str [source]
Converts Microsoft Word DOC files to text.
- cardinal_pythonlib.extract_text.convert_docx_to_text(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) str [source]
Converts a DOCX file to text. Pass either a filename or a binary object.
- Args:
filename: filename to process blob: binary
bytes
object to process config:TextProcessingConfig
control object- Returns:
text contents
Notes:
Old
docx
(https://pypi.python.org/pypi/python-docx) has been superseded (see https://github.com/mikemaccana/python-docx).docx.opendocx(file)
useszipfile.ZipFile
, which can take either a filename or a file-like object (https://docs.python.org/2/library/zipfile.html).Method was:
with get_filelikeobject(filename, blob) as fp: document = docx.opendocx(fp) paratextlist = docx.getdocumenttext(document) return '
‘.join(paratextlist)
Newer
docx
is python-docxHowever, it uses
lxml
, which has C dependencies, so it doesn’t always install properly on e.g. bare Windows machines.PERFORMANCE of my method:
nice table formatting
but tables grouped at end, not in sensible places
can iterate via
doc.paragraphs
anddoc.tables
but not in true document order, it seemsothers have noted this too:
docx2txt
is at https://pypi.python.org/pypi/docx2txt/0.6; this is pure Python. Its command-line function appears to be for Python 2 only (2016-04-21: crashes under Python 3; is due to an encoding bug). However, it seems fine as a library. It doesn’t handle in-memory blobs properly, though, so we need to extend it.PERFORMANCE OF ITS
process()
function:all text comes out
table text is in a sensible place
table formatting is lost.
Other manual methods (not yet implemented): https://etienned.github.io/posts/extract-text-from-word-docx-simply/.
Looks like it won’t deal with header stuff (etc.) that
docx2txt
handles.Upshot: we need a DIY version.
See also this “compile lots of techniques” libraries, which has C dependencies: https://textract.readthedocs.org/en/latest/
- cardinal_pythonlib.extract_text.convert_html_to_text(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) str [source]
Converts HTML to text.
- cardinal_pythonlib.extract_text.convert_odt_to_text(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) str [source]
Converts an OpenOffice ODT file to text.
Pass either a filename or a binary object.
- cardinal_pythonlib.extract_text.convert_pdf_to_txt(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) str [source]
Converts a PDF file to text. Pass either a filename or a binary object.
- cardinal_pythonlib.extract_text.convert_rtf_to_text(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) str [source]
Converts RTF to text.
- cardinal_pythonlib.extract_text.convert_xml_to_text(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) str [source]
Converts XML to text.
- cardinal_pythonlib.extract_text.document_to_text(filename: str | None = None, blob: bytes | None = None, extension: str | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) str [source]
Converts a document to text.
This function selects a processor based on the file extension (either from the filename, or, in the case of a BLOB, the extension specified manually via the
extension
parameter).Pass either a filename or a binary object.
- Parameters:
filename¶ – the filename to read
blob¶ – binary content (alternative to
filename
)extension¶ – file extension, used as a hint when
blob
is usedconfig¶ – an optional
TextProcessingConfig
object
- Returns:
Returns a string if the file was processed (potentially an empty string).
- Raises:
Raises an exception for malformed arguments, missing files, bad –
filetypes, etc. –
- cardinal_pythonlib.extract_text.docx_gen_fragments_from_xml_node(node: Element, level: int, config: TextProcessingConfig) Generator[DocxFragment, None, None] [source]
Returns text from an XML node within a DOCX file.
- Parameters:
node¶ – an XML node
level¶ – current level in XML hierarchy (used for recursion; start level is 0)
config¶ –
TextProcessingConfig
control object
- Returns:
contents as a string
- cardinal_pythonlib.extract_text.docx_gen_wordwrapped_fragments(fragments: Iterable[DocxFragment], width: int) Generator[str, None, None] [source]
Generates word-wrapped fragments.
- cardinal_pythonlib.extract_text.docx_process_table(table: CustomDocxTable, config: TextProcessingConfig) str [source]
Converts a DOCX table to text.
Structure representing a DOCX table:
table .rows[] .cells[] .paragraphs[] .text
That’s the structure of a
docx.table.Table
object, but also of our homebrew creation,CustomDocxTable
.The
plain
andsemiplain
options are implemented via theTextProcessingConfig
.Note also that the grids in DOCX files can have varying number of cells per row, e.g.
+---+---+---+ | 1 | 2 | 3 | +---+---+---+ | 1 | 2 | +---+---+
- cardinal_pythonlib.extract_text.docx_table_from_xml_node(table_node: Element, level: int, config: TextProcessingConfig) str [source]
Converts an XML node representing a DOCX table into a textual representation.
- Parameters:
table_node¶ – XML node
level¶ – current level in XML hierarchy (used for recursion; start level is 0)
config¶ –
TextProcessingConfig
control object
- Returns:
string representation
- cardinal_pythonlib.extract_text.docx_text_from_xml(xml: str, config: TextProcessingConfig) str [source]
Converts an XML tree of a DOCX file to string contents.
- Parameters:
xml¶ – raw XML text
config¶ –
TextProcessingConfig
control object
- Returns:
contents as a string
- cardinal_pythonlib.extract_text.docx_text_from_xml_node(node: Element, level: int, config: TextProcessingConfig) str [source]
Returns text from an XML node within a DOCX file.
- Parameters:
node¶ – an XML node
level¶ – current level in XML hierarchy (used for recursion; start level is 0)
config¶ –
TextProcessingConfig
control object
- Returns:
contents as a string
- cardinal_pythonlib.extract_text.docx_wordwrap_fragments(fragments: Iterable[DocxFragment], width: int) str [source]
Joins multiple fragments and word-wraps them as necessary.
- cardinal_pythonlib.extract_text.does_unrtf_support_quiet() bool [source]
The unrtf tool supports the ‘–quiet’ argument from a version that I’m not quite sure of, where
0.19.3 < version <= 0.21.9
. We check against 0.21.9 here.
- cardinal_pythonlib.extract_text.gen_xml_files_from_docx(fp: BinaryIO) Iterator[str] [source]
Generate XML files (as strings) from a DOCX file.
- Parameters:
fp¶ –
BinaryIO
object for reading the.DOCX
file- Yields:
the string contents of each individual XML file within the
.DOCX
file- Raises:
zipfile.BadZipFile – if the zip is unreadable (encrypted?)
- cardinal_pythonlib.extract_text.get_chardet_encoding(binary_contents: bytes) str | None [source]
Guess the character set encoding of the specified
binary_contents
.
- cardinal_pythonlib.extract_text.get_cmd_output(*args, encoding: str = 'utf-8') str [source]
Returns text output of a command.
- cardinal_pythonlib.extract_text.get_cmd_output_from_stdin(stdint_content_binary: bytes, *args, encoding: str = 'utf-8') str [source]
Returns text output of a command, passing binary data in via stdin.
- cardinal_pythonlib.extract_text.get_file_contents(filename: str | None = None, blob: bytes | None = None) bytes [source]
Returns the binary contents of a file, or of a BLOB.
- cardinal_pythonlib.extract_text.get_file_contents_text(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) str [source]
Returns the string contents of a file, or of a BLOB.
- cardinal_pythonlib.extract_text.get_filelikeobject(filename: str | None = None, blob: bytes | None = None) BinaryIO [source]
Open a file-like object.
Guard the use of this function with
with
.
- cardinal_pythonlib.extract_text.is_text_extractor_available(extension: str) bool [source]
Is a text extractor available for the specified extension?
- cardinal_pythonlib.extract_text.main() None [source]
Command-line processor. See
--help
for details.
- cardinal_pythonlib.extract_text.require_text_extractor(extension: str) None [source]
Require that a text extractor is available for the specified extension, or raise
ValueError
.
- cardinal_pythonlib.extract_text.rstrip_all_lines(text: str) str [source]
Right-strips all lines in a string and returns the result.
- cardinal_pythonlib.extract_text.update_external_tools(tooldict: Dict[str, str]) None [source]
Update the global map of tools.
- Parameters:
tooldict¶ – dictionary whose keys are tools names and whose values are paths to the executables