cardinal_pythonlib.extract_text

Original code copyright (C) 2009-2022 Rudolf Cardinal (rudolf@pobox.com).

This file is part of cardinal_pythonlib.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Converts a bunch of stuff to text, either from external files or from in-memory binary objects (BLOBs).

Prerequisites:

sudo apt-get install antiword
pip install docx pdfminer

Author: Rudolf Cardinal (rudolf@pobox.com)
Created: Feb 2015
Last update: 24 Sep 2015

See also:

class cardinal_pythonlib.extract_text.CustomDocxParagraph(text: str = '')[source]: Represents a paragraph of text in a DOCX file.

class cardinal_pythonlib.extract_text.CustomDocxTable(rows: List[CustomDocxTableRow] | None = None)[source]: Represents a table of a DOCX file. May contain several rows.

class cardinal_pythonlib.extract_text.CustomDocxTableCell(paragraphs: List[CustomDocxParagraph] | None = None)[source]: Represents a cell within a table of a DOCX file. May contain several paragraphs.

class cardinal_pythonlib.extract_text.CustomDocxTableRow(cells: List[CustomDocxTableCell] | None = None)[source]: Represents a row within a table of a DOCX file. May contain several cells (one per column).

class cardinal_pythonlib.extract_text.DocxFragment(text: str, wordwrap: bool = True)[source]: Representation of a line, or multiple lines, which may or may not need word-wrapping.

class cardinal_pythonlib.extract_text.TextProcessingConfig(encoding: str | None = None, width: int = 120, min_col_width: int = 15, plain: bool = False, semiplain: bool = False, docx_in_order: bool = True, horizontal_char='─', vertical_char='│', junction_char='┼', plain_table_start: str | None = None, plain_table_end: str | None = None, plain_table_col_boundary: str | None = None, plain_table_row_boundary: str | None = None, rstrip: bool = True)[source]

Class to manage control parameters for text extraction, without having to pass a lot of mysterious **kwargs around and lose track of what it means.

All converter functions take one of these objects as a parameter.

Parameters:

encoding¶ – optional text file encoding to try in addition to sys.getdefaultencoding().
width¶ – overall word-wrapping width
min_col_width¶ – minimum column width for tables
plain¶ – as plain as possible (e.g. for natural language processing); see docx_process_table().
semiplain¶ – quite plain, but with some ASCII art representation of the table structure.
docx_in_order¶ – for DOCX files: if True, process paragraphs and tables in the order they occur; if False, process all paragraphs followed by all tables
rstrip¶ – Right-strip whitespace from all lines?
horizontal_char¶ – horizontal character to use with PrettyTable, e.g. - or ─
vertical_char¶ – vertical character to use with PrettyTable, e.g. | or │
junction_char¶ – junction character to use with PrettyTable, e.g. + or ┼
plain_table_start¶ – table start line to use with plain=True
plain_table_end¶ – table end line to use with plain=True
plain_table_col_boundary¶ – boundary between columns to use with plain==True
plain_table_row_boundary¶ – boundary between rows to use with plain==True

Example of a DOCX table processed with:

plain=False, semiplain=False

┼─────────────┼─────────────┼
│ Row 1 col 1 │ Row 1 col 2 │
┼─────────────┼─────────────┼
│ Row 2 col 1 │ Row 2 col 2 │
┼─────────────┼─────────────┼

plain=False, semiplain=True

─────────────────────────────
  Row 1 col 1
─────────────────────────────
                Row 1 col 2
─────────────────────────────
  Row 2 col 1
─────────────────────────────
                Row 2 col 2
─────────────────────────────

plain=True

╔═════════════════════════════════════════════════════════════════╗
Row 1 col 1
───────────────────────────────────────────────────────────────────
Row 1 col 2
═══════════════════════════════════════════════════════════════════
Row 2 col 1
───────────────────────────────────────────────────────────────────
Row 2 col 2
╚═════════════════════════════════════════════════════════════════╝

The plain format is probably better, in general, for NLP, and is definitely clearer with nested tables (for which the word-wrapping algorithm is imperfect). We avoid “heavy” box drawing as it has a higher chance of being mangled under Windows.

cardinal_pythonlib.extract_text.availability_anything() → bool[source]: Is a generic “something-to-text” processor available?

cardinal_pythonlib.extract_text.availability_doc() → bool[source]: Is a DOC processor available?

cardinal_pythonlib.extract_text.availability_pdf() → bool[source]: Is a PDF-to-text tool available?

cardinal_pythonlib.extract_text.availability_rtf() → bool[source]: Is an RTF processor available?

cardinal_pythonlib.extract_text.convert_anything_to_text(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]: Convert arbitrary files to text, using strings or strings2. (strings is a standard Unix command to get text from any old rubbish.)

cardinal_pythonlib.extract_text.convert_doc_to_text(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]: Converts Microsoft Word DOC files to text.

cardinal_pythonlib.extract_text.convert_docx_to_text(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]

Converts a DOCX file to text. Pass either a filename or a binary object.

Args:
filename: filename to process blob: binary bytes object to process config: TextProcessingConfig control object

Returns:
text contents

Notes:
Old docx (https://pypi.python.org/pypi/python-docx) has been superseded (see https://github.com/mikemaccana/python-docx).
docx.opendocx(file) uses zipfile.ZipFile, which can take either a filename or a file-like object (https://docs.python.org/2/library/zipfile.html).
Method was:
with get_filelikeobject(filename, blob) as fp:
    document = docx.opendocx(fp)
    paratextlist = docx.getdocumenttext(document)
return '

‘.join(paratextlist)

Newer docx is python-docx

https://pypi.python.org/pypi/python-docx

https://python-docx.readthedocs.org/en/latest/

https://stackoverflow.com/questions/25228106

However, it uses lxml, which has C dependencies, so it doesn’t always install properly on e.g. bare Windows machines.

PERFORMANCE of my method:

nice table formatting

but tables grouped at end, not in sensible places

can iterate via doc.paragraphs and doc.tables but not in true document order, it seems

others have noted this too:

https://github.com/python-openxml/python-docx/issues/40

https://github.com/deanmalmgren/textract/pull/92

docx2txt is at https://pypi.python.org/pypi/docx2txt/0.6; this is pure Python. Its command-line function appears to be for Python 2 only (2016-04-21: crashes under Python 3; is due to an encoding bug). However, it seems fine as a library. It doesn’t handle in-memory blobs properly, though, so we need to extend it.

PERFORMANCE OF ITS process() function:

all text comes out

table text is in a sensible place

table formatting is lost.

Other manual methods (not yet implemented): https://etienned.github.io/posts/extract-text-from-word-docx-simply/.

Looks like it won’t deal with header stuff (etc.) that docx2txt handles.

Upshot: we need a DIY version.

See also this “compile lots of techniques” libraries, which has C dependencies: https://textract.readthedocs.org/en/latest/

cardinal_pythonlib.extract_text.convert_html_to_text(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]: Converts HTML to text.

cardinal_pythonlib.extract_text.convert_odt_to_text(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]

Converts an OpenOffice ODT file to text.

Pass either a filename or a binary object.

cardinal_pythonlib.extract_text.convert_pdf_to_txt(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]: Converts a PDF file to text. Pass either a filename or a binary object.

cardinal_pythonlib.extract_text.convert_rtf_to_text(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]: Converts RTF to text.

cardinal_pythonlib.extract_text.convert_xml_to_text(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]: Converts XML to text.

cardinal_pythonlib.extract_text.document_to_text(filename: str | None = None, blob: bytes | None = None, extension: str | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]

Converts a document to text.

This function selects a processor based on the file extension (either from the filename, or, in the case of a BLOB, the extension specified manually via the extension parameter).

Pass either a filename or a binary object.

Parameters:

filename¶ – the filename to read
blob¶ – binary content (alternative to filename)
extension¶ – file extension, used as a hint when blob is used
config¶ – an optional TextProcessingConfig object

Returns:

Returns a string if the file was processed (potentially an empty string).

Raises:

Raises an exception for malformed arguments, missing files, bad –
filetypes, etc. –

cardinal_pythonlib.extract_text.docx_gen_fragments_from_xml_node(node: Element, level: int, config: TextProcessingConfig) → Generator[DocxFragment, None, None][source]

Returns text from an XML node within a DOCX file.

Parameters:

node¶ – an XML node
level¶ – current level in XML hierarchy (used for recursion; start level is 0)
config¶ – TextProcessingConfig control object

Returns:

contents as a string

cardinal_pythonlib.extract_text.docx_gen_wordwrapped_fragments(fragments: Iterable[DocxFragment], width: int) → Generator[str, None, None][source]: Generates word-wrapped fragments.

cardinal_pythonlib.extract_text.docx_process_table(table: CustomDocxTable, config: TextProcessingConfig) → str[source]

Converts a DOCX table to text.

Structure representing a DOCX table:

table
    .rows[]
        .cells[]
            .paragraphs[]
                .text

That’s the structure of a docx.table.Table object, but also of our homebrew creation, CustomDocxTable.

The plain and semiplain options are implemented via the TextProcessingConfig.
Note also that the grids in DOCX files can have varying number of cells per row, e.g.
```
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| 1 | 2 |
+---+---+
```

cardinal_pythonlib.extract_text.docx_table_from_xml_node(table_node: Element, level: int, config: TextProcessingConfig) → str[source]

Converts an XML node representing a DOCX table into a textual representation.

Parameters:

table_node¶ – XML node
level¶ – current level in XML hierarchy (used for recursion; start level is 0)
config¶ – TextProcessingConfig control object

Returns:

string representation

cardinal_pythonlib.extract_text.docx_text_from_xml(xml: str, config: TextProcessingConfig) → str[source]

Converts an XML tree of a DOCX file to string contents.

Parameters:

xml¶ – raw XML text
config¶ – TextProcessingConfig control object

Returns:

contents as a string

cardinal_pythonlib.extract_text.docx_text_from_xml_node(node: Element, level: int, config: TextProcessingConfig) → str[source]

Returns text from an XML node within a DOCX file.

Parameters:

node¶ – an XML node
level¶ – current level in XML hierarchy (used for recursion; start level is 0)
config¶ – TextProcessingConfig control object

Returns:

contents as a string

cardinal_pythonlib.extract_text.docx_wordwrap_fragments(fragments: Iterable[DocxFragment], width: int) → str[source]: Joins multiple fragments and word-wraps them as necessary.

cardinal_pythonlib.extract_text.does_unrtf_support_quiet() → bool[source]: The unrtf tool supports the ‘–quiet’ argument from a version that I’m not quite sure of, where 0.19.3 < version <= 0.21.9. We check against 0.21.9 here.

cardinal_pythonlib.extract_text.gen_xml_files_from_docx(fp: BinaryIO) → Iterator[str][source]

Generate XML files (as strings) from a DOCX file.

Parameters:: fp¶ – BinaryIO object for reading the .DOCX file
Yields:: the string contents of each individual XML file within the .DOCX file
Raises:: zipfile.BadZipFile – if the zip is unreadable (encrypted?)

cardinal_pythonlib.extract_text.get_chardet_encoding(binary_contents: bytes) → str | None[source]: Guess the character set encoding of the specified binary_contents.

cardinal_pythonlib.extract_text.get_cmd_output(*args, encoding: str = 'utf-8') → str[source]: Returns text output of a command.

cardinal_pythonlib.extract_text.get_cmd_output_from_stdin(stdint_content_binary: bytes, *args, encoding: str = 'utf-8') → str[source]: Returns text output of a command, passing binary data in via stdin.

cardinal_pythonlib.extract_text.get_file_contents(filename: str | None = None, blob: bytes | None = None) → bytes[source]: Returns the binary contents of a file, or of a BLOB.

cardinal_pythonlib.extract_text.get_file_contents_text(filename: str | None = None, blob: bytes | None = None, config: ~cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]: Returns the string contents of a file, or of a BLOB.

cardinal_pythonlib.extract_text.get_filelikeobject(filename: str | None = None, blob: bytes | None = None) → BinaryIO[source]

Open a file-like object.

Guard the use of this function with with.

Parameters:

filename¶ – for specifying via a filename
blob¶ – for specifying via an in-memory bytes object

Returns:

a BinaryIO object

cardinal_pythonlib.extract_text.is_text_extractor_available(extension: str) → bool[source]: Is a text extractor available for the specified extension?

cardinal_pythonlib.extract_text.main() → None[source]: Command-line processor. See --help for details.

cardinal_pythonlib.extract_text.require_text_extractor(extension: str) → None[source]: Require that a text extractor is available for the specified extension, or raise ValueError.

cardinal_pythonlib.extract_text.rstrip_all_lines(text: str) → str[source]: Right-strips all lines in a string and returns the result.

cardinal_pythonlib.extract_text.update_external_tools(tooldict: Dict[str, str]) → None[source]

Update the global map of tools.

Parameters:: tooldict¶ – dictionary whose keys are tools names and whose values are paths to the executables

cardinal_pythonlib.extract_text.wordwrap(text: str, width: int) → str[source]

Word-wraps text.

Parameters:

text¶ – text to process (will be treated as a single line)
width¶ – width to word-wrap to (or 0 to skip word wrapping)

Returns:

wrapped text

from cardinal_pythonlib.extract_text import *
text = "Here is a very long line that may be word-wrapped. " * 50
print(docx_wordwrap(text, 80))