cardinal_pythonlib.extract_text


Original code copyright (C) 2009-2022 Rudolf Cardinal (rudolf@pobox.com).

This file is part of cardinal_pythonlib.

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.


Converts a bunch of stuff to text, either from external files or from in-memory binary objects (BLOBs).

Prerequisites:

sudo apt-get install antiword
pip install docx pdfminer
  • Author: Rudolf Cardinal (rudolf@pobox.com)
  • Created: Feb 2015
  • Last update: 24 Sep 2015

See also:

class cardinal_pythonlib.extract_text.CustomDocxParagraph(text: str = '')[source]

Represents a paragraph of text in a DOCX file.

class cardinal_pythonlib.extract_text.CustomDocxTable(rows: List[cardinal_pythonlib.extract_text.CustomDocxTableRow] = None)[source]

Represents a table of a DOCX file. May contain several rows.

class cardinal_pythonlib.extract_text.CustomDocxTableCell(paragraphs: List[cardinal_pythonlib.extract_text.CustomDocxParagraph] = None)[source]

Represents a cell within a table of a DOCX file. May contain several paragraphs.

class cardinal_pythonlib.extract_text.CustomDocxTableRow(cells: List[cardinal_pythonlib.extract_text.CustomDocxTableCell] = None)[source]

Represents a row within a table of a DOCX file. May contain several cells (one per column).

class cardinal_pythonlib.extract_text.DocxFragment(text: str, wordwrap: bool = True)[source]

Representation of a line, or multiple lines, which may or may not need word-wrapping.

class cardinal_pythonlib.extract_text.TextProcessingConfig(encoding: str = None, width: int = 120, min_col_width: int = 15, plain: bool = False, semiplain: bool = False, docx_in_order: bool = True, horizontal_char='─', vertical_char='│', junction_char='┼', plain_table_start: str = None, plain_table_end: str = None, plain_table_col_boundary: str = None, plain_table_row_boundary: str = None, rstrip: bool = True)[source]

Class to manage control parameters for text extraction, without having to pass a lot of mysterious **kwargs around and lose track of what it means.

All converter functions take one of these objects as a parameter.

Parameters:
  • encoding – optional text file encoding to try in addition to sys.getdefaultencoding().
  • width – overall word-wrapping width
  • min_col_width – minimum column width for tables
  • plain – as plain as possible (e.g. for natural language processing); see docx_process_table().
  • semiplain – quite plain, but with some ASCII art representation of the table structure.
  • docx_in_order – for DOCX files: if True, process paragraphs and tables in the order they occur; if False, process all paragraphs followed by all tables
  • rstrip – Right-strip whitespace from all lines?
  • horizontal_char – horizontal character to use with PrettyTable, e.g. - or
  • vertical_char – vertical character to use with PrettyTable, e.g. | or
  • junction_char – junction character to use with PrettyTable, e.g. + or
  • plain_table_start – table start line to use with plain=True
  • plain_table_end – table end line to use with plain=True
  • plain_table_col_boundary – boundary between columns to use with plain==True
  • plain_table_row_boundary – boundary between rows to use with plain==True

Example of a DOCX table processed with:

  • plain=False, semiplain=False

    ┼─────────────┼─────────────┼
    │ Row 1 col 1 │ Row 1 col 2 │
    ┼─────────────┼─────────────┼
    │ Row 2 col 1 │ Row 2 col 2 │
    ┼─────────────┼─────────────┼
    
  • plain=False, semiplain=True

    ─────────────────────────────
      Row 1 col 1
    ─────────────────────────────
                    Row 1 col 2
    ─────────────────────────────
      Row 2 col 1
    ─────────────────────────────
                    Row 2 col 2
    ─────────────────────────────
    
  • plain=True

    ╔═════════════════════════════════════════════════════════════════╗
    Row 1 col 1
    ───────────────────────────────────────────────────────────────────
    Row 1 col 2
    ═══════════════════════════════════════════════════════════════════
    Row 2 col 1
    ───────────────────────────────────────────────────────────────────
    Row 2 col 2
    ╚═════════════════════════════════════════════════════════════════╝
    

The plain format is probably better, in general, for NLP, and is definitely clearer with nested tables (for which the word-wrapping algorithm is imperfect). We avoid “heavy” box drawing as it has a higher chance of being mangled under Windows.

cardinal_pythonlib.extract_text.availability_anything() → bool[source]

Is a generic “something-to-text” processor available?

cardinal_pythonlib.extract_text.availability_doc() → bool[source]

Is a DOC processor available?

cardinal_pythonlib.extract_text.availability_pdf() → bool[source]

Is a PDF-to-text tool available?

cardinal_pythonlib.extract_text.availability_rtf() → bool[source]

Is an RTF processor available?

cardinal_pythonlib.extract_text.convert_anything_to_text(filename: str = None, blob: bytes = None, config: cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]

Convert arbitrary files to text, using strings or strings2. (strings is a standard Unix command to get text from any old rubbish.)

cardinal_pythonlib.extract_text.convert_doc_to_text(filename: str = None, blob: bytes = None, config: cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]

Converts Microsoft Word DOC files to text.

cardinal_pythonlib.extract_text.convert_docx_to_text(filename: str = None, blob: bytes = None, config: cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]

Converts a DOCX file to text. Pass either a filename or a binary object.

Args:
filename: filename to process blob: binary bytes object to process config: TextProcessingConfig control object
Returns:
text contents

Notes:

‘.join(paratextlist)

cardinal_pythonlib.extract_text.convert_html_to_text(filename: str = None, blob: bytes = None, config: cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]

Converts HTML to text.

cardinal_pythonlib.extract_text.convert_odt_to_text(filename: str = None, blob: bytes = None, config: cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]

Converts an OpenOffice ODT file to text.

Pass either a filename or a binary object.

cardinal_pythonlib.extract_text.convert_pdf_to_txt(filename: str = None, blob: bytes = None, config: cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]

Converts a PDF file to text. Pass either a filename or a binary object.

cardinal_pythonlib.extract_text.convert_rtf_to_text(filename: str = None, blob: bytes = None, config: cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]

Converts RTF to text.

cardinal_pythonlib.extract_text.convert_xml_to_text(filename: str = None, blob: bytes = None, config: cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]

Converts XML to text.

cardinal_pythonlib.extract_text.document_to_text(filename: str = None, blob: bytes = None, extension: str = None, config: cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]

Converts a document to text.

This function selects a processor based on the file extension (either from the filename, or, in the case of a BLOB, the extension specified manually via the extension parameter).

Pass either a filename or a binary object.

Parameters:
  • filename – the filename to read
  • blob – binary content (alternative to filename)
  • extension – file extension, used as a hint when blob is used
  • config – an optional TextProcessingConfig object
Returns:

Returns a string if the file was processed (potentially an empty string).

Raises:
  • Raises an exception for malformed arguments, missing files, bad
  • filetypes, etc.
cardinal_pythonlib.extract_text.docx_gen_fragments_from_xml_node(node: xml.etree.ElementTree.Element, level: int, config: cardinal_pythonlib.extract_text.TextProcessingConfig) → Generator[cardinal_pythonlib.extract_text.DocxFragment, None, None][source]

Returns text from an XML node within a DOCX file.

Parameters:
  • node – an XML node
  • level – current level in XML hierarchy (used for recursion; start level is 0)
  • configTextProcessingConfig control object
Returns:

contents as a string

cardinal_pythonlib.extract_text.docx_gen_wordwrapped_fragments(fragments: Iterable[cardinal_pythonlib.extract_text.DocxFragment], width: int) → Generator[str, None, None][source]

Generates word-wrapped fragments.

cardinal_pythonlib.extract_text.docx_process_table(table: cardinal_pythonlib.extract_text.CustomDocxTable, config: cardinal_pythonlib.extract_text.TextProcessingConfig) → str[source]

Converts a DOCX table to text.

Structure representing a DOCX table:

table
    .rows[]
        .cells[]
            .paragraphs[]
                .text

That’s the structure of a docx.table.Table object, but also of our homebrew creation, CustomDocxTable.

  • The plain and semiplain options are implemented via the TextProcessingConfig.

  • Note also that the grids in DOCX files can have varying number of cells per row, e.g.

    +---+---+---+
    | 1 | 2 | 3 |
    +---+---+---+
    | 1 | 2 |
    +---+---+
    
cardinal_pythonlib.extract_text.docx_table_from_xml_node(table_node: xml.etree.ElementTree.Element, level: int, config: cardinal_pythonlib.extract_text.TextProcessingConfig) → str[source]

Converts an XML node representing a DOCX table into a textual representation.

Parameters:
  • table_node – XML node
  • level – current level in XML hierarchy (used for recursion; start level is 0)
  • configTextProcessingConfig control object
Returns:

string representation

cardinal_pythonlib.extract_text.docx_text_from_xml(xml: str, config: cardinal_pythonlib.extract_text.TextProcessingConfig) → str[source]

Converts an XML tree of a DOCX file to string contents.

Parameters:
Returns:

contents as a string

cardinal_pythonlib.extract_text.docx_text_from_xml_node(node: xml.etree.ElementTree.Element, level: int, config: cardinal_pythonlib.extract_text.TextProcessingConfig) → str[source]

Returns text from an XML node within a DOCX file.

Parameters:
  • node – an XML node
  • level – current level in XML hierarchy (used for recursion; start level is 0)
  • configTextProcessingConfig control object
Returns:

contents as a string

cardinal_pythonlib.extract_text.docx_wordwrap_fragments(fragments: Iterable[cardinal_pythonlib.extract_text.DocxFragment], width: int) → str[source]

Joins multiple fragments and word-wraps them as necessary.

cardinal_pythonlib.extract_text.does_unrtf_support_quiet() → bool[source]

The unrtf tool supports the ‘–quiet’ argument from a version that I’m not quite sure of, where 0.19.3 < version <= 0.21.9. We check against 0.21.9 here.

cardinal_pythonlib.extract_text.gen_xml_files_from_docx(fp: BinaryIO) → Iterator[str][source]

Generate XML files (as strings) from a DOCX file.

Parameters:fpBinaryIO object for reading the .DOCX file
Yields:the string contents of each individual XML file within the .DOCX file
Raises:zipfile.BadZipFile – if the zip is unreadable (encrypted?)
cardinal_pythonlib.extract_text.get_chardet_encoding(binary_contents: bytes) → Optional[str][source]

Guess the character set encoding of the specified binary_contents.

cardinal_pythonlib.extract_text.get_cmd_output(*args, encoding: str = 'utf-8') → str[source]

Returns text output of a command.

cardinal_pythonlib.extract_text.get_cmd_output_from_stdin(stdint_content_binary: bytes, *args, encoding: str = 'utf-8') → str[source]

Returns text output of a command, passing binary data in via stdin.

cardinal_pythonlib.extract_text.get_file_contents(filename: str = None, blob: bytes = None) → bytes[source]

Returns the binary contents of a file, or of a BLOB.

cardinal_pythonlib.extract_text.get_file_contents_text(filename: str = None, blob: bytes = None, config: cardinal_pythonlib.extract_text.TextProcessingConfig = <cardinal_pythonlib.extract_text.TextProcessingConfig object>) → str[source]

Returns the string contents of a file, or of a BLOB.

cardinal_pythonlib.extract_text.get_filelikeobject(filename: str = None, blob: bytes = None) → BinaryIO[source]

Open a file-like object.

Guard the use of this function with with.

Parameters:
  • filename – for specifying via a filename
  • blob – for specifying via an in-memory bytes object
Returns:

a BinaryIO object

cardinal_pythonlib.extract_text.is_text_extractor_available(extension: str) → bool[source]

Is a text extractor available for the specified extension?

cardinal_pythonlib.extract_text.main() → None[source]

Command-line processor. See --help for details.

cardinal_pythonlib.extract_text.require_text_extractor(extension: str) → None[source]

Require that a text extractor is available for the specified extension, or raise ValueError.

cardinal_pythonlib.extract_text.rstrip_all_lines(text: str) → str[source]

Right-strips all lines in a string and returns the result.

cardinal_pythonlib.extract_text.update_external_tools(tooldict: Dict[str, str]) → None[source]

Update the global map of tools.

Parameters:tooldict – dictionary whose keys are tools names and whose values are paths to the executables
cardinal_pythonlib.extract_text.wordwrap(text: str, width: int) → str[source]

Word-wraps text.

Parameters:
  • text – text to process (will be treated as a single line)
  • width – width to word-wrap to (or 0 to skip word wrapping)
Returns:

wrapped text

from cardinal_pythonlib.extract_text import *
text = "Here is a very long line that may be word-wrapped. " * 50
print(docx_wordwrap(text, 80))