cardinal_pythonlib.spreadsheets
Original code copyright (C) 2009-2022 Rudolf Cardinal (rudolf@pobox.com).
This file is part of cardinal_pythonlib.
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Manipulate spreadsheets.
Note:
openpyxl is dreadfully slow. Its results are picklable, but not sensibly so (e.g. generating a >500Mb picklefile from a 12Mb spreadsheet.
xlrd is much faster, but we can’t pickle its results.
- class cardinal_pythonlib.spreadsheets.RowHolder(sheetholder: SheetHolder, row: int)[source]
Class to read from a single row of a spreadsheet.
The intended use is to create something like a dataclass, but one that knows its spreadsheet structure. Like this:
from cardinal_pythonlib.spreadsheets import RowHolder, SheetHolder class ReferralSheetHolder(SheetHolder): SHEET_NAME = "Patient Referrals 2018-19" def gen_referral_rows(self) -> Iterable["ReferralRow"]: for rownum in self.gen_row_numbers_excluding_header_row(): yield ReferralRow(self, rownum) class ReferralRow(RowHolder): def __init__(self, sheetholder: SheetHolder, row: int) -> None: super().__init__(sheetholder, row) self.inc_next_col() # column 0: query period; ignore self.patient_id = self.str_int_pp() self.referral_id_within_patient = self.int_pp() self.age_at_referral_int = self.int_pp() self.ethnicity = self.str_pp() self.gender = self.str_pp(check_header="Gender") def import_referrals(book: Book) -> None: sheet = ReferralSheetHolder(book) for referral in sheet.gen_referral_rows(): pass # do something useful here
- bool_pp(default: bool | None = None, true_values_lowercase: List[Any] | None = None, false_values_lowercase: List[Any] | None = None, unknown_values_lowercase: List[Any] | None = None, check_header: str | Sequence[str] | None = None) bool | None [source]
Reads a boolean value, then increments the “current” column. Optionally, checks that the header for this column is as expected.
- date_pp(default: date | None = None, check_header: str | Sequence[str] | None = None) date | None [source]
Reads a date, then increments the “current” column. Optionally, checks that the header for this column is as expected.
- datetime_pp(default: datetime | None = None, check_header: str | Sequence[str] | None = None) datetime | None [source]
Reads a datetime, then increments the “current” column. Optionally, checks that the header for this column is as expected.
- decimal_pp(default: float | None = None, check_header: str | Sequence[str] | None = None, dp: int | None = None, rounding: str = 'ROUND_HALF_UP') Decimal | None [source]
Reads a Decimal, then increments the “current” column. Optionally, checks that the header for this column is as expected.
- ensure_header(col: int, header: str | Sequence[str]) None [source]
Ensures the column has an appropriate heading value, or raises
ValueError
.
- ensure_heading(col: int, header: str | Sequence[str]) None [source]
Synonym for
ensure_header()
.
- ensure_next_col_header(header: str | Sequence[str]) None [source]
Ensures the next column has an appropriate heading value, or raises
ValueError
.
- ensure_next_col_heading(header: str | Sequence[str]) None [source]
Synonym for
ensure_next_col_header()
.
- float_pp(default: float | None = None, check_header: str | Sequence[str] | None = None) float | None [source]
Reads a float, then increments the “current” column. Optionally, checks that the header for this column is as expected.
- int_pp(default: int | None = None, check_header: str | Sequence[str] | None = None) int | None [source]
Reads an int, then increments the “current” column. Optionally, checks that the header for this column is as expected.
- property next_col: int
Returns the column number (0-based) that will be used by the next automatic read operation.
- none_pp(check_header: str | Sequence[str] | None = None) None [source]
Reads a null value, and ensures that it is null; then increments the “current” column. Optionally, checks that the header for this column is as expected.
- property row_one_based: int
One-based row number.
- property row_zero_based: int
Zero-based row number.
- str_int_pp(default: str | None = None, check_header: str | Sequence[str] | None = None) str | None [source]
Reads an integer as a string, then increments the “current” column. Optionally, checks that the header for this column is as expected.
- str_nonfloat_pp(default: str | None = None, check_header: str | Sequence[str] | None = None) str | None [source]
Reads something that may be a string or numeric, but if it’s numeric, it’s integer (not float). Then increments the “current” column. Optionally, checks that the header for this column is as expected.
- class cardinal_pythonlib.spreadsheets.SheetHolder(book: Book | None = None, sheet_name: str | None = None, sheet_index: int | None = None, sheet: Sheet | None = None, header_row_zero_based: int | None = None, first_data_row_zero_based: int | None = None, null_values: List[Any] | None = None, bool_true_values_lowercase: List[Any] | None = None, bool_false_values_lowercase: List[Any] | None = None, bool_unknown_values_lowercase: List[Any] | None = None, debug_max_rows_per_sheet: int | None = None)[source]
Class to read from an Excel spreadsheet.
There are two ways to specify the sheet:
Provide a workbook via
book
and…a sheet number, or
a sheet name.
Provide a worksheet directly via
sheet
.
You can specify the following as
_init__
parameters or (via their capitalized versions) by subclassing:sheet_name
header_row_zero_based
null_values
bool_true_values_lowercase
bool_false_values_lowercase
bool_unknown_values_lowercase
Initialization parameters take priority over subclassed values.
- Parameters:
book¶ – Workbook, from which a worksheet should be selected.
sheet_name¶ – Name of a sheet to select from within
book
.sheet_index¶ – Index (zero-based) of a sheet to select from within
book
.sheet¶ – Worksheet, provided directly.
header_row_zero_based¶ – Row number (zero-based) of the header row.
first_data_row_zero_based¶ – Row number (zero-based) of the first row containing data.
null_values¶ – Values to treat as null (blank) values, converted to Python
None
.bool_true_values_lowercase¶ – Values to treat, by default, as
True
in Boolean columns.bool_false_values_lowercase¶ – Values to treat, by default, as
False
in Boolean columns.bool_unknown_values_lowercase¶ – Values to treat, by default, as missing/unknown in Boolean columns.
debug_max_rows_per_sheet¶ – Debugging option: the maximum number of data rows to process.
- ensure_header(col: int, header: str | Sequence[str]) None [source]
Ensures that the header is correct for a specified column, or raise
ValueError
.You can specify a single correct heading or a sequence (e.g. list) of them.
- ensure_heading(col: int, heading: str | Sequence[str]) None [source]
Synonym for
ensure_header()
.
- gen_row_numbers_excluding_header_row(with_counter: bool = True) Iterable[int] [source]
Generates row numbers.
xlrd uses 0-based numbering, so row 1 is the first beyond a header row.
- gen_rows_excluding_header_row(with_counter: bool = True) Iterable[Sequence[Cell]] [source]
Generates rows. AVOID; index-based access is faster.
xlrd uses 0-based numbering, so row 1 is the first beyond a header row.
- property headers: List[str]
Returns all headings.
- property n_data_rows: int
Total number of data rows (below any header row).
- property n_rows: int
Total number of rows.
- read_bool(row: int, col: int, default: bool | None = None, true_values_lowercase: List[Any] | None = None, false_values_lowercase: List[Any] | None = None, unknown_values_lowercase: List[Any] | None = None, check_header: str | Sequence[str] | None = None) bool | None [source]
Reads a boolean value.
- read_date(row: int, col: int, default: date | None = None, check_header: str | Sequence[str] | None = None) date | None [source]
Reads a date from an Excel spreadsheet
https://stackoverflow.com/questions/32430679/how-to-read-dates-using-xlrd
- read_datetime(row: int, col: int, default: datetime | None = None, check_header: str | Sequence[str] | None = None) datetime | None [source]
Reads a datetime from an Excel spreadsheet via xlrd.
https://stackoverflow.com/questions/32430679/how-to-read-dates-using-xlrd
- read_decimal(row: int, col: int, default: Decimal | None = None, check_header: str | Sequence[str] | None = None, dp: int | None = None, rounding: str = 'ROUND_HALF_UP') Decimal | None [source]
Reads a Decimal from the spreadsheet.
If
dp
is notNone
, force the result to a specified number of decimal places, using the specified rounding method.
- read_float(row: int, col: int, default: float | None = None, check_header: str | Sequence[str] | None = None) float | None [source]
Reads a float from the spreadsheet.
- read_int(row: int, col: int, default: int | None = None, check_header: str | Sequence[str] | None = None) int | None [source]
Reads an integer from a spreadsheet.
- read_none(row: int, col: int, check_header: str | Sequence[str] | None = None) None [source]
Reads a value, and checks that it is a none/null value
- read_str(row: int, col: int, default: str | None = None, check_header: str | Sequence[str] | None = None) str | None [source]
Reads a string from a spreadsheet.
- read_str_int(row: int, col: int, default: str | None = None, check_header: str | Sequence[str] | None = None) str | None [source]
Reads a string version of an integer. (This prevents e.g. “2” being read as a floating-point value of “2.0” then converted to a string.)
- read_str_nonfloat(row: int, col: int, default: str | None = None, check_header: str | Sequence[str] | None = None) str | None [source]
Reads something that may be a string or numeric, but if it’s numeric, it’s integer (not float). (This prevents e.g. “2” being read as a floating-point value of “2.0” then converted to a string.)
- read_value(row: int, col: int, check_header: str | Sequence[str] | None = None) Any [source]
Retrieves a value from a cell of a spreadsheet.
Optionally, check that the heading for this column is correct (see
ensure_header()
).
- property sheet_name: str
Name of the sheet within the workbook (file).
- cardinal_pythonlib.spreadsheets.all_same(items: Iterable[Any]) bool [source]
Are all the items the same?
… though we will also allow “no items” to pass the test.
- cardinal_pythonlib.spreadsheets.attr_all_same(items: Sequence[Any], attr: str) bool [source]
Returns whether the value of an attribute is the same across a collection of items.
- cardinal_pythonlib.spreadsheets.check_attr_all_same(items: Sequence[Any], attr: str, id_attr: str | None = None, fail_if_different: bool = True, ignore_none: bool = False) None [source]
Checks if the value of an attribute is the same across a collection of items, and takes some action if not.
- Parameters:
items¶ – Items to check
attr¶ – Name of attribute whose value should be compared across items.
id_attr¶ – If the attributes are not all the same, use the value of this attribute from the first item to give some identifying context to the failure message.
fail_if_different¶ – If true, raises
ValueError
on failure; otherwise, prints a warning to the log.ignore_none¶ – Ignore
None
values?
- cardinal_pythonlib.spreadsheets.colnum_zb_from_alphacol(alphacol: str) int [source]
Reverses
column_lettering()
, generating a zero-based column index from an alphabetical name (A to Z, AA to AZ, etc.).
- cardinal_pythonlib.spreadsheets.column_lettering(colnum: int) str [source]
Converts a zero-based column index into a spreadsheet-style column name (A[0] to Z[25], then AA[26] to AZ[51], etc). Basically, it’s almost base 26, but without a proper sense of zero (in that A is zero, but AA is 26).
- cardinal_pythonlib.spreadsheets.load_workbook(spreadsheet_filename: str) Book [source]
Load a workbook.
- cardinal_pythonlib.spreadsheets.none_or_blank_string(x: Any) bool [source]
Is
x
eitherNone
or a string that is empty or contains nothing but whitespace?
- cardinal_pythonlib.spreadsheets.prefer_attr_all_same(items: Sequence[Any], attr: str, id_attr: str, ignore_none: bool = False) None [source]
Warn if the
attr
attribute of each item initems
is not the same. Seecheck_attr_all_same()
.
- cardinal_pythonlib.spreadsheets.read_int_row(row: Sequence[Cell], colnum: int) int | None [source]
Reads an integer from a spreadsheet, given a row.
AVOID: slower than index access (see
SheetHolder
,RowHolder
).
- cardinal_pythonlib.spreadsheets.read_value_row(row: Sequence[Cell], colnum: int) Any [source]
Retrieves a value from a cell of a spreadsheet, given a row.
AVOID: slower than index access (see
SheetHolder
,RowHolder
).
- cardinal_pythonlib.spreadsheets.require_attr_all_same(items: Sequence[Any], attr: str, id_attr: str, ignore_none: bool = False) None [source]
Raise if the
attr
attribute of each item initems
is not the same. Seecheck_attr_all_same()
.