Utilities

Basic Utils Functions

impresso_commons.utils.utils.bytes_to(bytes_nb: int, to_unit: str, bsize: int = 1024) → float

Convert bytes to the specified unit.

Supported target units: - ‘k’ (kilobytes), ‘m’ (megabytes), - ‘g’ (gigabytes), ‘t’ (terabytes), - ‘p’ (petabytes), ‘e’ (exabytes).

Parameters:

bytes_nb (int) – The number of bytes to be converted.
to_unit (str) – The target unit for conversion.
bsize (int, optional) – The base size used for conversion (default is 1024).

Returns:

The converted value in the specified unit.

Return type:

float

Raises:

KeyError – If the specified target unit is not supported.

impresso_commons.utils.utils.chunk(list, chunksize): Yield successive n-sized chunks from list.

impresso_commons.utils.utils.get_pkg_resource(file_manager: ExitStack, path: str, package: str = 'impresso_commons') → PosixPath

Return the resource at path in package, using a context manager.

Note

The context manager file_manager needs to be instantiated prior to calling this function and should be closed once the package resource is no longer of use.

Parameters:

file_manager (contextlib.ExitStack) – Context manager.
path (str) – Path to the desired resource in given package.
package (str, optional) – Package name. Defaults to “impresso_commons”.

Returns:

Path to desired managed resource.

Return type:

pathlib.PosixPath

impresso_commons.utils.utils.init_logger(level: int = 20, file: str | None = None) → RootLogger

Initialises the root logger.

Parameters:

level (int, optional) – desired level of logging. Defaults to logging.INFO.
file (str | None, optional) – _description_. Defaults to None.

Returns:

the initialised logger

Return type:

logging.RootLogger

impresso_commons.utils.utils.parse_json(filename)

impresso_commons.utils.utils.validate_against_schema(json_to_validate: dict[str, Any], path_to_schema: str = 'schemas/json/versioning/manifest.schema.json') → None

Validate a dict corresponding to a JSON against a provided JSON schema.

Parameters:

json (dict[str, Any]) – JSON data to validate against a schema.
path_to_schema (str, optional) – Path to the JSON schema to validate against. Defaults to “impresso-schemas/json/versioning/manifest.schema.json”.

Raises:

e – The provided JSON could not be validated against the provided schema.

S3 Utils Functions

Reusable functions to read/write data from/to our S3 drive. Warning: 2 boto libraries are used, and need to be kept until third party lib dependencies are solved.

impresso_commons.utils.s3.alternative_read_text(s3_key: str, s3_credentials: dict, line_by_line: bool = True) → list[str] | str: Read from S3 a line-separated text file (e.g. *.jsonl.bz2).

Note

The reason for this function is a bug in dask.bag.read_text() which breaks on buckets having >= 1000 keys. It raises a FileNotFoundError.

impresso_commons.utils.s3.fixed_s3fs_glob(path: str, boto3_bucket=None): From Benoit, impresso-pyimages package A custom glob function as the s3fs one seems to be unable to list more than 1000 elements on the switch S3 :param path: :return:

impresso_commons.utils.s3.get_boto3_bucket(bucket_name: str)

impresso_commons.utils.s3.get_bucket(name, create=False, versioning=True)

Create a boto s3 connection and returns the requested bucket.

It is possible to ask for creating a new bucket with the specified name (in case it does not exist), and (optionally) to turn on the versioning on the newly created bucket. >>> b = get_bucket(‘testb’, create=False) >>> b = get_bucket(‘testb’, create=True) >>> b = get_bucket(‘testb’, create=True, versioning=False)

Note

This function is depreciated, please prioritize using get_or_create_bucket or get_boto3_bucket instead.

Parameters:

name (string) – the bucket’s name
create (boolean) – creates the bucket if not yet existing
versioning (boolean) – whether the new bucket should be versioned

Returns:

an s3 bucket

Return type:

boto3.resources.factory.s3.Bucket

impresso_commons.utils.s3.get_bucket_boto3(name, create=False, versioning=True)

Get a boto3 s3 resource and returns the requested bucket.

It is possible to ask for creating a new bucket with the specified name (in case it does not exist), and (optionally) to turn on the versioning on the newly created bucket.

>>> b = get_bucket('testb', create=False)
>>> b = get_bucket('testb', create=True)
>>> b = get_bucket('testb', create=True, versioning=False)

Note

This function is depreciated, please prioritize using get_or_create_bucket or get_boto3_bucket instead.

Parameters:

name (string) – the bucket’s name
create (boolean) – creates the bucket if not yet existing
versioning (boolean) – whether the new bucket should be versioned

Returns:

an s3 bucket

Return type:

boto3.resources.factory.s3.Bucket

impresso_commons.utils.s3.get_or_create_bucket(name, create=False, versioning=True)

Create a boto3 s3 connection and returns the requested bucket.

It is possible to ask for creating a new bucket with the specified name (in case it does not exist), and (optionally) to turn on the versioning on the newly created bucket. >>> b = get_bucket(‘testb’, create=False) >>> b = get_bucket(‘testb’, create=True) >>> b = get_bucket(‘testb’, create=True, versioning=False)

Parameters:

name (string) – the bucket’s name
create (boolean) – creates the bucket if not yet existing
versioning (boolean) – whether the new bucket should be versioned

Returns:

an s3 bucket

Return type:

boto3.resources.factory.s3.Bucket

impresso_commons.utils.s3.get_s3_client(host_url='https://os.zhdk.cloud.switch.ch/')

impresso_commons.utils.s3.get_s3_connection(host='os.zhdk.cloud.switch.ch')

Create a boto connection to impresso’s S3 drive.

Assumes that two environment variables are set: SE_ACCESS_KEY and SE_SECRET_KEY.

Note

This function is depreciated, as it used boto instead of boto3, please prioritize using get_s3_resource instead.

Parameters:: host_url (string) – the s3 endpoint’s URL
Return type:: boto3.resources.factory.s3.ServiceResource

impresso_commons.utils.s3.get_s3_object_size(bucket_name, key)

Get the size of an object (key) in an S3 bucket.

Parameters:

bucket_name (str) – The name of the S3 bucket.
key (str) – The key (object) whose size you want to retrieve.

Returns:

The size of the object in bytes, or None if the object doesn’t exist.

Return type:

int

impresso_commons.utils.s3.get_s3_resource(host_url='https://os.zhdk.cloud.switch.ch/')

Get a boto3 resource object related to an S3 drive.

Assumes that two environment variables are set: SE_ACCESS_KEY and SE_SECRET_KEY.

Parameters:: host_url (string) – the s3 endpoint’s URL
Return type:: boto3.resources.factory.s3.ServiceResource

impresso_commons.utils.s3.get_s3_versions(bucket_name, key_name)

Get versioning information for a given key.

NB: it assumes a versioned bucket.

Parameters:

bucket_name (string) – the bucket’s name
key_name (string) – the key’s name

Returns:

for each version, the version id and the last modified date

Return type:

a list of tuples, where tuple[0] is a string and tuple[1] a datetime instance.

impresso_commons.utils.s3.get_s3_versions_client(client, bucket_name, key_name)

impresso_commons.utils.s3.get_storage_options()

impresso_commons.utils.s3.read_jsonlines(key_name, bucket_name): Given an S3 key pointing to a jsonl.bz2 archives, extracts and returns lines (=one json doc per line). Usage example: >>> lines = db.from_sequence(read_jsonlines(s3r, key_name , bucket_name)) >>> print(lines.count().compute()) >>> lines.map(json.loads).pluck(‘ft’).take(10) :param bucket_name: name of bucket :type bucket_name: str :param key_name: name of key, without S3 prefix :type key_name: str :return:

impresso_commons.utils.s3.readtext_jsonlines(key_name, bucket_name): Given an S3 key pointing to a jsonl.bz2 archives, extracts and returns lines (=one json doc per line) with limited textual information, leaving out OCR metadata (box, offsets). This can serve as the starting point for pure textual processing (NE, text-reuse, topics) Usage example: >>> lines = db.from_sequence(readtext_jsonlines(s3r, key_name , bucket_name)) >>> print(lines.count().compute()) >>> lines.map(json.loads).pluck(‘ft’).take(10) :param bucket_name: name of bucket :type bucket_name: str :param key_name: name of key, without S3 prefix :type key_name: str :return: JSON formatted str

impresso_commons.utils.s3.s3_get_articles(issue, bucket, workers=None)

Read a newspaper issue from S3 and return the articles it contains.

NB: Content items with type = “ad” (advertisement) are filtered out.

Parameters:

issue (an instance of impresso_commons.path.IssueDir) – the newspaper issue
bucket (boto.s3.bucket.Bucket) – the input s3 bucket
workers – number of workers for the iter_bucket function. If None, will be the number of detected CPUs.

Returns:

a list of articles (dictionaries)

impresso_commons.utils.s3.s3_get_pages(issue_id, page_names, bucket)

Read in canonical text data for all pages in a given newspaper issue.

Parameters:

issue_id (string) – the canonical issue id (e.g. “IMP-1990-03-15-a”)
page_names (list of strings) – a list of canonical page filenames (e.g. “IMP-1990-03-15-a-p0001.json”)
bucket (instance of boto.Bucket) – the s3 bucket where the pages to be read are stored

Returns:

a dictionary with page filenames as keys, and JSON data as values.

impresso_commons.utils.s3.upload(partition_name, newspaper_prefix=None, bucket_name=None)

impresso_commons.utils.s3.upload_to_s3(local_path: str, path_within_bucket: str, bucket_name: str) → bool

Upload a file to an S3 bucket.

Parameters:

local_path (str) – The local file path to upload.
path_within_bucket (str) – The path within the bucket where the file will be uploaded.
bucket_name (str) – The name of the S3 bucket (without any partitions).

Returns:

True if the upload is successful, False otherwise.

Return type:

bool

Dask Utils Functions

Utility which help preparing data in view of parallel or distributed computing, in a dask-oriented view.

Usage:

daskutils.py partition –config-file=<cf> –nb-partitions=<p> [–log-file=<f> –verbose]

Options:

--config-file=<cf>: json configuration dict specifying various arguments

impresso_commons.utils.daskutils.create_even_partitions(bucket, config_newspapers, output_dir, local_fs=False, keep_full=False, nb_partition=500)

Convert yearly bz2 archives to even bz2 archives, i.e. partitions.

Enables efficient (distributed) processing, bypassing the size discrepancies of newspaper archives. N.B.: in resulting partitions articles are all shuffled. Warning: consider well the config_newspapers as it decides what will be in the partitions and loaded in memory.

Parameters:

bucket – name of the bucket where the files to partition are
config_newspapers – json dict specifying the sources to consider (name(s) of newspaper(s) and year span(s))
output_dir – classic FS repository where to write the produced partitions
local_fs
keep_full – whether to filter out metadata or not (i.e. keeping only text and leaving out coordinates)
nb_partition – number of partitions

Returns:

None

impresso_commons.utils.daskutils.main(args)

impresso_commons.utils.daskutils.partitioner(bag, path, nbpart): Partition a bag into n partitions and write each partition in a file

Apache UIMA XMI Utils Functions

Utility functions to export data in Apache UIMA XMI format.

impresso_commons.utils.uima.compute_image_links(ci: ContentItem, padding: int = 20, iiif_endpoint: str = 'https://dhlabsrv17.epfl.ch/iiif_impresso/', iiif_links: Dict[str, str] = None, pct: bool = False)

Short summary.

Parameters:

ci (type) – Description of parameter ci.
padding (type) – Description of parameter padding.

Returns:

Description of returned object.

Return type:

type

impresso_commons.utils.uima.get_iiif_links(contentitems: List[ContentItem], canonical_bucket: str): Retrieves from S3 IIIF links for a set of canonical pages where the input content items are found.

impresso_commons.utils.uima.rebuilt2xmi(ci, output_dir, typesystem_path, iiif_mappings, pct_coordinates=False) → str

Converts a rebuilt ContentItem into Apache UIMA/XMI format.

The resulting file will be named after the content item’s ID, adding the .xmi extension.

Parameters:

ci (impresso_commons.classes.ContentItem) – the content item to be converted
output_dir (str) – the path to the output directory
typesystem_path (str) – TypeSystem file containing defitions of annotation layers.

Config File Loader

A class to load configuration files in json format, handling different task setting.

class impresso_commons.utils.config_loader.Base

Bases: object

Base class for initial loading and default checking methods

check_bucket(string, attribute)

check_params(config_dict, keys)

classmethod from_json(json_file)

to_dict()

class impresso_commons.utils.config_loader.PartitionerConfig(config_dict): Bases: Base

impresso_commons.utils.config_loader.main()

Proper generic configuration handling will follow.

As of now, here is a generic config:

{
  "solr_server" : URL of the server,
  "solr_core": name of the core,
  "s3_host": S3 host,
  "s3_bucket_rebuilt": s3 bucket,
  "s3_bucket_partitions": s3 bucket,
  "s3_bucket_processed": s3 bucket,
  "key_batches": number of key in batch,
  "number_partitions": number of partition,
  "newspapers" : {              // newspaper config, to be detailed
        "GDL": [1991,1998]
    }
}