Utilities
Basic Utils Functions
- impresso_commons.utils.utils.bytes_to(bytes_nb: int, to_unit: str, bsize: int = 1024) float
Convert bytes to the specified unit.
Supported target units: - ‘k’ (kilobytes), ‘m’ (megabytes), - ‘g’ (gigabytes), ‘t’ (terabytes), - ‘p’ (petabytes), ‘e’ (exabytes).
- Parameters:
bytes_nb (int) – The number of bytes to be converted.
to_unit (str) – The target unit for conversion.
bsize (int, optional) – The base size used for conversion (default is 1024).
- Returns:
The converted value in the specified unit.
- Return type:
float
- Raises:
KeyError – If the specified target unit is not supported.
- impresso_commons.utils.utils.chunk(list, chunksize)
Yield successive n-sized chunks from list.
- impresso_commons.utils.utils.get_pkg_resource(file_manager: ExitStack, path: str, package: str = 'impresso_commons') PosixPath
Return the resource at path in package, using a context manager.
Note
The context manager file_manager needs to be instantiated prior to calling this function and should be closed once the package resource is no longer of use.
- Parameters:
file_manager (contextlib.ExitStack) – Context manager.
path (str) – Path to the desired resource in given package.
package (str, optional) – Package name. Defaults to “impresso_commons”.
- Returns:
Path to desired managed resource.
- Return type:
pathlib.PosixPath
- impresso_commons.utils.utils.init_logger(level: int = 20, file: str | None = None) RootLogger
Initialises the root logger.
- Parameters:
level (int, optional) – desired level of logging. Defaults to logging.INFO.
file (str | None, optional) – _description_. Defaults to None.
- Returns:
the initialised logger
- Return type:
logging.RootLogger
- impresso_commons.utils.utils.parse_json(filename)
- impresso_commons.utils.utils.validate_against_schema(json_to_validate: dict[str, Any], path_to_schema: str = 'schemas/json/versioning/manifest.schema.json') None
Validate a dict corresponding to a JSON against a provided JSON schema.
- Parameters:
json (dict[str, Any]) – JSON data to validate against a schema.
path_to_schema (str, optional) – Path to the JSON schema to validate against. Defaults to “impresso-schemas/json/versioning/manifest.schema.json”.
- Raises:
e – The provided JSON could not be validated against the provided schema.
S3 Utils Functions
Reusable functions to read/write data from/to our S3 drive. Warning: 2 boto libraries are used, and need to be kept until third party lib dependencies are solved.
- impresso_commons.utils.s3.alternative_read_text(s3_key: str, s3_credentials: dict, line_by_line: bool = True) list[str] | str
Read from S3 a line-separated text file (e.g. *.jsonl.bz2).
Note
The reason for this function is a bug in dask.bag.read_text() which breaks on buckets having >= 1000 keys. It raises a FileNotFoundError.
- impresso_commons.utils.s3.fixed_s3fs_glob(path: str, boto3_bucket=None)
From Benoit, impresso-pyimages package A custom glob function as the s3fs one seems to be unable to list more than 1000 elements on the switch S3 :param path: :return:
- impresso_commons.utils.s3.get_boto3_bucket(bucket_name: str)
- impresso_commons.utils.s3.get_bucket(name, create=False, versioning=True)
Create a boto s3 connection and returns the requested bucket.
It is possible to ask for creating a new bucket with the specified name (in case it does not exist), and (optionally) to turn on the versioning on the newly created bucket. >>> b = get_bucket(‘testb’, create=False) >>> b = get_bucket(‘testb’, create=True) >>> b = get_bucket(‘testb’, create=True, versioning=False)
Note
This function is depreciated, please prioritize using get_or_create_bucket or get_boto3_bucket instead.
- Parameters:
name (string) – the bucket’s name
create (boolean) – creates the bucket if not yet existing
versioning (boolean) – whether the new bucket should be versioned
- Returns:
an s3 bucket
- Return type:
boto3.resources.factory.s3.Bucket
- impresso_commons.utils.s3.get_bucket_boto3(name, create=False, versioning=True)
Get a boto3 s3 resource and returns the requested bucket.
It is possible to ask for creating a new bucket with the specified name (in case it does not exist), and (optionally) to turn on the versioning on the newly created bucket.
>>> b = get_bucket('testb', create=False) >>> b = get_bucket('testb', create=True) >>> b = get_bucket('testb', create=True, versioning=False)
Note
This function is depreciated, please prioritize using get_or_create_bucket or get_boto3_bucket instead.
- Parameters:
name (string) – the bucket’s name
create (boolean) – creates the bucket if not yet existing
versioning (boolean) – whether the new bucket should be versioned
- Returns:
an s3 bucket
- Return type:
boto3.resources.factory.s3.Bucket
- impresso_commons.utils.s3.get_or_create_bucket(name, create=False, versioning=True)
Create a boto3 s3 connection and returns the requested bucket.
It is possible to ask for creating a new bucket with the specified name (in case it does not exist), and (optionally) to turn on the versioning on the newly created bucket. >>> b = get_bucket(‘testb’, create=False) >>> b = get_bucket(‘testb’, create=True) >>> b = get_bucket(‘testb’, create=True, versioning=False)
- Parameters:
name (string) – the bucket’s name
create (boolean) – creates the bucket if not yet existing
versioning (boolean) – whether the new bucket should be versioned
- Returns:
an s3 bucket
- Return type:
boto3.resources.factory.s3.Bucket
- impresso_commons.utils.s3.get_s3_client(host_url='https://os.zhdk.cloud.switch.ch/')
- impresso_commons.utils.s3.get_s3_connection(host='os.zhdk.cloud.switch.ch')
Create a boto connection to impresso’s S3 drive.
Assumes that two environment variables are set: SE_ACCESS_KEY and SE_SECRET_KEY.
Note
This function is depreciated, as it used boto instead of boto3, please prioritize using get_s3_resource instead.
- Parameters:
host_url (string) – the s3 endpoint’s URL
- Return type:
boto3.resources.factory.s3.ServiceResource
- impresso_commons.utils.s3.get_s3_object_size(bucket_name, key)
Get the size of an object (key) in an S3 bucket.
- Parameters:
bucket_name (str) – The name of the S3 bucket.
key (str) – The key (object) whose size you want to retrieve.
- Returns:
The size of the object in bytes, or None if the object doesn’t exist.
- Return type:
int
- impresso_commons.utils.s3.get_s3_resource(host_url='https://os.zhdk.cloud.switch.ch/')
Get a boto3 resource object related to an S3 drive.
Assumes that two environment variables are set: SE_ACCESS_KEY and SE_SECRET_KEY.
- Parameters:
host_url (string) – the s3 endpoint’s URL
- Return type:
boto3.resources.factory.s3.ServiceResource
- impresso_commons.utils.s3.get_s3_versions(bucket_name, key_name)
Get versioning information for a given key.
NB: it assumes a versioned bucket.
- Parameters:
bucket_name (string) – the bucket’s name
key_name (string) – the key’s name
- Returns:
for each version, the version id and the last modified date
- Return type:
a list of tuples, where tuple[0] is a string and tuple[1] a datetime instance.
- impresso_commons.utils.s3.get_s3_versions_client(client, bucket_name, key_name)
- impresso_commons.utils.s3.get_storage_options()
- impresso_commons.utils.s3.read_jsonlines(key_name, bucket_name)
Given an S3 key pointing to a jsonl.bz2 archives, extracts and returns lines (=one json doc per line). Usage example: >>> lines = db.from_sequence(read_jsonlines(s3r, key_name , bucket_name)) >>> print(lines.count().compute()) >>> lines.map(json.loads).pluck(‘ft’).take(10) :param bucket_name: name of bucket :type bucket_name: str :param key_name: name of key, without S3 prefix :type key_name: str :return:
- impresso_commons.utils.s3.readtext_jsonlines(key_name, bucket_name)
Given an S3 key pointing to a jsonl.bz2 archives, extracts and returns lines (=one json doc per line) with limited textual information, leaving out OCR metadata (box, offsets). This can serve as the starting point for pure textual processing (NE, text-reuse, topics) Usage example: >>> lines = db.from_sequence(readtext_jsonlines(s3r, key_name , bucket_name)) >>> print(lines.count().compute()) >>> lines.map(json.loads).pluck(‘ft’).take(10) :param bucket_name: name of bucket :type bucket_name: str :param key_name: name of key, without S3 prefix :type key_name: str :return: JSON formatted str
- impresso_commons.utils.s3.s3_get_articles(issue, bucket, workers=None)
Read a newspaper issue from S3 and return the articles it contains.
NB: Content items with type = “ad” (advertisement) are filtered out.
- Parameters:
issue (an instance of impresso_commons.path.IssueDir) – the newspaper issue
bucket (boto.s3.bucket.Bucket) – the input s3 bucket
workers – number of workers for the iter_bucket function. If None, will be the number of detected CPUs.
- Returns:
a list of articles (dictionaries)
- impresso_commons.utils.s3.s3_get_pages(issue_id, page_names, bucket)
Read in canonical text data for all pages in a given newspaper issue.
- Parameters:
issue_id (string) – the canonical issue id (e.g. “IMP-1990-03-15-a”)
page_names (list of strings) – a list of canonical page filenames (e.g. “IMP-1990-03-15-a-p0001.json”)
bucket (instance of boto.Bucket) – the s3 bucket where the pages to be read are stored
- Returns:
a dictionary with page filenames as keys, and JSON data as values.
- impresso_commons.utils.s3.upload(partition_name, newspaper_prefix=None, bucket_name=None)
- impresso_commons.utils.s3.upload_to_s3(local_path: str, path_within_bucket: str, bucket_name: str) bool
Upload a file to an S3 bucket.
- Parameters:
local_path (str) – The local file path to upload.
path_within_bucket (str) – The path within the bucket where the file will be uploaded.
bucket_name (str) – The name of the S3 bucket (without any partitions).
- Returns:
True if the upload is successful, False otherwise.
- Return type:
bool
Dask Utils Functions
Utility which help preparing data in view of parallel or distributed computing, in a dask-oriented view.
- Usage:
daskutils.py partition –config-file=<cf> –nb-partitions=<p> [–log-file=<f> –verbose]
- Options:
- --config-file=<cf>
json configuration dict specifying various arguments
- impresso_commons.utils.daskutils.create_even_partitions(bucket, config_newspapers, output_dir, local_fs=False, keep_full=False, nb_partition=500)
Convert yearly bz2 archives to even bz2 archives, i.e. partitions.
Enables efficient (distributed) processing, bypassing the size discrepancies of newspaper archives. N.B.: in resulting partitions articles are all shuffled. Warning: consider well the config_newspapers as it decides what will be in the partitions and loaded in memory.
- Parameters:
bucket – name of the bucket where the files to partition are
config_newspapers – json dict specifying the sources to consider (name(s) of newspaper(s) and year span(s))
output_dir – classic FS repository where to write the produced partitions
local_fs
keep_full – whether to filter out metadata or not (i.e. keeping only text and leaving out coordinates)
nb_partition – number of partitions
- Returns:
None
- impresso_commons.utils.daskutils.main(args)
- impresso_commons.utils.daskutils.partitioner(bag, path, nbpart)
Partition a bag into n partitions and write each partition in a file
Apache UIMA XMI Utils Functions
Utility functions to export data in Apache UIMA XMI format.
- impresso_commons.utils.uima.compute_image_links(ci: ContentItem, padding: int = 20, iiif_endpoint: str = 'https://dhlabsrv17.epfl.ch/iiif_impresso/', iiif_links: Dict[str, str] = None, pct: bool = False)
Short summary.
- Parameters:
ci (type) – Description of parameter ci.
padding (type) – Description of parameter padding.
- Returns:
Description of returned object.
- Return type:
type
- impresso_commons.utils.uima.get_iiif_links(contentitems: List[ContentItem], canonical_bucket: str)
Retrieves from S3 IIIF links for a set of canonical pages where the input content items are found.
- impresso_commons.utils.uima.rebuilt2xmi(ci, output_dir, typesystem_path, iiif_mappings, pct_coordinates=False) str
Converts a rebuilt ContentItem into Apache UIMA/XMI format.
The resulting file will be named after the content item’s ID, adding the .xmi extension.
- Parameters:
ci (impresso_commons.classes.ContentItem) – the content item to be converted
output_dir (str) – the path to the output directory
typesystem_path (str) – TypeSystem file containing defitions of annotation layers.
Config File Loader
A class to load configuration files in json format, handling different task setting.
- class impresso_commons.utils.config_loader.Base
Bases:
object
Base class for initial loading and default checking methods
- check_bucket(string, attribute)
- check_params(config_dict, keys)
- classmethod from_json(json_file)
- to_dict()
- impresso_commons.utils.config_loader.main()
Proper generic configuration handling will follow.
As of now, here is a generic config:
{
"solr_server" : URL of the server,
"solr_core": name of the core,
"s3_host": S3 host,
"s3_bucket_rebuilt": s3 bucket,
"s3_bucket_partitions": s3 bucket,
"s3_bucket_processed": s3 bucket,
"key_batches": number of key in batch,
"number_partitions": number of partition,
"newspapers" : { // newspaper config, to be detailed
"GDL": [1991,1998]
}
}