Skip to content

Digester

digesters.digester.Digester(*args, **kwargs)

Bases: ABC

The Digester class is an abstract base class designed to assist with digesting data from various computational chemistry and biology resources. This class provides a framework for processing, parsing, and validating data extracted from simulations, geometry optimizations, and other computational methods.

checks() classmethod

Perform basic checks to raise warnings or errors before digesting.

This method should be overridden in the child class to include specific checks required for the particular type of data being processed.

digest(ensemble_schema=None, digester_args=None, digester_kwargs=None, parallelize=False, max_workers=None) classmethod

Given the inputs in digester_args and digester_kwargs, digest all possible atomistic frames and populate an EnsembleSchema.

This method initializes an EnsembleSchema, calls the checks method to perform preliminary checks, prepares the inputs for digestion by calling the prepare_inputs_digester method, and iterates through each step of the digestion process to append frames to the EnsembleSchema.

This method keeps the entire EnsembleSchema in memory. If you have a large amount of data, consider using digest_chunks.

PARAMETER DESCRIPTION
ensemble_schema

The schema to which frames will be appended.

TYPE: EnsembleSchema | None DEFAULT: None

digester_args

Arguments to pass into prepare_inputs_digester.

TYPE: tuple[Any, ...] | None DEFAULT: None

digester_kwargs

Keyword arguments to pass into prepare_inputs_digester.

TYPE: dict[str, Any] | None DEFAULT: None

parallelize

Execute concurrently.

TYPE: bool DEFAULT: False

max_workers

Maximum number of workers for concurrent operation.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
EnsembleSchema

The updated schema with frames extracted from a digester.

digest_chunks(ensemble_schema=None, digester_args=None, digester_kwargs=None, chunk_size=100, parallelize=False, max_workers=None) classmethod

Same as digest, but instead of returning a whole EnsembleSchema, it will generate ones with a specified chunk_size.

PARAMETER DESCRIPTION
ensemble_schema

The schema to which frames will be appended.

TYPE: EnsembleSchema | None DEFAULT: None

digester_args

Arguments to pass into prepare_inputs_digester.

TYPE: tuple[Any, ...] | None DEFAULT: None

digester_kwargs

Keyword arguments to pass into prepare_inputs_digester.

TYPE: dict[str, Any] | None DEFAULT: None

chunk_size

Number of frames to process before yielding an EnsembleSchema.

TYPE: int DEFAULT: 100

parallelize

Execute concurrently.

TYPE: bool DEFAULT: False

max_workers

Maximum number of workers for concurrent operation.

TYPE: int | None DEFAULT: None

digest_frame(inputs_frame, schema_map, cadence_eval='molecule') classmethod

Digest a single frame of input data into a [MoleculeSchema][schemas.atomistic.MoleculeSchema].

This method processes a single frame of data by invoking static methods implemented in the child digester class. These static methods with a [SchemaUUID][digesters.ids.SchemaUUID] are responsible for processing specific parts of the frame input data and returning key-value pairs that correspond to fields in the [MoleculeSchema][schemas.atomistic.MoleculeSchema].

PARAMETER DESCRIPTION
inputs_frame

The inputs for the frame digestion process. This dictionary should contain all necessary data for processing a single frame.

TYPE: dict[str, Any]

schema_map

A mapping of UUIDs to field keys from get_schema_map

TYPE: dict[str, dict[str, str]]

cadence_eval

Cadence of properties to evaluate and digest.

TYPE: Literal['molecule', 'ensemble'] DEFAULT: 'molecule'

RETURNS DESCRIPTION
dict[str, Any]

Data parsed or computed for this frame. Keys are field keys and values

dict[str, Any]

are the data from this frame.

RAISES DESCRIPTION
AttributeError

If the static method corresponding to a field's UUID is not found in the class.

Exception

For any other exceptions that occur during the processing of the frame.

Notes
  • The method relies on metadata defined within the fields of MoleculeSchema to determine which static method to call for processing each field.
  • Each field in the MoleculeSchema should have metadata that includes a 'uuid' and optionally a 'cadence'. The 'cadence' should be set to 'molecule' to indicate that the field is processed per molecule.
  • Static methods in the child class should be decorated with @SchemaUUID to associate them with the corresponding fields in MoleculeSchema.
Example

Suppose inputs_frame contains data for atomic coordinates, the static method decorated with the appropriate UUID will be called to process these coordinates, and the resulting values will be assigned to the corresponding field in MoleculeSchema.

gen_inputs_frame(inputs_digester) classmethod

Generate inputs for each frame starting from a specific frame.

PARAMETER DESCRIPTION
inputs_digester

The initial inputs for the digestion process.

TYPE: dict[str, Any]

YIELDS DESCRIPTION
dict[str, Any]

A generator yielding input dictionaries for each frame.

get_inputs_frame(inputs_digester) abstractmethod classmethod

Builds dictionary of keyword arguments for the current frame specified in inputs_digester. This is called for every frame.

PARAMETER DESCRIPTION
inputs_digester

A dictionary of inputs for the digestion process.

TYPE: dict[str, Any]

RETURNS DESCRIPTION
dict[str, Any]

A dictionary of inputs for the digestion process.

get_uuid_map() classmethod

Update the function UUID map by inspecting the class methods decorated with [@SchemaUUID][digesters.ids.SchemaUUID].

This method scans through all the methods in the class, identifies those decorated with [@SchemaUUID][digesters.ids.SchemaUUID], and constructs a dictionary mapping the UUIDs to the method names. This map is used to dynamically call methods based on their UUIDs during the data digestion process.

By using [@SchemaUUID][digesters.ids.SchemaUUID], each method that processes a part of the input data can be easily identified and called based on its UUID. This allows for a flexible and dynamic way to handle various data processing tasks, ensuring that each piece of data is processed by the appropriate method.

RETURNS DESCRIPTION

A dictionary mapping UUIDs to method names.

Example

If a method called coordinates is decorated with @SchemaUUID("81c7cec9-beec-4126-b6d8-91bee28951d6"), the returned dictionary will include an entry: {"81c7cec9-beec-4126-b6d8-91bee28951d6": "coordinates"}

Notes

This method only includes methods that

  • are callable,
  • do not start with __, and
  • have the __uuid__ attribute.

next_frame(inputs_digester) abstractmethod classmethod

Advance the digester inputs to the next frame in the data. This abstract method must be implemented in any child class as each data source may have a different way of advancing to the next frame.

PARAMETER DESCRIPTION
inputs

A dictionary of inputs for the digestion process.

RETURNS DESCRIPTION
dict[str, Any]

A dictionary of inputs for the digestion process.

prepare_inputs_digester(*args, **kwargs) abstractmethod classmethod

Prepare and return the inputs necessary to start the digestion process.

This abstract method must be implemented in any child class. It should return a dictionary of inputs that will be used by get_inputs_frame.

PARAMETER DESCRIPTION
*args

Variable length argument list.

TYPE: Any DEFAULT: ()

**kwargs

Arbitrary keyword arguments.

TYPE: Collection[Any] DEFAULT: {}

RETURNS DESCRIPTION
dict[str, Any]

A dictionary of inputs for the frame digesting process.