fcollections.core

fcollections.core#

Files collections.

This module brings together the classic operations done on a collection of files (in other word a dataset stored in multiple files). This operations are: layout (structure of the folders) minimal walk; file name: parsing, filtering and interpretation; file reading: data loading, combination and post-processing.

Functions

`compose`(func1, *func2)	Compose multiple functions that preprocess an xarray Dataset.
`group_metadata_from_netcdf`(nds)	Extract metadata from a netcdf dataset.
`walk`(node, visitor)	Recursive walk of a file system tree.

Classes

`FileNameField`(name[, default, description])
`FileNameFieldDatetime`(name, date_fmt[, ...])	Numpy datetime value.
`FileNameFieldDateDelta`(name, date_fmt, delta)	Numpy datetime value.
`FileNameFieldDateJulian`(name, reference[, ...])
`FileNameFieldEnum`(name, enum_cls[, ...])	Enum field for files selection.
`FileNameFieldFloat`(name[, default, description])	Float value.
`FileNameFieldInteger`(name[, default, ...])	Integer value.
`FileNameFieldString`(name[, default, description])
`FileNameFieldPeriod`(name, date_fmt[, ...])	Period value.
`FileNameFieldISODuration`(name[, default, ...])	ISO8601 duration codes field (PT1D, P1W, ...)
`FileNameConvention`(regex, fields[, ...])	Parse or generate filenames with a convention definition.
`IFilesReader`()	Interface for reading multiple files on a specific file system.
`OpenMfDataset`([xarray_options])	Xarray implementation of IFilesReader interface.
`FilesDatabase`(path[, fs, enable_layouts, ...])	Abstract database mapping.
`SubsetsUnmixer`(partition_keys, auto_pick_last)
`Deduplicator`(unique, auto_pick_last)
`RecordFilter`(fields, **references)	Utility class for filtering values.
`FileNameFieldDateJulianDelta`(name, delta, ...)	Datetime value given as a julian day.
`DownloadMixin`()
`CaseType`(*values)
`PeriodMixin`()
`GroupMetadata`(name, variables, subgroups, ...)	Metadata for a group of variables.
`VariableMetadata`(name, dtype, dimensions, ...)	Metadata of a variable.
`DiscreteTimesMixin`([sampling])
`ITemporalMixin`()
`IPredicate`()	Interface for defining a complex predicate.
`Layout`(conventions)	Implements a ILayout with a succession of conventions.
`ICodec`()	Coder-Decoder interface.
`ITester`()	Compare two objects of types U and T.
`ILayout`()	Information about a multiple Tree levels.
`INode`(name, info, level)	Representation of a file system path.
`FileNode`(name, info, level)	File node of a file system tree.
`DirNode`(name, info, fs, level[, follow_symlinks])	Directory node of a file system tree.
`IVisitor`()	Visitor processing an `INode`.
`LayoutVisitor`(layouts[, stat_fields, ...])	Visitor with node interpretation and branch exploration hints.
`NoLayoutVisitor`(convention, record_filter[, ...])	Visitor with file node interpretation only.
`LayoutMismatchHandling`(*values)	Possibilities when a folder of file node does not match any layout.
`StandardVisitor`()	Visitor for producing the equivalent of `fsspec.spec.AbstractFileSystem.walk()`.
`VisitResult`(explore_next, payload, ...)	Result of a visit.
`FileSystemMetadataCollector`(layouts, root_node)	Filtered discovery and aggregation of filesystem metadata.

Exceptions

`FileListingError`	Raised when an error occurs during files discovery and interpretation.
`NotExistingPathError`
`DecodingError`	Raised by a codec if a string cannot be properly decoded.
`VisitError`	Raised by the visitor during node visit.
`LayoutMismatchError`	Raised if all layouts do not match the actual file system structure.

class fcollections.core.CaseType(*values)[source]#

Bases: Enum

lower = 2#

upper = 1#

exception fcollections.core.DecodingError[source]#

Bases: Exception

Raised by a codec if a string cannot be properly decoded.

class fcollections.core.Deduplicator(unique: 'tuple[str, ...]', auto_pick_last: 'tuple[str, ...]' = <factory>)[source]#

Bases: object

auto_pick_last: tuple[str, ...]#

property keys: set[str]#

unique: tuple[str, ...]#

class fcollections.core.DirNode(name, info, fs: AbstractFileSystem, level: int, follow_symlinks: bool = False)[source]#

Bases: INode

Directory node of a file system tree.

Parameters:

name – Name of the node. Not to be confused with the full path that should be contained in the info parameter
info – Additional information. The entry name - representing the full path - is expected to be in this parameter. Other information will depend on the fsspec implementation
level – Nesting level of the current node with respect to the tree root
fs – File system hosting the node. Useful to list the children
follow_symlinks – If False, symbolic links will be marked as file nodes instead of directory nodes, and will not be explored

accept(visitor: LayoutVisitor) → VisitResult[source]#

Accept a visitor.

This method should trigger operations in the visitor. The visitor computes the desired result, and the node is responsible for emitting said-result to the walk operation.

Returns:: The visit result

See also

walk: Walk operation handling the tree traversal

children() → Iterable[INode][source]#

List child nodes.

Returns:: The child nodes, either files or folders

class fcollections.core.DiscreteTimesMixin(sampling: np.timedelta64 | None = None)[source]#

Bases: ITemporalMixin

time_coverage(**filters) → Period | None[source]#

Find the time extent of the netcdf files.

Returns:: A Period representing the period covered by the data

time_holes(**filters) → Generator[Period, None, None][source]#

Find the holes in time coverage.

Returns:: A generator yielding Period representing holes in the data

class fcollections.core.DownloadMixin[source]#

Bases: ABC

download(files: list[str], local_path: str, force_download: bool = False)[source]#

Retrieve files from FTP to local path.

Parameters:

files (str) – list of file paths to copy locally
local_path (str) – local path to copy files to
force_download (boolean) – force download files (True) or don’t download files if already exist locally (False)

Return type:

the list of downloaded files

abstract property fs: AbstractFileSystem#: The mixin relies on this attribute to build new functionalities.

exception fcollections.core.FileListingError[source]#

Bases: Exception

Raised when an error occurs during files discovery and interpretation.

class fcollections.core.FileNameConvention(regex: Pattern, fields: list[FileNameField], generation_string: str | None = None)[source]#

Bases: object

Parse or generate filenames with a convention definition.

The convention is expressed as both a regex and a simple string to handle both parsing and generation. The generation string can be omitted and set to None if the convention is only used to parse files.

fields: list[FileNameField]#: List of fields, each field name must correspond to a group in the regex pattern.

generate(**kwargs)[source]#

generation_string: str | None = None#

String that will be formatted with the input objects to generate a string.

The string can use the formatting language described in help(‘FORMATTING’). In addition, the formatting can be delegated to each field.encode methods by specifying the field name fn spec. This allows handling more complex objects such as Period. For example with an FileNameFieldInteger and FileNameFieldPeriod defined: ‘{cycle_number:>03d}_{period!f}’ -> ‘003_20230102_20240201

get_field(name: str) → FileNameField[source]#

Retrieve a field from its name.

Only the first matching field is returned. It is assumed that the convention has fields with independant names.

Parameters:: name – Name of the field to seek
Returns:: The requestion FileNameField
Raises:: KeyError – In case the requested field has no match in the convention

match(filename: str) → Any[source]#

parse(match_object: Match) → tuple[source]#

regex: Pattern#: Pattern for filename matching.

class fcollections.core.FileNameField(name: str, default: T | None = None, description: str = '')[source]#

Bases: ICodec[T], ITester[U, T]

property description: str#

class fcollections.core.FileNameFieldDateDelta(name: str, date_fmt: str | list, delta: timedelta64, include_stop: bool = False, default: Period | None = None, description: str = '')[source]#

Bases: FileNameField, PeriodTester, PeriodDeltaCodec, DateTimeCodec

Numpy datetime value.

name#

name of the field

Type:: str

date_fmt#

date format

Type:: str

delta#

time delta

Type:: np.timedelta64

include_stop#

Whether the delta is included or not, default to False

Type:: bool

class fcollections.core.FileNameFieldDateJulian(name: str, reference: datetime64, default: Period | None = None, description: str = '', julian_day_format: str = 'days_hours')[source]#: Bases: FileNameField, DateTimeTester, JulianDayCodec

class fcollections.core.FileNameFieldDateJulianDelta(name: str, delta: timedelta64, reference: datetime64, include_stop: bool = False, default: Period | None = None, description: str = '', julian_day_format: str = 'days')[source]#

Bases: FileNameField, PeriodTester, PeriodDeltaCodec, JulianDayCodec

Datetime value given as a julian day.

name#

name of the field

Type:: str

delta#

time delta

Type:: np.timedelta64

reference#: Reference date for the given julian days

include_stop#

Whether the delta is included or not, default to False

Type:: bool

julian_day_format#: Whether the julian day is expected as ‘days’, ‘days_hours’ or ‘fractional’. For example 24000, 24000_06 or 24000.25

class fcollections.core.FileNameFieldDatetime(name: str, date_fmt: str | list, default: timedelta64 | None = None, description: str = '')[source]#

Bases: FileNameField, DateTimeTester, DateTimeCodec

Numpy datetime value.

name#

name of the field

Type:: str

date_fmt#

date format

Type:: str|List

class fcollections.core.FileNameFieldEnum(name: str, enum_cls: type[Enum], case_type_decoded: CaseType | None = None, case_type_encoded: CaseType | None = None, underscore_encoded: bool = True, default: type[Enum] | None = None, description: str = '')[source]#

Bases: FileNameField, EnumTester, EnumCodec

Enum field for files selection.

name#

name of the field

Type:: str

enum_cls#: enum class

choices() → list[str][source]#

class fcollections.core.FileNameFieldFloat(name: str, default: T | None = None, description: str = '')[source]#

Bases: FileNameField, FloatTester, FloatCodec

Float value.

name#

name of the field

Type:: str

class fcollections.core.FileNameFieldISODuration(name: str, default: T | None = None, description: str = '')[source]#

Bases: FileNameField, ISODurationCodec

ISO8601 duration codes field (PT1D, P1W, …)

sanitize(reference: str | ISODuration) → ISODuration[source]#

Cast to one of the types handled by this tester.

Parameters:: reference – The reference object to cast
Returns:: The input cast to the proper type

property test_description: str#: User-friendly description of the possible types for the reference.

property type: type[ISODuration]#: Type of the tested field.

class fcollections.core.FileNameFieldInteger(name: str, default: T | None = None, description: str = '')[source]#

Bases: FileNameField, IntegerTester, IntegerCodec

Integer value.

name#

name of the field

Type:: str

class fcollections.core.FileNameFieldPeriod(name: str, date_fmt: str, separator='_', default: Period | None = None, description: str = '')[source]#

Bases: FileNameField, PeriodTester, PeriodCodec

Period value.

name#

name of the field

Type:: str

date_fmt#

date format

Type:: str

separator#

dates separator. Default: ‘-’

Type:: str

class fcollections.core.FileNameFieldString(name: str, default: T | None = None, description: str = '')[source]#: Bases: FileNameField, StringTester, StringCodec

class fcollections.core.FileNode(name: str, info: dict[str, Any], level: int)[source]#

Bases: INode

File node of a file system tree.

accept(visitor: LayoutVisitor) → VisitResult[source]#

Accept a visitor.

This method should trigger operations in the visitor. The visitor computes the desired result, and the node is responsible for emitting said-result to the walk operation.

Returns:: The visit result

See also

walk: Walk operation handling the tree traversal

children() → Iterator[INode][source]#

List child nodes.

Returns:: An empty list (files have no children)

class fcollections.core.FileSystemMetadataCollector(layouts: list[Layout], root_node: INode)[source]#

Bases: object

Filtered discovery and aggregation of filesystem metadata.

Notes

The aggregation has yet to be implemented.
Only files’ metadata can be collected in the current implementation

Parameters:

path – path of directory containing files
layouts – Succession of conventions describing how to interpret the folder and file nodes
root_node – Root node representing an explorable tree. Usually represent the parent directory on a file system

discover(predicates: tuple[Callable[[tuple[Any, ...]], bool], ...] = (), stat_fields: tuple[str] = (), enable_layouts: bool = True, **filters) → DataFrame[source]#

Parameters:

predicates – Complex predicates for filtering a file’s record
stat_fields – Name of the file metadata fields that should be returned in the record. The info that can be retrieved is dependent on the file system implementation. Check the filesystem ls method to get the available stat fields
enable_layouts – Set to True to use the layouts for directory names parsing. This will speed up the listing, but may raise an error if some directory do not match the declared layouts. Set to False to scan the entire directory and parse the files only
**filters – filters for files/folde selection over the fields declared in the layouts. Each field can accept a different filter value depending on the underlying FileNameField subclass

Yields:

The record matching the files

Raises:

KeyError – In case some of the requested stat_fields are not available for the current file system
LayoutMismatchError – In case enable_layouts is True and a mismatch between the layouts and the actual files is detected

to_dataframe(predicates: tuple[Callable[[tuple[Any, ...]], bool], ...] = (), stat_fields: tuple[str] = (), enable_layouts: bool = True, **filters) → DataFrame[source]#

Parameters:

predicates – Complex predicates for filtering a file or folder’s record
stat_fields – Name of the file or folder metadata fields that should be returned in the record. The info that can be retrieved is dependent on the file system implementation. Check the filesystem ls method to get the available stat fields
enable_layouts – Set to True to use the layouts for directory names parsing. This will speed up the listing, but may raise an error if some directory do not match the declared layouts. Set to False to scan the entire directory and parse the files only
**filters – filters for files/folders selection over the fields declared in the file name convention and layout (optional). Each field can accept a different filter value depending on the underlying FileNameField subclass

Yields:

A pandas’s dataframe containing all selected filenames + a column per field requested

Raises:

KeyError – In case some of the requested stat_fields are not available for the current file system
VisitError – In case enable_layouts is True and a mismatch between the layouts and the actual files is detected

class fcollections.core.FilesDatabase(path: str, fs: AbstractFileSystem = LocalFileSystem(), enable_layouts: bool = True, follow_symlinks: bool = False)[source]#

Bases: object

Abstract database mapping.

Parameters:

path – path to a directory containing NetCDF files
fs – File system hosting the files. Can be used to access local or remote (S3, FTP, …) file systems. Underlying readers may not be compatible with all file systems implementations
enable_layouts – Set to True to use the layouts for directory names parsing. This will speed up the listing, but may raise an error if some directory does not match the pre-configured layouts. Set to False to scan the entire directory and parse the files only
follow_symlinks – If False, symbolic links will be marked as file nodes instead of directory nodes, and will not be explored

discoverer#: File discoverer. Walks in a folder (can be on a remote file system), parses the listed files and filters them.

deduplicator: Deduplicator | None = None#: Deduplicate the file metadata table of a unique subset (after unmixing).

layouts: list[Layout] | None = None#

Semantic describing how the files are organized.

Useful to extract information and have an efficient file system scanning. The pre-configured layouts can mismatch the current files organization, in which case the user can build its own or set enable_layouts to False.

list_files(*args, **kwargs)#

map(*args, **kwargs)#

metadata_injection: dict[str, tuple[str, ...]] | None = None#

Configures how metadata from the files listing can be injected in a dataset returned from the read.

The keys is the columns of the file metadata table, the value is a tuple of dimensions for insertion.

property parser: FileNameConvention#

Amongst all name conventions, get the one managing the files.

Returns:: Files name parser

predicate_classes: list[type[IPredicate]] | None = None#

List of predicates that are built at each query.

The predicates intercepts the input parameters to build a custom record predicate. Usually, it is a complex test involving auxiliary data, such as ground track footprints or half_orbit/periods tables.

query(*args, **kwargs)#

reader: IFilesReader | None = None#: Files reader.

sort_keys: list[str] | str | None = None#

Keys that specifies the fields used to sort the records extracted from the filenames.

Useful to order the files prior to reading them.

unmixer: SubsetsUnmixer | None = None#: Specify how to interpret the file metadata table to unmix subsets.

variables_info(*args, **kwargs)#

class fcollections.core.GroupMetadata(name: str, variables: list[VariableMetadata], subgroups: list[GroupMetadata], attributes: dict[str, str], dimensions: dict[str, int])[source]#

Bases: object

Metadata for a group of variables.

A dataset may be organized as a simple set of variables, or adopt a more complex tree-like structure. This dataclass reflects the most complex case where we can have an indefinite number of nesting levels. The simplest case (no concept of groups) is naturally well- contained within this model.

apply(callable: tp.Callable[[GroupMetadata]])[source]#

Apply a callable to the metadata tree.

Useful to modify the tree in place/

Parameters:: callable – The function to apply to each node

attributes: dict[str, str]#: Dictionary of attributes specific to the group.

dimensions: dict[str, int]#: Name and size of the dimensions contained in the group.

flatten() → list[dict[str, Any]][source]#

Flatten the tree structure to a dictionary.

Group names will be converted to absolute paths with ‘/’ separator.

Returns:: A dictionary containing all the groups, with keys containing paths linked to the tree structure

name: str#: Name of the group (can be set to ‘/’ when no nesting is needed).

nodes(path: str) → list[GroupMetadata][source]#

Walk the metadata tree and retrieves the nodes along a given path.

Parameters:: path – Absolute path for the node to find. The path separator is ‘/’. For example, a path [root, first_level, second_level] can be given as root/first_level/second_level or /root/first_level/second_level (the prepending ‘/’ will be stripped)
Returns:: List of nodes that are part of the path, starting with the root node and ending with the last node of the path
Raises:: ValueError – In case nodes are missing in the path

subgroups: list[GroupMetadata]#: Nested groups.

variables: list[VariableMetadata]#: List of variables contained in the group.

class fcollections.core.ICodec[source]#

Bases: ABC, Generic[T]

Coder-Decoder interface.

A codec defines how to encode/decode strings to/from a given T class object.

abstractmethod decode(input_string: str) → T[source]#

Decode an input string and generate a Generic[T] object.

Parameters:: input_string – The input string
Returns:: The decoded Generic[T] object
Raises:: DecodingError – If the input string decoding fails

abstractmethod encode(data: T) → str[source]#

Encode a Generic[T] object into a string.

Parameters:: data – The input Generic[T] object
Returns:: The encoded string

class fcollections.core.IFilesReader[source]#

Bases: ABC

Interface for reading multiple files on a specific file system.

Implementations of this interface can be called to peek at the metadata of a dataset, or to read a selection of variables.

abstractmethod metadata(file: str, fs: AbstractFileSystem = fs_loc.LocalFileSystem()) → GroupMetadata[source]#

Load the metadata of the given file.

Useful to get information about the structure of the dataset, and which variables, dimensions and coordinates are available for reading.

Parameters:

file – File from which the metadata is read
fs – File system hosting the file

Returns:

A GroupMetadata containing the variables, dimensions, attributes and subgroups

abstractmethod read(files: list[str] | list[list[str]], selected_variables: list[str] | None = None, fs: AbstractFileSystem = fs_loc.LocalFileSystem(), **kwargs: Any) → Dataset[source]#

Read a list of files.

Parameters:

files – List of the files to read
selected_variables – Variables that needs to be read. Set to None to read everything
fs – File system hosting the files

Returns:

An xarray dataset containing the selected variables

class fcollections.core.ILayout[source]#

Bases: object

Information about a multiple Tree levels.

Given a Tree (ex a filesystem) with a structure of N homogeneous levels, the layout will associate each level with a FileNameConvention to extract useful information. This information can then be leveraged by building filters to speed up the tree visitation.

For example, let’s consider a set of altimetry data files, organized in pre-defined folders: v1/Expert/cycle_001, v1/Expert/cycle_002, v2/Basic/cycle_001, … The first level contains information about the version, the second level about the subset, and the last level about the cycle number. The layout will declare three FileNameConvention to ‘know’ about the tree structure. Then, filters - for example subset=’Expert’ - can be set to select only a subpart of the tree, greatly improving the visitation performance.

abstractmethod generate(root: str, **fields: Any) → str[source]#

Generate a path from the fields.

Parameters:

root – The root path
fields – key/values for interpolating the conventions

Returns:

A path

Raises:

ValueError – In case one of the field required to generate the path is missing,
ValueError – In case one of the field required to generate the path has an improper value

abstract property names: set[str]#: Names of the supported filters.

abstractmethod set_filters(**references: Any)[source]#

Set filters used to check if a path complies with the layout.

Parameters:: **references – Key/values matching at least one of the underlying conventions

abstractmethod test(level: int, node: str) → bool[source]#

Checks if a path part matches the current filters.

Parameters:

node – Path part that needs to be checked
level – Level of the current path part among the layout conventions

Returns:

True if the path part is selected with the current filters, False otherwise

class fcollections.core.INode(name: str, info: dict[str, Any], level: int)[source]#

Bases: ABC

Representation of a file system path.

Parameters:

name – Name of the node. Not to be confused with the full path that should be contained in the info parameter
info – Additional information. name - representing the full path - is expected to be in this parameter. Other information will depend on the fsspec implementations
level – Nesting level of the current node with respect to the tree root

abstractmethod accept(visitor: LayoutVisitor) → VisitResult[source]#

Accept a visitor.

This method should trigger operations in the visitor. The visitor computes the desired result, and the node is responsible for emitting said-result to the walk operation.

Returns:: The visit result

See also

walk: Walk operation handling the tree traversal

abstractmethod children() → Iterator[INode][source]#

List child nodes.

Returns:: The child nodes, either files or folders

class fcollections.core.IPredicate[source]#

Bases: ABC

Interface for defining a complex predicate.

This predicate will be used to filter records from file names listing and parsing.

indexes#: Attributes

\*args: Any input that will be used to create the predicate

abstractmethod classmethod parameters() → tuple[str, ...][source]#: Initialization parameters name for the class.

abstractmethod classmethod record_fields() → tuple[str, ...][source]#: Record fields needed by the predicate.

class fcollections.core.ITemporalMixin[source]#

Bases: ABC

abstractmethod list_files(*args, **kwargs) → pda_t.DataFrame[source]#: The mixin relies on this method to build new functionalities.

abstractmethod time_coverage(**filters) → Period | None[source]#

Find the time extent of the netcdf files.

Returns:: A Period representing the period covered by the data

abstractmethod time_holes(**filters)[source]#

Find the holes in time coverage.

Returns:: A generator yielding Period representing holes in the data

class fcollections.core.ITester[source]#

Bases: ABC, Generic[U, T]

Compare two objects of types U and T.

This interface can be used to define filters that needs to compare objects with different but close types. For example, an integer with another integer or a list of integers.

In addition to the testing functionality, this interface also provides a way to cast an object to one of the expected U types. This is useful for sanitizing user inputs that are in the simplest possible types. Such example is the automatic building of a np.datetime64 from a string given by the user (‘2024-01-01’)

sanitize(reference: Any) → U[source]#

Cast to one of the types handled by this tester.

Parameters:: reference – The reference object to cast
Returns:: The input cast to the proper type

test(reference: U, tested: T) → bool[source]#

Compare two objects of similar types.

Parameters:

reference – The reference object
tested – The tested object

Returns:

True if the test is successful, False otherwise

abstract property test_description: str#: User-friendly description of the possible types for the reference.

abstract property type: type[T]#: Type of the tested field.

property type_name: str#: Type name of the tested field for signature parameters.

class fcollections.core.IVisitor[source]#

Bases: ABC

Visitor processing an INode.

Visitors interpret a node and return information from it. It is up to the implementation to define which information it can get from the node. Some implementations will only return the node path, other will try to interpret it using semantics’ definitions.

An important characteristic of the visitor is its ability to advance from a previous visit result. This gives flexibility to implement specific states during the tree traversal.

Additionnal metadata about the visit are also returned by the visitor. This information should be used for tree traversal and visitor advancement only, and not returned by the walk operation.

abstractmethod advance(result: VisitResult) → IVisitor[source]#

Advance the visitor.

The advancement can either return a reference or a copy of the visitor. If a per-branch state is needed, it is advised to return a copy.

Parameters:: result – Previous result of a visit. Originally intended to be the visit result of the parent node.
Returns:: The current visitor or a copy with a modified state

abstractmethod visit_dir(dir_node: DirNode) → VisitResult[source]#

Visits a directory node.

Parameters:: dir_node – The directory node to visit
Returns:: Node information and visit metadata.

abstractmethod visit_file(file_node: DirNode) → VisitResult[source]#

Visits a file node.

Parameters:: file_node – The file node to visit
Returns:: Node information and visit metadata.

class fcollections.core.Layout(conventions: list[FileNameConvention])[source]#

Bases: ILayout

Implements a ILayout with a succession of conventions.

Parameters:: conventions – List of convention, with the first element matching the tree root, and last element the last level before the leafs

See also

fcollections.core.FileNameConvention: the equivalent for the tree leafs (file

names

generate(root: str, **fields: Any) → str[source]#

Generate a path from the fields.

Parameters:

root – The root path
fields – key/values for interpolating the conventions

Returns:

A path

Raises:

ValueError – In case one of the field required to generate the path is missing,
ValueError – In case one of the field required to generate the path has an improper value

property names: set[str]#: Names of the supported filters.

parse_node(level: int, node: str) → tuple[Any, ...][source]#

Interprets a node name.

Parameters:

level – Depth in the layout. Depth in the layout is the depth of the node with respect to its root minus 1. There is no semantic for the root node, which explains this discrepency of layout-depth and tree-depth
node – Node name (not its full path)

Returns:

Structure information about the node

set_filters(**references: Any)[source]#

Set filters used to check if a path complies with the layout.

Parameters:: **references – Key/values matching at least one of the underlying conventions

test_record(level: int, record: tuple[Any, ...]) → bool[source]#

Checks if the node information matches the filters.

The test will look for filters at the considered layout depth, and apply them on the record.

Parameters:

level – Depth in the layout. Depth in the layout is the depth of the node with respect to its root minus 1. There is no semantic for the root node, which explains this discrepency of layout-depth and tree-depth
record – Interpreted node informations

Returns:

True if the node matches the filters, false otherwise

exception fcollections.core.LayoutMismatchError[source]#

Bases: VisitError

Raised if all layouts do not match the actual file system structure.

class fcollections.core.LayoutMismatchHandling(*values)[source]#

Bases: Enum

Possibilities when a folder of file node does not match any layout.

IGNORE = 3#: Ignore the mismatch.

RAISE = 1#: Raise an exception.

WARN = 2#: Warn the user.

class fcollections.core.LayoutVisitor(layouts: list[Layout], stat_fields: Iterable[str] = tuple(), on_mismatch_directory: LayoutMismatchHandling = LayoutMismatchHandling.RAISE, on_mismatch_file: LayoutMismatchHandling = LayoutMismatchHandling.IGNORE)[source]#

Bases: IVisitor

Visitor with node interpretation and branch exploration hints.

The layouts will try to interpret a node and get a record of structured information. Layouts also include filters that are applied to give a hint about tree exploration: if all layouts exclude the current node, exploration should not continue.

Parameters:

layouts – Semantic definitions for interpreting and testing node meanings
stat_fields – List of node metadata to add to the record
on_mismatch_directory – Behavior on mismatch for directories
on_mismatch_file – Behavior on mismatch for files

advance(result: VisitResult) → LayoutVisitor[source]#

Advance the visitor.

The advancement can either return a reference or a copy of the visitor. If a per-branch state is needed, it is advised to return a copy.

Parameters:: result – Previous result of a visit. Originally intended to be the visit result of the parent node.
Returns:: The current visitor or a copy with a modified state

visit_dir(dir_node: DirNode) → VisitResult[source]#

Visits a directory node.

The directory node path is parsed into a structured node. If none of the layouts is able to parse the node, it means we are in uncharted territory: tree traversal hint in the visit result will state we should not continue exploring.

In addition, layout filters are applied on the node information. If all layouts exclude the node, it means no node of interest are in this branch: we want to terminate the current branch exploration as soon as possible to speed up the walk operation.

Multiple layouts means multiple semantics are possible. This is the case in a heterogeneous folder. When exploring a branch, some layouts may not match the branch semantic. These are pruned as soon as possible, but only for the current branch.

Warns:: UserWarning – In case the dir_node does not match any configured layout and on_mismatch is set to WARN
Raises:: LayoutMismatchError – In case the dir_node does not match any configured layout and on_mismatch is set to RAISE
Returns:: Node information and visit metadata. The visit metadata includes a tree traversal hint for further exploration, and the surviving layouts that match the current branch

visit_file(file_node: FileNode) → VisitResult[source]#

Visits a file node.

The file node is interpreted to generate a record of structured information. The content of this record depends on the layouts definition. If the interpretation fails, the visit result will not include any information about the node.

Layout filters are also applied to the node record. If all layouts exclude the node, the visit result will not include any information about the node.

Raises:: KeyError – If the requested stats_fields key are unknown for the given fsspec implementation
Returns:: Node information and visit metadata. For file node, no further exploration should be needed. In this case, surviving layouts are not relevant and will not be included in the visit result.

class fcollections.core.NoLayoutVisitor(convention: FileNameConvention, record_filter: RecordFilter, stat_fields: Iterable[str] = tuple())[source]#

Bases: IVisitor

Visitor with file node interpretation only.

The given convention will interpret the file nodes, the folders are not interpreted.

Parameters:

convention – Semantic definitions for interpreting a file node
record – Tester for the file node information
stat_fields – List of node metadata to add to the record

advance(result: VisitResult) → IVisitor[source]#

Advance the visitor.

The advancement can either return a reference or a copy of the visitor. If a per-branch state is needed, it is advised to return a copy.

Parameters:: result – Previous result of a visit. Originally intended to be the visit result of the parent node.
Returns:: The current visitor or a copy with a modified state

visit_dir(dir_node: DirNode) → VisitResult[source]#

Visits a directory node.

Transparent visit of a directory node. The visit will not return any information about the node. The metadata will always hint at continuing the branch exploration.

Parameters:: dir_node – The directory node to visit
Returns:: Node information and visit metadata.

visit_file(file_node: DirNode) → VisitResult[source]#

Visits a file node.

Parameters:: file_node – The file node to visit
Returns:: Node information and visit metadata.

exception fcollections.core.NotExistingPathError[source]#: Bases: Exception

class fcollections.core.OpenMfDataset(xarray_options: dict[str, str] | None = None)[source]#

Bases: IFilesReader

Xarray implementation of IFilesReader interface.

This implementation is a simple wrapper around the xarray.open_mfdataset function. The function parameters are expected to be given as a dictionary of the reader, except for the preprocessor argument that should be given to the read method.

Parameters:: xarray_options – xarray.open_mfdataset reading options. Set to None to keep xarray defaults

See also

xarray.open_mfdataset: The wrapped reading function

metadata(file: str, fs: AbstractFileSystem = fs_loc.LocalFileSystem()) → GroupMetadata[source]#

Load the metadata of the given file.

Useful to get information about the structure of the dataset, and which variables, dimensions and coordinates are available for reading.

Parameters:

file – File from which the metadata is read
fs – File system hosting the file

Returns:

A GroupMetadata containing the variables, dimensions, attributes and subgroups

read(files: list[str] | list[list[str]], selected_variables: list[str] | None = None, fs: AbstractFileSystem = fs_loc.LocalFileSystem(), preprocess: Callable[[Dataset], Dataset] | None = None, **kwargs: Any) → Dataset[source]#

Read a list of files.

Parameters:

files – List of the files to read
selected_variables – Variables that needs to be read. Set to None to read everything
fs – File system hosting the files
preprocess – Preprocessor for open_mfdataset

Returns:

An xarray dataset containing the selected variables

class fcollections.core.PeriodMixin[source]#

Bases: ITemporalMixin

time_coverage(**filters) → Period | None[source]#

Find the time extent of the netcdf files.

Returns:: A Period representing the period covered by the data

time_holes(**filters) → Generator[Period, None, None][source]#

Find the holes in time coverage.

Returns:: A generator yielding Period representing holes in the data

class fcollections.core.RecordFilter(fields: list[FileNameField], **references)[source]#

Bases: object

Utility class for filtering values.

fields#

the fields to filter

Type:: List[FileNameField]

\*\*references: the values of fields used for selection

test(record)[source]#

Test if a record is filtered.

Parameters:: record – record to filter
Returns:: true if the record is filtered
Return type:: boolean

class fcollections.core.StandardVisitor[source]#

Bases: IVisitor

Visitor for producing the equivalent of fsspec.spec.AbstractFileSystem.walk().

The useful information is a tuple (root, dirs, files) that mimics the standard output of a walk operation.

No additionnal metadata related to the visit itself is returned.

advance(result: VisitResult) → StandardVisitor[source]#

Advance the visitor.

The advancement can either return a reference or a copy of the visitor. If a per-branch state is needed, it is advised to return a copy.

Parameters:: result – Previous result of a visit. Originally intended to be the visit result of the parent node.
Returns:: The current visitor or a copy with a modified state

visit_dir(dir_node: DirNode) → VisitResult[source]#

Visits a directory node.

Parameters:: dir_node – The directory node to visit
Returns:: Node information and visit metadata.

visit_file(file_node: DirNode) → VisitResult[source]#

Visits a file node.

Parameters:: file_node – The file node to visit
Returns:: Node information and visit metadata.

class fcollections.core.SubsetsUnmixer(partition_keys: 'tuple[str, ...] | dict[str, tp.Callable | None]', auto_pick_last: 'tuple[str, ...]' = <factory>)[source]#

Bases: object

auto_pick_last: tuple[str, ...]#

property keys: set[str]#

partition_keys: tuple[str, ...] | dict[str, Callable | None]#

class fcollections.core.VariableMetadata(name: str, dtype: dtype, dimensions: tuple[str, ...], attributes: dict[str, str])[source]#

Bases: object

Metadata of a variable.

attributes: dict[str, str]#: Dictionary of attributes specific to the variable.

dimensions: tuple[str, ...]#: Dimensions’ names.

dtype: dtype#: Type of the variable as a numpy dtype.

name: str#: Name of the variable.

exception fcollections.core.VisitError[source]#

Bases: Exception

Raised by the visitor during node visit.

class fcollections.core.VisitResult(explore_next: bool, payload: ~typing.Any | None = None, surviving_layouts: list[~fcollections.core._listing.Layout] = <factory>)[source]#

Bases: object

Result of a visit.

The result type is defined by the IVisitor implementations.

Additional information related to semantic definition contained in layouts (Layout) is given for further advancement of the visitors.

Tree traversal can also use exploration hints given by the visitors decide if the current branch should be explored.

See also

walk: Handle tree traversal

explore_next: bool#: True if we should continue to explore the current branch.

payload: Any | None = None#: Post processing result of a node by the visitor.

surviving_layouts: list[Layout]#: LayoutVisitor only, used to know which semantic is still valid for the current branch.

fcollections.core.compose(func1: Callable[[Dataset], Dataset], *func2: Callable[[Dataset], Dataset] | None) → Callable[[Dataset], Dataset][source]#

Compose multiple functions that preprocess an xarray Dataset.

Before calling xr.open_mfdataset, it is useful to set up various preprocessings. For example, one might want to crop a subset of the dataset, and then create an index before xarray combination steps in.

This method is an utility that will make it easier to chain such preprocessors.

The call order is the same as the input arguments: func1 is called first, func2[0] is called second and so on.

Parameters:

func1 – First preprocessor. Cannot be None
*func2 – Subsequent preprocessors. None elements will be ignored

Returns:

The chained functions

See also

xarray.open_mfdataset: method that takes chained preprocessors as an input

fcollections.core.group_metadata_from_netcdf(nds: nc4.Dataset) → GroupMetadata[source]#

Extract metadata from a netcdf dataset.

Parameters:: nds – The netcdf dataset from which we want the metadata
Returns:: The associated GroupMetadata

fcollections.core.walk(node: INode, visitor: IVisitor) → Iterator[Any][source]#

Recursive walk of a file system tree.

This is a reimplementation of the similar os.walk() and fsspec.spec.AbstractFileSystem.walk(). The motivation for the reimplementation is that we need to inject some complex logic (node parsing and branch exploration) during the tree traversal.

Parameters:

node – File or folder node representing a path on the filesystem
visitor – Visitor that will process the note and produce some results

Raises:

VisitError – Raised by the visitor to signal something went wrong during a node visit

Yields:

The results of all visits in the tree. The result type will depend on the visitor implementation

See also

StandardVisitor: Visitor returning (root, dirs, files) tuples similar to a conventionnal walk
LayoutVisitor: Visitor that can interpret the node paths and return structured information

fcollections.core

Contents

fcollections.core#