dwrappr.dataset module

class dwrappr.dataset.DataPoint(x: ndarray, y: ndarray | None = None)[source]

Bases: object

Represents a data point with associated x and optional y data arrays.

This class is designed to encapsulate a data point represented by Numpy arrays and optionally associated data. It validates inputs during initialization to ensure they are Numpy arrays. The class also supports saving itself to and loading from joblib files, with helper methods for these tasks.

Attributes:

x: np.ndarray: The primary data array, must be a Numpy array.
y: Optional[np.ndarray]: The secondary or optional data array, can be a Numpy array or None.

classmethod load(filepath: str) → DataPoint[source]

Load a DataPoint object from a .joblib file.

This method reads a .joblib file from the given filepath, validates its extension, and loads the data to instantiate a DataPoint object.

Args:: filepath (str) : Path to the .joblib file to be loaded. The file must have a ‘.joblib’ extension.
Returns:: DataPoint : An instance of DataPoint created using the data in the file.
Raises:: ValueError : If the file does not have a ‘.joblib’ extension.

save(filepath: str) → None[source]

Saves the object’s data to a specified file in Joblib format.

Raises an error if the specified file does not have a ‘.joblib’ extension. Uses the internal representation of the object’s data converted to a dictionary.

Args:: filepath (str) : The path to the file where the object’s data will be saved.
Raises:: ValueError : If the specified filepath does not end with ‘.joblib’.

x: ndarray

y: ndarray | None = None

class dwrappr.dataset.DataSet(datapoints: list[~dwrappr.dataset.DataPoint] = <factory>, dtypes: dict[str, str] = <factory>, meta: ~dwrappr.dataset.DataSetMeta = <factory>)[source]

Bases: object

Represents a dataset consisting of data points, metadata, and associated attributes.

The DataSet class is designed to store and manipulate a collection of data points, along with metadata and data types for features and targets. It provides multiple methods and properties for retrieving subsets of the dataset, accessing features and targets in various formats (e.g., numpy array, pandas DataFrame, PyTorch tensor), and loading/saving datasets.

Attributes:

datapointsList[DataPoint]: A list of data point objects that make up the dataset.
dtypesdict[str, str]: A dictionary mapping column names to their data types.
metaDataSetMeta: Metadata object that contains information such as feature names, target names, and dataset name.

property as_df: DataFrame

Returns the dataset object as a pandas DataFrame.

This property converts the stored DataPoints into a DataFrame. It concatenates the x and y DataFrames along the columns axis and applies the stored data types to the resulting DataFrame before returning it.

Returns:: pd.DataFrame: A DataFrame representation of the stored DataPoints, with the stored data types applied.

datapoints: list[DataPoint]

dtypes: dict[str, str]

property feature_names: list[str]

Returns the names of features used in the metadata.

This property provides access to the feature names attribute present in the metadata object. It retrieves and returns the list of feature names.

Returns:

List[str]: The list of feature names.

Example:

>>> ds.feature_names
['feature']

classmethod from_dataframe(df: DataFrame, meta: DataSetMeta, check_df=True) → DataSet[source]

Create a new DataSet instance from a given pandas DataFrame and metadata.

This method constructs a DataSet object from a DataFrame by extracting features and target values based on the provided metadata. It also ensures that the DataFrame aligns with the metadata specifications and performs a check if enabled. Additionally, the method captures data types of the specified feature and target columns for later retransformation from the DataSet to a DataFrame.

Args:

df (pd.DataFrame): The input DataFrame containing data structured according to the provided metadata. meta (DataSetMeta): Metadata object that specifies feature and target column names, among other dataset properties. check_df (bool, optional): An flag that determines whether to validate the DataFrame against the metadata. Default is True.

Returns:

DataSet: A new DataSet instance populated with DataPoint objects derived from the input DataFrame.

Raises:

No exceptions specified, as this section is not included as per the guidelines.

Examples:

>>> import pandas as pd
>>> from dwrappr import DataSet, DataSetMeta
>>> file_path_meta = r"dwrappr\examples\data\example_dataset_meta.json"
>>> meta = DataSetMeta.load(file_path_meta)
>>> file_path_data = r"dwrappr\examples\data\example_data.csv"
>>> df = pd.read_csv(file_path_data)
>>> ds = DataSet.from_dataframe(df = df, meta = meta)
>>> ds
DataSet(datapoints=[DataPoint(x=array([12]), y=array([0])), DataPoint(x=array([7]), y=array([1])), DataPoint(x=array([15]), y=array([0])), DataPoint(x=array([9]), y=array([1]))], dtypes={'feature': dtype('int64'), 'target': dtype('int64')}, meta=DataSetMeta(name='example_data', time_series='False', synthetic_data='True', feature_names=['feature'], target_names=['target'], origin=None, year=None, url=None, sector=None, target_type=None, description=None))
todo (jacob): Add example without json

classmethod from_list(features: list, meta: DataSetMeta, targets: list = None) → DataSet[source]

Creates a DataSet object from given lists of features and targets along with a DataSetMeta instance.

Args:: features (list): A list containing the feature data, where each sub-list represents a row of feature values. meta (DataSetMeta): The metadata associated with the dataset, including feature and target names. targets (list, optional) : A list containing the target data, where each sub-list represents a row of target values. Defaults to None.
Returns:: DataSet: Returns an instance of the DataSet object created from the provided features, targets, and metadata.

static get_available_datasets_in_folder(path: str) → DataFrame[source]

Gets available datasets from a specified folder and combines them into a single DataFrame.

Scans the folder to identify dataset metadata, retrieves the datasets, and concatenates them into one DataFrame.

Args:: path (str): The file path to the folder containing datasets.
Returns:: pd.DataFrame: A DataFrame containing the combined data from all datasets found in the folder.

classmethod load(filepath: str) → DataSet[source]

Loads a dataset object.

This function loads a DataSet object from a file in .joblib format while reconstructing necessary components such as DataPoint and DataSetMeta objects. Assumes the file contains serialized elements suitable for creating a DataSet.

Args:: filepath (str): The path to the .joblib file from which the DataSet will be loaded.
Returns:: DataSet: A fully reconstructed DataSet instance based on the data provided in the file.
Raises:: ValueError: If the provided file does not have the .joblib extension.

meta: DataSetMeta

property name: str

Returns the name attribute of the meta property.

This property retrieves the name stored in the meta attribute. It does not accept any arguments and directly returns the name as a string.

Returns:

str: The name value associated with the meta attribute.

Example:

>>> ds.name
'example_data'

property num_datapoints: int

Returns the number of datapoints in the dataset.

This property calculates the total count of datapoints currently present and provides this information as an integer.

Returns:: int: The total number of datapoints in the dataset.

save(filepath: str, drop_meta_json: bool = True) → None[source]

Saves the current object state to a specified file path, optionally excluding a meta JSON file. Ensures the file has the correct extension before saving.

Args:: filepath (str): The file path to save the object to. Must end with ‘.joblib’. drop_meta_json (bool): Whether to drop saving the meta JSON file. Defaults to True.
Raises:: ValueError: If the provided file path does not have a ‘.joblib’ extension.
Returns:: None

split_dataset(first_ds_size: float, shuffle: bool = True, random_state: int = 42, group_by_features: List[str] | None = None) → Tuple[DataSet, DataSet][source]

Splits the dataset into two subsets based on a specified ratio. The split can optionally group data points by specific feature values to ensure grouped subsets stay intact.

Args:

first_ds_size (float): Proportion of the dataset to assign to the first subset. Should be a value between 0 and 1. shuffle (bool, optional): Whether to shuffle the dataset or groups before splitting. Defaults to True. random_state (int, optional): Random seed for reproducibility of shuffling. Defaults to 42. group_by_features ([List[str]], optional): List of feature names to group data points by before splitting. If None, no grouping is applied. Defaults to None.

Returns:

Tuple[‘DataSet’, ‘DataSet’]: A tuple containing the two resulting datasets after the split.

Example:

>>> ds(0.5)
(DataSet(datapoints=[DataPoint(x=array([12]), y=array([0])), DataPoint(x=array([9]), y=array([1]))], dtypes={'feature': dtype('int64'), 'target': dtype('int64')}, meta=DataSetMeta(name='example_data', time_series='False', synthetic_data='True', feature_names=['feature'], target_names=['target'], origin=None, year=None, url=None, sector=None, target_type=None, description=None)),

(DataSet(datapoints=[DataPoint(x=array([7]), y=array([1])), DataPoint(x=array([15]), y=array([0]))], dtypes={‘feature’: dtype(‘int64’), ‘target’: dtype(‘int64’)}, meta=DataSetMeta(name=’example_data’, time_series=’False’, synthetic_data=’True’, feature_names=[‘feature’], target_names=[‘target’], origin=None, year=None, url=None, sector=None, target_type=None, description=None)))

property target_names: list[str]

Returns the list of target names specified in the metadata.

The method fetches and provides a list containing the target names which are stored in the meta attribute. The list represents the names or labels that correspond to target values in a dataset or similar context.

Returns:

list[str]: A list of target names.

Example:

>>> ds.target_names
['target']

property x: array

Returns the x-coordinates of all datapoints in the current object.

This property compiles a list of the x-values from all elements in the ‘datapoints’ attribute and returns them as a NumPy array. The returning array provides a structured format of the x-coordinates for further computations or manipulations.

Returns:

np.array: A NumPy array containing the x-coordinates of the datapoints in the object.

Example:

>>> ds.x
0,12
1,7
2,15
3,9

property x_as_df: DataFrame

Returns the x attribute as a pandas DataFrame.

Provides a property method to process and return the x attribute formatted as a pandas DataFrame with updated data types. The output DataFrame’s schema is adjusted according to the stored metadata and type definitions.

Returns:: pd.DataFrame: A pandas DataFrame created from the x attribute, with columns named according to meta.feature_names and updated data types based on the metadata settings.

property x_as_tensor: torch.Tensor

Returns the x attribute of the instance as a PyTorch tensor.

This property converts the x attribute to a PyTorch tensor of type torch.float32. It requires PyTorch to be installed in the environment.

Returns:: torch.Tensor: The x attribute of the instance converted to a tensor.
Raises:: ImportError: If PyTorch is not installed in the environment.

property y: ndarray | None

Returns the y values extracted from all datapoints as a NumPy array.

Y values correspond to the ‘y’ attribute of each datapoint in the list of datapoints. If no datapoints are present, it returns None.

Return:

Optional[np.ndarray]: A NumPy array of y values from datapoints, or None if no datapoints exist.

Example:

>>> ds.y
0,0
1,1
2,0
3,1

property y_as_df: DataFrame

Returns the target variable as a pandas DataFrame.

This property provides a DataFrame representation of the target variable with column names corresponding to the target_names attribute. It also ensures that the DataFrame’s data types are updated consistent with any pre-defined data type information.

Returns:: pd.DataFrame: The target variable represented as a pandas DataFrame with appropriately updated data types.

property y_as_tensor: torch.Tensor

Returns the attribute ‘y’ as a PyTorch tensor.

This property converts the ‘y’ attribute of the object into a PyTorch tensor with a data type of float32. It requires PyTorch to be installed, and will raise an ImportError if it is not available.

Returns:: torch.Tensor: The attribute ‘y’ represented as a tensor of type torch.float32.
Raises:: ImportError: If PyTorch library is not installed.

class dwrappr.dataset.DataSetMeta(name: str, time_series: bool, synthetic_data: bool, feature_names: ~typing.List[str], target_names: ~typing.List[str] = <factory>, origin: str = None, year: str = None, url: str = None, sector: str = None, target_type: str = None, description: str = None)[source]

Bases: object

Represents metadata information for a dataset, including details about its attributes, usage, and related files.

This class is designed to store and manipulate metadata for datasets, providing an interface for converting metadata to a pandas DataFrame, loading metadata from JSON files and scanning directories for metadata files.

Attributes:

name: str: Name of the dataset.
time_series: bool: Indicates whether the dataset contains time series data.
synthetic_data: bool: Indicates whether the dataset contains synthetic data.
feature_names: List[str]: List of feature names in the dataset.
target_names: List[str]: List of target names in the dataset.
origin: str: Source or origin of the dataset.
year: str: Year associated with the dataset.
url: str: URL to access further information about the dataset.
sector: str: Sector to which the dataset belongs.
target_type: str: Type of the target variable (e.g., ‘classification’, ‘regression’).
description: str: Description or additional details about the dataset.

property as_df: DataFrame

Returns the available Metadata as a Dataframe.

This property Converts the object’s attributes into a pandas DataFrame. Lists in attributes are transformed into comma-separated strings for better readability.

Returns:: pd.DataFrame: A single-row DataFrame representing the object’s metadata and attributes.

description: str = None

feature_names: List[str]

classmethod load(filepath: str) → DataSetMeta[source]

Loads an instance of DataSetMeta from a JSON file.

This class method reads a JSON file and initializes an instance of DataSetMeta using the contents of the file. If the file provided does not have a .json extension, a ValueError is raised.

Args:

filepath (str) :The path to the JSON file that contains the data needed to initialize a DataSetMeta instance.

Returns:

DataSetMeta: An instance of DataSetMeta initialized with the data from the JSON file.

Raises:

ValueError : If the file specified by ‘filepath’ does not have a “.json” extension.

Example:

>>> file_path_meta = r"dwrappr\examples\data\example_dataset_meta.json"
>>> meta = DataSetMeta.load(file_path_meta)
>>> meta
DataSetMeta(name='example_data', time_series='False', synthetic_data='True', feature_names=['feature'], target_names=['target'], origin=None, year=None, url=None, sector=None, target_type=None, description=None)

name: str

origin: str = None

save(filepath: str) → None[source]

Saves the instance data to a specified JSON file.

The method ensures that the file has a ‘.json’ extension before attempting to save the instance data. If the extension is incorrect, a ValueError is raised. The instance is first converted to a dictionary representation and then written to the specified file path.

Args:: filepath (str) : The path to the file where the instance data will be saved. The file must have a ‘.json’ extension.
Returns:: None
Raises:: ValueError: Raised if the file does not have a ‘.json’ extension.

classmethod scan_for_meta(path: str, recursive: bool = True) → List[DataSetMeta][source]

Scans the directory for metadata and corresponding dataset objects.

This function scans a specified directory for metadata and associated dataset object files and returns a list of DataSetMeta instances. Files with extensions ‘.joblib’ and corresponding ‘_meta.json’ are paired, with unpaired files logged.

Args:

path (str): The root directory path to scan for metadata and dataset object files. recursive (bool, optional): Indicates whether subdirectories should also be scanned. Defaults to True.

Returns:

List[DataSetMeta]: A list containing DataSetMeta objects where both’.joblib’ dataset files and matching ‘_meta.json’ metadata files are found. If any of these files are missing a counterpart, a warning is logged.

Example:

>>> DataSetMeta.scan_for_meta(r"dwrappr\examples\data")
[DataSetMeta(name='example_data', time_series='False', synthetic_data='True', feature_names=['feature'], target_names=['target'], origin=None, year=None, url=None, sector=None, target_type=None, description=None)]

sector: str = None

synthetic_data: bool

target_names: List[str]

target_type: str = None

time_series: bool

url: str = None

year: str = None