π¦ dwrapprο
A lightweight and extensible Python package for managing data, tailored for researchers working with structured data. In addition to general data management features, the package introduces a data structure specifically optimized for ML research. This common format enables researchers to efficiently test new algorithms and methods, streamlining collaboration and ensuring consistency in data management across projects.
π§© Featuresο
ποΈ Consistent dataset object structure for handling structured data in ML use cases
π Support for building a file-based internal dataset collaboration platform for researchers
π§° General utilities for managing data like saving and loading
π Quickstartο
For executing the quickstart examples and get an overview of dwrapprβs functionalities, please have a look at IEEE_examples.
Additional functionalities are showcased in:
loading_dataset_from_file.py: Shows how to load a dataset from an existing dataset file
scanning_folder_for_datasets.py: Shows how to scann a folder vor available datasets
dataset_functionalities.py : Shows some of the main functionalities of the DataSet class.
π Functionality Ipnsightsο
Scan folder for datasetο
DATASET_FOLDER = "./data/datasets/"
available_datasets = DataSet.get_available_datasets_in_folder(
DATASET_FOLDER
)
available_datasets.T
Loading specific datasetο
DATASET_FILEPATH = "./data/datasets/manufacturing_process_ds.joblib"
ds = DataSet.load(DATASET_FILEPATH)
Generating dataset from raw dataο
RAW_DATA_FILEPATH= "./data/raw_data.csv"
#load raw data into pandas.DataFrame
df = pd.read_csv(RAW_DATA_FILEPATH)
"""
<some manual dataset preprocessing steps
like dropping missing values and chaning dtypes>
"""
#define metaData
meta = DataSetMeta(
name = "example_dataset",
synthetic_data=True,
time_series=False,
feature_names=["feature"],
target_names=["target"]
)
#generate DataSet
ds = DataSet.from_dataframe(
df=df,
meta=meta
)
#saving dataset
ds.save("./data/example_dataset.joblib", drop_meta_json=True)
Split datasetο
(train/test-split)
import numpy as np
n_instances = 100
# Create the 'product_id' feature with 3 different categorical values
product_ids = np.random.choice(['1001', '2002', '3003', '4004', '5005', '6006', '7007'], size=n_instances)
# Generate two additional numeric features
feature_1 = np.random.rand(n_instances) * 100 # Random numbers between 0 and 100
feature_2 = np.random.rand(n_instances) * 50 # Random numbers between 0 and 50
# Generate a numeric target
target = feature_1 * 0.5 + feature_2 * 0.3 + np.random.randn(n_instances) * 5 # Adding some noise
# Create a DataFrame
df = pd.DataFrame({
'product_id': product_ids,
'feature_1': feature_1,
'feature_2': feature_2,
'target': target
})
ds = DataSet.from_dataframe(
df=df,
meta = DataSetMeta(
name = "example_dataset",
synthetic_data=True,
time_series=False,
feature_names=["product_id", "feature_1", "feature_2"],
target_names=["target"]
)
)
train_ds, test_ds = ds.split_dataset(
first_ds_size=0.5,
shuffle=True,
group_by_features=["product_id"]
)
π Helpο
See Documentation for details.
π οΈ Package Installationο
full version:
pip install dwrapprlight version (excluding sklearn library):
pip install dwrappr[light]
(keep package updated with pip install dwrappr --upgrade)
π§ Maintainerο
This project is maintained by Nils