Elemeta: Extract metafeatures from unstructured data

Lior Durahly

April 24th, 2023 min read

April 24th, 2023

min read

Depending on who you ask, unstructured data represents 80% to 90% of all new enterprise data, and it’s growing 3X faster than structured data. And with more and more models style DALLᐧE and ChatGPT being released daily, we’re tapping into unstructured data for machine learning more than ever before. With that said, one piece of feedback we’ve heard across the board from practitioners is that the general architectural understanding and intuition into how these models make decisions is vague at best, much less interpretable.

Over the past few months, the data science team at Superwise has been working on this problem so that practitioners leveraging NLP and vision can enjoy similar monitoring, interpretability, and explainability available to their tabular counterparts. Today, we’re excited to release to beta version 1.0 of Elemeta, our open-source library for exploring, monitoring, and extracting features from unstructured data. 

The road to Elemeta

Initially, when the team kicked off with this project, it was meant to be part of the Superwise model observability platform and extend our capabilities and use case coverage into NLP and vision. While we can definitely mark a check on those goals, we saw, both internally and through our beta users, that people were getting excited about how the package could be potentially applied to use cases beyond ML monitoring (more on that in a bit). That’s when we decided to make a shift and open-source this library. 

Elemeta is based on a concept we call metafeatures (it’s not precisely metadata, but not that far from it in some cases). Metafeatures (currently focused on NLP) are metrics extracted from unstructured data that enable you to explore, model, and monitor NLP use cases through enriched tabular representations. 

Example of Elemeta metafeature extraction
Example of Elemeta metafeature extraction

Metafeatures 

Elemeta already has an extensive set of out-of-the-box meta features such as SpecialCharsCount, EmojiCount, OutOfVocabularyCount, SentimentSubjectivity, etc. Additionally, you can create both low-level API extractors and custom metafeature extractors to fit your specific needs.  

For example, if we want to create IsPalindromeExtractor, that will return if the given text is a palindrome:

class IsPalindromeExtractor(AbstractMetadataExtractor):
    def extract(self, text: str) -> bool:
        normalized_text = text.replace(" ", "").lower()
        return normalized_text == normalized_text[::-1]
ipe = IsPlindromExtractor()

And it will return: 

ipe("cat")
False
ipe("taco cat")
True

Within Elemeta, metafeatures are currently split into two groups of metrics, statistical metrics and contextual metrics. Statistical metrics calculate technical values such as word length, word count, etc., and contextual metrics extract information regarding the context of the text. Statistical metrics are language agnostic, while contextual metrics currently support English and, to some extent, Indo-European languages (not tested).

Getting started with Elemeta 

To get started, simply run

pip install elemeta

And use our getting started guide to get going.

From there, you’ll find a set of colab notebooks that can help you dig deeper into the use cases and metafeatures and explore, model, and monitor NLP with Elemeta.

Elemeta use cases

We see Elemeta being applied to three core use cases: Exploratory Data Analysis (EDA), modeling, and model monitoring. But as we hinted above, we’ve already heard from beta users of some additional potential use cases we didn’t think about. So don’t stick to how we think Elemeta should be used; we look forward to seeing how the community uses it.  

  • Exploratory Data Analysis (EDA) – extract useful metadata information on unstructured data to analyze, investigate, and summarize the main characteristics and employ data visualization methods.
  • Data and model monitoring – utilize structured ML monitoring techniques in addition to the typical latent embedding visualizations.
  • Feature extraction & modeling – engineer alternative features to be utilized in simpler models such as decision trees (Coming soon).

What’s on the roadmap for Elemeta

We’ve only just gotten started with Elemeta. And while there are already a few areas we know we’re going to invest in, such as image extractors and additional language coverage, we’ve already had input from beta users on expansions that we didn’t initially think about. That’s precisely why we decided to shift Elemeta into a free, open-source project for the community. We want to know what metafeatures you need for your use cases and domains, and we are more than happy to accept community contributions! So if you’re working with NLP and need better exploratory data analysis, feature extraction, or monitoring, check out the Elemeta repo, take it for a spin with our colab notebooks, and if you star/follow the repo (show some ❤️), you’ll get notified as soon as there’s a new release.

Everything you need to know about AI direct to your inbox

Superwise Newsletter

Superwise needs the contact information you provide to us to contact you about our products and services. You may unsubscribe from these communications at any time. For information on how to unsubscribe, as well as our privacy practices and commitment to protecting your privacy, please review our privacy policy.


Featured Posts

Drift in machine learning
May 5, 2022

Everything you need to know about drift in machine learning

What keeps you up at night? If you’re an ML engineer or data scientist, then drift is most likely right up there on the top of the list. But drift in machine learning comes in many forms and variations. Concept drift, data drift, and model drift all pop up on this list, but even they

Read now >
Everything you need to know about drift in machine learning
July 12, 2022

Concept drift detection basics

This article will illustrate how you can use Layer and Amazon SageMaker to deploy a machine learning model and track it using Superwise.

Read now >
Concept drift detection basics
Data Drift
August 31, 2022

Data drift detection basics

Drift in machine learning comes in many shapes and sizes. Although concept drift is the most widely discussed, data drift is the most frequent, also known as covariate shift. This post covers the basics of understanding, measuring, and monitoring data drift in ML systems. Data drift occurs when the data your model is running on

Read now >
Data drift detection basics