Home Data Engineering Data Education What are Label Spans?

What are Label Spans?

May 4, 2022

Spans are a new Indico Data innovation that allows your workflows to perform to their full potential. Spans are the first step in a major transformation for the Indico platform and its users, even if things don’t appear to be different on your end.

What are spans?

In the Indico platform, Spans are a new way of representing data that standardizes model input and output. Spans are the first step in allowing users to use Indico’s basic building blocks to create workflows that we’ve never dared to imagine! They’re essential for making room for new features, model combinations, and more.

What are spans, though? Data at a sub-document level is represented by Spans. Spans, on the other hand, allow our models to break down a document into smaller pieces of data rather than viewing the entire document. Before the invention of spans, our models had to adhere to a set of strict rules when it came to data representation.

Models can break up data into different-sized chunks with spans, enabling a slew of new features and enhancing the power of our platform. A data unit can now be an extraction label, an image, or a variety of other things. And spans make it possible to do all of this while maintaining the context of the original document, making it easier to understand for humans as well. This is the stuff of fairy tales!

What issues are addressed by spans?

Indico’s models were previously overly specific – each model required its data and training labels to be structured in a specific way (encoded text vs. raw files, strings vs. lists of dictionaries, etc.). For us and you, our users, the most significant limitations were:

Complex model integration into the Indico platform
A workflow’s inability to connect specific model types. You’ve sensed it. We’ve sensed it. Models cannot be linked – Our app currently does not support classifying extracted clauses from legal documents, extracting an address, and then breaking it up into address components with another extraction model (street, street number, city, zip, state, etc.), or many other useful combinations. The reason for this was that attempting to combine these specific I/O requirements for our models would need writing long, brittle code.

Furthermore, the platform was built in such a way that training labels and model predictions always received the entire input file as input. So, what happens when users want to process files that are hundreds of pages long but only want to classify the header page? When models thrive on learning from specific and relevant data, overloading them with information can have a significant impact on performance. Until now, there was no way to “narrow the focus” on the platform.

Finally, it was previously impossible to label non-continuous data. A label had to apply to a continuous range of characters for text extraction models. A label had to be applied to a single bounding box for image extraction models. However, because the data we work with is unstructured, there is no guarantee that the single item you want to extract will always appear together.

What are Label Spans? 1

What are Label Spans? 2

Consider the following example: we want to extract the entire address from this document. Simple left-to-right highlighting would capture not only the address but also the company motto and website on the left.

To avoid this, you’d have to create individual labels for each address line and participate in post-processing – an inconvenient process that adds extra steps to your workflow. Label spans solve this problem by redefining how we capture data segments to account for the wide range of situations that our app encounters.

How are spans addressing this issue?

A Span (or, as we call them as a unit, a Span-Group) is a fancy word for a “portion of data.” It’s an object with an underlying reference to the original source of data and a representation of a portion of that original source.

A “text” span group is represented in the source text by a list of start/end/pageNum character ranges.

Document

– – Page 1 – –

Hello my name is Foo

– – Page 2 – –

This is my friend Bar

Representative Span-Group

[

{ start: 0, end: 19, page_num: 0 },

{ start: 21, end: 41, page_num: 1 }

]

An image span group is represented on the source image by a list of top/bottom/left/right/pageNum bounding boxes.

What are Label Spans? 3

Image

Representative Span-Group

[

{

top: 0,

bottom: 100,

left: 0,

right: 100,

page_num: 0

}

]

In the spans platform, all data is represented as Span-Groups; gone are the days of “sometimes files, sometimes URLs, sometimes raw text, and sometimes feature vectors.” Furthermore, a SpanGroup can be interpreted to meet the needs of any model! A text document’s SpanGroup can be converted to a bounding-box SpanGroup by simply referencing the source file, and vice versa.

The most significant advantage is that models can generate spans as well! Consider an extraction model, which extracts a portion of a document or image; an extraction label is simply a combination of “class name and SpanGroup.” In our previous Document example, an extraction model that predicted “Person Name” would do the following.

Document -> SpanGroup -> Prediction

Prediction (as you might know it today)

[

    {label: “Person Name”, start: 17, end, 19, text: “Foo”},

    {label: “Person Name”, start: 39, end: 41, text: “Bar”}

]

Using the same SpanGroup data structure as before, each extraction becomes its own SpanGroup!

Prediction w/ SpanGroups

[

  {label: “Person Name”, spans: [{start: 17, end: 19, page_num: 0}]},

  {label: “Person Name”, spans: [{start: 39, end: 41, page_num: 1}]}

]

Readers with a keen eye may wonder why these are “nested,” and why each extraction. Instead of a single start-end span, do spans become a list? What a great question! We will be able to handle non-continuous extractions as a result of this. We could now accurately label and predict the address on the document using the same label-the-address question as before.

What are Label Spans? 4

Address Label

{

  label: “Address”,

  spans: [

           {start: 12, end: 16, page_num: 0},

           {start: 43, end: 63, page_num: 0},

           {start: 78, end: 98, page_num: 0}       

         ]

}

Finally, this means that model labels or predictions can be used as data sources for downstream models. Finally, we have a way to connect any model to any other model.

A SpanGroup, as a bonus, always refers to its original source data. This means that labelers will always be able to view the entire image or file, even if the model you’re labeling for is only interested in a specific SpanGroup on that file. We want to assist our users in narrowing the focus of their models while still allowing them to see and understand all of their data.

Conclusion

If you’re thinking to yourself, “Wow, this is amazing!” When will I be able to use these? We have some fantastic news for you! The wait will not be long – spans will be added to the Indico platform in our upcoming 5.1 release, which is scheduled for April 2022.

Source link

What are Label Spans?

Related

Follow Us

POPULAR POSTS

AI that could crash the financial system

The AI you use every day is biased

The AI revolution is sorting people into three user categories

Can AI be a ‘child of God’?

POPULAR CATEGORY

The AI revolution is sorting people into three user categories