Supported data types

Supported file types

Firenze supports uploading of four file formats: CSV, TSV, XLSX, and JSONL.

.csv - a comma separated values file is a delimited text file that uses a comma to separate values. Each line of the file is a data record, and each record consists of one or more fields that are separated by commas. Please encapsulate your input with double quotation marks if they contain a comma to prevent the input being split across multiple fields.
.tsv - a tab separated values file is a delimited text file that uses a tab to separate values. Each line of the file is a data record, and each record consists of one or more fields that are separated by tabs.
.xlsx – a spreadsheet file that is created through Microsoft Excel. Each row of the file is a data record, and each column consists of information of that record. Please note that Firenze will only use the data from the first sheet of the file.
.jsonl - a JSON file created in the JSON Lines format which contain a single object per line where the key indicates the type of information and the value contains the actual information.

If your preferred format is not supported, please submit a feature request.

Data content

Firenze does not require any specific column names. The columns that will be used for training and prediction will be selected during the data uploading process. However, we do recommend naming the column or field which contains a unique identifier as “key”. This will allow the user to trace specific data entries throughout Firenze. If a “key” column or field is not provided, one will be generated by Firenze.

Text classification

Text classification data can be used to train a multi-class single-label text classifier or a multi-class multi-label classifier. Depending on the model type, a slightly different input format is

required. A multi-class single-label text classifier will predict a single label for each text that is provided. Therefore, the training data entry should contain a single label.

CSV format

key,text,class
1,lorem ipsum,foo
2,dolar sit,bar
3,"amet, consectetur",foobar

JSONL format

{'key': '1', 'text': 'lorem ipsum', 'class': 'foo'}
{'key': '2', 'text': 'dolar sit', 'class': 'bar'}
{'key': '3', 'text': 'amet, consectetur', 'class': 'foobar'}

A multi-class multi-label text classifier will predict one or more labels for each text that is provided. Therefore, the training data entry should also contain one or more labels per text. In TSV, CSV, and XLSX files the labels should be in the same field and separated by a comma. In JSONL files the labels should be contained in a list.

CSV format

key,text,classes
1,lorem ipsum,foo
2,dolor sit,bar
3,"amet, consectetur","foo,bar"

JSONL format

{'key': '1', 'text': 'lorem ipsum', 'classes':['foo']}
{'key': '2', 'text': 'dolar sit', 'classes':['bar']}
{'key': '3', 'text': 'amet, consectetur', 'classes':['foo', 'bar']}

Image classification

Image classification data can be used to train a multi-class single-label image classifier. A slightly different input format is required for images.

ZIP format

For zip files, two formats are supported. The annotated and unannotated variants. The images inside the zip file should be of type.jpgor.png, other files will not be used.

When a mix between unannotated data and annotated data is found, only the annotated data is used.

Deeper nested folder structures than what is shown in the example is not supported.

For unannotated data, no folders should be present. The images should directly be in the root of the zip file.

*.zip
├── unannotated-image1.jpg
├── unannotated-image2.png
├── unannotated-image3.jpg
├── unannotated-image4.png
├── unannotated-image5.jpg
└──

For annotated data, the folders are used as labels for the images.

*.zip
├── foo
│   ├── foo-image1.jpg
│   ├── foo-image2.png
│   ├── foo-image3.jpg
├── bar
│   ├── bar-image1.png
│   ├── bar-image2.jpg
│   ├── bar-image3.png
└──

JSONL format

When using the format listed below, the image column should contain a base64 encoded image converted to a string. The class column is optional.

{'key': '1', 'image': 'base64 encoded string', 'class': 'foo'}
{'key': '2', 'image': 'base64 encoded string', 'class': 'bar'}
{'key': '3', 'image': 'base64 encoded string', 'class': 'baz'}