Overview
Every Job in Supplier Data Manager starts with an input file.
By default, when using a file type that supports multiple sheets (such as Excel), only the first sheet is processed. However, this behavior can be customized using Custom Input Formats in the SDM admin.
Custom Input Formats
SDM admin > Home > Files > Input formats
Custom Input Formats are predefined templates that specify how multi-sheet files should be processed.
They allow for more complex data imports by defining:
- Which sheets should be present in the file
- What columns each sheet should contain
- How different sheets could/should be joined together
- Which columns are required and which are optional
- Any unique identifiers across the dataset
Custom Input Formats serve two primary functions:
- Matching: They define criteria to match against uploaded files.
- Transformation: They specify how data should be extracted and combined from matching files into the Pandas Dataframe that will be used by the job.
Data Transformation
Once a file matches a Custom Input Format:
- The system extracts data based on the format's specifications:
- It includes all specified columns.
- If a
null
entry is present in the columns list, all unnamed columns are also included.
- Sheets are joined together using pandas join operations as specified in the format.
Custom Input Formats can be defined and managed in the admin panel:
Input Upload Process
The following diagram illustrates the process of uploading a file and matching it with Custom Input Formats:
Configuration of Input Format
The configuration outlined in this document aims at providing a more digestable version of what is presented in the API docs. However the API docs are always up to date and should remain a source of truth. It is recommended to double check the info presented here against it.
Structure
The configuration object has the following structure:
{
"sheets": [ ... ],
"unique_columns": [ ... ]
}
Options
Sheets (required)
An array of objects, each representing a sheet in the Excel file. Each sheet object has the following properties:
- name (string, required): The name of the sheet in the Excel file.
-
columns (array of objects, required): Specifies the columns to be processed in the sheet. Each column object can have one of two structures:
- Standard column:
- name (string): The name of the column in the Excel file.
- Name-value pair column:
- name_from (string): The name of the column containing the attribute names.
-
value_from (string): The name of the column containing the attribute values.
You cannot use name and name_from/value_from in the same column object.
- Standard column:
- skip_rows (integer, optional): The number of rows to skip at the beginning of the sheet. Useful for files with header information at the top.
- drop_duplicated_columns (boolean, optional, default: false): When set to true, it removes duplicate columns across multiple sheets. This is useful when the same information appears in multiple sheets.
-
join (object, optional): Specifies how this sheet should be joined with the previous sheet. Properties include:
- on (array of strings): The column(s) to join on.
-
how (string): The type of join to perform (e.g., "left", "right", "inner", "outer").
For more information on the join options, please refer to the Pandas docs as this is what is used behind the scenes.
- required (boolean, optional, default: true): Indicates whether this sheet must be present in the Excel file for the format to match.
Unique_columns (optional)
An array of strings representing the column names that serve as unique identifiers across all sheets. These columns are typically used for joining data between sheets.
Example
Here's an example of an input format demonstrating various features:
{
"sheets": [
{
"name": "Products",
"columns": [
{ "name": "ProductID" },
{ "name": "ProductName" },
{ "name": "Category" }
],
"skip_rows": 2,
"required": true
},
{
"name": "Attributes",
"columns": [
{ "name": "ProductID" },
{ "name_from": "AttributeName_3", "value_from": "AttributeValue_3" }
],
"drop_duplicated_columns": true,
"join": {
"on": ["ProductID"],
"how": "left"
}
}
],
"unique_columns": ["ProductID"]
}
Use case
Consider a Custom Input Format that specifies:
{
"sheets": [
{
"name": "01_COMMERCE",
"columns": [
{"name": "REFCIALE"},
{"name": "LIBELLE240"},
{"name": "LIBELLE80"},
{"name": "GTIN13"},
null
],
"required": true
},
{
"name": "02_LOGISTIQUE",
"columns": [
{"name": "MARQUE"},
{"name": "REFCIALE"},
null
],
"required": true,
"join": {
"on": ["REFCIALE"],
"how": "left"
}
}
],
"unique_columns": ["REFCIALE"]
}
This format will:
- Match files containing sheets named "
01_COMMERCE
" and "02_LOGISTIQUE
". - Ensure specified columns exist in each sheet.
- Include all columns from both sheets in the final dataset (due to the
null
incolumns
). - Join the "
02_LOGISTIQUE
" sheet to "01_COMMERCE
" using a left join on the "REFCIALE
" column.
Implications for what’s next in the job
- The resulting dataframe used in jobs will contain only the data specified by the Custom Input Format.
- By removing
null
entries from the columns list, you can restrict the dataframe to only specified columns.