Inputs

Summary

Overview

Every Job in Supplier Data Manager starts with an input file. 

By default, when using a file type that supports multiple sheets (such as Excel), only the first sheet is processed. However, this behavior can be customized using Custom Input Formats in the SDM admin.

Custom Input Formats

SDM admin > Home > Files > Input formats

Custom Input Formats are predefined templates that specify how multi-sheet files should be processed.

They allow for more complex data imports by defining:

  1. Which sheets should be present in the file
  2. What columns each sheet should contain
  3. How different sheets could/should be joined together
  4. Which columns are required and which are optional
  5. Any unique identifiers across the dataset

Custom Input Formats serve two primary functions:

  1. Matching: They define criteria to match against uploaded files.
  2. Transformation: They specify how data should be extracted and combined from matching files into the Pandas Dataframe that will be used by the job.

Data Transformation

Once a file matches a Custom Input Format:

  1. The system extracts data based on the format's specifications:
    • It includes all specified columns.
    • If a null entry is present in the columns list, all unnamed columns are also included.
  2. Sheets are joined together using pandas join operations as specified in the format.

Custom Input Formats can be defined and managed in the admin panel:

Input Upload Process

The following diagram illustrates the process of uploading a file and matching it with Custom Input Formats:

Configuration of Input Format

The configuration outlined in this document aims at providing a more digestable version of what is presented in the API docs. However the API docs are always up to date and should remain a source of truth. It is recommended to double check the info presented here against it.

 

Structure

The configuration object has the following structure:

{
  "sheets": [ ... ],
  "unique_columns": [ ... ]
}

Options

Sheets (required)

An array of objects, each representing a sheet in the Excel file. Each sheet object has the following properties:

  • name (string, required): The name of the sheet in the Excel file.
  • columns (array of objects, required): Specifies the columns to be processed in the sheet. Each column object can have one of two structures:
    • Standard column:
      • name (string): The name of the column in the Excel file.
    • Name-value pair column:
      • name_from (string): The name of the column containing the attribute names.
      • value_from (string): The name of the column containing the attribute values.

        You cannot use name and name_from/value_from in the same column object.

         
  • skip_rows (integer, optional): The number of rows to skip at the beginning of the sheet. Useful for files with header information at the top.
  • drop_duplicated_columns (boolean, optional, default: false): When set to true, it removes duplicate columns across multiple sheets. This is useful when the same information appears in multiple sheets.
  • join (object, optional): Specifies how this sheet should be joined with the previous sheet. Properties include:
    • on (array of strings): The column(s) to join on.
    • how (string): The type of join to perform (e.g., "left", "right", "inner", "outer").

      For more information on the join options, please refer to the Pandas docs as this is what is used behind the scenes.

       
  • required (boolean, optional, default: true): Indicates whether this sheet must be present in the Excel file for the format to match.

Unique_columns (optional)

An array of strings representing the column names that serve as unique identifiers across all sheets. These columns are typically used for joining data between sheets.

Example

Here's an example of an input format demonstrating various features:

{
  "sheets": [
    {
      "name": "Products",
      "columns": [
        { "name": "ProductID" },
        { "name": "ProductName" },
        { "name": "Category" }
      ],
      "skip_rows": 2,
      "required": true
    },
    {
      "name": "Attributes",
      "columns": [
        { "name": "ProductID" },
        { "name_from": "AttributeName_3", "value_from": "AttributeValue_3" }
      ],
      "drop_duplicated_columns": true,
      "join": {
        "on": ["ProductID"],
        "how": "left"
      }
    }
  ],
  "unique_columns": ["ProductID"]
}

Use case

Consider a Custom Input Format that specifies:

{
  "sheets": [
    {
      "name": "01_COMMERCE",
      "columns": [
        {"name": "REFCIALE"},
        {"name": "LIBELLE240"},
        {"name": "LIBELLE80"},
        {"name": "GTIN13"},
        null
      ],
      "required": true
    },
    {
      "name": "02_LOGISTIQUE",
      "columns": [
        {"name": "MARQUE"},
        {"name": "REFCIALE"},
        null
      ],
      "required": true,
      "join": {
        "on": ["REFCIALE"],
        "how": "left"
      }
    }
  ],
  "unique_columns": ["REFCIALE"]
}

This format will:

  1. Match files containing sheets named "01_COMMERCE" and "02_LOGISTIQUE".
  2. Ensure specified columns exist in each sheet.
  3. Include all columns from both sheets in the final dataset (due to the null in columns).
  4. Join the "02_LOGISTIQUE" sheet to "01_COMMERCE" using a left join on the "REFCIALE" column.

Implications for what’s next in the job

  • The resulting dataframe used in jobs will contain only the data specified by the Custom Input Format.
  • By removing null entries from the columns list, you can restrict the dataframe to only specified columns.