Overview
Every Job in SDM system starts with an input file.
By default, when using a file type that supports multiple sheets (like Excel), only the first sheet will be processed. However, this behavior can be customized using Custom Input Formats in the SDM admin.
Custom Input Formats
SDM admin > Home > Files > Input formats
Custom Input Formats are predefined templates that specify how multi-sheet files should be processed.
They allow for more complex data imports by defining:
- Which sheets should be present in the file
- What columns each sheet should contain
- How different sheets could/should be joined together
- Which columns are required and which are optional
- Any unique identifiers across the dataset
Custom Input Formats serve two primary functions:
- Matching: They define criteria to match against uploaded files.
- Transformation: They specify how data should be extracted and combined from matching files into the Pandas Dataframe that will be used by the job.
Data Transformation
Once a file matches a Custom Input Format:
- The system extracts data based on the format's specifications:
- It includes all specified columns.
- If a
null
entry is present in the columns list, all unnamed columns are also included.
- Sheets are joined together using pandas join operations as specified in the format.
Custom Input Formats can be defined and managed in the admin panel:
Job creation
1. Using the interface
When uploading a file, the system handles Custom Input Formats in the following way:
Default Behavior: The uploaded file is automatically compared against all available Custom Input Formats for the organization. The first format that matches the file structure will be used. If there is no match, or there are no Custom Input Formats in the first place, then the default reader will be used for that file.
Advanced Options: Users can fine-tune this behavior using the advanced options in the job creation form.
- No Input Format: This option is a bit of a misnomer as it uses the default behavior, attempting to match the file against existing Custom Input Formats.
- Specific Format Selection: Users can manually select a specific Custom Input Format to use. This is particularly useful when a file could potentially match multiple formats, ensuring predictable processing.
2. Using FTP
Enable SDM FTP for Inputs
FTP access needs to be enabled at 2 different levels:
- Organization level First check the
Ftp allowed
checkbox
- User level, check “ftp allowed”
Connect to SDM FTP
- Connect to
ftp.sdm.akeneo.cloud
(ftp
+ environment URL) and theusername
andpassword
that you use to connect to the app.
- Once connected to the FTP you should be seeing 2 directories
inputs
andoutputs
- Navigate to the
inputs
directory. You should see a list of all the projects in your organization as directories. Simply drop an input file (CSV or XLS) in a directory to trigger a job creation for that project.
FTP Jobs do not support medias, and will only accept CSV or XLS files.
Retrieving the outputs of a job through FTP
Similarly, the outputs
directory lists all the existing projects in your organization. If you navigate to one of them you should be able to download all of the output files created for the jobs of that project.
Input Upload Process
The following diagram illustrates the process of uploading a file and matching it with Custom Input Formats:
Configuration
The configuration outlined in this document aims at providing a more digestable version of what is presented in the API docs. However the API docs are always up to date and should remain a source of truth. It is recommended to double check the info presented here against it.
Configuration Structure
The configuration object has the following structure:
{
"sheets": [ ... ],
"unique_columns": [ ... ]
}
Configuration Options
Sheets (required)
An array of objects, each representing a sheet in the Excel file. Each sheet object has the following properties:
- name (string, required): The name of the sheet in the Excel file.
-
columns (array of objects, required): Specifies the columns to be processed in the sheet. Each column object can have one of two structures:
- Standard column:
- name (string): The name of the column in the Excel file.
- Name-value pair column:
- name_from (string): The name of the column containing the attribute names.
-
value_from (string): The name of the column containing the attribute values.
You cannot use name and name_from/value_from in the same column object.
- Standard column:
- skip_rows (integer, optional): The number of rows to skip at the beginning of the sheet. Useful for files with header information at the top.
- drop_duplicated_columns (boolean, optional, default: false): When set to true, it removes duplicate columns across multiple sheets. This is useful when the same information appears in multiple sheets.
-
join (object, optional): Specifies how this sheet should be joined with the previous sheet. Properties include:
- on (array of strings): The column(s) to join on.
-
how (string): The type of join to perform (e.g., "left", "right", "inner", "outer").
For more information on the join options, please refer to the Pandas docs as this is what is used behind the scenes.
- required (boolean, optional, default: true): Indicates whether this sheet must be present in the Excel file for the format to match.
Unique_columns (optional)
An array of strings representing the column names that serve as unique identifiers across all sheets. These columns are typically used for joining data between sheets.
Configuration Example
Here's an example configuration demonstrating various features:
{
"params": {
"sheets": [
{
"name": "Products",
"columns": [
{ "name": "ProductID" },
{ "name": "ProductName" },
{ "name": "Category" }
],
"skip_rows": 2,
"required": true
},
{
"name": "Attributes",
"columns": [
{ "name": "ProductID" },
{ "name_from": "AttributeName_3", "value_from": "AttributeValue_3" }
],
"drop_duplicated_columns": true,
"join": {
"on": ["ProductID"],
"how": "left"
}
}
],
"unique_columns": ["ProductID"]
}
}
Use case
Consider a Custom Input Format that specifies:
{
"sheets": [
{
"name": "01_COMMERCE",
"columns": ["REFCIALE", "LIBELLE240", "LIBELLE80", "GTIN13", null],
"required": true
},
{
"name": "02_LOGISTIQUE",
"columns": ["MARQUE", "REFCIALE", null],
"required": true,
"join": {
"on": ["REFCIALE"],
"how": "left"
}
}
],
"unique_columns": ["REFCIALE"]
}
This format will:
- Match files containing sheets named "
01_COMMERCE
" and "02_LOGISTIQUE
". - Ensure specified columns exist in each sheet.
- Include all columns from both sheets in the final dataset (due to the
null
incolumns
). - Join the "
02_LOGISTIQUE
" sheet to "01_COMMERCE
" using a left join on the "REFCIALE
" column.
Implications for what’s next in the job
- The resulting dataframe used in jobs will contain only the data specified by the Custom Input Format.
- By removing
null
entries from the columns list, you can restrict the dataframe to only specified columns.