Overview
The Classification module automates product classification within a predefined hierarchical structure. It can leverage artificial intelligence (AI) to categorize products automatically, enhancing efficiency.
The module configuration provides flexibility with customizable settings, via the SDM admin, for field definitions, hierarchy setup, and AI model behavior.
Use Cases
-
Basic classification: The module is used to automatically suggest and categorize products based on their attributes using AI models. Products are grouped by fields such as
product_code
orcolor
, and then categorized using a hierarchical structure. Without AI, the module will display all product rows to be categorized. - Manual validation: After classification by the AI, users can manually verify and correct the classifications through the UI.
- Hierarchical storage: The system can store categorized data across multiple hierarchy levels and output them to specific fields for downstream processes.
Interface
The module's user interface is divided into three tabs to assist users with classification tasks:
- Rows to Check: Displays products that need classification or verification.
- Rows Checked by You: Displays products users have manually validated.
- Rows Checked by the AI: Displays products categorized by the AI without human intervention.
When a user validates a classification, the inter dataframe is updated with internal columns (which contains _UNiFAi ), and the product moves from Rows to Check to Rows Checked by You.
The frontend interface provides tools for easy validation of the classifications made by the AI, enabling users to manually correct or validate the automated classification results.
From the configuration below, it is possible to set up and adjust an “audit rate” to display to the user only a certain amount of product rows to check.
Configuration
Always be sure to refer to the API docs. However the API docs are always up to date and should remain a source of truth, it is recommended to double check the info presented here against it.
The classification process involves two key configuration sections:
For the classification, we also need to configure the required hierarchies.
Business Configuration (Params)
Params will stock the initial configuration.
-
groupby
Each product that has the same values for the fields specified in thegroup_by
will share the same category. For example, in the case of Intersport, products with the sameCODE_MODELE
andCODE_COULEUR
will be grouped under the same category. Some products, like a t-shirt, may have multiple size variations, but they should all belong to the same category. -
automated_audit_rate
A percentage (0 to 1) defining how often predicted results are audited for accuracy. The audit pushes some results that have a high confidence from CoreAI to the user as "needing verification". This is done to ensure that the model is correct. The higher the audit rate, the more products will be presented to the user. -
truth_store
Reference base (database cache) to avoid re-categorizing already categorized products. If a product has already been categorized in the Reference Base, we can use the truth_store to tell it to use that instead of running classification again. -
categorisation_fields
Multiple fields can be used for classification. The fields are flexible and not bound to specific types. ex: a product can be in more than one category -
name
: Used internally to identify the categories (columns in the inter dataframe). -
label
: Localizable label for frontend display. -
hierarchy
: The name of the hierarchy, used to define the project structure for classification. Note that we use the name to find the hierarchy where you would usually use the object id. -
output
: Describes how the categories will be written to go to the next step, represented as a list of lists.-
name
: The column where to store the value,null
to discard. -
label
: output column label visible on the frontend. -
required
: Defaults to true; controls whether the level is mandatory. When a hierarchy level is missing, if required is true there will be an error, if required is false then nothing will be written. -
nb_levels
: Specifies how many levels of the hierarchy to use.-
null
: if it's the first one, it's like a pad start, and SDM will just count the steps left after -
separator
: Joins hierarchy levels if multiple is true. -
int
the number of levels to target:
-
-
Example:
If we take this list of categories: Code_FEDAS_UNIFAI: [["2", "200", "20012", "200124", "HO"]]
and then the following configuration:
{
"output": [
{
"name": null,
"label": {},
"required": true,
"nb_levels": 1
},
{
"name": "Code FEDAS",
"label": {},
"required": true,
"nb_levels": 3
},
{
"name": "Code Genre",
"label": {},
"required": false,
"nb_levels": 1
}
]
}
Then we’ll write "200", "20012", "200124"
in the column Code FEDAS
since we ignore the first level and then write the next 3 and "HO"
in the column Code Genre
. Taking this same example to illustrate the required
parameter: if the last value "HO"
would not have been provided, the column Code Genre
would be empty. Then, if required
would have been true
, an error would have been raised.
-
check_representation_conflicts
: Ensures that outputs represent a valid hierarchy. Errors are raised if ambiguity exists between hierarchy nodes.- For example, if we have hierarchies such as A/AA/ONE and B/BB/ONE, and the output is set as
[{name: null, nb_levels: 2}, {name: "category", nb_levels: 1}]
, an error will occur when saving the configuration. This option is important when exporting data that is later used in other systems like an ERP. In the example above, if only “ONE” is written in the output column, there is no way later on to know which hierarchy was meant.
- For example, if we have hierarchies such as A/AA/ONE and B/BB/ONE, and the output is set as
-
multiple
: Allows multiple classifications per item. -
multiple_separator
: Defines a separator for multiple categories. -
allow_no_category
: Defaults tofalse
. Allows moving to the next step if no category is found for a product.
AI Configuration (Model Config)
Check our API documentation for more info
-
sources
: Fields used for AI prediction. -
use_model
: Specifies the AI model backend (null
,coreai_api
,demo
). -
model_results
: Stores simulated results when using thedemo
model. -
additional_config
-
model_field_mappings
: Defines the mappings for fields used in thecoreai_api
model-
field_name
: related source field -
model_type
: trained model or a zero-shot model (zero-shot is AI Classification agent cf.guide) -
ai_provider:
Which API provider is used for automation -
model_capability
: Which model capability is used for automation -
confidence_status
:- automated => All categories mapped by AI will be validated automatically but still be available for review in the UI. The unmapped categories will also be added for checking in the UI.
- to_check => Categories are automatically placed for manual review in the UI
-
- Example
"model_field_mappings": [
{
"field_name": "categories",
"model_type": "zero_shot",
"ai_provider": "openai",
"model_capability": "cost_effective",
"confidence_status": "to_check"
}
]
Hierarchy Configuration
The Hierarchy feature is used to represent the category trees. It uses a recursive model to organize data into a multi-level structure.
Each node in the hierarchy represents what the interface will display and is defined as follows:
- Name: A unique identifier for the node, similar to the code used in the PIM system.
- Label: The display name for the node, which can be localized to support different languages.
- Selectable: By default, nodes without children are selectable. Parent nodes can be selectable but are typically not by default.
- Children: Defines sub-nodes under a parent node, creating a recursive hierarchy.
Limitations and Known Issues
- Maximum number of products (rows) per job: 100,000
- Maximum number of attributes: 50
- Editing categories: It is currently complicated to edit a category outside of the classification process itself.
- Overwriting existing categories: If a product is already categorized in the original file, the system will still re-categorize it, potentially causing inconsistent results. This can lead to different classification outputs across different runs.