XML files and custom Input Formats

Summary

Overview

Custom XML Input Formats (more infrmation: Inputs) are predefined templates that specify how XML files should be processed. It allows for configurable extraction of structured data based on XML paths and element relationships.

Adding a custom input format for your expected XML files is mandatory to upload XML files.

What is XML?

XML is one of our supported input formats. While less commonnly used by clients than CSV or XLS, XML's flexible structure often requires more configuration to properly extract data. Our parsing system is designed to handle complex XML structures through customizable extraction rules that can navigate hierarchical data, process attributes, and manage namespaces. For example, a typical product XML might look like:

<?xml version="1.0" encoding="UTF-8"?>
<products>
  <product>
    <id>1</id>
    <name>Product One</name>
    <price currency="USD">19.99</price>
    <categories>
      <category>Electronics</category>
      <category>Gadgets</category>
    </categories>
  </product>
  <product>
    <id>2</id>
    <name>Product Two</name>
    <price currency="USD">29.99</price>
    <categories>
      <category>Home</category>
      <category>Kitchen</category>
    </categories>
  </product>
</products>

XML Namespaces

XML namespaces help avoid element name conflicts when combining XML documents from different sources. They're defined using the xmlns attribute and are represented as URIs.

Example of an XML document with namespaces:

<?xml version="1.0" encoding="UTF-8"?>
<ProductFeed xmlns="<http://www.example.com/xsd/product/feed/0.2>">
  <Products xmlns="<http://www.example.com/xsd/product/0.3>">
    <Product xmlns="<http://www.example.com/xsd/product/0.3>">
      <Id xmlns="<http://www.example.com/xsd/product/0.3>">1</Id>
      <Name xmlns="<http://www.example.com/xsd/product/0.3>">Product One</Name>
      <Price xmlns="<http://www.example.com/xsd/product/0.3>" currency="USD">19.99</Price>
    </Product>
  </Products>
</ProductFeed>

Handling Namespaces in SDM

When parsing XML files with namespaces, you have two options:

Remove namespaces: Treats all elements as if they had no namespace, simplifying the parsing process. This works well for most data extraction scenarios.
Preserve namespaces (default): Keeps the namespace information intact, which is necessary when dealing with elements that have the same local name but different namespaces.

The `remove_ns` Option

You can control namespace handling using the remove_ns parameter in your XML format configuration:

{
  "sources": [
    {
      "root_paths": [["ProductFeed", "Products", "Product"]],
      "remove_ns": true,  // Default: false
      "columns": [...]
    }
  ]
}

When to use `remove_ns: true`

For simpler (most) data extraction scenarios
When namespace information isn't relevant
When you want to use simpler XPath expressions

When to use `remove_ns: false` (default)

When working with elements that have the same local name but different namespaces
When namespace information is semantically important

Example

Consider this XML with namespaces:

<ns1:product xmlns:ns1="<http://example.com/ns1>" xmlns:ns2="<http://example.com/ns2>">
  <ns1:id>1</ns1:id>
  <ns1:name>Product One</ns1:name>
  <ns2:price currency="USD">19.99</ns2:price>
</ns1:product>

With remove_ns: true, the parser treats this as:

<product>
  <id>1</id>
  <name>Product One</name>
  <price currency="USD">19.99</price>
</product>

This makes it easier to extract data using simple paths like ["product", "id"].

With remove_ns: false (default), you would need to include the namespaces in your paths, which is more complex but preserves the full information. For example, you would need to use paths like ["ns1:product", "ns1:id"].

How Custom XML Input Formats work

SpecificXMLFormat allows you to define how to extract data from XML files by specifying:

Where to find product listings in the XML structure (root paths)
Which XML elements to extract as columns
How to name these columns
How to handle XML attributes

Workflow

Configuration Options

Sources

A SpecificXMLFormat can have multiple "sources" defined. Each source represents a different extraction pattern for the XML file. The system tries each source and uses the first one that matches.

For each source, you can define:

Root Paths

Root paths define where to find product listings in the XML structure. Each root path is a sequence of XML tags that leads to the elements containing product data.

The topmost parent element of the XML file (the document root) does not need to be included in the root paths. The system automatically starts searching from the document root.

Example configuration:

"root_paths": [["products", "product"]]

This would match all <product> elements under the <products> element in this XML:

<?xml version="1.0" encoding="UTF-8"?>
<products_feed>
  <products>
    <product>
      <id>1</id>
      <name>Product One</name>
    </product>
    <product>
      <id>2</id>
      <name>Product Two</name>
    </product>
  </products>
</products_feed>

Notice that products_feed (the document root) is not included in the root path, but the system still correctly finds the <product> elements.

Each <product> element is processed as a separate row in the resulting data.

XML data extraction focuses on repeating elements at a single structural level. Cross-referencing data from parent elements, sibling nodes, or different document sections is not currently supported.

Exclude Paths

Specify paths to exclude from parsing when you want to ignore certain elements within your product listings.

Example configuration:

"exclude": [["product", "internal_data"]]

With this XML:

<products>
  <product>
    <id>1</id>
    <name>Product One</name>
    <internal_data>
      <cost>10.00</cost>
      <supplier_id>S123</supplier_id>
    </internal_data>
    <price>19.99</price>
  </product>
</products>

The parser would extract id, name, and price, but would ignore the entire internal_data element and its children (cost and supplier_id).

Name Mode

The name mode determines how column names are generated from XML paths:

join: Joins all tags in the path using a separator (default: /)
last: Uses only the last tag in the path as the column name

Example configuration:

"name_mode": "join"

With this XML:

<product>
  <details>
    <color>Red</color>
    <size>Large</size>
  </details>
</product>

Using name_mode: "join" would create columns named:

details/color with value "Red"
details/size with value "Large"

While name_mode: "last" creates columns named:

color with value "Red"
size with value "Large"

Name Separator

When using the join name mode, this defines the separator used between tags.

Example configuration:

"name_mode": "join",
"name_separator": "_"

With the same XML as above, this creates columns named:

details_color with value "Red"
details_size with value "Large"

Name With Attributes

Determines whether XML attributes should be included in column names.

Example configuration:

"name_with_attributes": true

With this XML:

<product>
  <price currency="USD" tax="included">19.99</price>
  <weight unit="kg">1.5</weight>
</product>

This creates columns that include the attributes in their names:

price_currency:USD_tax:included with value "19.99"
weight_unit:kg with value "1.5"

Attributes Template

When name_with_attributes is true, this template defines how attributes are formatted in column names.

Example configuration:

"name_with_attributes": true,
"attributes_template": "-{name}({value})"

With the same XML, this creates a column named:

price-currency(USD)-tax(included) with value "19.99"

Columns

For each source, you define which columns to extract. Each column can be configured with:

Path

The path to the XML element to extract, relative to the root path.

Example configuration:

"path": ["name"]

With this XML:

<product>
  <id>1</id>
  <name>Product One</name>
  <price>19.99</price>
</product>

This extracts only the content of the <name> element, creating a column with the value "Product One".

Name

An optional custom name for the column. If not provided, the name is generated based on the path and name mode.

Example configuration:

"path": ["name"],
"name": "product_name"

With the same XML, instead of using "name" as the column name, it uses "product_name" for the column containing "Product One".

Name From / Value From

These options allow you to dynamically extract column names and values from the XML. This is useful for handling XML structures where the column names themselves are stored in the XML.

Example configuration:

"path": ["attributes", "attribute"],
"name_from": ["name"],
"value_from": ["value"]

With this XML:

<product>
  <id>1</id>
  <attributes>
    <attribute>
      <name>Color</name>
      <value>Red</value>
    </attribute>
    <attribute>
      <name>Size</name>
      <value>Large</value>
    </attribute>
    <attribute>
      <name>Material</name>
      <value>Cotton</value>
    </attribute>
  </attributes>
</product>

This dynamically creates three columns:

Color with value "Red"
Size with value "Large"
Material with value "Cotton"

This is useful for handling variable attribute sets where you don't know all possible attributes in advance.

XPath Predicates for Attribute Filtering

You can use XPath predicates in path elements to filter by attributes. This is particularly useful for multilingual XML files.

Supported predicates:

element[not(@attribute)] - Match elements WITHOUT a specific attribute
element[@attribute='value'] - Match elements with a specific attribute value
element[@attribute] - Match elements that HAVE a specific attribute (any value)

Example: Multilingual XML

Consider this XML with default and English translations:

<RECIPE>
  <TITLE>Spaghetti alla Carbonara</TITLE>
  <TITLE lang="eng">Pasta Carbonara</TITLE>
  <DESCRIPTION>Plat italien à base d'œufs, de pecorino et de pancetta.</DESCRIPTION>
  <DESCRIPTION lang="eng">Italian dish made with eggs, pecorino, and pancetta.</DESCRIPTION>
  <COOKTIME>20</COOKTIME>
</RECIPE>

You can extract specific language versions:

"columns": [
  {
    "path": ["RECIPE", "TITLE[not(@lang)]"],
    "name": "title_default"
  },
  {
    "path": ["RECIPE", "TITLE[@lang='eng']"],
    "name": "title_english"
  },
  {
    "path": ["RECIPE", "DESCRIPTION[not(@lang)]"],
    "name": "description_default"
  },
  {
    "path": ["RECIPE", "DESCRIPTION[@lang='eng']"],
    "name": "description_english"
  }
]

This creates four columns:

title_default: "Spaghetti alla Carbonara" (no lang attribute)
title_english: "Pasta Carbonara" (lang="eng")
description_default: "Plat italien à base d'œufs, de pecorino et de pancetta."
description_english: "Italian dish made with eggs, pecorino, and pancetta."

Using XPath Predicates with Exclude Paths

XPath predicates can also be used in the exclude array to filter out specific elements based on their attributes. This is particularly useful when you want to exclude certain language versions while keeping others.

Example: Excluding English language attributes while keeping German:

{
  "sources": [
    {
      "root_paths": [["CATALOG", "PRODUCT"]],
      "exclude": [
        ["PRODUCT_DETAILS", "DESCRIPTION[@lang='eng']"],
        ["PRODUCT_DETAILS", "KEYWORD[@lang='eng']"],
        ["PRODUCT_FEATURES", "FEATURE", "FNAME[@lang='eng']"],
        ["PRODUCT_FEATURES", "FEATURE", "FVALUE[@lang='eng']"]
      ],
      "columns": [{"path": [null]}]
    }
  ]
}

With this XML:

<PRODUCT>
  <PRODUCT_DETAILS>
    <DESCRIPTION>Deutsche Beschreibung</DESCRIPTION>
    <DESCRIPTION lang="deu">Deutsche Beschreibung</DESCRIPTION>
    <DESCRIPTION lang="eng">English Description</DESCRIPTION>
    <KEYWORD lang="deu">Schlüsselwort</KEYWORD>
    <KEYWORD lang="eng">Keyword</KEYWORD>
  </PRODUCT_DETAILS>
  <PRODUCT_FEATURES>
    <FEATURE>
      <FNAME>Eigenschaft</FNAME>
      <FNAME lang="deu">Eigenschaft</FNAME>
      <FNAME lang="eng">Property</FNAME>
      <FVALUE>Wert</FVALUE>
      <FVALUE lang="eng">Value</FVALUE>
    </FEATURE>
  </PRODUCT_FEATURES>
</PRODUCT>

The parser will exclude all elements with lang="eng", keeping only the default (no lang attribute) and lang="deu" versions. This results in extracting:

"Deutsche Beschreibung" (both default and lang="deu")
"Schlüsselwort" (lang="deu")
"Eigenschaft" (both default and lang="deu")
"Wert" (default, no lang attribute)

While excluding:

"English Description" (lang="eng")
"Keyword" (lang="eng")
"Property" (lang="eng")
"Value" (lang="eng")

Performance Note

XPath predicates are slightly slower than simple tag matching. Use them only where needed for best performance.

Example Configuration

Here's a complete example of a SpecificXMLFormat configuration:

{
  "name": "Product Catalog Format",
  "params": {
    "sources": [
      {
        "root_paths": [["catalog", "products", "product"]],
        "name_mode": "join",
        "name_separator": "_",
        "name_with_attributes": true,
        "attributes_template": "_{name}:{value}",
        "exclude": [["product", "internal"]],
        "remove_ns": true,
        "columns": [
          {
            "path": ["id"],
            "name": "product_id"
          },
          {
            "path": ["name"],
            "name": "product_name"
          },
          {
            "path": ["price"],
            "name_with_attributes": true
          },
          {
            "path": ["attributes", "attribute"],
            "name_from": ["name"],
            "value_from": ["value"]
          }
        ]
      }
    ]
  }
}

Processing Flow

Best Practices

Start with root paths: Correctly identify where your product data is located in the XML structure.
Use exclude paths to ignore irrelevant or sensitive data.
Choose the appropriate name mode:
- Use join when you want to preserve the full path hierarchy in column names
- Use last when you only care about the leaf element names
Use custom column names for clarity and consistency, especially when integrating with other systems.
Handle attributes appropriately: Use name_with_attributes and attributes_template to include XML attributes in your data.
Test with sample files: Always test your format configuration with representative sample files.
Consider namespace handling: Use the remove_ns parameter appropriately based on your XML structure and requirements.

Troubleshooting

Common Issues

No data extracted: Check that your root paths correctly identify the product elements.
Missing columns: Verify that the column paths are correct relative to the root paths.
Encoding issues: If you encounter encoding problems, specify the encoding explicitly when uploading the file.
Performance with large files: For very large XML files, consider using more specific root paths to limit the amount of data processed.
Namespace issues: If you're having trouble with namespaced XML:
- Try setting remove_ns: true to simplify parsing
- If you need to preserve namespaces (the default behavior), ensure your paths account for them

Advanced Usage

Dynamic Column Names

The name_from and value_from options allow for extracting data where the column names themselves are stored in the XML. This is particularly useful for handling attribute-value pair structures.

Example XML:

<product>
  <id>123</id>
  <attributes>
    <attribute>
      <name>Color</name>
      <value>Red</value>
    </attribute>
    <attribute>
      <name>Size</name>
      <value>Large</value>
    </attribute>
  </attributes>
</product>

Configuration:

{
  "path": ["attributes", "attribute"],
  "name_from": ["name"],
  "value_from": ["value"]
}

This would create columns named "Color" and "Size" with values "Red" and "Large" respectively.

Handling XML Attributes

XML attributes can contain important data. Use the name_with_attributes and attributes_template options to include them in your extracted data.

Example XML:

<product>
  <price currency="USD">19.99</price>
</product>

With name_with_attributes: true and attributes_template: "_{name}:{value}", this creates a column named "price_currency:USD" with the value "19.99".

Working with Namespaced XML

When working with heavily namespaced XML (like SOAP responses or industry-standard formats), you have two approaches:

Approach 1: Remove Namespaces (Simpler)

Set remove_ns: true to strip all namespace information:

{
  "sources": [
    {
      "root_paths": [["products", "product"]],
      "remove_ns": true,
      "columns": [...]
    }
  ]
}

This allows you to use simple paths like ["product", "id"] regardless of the original namespaces.

Approach 2: Preserve Namespaces (More Precise, Default)

Keep the default remove_ns: false to preserve namespace information:

{
  "sources": [
    {
      "root_paths": [["ns1:products", "ns1:product"]],
      "remove_ns": false,
      "columns": [{"path": ["ns1:id"]}, {"path": ["ns2:price"]}]
    }
  ]
}

This is useful when:

You need to distinguish between elements with the same name but different namespaces
You're validating against a schema
The semantic meaning of the namespaces is important to your data

Limitations

Please take a look at our limitations: What product file formats and size are accepted?

Akeneo Help Center

Overview

What is XML?

XML Namespaces

Handling Namespaces in SDM

The remove_ns Option

When to use remove_ns: true

When to use remove_ns: false (default)

Example

How Custom XML Input Formats work

Workflow

Configuration Options

Sources

Root Paths

Exclude Paths

Name Mode

Name Separator

Name With Attributes

Attributes Template

Columns

Path

Name

Name From / Value From

XPath Predicates for Attribute Filtering

Performance Note

Example Configuration

Processing Flow

Best Practices

Troubleshooting

Common Issues

Advanced Usage

Dynamic Column Names

Handling XML Attributes

Working with Namespaced XML

Approach 1: Remove Namespaces (Simpler)

Approach 2: Preserve Namespaces (More Precise, Default)

Limitations

Want to find out more? Take a look at these related articles

The `remove_ns` Option

When to use `remove_ns: true`

When to use `remove_ns: false` (default)