XML files and custom Input Formats

Summary

Overview

Custom XML Input Formats (more infrmation: Inputs) are predefined templates that specify how XML files should be processed. It allows for configurable extraction of structured data based on XML paths and element relationships.

What is XML?

XML is one of our supported input formats. While less commonnly used by clients than CSV or XLS, XML's flexible structure often requires more configuration to properly extract data. Our parsing system is designed to handle complex XML structures through customizable extraction rules that can navigate hierarchical data, process attributes, and manage namespaces. For example, a typical product XML might look like:

<?xml version="1.0" encoding="UTF-8"?>
<products>
  <product>
    <id>1</id>
    <name>Product One</name>
    <price currency="USD">19.99</price>
    <categories>
      <category>Electronics</category>
      <category>Gadgets</category>
    </categories>
  </product>
  <product>
    <id>2</id>
    <name>Product Two</name>
    <price currency="USD">29.99</price>
    <categories>
      <category>Home</category>
      <category>Kitchen</category>
    </categories>
  </product>
</products>

XML Namespaces

XML namespaces help avoid element name conflicts when combining XML documents from different sources. They're defined using the xmlns attribute and are represented as URIs.

Example of an XML document with namespaces:

<?xml version="1.0" encoding="UTF-8"?>
<ProductFeed xmlns="<http://www.example.com/xsd/product/feed/0.2>">
  <Products xmlns="<http://www.example.com/xsd/product/0.3>">
    <Product xmlns="<http://www.example.com/xsd/product/0.3>">
      <Id xmlns="<http://www.example.com/xsd/product/0.3>">1</Id>
      <Name xmlns="<http://www.example.com/xsd/product/0.3>">Product One</Name>
      <Price xmlns="<http://www.example.com/xsd/product/0.3>" currency="USD">19.99</Price>
    </Product>
  </Products>
</ProductFeed>

Handling Namespaces in SDM

When parsing XML files with namespaces, you have two options:

  1. Remove namespaces: Treats all elements as if they had no namespace, simplifying the parsing process. This works well for most data extraction scenarios.
  2. Preserve namespaces (default): Keeps the namespace information intact, which is necessary when dealing with elements that have the same local name but different namespaces.

The remove_ns Option

You can control namespace handling using the remove_ns parameter in your XML format configuration:

{
  "sources": [
    {
      "root_paths": [["ProductFeed", "Products", "Product"]],
      "remove_ns": true,  // Default: false
      "columns": [...]
    }
  ]
}

When to use remove_ns: true

  • For simpler (most) data extraction scenarios
  • When namespace information isn't relevant
  • When you want to use simpler XPath expressions

When to use remove_ns: false (default)

  • When working with elements that have the same local name but different namespaces
  • When namespace information is semantically important

Example

Consider this XML with namespaces:

<ns1:product xmlns:ns1="<http://example.com/ns1>" xmlns:ns2="<http://example.com/ns2>">
  <ns1:id>1</ns1:id>
  <ns1:name>Product One</ns1:name>
  <ns2:price currency="USD">19.99</ns2:price>
</ns1:product>

With remove_ns: true, the parser treats this as:

<product>
  <id>1</id>
  <name>Product One</name>
  <price currency="USD">19.99</price>
</product>

This makes it easier to extract data using simple paths like ["product", "id"].

With remove_ns: false (default), you would need to include the namespaces in your paths, which is more complex but preserves the full information. For example, you would need to use paths like ["ns1:product", "ns1:id"].

How Custom XML Input Formats work

SpecificXMLFormat allows you to define how to extract data from XML files by specifying:

  1. Where to find product listings in the XML structure (root paths)
  2. Which XML elements to extract as columns
  3. How to name these columns
  4. How to handle XML attributes

Workflow

Configuration Options

Sources

A SpecificXMLFormat can have multiple "sources" defined. Each source represents a different extraction pattern for the XML file. The system tries each source and uses the first one that matches.

For each source, you can define:

Root Paths

Root paths define where to find product listings in the XML structure. Each root path is a sequence of XML tags that leads to the elements containing product data.

The topmost parent element of the XML file (the document root) does not need to be included in the root paths. The system automatically starts searching from the document root.

 

Example configuration:

"root_paths": [["products", "product"]]

This would match all <product> elements under the <products> element in this XML:

<?xml version="1.0" encoding="UTF-8"?>
<products_feed>
  <products>
    <product>
      <id>1</id>
      <name>Product One</name>
    </product>
    <product>
      <id>2</id>
      <name>Product Two</name>
    </product>
  </products>
</products_feed>

Notice that products_feed (the document root) is not included in the root path, but the system still correctly finds the <product> elements.

Each <product> element is processed as a separate row in the resulting data.

Exclude Paths

Specify paths to exclude from parsing when you want to ignore certain elements within your product listings.

Example configuration:

"exclude": [["product", "internal_data"]]

With this XML:

<products>
  <product>
    <id>1</id>
    <name>Product One</name>
    <internal_data>
      <cost>10.00</cost>
      <supplier_id>S123</supplier_id>
    </internal_data>
    <price>19.99</price>
  </product>
</products>

The parser would extract id, name, and price, but would ignore the entire internal_data element and its children (cost and supplier_id).

Name Mode

The name mode determines how column names are generated from XML paths:

  • join: Joins all tags in the path using a separator (default: /)
  • last: Uses only the last tag in the path as the column name

Example configuration:

"name_mode": "join"

With this XML:

<product>
  <details>
    <color>Red</color>
    <size>Large</size>
  </details>
</product>

Using name_mode: "join" would create columns named:

  • details/color with value "Red"
  • details/size with value "Large"

While name_mode: "last" creates columns named:

  • color with value "Red"
  • size with value "Large"

Name Separator

When using the join name mode, this defines the separator used between tags.

Example configuration:

"name_mode": "join",
"name_separator": "_"

With the same XML as above, this creates columns named:

  • details_color with value "Red"
  • details_size with value "Large"

Name With Attributes

Determines whether XML attributes should be included in column names.

Example configuration:

"name_with_attributes": true

With this XML:

<product>
  <price currency="USD" tax="included">19.99</price>
  <weight unit="kg">1.5</weight>
</product>

This creates columns that include the attributes in their names:

  • price_currency:USD_tax:included with value "19.99"
  • weight_unit:kg with value "1.5"

Attributes Template

When name_with_attributes is true, this template defines how attributes are formatted in column names.

Example configuration:

"name_with_attributes": true,
"attributes_template": "-{name}({value})"

With the same XML, this creates a column named:

  • price-currency(USD)-tax(included) with value "19.99"

Columns

For each source, you define which columns to extract. Each column can be configured with:

Path

The path to the XML element to extract, relative to the root path.

Example configuration:

"path": ["name"]

With this XML:

<product>
  <id>1</id>
  <name>Product One</name>
  <price>19.99</price>
</product>

This extracts only the content of the <name> element, creating a column with the value "Product One".

Name

An optional custom name for the column. If not provided, the name is generated based on the path and name mode.

Example configuration:

"path": ["name"],
"name": "product_name"

With the same XML, instead of using "name" as the column name, it uses "product_name" for the column containing "Product One".

Name From / Value From

These options allow you to dynamically extract column names and values from the XML. This is useful for handling XML structures where the column names themselves are stored in the XML.

Example configuration:

"path": ["attributes", "attribute"],
"name_from": ["name"],
"value_from": ["value"]

With this XML:

<product>
  <id>1</id>
  <attributes>
    <attribute>
      <name>Color</name>
      <value>Red</value>
    </attribute>
    <attribute>
      <name>Size</name>
      <value>Large</value>
    </attribute>
    <attribute>
      <name>Material</name>
      <value>Cotton</value>
    </attribute>
  </attributes>
</product>

This dynamically creates three columns:

  • Color with value "Red"
  • Size with value "Large"
  • Material with value "Cotton"

This is useful for handling variable attribute sets where you don't know all possible attributes in advance.

Example Configuration

Here's a complete example of a SpecificXMLFormat configuration:

{
  "name": "Product Catalog Format",
  "params": {
    "sources": [
      {
        "root_paths": [["catalog", "products", "product"]],
        "name_mode": "join",
        "name_separator": "_",
        "name_with_attributes": true,
        "attributes_template": "_{name}:{value}",
        "exclude": [["product", "internal"]],
        "remove_ns": true,
        "columns": [
          {
            "path": ["id"],
            "name": "product_id"
          },
          {
            "path": ["name"],
            "name": "product_name"
          },
          {
            "path": ["price"],
            "name_with_attributes": true
          },
          {
            "path": ["attributes", "attribute"],
            "name_from": ["name"],
            "value_from": ["value"]
          }
        ]
      }
    ]
  }
}

Processing Flow

Best Practices

  1. Start with root paths: Correctly identify where your product data is located in the XML structure.
  2. Use exclude paths to ignore irrelevant or sensitive data.
  3. Choose the appropriate name mode:
    • Use join when you want to preserve the full path hierarchy in column names
    • Use last when you only care about the leaf element names
  4. Use custom column names for clarity and consistency, especially when integrating with other systems.
  5. Handle attributes appropriately: Use name_with_attributes and attributes_template to include XML attributes in your data.
  6. Test with sample files: Always test your format configuration with representative sample files.
  7. Consider namespace handling: Use the remove_ns parameter appropriately based on your XML structure and requirements.

Troubleshooting

Common Issues

  1. No data extracted: Check that your root paths correctly identify the product elements.
  2. Missing columns: Verify that the column paths are correct relative to the root paths.
  3. Encoding issues: If you encounter encoding problems, specify the encoding explicitly when uploading the file.
  4. Performance with large files: For very large XML files, consider using more specific root paths to limit the amount of data processed.
  5. Namespace issues: If you're having trouble with namespaced XML:
    • Try setting remove_ns: true to simplify parsing
    • If you need to preserve namespaces (the default behavior), ensure your paths account for them

Advanced Usage

Dynamic Column Names

The name_from and value_from options allow for extracting data where the column names themselves are stored in the XML. This is particularly useful for handling attribute-value pair structures.

Example XML:

<product>
  <id>123</id>
  <attributes>
    <attribute>
      <name>Color</name>
      <value>Red</value>
    </attribute>
    <attribute>
      <name>Size</name>
      <value>Large</value>
    </attribute>
  </attributes>
</product>

Configuration:

{
  "path": ["attributes", "attribute"],
  "name_from": ["name"],
  "value_from": ["value"]
}

This would create columns named "Color" and "Size" with values "Red" and "Large" respectively.

Handling XML Attributes

XML attributes can contain important data. Use the name_with_attributes and attributes_template options to include them in your extracted data.

Example XML:

<product>
  <price currency="USD">19.99</price>
</product>

With name_with_attributes: true and attributes_template: "_{name}:{value}", this creates a column named "price_currency:USD" with the value "19.99".

Working with Namespaced XML

When working with heavily namespaced XML (like SOAP responses or industry-standard formats), you have two approaches:

Approach 1: Remove Namespaces (Simpler)

Set remove_ns: true to strip all namespace information:

{
  "sources": [
    {
      "root_paths": [["products", "product"]],
      "remove_ns": true,
      "columns": [...]
    }
  ]
}

This allows you to use simple paths like ["product", "id"] regardless of the original namespaces.

Approach 2: Preserve Namespaces (More Precise, Default)

Keep the default remove_ns: false to preserve namespace information:

{
  "sources": [
    {
      "root_paths": [["ns1:products", "ns1:product"]],
      "remove_ns": false,
      "columns": [{"path": ["ns1:id"]}, {"path": ["ns2:price"]}]
    }
  ]
}

This is useful when:

  • You need to distinguish between elements with the same name but different namespaces
  • You're validating against a schema
  • The semantic meaning of the namespaces is important to your data