Overview
Custom XML Input Formats (more infrmation: Inputs) are predefined templates that specify how XML files should be processed. It allows for configurable extraction of structured data based on XML paths and element relationships.
What is XML?
XML is one of our supported input formats. While less commonnly used by clients than CSV or XLS, XML's flexible structure often requires more configuration to properly extract data. Our parsing system is designed to handle complex XML structures through customizable extraction rules that can navigate hierarchical data, process attributes, and manage namespaces. For example, a typical product XML might look like:
<?xml version="1.0" encoding="UTF-8"?>
<products>
<product>
<id>1</id>
<name>Product One</name>
<price currency="USD">19.99</price>
<categories>
<category>Electronics</category>
<category>Gadgets</category>
</categories>
</product>
<product>
<id>2</id>
<name>Product Two</name>
<price currency="USD">29.99</price>
<categories>
<category>Home</category>
<category>Kitchen</category>
</categories>
</product>
</products>
XML Namespaces
XML namespaces help avoid element name conflicts when combining XML documents from different sources. They're defined using the xmlns
attribute and are represented as URIs.
Example of an XML document with namespaces:
<?xml version="1.0" encoding="UTF-8"?>
<ProductFeed xmlns="<http://www.example.com/xsd/product/feed/0.2>">
<Products xmlns="<http://www.example.com/xsd/product/0.3>">
<Product xmlns="<http://www.example.com/xsd/product/0.3>">
<Id xmlns="<http://www.example.com/xsd/product/0.3>">1</Id>
<Name xmlns="<http://www.example.com/xsd/product/0.3>">Product One</Name>
<Price xmlns="<http://www.example.com/xsd/product/0.3>" currency="USD">19.99</Price>
</Product>
</Products>
</ProductFeed>
Handling Namespaces in SDM
When parsing XML files with namespaces, you have two options:
- Remove namespaces: Treats all elements as if they had no namespace, simplifying the parsing process. This works well for most data extraction scenarios.
- Preserve namespaces (default): Keeps the namespace information intact, which is necessary when dealing with elements that have the same local name but different namespaces.
The remove_ns
Option
You can control namespace handling using the remove_ns
parameter in your XML format configuration:
{
"sources": [
{
"root_paths": [["ProductFeed", "Products", "Product"]],
"remove_ns": true, // Default: false
"columns": [...]
}
]
}
When to use remove_ns: true
- For simpler (most) data extraction scenarios
- When namespace information isn't relevant
- When you want to use simpler XPath expressions
When to use remove_ns: false
(default)
- When working with elements that have the same local name but different namespaces
- When namespace information is semantically important
Example
Consider this XML with namespaces:
<ns1:product xmlns:ns1="<http://example.com/ns1>" xmlns:ns2="<http://example.com/ns2>">
<ns1:id>1</ns1:id>
<ns1:name>Product One</ns1:name>
<ns2:price currency="USD">19.99</ns2:price>
</ns1:product>
With remove_ns: true
, the parser treats this as:
<product>
<id>1</id>
<name>Product One</name>
<price currency="USD">19.99</price>
</product>
This makes it easier to extract data using simple paths like ["product", "id"]
.
With remove_ns: false
(default), you would need to include the namespaces in your paths, which is more complex but preserves the full information. For example, you would need to use paths like ["ns1:product", "ns1:id"]
.
How Custom XML Input Formats work
SpecificXMLFormat allows you to define how to extract data from XML files by specifying:
- Where to find product listings in the XML structure (root paths)
- Which XML elements to extract as columns
- How to name these columns
- How to handle XML attributes
Workflow

Configuration Options
Sources
A SpecificXMLFormat can have multiple "sources" defined. Each source represents a different extraction pattern for the XML file. The system tries each source and uses the first one that matches.
For each source, you can define:
Root Paths
Root paths define where to find product listings in the XML structure. Each root path is a sequence of XML tags that leads to the elements containing product data.
The topmost parent element of the XML file (the document root) does not need to be included in the root paths. The system automatically starts searching from the document root.
Example configuration:
"root_paths": [["products", "product"]]
This would match all <product>
elements under the <products>
element in this XML:
<?xml version="1.0" encoding="UTF-8"?>
<products_feed>
<products>
<product>
<id>1</id>
<name>Product One</name>
</product>
<product>
<id>2</id>
<name>Product Two</name>
</product>
</products>
</products_feed>
Notice that products_feed
(the document root) is not included in the root path, but the system still correctly finds the <product>
elements.
Each <product>
element is processed as a separate row in the resulting data.
Exclude Paths
Specify paths to exclude from parsing when you want to ignore certain elements within your product listings.
Example configuration:
"exclude": [["product", "internal_data"]]
With this XML:
<products>
<product>
<id>1</id>
<name>Product One</name>
<internal_data>
<cost>10.00</cost>
<supplier_id>S123</supplier_id>
</internal_data>
<price>19.99</price>
</product>
</products>
The parser would extract id
, name
, and price
, but would ignore the entire internal_data
element and its children (cost
and supplier_id
).
Name Mode
The name mode determines how column names are generated from XML paths:
-
join
: Joins all tags in the path using a separator (default:/
) -
last
: Uses only the last tag in the path as the column name
Example configuration:
"name_mode": "join"
With this XML:
<product>
<details>
<color>Red</color>
<size>Large</size>
</details>
</product>
Using name_mode: "join"
would create columns named:
-
details/color
with value "Red" -
details/size
with value "Large"
While name_mode: "last"
creates columns named:
-
color
with value "Red" -
size
with value "Large"
Name Separator
When using the join
name mode, this defines the separator used between tags.
Example configuration:
"name_mode": "join",
"name_separator": "_"
With the same XML as above, this creates columns named:
-
details_color
with value "Red" -
details_size
with value "Large"
Name With Attributes
Determines whether XML attributes should be included in column names.
Example configuration:
"name_with_attributes": true
With this XML:
<product>
<price currency="USD" tax="included">19.99</price>
<weight unit="kg">1.5</weight>
</product>
This creates columns that include the attributes in their names:
-
price_currency:USD_tax:included
with value "19.99" -
weight_unit:kg
with value "1.5"
Attributes Template
When name_with_attributes
is true
, this template defines how attributes are formatted in column names.
Example configuration:
"name_with_attributes": true,
"attributes_template": "-{name}({value})"
With the same XML, this creates a column named:
-
price-currency(USD)-tax(included)
with value "19.99"
Columns
For each source, you define which columns to extract. Each column can be configured with:
Path
The path to the XML element to extract, relative to the root path.
Example configuration:
"path": ["name"]
With this XML:
<product>
<id>1</id>
<name>Product One</name>
<price>19.99</price>
</product>
This extracts only the content of the <name>
element, creating a column with the value "Product One".
Name
An optional custom name for the column. If not provided, the name is generated based on the path and name mode.
Example configuration:
"path": ["name"],
"name": "product_name"
With the same XML, instead of using "name" as the column name, it uses "product_name" for the column containing "Product One".
Name From / Value From
These options allow you to dynamically extract column names and values from the XML. This is useful for handling XML structures where the column names themselves are stored in the XML.
Example configuration:
"path": ["attributes", "attribute"],
"name_from": ["name"],
"value_from": ["value"]
With this XML:
<product>
<id>1</id>
<attributes>
<attribute>
<name>Color</name>
<value>Red</value>
</attribute>
<attribute>
<name>Size</name>
<value>Large</value>
</attribute>
<attribute>
<name>Material</name>
<value>Cotton</value>
</attribute>
</attributes>
</product>
This dynamically creates three columns:
-
Color
with value "Red" -
Size
with value "Large" -
Material
with value "Cotton"
This is useful for handling variable attribute sets where you don't know all possible attributes in advance.
Example Configuration
Here's a complete example of a SpecificXMLFormat configuration:
{
"name": "Product Catalog Format",
"params": {
"sources": [
{
"root_paths": [["catalog", "products", "product"]],
"name_mode": "join",
"name_separator": "_",
"name_with_attributes": true,
"attributes_template": "_{name}:{value}",
"exclude": [["product", "internal"]],
"remove_ns": true,
"columns": [
{
"path": ["id"],
"name": "product_id"
},
{
"path": ["name"],
"name": "product_name"
},
{
"path": ["price"],
"name_with_attributes": true
},
{
"path": ["attributes", "attribute"],
"name_from": ["name"],
"value_from": ["value"]
}
]
}
]
}
}
Processing Flow

Best Practices
- Start with root paths: Correctly identify where your product data is located in the XML structure.
- Use exclude paths to ignore irrelevant or sensitive data.
-
Choose the appropriate name mode:
- Use
join
when you want to preserve the full path hierarchy in column names - Use
last
when you only care about the leaf element names
- Use
- Use custom column names for clarity and consistency, especially when integrating with other systems.
-
Handle attributes appropriately: Use
name_with_attributes
andattributes_template
to include XML attributes in your data. - Test with sample files: Always test your format configuration with representative sample files.
-
Consider namespace handling: Use the
remove_ns
parameter appropriately based on your XML structure and requirements.
Troubleshooting
Common Issues
- No data extracted: Check that your root paths correctly identify the product elements.
- Missing columns: Verify that the column paths are correct relative to the root paths.
- Encoding issues: If you encounter encoding problems, specify the encoding explicitly when uploading the file.
- Performance with large files: For very large XML files, consider using more specific root paths to limit the amount of data processed.
-
Namespace issues: If you're having trouble with namespaced XML:
- Try setting
remove_ns: true
to simplify parsing - If you need to preserve namespaces (the default behavior), ensure your paths account for them
- Try setting
Advanced Usage
Dynamic Column Names
The name_from
and value_from
options allow for extracting data where the column names themselves are stored in the XML. This is particularly useful for handling attribute-value pair structures.
Example XML:
<product>
<id>123</id>
<attributes>
<attribute>
<name>Color</name>
<value>Red</value>
</attribute>
<attribute>
<name>Size</name>
<value>Large</value>
</attribute>
</attributes>
</product>
Configuration:
{
"path": ["attributes", "attribute"],
"name_from": ["name"],
"value_from": ["value"]
}
This would create columns named "Color" and "Size" with values "Red" and "Large" respectively.
Handling XML Attributes
XML attributes can contain important data. Use the name_with_attributes
and attributes_template
options to include them in your extracted data.
Example XML:
<product>
<price currency="USD">19.99</price>
</product>
With name_with_attributes: true
and attributes_template: "_{name}:{value}"
, this creates a column named "price_currency:USD" with the value "19.99".
Working with Namespaced XML
When working with heavily namespaced XML (like SOAP responses or industry-standard formats), you have two approaches:
Approach 1: Remove Namespaces (Simpler)
Set remove_ns: true
to strip all namespace information:
{
"sources": [
{
"root_paths": [["products", "product"]],
"remove_ns": true,
"columns": [...]
}
]
}
This allows you to use simple paths like ["product", "id"]
regardless of the original namespaces.
Approach 2: Preserve Namespaces (More Precise, Default)
Keep the default remove_ns: false
to preserve namespace information:
{
"sources": [
{
"root_paths": [["ns1:products", "ns1:product"]],
"remove_ns": false,
"columns": [{"path": ["ns1:id"]}, {"path": ["ns2:price"]}]
}
]
}
This is useful when:
- You need to distinguish between elements with the same name but different namespaces
- You're validating against a schema
- The semantic meaning of the namespaces is important to your data