Вопрос

I'm having a lot of trouble converting an XML file to a CSV in Python. I've looked at many forums, tried both lxml and xmlutils.xml2csv, but I can't get it to work. It's GPS data from a Garmin GPS device.

Here's what my XML file looks like, shortened of course:

<?xml version="1.0" encoding="utf-8"?>
<gpx xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="TC2 to GPX11 XSLT stylesheet" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd">
  <trk>
      <name>2013-12-03T21:08:56Z</name>
      <trkseg>
          <trkpt lat="45.4852855" lon="-122.6347885">
              <ele>0.0000000</ele>
              <time>2013-12-03T21:08:56Z</time>
          </trkpt>
          <trkpt lat="45.4852961" lon="-122.6347926">
              <ele>0.0000000</ele>
              <time>2013-12-03T21:09:00Z</time>
          </trkpt>
          <trkpt lat="45.4852982" lon="-122.6347897">
              <ele>0.2000000</ele>
              <time>2013-12-03T21:09:01Z</time>
          </trkpt>
      </trkseg>
  </trk>
</gpx>

There are several trk tags in my massive XML file, but I can manage to separate them out -- they represent different "segments" or trips on the GPS device. All I want is a CSV file that plots something like this:

LAT         LON         TIME         ELE
45.4...     -122.6...   2013-12...   0.00...
...         ...         ...          ...

Here's the code I have so far:

## Call libraries
import csv
from xmlutils.xml2csv import xml2csv

inputs = "myfile.xml"
output = "myfile.csv"

converter = xml2csv(inputs, output)
converter.convert(tag="WHATEVER_GOES_HERE_RENDERS_EMPTY_CSV")

This is another alternative code. It merely outputs a CSV file with no data, just the headers lat and lon.

import csv
import lxml.etree

x = '''
<?xml version="1.0" encoding="utf-8"?>
<gpx xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="TC2 to GPX11 XSLT stylesheet" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd">
<trk>
  <name>2013-12-03T21:08:56Z</name>
  <trkseg>
    <trkpt lat="45.4852855" lon="-122.6347885">
      <ele>0.0000000</ele>
      <time>2013-12-03T21:08:56Z</time>
    </trkpt>
    <trkpt lat="45.4852961" lon="-122.6347926">
      <ele>0.0000000</ele>
      <time>2013-12-03T21:09:00Z</time>
    </trkpt>
    <trkpt lat="45.4852982" lon="-122.6347897">
      <ele>0.2000000</ele>
      <time>2013-12-03T21:09:01Z</time>
    </trkpt>
  </trkseg>
</trk>
</gpx>
'''

with open('output.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(('lat', 'lon'))
    root = lxml.etree.fromstring(x)
    for trkpt in root.iter('trkpt'):
        row = trkpt.get('lat'), trkpt.get('lon')
        writer.writerow(row)

How do I do this? Please realize I'm a novice, so a more comprehensive explanation would be super awesome!

Это было полезно?

Решение

This is a namespaced XML document. Therefore you need to address the nodes using their respective namespaces.

The namespaces used in the document are defined at the top:

xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1"
xmlns="http://www.topografix.com/GPX/1/1"

So the first namespace is mapped to the short form tc2, and would be used in an element like <tc2:foobar/>. The last one, which doesn't have a short form after the xmlns, is called the default namespace, and it applies to all elements in the document that don't explicitely use a namespace - so it applies to your <trkpt /> elements as well.

Therefore you would need to write root.iter('{http://www.topografix.com/GPX/1/1}trkpt') to select these elements.

In order to also get time and elevation, you can use trkpt.find() to access these elements below the trkpt node, and then element.text to retrieve those elements' text content (as opposed to attributes like lat and lon). Also, because the time and ele elements also use the default namespace you'll have to use the {namespace}element syntax again to select those nodes.

So you could use something like this:

NS = 'http://www.topografix.com/GPX/1/1'
header = ('lat', 'lon', 'ele', 'time')

with open('output.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(header)
    root = lxml.etree.fromstring(x)
    for trkpt in root.iter('{%s}trkpt' % NS):
        lat = trkpt.get('lat')
        lon = trkpt.get('lon')
        ele = trkpt.find('{%s}ele' % NS).text
        time = trkpt.find('{%s}time' % NS).text

        row = lat, lon, ele, time
        writer.writerow(row)

For more information on XML namespaces, see the Namespaces section in the lxml tutorial and the Wikipedia article on XML Namespaces. Also see GPS eXchange Format for some details on the .gpx format.

Другие советы

Apologies for using already-made tools here, but this did the job with your data :

  1. Convert XML to JSON : http://convertjson.com/xml-to-json.htm
  2. Take that JSON and convert JSON to CSV : https://konklone.io/json/

It worked like a charm with your data.

ele,time,_lat,_lon
0.0000000,2013-12-03T21:08:56Z,45.4852855,-122.6347885
0.0000000,2013-12-03T21:09:00Z,45.4852961,-122.6347926
0.2000000,2013-12-03T21:09:01Z,45.4852982,-122.6347897

So for coding, I reckon XML > JSON > CSV may be a good approach. You many find the relevant scripts pointed to in those links.

I wrote gpxcsv exactly for this case, and to handle gpx extension fields that other converters omit.

As easy as:

from gpxcsv import gpxtolist
import pandas as pd

df = pd.DataFrame(
    pxtolist('myfile.gpx'))

for a dataframe, or a command line tool exists to just create a .csv or .json file, preserving as many columns in the trackpoint as it finds using the tags as the column names.

Source code of the project on github.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top