Category: Go

Using Go to parse non-UTF8 XML feeds

Posted by – October 6, 2014

For a learning exercise, I'm rewriting my Nowplaying Clojure web application into Go. In the case of Clojure, the clojure.xml package handled this non-UTF8 XML file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<nexgen_audio_export>
  <audio ID="id_1667331726_30393658">
    <type>Song</type>
    <status>Playing</status>
    <played_time>09:41:18</played_time>
    <composer>Frederic Delius</composer>
    <title>Violin Sonata No.1</title>
    <artist>Tasmin Little, violin; Piers Lane, piano</artist>
  </audio>
</nexgen_audio_export>

without complaint, but in the case of Go, I got this error:

xml: encoding "ISO-8859-1" declared but Decoder.CharsetReader is nil

when I tried my first version:

type Piece struct {
  Title    string
  Composer string
}
 
type SecondInversionFeed struct {
  XMLName xml.Name             `xml:nexgen_audio_export`
  Audio   SecondInversionAudio `xml:"audio"`
}
 
type SecondInversionAudio struct {
  Title    string `xml:"title"`
  Composer string `xml:"composer"`
}
 
func translateSecondInversion(data []byte) Piece {
  var feed SecondInversionFeed
  err := xml.Unmarshal(data, &feed)
  if err != nil {
    log.Fatal("Unmarshal error:", err)
  }
  return Piece{feed.Audio.Title, feed.Audio.Composer}
}

I read this Stack Overflow thread a few times, but I still wasn't sure how to use go-charset or some other library to accomplish my task.

I first tried using go-charset to translate the file and pass it to Unmarshal, but that declaration at the top:

  <?xml version="1.0" encoding="ISO-8859-1"?>

still caused the same error. I then realized that the Unmarshal function simply creates a new Decoder, so I just had to pass a reference to the charset.NewReader function, and the xml package would use that to translate my XML data.

Here is a small program that demonstrates my approach:

package main
 
import (
  "bytes"
  "code.google.com/p/go-charset/charset"
  _ "code.google.com/p/go-charset/data"
  "encoding/xml"
  "fmt"
)
 
type Feed struct {
  XMLName xml.Name  `xml:nexgen_audio_export`
  Audio   FeedAudio `xml:"audio"`
}
 
type FeedAudio struct {
  Title    string `xml:"title"`
  Composer string `xml:"composer"`
}
 
func main() {
  xml_data := []byte(`
  <?xml version="1.0" encoding="ISO-8859-1"?>
  <nexgen_audio_export>
    <audio ID="id_1667331726_30393658">
      <type>Song</type>
      <status>Playing</status>
      <played_time>09:41:18</played_time>
      <composer>Frederic Delius</composer>
      <title>Violin Sonata No.1</title>
      <artist>Tasmin Little, violin; Piers Lane, piano</artist>
    </audio>
  </nexgen_audio_export>
`)
  var feed Feed
 
  reader := bytes.NewReader(xml_data)
  decoder := xml.NewDecoder(reader)
  decoder.CharsetReader = charset.NewReader
  err := decoder.Decode(&feed)
  if err != nil {
    fmt.Println("decoder error:", err)
  }
  fmt.Println(feed.Audio.Title)
}