The XML 1.0 and 1.1 standards define some ranges of Unicode code points which are valid, and some "compatibility characters" which should not be used. CDS Invenio (a FOSS CMS) already has some code to clean up text to remove invalid characters, but it doesn't remove the compatibility characters. Using the existing code for HTML 4.01 made the W3C Markup Validation Service complain, so I wanted to exclude the compatibility character ranges from the valid ranges, and get the most concise hexadecimal ranges corresponding to the resulting set to plug into a Python regular expression. Here's the resultingsloppy and ugly code (I'll post updated code and/or a link to the source repository if this is included at some point):

 # -*- coding: utf-8 -*- ## Copyright (C) 2009 CERN. ##
  ## This file is free software; you can redistribute it and/or ## modify it
  under the terms of the GNU General Public License as ## published by the Free
  Software Foundation; either version 2 of the ## License, or (at your option)
  any later version. ## ## CDS Invenio is distributed in the hope that it will
  be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of
  ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General
  Public License for more details. ## ## You should have received a copy of the
  GNU General Public License ## along with CDS Invenio; if not, write to the
  Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA
  02111-1307, USA. """Creates the minimal set of Unicode character ranges for
  valid XML 1.0 and 1.1 characters minus the compatibility changes"""
  INCLUDE_XML10 = "#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] \ |
  [#x10000-#x10FFFF]" EXCLUDE_XML10 = "[#x7F-#x84], [#x86-#x9F],
  [#xFDD0-#xFDEF], \ [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], \
  [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF], \ [#x7FFFE-#x7FFFF],
  [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF], \ [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF],
  [#xCFFFE-#xCFFFF], \ [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
  \ [#x10FFFE-#x10FFFF]" INCLUDE_XML11 = "[#x1-#xD7FF] | [#xE000-#xFFFD] |
  [#x10000-#x10FFFF]" EXCLUDE_XML11 = "[#x1-#x8], [#xB-#xC], [#xE-#x1F],
  [#x7F-#x84], [#x86-#x9F], \ [#xFDD0-#xFDDF], \ [#x1FFFE-#x1FFFF],
  [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], \ [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF],
  [#x6FFFE-#x6FFFF], \ [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
  \ [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF], \
  [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF], \
  [#x10FFFE-#x10FFFF]" def cleanup(value): """Prepare string for conversion to
  hex ranges @param value: String with ranges @return: String with ranges"""
  return value.replace('#', '0').translate(None, '[]') def list_to_range(value):
  """Convert a list of strings (ranges and not) @param value: List of strings
  corresponding to hexadecimal numbers and ranges @return: List of numbers"""
  result = [] for item in value: if item.find('-') == -1:
  result.append(int(item, 16)) else: numbers = [int(hex_str, 16) for hex_str in
  item.split('-')] result.extend(range(numbers[0], numbers[1] + 1)) return
  result def range_minus(include_range, exclude_range): """Subtract one range
  from another @param include_range: String from
  http://www.w3.org/TR/xml/#charsets or http://www.w3.org/TR/xml11/#charsets
  @param exclude_range: Ditto @return: String with hex numbers and ranges"""
  include_range = cleanup(include_range) includes = include_range.split(' | ')
  exclude_range = cleanup(exclude_range) excludes = exclude_range.split(', ')
  include_numbers = list_to_range(includes) exclude_numbers =
  list_to_range(excludes) numbers = set([ number for number in include_numbers
  if number not in exclude_numbers]) lows = [ number for number in numbers if
  number - 1 not in numbers] highs = [ number for number in numbers if number +
  1 not in numbers] result = zip(lows, highs) result_hex = [ '\\U%0*X-\\U%0*X' %
  (8, pair[0], 8, pair[1]) for pair in result] result_hex = [ text.replace('-' +
  text[:10], '') for text in result_hex] # Single ranges result_hex = [
  text.replace('\\U0000', '\\u') for text in result_hex] # Shorten where
  possible return '\n'.join(result_hex) print 'XML 1.0:\n' +
  range_minus(INCLUDE_XML10, EXCLUDE_XML10) + '\n' print 'XML 1.1:\n' +
  range_minus(INCLUDE_XML11, EXCLUDE_XML11) 

In case you just want the results, here you go:

XML 1.0:
\u0009-\u000A
\u000D
\u0020-\u007E
\u0085
\u00A0-\uD7FF
\uE000-\uFDCF
\uFDF0-\uFFFD
\U00010000-\U0001FFFD
\U00020000-\U0002FFFD
\U00030000-\U0003FFFD
\U00040000-\U0004FFFD
\U00050000-\U0005FFFD
\U00060000-\U0006FFFD
\U00070000-\U0007FFFD
\U00080000-\U0008FFFD
\U00090000-\U0009FFFD
\U000A0000-\U000AFFFD
\U000B0000-\U000BFFFD
\U000C0000-\U000CFFFD
\U000D0000-\U000DFFFD
\U000E0000-\U000EFFFD
\U000F0000-\U000FFFFD
\U00100000-\U0010FFFD

XML 1.1:
\u0009-\u000A
\u000D
\u0020-\u007E
\u0085
\u00A0-\uD7FF
\uE000-\uFDCF
\uFDE0-\uFFFD
\U00010000-\U0001FFFD
\U00020000-\U0002FFFD
\U00030000-\U0003FFFD
\U00040000-\U0004FFFD
\U00050000-\U0005FFFD
\U00060000-\U0006FFFD
\U00070000-\U0007FFFD
\U00080000-\U0008FFFD
\U00090000-\U0009FFFD
\U000A0000-\U000AFFFD
\U000B0000-\U000BFFFD
\U000C0000-\U000CFFFD
\U000D0000-\U000DFFFD
\U000E0000-\U000EFFFD
\U000F0000-\U000FFFFD
\U00100000-\U0010FFFD