Windows 上の Python 2.x でコマンドライン引数から Unicode 文字を読み取る

https://stackoverflow.com/questions/846850

21-08-2019
|

質問

Python スクリプトで Windows の Unicode コマンドライン引数を読み取れるようにしたいと考えています。しかし、sys.argv は Unicode ではなく、ローカルエンコーディングでエンコードされた文字列であるようです。コマンドラインを完全な Unicode で読み取るにはどうすればよいですか?

コード例: argv.py

import sys

first_arg = sys.argv[1]
print first_arg
print type(first_arg)
print first_arg.encode("hex")
print open(first_arg)

日本語コードページ用にセットアップされた私の PC では、次の結果が得られます。

C:\temp>argv.py "PC・ソフト申請書08.09.24.doc"
PC・ソフト申請書08.09.24.doc
<type 'str'>
50438145835c83748367905c90bf8f9130382e30392e32342e646f63
<open file 'PC・ソフト申請書08.09.24.doc', mode 'r' at 0x00917D90>

これは Shift-JIS でエンコードされており、そのファイル名では「機能」すると思います。ただし、ファイル名に Shift-JIS 文字セットにない文字が含まれる場合は中断され、最後の "open" 呼び出しは失敗します。

C:\temp>argv.py Jörgen.txt
Jorgen.txt
<type 'str'>
4a6f7267656e2e747874
Traceback (most recent call last):
  File "C:\temp\argv.py", line 7,
in <module>
    print open(first_arg)
IOError: [Errno 2] No such file or directory: 'Jorgen.txt'

注 - ここで話しているのは Python 3.0 ではなく、Python 2.x です。Python 3.0 が提供することがわかりました。 sys.argv 適切な Unicode として。ただし、Python 3.0 への移行にはまだ少し時期尚早です (サードパーティライブラリのサポートがないため)。

アップデート：

いくつかの回答では、次のとおりにデコードする必要があると述べています。 sys.argv でエンコードされます。問題は、完全な Unicode ではないため、一部の文字が表現できないことです。

私にとって残念な使用例は次のとおりです。私は持っている Windows エクスプローラーで .py ファイルへのファイルのドラッグアンドドロップが可能になりました. 。システムのデフォルトのコードページに含まれていないものも含め、あらゆる種類の文字を含むファイル名があります。文字が現在のコードページエンコーディングで表現できない場合、私の Python スクリプトは、sys.argv 経由で渡される正しい Unicode ファイル名を常に取得しません。

確かに、完全な Unicode でコマンドラインを読み取るための Windows API がいくつかあります (Python 3.0 はそれを実行します)。Python 2.x インタープリターはそれを使用していないと思います。

解決

これは、Windows に呼び出しを行う、まさに私が探しているソリューションです。 GetCommandLineArgvW 関数：
Windows で Unicode 文字を含む sys.argv を取得する (アクティブステートから)

ただし、使用法を簡素化し、特定の用途をより適切に処理するために、いくつかの変更を加えました。私が使用しているものは次のとおりです。

win32_unicode_argv.py

"""
win32_unicode_argv.py

Importing this will replace sys.argv with a full Unicode form.
Windows only.

From this site, with adaptations:
      http://code.activestate.com/recipes/572200/

Usage: simply import this module into a script. sys.argv is changed to
be a list of Unicode strings.
"""


import sys

def win32_unicode_argv():
    """Uses shell32.GetCommandLineArgvW to get sys.argv as a list of Unicode
    strings.

    Versions 2.x of Python don't support Unicode in sys.argv on
    Windows, with the underlying Windows API instead replacing multi-byte
    characters with '?'.
    """

    from ctypes import POINTER, byref, cdll, c_int, windll
    from ctypes.wintypes import LPCWSTR, LPWSTR

    GetCommandLineW = cdll.kernel32.GetCommandLineW
    GetCommandLineW.argtypes = []
    GetCommandLineW.restype = LPCWSTR

    CommandLineToArgvW = windll.shell32.CommandLineToArgvW
    CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
    CommandLineToArgvW.restype = POINTER(LPWSTR)

    cmd = GetCommandLineW()
    argc = c_int(0)
    argv = CommandLineToArgvW(cmd, byref(argc))
    if argc.value > 0:
        # Remove Python executable and commands if present
        start = argc.value - len(sys.argv)
        return [argv[i] for i in
                xrange(start, argc.value)]

sys.argv = win32_unicode_argv()

さて、私がそれを使用する方法は次のとおりです。

import sys
import win32_unicode_argv

そしてそれ以来、 sys.argv Unicode 文字列のリストです。パイソン optparse モジュールは喜んで解析しているようで、これは素晴らしいことです。

他のヒント

エンコーディングの扱いは非常に複雑です。

私 信じる コマンドライン経由でデータを入力すると、データはシステムのエンコード形式であり、Unicode ではないものとしてエンコードされます。(コピー/ペーストでもこれを行う必要があります)

したがって、システムエンコーディングを使用して Unicode にデコードするのが正しいはずです。

import sys

first_arg = sys.argv[1]
print first_arg
print type(first_arg)

first_arg_unicode = first_arg.decode(sys.getfilesystemencoding())
print first_arg_unicode
print type(first_arg_unicode)

f = codecs.open(first_arg_unicode, 'r', 'utf-8')
unicode_text = f.read()
print type(unicode_text)
print unicode_text.encode(sys.getfilesystemencoding())

次のような出力を実行すると、次のようになります。プロンプト> python myargv.py "PC・ソフト申請書08.09.24.txt"

PC・ソフト申請書08.09.24.txt
<type 'str'>
<type 'unicode'>
PC・ソフト申請書08.09.24.txt
<type 'unicode'>
?日本語

「PC・ソフト申請書08.09.24.txt」には「日本語」という文字が含まれていました。(Windows のメモ帳を使用してファイルを utf8 としてエンコードしましたが、印刷時に先頭に「?」が表示されるのはなぜなのか少し困惑しています。メモ帳が utf8 で保存する方法と何か関係があるのでしょうか?)

文字列の「decode」メソッドまたは組み込みの unicode() を使用して、エンコーディングを Unicode に変換できます。

unicode_str = utf8_str.decode('utf8')
unicode_str = unicode(utf8_str, 'utf8')

また、エンコードされたファイルを扱う場合は、組み込みの open() 関数の代わりに codecs.open() 関数を使用することもできます。ファイルのエンコーディングを定義でき、指定されたエンコーディングを使用してコンテンツを透過的に Unicode にデコードします。

それで、あなたが電話するとき、 content = codecs.open("myfile.txt", "r", "utf8").read() content ユニコードになります。

コーデック.オープン:http://docs.python.org/library/codecs.html?#codecs.open

何か理解が間違っている場合は、お知らせください。

まだ読んでいない場合は、Unicode とエンコーディングに関する Joel の記事を読むことをお勧めします。http://www.joelonsoftware.com/articles/Unicode.html

これを試して：

import sys
print repr(sys.argv[1].decode('UTF-8'))

もしかしたら代用する必要があるかもしれない CP437 または CP1252 のために UTF-8. 。レジストリキーから適切なエンコーディング名を推測できるはずです。 HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\OEMCP

コマンドラインは Windows エンコーディングである可能性があります。引数をデコードしてみてください unicode オブジェクト:

args = [unicode(x, "iso-8859-9") for x in sys.argv]

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow