gaborous's answer uses git fast-import
, which could fails on log message not encoded in UTF-8.
That will work better with Git 2.23 (Q2 2019): The "git fast-export/import
" pair has been taught to handle commits with log messages in encoding other than UTF-8 better.
See commit e80001f, commit 57a8be2, commit ccbfc96, commit 3edfcc6, commit 32615ce (14 May 2019) by Elijah Newren (newren
).
(Merged by Junio C Hamano -- gitster
-- in commit 66dc7b6, 13 Jun 2019)
fast-export
: do automatic reencoding of commit messages only if requested
Automatic re-encoding of commit messages (and dropping of the encoding header) hurts attempts to do reversible history rewrites (e.g. sha1sum <-> sha256sum transitions, some subtree rewrites), and seems inconsistent with the general principle followed elsewhere in fast-export
of requiring explicit user requests to modify the output
(e.g. --signed-tags=strip
, --tag-of-filtered-object=rewrite
).
Add a --reencode
flag that the user can use to specify, and like other fast-export flags, default it to 'abort
'.
That means the Documentation/git-fast-export
now includes:
--reencode=(yes|no|abort)::
Specify how to handle encoding
header in commit objects.
- When asking to '
abort
' (which is the default), this program will die when encountering such a commit object.
- With 'yes', the commit message will be reencoded into UTF-8.
- With 'no', the original encoding will be preserved.
fast-export
: avoid stripping encoding header if we cannot reencode
When fast-export
encounters a commit with an 'encoding' header, it tries to reencode in UTF-8 and then drops the encoding header.
However, if it fails to reencode in UTF-8 because e.g. one of the characters in the
commit message was invalid in the old encoding, then we need to retain the original encoding or otherwise we lose information needed to understand all the other (valid) characters in the original commit message.
fast-import
: support 'encoding' commit header
Since git supports commit messages with an encoding other than UTF-8, allow fast-import
to import such commits.
This may be useful for folks who do not want to reencode commit messages from an external system, and may also be useful to achieve reversible history rewrites (e.g. sha1sum
<-> sha256sum transitions or subtree work) with Git repositories that have used specialized encodings in their commit history.
The Documentation/git-fast-import
now includes:
encoding`
The optional encoding
command indicates the encoding of the commit message.
Most commits are UTF-8 and the encoding is omitted, but this allows importing commit messages into git without first reencoding them.
To see that test which uses an author with non-ascii characters in the name, but no
special commit message.
It does check that the reencoding into UTF-8 worked, by checking its size:
The commit object, if not re-encoded, would be 240 bytes.
- Removing the "
encoding iso-8859-7\n
" header drops 20 bytes.
- Re-encoding the Pi character π from
\xF0
(\360
) in iso-8859-7 to \xCF\x80
(\317\200
) in UTF-8 adds a byte.
Check for the expected size.
And with Git 2.29 (Q4 2020), the pack header created for import is better managed.
See commit 7744a5d, commit 014f144, commit ccb181d (06 Sep 2020) by René Scharfe (rscharfe
).
(Merged by Junio C Hamano -- gitster
-- in commit 9b80744, 18 Sep 2020)
fast-import
: use write_pack_header()
Signed-off-by: René Scharfe
Call write_pack_header()
to hash and write a pack header instead of open-coding this function.
This gets rid of duplicate code and of the magic version number 2 -- which has been used here since c90be46abd ("Changed fast-import's pack header creation to use pack.h
", 2006-08-16, Git v1.5.0-rc4 -- merge) and in pack.h
(again) since 29f049a0c2 (Revert "move pack creation to version 3", 2006-10-14, Git v1.4.3).