Python: convert RTF file to unicode? -
I am trying to convert lines into a series of Unicode strings in an RTF file, and then a regex Match lines (I need them Unicode so that I can output them to any other file.)
However, my ragex match is not working - I think because they are not being converted to Unicode.
This is my code:
usefulLines = [] textData = {} # regex pattern (such as SUF 76,22) for an entry in DB: it is for us Suffix is enough to match three upper-case characters as well as a space entry. Pattern = '^ ([aged] {3}) [\ s]. * $ 'F = open (' textbase_1a.rtf ',' ur ') fileLines = f.readlines) # Get the milling line number, and save in useful lines for file (fileline): #line = line.decode ( 'Utf-16be') # This causes an error: I do not really know what file encoding is if R.T. In the file ... line = line.Secode ('Mac_Roman') is the print line, then re-matches (entropy, line): # Retrieve the following lines, all the way until we get blank lines print: + Str (i) Useful Lins .append (i)
For the moment, it prints all lines, but does not print anything with the match - though it should match. In addition, For some reasons the lines are being printed with '/ cross' when I try to print them in any output file, so they look very strange.
Part of the problem is that I do not know what encoding is to specify.
If I use entryPattern = '^. * $'
I get.
Can someone help?
/ P >
You have not decoded the RTF file. No are just simple text files. For deletion, a file containing "äöü" contains:
{\ rtf1 \ ansi \ ansicpg1252 \ deff0 \ deflang1031 {\ fonttbl {\ f0 \ fswiss \ fcharset0 ariel;}} < / P>
{* * Generator Msftedit 5.41.15.1507;} \ Viewkind4 \ uc1 \ pard \ f0 \ fs20 \ 'e4 \ f6 \ fc \ par
}
The letter "äöü" has been encoded as a window when opened in a text editor - 1252 as declared at the beginning of the file (äöü = 0xE4 0xF6 0x FC).
To read the RTF you will first need to convert the RTF to text already).
Comments
Post a Comment