优化一个被多次调用的简单解析器

https://stackoverflow.com//questions/24063219

02-01-2020
|

题

我使用以下命令为自定义文件编写了一个解析器 attoparsec. 。分析报告表明，大约 67% 的内存分配是在名为 tab, ，这也是消耗最多时间的。这 tab 功能非常简单：

tab :: Parser Char
tab = char '\t'

整个分析报告如下：

       ASnapshotParser +RTS -p -h -RTS

    total time  =       37.88 secs   (37882 ticks @ 1000 us, 1 processor)
    total alloc = 54,255,105,384 bytes  (excludes profiling overheads)

COST CENTRE    MODULE                %time %alloc

tab            Main                   83.1   67.7
main           Main                    6.4    4.2
readTextDevice Data.Text.IO.Internal   5.5   24.0
snapshotParser Main                    4.7    4.0


                                                             individual     inherited
COST CENTRE        MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN               MAIN                     75           0    0.0    0.0   100.0  100.0
 CAF               Main                    149           0    0.0    0.0   100.0  100.0
  tab              Main                    156           1    0.0    0.0     0.0    0.0
  snapshotParser   Main                    153           1    0.0    0.0     0.0    0.0
  main             Main                    150           1    6.4    4.2   100.0  100.0
   doStuff         Main                    152     1000398    0.3    0.0    88.1   71.8
    snapshotParser Main                    154           0    4.7    4.0    87.7   71.7
     tab           Main                    157           0   83.1   67.7    83.1   67.7
   readTextDevice  Data.Text.IO.Internal   151       40145    5.5   24.0     5.5   24.0
 CAF               Data.Text.Array         142           0    0.0    0.0     0.0    0.0
 CAF               Data.Text.Internal      140           0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Handle.FD        122           0    0.0    0.0     0.0    0.0
 CAF               GHC.Conc.Signal         103           0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Encoding         101           0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.FD               100           0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Encoding.Iconv    89           0    0.0    0.0     0.0    0.0
  main             Main                    155           0    0.0    0.0     0.0    0.0

我该如何优化这个？

整个代码因为解析器在这里。我正在解析的文件大约 77MB。

解决方案

tab 是替罪羊。如果你定义 boo :: Parser (); boo = return () 并插入一个 boo 在每次绑定之前 snapshotParser 定义，成本分配将类似于：

 main             Main                    255           0   11.8   13.8   100.0  100.0
  doStuff         Main                    258     2097153    1.1    0.5    86.2   86.2
   snapshotParser Main                    260           0    0.4    0.1    85.1   85.7
    boo           Main                    262           0   71.0   73.2    84.8   85.5
     tab          Main                    265           0   13.8   12.3    13.8   12.3

因此，分析器似乎正在转移解析结果分配的责任，这可能是由于广泛内联 attoparsec 代码，正如 John L 在评论中建议的那样。

至于性能问题，关键是，当您解析 77MB 文本文件以构建包含一百万个元素的列表时，您希望文件处理是惰性的，而不是严格的。一旦解决了这个问题，就可以解耦 I/O 并进行解析 doStuff 在没有累加器的情况下构建快照列表也很有帮助。这是考虑到这一点的程序的修改版本。

{-# LANGUAGE BangPatterns #-}
module Main where

import Data.Maybe
import Data.Attoparsec.Text.Lazy
import Control.Applicative
import qualified Data.Text.Lazy.IO as TL
import Data.Text (Text)
import qualified Data.Text.Lazy as TL

buildStuff :: TL.Text -> [Snapshot]
buildStuff text = case maybeResult (parse endOfInput text) of
  Just _ -> []
  Nothing -> case parse snapshotParser text of
      Done !i !r -> r : buildStuff i
      Fail _ _ _ -> []

main :: IO ()
main = do
  text <- TL.readFile "./snap.dat"
  let ss = buildStuff text
  print $ listToMaybe ss
    >> Just (fromIntegral (length $ show ss) / fromIntegral (length ss))

newtype VehicleId = VehicleId Int deriving Show
newtype Time = Time Int deriving Show
newtype LinkID = LinkID Int deriving Show
newtype NodeID = NodeID Int deriving Show
newtype LaneID = LaneID Int deriving Show

tab :: Parser Char
tab = char '\t'

-- UNPACK pragmas. GHC 7.8 unboxes small strict fields automatically;
-- however, it seems we still need the pragmas while profiling. 
data Snapshot = Snapshot {
  vehicle :: {-# UNPACK #-} !VehicleId,
  time :: {-# UNPACK #-} !Time,
  link :: {-# UNPACK #-} !LinkID,
  node :: {-# UNPACK #-} !NodeID,
  lane :: {-# UNPACK #-} !LaneID,
  distance :: {-# UNPACK #-} !Double,
  velocity :: {-# UNPACK #-} !Double,
  vehtype :: {-# UNPACK #-} !Int,
  acceler :: {-# UNPACK #-} !Double,
  driver :: {-# UNPACK #-} !Int,
  passengers :: {-# UNPACK #-} !Int,
  easting :: {-# UNPACK #-} !Double,
  northing :: {-# UNPACK #-} !Double,
  elevation :: {-# UNPACK #-} !Double,
  azimuth :: {-# UNPACK #-} !Double,
  user :: {-# UNPACK #-} !Int
  } deriving (Show)

-- No need for bang patterns here.
snapshotParser :: Parser Snapshot
snapshotParser = do
  sveh <- decimal
  tab
  stime <- decimal
  tab
  slink <- decimal
  tab
  snode <- decimal
  tab
  slane <- decimal
  tab
  sdistance <- double
  tab
  svelocity <- double
  tab
  svehtype <- decimal
  tab
  sacceler <- double
  tab
  sdriver <- decimal
  tab
  spassengers <- decimal
  tab
  seasting <- double
  tab
  snorthing <- double
  tab
  selevation <- double
  tab
  sazimuth <- double
  tab
  suser <- decimal
  endOfLine <|> endOfInput
  return $ Snapshot
    (VehicleId sveh) (Time stime) (LinkID slink) (NodeID snode)
    (LaneID slane) sdistance svelocity svehtype sacceler sdriver
    spassengers seasting snorthing selevation sazimuth suser

即使您将整个快照列表强制放入内存，此版本也应该具有可接受的性能，就像我在 main 这里。要衡量什么是“可接受的”，请记住，考虑到每个字段中有 16 个（小，未装箱）字段 Snapshot 加上高架的 Snapshot 和列表构造函数，我们讨论的是每个列表单元 152 字节，这对于您的测试数据来说大约为 152MB。无论如何，这个版本是尽可能懒惰的，正如您通过删除中的除法所看到的那样 main, ，或将其替换为 last ss.

注意：我的测试是使用 attoparsec-0.12 完成的。

其他提示

将 attoparsec 更新到最新版本后（0.12.0.0），执行时间从 38 秒减少到 16 秒。加速速度超过 50%。它消耗的内存也急剧减少。正如@JohnL 指出的，启用分析后，结果差异很大。当我尝试使用最新版本的 attoparsec 库对其进行分析时，执行整个程序大约需要 64 秒。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow