从文法产生的句子与终端的给定数量

https://stackoverflow.com/questions/966576

12-09-2019
|

题

假设你已经有了一个玩具语法，如：（更新使输出看起来更自然）的

S -> ${NP} ${VP} | ${S} and ${S} | ${S}, after which ${S}

NP -> the ${N} | the ${A} ${N} | the ${A} ${A} ${N}

VP -> ${V} ${NP}

N -> dog | fish | bird | wizard

V -> kicks | meets | marries

A -> red | striped | spotted

如，“狗踢向导红色”，“鸟满足斑鱼或向导结婚条纹狗”

如何从这个语法根据它必须包含一个总的名词 VS +作为+ Ns的约束产生一个句子。给定一个整数的句子必须包含许多终端。（注意，当然在此语法的最小可能的名词的是3）。

解决方案

以下Python代码将生成具有终端的给定数量的随机句子。它的工作原理通过计数的方法来产生给定长度的一个句子的数量，产生一个大的随机数，并计算指示句子。计数完成递归，与记忆化。一个空的右侧产生1句如果n是0，否则为0的句子。以计算一个非空的右侧产生的句子的数量，求和I，在右手侧使用的第一符号的终端的数目。对于每个i，乘的可能性的数量为右手侧的通过用于第一符号可能性的数目的其余部分。如果第一符号是一个终端，有1种可能性，如果i是1，否则为0。如果第一符号是一个非末端，总结在每个替代的可能性。为了避免无限循环，我们必须要小心，当数量为0修剪递归调用。这可能还是无限循环，如果语法有一个句子的无限多的推导。例如，在语法

S -> S S
S ->

有空句子的无限多的推导：S => S =>取值S => S =>取值S => S S S =>等找到一个特定的句子中的代码是代码的简单修改指望他们。这个代码是相当有效的，产生在不到一秒钟的100个句子与每个100分的端子。

import collections
import random

class Grammar:
    def __init__(self):
        self.prods = collections.defaultdict(list)
        self.numsent = {}
        self.weight = {}

    def prod(self, lhs, *rhs):
        self.prods[lhs].append(rhs)
        self.numsent.clear()

    def countsent(self, rhs, n):
        if n < 0:
            return 0
        elif not rhs:
            return 1 if n == 0 else 0
        args = (rhs, n)
        if args not in self.numsent:
            sym = rhs[0]
            rest = rhs[1:]
            total = 0
            if sym in self.prods:
                for i in xrange(1, n + 1):
                    numrest = self.countsent(rest, n - i)
                    if numrest > 0:
                        for rhs1 in self.prods[sym]:
                            total += self.countsent(rhs1, i) * numrest
            else:
                total += self.countsent(rest, n - self.weight.get(sym, 1))
            self.numsent[args] = total
        return self.numsent[args]

    def getsent(self, rhs, n, j):
        assert 0 <= j < self.countsent(rhs, n)
        if not rhs:
            return ()
        sym = rhs[0]
        rest = rhs[1:]
        if sym in self.prods:
            for i in xrange(1, n + 1):
                numrest = self.countsent(rest, n - i)
                if numrest > 0:
                    for rhs1 in self.prods[sym]:
                        dj = self.countsent(rhs1, i) * numrest
                        if dj > j:
                            j1, j2 = divmod(j, numrest)
                            return self.getsent(rhs1, i, j1) + self.getsent(rest, n - i, j2)
                        j -= dj
            assert False
        else:
            return (sym,) + self.getsent(rest, n - self.weight.get(sym, 1), j)

    def randsent(self, sym, n):
        return self.getsent((sym,), n, random.randrange(self.countsent((sym,), n)))

if __name__ == '__main__':
    g = Grammar()
    g.prod('S', 'NP', 'VP')
    g.prod('S', 'S', 'and', 'S')
    g.prod('S', 'S', 'after', 'which', 'S')
    g.prod('NP', 'the', 'N')
    g.prod('NP', 'the', 'A', 'N')
    g.prod('NP', 'the', 'A', 'A', 'N')
    g.prod('VP', 'V', 'NP')
    g.prod('N', 'dog')
    g.prod('N', 'fish')
    g.prod('N', 'bird')
    g.prod('N', 'wizard')
    g.prod('V', 'kicks')
    g.prod('V', 'meets')
    g.prod('V', 'marries')
    g.prod('A', 'red')
    g.prod('A', 'striped')
    g.prod('A', 'spotted')
    g.weight.update({'and': 0, 'after': 0, 'which': 0, 'the': 0})
    for i in xrange(100):
        print ' '.join(g.randsent('S', 3))

其他提示

也许不是最好的解决办法，但我会通过递归语法的每个规则的工作我的方式，直到我已经超过了约束，则弹回，并探讨沿着语法的路径。保留所有满足约束的句子，扔掉所有不句子。

例如，其中n = 3：

的S - >（$ {NP} $ {VP}） - >（（使用$ {N}）$ {VP}） - >（（所述（狗）$ {VP}） - > ... - >（（所述（狗）（（踢腿）（的$ {NP}）））） - >（（所述（狗）（（踢腿）（第（狗）））））

和然后弹回

（（所述（狗）（（踢腿）（在$ {N}）））） - >（（所述（狗）（（踢腿）（第（鱼）））））

和一小会儿后...

（（所述（狗）（$ {V} $ {N}））） - >（（所述（狗）（（符合）$ {N}））） - >（（所述（狗）（（符合）的（狗））））

等

基本上深度优先图搜索，只有正在构建的曲线图你正在寻找它（和你停止构建超过约束份）。

此问题包含一个类别错误。您指定的语法具有上下文无关语法的外观，但存在成为终端节点的特定数目的要求需要递归可枚举语法。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow