Aggressive reduplication and dissimilation in Sundanese

Most cases of long-distance consonant dissimilation can be characterized as local (occurring across a vowel) or unbounded (occurring at all distances). The only known exception is rhotic dissimilation in Sundanese (Cohn 1992; Bennett 2015a,b), which applies in certain non-local contexts only. Following a suggestion by Zuraw (2002:433), I show that the pattern can be analyzed in a co-occurrence-based framework (Suzuki 1998) by invoking two unbounded co-occurrence constraints, *[r]...[r] and *[l]...[l], whose effects in local contexts are obscured by a drive for identity between adjacent syllables. Statistical trends in the lexicon are consistent with this analysis. I compare the predictions of this analysis to those of Bennett’s (2015a,b) and suggest that the present proposal is preferable.


Introduction
Most cases of long-distance consonant dissimilation can be characterized as local or unbounded. In the local cases, alternations occur only across a single vowel (or, alternatively, between adjacent syllables; the difference between these characterizations is not important here). An example of local dissimilation comes from Yimas (Foley 1991), where the inchoative suffix /aRa/ dissimilates to [ata] given an [R]-final root (1b) but not otherwise (1c).
(3) Non-local-only rhotic dissimilation in Sundanese (Cohn 1992 My interest lies in how the Sundanese data bear on predictions of two competing theories of dissimilation. The theories are Suzuki's (1998) Generalized OCP (or GOCP), which treats dissimilation as the result of antisimilarity constraints, and Bennett's (2015) Surface Correspondence Theory of Dissimilation (or SCTD), which treats dissimilation as a way of avoiding similarity-based surface correspondence. Both theories can generate non-local-only dissimilation, but they do so in different ways. Under the GOCP, non-local-only dissimilation is only possible given the interaction of a preference for unbounded dissimilation with an overriding dispreference for the result of local dissimilation. The SCTD, by contrast, provides an explicit provision for non-local-only dissimilation: cases like (3) can be generated directly, without appealing to any dispreference for the result of local dissimilation. The remainder of this section introduces the GOCP and SCTD and explicates their predictions regarding the character of non-local dissimilation. Following this I show that the Sundanese case, previewed in (3), is consistent with the more restrictive predictions of the GOCP. Suzuki's (1998) GOCP proposes that dissimilation is motivated by constraints of the form *X …Y, where X and Y are entities whose co-occurrence is dispreferred (for earlier constraint-based analyses of dissimilation see Holton 1995;Alderete 1997;Myers 1997;a.o.). Each *X …Y constraint stands for a family of constraints, where "…" denotes intervening material of differing lengths. 1 To explore the theory's predictions regarding non-local dissimilation, we will consider two constraints from the *[-lateral]…[-lateral] family: one that penalizes co-occurring rhotics separated by only a mora (4), and one that penalizes each pair of rhotics occurring within the word (5). (Throughout this paper I assume that laterals are [+lateral], rhotics are [lateral], and that no other segments are specified for [±lateral] A factorial typology of (4-5), together with Ident- [±lateral] ("assign one violation for each input [αlateral] segment whose output correspondent is [-αlateral]"), predicts two kinds of dissimliation: local (as in Yimas) and unbounded (as in Georgian). Cases of non-local-only dissimilation are not predicted, as neither *[r]µ [r] nor *[r]…[r] penalizes only non-local co-occurrence (6).

Non-local dissimilation in the GOCP
Factorial typology of (4), (5) The main point is that non-local dissimilation is not a basic prediction of the GOCP. Rather, it emerges from an interaction of constraints that prefer unbounded dissimilation with others that disprefer the result of local dissimilation. Note that these other constraints need not promote local assimilation, as the role played by *[αlat]µ[-αlat] above can be played by any other constraint that disprefers (8b). To give another example, (8b) could also be ruled out by a positional faithfulness constraint that protects the root-initial segment; in such a case we might expect to find other evidence of positional faithfulness to the root-initial segment. A related case occurs in Zulu (Beckman 1998;Bennett 2015b), where labial palatalization triggered by a suffixed /w/ (11a) fails to apply if the targeted labial is root-initial (11b). External evidence suggesting that root-initial consonants are privileged comes from the larger inventory of consonants licensed initially and the fact that long-distance laryngeal harmony is controlled by the root-initial consonant (Hansson 2010:122-126;see Beckman 1998; Nevins & Levine 2012 on initial syllable faithfulness).

Non-local dissimilation in the SCTD
In Bennett's (2015b) SCTD, dissimilation avoids an otherwise required correspondence relation among consonants. Correspondence between surface segments is required by a set of Corr·[F] constraints, which penalize pairs of segments sharing some featural specification [F] that do not stand in correspondence with one another (see also Rose & Walker 2004;Hansson 2010 Under this theory, long-distance consonant assimilation and dissimilation are two sides of the same coin: the same constraints that generate assimilation also generate dissimilation. The SCTD thus predicts a set of relationships between the typologies of long-distance assimilation and dissimilation (Bennett 2015b: Ch. 9 'Cs in the same correspondence class must inhabit a contiguous span of syllables' (≈ 'correspondence cannot skip across an inert intervening syllable') For each distinct pair of output consonants X and Y, assign a violation if: a. X and Y are in the same surface correspondence class b. X and Y are in distinct syllables, Σx and Σy c. there is some syllable Σz that precedes Σy, and is preceded by Σx d. Σz contains no members of the same surface correspondence class as X and Y Local assimilation results when CC·SyllAdj dominates an otherwise active Corr·[F] constraint. The example in (16-17) builds on (14). When two place-distinct rhotics are in adjacent syllables, they correspond and place-assimilate (16). When the two rhotics are separated by a syllable, however, the ranking CC·SyllAdj ≫ Corr·[-lateral] makes correspondence impossible (17). The best option given the ranking in (16-17) is (17a), where the two rhotics do not correspond and do not assimilate.
(16) CC·SyllAdj compels local assimilation  (18)(19) is thus one which exhibits non-local-only dissimilation, in the absence of any extrinsic factor that prevents local dissimilation. This is the type of system that the GOCP does not predict. In addition to the type of system in (18)(19), the SCTD -like the GOCP -predicts a range of systems in which constraints promoting unbounded dissimilation interact with those that exert (dis)preferences in local contexts. To give one example: a system differing from (18)(19) in that CC·ID-[±ant] ≫ ID-[±ant] yields the mappings /õa-rata/ → [õ x a-õ x ata], /õa-tara/ → [õ x a-tal y a]; this is a system in which non-local dissimilation co-exists with local assimilation. To give another: it is possible to analyze the mappings in (18)(19) as an interaction between unbounded rhotic dissimilation and a positional faithfulness constraint (here Ident-σ 1 , after Becker et al. 2012) protecting root-initial syllables, as shown in (20-21).
(20) Positional faithfulness blocks local dissimilation /õa-rata/ The important point here is a difference in the SCTD and GOCP's predictions regarding the character of non-local-only dissimilation. As discussed above, the GOCP predicts that non-local-only dissimilation must be linked to some interacting constraint that disprefers the consequences of local dissimilation. The SCTD, by contrast, makes no such prediction. While it is possible for a case of non-local-only dissimilation to coexist with some external factor, this is not necessary, as non-local-only dissimilation is also predicted to exist on its own (as in (18)(19)). This difference in prediction is due to a difference in the type of constraint interactions that generate non-local-only dissimilation. In the GOCP, non-local-only dissimilation occurs when local dissimilation is penalized; in the SCTD it occurs when local dissimilation is not motivated. With respect to locality effects in dissimilation, the SCTD predicts a superset of those systems predicted by the GOCP, as non-local dissimilation can occur in both the presence and absence of external constraints that hold in local contexts. To show that the SCTD's comparative lack of restrictiveness in this domain is justified, it would be necessary to find cases of non-local dissimilation that are not obvious candidates for a GOCP-based analysis. One example of such a case could be a language with the mappings in (18)(19) and (20)(21) where there is no external evidence for initial-syllable faithfulness. More broadly, these would be cases of non-local dissimilation where there is no apparent reason why local dissimilation should fail.

Roadmap
The rest of the paper argues that Sundanese non-local-only dissimilation does not uniquely support the SCTD's predictions regarding locality, as a GOCP-based analysis is available. Developing a suggestion by Zuraw (2002:433), I show that the full pattern can be analyzed as resulting from the interaction of two distinct pressures: unbounded co-occurrence restrictions on [r]s and [l]s, whose effects in local contexts are obscured by a language-wide desire for identity between adjacent syllables (Section 2). Building on results presented by Cohn (1992), I show that statistical trends in the lexicon are consistent with this analysis: words containing multiple [r]s and [l]s are underattested relative to naïve expectations, and identity between adjacent syllables is overattested relative to naïve expectations (Section 3). Given the success of a GOCP-based analysis in accounting for the Sundanese pattern, the extant typology of locality in dissimilation provides us with little reason to adopt the less restrictive SCTD. Some implications for the analysis of long-distance consonant interactions more generally are discussed in the conclusions (Section 4).

Sundanese assimilation and dissimilation: Data and analysis
Sundanese exhibits a complex pattern of liquid assimilation and dissimilation, manifested primarily as allomorphy between [ar] and [al] (though see Section 3 for discussion of related effects in the lexicon). The allomorphs [ar] and [al] are exponents of a plural affix that appears before the first vowel in the stem. 4 It is a productive verbal affix and is also used with a small, likely closed class of nouns (Robins 1959:343). As discussed by Cohn (1992), Bennett (2015a,b) and others, the choice between [ar] and [al] depends on the presence of other liquids ([r] and [l]) within the word, as well as their location respective to the affixal liquid. The data considered throughout most of this section are in Table 1; the presentation follows Bennett (2015b:315), but with some reordering and my comments.  Cohn (1992:207) in assuming that the affix's underlying form is /ar/, as [ar] surfaces when the root contains no other [r] (a-b). When the root contains an [r], however, the affix generally surfaces as [al] (ce). These alternations suggest a general process of [r]-dissimilation: co-occurrence of two [r]s is avoided by mapping the affixal /r/ to [l]. There are two kinds of exception to this pattern, both of which suggest processes of local liquid assimilation. First, if the stem-initial onset is [l], [al] surfaces unexpectedly (f-g). The result is agreement between the stem's first two syllable onsets for [+lateral]. Second, if one of the syllables adjacent to the affixal /r/ has an /r/ onset (and the root-initial consonant is not [l]), [ar] surfaces unexpectedly (h-i). The result is agreement among onsets of adjacent syllables for [-lateral]. Bennett (2015a,b) proposes an analysis of these facts within the SCTD. The premise of the analysis is that correspondence among liquids is only possible when the liquids inhabit adjacent syllable onsets. This requirement is enforced by CC·SyllAdj (15) as well as CC·SRole, which requires corresponding consonants to have the same syllabic role. In adjacent syllable onsets, where liquids must correspond, they are forced to assimilate for [±lateral] by CC·Ident- [±lateral]. In all other contexts, liquids cannot correspond, so satisfaction of the relevant Corr constraints dictates that they must dissimilate for [±lateral]. The overall analysis is one in which the complementarity between assimilation and dissimilation observable in Table 1 is derived by constraints that limit the contexts in which liquids can correspond.
Arguments for the SCTD-based analysis of Sundanese come in part from its ability to derive this complementarity, and in part from difficulties that the data pose to co-occurrence-based theories of dissimilation (like the GOCP). Namely, it is difficult for theories invoking constraints like *X …Y to explain why [r] cooccurrence is permitted only in adjacent syllables. Bennett (2015a:375) notes that "with enough wrangling, the co-occurrence constraint approach can be made to accommodate the Sundanese data", but that "such elaborations require extra stipulations beyond the theoretical machinery of co-occurrence constraints, and they miss a significant insight about Sundanese: the connection between assimilation and dissimilation." Even granting these advantages, there are reasons why pursuing a co-occurrence based analysis of Sun-danese assimilation and dissimilation is justified. First, as discussed above, the SCTD makes less restrictive predictions regarding the character of non-local-only dissimilation. Second, there is evidence that the SCTD's predictions fail to line up with the types of long-distance consonant interactions that are learnable. As Section 4 summarizes in more detail, a series of artificial grammar learning experiments by McMullin & Hansson (2016, 2019 have shown that the types of dissimilatory patterns learned by participants in artificial grammar studies correspond to the types of patterns predicted by the GOCP: local and unbounded dissimilation, plus non-local-only dissimilation with concomitant local assimilation. Crucially, participants had difficulty learning non-local-only dissimilation when not accompanied by local assimilation, the only pattern type exclusively predicted by the SCTD.

Co-occurrence-based analysis
The analysis of the Sundanese data proceeds in three parts. First (Section 2.1.1), I provide an analysis of [r]dissimilation, as observed in Table 1's c-e. Second (in Section 2.1.2), I provide an analysis of [r]-assimilation and [l]-assimilation (Table 1's f-i) in terms of aggressive reduplication (Zuraw 2002). Third (Section 2.1.3), I fix a problem with the analysis by positing an additional process of [l]-dissimilation. Similarities and differences between the proposed analysis and related co-occurrence-based analyses (Suzuki 1999;Hansson 2001) are discussed in Section 2.3.

Analyzing [r]-dissimilation
The preference for the [

Aggressive reduplication, [r]-assimilation, and [l]-assimilation
To explain the problem posed by [c-ar-uriga] and [r-ar-ah1t], I propose that in Sundanese there is a more general drive for adjacent syllables to be "coupled" in a reduplication-like structure (as suggested by Zuraw 2002:433). Zuraw (2002) argues that such a drive, which she terms aggressive reduplication, encourages a heightening of self-similarity between adjacent, phonologically similar constituents. For example: Zuraw interprets the frequent misspellings of English pompon as pompom, and sherbet as sherbert (among others), as the result of aggressive reduplication. In the case of pompon, the misspelling pompom results in total identity between the word's two syllables. In the case of sherbet, the misspelling sherbert results in nucleus identity. Beyond English, a desire to preserve word-internal self-similarity in Tagalog can impede an otherwise pro-ductive word-final vowel raising process, if the result of raising would be a reduction in similarity between the final and penultimate syllables (see Zuraw 2002:410ff for more details).
Zuraw's proposal has two crucial components. The first is Redup, which promotes word-internal coupling. While Zuraw's (p. 405) definition of Redup is deliberately simple -"A word must contain some substrings that are coupled" -I adopt, for expositional reasons, a more specific definition that requires coupled substrings to be adjacent syllables (23).
(23) Redup: Assign one * if a word does not contain adjacent coupled syllables.
Whether or not a candidate has coupled substrings, and where these coupled substrings are located, is determined by Gen. Again for expositional simplicity I make two limitations to the candidates that Gen can produce. First, a coupled substring must be isomorphic with a syllable: given /pabada/, for example,  following Zuraw 2002.) In this section, what will be of interest is whether or not the onsets of coupled syllables are identical. While it would be possible to encode these requirements as the combination of κκ·Max (ensuring that onsets contain the same number of segments) and a set of κκ·Ident- [F] constraints (ensuring that onsets contain the same segments), I depart from Zuraw in assuming that faithfulness constraints along the κκ dimension can also evaluate entire syllabic constituents. Thus the requirement for onset identity among coupled syllables is formalized as κκ·Ident-[onset] ((24); see Suzuki 1999 for a similar proposal).
With this in place, we can continue with the analysis of [c-ar-uriga] and [r-ar-ah1t]. To derive the fact that co-occurring [r]s are permitted in adjacent syllable onsets, I propose that Sundanese prioritizes coupling over 5 As suggested by Zuraw (in a more general way, p. 405), these limitations could be derived through the interaction of a more general Redup with constraints that govern reduplicant size and placement. Sundanese has initial-syllable partial reduplication; reduplicants are always adjacent to their bases, and there are no instances of multiple reduplication that I am aware of (see Robins 1959 on partial reduplication in Sundanese, and Hansson 2010:289 for previous discussion of the connection between partial reduplication and /ar/ allophony). As the nature of Redup is not the focus of this paper, I do not explore this alternative. a. c-ar-uriga a. c-ar-ombrek a. N-ar-umbara Finally, for forms like /ar-hormat/ (22), the analysis correctly predicts that h-al-ormat is the winner. This is because [l] and [r] cannot correspond: the liquids are contained in the same syllable (ha.lor.mat), and I assume that coupled substrings must be isomorphic to a syllable (*ha[lo] κ [r] κ mat).

Analyzing [l]-dissimilation
The current analysis incorrectly predicts that /ar-g1lis/ should surface as g- a. g-ar-1lis The next part of the pattern to explain is why /ar-l1tik/ surfaces as [l-al-1tik] and /ar-liren/ surface as [l-aliren]. As shown in (30)  a. l-ar-iren To account for these data, I propose that coupling is preferred between the first two syllables of the word, and that this context-sensitive preference for coupling overrides the prohibition on [l] co-occurrence. I formalize the preference for initial coupling as a context-sensitive version of Redup, Redup-σ 1 σ 2 (31), which requires that the stem's first two syllables be coupled.
(31) Redup-σ 1 σ 2 : Assign one * if the first two syllables of the stem are not coupled.
To derive the result that /ar-liren/ surfaces as (32) confirms that, with this ranking in place, [l-a] κ [l-i] κ ren (32d) is correctly selected as optimal. The overall effect is that [l] co-occurrence is permitted only if it results in onset identity between the first two syllables.
The ranking κκ·Ident-[onset] ≫ Redup-σ 1 σ 2 is motivated by further consideration of forms like [c-aruriga], where this analysis assumes that [r] co-occurrence is permitted because the second and third syllables are coupled. Tableau (33) demonstrates that if the ranking between these two constraints were reversed, as To allow coupling to occur outside of the stem-initial context, then, κκ·Ident-[onset] must dominate Redupσ 1 σ 2 . With this final ranking in place, the analysis can now account for all data in Table 1.

Summary of analysis
The proposed analysis of the Sundanese data is summarized in Figure 1, with winner-loser pairs provided to illustrate each ranking argument. Ident·Root-[±lateral] ≫ Redup-σ 1 σ 2 was not established in the analysis above but is included and justified below for completeness's sake.
As mentioned throughout, a number of other constraints regulate aggressive reduplication in these infixed forms. Some of these constraints dominate κκ·Ident- [onset]: constraints demanding faithfulness to features other than [±lateral], for example, are necessary to explain why /ar-kusut/ does not surface as [k-a] κ [ku] κ sut. Some of these constraints are lower-ranked: both κκ·Ident- [nucleus] and κκ·Ident-[coda] must be dominated by Redup, as coupling is allowed word-medially despite a lack of nucleus and coda identity. Finally, in some cases the position of these constraints in the hierarchy is unclear: Max and Dep must dominate Redup to explain why /ar-combrek/ is not c-a[r-om] κ [rek] κ or c-a[br-om] κ [brek] κ , but a lack of relevant forms that contain clusters make their rankings with respect to Redup-σ 1 σ 2 impossible to establish.
One form provided by Bennett (2015b:142), [al-ulur] 'lower on a rope (pl.)', poses a problem for this analysis. 8 The ranking in Figure 1 predicts that it should surface instead as [ar-ulur]   a. ar-ulur There are at least two ways to make sense of this apparent exception. One is to treat it as just that -an exception -and to claim that /ulur/ must exceptionally be realized with the plural allomorph [al]. Such a provision must be part of the analysis in any case, as lexical exceptions exist: Robins (1959:344) -ulur] is unlike all other forms considered here in that it is vowel-initial, and the affix /ar/ surfaces as a prefix. If Redup-σ 1 σ 2 were revised to demand that the first two syllables containing root material must be coupled, candidates (34b,d) would satisfy Redup-σ 1 σ 2 and a[l-u] κ [lur] κ (34d) would be correctly chosen as the winner. 9 It is difficult to know at present which of these solutions is more plausible.
past descriptions. The distribution of [P] is predictable (Robins 1959) and from this I infer that is not part of a root's underlying representation; whether or not and where it surfaces predictably is not important here.
9 A variant of this would be to claim that Redup-σ 1 σ 2 requires coupling between the stem's first and second syllables, as claimed in (31), but that onsetless syllables cannot function as stem-initial syllables (for typological evidence supporting this idea as well as a formal implementation, see Downing 1998 The analysis predicts that /arar/ should surface as [alar] given a lateral-initial root (hypothetical /alar-liren/ → [l-alar-iren]), but no data that I am aware of bears on this prediction.

Comparison with alternatives
The Sundanese data discussed here have been analyzed several times before, and the analysis proposed in this section bears some resemblance to prior analyses by Suzuki (1999) and Hansson (2001). Some major similarities and differences among these analyses are highlighted below.
The proposed analysis shares with Suzuki (1999) and Hansson (2001) an appeal to *[r]…[r], to account for the realization of /ar/ as [al] in forms like [h-al-ormat]. It also shares with these analyses an appeal to some form of surface correspondence that holds between adjacent syllables, as well as a constraint requiring identity among correspondents. The specifics of how this is accomplished differ. Suzuki's (1999) proposal is closer to the present one: he assumes that any two adjacent syllables can stand in correspondence, and appeals to Identσ 1 σ 2 [Ons] ("Adjacent syllables have an identical onset specifications [sic]") to derive [r] co-occurrence in forms like [c-ar-uriga]. Hansson (2001) appeals to two constraints which together derive [r] co-occurrence: Corr-[lat] Ons(σ1-σ2) ("liquids in adjacent syllable onsets must correspond") and Ident[lat]-CC ("corresponding consonants must agree for [±lateral]").
The proposed analysis differs from prior work in how [l] co-occurrence is analyzed. Suzuki (1999) and Hansson (2001) account for this aspect of the pattern by proposing that two distinct types of correspondence are active in Sundanese: base-reduplicant correspondence (which holds between the first two syllables) and another form of surface correspondence (which can hold between all pairs of adjacent syllables). The limitation of [l] co-occurrence to initial position is then derived by assuming that different faithfulness constraints hold within these two correspondence dimensions. Under Hansson's (2001) proposal, the existence of [l] and [r] co-occurrence in initial position shows us that Ident[±lat]-BR is high-ranked; the existence of only [r] co-occurrence elsewhere shows us that Ident[-lat]-CC is high-ranked but Ident[+lat]-CC is not. if one of the consonants is stem-initial.) Thus the proposed analysis is novel in two ways. First, it attributes the special behavior of σ 1 σ 2 to a distinct pressure for correspondence in this context. Second, it attributes the absence of [l] co-occurrence outside of σ 1 σ 2 to a co-occurrence constraint, *[l]…[l]. The overall characterization of the pattern is one in which marked configurations (here co-occurring [r]s and [l]s) are licensed in order to enhance the self-similarity of adjacent syllables. This is distinct from the characterizations argued for in prior work.

Evidence from the lexicon
As discussed above, co-occurrence-based theories of dissimilation predict that non-local-only dissimilation must coexist with an interacting pressure that disprefers the result of local dissimilation. Under the analysis above, Sundanese instantiates this prediction: dissimilation of [r]s and [l]s is obscured in local contexts by a general desire for identity between adjacent syllables.
Previous work suggests independent evidence consistent with aspects of this analysis. Regarding rhotic dissimilation, Cohn (1992:213) notes that loanwords with multiple [r]s often undergo optional dissimilation (rapor, lapor, or rapot for 'report'; direktur or dalektur for 'director'). In addition, she attributes to Eringa (1949) the observation that other morphologically complex forms optionally exhibit rhotic dissimilation as well (e.g. pira(N)+kadar 'type+fate' optionally maps to pilakadar 'only'). These facts are consistent with a system in which *[r]…[r] is active. Regarding aggressive reduplication: Cohn's (1992:213-214) investigation of Lembaga Basa & Sastra Sunda (1985), a large Sundanese dictionary, reveals that 105 of the dictionary's [r]-initial entries have co-occurring [r]s. In 87 of these, the [r]s are onsets of adjacent syllables that also have identical nuclei (e.g. rara 'braid', rorod 'pull in (as a string of a kite)', ragrag 'fall'). Zuraw (2002:433) notes that the observed correlation between [r] co-occurrence and nucleus-matching is consistent with an interaction between dissimilation and aggressive reduplication, as "successive liquid onsets that escape a general dissimilation process are likely to belong to strings that are similar in other ways".
This section replicates and expands on Cohn's findings by providing evidence that trends in the Sundanese lexicon are consistent with the activity of liquid dissimilation and aggressive reduplication. This evidence and its relationship to the analysis is previewed below.
• Evidence for aggressive reduplication in adjacent syllables: If onset identity is a prerequisite for coupling in Sundanese, we might expect that syllables with identical onsets are more likely than expected to be similar in other ways. This is because if an adjacent pair of syllables satisfies κκ·Ident- [onset], coupling is compelled by Redup. These coupled syllables are then evaluated by further κκ·Ident constraints, like κκ·Ident- [nucleus] and κκ·Ident- [coda]. (While inputoutput faithfulness constraints prevented us from seeing the effects of these further constraints in Section 2, we might expect to find effects in the larger lexicon, as faithfulness constraints do not play a role in lexical innovation; see discussion in Section 3.2.4.) I show that this prediction is borne out in the relationship between onset and nucleus identity. When syllables are adjacent, there is a statistically significant correlation between onset-matching and nucleus-matching: syllables with matching onsets are disproportionately likely to have matching nuclei. For non-adjacent syllables, no such correlation exists. These findings are consistent with Section 2's claim that Redup requires coupling only between adjacent syllables, and that a family of κκ·Ident constraints promotes identity among coupled syllables.
• Evidence for aggressive reduplication in σ 1 σ 2 : Evidence consistent with the claim that aggressive reduplication is specifically preferred between the first two syllables (formalized in Section 2 as Redup-σ 1 σ 2 ) comes from patterns of onset-matching. Namely, the onsets of σ 1 and σ 2 are more likely to be identical than is predicted by the frequency of individual onsets in these positions. Importantly, this preference for onset-matching does not hold in σ 2 σ 3 or σ 1 σ 3 (when other processes promoting identity are controlled for; see Section 3.2.3).

• Evidence for restrictions on multiple [r]s and multiple [l]s: If there are active co-occurrence restrictions on multiple [r]s and [l]s (formalized in Section 2 as *[r]…[r] and *[l]…[l]) we should find words containing multiple [r]s or [l]
s to be significantly less frequent than expected. I show that this is true throughout the Sundanese lexicon, even in contexts where identity is otherwise preferred (like the onsets of σ 1 σ 2 ; see discussion in Section 3.2.1).
The main point of this section is that trends in the Sundanese lexicon are consistent with each of the markedness constraints proposed in Section 2. These findings are thus consistent with the claim that non-local dissimilation in Sundanese can be analyzed as the interaction between unbounded dissimilation and a preference for identity between adjacent syllables. Section 3.1 discusses methodological aspects of this study, including information about the data source and the statistical models. Context-by-context results are presented in Section 3.2. Section 3.3 provides a potential learnability-based reason why we should take seriously these links between /ar/ allomorphy and the lexicon. A further corpus study suggests that /ar/-affixed forms supporting the crucial rankings in Figure  1 are likely rare, yet Sundanese children have no problem acquiring the correct grammar: the pattern has been stable for decades. The trends established in Section 3.2 raise the possibility that the relevant constraints and their ranking are discoverable from the lexicon, and that successful acquisition of /ar/ allomorphy may not require much exposure to /ar/-affixed forms.

Methods
The lexicon study discussed in this section is based on a wordlist that contains 11,913 headwords from Lembaga Basa & Sastra Sunda (1985), excluding only those that were explicitly marked as borrowings (from Arabic, English, Javanese, Malay, and a number of other languages). Each word in this list was syllabified according to Cohn's (1992:205) description of cluster phonotactics in the Sundanese root pattern. For clarity, her description of the canonical Sundanese root is replicated in Figure 2. Following this description meant that a word like ablag 'large, spacious' was syllabified as a.blag and a word like ambacak 'scattered' was syllabified as am.ba.cak. A small number of words contained triconsonantal or longer clusters not explicitly described by Cohn; in these cases, the first consonant was assigned to a syllable coda and the rest to the following onset (tasblaN 'finished study / nothing more to learn; from dusk to dawn (awake all night)', for example, was syllabified as tas.blaN). Unsyllabified and syllabified versions of the wordlist are available as supplementary materials.
The lexicon analysis takes into account only disyllabic and trisyllabic words. This limitation was made because most quadrisyllabic or longer forms in Lembaga Basa & Sastra Sunda (1985) appear to be morphologically complex or are likely unmarked borrowings (e.g. afghanistan 'Afghanistan'). 10 In particular, a large number of the longer forms appear to be fully reduplicated roots (e.g. alangahéléngeh 'shy smile', alunalun 'square, plaza'; see Van Syoc 1959:78-80 on morphological reduplication in Sundanese). Part of our interest here is in the extent of evidence for the activity of aggressive reduplication, so including morphologically reduplicated forms would bias the results.
Because each word was maximally three syllables, there were a total of three syllabic contexts to investigate: the first and second syllables (σ 1 σ 2 ), the second and third (σ 2 σ 3 ), and the first and third (σ 1 σ 3 ). For each context, the forms considered were only those that had a native (i.e. not /f v z P/) singleton onset in both positions. 11 Thus words like ke.ke.ba 'a bag/container made out of bamboo', where all syllables have 10 Regarding the issue of loanwords, an anonymous reviewer asks about the influence of Javanese loans on Sundanese lexical statistics. Words marked as Javanese loans in the dictionary (e.g. kecut 'sour') have been excluded, as per the discussion above. Abby Cohn (p.c.) notes that Javanese loans are common in the high register of Sundanese, but that many speakers do not command the high register, and that it would be surprising if speakers were aware which words are of Javanese origin and which aren't. 11 The limitation to native singleton onsets was made largely to simplify the statistics and the data visualizations, but also in part singleton onsets, are considered for all contexts (σ 1 σ 2 , σ 2 σ 3 , and σ 1 σ 3 ). Words like am.ba.cak, where one syllable has no onset, are only considered for a subset of the contexts (here only σ 2 σ 3 ). Words like ke.de.plik 'very thick', where one syllable has a complex onset, are also only considered for a subset of the contexts (here only σ 1 σ 2 ). Finally, words like ka.ri 'leftovers' are only considered for σ 1 σ 2 , as they lack a third syllable. To determine the frequency of onset pairs relative to expectation, loglinear models were fit to each of the datasets in (35). Loglinear models were chosen as they are a statistically sound way of analyzing count data (see Wilson & Obdeyn 2009 for discussion). For each model, the dependent variable was the number of times a particular onset-onset pair was attested. The independent variables included a predictor for identity (is the onset-onset pair composed of two identical consonants?) and one predictor per onset segment per position. For example, if the possible syllable onsets for a given language are /p t l k/, this results in eight segmental predictors: four for the segments in first position (p 1 , t 1 , l 1 , k 1 ) and four for the segments in second position (p 2 , t 2 , l 2 , k 2 ). Each predictor assigned a 1 if that segment was present in the specified position and a 0 if it wasn't. In addition, one predictor was included for each identical onset pair of interest (e.g. l 12 ). The schematic example in Table 2 illustrates the structure of the model inputs for a made-up language whose possible onsets are /p t l k/ and where the rate of [l] co-occurrence is of interest. Two models were fit to each subset of the data. In the baseline model, the counts were modeled as a function of only the segmental predictors (p 1 , p 2 , etc.). This model was then queried for a set of fitted values (with R's fitted.values function) that reflect how frequent each pair is predicted to be, given no constraints on onset-onset combination. If the pair is more frequent than predicted, it is overattested relative to naïve expectation; if it is less frequent than predicted, it is underattested. Following this, predictors that reference identity (above as Identity, l 12 ) were added to the model. The Identity predictor was included to let the model assess whether or not pairs of identical onsets, as a class, are overattested or underattested. The predictors for identical onset combinations (like l 12 ) were included to let the model determine if individual pairs of identical onsets are overattested or underattested, relative to the expectations set by the frequency of identical pairs (as a class) and the independent frequency of each member of the pair. In this way, these models allow us to evaluate evidence for a potential identity preference (which would manifest as significant overattestation of identical pairs) as well as evidence for co-occurrence restrictions on [r]s and [l]s (which would manifest as underattestation of those specific pairs). All loglinear models were fit with the bayesglm function of R's arm package (Gelman & Hill 2006) and the quasipoisson link function. 12 For each context, further evidence for aggressive reduplication was evaluated by determining if onsetmatching was significantly correlated with nucleus-matching. This was done by splitting the forms into four groups, according to (i) whether or not their onsets match and (ii) whether or not their nuclei match, and performing chi-squared tests on the resulting contingency tables. because non-native and cluster onsets are infrequent. Widening the corpus to contain these forms does not qualitatively change the results or any or the conclusions drawn from them. (Note that while [P] is a native Sundanese phone, its distribution is predictable and it is not written. Instances of it in the dictionary are likely not due to this predictable pattern; Abby Cohn (p.c.) notes that kaPbah 'a place in Mecca', for example, is likely an Arabic loan. See Robins, 1959, for further discussion of [P].) 12 The bayesglm function was selected as Bayesian regression was found to be uniquely capable of accommodating the numerous 0s in the Sundanese count data. The quasipoisson link function is appropriate for these data because in all relevant subsets, the variance in frequency is larger than the mean.

Results
The results of the lexicon study are presented by-context below (first σ 1 σ 2 , then σ 2 σ 3 , then σ 1 σ 3 ). Note that the goal of this subsection is not to provide a comprehensive description and analysis of all trends in the Sundanese lexicon; the goal is only to discuss results that bear on the analysis in Section 2. Materials that provide a more complete picture of Sundanese lexical statistics are available as supplementary materials.

Results for σ 1 σ 2
Results of the loglinear models for the σ 1 σ 2 context suggest a dispreference for co-occurring [r]s and [l]s modulated by a co-existing preference for identity. This is visible in Figure 3, which plots the baseline model's predicted count for a given onset pair against its observed count. 13 Identical pairs are represented with black dots and all other pairs are represented with gray. 14 Dots above the identity line denote pairs that are more frequent than expected, given the individual probabilities of each onset; dots below the line denote pairs that are less frequent than expected.  Figure 3 that identical σ 1 σ 2 onset pairs are overattested relative to expectation: identity is linked to a boost in frequency that cannot be explained only by reference to the independent frequency of the pair's members. In addition, l+l and r+r are underattested relative to other identical pairs. The results of the second loglinear model, which incorporates predictors referencing identity, confirm that these observations are unlikely to be due to chance. The positive coefficient in (36b) confirms that identity is linked to a significant increase in log frequency, and the negative coefficients in (36c-d) confirm that the log frequencies of l+l and r+r are lower than expected, relative to their position-specific frequencies (controlled for in (36e-h)) and the general frequency boost for identical segments. Thus in σ 1 σ 2 , evidence for a similarity preference among adjacent syllables comes from the overattested status of identical onsets. Evidence for a restriction  Before moving on to address the patterns in σ 2 σ 3 , it is necessary to address a potential confound. Sundanese employs partial reduplication in a variety of morphological contexts, as attested in pairs like basa 'language' ba-basan 'proverb', saur 'to speak' sa-sauran 'to talk together', tani 'agriculture' ta-tanen 'to farm', and others (Robins 1959:360-361). It is possible that the preference for adjacent syllable identity in this context could be due to the dictionary's inclusion of a large number of morphologically reduplicated forms.
To determine whether or not this alternative interpretation of the results is plausible, I limited the Sundanese roots under investigation to those of the shape CVx.CVx, where x is an optional coda. The majority (97%) of words in the dictionary are two syllables or longer, suggesting a dispreference for monosyllabic words. 15 Given this, it is reasonable to expect that most disyllabic words are not morphologically reduplicated. Words consisting of two identical syllables were however excluded if the repeated syllable was recorded as a monosyllabic word; these exclusions brought the number of forms considered down from 6,409 to 6,373. Figure 4 demonstrates that, in this subset of the data, identical onsets are still overattested. A loglinear model similarly finds a boost in frequency for identical pairs (p < .001) and a decrease in frequency for r+r (p < .05) and l+l (p = .06). These findings suggest that morphological reduplication is not responsible for the preference for identical onsets apparent in Figure 3. Similarly, morphological reduplication is likely not responsible for the link between onset-matching and nucleus-matching. Even when we focus on the subset of disyllabic forms, syllables with matching onsets are still disproportionately likely to have matching nuclei (38).
(38) Onset-matching encourages nucleus-matching in σ 1 σ 2 (χ 2 (1) = 409.33, p < .001) Nucleus match Nucleus mismatch Onset match 473 159 Onset mismatch 1932 3809 In short, the properties of σ 1 σ 2 investigated in this section are consistent with the analysis proposed in Section 2: we see a preference for identical onsets, as well as a dispreferrence for [l]…[l] and [r]…[r]. (Note however that the interrelation between the preference for identical onsets and the co-occurrence constraints is not predicted by the analysis, as the analysis is silent on how these pressures should interact in gradient lexical data.) Furthermore, it is unlikely that the observed preference for self-similarity between σ 1 and σ 2 can be attributed to morphological reduplication: the preference is also observed within a set of forms that are likely not morphologically reduplicated.

Results for σ 2 σ 3
The patterns observed in σ 2 σ 3 differ from those in σ 1 σ 2 as a function of the rate of onset-matching. Figure 5 makes it clear that in this context there is no preference for identity among adjacent syllable onsets. But like the patterns for σ 1 σ 2 , r+r and l+l behave differently than the rest of the identical pairs. While most identical pairs are fairly close to the identity line -their frequency is predictable given the independent frequencies of their members -r+r and l+l are well below it. These two findings were confirmed by adding identity-related predictors to the baseline model. The results (in (39)) confirm the observations made on the basis of Figure 5. The predictor for onset identity is not significant: whether or not a pair of onsets is identical has no independent effect on its log frequency. The r 23 and l 23 predictors are however both significant, and the negative coefficients indicate that these pairs are less frequent than expected.
(39) Partial results of loglinear model for σ 2 σ 3 onset pairs (full results in the appendix) Predictor The results for σ 2 σ 3 are similar to those for σ 1 σ 2 in that syllables with identical onsets are disproportionately likely to have matching nuclei. This is evident in (40), where 66.7% of syllable pairs with matching onsets but only 49.6% of syllables with mismatching onsets have matching nuclei.
(40) Onset-matching encourages nucleus-matching in σ 2 σ 3 (χ 2 (1) = 13.77, p < .001) Nucleus match Nucleus mismatch Onset match 86 43 Onset mismatch 1438 1463 In sum, underattestation of r+r and l+l is consistent with the activity of *[r]…[r] and *[l]…[l]. The observation in (40) that similarity along one dimension is correlated with similarity along another is consistent with a preference for self-similarity between all adjacent pairs of syllables and not just σ 1 σ 2 . Finally, the preference for onset-matching in σ 1 σ 2 but not σ 2 σ 3 is potentially attributable to a higher drive for self-similarity for σ 1 σ 2 ; this is consistent with the activity of Redup-σ 1 σ 2 .
One generalization evident from the properties of σ 1 σ 2 and σ 2 σ 3 is that σ 2 's onset is frequently occupied by [l] or [r] ((36f,h); (39e,g)). One might ask if this is due to morphology, and, in particular, to the dictionary's potential inclusion of plural forms (like karusut, halormat). A search through Lembaga Basa & Sastra Sunda (1985) for potential singular-plural pairs, however, suggests that the dictionary does not record plurals. To identify potential plural forms, I created a list containing the subset of words considered here that have [a] as the rime of the first syllable and [l] or [r] as the onset of the second (e.g. garalaN 'a long scar', balida 'knife fish'). Possible singulars were identified by removing the al or ar from the potential plural (resulting The majority of forms (722/935, or 77%) that qualify as a potential plural do not have a corresponding potential singular in the wordlist. Of the 214 potential plurals that do, 81 do not obey the generalizations regarding the distribution of [al∼ar] (these are forms like calacah 'cigarette ash', where ar is expected, or laruN 'missed', where al is expected), leaving 133 phonologically plausible plurals with a potential singular pair. Examples are dapon 'not determined, careless' and darapon 'at random', jujur 'honest' and jalujur 'sewing with hand before using sewing machine'; pairs like o 'sound like about to vomit' and aro 'fly' were included even though the singular is likely subminimal. Given the small number of these forms relative to the size of the overall corpus (11,913 forms), it is unlikely that plurals are regularly recorded.
Nonetheless, I reran the statistics for σ 1 σ 2 and σ 2 σ 3 while excluding these 133 plausibly plural forms. There were no resulting qualitative changes. For σ 1 σ 2 , identical pairs are still overattested (p < .001), l+l and r+r are still underattested (p < .01 for both), and the presence of [l] or [r] in the second syllable's onset is still associated with an increase in log frequency (p < .05 for both). For σ 2 σ 3 , there is still no effect of identity on log frequency (p > .1), l+l and r+r are still underattested (p < .001 for both), and the presence of [l] or [r] in the second syllable is still associated with an increase in log frequency (p < .001 for both). Even if the 133 forms identified as plausible plurals are in fact plurals, it cannot be the case that their inclusion is responsible for the high frequency of [l] and [r] as the second syllable's onset. It is not clear to me that there is an insightful explanation for the high frequency of liquids in this position beyond some arbitrary phonotactic preference.

Results for σ 1 σ 3
The σ 1 σ 3 context differs from σ 1 σ 2 and σ 2 σ 3 in that it involves non-adjacent syllables. The analysis predicts that in this non-adjacent context there should be no drive for self-similarity, and (as a result) that combinations of [r]s and [l]s should be significantly underattested. Figure 6 plots the observed count for each σ 1 σ 3 onset pair against its predicted count. The shape of the σ 1 σ 3 data looks similar to the shape of the σ 1 σ 2 data: there is a preference for onset identity, with a concomitant dispreference for r+r and l+l (and additionally in this context, k+k).
To determine if these trends are meaningful, identity-based predictors were added to the baseline model. The results are consistent with Figure 6: there is a boost in log frequency for identity (41b) and a decrease in log frequency for r+r and l+l (41c-d) relative to other identical pairs and the independent frequencies of [r]s and [l]s (41e-h).
(41) Partial results of loglinear model for σ 1 σ 3 onset pairs (full results in the appendix) Predictor The effect of identity is surprising, as the analysis does not predict a preference for self-similarity between non-adjacent syllables. A closer look at the 231 forms with identical onsets, however, suggests that this number is likely inflated by a type of discontinuous reduplication. Of these 231 forms, 74 have a third syllable that is composed of the first syllable's onset and the second syllable's rime (e.g. balingbing 'starfruit', corodcod 'shaky leg', harashas 'dried palm leaf', perekpek 'take a beating / got beaten'). While it is unclear if this process is synchronically active, similar patterns of discontinuous reduplication are attested in dialects of closely related Malay (see Kroeger 1989). 16 As it is possible that the self-similarity in these cases is due to some morphophonological process, it is worthwhile to consider what the data would look like were these 74 forms excluded. Figure 7 confirms that they do in fact look quite different. ). Yet in Figure 5, the apparent preference for identity has vanished. A second loglinear model, fit to the data visualized in Figure 5, finds no increase or decrease in frequency associated with identity (p > .1) and near-significant frequency decrements associated with r+r (p < .1 for both); it is likely that the lack of significance in these cases is due to a lack of statistical power. 17 These results are consistent with the assumptions of Section 2's analysis: the co-occurrence restrictions, but not the drive for identity, hold in the non-local σ 1 σ 3 context.
The suggestion that there is no drive for identity in non-adjacent contexts is supported by the lack of a relationship between onset-matching and nucleus-matching in this context. Even when the 74 potentially reduplicated forms are included, the rate of onset-matching is not significantly correlated with the rate of nucleus-matching (42). When these 74 forms are excluded, the number of forms with matching onsets decreases (nucleus match, n=71; nucleus mismatch, n=74) and the correlation remains insignificant (χ 2 (1) = 1.00, p > .1).
(42) Onset-matching does not encourage nucleus-matching in σ 1 σ 3 (χ 2 (1) = In sum, the σ 1 σ 3 data provide further evidence for co-occurrence restrictions on [r]s and [l]s: [l]s do not co-occur and [r]s co-occur only rarely. Furthermore, the lack of a relationship between onset-matching and nucleus-matching is consistent with the assumption encoded in Redup that corresponding substrings must be adjacent: in non-adjacent syllables, similarity along one dimension is not correlated with similarity along another. This conclusion is further supported by the lack of onset-matching in σ 1 σ 3 , visible when 74 potentially reduplicated forms are excluded.

Local summary
The markedness constraints proposed in Section 2 to account for /ar/ allomorphy potentially predict languagewide effects of liquid dissimilation ( Redup, Redup-σ 1 σ 2 , and κκ·Ident). While corroborating evidence from other synchronic processes is limited, I have shown here that each of these constraints has echoes in the Sundanese lexicon.
One general finding is that l+l and r+r are dispreferred relative to their expected frequencies in all positions within the word, as would be expected if *[l]…[l] and *[r]…[r] were active. While the above discussion focuses only on their co-occurrence in onset position, co-occurrence is likely underattested in all contexts (as the analysis predicts). To examine the rates of co-occurrence more broadly, I searched through all forms in Lembaga Basa & Sastra Sunda (1985) (n=16,238) for words that contain more than one [r] or more than one [l]. For [r]: the dictionary contains only 247 forms with multiple [r]s, and 200 can be interpreted as involving total reduplication (biribiri 'thiamin deficiency') or partial/aggressive reduplication (rereb 'stay overnight on the road'). Many of the remaining 47 are likely loans, though they are not necessarily annotated as such (kolaborator 'a person who helps the opponent', barometer 'barometer', organisator 'a person capable of setting a meeting'). For [l]: 226 forms contain more than one [l], and 204 of these cases can be interpreted as involving total reduplication (lapatlapat 'blurry vision because the object is too far') or partial/aggressive reduplication (lalab 'vegetables served raw / salad', loloco 'mashing, pounding'). Again, of the remaining 22, many are loans (kolonial 'invasion'). The low frequency of [r] and [l] co-occurrence outside of reduplicative contexts is consistent with an analysis that treats /ar/ allomorphy as resulting in part from co-occurrence constraints on [r]s and [l]s. It is worth noting that while the existence of a dispreference for co-occurring [r]s is consistent with most analyses of Sundanese /ar/ allomorphy, the existence of a dispreference for cooccurring [l]s is uniquely consistent with the proposed analysis, as it is the only analysis I am aware of that posits *[l]…[l] (see Section 2.3).
Another general finding is that in adjacent but not non-adjacent syllables, onset identity encourages nucleus identity (consistent with the activity of Redup and the Ident-κκ constraints it activates). In addition, onsets are more likely to match in σ 1 σ 2 than is naïvely expected (an observation consistent with the position-specific Redup-σ 1 σ 2 ). These findings hold even when potentially reduplicated forms are excluded from the analysis, underscoring the point that in Sundanese there exists an entirely phonological drive for self-similarity between adjacent syllables. A word is necessary here regarding the relationship between these lexical trends and the analysis of /ar/ allomorphy. The analysis proposed in Section 2 uses categorical constraints, but the trends in the lexicon are gradient. There are at least two possible ways to understand why this difference exists. One possibility is that the right analysis of the alternations in Section 2 is actually a probabilistic analysis that makes gradient predictions about both alternations and the lexicon. Motivation for such a claim could come from variation in the realization of /ar/: perhaps /ar-liren/ is realized as [l-al-iren] most of the time, but less frequently as the more self-similar [l-il-iren]; perhaps /ar-curiga/ is usually realized as [c-ar-uriga] but occasionally as [c-ar-iriga]. Knowing whether or not this is the correct approach would require more data on speaker judgments and productions than is currently available. The second possibility is that the relationship between the grammar and the lexicon is indirect, in the way outlined by Martin (2007). Under this scenario, constraints like *[l]…[l], κκ·Ident- [onset], κκ·Ident-[nucleus] play a role in determining which words are more likely to be coined and accepted by speakers, but do not act on those words directly. Thus the relative rarity of words containing multiple [r]s and [l]s, as well as the prevalence of words with self-similar adjacent syllables, is due not to any active phonological process but rather to speakers' relative unwillingness to accept and continue to use words that violate active markedness constraints. As many other pressures likely help shape the lexicon (e.g. a desire to faithfully render loanwords from languages with different phonotactics), it would be surprising if each active markedness constraint held categorically.

Lexical evidence and learnability
The discussion in Section 3.2 shows that the constraints proposed in my analysis of /ar/ allomorphy (Section 2) have echoes in the Sundanese lexicon. Recent work shows however that speakers are not always aware of statistically significant trends in the lexicon (e.g. Becker, Ketrez & Nevins 2011), so it is not necessarily the case that there should be a correlation between the constraints apparently implicated by statistical trends in the lexicon and the constraints that drive phonological alternations. The question then is why we should take seriously the lexical evidence outlined above as support for the analysis in Section 2.
This subsection outlines a potential learnability-based argument. One striking fact about descriptions of Sundanese /ar/ allomorphy is that, despite being a complex process limited to a single morphological context, it appears to be reliably acquired: descriptions of the pattern by Robins (1959), Van Syoc (1959), Cohn (1992), and Bennett (2015a,b) are mutually reinforcing (and all appear to rely at least in part on their own primary data). As it is a stable, reliably acquired pattern, its analysis should be easily learnable given the input available to a child. Based on evidence from a large Sundanese corpus, I suggest that ∼.02% of a learner's input would provide them with evidence that [ar∼al] alternations exist. In order to acquire these alternations, the learner would thus need to posit a complicated set of rankings based on comparatively few forms. Links between morphophonology and the lexicon would make the child's task easier, as the constraints and potentially the rankings among them could be induced at least in part from the larger lexicon.
This subsection focuses on quantifying the evidence for [ar∼al] alternations and stops short of implementing a computational learner to demonstrate that the necessary constraints and their ranking can be induced from the lexical evidence. The discussion remains speculative in this way because there is no currently implemented phonotactic learner that can induce the representations and constraints assumed by Zuraw (2002), nor is there a currently implemented learner that can find non-local-only dissimilation. While the Inductive Phonotactic Learner (Gouskova & Gallagher 2020) can discover non-local restrictions, to do so it must first discover a restriction that holds within a trigram (e.g. *X[]X). But the evidence for *r[]r and *l[]l is muted in Sundanese and thus not discoverable by any algorithm that requires evidence for a local co-occurrence restriction to justify searching for a non-local one. 18

Corpus and methodology
To approximate the frequency of words containing plural /ar/, I extracted all potential singular-plural pairs from the Sundanese An Crúbadán corpus (Scannell 2007), which comprises 713,970 tokens. Potential plurals were forms with an al or ar sequence that is both followed by and not preceded by a vowel; forms like laloba 'many, abundant, plenty pl.' were considered but forms like regional 'regional' were not. Potential singulars were identified by removing ar or al from the plural and searching the wordlist for the resulting singular. Thus for laloba, the wordlist was searched for loba. A singular-plural pair was recorded if the corresponding singular exists and has a higher token frequency than the plural. This frequency criterion was established based on a preliminary search through the corpus for singular-plural pairs identified in the extant literature on Sundanese /ar/ allomorphy (Robins 1959;Cohn 1992;Bennett 2015a,b). The findings indicate that productively derived plurals are less frequent than the singulars; the mean token frequency for words containing a singular form was 173.9 and the mean token frequency for words containing a plural form was 52.4. 19 Several examples with their associated frequencies are in (43) Given the general trend for singulars to be more frequent than the corresponding plurals, a search that limits plausible singular-plural pairs to those with a more frequent singular is justifiable.

Findings
The search discovered a total of 991 plausible singular-plural pairs. The token frequency of the plural forms sums to 6,239, meaning that approximately 0.1% of the tokens in the corpus are plausibly pluralized forms. This is a conservative estimate, as neither /ar/'s location nor the semantics of the plural were considered when deciding whether or not a pair was plausible. In other words, pairs like hal 'thing' and halal 'halal', tatu 'a wound from war or accident' and tatalu 'hitting the tip of one's fingers or palms against any hard surface to make sounds (music)' were counted as plausible pairs, even when the "plural" is likely a simplex word (halal) or the affix does not occur before the initial vowel (tatalu). (I included pairs like tatu-tatalu because some prefixes can attach outside of /ar/; a prefixed form from Robins 1959:344, where the affix does not occur before the initial vowel, is di-bawadi-barawa, 'to be carried (pl.)'. Semantics were not considered because glosses are not provided in the corpus.) Not all of these 991 plausible plurals are informative about the ranking governing /ar/ allomorphy, as most lack another liquid. Recall from Section 2 that in roots that do not contain a liquid, IO·Ident- [±lateral] prohibits /ar/ from being realized as anything but [ar]. The number of plausible plurals whose stem contains a liquid is much smaller, at 353, and their frequency amounts to 1,179 tokens. Assuming that the An Crúbadán corpus is broadly representative of the types of words that the Sundanese learner encounters, the implication is that only .02% of words the learner encounters would provide evidence as to the ranking of the various constraints proposed in Section 2. 20 While this may well be enough information for the learner to arrive at the correct ranking -alternations are salient and .02% of a child's input is likely still a large number of words -the links established here between phonology and the lexicon mean that the child's acquisition of /ar/ allomorphy may be bolstered by trends discoverable in the lexicon. In other words, it may be easier for the Sundanese learner to discover the proposed analysis than an alternative that treats the /ar/ allomorphy as an idiosyncratic property divorced from the larger lexicon (cf.  (Robins 1959:352) and so it is probable that [l]-assimilation has applied as expected in stem-initial position. Most of the apparent exceptions have a plausible reanalysis along these lines.
2003 on how learners do not draw morpholophonological generalizations from small amounts of data).

Discussion
This paper has shown that Sundanese /ar/ allomorphy can be analyzed as resulting from unbounded cooccurrence restrictions on [r]s and [l]s, whose effects in local contexts are obscured by a general desire for identity between adjacent syllables. Statistical trends from the lexicon are consistent with this analysis. I have suggested that this connection between /ar/ allomorphy and the lexicon may function as an argument for the proposed analysis, as the evidence that would be required for a learner to acquire the crucial rankings governing /ar/ allomorphy is otherwise likely infrequent.
Recall that our interest in the Sundanese data is in how they bear on the predictions of two competing theories of dissimilation: Suzuki 1998's GOCP, in which dissimilation is motivated by co-occurrence constraints; and Bennett's (2015a,b) SCTD, in which dissimilation is a way of avoiding similarity-based surface correspondence. The GOCP predicts that non-local-only dissimilation should only arise given the coexistence of some independent pressure that disprefers the results of local dissimilation. As discussed above, Sundanese -which has the only known case of non-local-only dissimilation -fits this description. In addition to cases like Sundanese, the SCTD predicts cases of non-local-only dissimilation that cannot be analyzed by invoking constraints that disprefer the results of local dissimilation. This prediction is not supported by the typological data. Furthermore, results from artificial grammar learning experiments parallel the typological data. McMullin & Hansson (2016) show that participants are able to acquire the kinds of non-local dissimilation predicted by both the GOCP and the SCTD, where a non-local restriction on identical liquids (*lVCVl, *rVCVr; lVCVr, rVCVl) accompanies a restriction on local non-identical liquids (lVl, rVr; *lVr, *rVl). McMullin & Hansson (2019) however show that participants are not able to reliably learn non-local dissimilation when it is not accompanied by local assimilation, regardless of whether or not they are presented with overt evidence for non-alternation in local contexts. These findings suggest that the type of non-local dissimilation uniquely predicted by the SCTD is not only unattested but also unlearnable, and that the correct theory of dissimilation should not treat it as part of the learner's hypothesis space.
Prior work has shown that the SCTD fails to make accurately restrictive predictions in other domains as well. For example, Stanton (2017) shows that the GOCP predicts a more restricted typology of blocking in long-distance dissimilation than does the SCTD, and that all known relevant cases are consistent with the GOCP's predictions. In addition, Stanton (2016b) shows that the SCTD fails to derive a generalization regarding the role of similarity in dissimilation. Generally speaking, if a language disprefers co-occurrence of two less similar segments it also disprefers co-occurrence of more similar segments (the only exceptional cases in this respect involve fully identical segments; see e.g. MacEachern 1997; Gallagher & Coon 2009;Gallagher 2013 for discussion and analysis). But the SCTD predicts the opposite similarity implication: all else being equal, dissimilation of two more similar segments should imply dissimilation of less similar segments. To give a concrete example, the SCTD can generate a system in which /p p/ and /p f/ can co-occur, but /p v/ is banned (Stanton 2016b:539). The typology of dissimilation suggests no cases with this character.
An argument offered by Bennett (2015a,b) for the SCTD is that it unifies the analysis of long-distance assimilation and dissimilation: the theory's predictions regarding the typology of dissimilation follow directly from its analysis of the typology of assimilation. The work reported in this paper and cited above, however, suggests that the SCTD's predictions in the domains of locality and similarity are not sufficiently restrictive. These results, in turn, raise the question of whether Bennett's theoretically elegant unification of two disparate typologies should come at the expense of restrictiveness. My position is that it should not, and that the facts reviewed here support co-occurrence-based theories of dissimilation over available correspondence-based alternatives.

Appendix: full results of statistical models
This appendix contains full results for four statistical models: the σ 1 σ 2 model summarized in (36), the σ 2 σ 3 model summarized in (39), the σ 1 σ 3 model summarized in (41), and the additional σ 1 σ 3 model in which the 74 forms that plausibly exhibit discontinuous reduplication have been excluded (see Section 3.2.3 for discussion). Further variations on these models (like those that exclude plausible plurals) are not reported here as the results did not differ qualitatively from those presented below.