What Changed Your Mind: The Roles of Dynamic Topics and Discourse in Argumentation Process

In our world with full of uncertainty, debates and argumentation contribute to the progress of science and society. Despite of the increasing attention to characterize human arguments, most progress made so far focus on the debate outcome, largely ignoring the dynamic patterns in argumentation processes. This paper presents a study that automatically analyzes the key factors in argument persuasiveness, beyond simply predicting who will persuade whom. Specifically, we propose a novel neural model that is able to dynamically track the changes of latent topics and discourse in argumentative conversations, allowing the investigation of their roles in influencing the outcomes of persuasion. Extensive experiments have been conducted on argumentative conversations on both social media and supreme court. The results show that our model outperforms state-of-the-art models in identifying persuasive arguments via explicitly exploring dynamic factors of topic and discourse. We further analyze the effects of topics and discourse on persuasiveness, and find that they are both useful - topics provide concrete evidence while superior discourse styles may bias participants, especially in social media arguments. In addition, we draw some findings from our empirical results, which will help people better engage in future persuasive conversations.


INTRODUCTION
"The aim of argument, or of discussion, should not be victory, but progress. " -Joseph Joubert Argumentation process is a turn-taking dialogue mostly held to increase the acceptability of a controversial standpoint. In the process, a series of connected propositions (henceforth arguments) are put forward intending to justify or refute a standpoint before a rational judge [40]. It plays an essential role in making decisions, constructing knowledge, and bringing truths and better ideas to life [19]. Consequently, the understanding of argumentation processes will help individuals and human society better engage with conflicting stances and open up their minds to pros and cons [24]. It collides different ideas to form thoughts and knowledge, contributing to advance science and society forward [45]. However, making sense of argumentative conversations is a daunting task for human readers, mostly due to the varied viewpoints and evidence continuously put forward and the complicated interaction structure therein; not to mention huge volume of argumentation data appearing on online platforms every day.
We hence study how to automatically understand argumentation processes, predicting who will persuade whom and figuring out why it happens. To date, much progress made in persuasiveness prediction has focused on individual arguments, the wordings therein [13,44], and how they locally connect with other arguments [15,17]. On the contrary, we examine the context and the dynamic progress of argumentative conversations, which is beyond the studies of argument-level persuasiveness. Some research work analyze argument interactions [14,43] to predict who will win the debate. Most of them focus on the outcome of argumentation instead of diving deep into the argumentation process [19,37]. The latter, however, is arguably the essence of argumentation, revealing how participants collaborate to reshape and refine ideas.
In light of these missing points, we track the argumentation process and explicitly explore the dynamic patterns of what a discussion is centered around (henceforth topics) and how the participants voice their opinion in arguments (henceforth discourse), as well as how they affect the persuasion results. To illustrate the interplay of topics and discourse in argument persuasiveness, Figure 1 arXiv:2002.03536v1 [cs.CL] 10 Feb 2020 O P : Translation software already exists and is pretty good.
... Most people putting in years of effort to learn a foreign language skill they might only use a couple times in their whole life, and will likely forget. A 1 [Evidence]: ... There is research that indicates "that those who spoke two or more languages had significantly better cognitive abilities compared to what would have been expected from their baseline test." ⟨url⟩. ... Another study found that " the language-learning participants ended up with increased density in their grey matter and that their white matter tissue had been strengthened. " ⟨url⟩ A 2 [Metaphor]: The common comparison is made to learning music, as /u/awesomeosprey has pointed out. I did some research into the matter. It seems that learning a musical instrument does have long-lasting benefits ( ⟨url⟩) that relate to "higher-order aspects of cognition." ...
... But a quick search and I have other sources: ⟨digit⟩ ⟨url⟩, ⟨digit⟩ ⟨url⟩, ⟨digit⟩ ⟨url⟩. The most interesting study is this one (⟨url⟩), but I can't find a complete version of it, sorry. /n/nNote: Study ⟨digit⟩ has an exceptionally small sample size. It's still interesting reading. Figure 1: A ChangeMyView conversation snippet of challengers' arguments against OP raised by the opinion holder concerning "learning a second language isn't worth it anymore for most people". The red and italic words indicate the key points resulting in the challengers' victory. The words in [] are our interpretations of the arguments' discourse styles.
shows a Reddit conversation snippet from ChangeMyView subreddit. 1 On ChangeMyView, an opinion holder first raises a viewpoint (henceforth OP short for original post), followed by challengers' arguments attempting to change the opinion holder's mind. This example dialogue is formed with challengers' arguments against "learning a second language isn't worth it for most people anymore", which was the opinion holder's point of view.
It is seen that the challengers successfully persuaded the opinion holder to change their view in the aforementioned example. The probable reasons are two fold. First, there are strong evidences (reflected by topic words) put forward, such as the research findings on cognitive abilities. Second, they deploy skillful debating styles (captured by discourse words), such as the metaphors with learning music (in A 2 ) and the reference to external information (in A 4 ).
Motivated with these observations, we propose a novel neural framework that explicitly models how the change of discussion topic and discourse styles affect persuasion effectiveness. Our model first explores latent topics and discourse in arguments with word clusters. Furthermore, it tracks topic change and discourse flow in the argumentation process and automatically interprets the key factors indicating the success or failure of the persuasion. Coupling the advantages of neural topic models [28,49,51] and dynamic memory networks [23,46,54], we are able to explore dynamic topic and discourse representations indicative of persuasiveness in an end-to-end manner with the persuasion outcome prediction. To the best of our knowledge, we are the first to explicitly model topics 1 https://www.reddit.com/r/changemyview/ and discourse in argumentation processes, and investigate how their dynamic patterns contribute to the argument persuasiveness.
We carry out extensive experiments on argumentative conversations gathered from both social media and U.S. supreme court. The results show that our model can significantly outperform state-ofthe-art methods on both datasets, which shows its effectiveness in identifying persuasive arguments. For example, we achieve 70.2% accuracy when predicting winners in supreme court debates, compared with 63.1% obtained by logistic regression without explicitly exploiting dynamic topics and discourse features in argumentation processes. Based on the produced topics and discourse, we further analyze how they affect persuasiveness. It is indicated that topics (such as evidence and viewpoints) statistically contribute more on persuasion success while skillful discourse style may sometimes lead to victory. In addition, we summarize the key findings from our empirical results, which will help individuals better engage in future persuasions.
To sum up, our contributions are three folds: • We are the first to study the argumentation process via dynamic analysis of latent topics and discourse, which reveals the key factors in argument persuasiveness. • We propose a novel neural model to predict argumentation outcome via tracking dynamic topic and discourse patterns in the dialogue process. • We provide an extensive empirical study on two real-world datasets that demonstrates the effectiveness of our model and sheds light on a better understanding and development of persuasive augmentations.

RELATED WORK 2.1 Argument Persuasiveness
As a fast growing sub-field of computational argumentation mining [35,41], previous work in this area mostly work on the identification of convincing arguments [13,44] and viewpoints [14,19] from varying argumentation genres, such as social media discussions [37], political debates [4], and student essays [6]. In this line, many existing studies focus on crafting hand-made features [37,44], such as wordings and topic strengths [43,53], echoed words [2], semantic and syntactic rules [15,30], participants' personality [42], argument interactions and structure [29], and so forth. These methods, however, require labor-intensive feature engineering process, and hence have limited generalization abilities to new domains. Recently, built upon the success of neural models in natural language processing (NLP), neural argumentation mining methods have been proposed to enable end-to-end learning of automatic features and argument persuasiveness. For example, Potash et al. [31] tailor a pointer network architecture to learn argument representations. Lin et al. [26] focus on incorporating external lexicons into an attentive neural network for argumentative component identification. These studies, however, ignore the dynamic nature of argumentation process, where the persuasion features may change in a heated back-and-forth debate. Some other methods consider the modeling of the argument interactions in persuasiveness prediction. Ji et al. [17] explore the argument-level interactions between ChangeMyView original post (OP) and its following comments with a co-attention network. Jo et al. [19] investigate the interplay between OP and its challenger's argument, explicitly identifying the amenable parts of OP that is likely to be affected with good arguments. Compared with these work focusing on interaction between OP and comments, we dynamically track the entire argumentation flow and capture how topics and discourse therein change and affect persuasion outcomes. Hidey and McKeown [14] employ sequence modeling to learn implicit persuasiveness signals from chronologically ordered arguments. Different from them, we explicitly capture the dynamic topics and discourse behaviors as discussion process is moved forward, where their roles in shaping the persuasive arguments can be examined.

Conversation Process Understanding
Our work is also closely related with conversation process understanding. In this line, previous studies have shown the benefits of discovering the latent discourse structure. It shapes how utterances interact with each other and form the discussion flow with the use of dialogue acts (e.g., making a statement, asking a question, and giving an example). Most of them extend Hidden Markov Model (HMM) to produce distributional clusters of words to reflect latent discourse [8,33]. In discourse learning, features are exploited via modeling of conversation tree structure [25], relative position of sentences [20], topic content [32,52], and so forth.
In addition, the recent progress in recurrent variational neural networks (non-linear HMM counterpart) enables to capture latent discourse structure in dialogues. For example, latent variable RNN (LVRNN) and variational RNN (VRNN) have been adopted to model the latent conversation states in each turn [18,34]. Zeng et al. [49] jointly explore the topic content and discourse behavior to better understand conversations by using the word clusters to represent topics and discourse in microblog conversations. However, none of them captures how topics and discourse change in a conversation process and how these dynamic patterns affect argumentation persuasiveness, which is the gap our work fills in.

STUDY DESIGN
In this section, we first introduce how we formulate our problem, followed by a detailed discussion on the experimental datasets.

Problem Formulation
In this paper, we define argumentation process C as a dynamic conversation process held by participants. It is formulated as a sequence of turns, denoted as C = {x t } T t =1 , where a turn x t refers to an argument and T the number of turns in the process. As discussed above, our work studies argument persuasiveness in the context of its discussion process, which however relies on subjective judgement. After all, human performance on "yes-or-no" persuasiveness judgement is still close to random guess [37]. In our study, we view argument persuasiveness from a perspective of comparison (instead of answering "yes or no"), and formulate its prediction as a pairwise ranking problem under a debate D. Concretely, we construct the pairwise comparison settings to take a pair of argumentation process ⟨C i , C j ⟩ as input, where C i , C j ∈ D; Scores y i and y j are assigned to measure their persuasiveness respectively. Here y i > y j means that C i has a better chance to win the debate compared with C j , while y i < y j otherwise. The goal of our paper is to predict  which argumentation process from the input pair is relatively more persuasive and analyze the key factors therein to reveal insights for argumentation study. Our problem setting can fit diverse scenarios to learn what a good persuasion should be. For example, it works for the classic Oxford-style debate involving two sides, where one argues "for" a statement and the other "against". The arguments from both sides can be defined as C f = {x f t } ("for" side) and C a = {x a t } ("against" side), which corresponds to our input pair in problem setting.

Data Description
We conduct our study in two scenarios -social media arguments, which tend to use colloquial and informal languages, and supreme court debates, 2 exhibiting a more formal language style. The social media arguments are gathered from the ChangeMyView subreddit, where challengers engage in the discussion with attempts to change the opinion holder's view (pointed out in the original post OP) [37]. As a multi-party conversation, a debate there is in tree structure formed with in-reply-to relations (a post can have multiple replies), and a path therein is defined as an argumentation process. We aim to predict which path has a better chance to be awarded a ∆ by the opinion holder to indicate successful persuasion. For the supreme court debates, we aim to predict whether the petitioner or respondent will win the case, given their corresponding conversational exchanges with the justices.
The ChangMyView social media dataset (henceforth CMV) is built with a corpus released by Tan et al. [37] with argumentative conversations held from Jan 2013 to May 2015. As stated above, each discussion in CMV can be organized in a tree structure with in-reply-to relations (henceforth a debate tree), with its root representing the OP (the opinion holder's viewpoint). To construct our input data, following Tan et al. [37], we first filter out the trivial cases by removing the discussions with less than 10 challengers, or those do not contain a ∆. Then, we flatten the debate tree into conversation paths and remove replies with 50 words or less. Also removed are conversation paths involving less than two turns 3 . Next, all challengers' replies remained in a conversation path is considered as the turns in argumentation process. For each debate tree, we form a positive candidate set with all the argumentation processes (paths) leading to a ∆, and include those without a ∆ into the negative candidate set. To formulate our pairwise inputs, we perform the Cartesian product 4 on the positive and negative candidate sets, which returns all the possible combinations of successfulunsuccessful argumentation process pairs in the debate. For the supreme court debate dataset (abbreviated as Court), it is gathered by Danescu-Niculescu-Mizil et al. [9] from the U.S. supreme court dialogues 5 . In this corpus, the petitioner and respondent make conversational exchanges to justices to defend for themselves in turn. Here the petitioner's utterances are taken to form its augmentation process, and so does the respondent's. For each case, we build the positive candidate set with argumentation processes from the wining side, and negative from its opponent. The pairwise inputs are formed following the similar procedure used for CMV dataset.
In addition, we employ two strategies to further improve the quality of our data. First, to ensure the argumentation processes in an input pair concern relevant topics, their Jaccard similarity are measured over bag-of-words form. After that, following the practice in Tan et al. [37], we remove pairs with < 0.5 Jaccard similarity where the conversation pairs may not be on the same page. Second, as pointed out in previous studies [37], the number of argument turns can largely affect the debate outcome. Here we show the distribution of turn number over winning and losing argumentation processes in Figure 2 and observe that the wining ones tend to be shorter (with smaller turn number). It might result in trivial features of turn number to be learned for persuasiveness prediction. To mitigate the effects of turn number and better study the roles of topics and discourse, we make sure that the pairwise processes fed to the model are equally long (have the same number of argument turns). To this end, we remove pairs with shorter negative process and for the rest, we truncate the longer parts of negative processes. The statistics of our two datasets are shown in Table 1. As can be seen, there are more conversations in CMV than Court. However, the Court debates involve more turns (26.9 vs. 3.6 turns on average per conversation). It might be because court debates are more serious and usually result in a back-and-forth fashion while social media discussions are mostly casual and may end soon.
It is worth noting that we do not feed the words from either opinion holders or justices to avoid the possible bias incurred in persuasiveness prediction. In doing so, we can focus on linguistic features in participants' arguments that lead to good persuasion. Further, it enables our setting to be easily adapted to scenarios without the third-party engagement (e.g., opinion holders and justices). In addition, for CMV dataset, we consider the engagements of all challengers regardless of their ∆ records, which is different from the setting in Tan et al. [37], which only examine the ∆ winners. It is because everyone's efforts may contribute to the final success (or failure) of an argumentation process. Therefore, all the challengers' argument are taken into account in our persuasiveness analysis. For the same reason, we have more training data instances than those in Tan et al. [37] (12, 879 vs. 4, 263).

DTDMN: DYNAMIC TOPIC-DISCOURSE MEMORY NETWORKS FOR ARGUMENT PERSUASIVENESS
This section presents our model that predicts persuasiveness, and dynamically discovers the key topic and discourse factors therein to explain the reasons behind. Our model, named as dynamic topicdiscourse memory networks (DTDMN), consists of three modulesone to learn latent topic and discourse factors from each argument (henceforth argument factor encoder), one to explore the change of topic and discourse factors in argumentation flows (henceforth dynamic process encoder), and the last one to identify the more persuasive conversation from the input pair (henceforth persuasiveness predictor). The model architecture is shown in Figure  3 with an overview presented in Section 4.1. Then in Section 4.2, 4.3, and 4.4, we describe our three modules in turn, followed by our learning objective discussed in Section 4.5.

Model Overview
As described in Section 3, our model takes pairwise conversations as input. In training, we feed ⟨C + ; C − ⟩ into our model, where C + is a positive instance referring to a persuasive conversation. Likewise, C − , the negative instance, denotes a failed persuasion. During the testing, given two conversations, our model will recognize the one which is more persuasive. Each conversation C is formed with a sequence of argumentative turns (henceforth arguments): C = ⟨x 1 , . . . , x T ⟩, where T denotes the number of arguments in C.
For the t-th argument x t , we capture argument-level representations, z t ∈ R K for topic factor and d t ∈ R D for discourse factor, from the input of bag-of-words vector x BoW t ∈ R V , where K is the number of topics, D discourse, and V the vocabulary size. Then, z t and d t are fed into the dynamic memory, together with the word index sequence x Seq t ∈ R L , to update the memory state, where L is the sequence length. The output of the dynamic memory networks is used to predict the persuasiveness score y for each conversation, where higher scores indicate better persuasiveness. Our training target is to have y + > y − for C + and C − .

Argument Factor Encoder
This section presents how we capture topic and discourse factors at the argument level. The subscript t is omitted for simplicity. As mentioned in Section 4.1, we employ latent variables z for argument topic factor representation, and d for discourse. The modeling process is inspired by Zeng et al. [50] and based on variational autoencoder (VAE) [22] to reconstruct a given argument in the BoW form, x BoW , conditioned on z and d. Here z is the topic mixture and d is a one-hot vector denoting the discourse style. 6 Specifically, the generation process for each word w n ∈ x BoW is defined as: where f * (·) is a neural perceptron that linearly transforms inputs. For both latent topic and discourse factors, we employ word distributions to represent them. Here we consider the weight matrix of f ϕ T (·) (after the softmax normalization) as topic-word distributions, ϕ T . Likewise, f ϕ D (·)'s weight matrix is used to compute the discourse-word distributions, ϕ D . For the other parameters µ, σ , and π , they can be learned from the input x BoW following the formula below: µ = f µ (tanh(f e (x BoW ))), log σ = f σ (tanh(f e (x BoW ))), π = softmax(f π (x BoW )). (2)

Dynamic Process Encoder
Based on the topic and discourse factors learned at the argument level, here we discuss how to capture their dynamic patterns in the persuasion process. Our dynamic process encoder is inspired by dynamic memory network (DMN) [23,46,54] and topic memory mechanism [51], where we capture the indicative dynamic topic and discourse factors to interpret why a conversation can result in successful persuasion.
To be more specific, memory weight w t ∈ R (K +D) is defined as the concatenation of latent aspects z t and d t : where [·; ·] represents the concatenation. Once we have the memory weight, DTDMN will retrieve and update the memory according to the memory weight and input argument. We employ a bidirectional attentive GRU [3,47] to encode the word index sequence vector input x Seq t into hidden states h X t ∈ R H : 6 We follow the setting of Zeng et al. [50], and apply Gumbel-Softmax relaxation for d .
where j ∈ [1, L], x t, j is the j-th token in x Seq t . attn(·) is the attention operator [3,27] to aggregate the representations of tokens to form a vector representation for x Seq t . Similar to Zhang et al. [54], we employ a forget gate to erase the retrieved memory. The erase vector is denoted as e t ∈ R E , where E is the dimension of memory embeddings. Afterwards, an augment gate is used to strengthen the retrieved memory. The augment vector is denoted as a t ∈ R E . The overall update formulae for episodic memory are: where M t,i ∈ R E is the i-th row of the memory matrix M t , 1 is a row-vector of all 1s. W (e) ,W (a) ∈ R E×H and b (e) , b (a) ∈ R E are the weight matrices and bias vectors for computing e t and a t , respectively. The read content r t ∈ R E of the episodic memory M t is the weighted sum of the memory matrix:

Persuasiveness Predictor
For each conversation, DTDMN dynamically summarizes the read contents of the previous arguments in a conversation {r t } T ′ t =1 via an attentive GRU at the argument level: Then we map h R to a score value y: where W (r ) ∈ R 1×E and b (r ) ∈ R 1 are weight and bias for computing y.

Learning Objective
Argument Factor Learning. To model topic and discourse factors, in learning, we maximize the variational lower bound L z for z and L d for d. The corresponding functions are defined as: where p(z) is the standard normal prior N (0, I) and p(d) the uniform distribution U ni f (0, 1). q(z | x) and q(d | x) are posterior probabilities to approximate how z and d are generated from the arguments. p(x | z) and p(x | d) represent the corpus likelihoods conditioned on these topic and discourse factors. The overall argument factor learning objective is to maxmize: where L x is for reconstructing the argument x from z and d, L M I is the mutual information (MI) penalty (for separating topic and discourse words). The hyperparameter λ is the trade-off parameter for balancing between the L M I and the other learning objectives. We leave out the details and refer the readers to Zeng et al. [50].
Persuasiveness Prediction Learning. In our setting, we aim to identify which conversation is more persuasive given an input of two conversations. Therefore, our goal is to have C + scored higher than C − . We apply the pairwise cross-entropy loss to maximize the margin of y + and y − for C + and C − , which equals to minimize: Overall learning Objective. The three components of our model can be jointly optimized by minimizing the objective function: where L t F act or is for argument turn level.

EXPERIMENTAL SETUP
Data Preprocessing. We randomly split the dataset with 80% for training and 20% for test. Then, 20% of the training data is randomly selected for validation. For preprocessing, we take the the following steps. First, non-English terms were filtered out. Then, quotations, digits, and links were replaced with generic tags '⟨quote⟩', '⟨digit⟩', and '⟨url⟩', respectively. Next, we employed the natural language toolkit (NLTK) for tokenization 7 . After that, all letters were converted to lowercase. Finally, words occurred less than 10 times were filtered out from the data.
Parameter Setting. We use Gated Recurrent Unit (GRU) as the RNN cell. The hidden size of GRU is set to 512 with the word dropout rate of 0.2. The dimensions of word embeddings and memory embeddings are both set to 200. λ = 0.01 following the setting of Zeng et al. [50] for balancing the MI loss. For all the other hyperparameters, we tune them on the validation set by grid search. Optimization is performed using Adam [21]. In the learning process, we alternatively update the parameters of the argument factor encoder and the rest of our model. We run our model for 80 epochs with early-stop strategy applied [7].
Comparison Baselines. Tan et al. [37] uses logistic regression with bag-of-words features in the pairwise pervasiveness prediction tasks, achieving good performance when compared with most of the handcrafted features. Here we implement logistic regression with TfIdf-weighted n-grams features (LR-Tfidf). Similar to [37], we adopt ℓ 1 regularization on the training stage to avoid overfitting. Joint topic-discourse model (JTDM) [50] extracts topics and discourse features in an unsupervised way and can be used to place our argument factor encoder. We use the mean of each argument's topic-discourse mixture as the feature of an input conversation without considering the dynamics. Hierarchical attention recursive neural network (HAtt-RNN) [48] uses bi-directional GRU as sequence encoder, including two levels of attention mechanisms (i.e., word level and argument level) while constructing the representation of a conversation. Dynamic memory network (DMN) [23] is a neural sequence model that can encode the contextual history into the episodic memory component. Dynamic key-value memory network (DKVMN) [54] improves upon DMN using one static matrix  as key to compute the memory reading weights and one dynamic matrix as value for updating the memory states.

EXPERIMENTAL RESULTS
This section presents the how models perform on persuasiveness prediction. We reports the main comparison results on persuasiveness prediction in Section 6.1, followed by topic and discourse interpretations in Section 6.2. Afterwards, we analyze the major parameters and errors in Section 6.3 and Section 6.4 respectively.

Persuasiveness Prediction Comparison
We follow Tan et al. [37] to conduct pairwise classification. For the CMV dataset, we predict which conversation can win ∆, and for the Court dataset, which side will win the case. In Table 2, we report the pairwise accuracy and F1 scores. For our models, we also display ablation results without considering topic, discourse, and memory structure, respectively. It is observed that: • Topic and discourse factors are useful. By exploiting pre-learned latent topic and discourse factors, JTDM outperforms LR-Tfidf baseline on both datasets. It even performs better than HAtt-RNN on Court debates. This observation implies that topic and discourse factors can be indicative of persuasiveness arguments.
• Neural models generally outperform the non-neural baselines. This indicates that neural models are able to learn deep persuasiveness features. We also find that the improvement upon non-neural models is less significant on the Court compared to CMV. This may be partly attributed to the sparse training instances in the Court dataset as shown in Table 1, which may result in overfitting. Nevertheless, our models can well alleviate such sparsity and achieve significantly better performance on both datasets.
• Process modeling is important to predict argument persuasiveness. We observe that LR-Tfidf and JTDM, with only word features encoded, perform worse compared to other methods that explore dynamic patterns in argumentation process. This shows that persuasion outcomes are also dependent on a dynamic process beyond word features.
• Dynamic memory mechanism is effective. Our full model obtains better results than its w/o memory variant. Also, DMN and DKVMN outperform other baselines without dynamic memory mechanism. The above observations indicate that dynamic memory mechanism is effective for the argumentation progress.
• Both dynamic topic and discourse factors contribute to argument persuasiveness. It is observed that our full model achieves better results than the w/o topic and w/o discourse ablation, which considers only dynamic discourse or topic factors. Though the slightly better performance of w/o discourse than w/o topic shows that topic factors might contribute more to persuasiveness, coupling the topics and discourse exhibiting the best performance.

Interpretation on Topics and Discourse
To analyze the latent topics and discourse produced our model, we carry out a qualitative analysis to investigate their interpretability.
Here we select the top 10 words from the distributions of some example topics and discourse factors and list them in Table 3 and Table 4, respectively. It can be observed that there are some meaningful word clusters reflecting varying debate topics and discourse skills on the two datasets. Interestingly, we observe that latent discourse from CMV and Court, though learned separately, exhibit some overlaps in their corresponding top 10 words; particularly for "pronoun", which are used to refer to participants (e.g., "we") or someone/something else (e.g., "he" and "they") in the discourse. We also note that the discourse style of "statistic" is represented by very different words. The reason is that Court debates are likely to involve lawsuit-related statistical evidence, hence exhibiting the prominence of words like "records" and "proximate". We further explore why coherent topics and discourse styles can be learned with the example conversation in Figure 1. In Figure 4, we visualize the topics and discourse assignments where we highlight the topic words (with p(w |z) > p(w |d)) in red, the rest in blue to indicate discourse style. The shade indicates the confidence level of such word assignment. We can see that our model can identify the topic words, e.g., "language beyond", "found", and "learning participants ended", and also discourse words, e.g., "<digit>" and "[". It is seen that topics and discourse words can be well distinguished, which allows us to discover meaningful latent factors and analyze reasons behind persuasive arguments.

Parameter Analysis
Here we study how the two important hyper-parameters in our model, the number of topics (K) and discourse (D) affect our model performance. In Figure 5, we show the persuasiveness prediction accuracy given varying K in (a) and varying D in (b).
As can be seen, for both topic and discourse, the curves corresponding model performance are not monotonic. In particular, better accuracies are achieved given relatively larger topic numbers for CMV with the best result observed at K = 50. While for Court, the optimum topic number is K = 20. This may be due to the relatively more centralized topics in Court debates, whereas wider range of topics discussed in social media, CMV. For discourse, we observed a similar trend in both CMV and Court datasets. The best score is achieved when D = 10 for CMV and D = 8 for Court dataset. This implies that the discourse styles used in both CMV and Court are somewhat limited.

Error Analysis
In this section, we look into the errors produced by our model in predicting the argumentation persuasiveness, where three types of major errors are observed. Error Type I: Wrong separation of the topic words and discourse words. The errors occur in distinguishing topic and discourse words may result in erroneous persuasiveness prediction results. For example, as shown in Figure 5, the word "cognition" should be considered as a topic word yet erroneously inferred to reflect discourse Because of the cascading failure, the model output might be affected. Error Type II: Preconception held by opinion holders. Sometimes opinion holders hold preconception towards the debating subject and their views are difficult to be changed by others. As shown in Figure 6, the opinion holder raised an issue related to the "abortion ban act". Although the challengers provide arguments Figure 4: Visualization of the topic-discourse assignment of CMV conversation in Figure 1. The annotated blue words are pone to be discourse words, and the red are topic words. The shade is the word-level confidence of current assignment.
with concrete evidences against the OP, they fail to obtain a ∆ due to the opinion holder's preconception. Such cases are prominent on social media, posing a challenge to understand opinion holders' prior beliefs for a better prediction of argumentation outcome. Error Type III: Lack of knowledge for judging the sufficiency of the evidence. In the court scenario, successful debates depend on how the lawyers make use of their persuasive skills to present their evidence or interpret their opponents' evidence. The judgement of the evidence sufficiency is beyond what the current model can capture. The logic and sufficiency of evidence could not be easily determined without external knowledge, e.g., law terminology and clauses. In future work, we will strengthen the reasoning process of the model by incorporating external knowledge sources.

DISCUSSION
In Section 6, we have shown the superior performance of our proposed model to identify persuasiveness arguments. Here, we discuss how the latent topics and discourse signal argumentation outcome. From our results, we further draw three suggestions, which might help individuals better engage in argumentative dialogues.

The Roles of Topics and Discourse in Argumentation Process
In Section 6.1, topics have shown slightly stronger effects on successful persuasion than discourse. Here we further analyze their roles in affecting persuasion outcome. Similar trends are observed on both datasets and we only discuss the results on CMV dataset due to the space limitation.
To investigate topic effects, we follow Wang et al. [43] to identify strong argument topics when the topic likelihood is larger than a pre-defined threshold (set to 0.2 here). 8 Then in Figure 7(a), we show how the number of strong argument topics distribute over winning arguments compared with the losing ones. For discourse, we similarly show the discourse factor distributions on winning and losing arguments in Figure 7  factors with our interpretations on the discourse styles according to their associated word distributions. In the following we discuss the findings from topics and discourse in turn. Topic Roles. As can be seen in Figure 7(a), the winning side tends to put forward fewer topics in the argumentative process. This indicates that strong and focused argument points are more closely related to successful arguments than diverse topics, because arguing with too many things might overwhelm the opinion holder, , we display the discourse factors with our interpretation ("conj."-conjunction, "quot."-quotation, "cont."contrast, "pron."-personal pronoun, and "num."-statistic). Two-sided Mann-Whitney rank test shows that the two distributions are significantly different for both sides (p < 0.01).
which may lead to the persuasion failure. Similar trend can also be observed in Court dataset. Discourse Roles. From Figure 7(b), we can see discourse styles vary in their effects over the persuasiveness results. Specifically, personal pronoun and statistic are more likely to appear in the winning side than the losing side. Their positive effects have also been previously reported [37,43]. Moreover, we find that conjunction, though not widely used, is obviously more endorsed by winning sides. The benefit of conjunction may be related with the better logic it renders. For the losing side, they are more in favor of the quotation discourse, which is used in CMV to quote and attack others' weak points. People may dislike such criticism, which renders the negative impact on persuasiveness.

Discourse Effects over Turn Number
To provide more insights, we further study the change of discourse effects over argumentation process with varying conversation length (the number of turns). Here we focus on discourse effects instead of topics because discourse styles are commonly used in diverse arguments and exhibit shared patterns on the two dataset, while topics vary in different scenarios. Specifically, we investigate the effects of the example discourse (listed in Table 4) over argumentation processes with varying turn number. The persuasiveness scores (computed with Eq. 8) are employed to measure the discourse effects and the results on the two datasets are displayed in Figure 8. Here comes our observations. First, we find that in general all the discourse styles exhibit a decreasing trend in terms of persuasiveness scores with more argumentative turns coming in, although they appear to be more important in the initial few turns. Second, same discourse style may demonstrate varying persuasiveness impacts in different debate scenarios. For example, the pronouns usually express more personal emotion and tend to arouse empathy from others. Such discourse shows a positive effect on debates in the social media scenario, as shown in Figure 8 (a), but its effectiveness in the Court scenario is less apparent. A similar observation can be found for the quotations. Finally, we also observe that on the Court dataset, various discourse styles exhibit very similar effects on the persuasion, as shown in  Table  4) on the ultimate persuasiveness as the argumentation process continues. The horizontal axis indicates the number of argumentative turns, and the vertical axis the dynamic persuasiveness score (given by Eq. 8).

Case Study
Our DTDMN is designed for capturing the topic shifting and discourse flow in an argumentation conversation, which allows us to interpret argument persuasiveness from the perspectives of topic and discourse dynamics. Here we take the CMV discussion in Figure 1 as an example to look into its persuasion process. Recall that the challengers put forward viewpoints centered around "the advantage to learn a second language", and they successfully change the opinion holder's mind with good arguments delivered. In Figure 9(a), we visualize the dynamic memory weights w t (see Eq. 3) for each turn. It is observed that our model highlights the 'cognition' topic factor, which suggests the cognitive research evidence (e.g., learning a musical instrument) might help challengers win.
For discourse, the model highlights latent factors represented by words like '⟨url⟩', '⟨digits⟩', and 'more'. This suggests that effective discourse styles, such as quotation of links ('⟨url⟩') and statistic ('⟨digits⟩'), may also play an important role in persuasiveness.
To further study how each topic and discourse alone contributes to this example's persuasion, we disable the effects from other topics and discourse via masking w t , and map the prediction score y in Eq. 8 to [0, 1] range. We visualize the prediction scores in Figure 9(b) to depict the effect of persuasiveness from each topic and discourse. We observe that the "cognition" topic is still highlighted for all turns. It implies our model still recognizes this topic to be important, without taking the discourse effects into account. For discourse, we notice that the quotation and statistic skills are considered useful for the first few turns, whose impacts however later change to be negative. It might be because people tend to be tired of excessive URL links and statistics without providing more insightful opinions.

Suggestions on Argumentation
From the results, we draw some general suggestions on argumentation, which might help participants behave better in debates. Topics are more important than discourse styles. In an argumentative conversation, opponents attempt to establish the validity of two positions by convincing each other and trying to win points in the debate [36]. Our study shows that topics contribute slightly more on persuasiveness than discourse. It happens especially in later stage of the argumentation process, which is suggested by the decreasing effects of discourse over turn number (see Figure 8). This is consistent with the discovery in Van Dijk et al. [39], which points out that style and rhetoric are not the dominant factors to determine debate outcome. Strong and focused argument points are better than diverse topics. Strong arguments that are well-supported with evidence and/or reasoning, generally deliver more persuasive messages to audience [5]. Our study reveals that successful argumentation usually conveys fewer and focused topics. Diverse topics could only distract audience and expose more vulnerable points to the opponent. Well organize the points and address them in a modest and concrete way. Argument discourse represents the cultural and situational realities of human reasoning, and is more sensitive to audience in conversational debates [10]. Amossy [1] also claims that argumentativity constitutes an inherent feature of discourse. This advice works particularly well in social media arguments, where the amateur debaters from general public are likely to be affected by opponents' discourse skills. As a result, we see that personal pronoun (modest), statistic (concrete), and conjunction (well-organized) discourse are more likely to appear in wining side.

Limitations of Our Study
In this paper, we use the CMV dataset following the previous augmentation mining setting [14,17,26,37]. In them, only the CMV dataset is used in evaluation. To better evaluate the generalization performance, we also include the Supreme Court dataset, which exhibits different data statistics from CMV (e.g., fewer argumentation processes and more turns involved in a process). However, it might not guarantee the generalization capability of our proposed model on other argumentation genres. Also, our findings are drawn from the experimental results of the CMV and the Court datasets. These findings are consistent with prior studies in social linguistics, and provide some additional details. To further evaluate if our model and empirical results are applicable in other scenarios, more experimental study is required on a diverse range of debate data to better understand human arguments. In addition, we mainly consider the topic and discourse factors in the modeling of the argumentation process. There are other factors that may relate to the persuasiveness of an argumentative conversation, such as age [12], culture [38], gender [16] of participants. For example, earlier research [11] on argumentation suggests that adults use advanced discourse strategies more consistently, frequently, and flexibly than adolescents do. Due to the unavailability of such metadata in our datasets, we could not easily incorporate these factors. Future research can consider building debate datasets with side information such as demographics data included.

CONCLUSION AND FUTURE WORK
In this paper, we propose to dynamically track both topics and discourse factors in conversational argumentation for persuasiveness prediction. The proposed neural model not only identifies persuasive arguments more accurately, but also provides insights into the usefulness of topics and discourse for a successful persuasion. The findings concluded in this paper can facilitate the argument persuasiveness analysis.