The website Footnote 2 was utilized as a way to get tweet-ids Footnote 3 , this amazing site brings researchers which have metadata from good (third-party-collected) corpus away from Dutch tweets (Tjong Kim Done and you can Van den Bosch, 2013). e., the latest historic limit when asking for tweets based on a journey ask). This new Roentgen-bundle ‘rtweet’ and you will subservient ‘lookup_status’ function were used to get tweets into the JSON style. The JSON file comprises a table toward tweets’ suggestions, for instance the design day, the brand new tweet text message, and origin (i.age., form of Twitter customer).
Analysis clean and preprocessing
The JSON Footnote 4 files were converted into an R data frame object. Non-Dutch tweets, retweets, and automated tweets (e.g., forecast-, advertisement-relatea, and traffic-related tweets) were removed. In addition, we excluded tweets based on three user-related criteria: (1) we removed tweets that belonged to the top 0.5 percentile of user activity because we considered them non-representative of the normal user population, such as profiles who created more than 2000 tweets within four weeks. (2) Tweets from users with early access to the 280 limit were removed. (3) Tweets from users who were not represented in both pre and post-CLC datasets were removed, this procedure ensured a consistent user sample over time (within-group design, Nusers = 109,661). All cleaning procedures and corresponding exclusion numbers are presented in Table 2.
Brand new tweet texts was basically changed into ASCII encoding. URLs, line vacation trips, tweet headers, display screen names, and you can records to help you display screen names have been eliminated. URLs add to the profile count when located for the tweet. not, URLs don’t add to the character amount if they are found at the end of a great tweet. To get rid of a beneficial misrepresentation of the real character restrict that pages had to endure, tweets which have URLs ( not media URLs for example additional photos otherwise video) was basically excluded.
Token and bigram data
New Roentgen bundle Footnote 5 ‘quanteda’ was utilized to tokenize the fresh new tweet messages on the tokens (i.elizabeth., remote terms and conditions, punctuation s. At exactly the same time, token-frequency-matrices was indeed determined having: brand new volume https://datingranking.net/sugar-daddies-usa/co/denver/ pre-CLC [f(token pre)], brand new cousin volume pre-CLC[P (token pre)], the newest volume blog post-CLC [f(token article)], this new relative frequency blog post-CLC and you can T-results. The brand new T-attempt is much like a basic T-figure and you can computes the brand new analytical difference in form (we.e., brand new cousin phrase wavelengths). Negative T-ratings imply a somewhat higher thickness out-of a token pre-CLC, while positive T-results indicate a fairly large thickness out of good token blog post-CLC. New T-rating picture used in the analysis is showed as the Eq. (1) and you will (2). N is the final number out-of tokens for each and every dataset (i.age., pre and post-CLC). That it formula is based on the method to own linguistic data by Chapel mais aussi al. (1991; Tjong Kim Sang, 2011).
Part-of-address (POS) research
The fresh new R bundle Footnote 6 ‘openNLP’ was used to help you categorize and number POS kinds regarding tweets (we.elizabeth., adjectives, adverbs, articles, conjunctives, interjections, nouns, numeral, prepositions, pronouns, punctuation, verbs, and you may various). This new POS tagger operates having fun with a max entropy (maxent) likelihood model to assume the fresh new POS category based on contextual provides (Ratnaparkhi, 1996). The new Dutch maxent model used for the fresh POS class was educated to the CoNLL-X Alpino Dutch Treebank research (Buchholz and you will ). New openNLP POS design has been claimed which have an accuracy get away from 87.3% when employed for English social network data (Horsmann et al., 2015). An ostensible limitation of latest study is the reliability out of the brand new POS tagger. Although not, similar analyses was in fact performed both for pre-CLC and you can article-CLC datasets, meaning the precision of the POS tagger are uniform more one another datasets. For this reason, we assume there are no health-related confounds.
Find more like this: sugar-daddies-usa+co+denver review