HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Conference papers

"Don't worry, it's just noise": quantifying the impact of files treated as single textual units when they are really collections

Abstract : Literature works may present many autonomous or semi-autonomous units, such as poems for the first or chapter for the second. We make the hypothesis that such cuts in the text's flow, if not taken care of in the way we process text, have an impact on the application of the distributional hypothesis. We test this hypothesis with a large 20M tokens corpus of Latin works, by using text files as a single unit or multiple "autonomous" units for the analysis of selected words. For groups of rare words and words specific to heavily segmented works, the results show that their semantic space is mostly different between both versions of the corpus. For the 1000 most frequent words of the corpus, variations are important as soon as the window for defining neighborhood is larger or equal to 10 words.
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-03481620
Contributor : Thibault Clérice Connect in order to contact the contributor
Submitted on : Wednesday, December 15, 2021 - 2:11:22 PM
Last modification on : Wednesday, March 16, 2022 - 3:43:16 AM
Long-term archiving on: : Wednesday, March 16, 2022 - 7:00:52 PM

File

dont_worry_just_noise_ACL.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-03481620, version 1

Citation

Thibault Clérice. "Don't worry, it's just noise": quantifying the impact of files treated as single textual units when they are really collections. Workshop on Natural Language Processing for Digital Humanities (NLP4DH), Dec 2021, Virtual, India. ⟨hal-03481620⟩

Share

Metrics

Record views

54

Files downloads

24