sociolinguistics – Practical Linguist https://www.practicallinguist.com Linguistics meets computers Tue, 05 Dec 2017 03:15:05 +0000 en-US hourly 1 https://wordpress.org/?v=5.2.2 https://www.practicallinguist.com/wp-content/uploads/2017/09/logo-small-1-150x150.jpg sociolinguistics – Practical Linguist https://www.practicallinguist.com 32 32 LTE: Urban, suburban and rural discourse https://www.practicallinguist.com/lte-urban-suburban-and-rural-discourse/ https://www.practicallinguist.com/lte-urban-suburban-and-rural-discourse/#comments Mon, 27 Nov 2017 02:23:43 +0000 http://www.practicallinguist.com/?p=276 LTE: Urban, suburban and rural discourse Read More »

]]>
After I labeled all the data by topic, I took a quick look at the topic tallies, both overall and by newspaper, keeping in mind the newspapers’ locations. The overall percentages are shown in the graph below:

Some of these stats were expected, others surprising. I expected most of the letters to be about politics and government; and that is indeed the case. Overall, we can identify 4 topics that account for the majority of the letters: politics, government, community and society. Additionally, healthcare, business and education are also important. Several topics are so rare that there are less than 5% of all letters under them (e.g., technology, sports and science).

An interesting finding is about the third most dominant topic. If you remember from the scraping web-sites post, the newspapers are from three different types of communities: urban, suburban and rural. I originally classified them as follows: urban (Chicago Tribune, Daily Herald), suburban (The Citizen, Times Union), and rural (Dubois County Press, Enquirer Democrat). However, after I looked at the topics data, I reconsidered my original classification.

Here are some data about the towns and counties where the newspapers are from.

Newspaper name Newspaper location Miles from closest urban area Town population Population density per square mile
Chicago Tribune Chicago, IL 0 10 million 11,868 (town)
Times Union Albany, NY 0 1 million 563 (town)
Daily Herald Arlington Heights, IL 27 75,101 7,633 (town)
The Citizen Fayetville, GA 22 15,945 463 (county)
Dubois County Free Press Jasper, IN 80 15,038 98 (county)
Enquirer Democrat Carlinville, IL 60 5,917 55.4 (county)

It is clear from the population density data above why I had originally classified the Daily Herald as an urban newspaper and Times Union as a suburban one. The population density in Arlington Heights is much closer to the population density of Chicago, and the population density of Albany is much closer to the population density of Fayetville, GA, which is a suburb of Atlanta. However, Arlington Heights is 22 miles from the center of Chicago, and the Daily Herald calls itself “Suburban Chicago’s Information Source”. Similarly, while the population density of Albany is not very high, it is nonetheless an urban center.

Additionally, I double checked that Dubois County Press and Enquirer Democrat were indeed from rural areas. According to the US Department of Agriculture classification, Dubois county in Indiana is “Urban population of 20,000 or more, not adjacent to a metro area”. Since it is not inside a metro area, and the population is low, we can consider it rural, or much closer to a rural than a suburban or an urban area. Macoupin county in Illinois, where the Enquirer Democrat is from, is classified as “Counties in metro areas of 1 million population or more”, most likely because of its proximity to St. Louis, MO. However, this county is again much closer to a rural area than a suburban one, since its population density is low. Also, the neighboring Montgomery county is classified as “Nonmetro – Urban population of 2,500 to 19,999, adjacent to a metro area”. Clearly, the towns where the newspapers are located, Jasper, IN and Carlinville, IL, are just that, towns, but they are of a very different character than the other four towns/cities we are considering, and the surrounding areas they serve are more rural than the other four areas.

The graph below shows the topic percentages in each of the 6 newspapers. Pay close attention to the distribution of the top three topics and see if you can find a pattern (click to enlarge).

Excluding the two most dominant topics, politics and government, take a look at the community topic. In the urban newspapers, it takes a backseat to society and education, while in the rural newspapers it is either the third most common topic (Dubois County), or the first (Enquirer Democrat). The suburban newspapers follow the same pattern, where the community topic is the third most dominant one after government and politics.

Here is an example of a letter that was tagged with the topic “community”:

A bumpy thrill ride is here on Dunton
It’s Spring once again in Arlington Heights and vacation season is upon us. I encourage all readers of the Daily Herald to visit “Craters of the Moon State Park” on the 300 block of South Dunton Avenue, just south of the downtown business district. Hiking trails into the craters should be opening soon.

The data shows that the letters about local communities are more prominent in suburban and rural areas. While this is a preliminary analysis with just 6 data points, we can still speculate as to why that is. The most obvious reason is that people are more concerned with local issues in suburbs and rural areas. People in cities have less of a feeling of a community. Another possible reason is letter selection. It is possible that newspapers do not select letters pertinent to the local issues as much in the cities as they do in suburban and rural communities, or they do more selection in city newspapers than in the other ones. This would again reflect reader interest in community topics, depending on the area type. In any case, this is an interesting issue worth more investigation with more data points (and lots more tagging!). It would also be instructive to see if there is a continuum from urban to suburban to rural areas with community interest being lowest in cities and highest in rural areas.

The next post will be about preprocessing the collected data for machine learning. Stay tuned!

]]>
https://www.practicallinguist.com/lte-urban-suburban-and-rural-discourse/feed/ 6
LTE: Letters to the editor corpus analysis using machine learning https://www.practicallinguist.com/lte-letters-to-the-editor-corpus-analysis-using-machine-learning/ https://www.practicallinguist.com/lte-letters-to-the-editor-corpus-analysis-using-machine-learning/#comments Tue, 26 Sep 2017 15:38:43 +0000 http://www.practicallinguist.com/?p=186 LTE: Letters to the editor corpus analysis using machine learning Read More »

]]>
See if you can determine whether the author of the following texts is male or female.

Text A:

When my children were younger, my goal was to get them into the gifted program or even charter schools, because I just wanted a high-quality option. I was thankful that one of my daughters was accepted to Alexander Graham Bell Elementary School’s gifted program. But my other daughter, who was not as advanced, did not make it in. My oldest child, who attended Alexander Graham Bell, was able to get a great education.

Text B:

Gov. Bruce Rauner rocks the political boat. He has no personal agenda other than decent government. Former governors, of both parties, were all about typical politics: money, power and ego/legacy (see: Barack Obama). Of course they worked well together; they sold their principles to “get what they could,” not what they should. Democrats’ naivete regarding Illinois’ (and Chicago’s) blatant corruption is laughable. Illinois doesn’t deserve our honest and hardworking governor. It likes the stink of the status quo.

Did you pick text A to be female-authored and text B male-authored? That’s what I would have done, going along with societal stereotypes. But of course, I hand-picked texts that are just the opposite. A is written by a man, B, by a woman.

When deciding who wrote which text, which factors influenced your decision? The topic, probably. Maybe how personal it is. In some languages, it is almost trivial to determine the gender of the writer. For example, Spanish has different male and female adjective endings:

estoy cansada 'I am tired-f'
estoy cansado 'I am tired-m'

Slavic languages also have a gender distinction, different verb endings in the past tense for male and female forms:

ja wypiła 'I drank-f'
ja wypił 'I drank-m' (Polish)

English does not have anything like this; however, there are clues that can give out the gender of the author. Sometimes, there are clear indications, such as self-identifying expressions; for example, as a father. Other times, we can use our judgement based on perceived correlations, for example, topic, which can sometimes be wrong, as the two texts above show.

Topic is one of the many questions that arise when we consider differences in female and male writing. Do women and men write about different things? I would argue that yes, they usually do; but I would like to have data to back that up. To find evidence, I collected a corpus of publicly available letters to the editor. While I started with the idea of looking at differences between male and female writing, it evolved into something much larger. This is going to be a series of posts about this project.

Here is an outline of the work that I did:

  1. Corpus collection. Where to get texts with the gender of the author known? My friend had a brilliant idea: letters to the editor. Usually (but not always), the letters are signed, and most of the names can be labeled as female or male. Also, most letters to the editor are freely available on the Internet. I scraped several sites to collect between 200 and 1500 letters per site.
  2. Corpus labeling. I wanted to compare supervised and unsupervised methods for topic identification in my letters to the editor (LTE) corpus. For the supervised topic labeling, I needed to manually tag the corpus. I developed a simple web-app to do that.
  3. Text extraction. The letters were all in the HTML format, and they all contained other information in addition to the text of the letter itself, so I wrote a program to automatically extract the relevant text.
  4. Author identification. Since I already tagged my corpus for author and topic, I wanted to see how well a machine learning algorithm would do in automatically extracting the author of the letter from the file, which is not a trivial task, as it may seem at first.
  5. Automatic topic assignment, supervised. I labeled each of the letters with their topics, and in this stage I used that information to train a classifier to automatically assign the topic given the text of the letter.
  6. Automatic topic assignment, unsupervised. A technique called LDA clustering allows us to group the documents according to their similarity to each other. I used it to group the letters and to compare the resulting clusters to the topics from the supervised topic assignment task.
  7. Expectations and results. What are the theoretical expectations about the data? How do they compare to the results? Which topics are more common than others? What do men and women write about? Are there any differences by location?
]]>
https://www.practicallinguist.com/lte-letters-to-the-editor-corpus-analysis-using-machine-learning/feed/ 2