Skip to main content Skip to Search Box

Definition: Content Analysis (education) from The SAGE Glossary of the Social and Behavioral Sciences

A data analysis strategy used to study the content of texts and discourses. Qualitatively oriented researchers who use content analysis (CA) seek to understand the meanings, symbols, and communicative nature of texts, through either deductive or inductive measurements. They recommend focusing on what readers do with a text, how they relate to texts, and the social meanings. Quantitatively oriented researchers use CA to determine the amount and frequency of words and concepts found in textual data. Both qualitative and quantitative approaches can provide useful information.

See also

Discourse Analysis

Summary Article: Content Analysis from Encyclopedia of Case Study Research

Content analysis is a tool of qualitative research used to determine the presence and meaning of concepts, terms, or words in one or more pieces of recorded communication. This systematic and replicable technique allows for compressing many words of text into fewer content categories based on explicit rules of coding in order to allow researchers to make inferences about the author (individuals, groups, organizations, or institutions), the audience, and their culture and time.

Conceptual Overview and Discussion

Content analysis became a relatively established method of systematic analysis during the 1940s. At first, content analysis was a time-consuming process, executed manually, prone to human error, and subject to serious time and resource constraints. Because of this, the technique was limited to examinations of texts for the frequency of occurrence of identified terms or to short texts, being deemed impractical for more complex investigations, for larger texts, or for most recorded communication other than written texts. By the 1950s, researchers had recognized the need for more sophisticated methods of analysis, and as a result they started to focus on concepts rather than words, and on semantic relationships rather than just the mere presence of certain words.

Since then, content analysis has been extended to almost every type of recorded communication, ranging from books, newspaper articles, historical documents, medical records, Web sites, speeches, and communiqués to theater, television programs, sketches and drawings, informal conversation, writing journals, interviews, classroom discussions, lectures, and manifestos of political parties. As a result, today this research technique is used in fields as varied as marketing and advertising, literature and rhetoric, media studies, ethnography and anthropology, cultural, gender and age studies, sociology, political science, psychology and cognitive science, theology, and religious studies.

Since the 1980s, content analysis has also been widely used in media analysis and media evaluation, often in combination with data on media circulation, frequency of publication, readership, and number of viewers or listeners. During recent decades, various software packages have greatly facilitated the execution of content analysis by allowing researchers to sift systematically through large volumes of data with relative ease, and to make inferences that can then be corroborated by using other methods of data collection and data analysis. Today it is widely recognized that the careful examination of communication patterns can help researchers learn a great deal about individuals, groups, organizations, institutions, and even the larger society in which they are embedded.


Content analysis is possible whenever there is a physical record of communication. This record of communication can be (a) created independently of the research process and internally by the individual or organization under study (as, e.g., newspaper articles, or archived documents detailing household consumption), (b) internally generated and externally directed (e.g., the verbatim transcripts of legislative hearings or committee debates generated by a number of parliaments around the world, which may reflect or obscure the political decision-making process), or (c) produced by the researchers themselves in view of the analysis that needs to be conducted (as, e.g., videotapes of television news programs or commercials, or of debates carried out in the legislature and/or town council). The population of available communications greatly influences the nature of the questions that can be answered through content analysis, as well as the reliability and validity of the final research results.

The most basic quantitative content analysis consists of a frequency count of words, although the assumption that the most frequently mentioned words reflect the greatest concerns does not always hold true. A concept's importance might be over-estimated when the word has multiple meanings (as when a record includes references to cabinet “ministers” and religious “ministers,” and the researcher fails to set these meanings apart). Its importance can be underestimated when synonyms are used for stylistic reasons (e.g., an author uses the name of the president—“Obama”—and “our head of state” in order not to repeat the word president) or when the author avoids raising the issue represented by the concept as a result of self-censorship in response to societal bias or political pressure (e.g., the author omits references to current political leaders for fear of censorship).

To avoid such problems, researchers first use frequency counts to identify words of potential interest, and then conduct a Key Word In Context (KWIC) search to test for the consistency of usage of words. Most qualitative research software programs allow researchers to read the whole sentence in order to see the word in context, a procedure that strengthens the validity of inferences made from the data. Newer software packages, which can differentiate between the different meanings of the same word based on context, have greatly reduced the level of difficulty in conducting content analysis and allowed for ever more sophisticated analyses. In addition, researchers must note that nonstandardized measures can lead to biased results. If, over the Cold War, the Americans uttered a total of 100,000 words, including 100 salient references to weapons' proliferation, while the Soviets uttered 200,000 words, including 200 salient references, one might conclude that the Soviet were interested in the issue more than were the Americans. However, when standardizing the measure to obtain the proportion of all salient words, then we would conclude that both sides were equally interested in the topic. Depending on the recorded communication under analysis, basic content analysis could also include space measurements (column length in the case of newspaper articles or advertisements) and time counts (for radio and television programs).

How It Works

More complex content analysis extends beyond word counts to code and categorize the data. Data are coded with coding protocols decided either before or during the analysis. In a priori coding, categories are established before the start of the analysis. Professional colleagues agree on the selected categories, the coding is applied to the data, and revisions are operated, if needed, in order to maximize the mutual exclusivity and exhaustive-ness of the categories. In emergent coding, categories are established after a preliminary examination of the data and during data analysis. In this case, at least two researchers review the material independently and select a set of features for inclusion on a checklist; reconcile any differences between their initial checklists; design a consolidated checklist to apply the coding independently; and finally check the reliability of the coding, aiming for at least a 95% agreement. If the level of reliability is not acceptable, the researchers repeat the process as many times as needed to obtain the desired reliability. If the level of reliability is the one desired, the coding is applied on a large-scale basis.

To construct the categories, words with similar meanings and connotations are organized in mutually exclusive and exhaustive categories, which ensures that no word falls between two categories, all words are assigned to the categories, and the categories do not overlap. The text is broken down into manageable categories that could range from a word or a word sense to a phrase, a sentence, or even a theme, and then it is examined using either conceptual or relational analysis. Conceptual analysis establishes the existence and frequency of concepts represented by words or phrases in a given text. Relational analysis goes one step farther to examine the relationships among different concepts in a given text. Dermot McKeone further differentiated prescriptive analysis from open analysis. While prescriptive analysis emphasizes a closely defined set of communication parameters, which can be specific messages or subject matter, open analysis identifies the dominant messages and main subject matter of a recorded communication.

Coding units can be defined physically in terms of their natural or intuitive borders (e.g., letters, newspaper articles, communiqués, poems, or archival documents); syntactically by using the separations created by the author (e.g., words, sentences, or paragraphs); or referentially by employing the referential units created by the author (e.g., a text might refer to Barack Obama as “our president,”“President Obama,”“the 44th president of the United States” or just “Obama”). In addition, coding units can be defined by using propositional units that result from breaking the text down in order to examine underlying assumptions. For example, a sentence reading “Transitional justice was launched after the new democratic government replaced the dictatorship” is broken down into “The new democratic government replaced the dictatorship” and “Transitional justice was launched.”

Typically, content analysis uses sampling units, recording units, or context units. Sampling units, which can be words, sentences, or paragraphs, are the individual units we make descriptive and explanatory statements about. If we wish to examine novelists who wrote on transitional justice, then the individual writers included in our sample constitute our sampling units. Recording units can be ideas relevant for the analysis. For example, we might want to see if some novelists valued transitional justice for its ability to reevaluate the recent dictato-rial past or for preventing future human rights trespasses. However, in some cases it might be difficult for the researcher to determine whether authors present transitional justice as a backward-looking or a forward-looking phenomenon by simply examining their assertions on transitional justice. In this case, researchers use context units, which allow assertions to be evaluated in the context of the writing. The researcher must decide whether the paragraph around the assertion, several paragraphs, or the entire writing is the appropriate context unit.

Klaus Krippendorf listed six questions that need to be addressed in every content analysis. These questions are: (1) Which data are analyzed? (2) How are they defined? (3) What is the population from which they are drawn? (4) What is the context relative to which the data are analyzed? (5) What are the boundaries of the analysis? (6) What is the target of the inferences? To allow for replication, data examined through content analysis must be durable in nature. Several problems can occur when written documents or other types of recorded communication are assembled for content analysis. When a significant number of documents from the population are missing or unavailable, the content analysis must be abandoned. When some documents match the requirements for analysis but they cannot be coded because they are incomplete or contain ambiguous content, these documents must be abandoned.

Use in Political Science

Some of the best applications of content analysis in the area of political science have included determining authorship, identifying trends and patterns in documents, and monitoring shifts in public opinion. Using Bayesian techniques based on word frequency, in 1964 Frederick Mosteller and David Wallace showed that James Madison had indeed authored the Federalist Papers. Three decades later, Don Foster used statistical methods to identify Joe Klein as the anonymous author of Primary Colors, the 1992 fictionalized account of Bill Clinton's quest for the American presidency. After repeated denials, Klein admitted writing the controversial insider's account, leading to unprecedented media interest in content analysis. Authorship is determined by examining the prior works of suspected authors (James Madison, in the case of the Federalist Papers, or Clinton's close collaborators in the case of Primary Colors) and correlating their frequency of key terms (nouns or function words) with that of the target text.

One of the most remarkable applications of content analysis to political science was undertaken as part of the Manifesto Research Group and the Comparative Manifestoes Project, which estimated policy preferences from the manifestos of a wide range of left-wing and right-wing political parties in more than 50 countries in Central and Eastern Europe, Western Europe, North America, and Asia over the 1944-1989 and 1990-2003 periods. Election programs were taken as indicators of the parties' policy emphasis and policy positions at a certain point in time, and were subjected to content analysis. The analysis ascertained party preferences with respect to foreign relations (anti-imperialism, military, peace, and European integration), freedom and democracy (respect for freedom, and constitutionalism), political system (decentralization, political corrup-tion, and political authority), economy (planning, free enterprise, corporatism, and protectionism), welfare and quality of life (social justice, culture, education, and environmentalism), the fabric of society (traditional morality, law and order, multiculturalism, social harmony), and social groups (labor, farmers, underprivileged). The best-known research resulting from these projects was published by Ian Budge and Hans-Dieter Klingemann in 2001 and by Klingemann and Andrea Volkens in 2006.

While the database generated by the Manifesto Research Group and the Comparative Manifestoes Project is recognized as the most comprehensive and most extensively validated set of policy estimates enabling comparisons over time and space, critics have pointed out that the scheme used to code the political manifestos cannot be changed without jeopardizing its ability to enable meaningful comparative research. Thus, the shortcomings of the coding scheme, most notably its overlapping and missing categories, cannot be adequately addressed without recoding all manifestos all over again, a time-consuming endeavor that many researchers believe would be useless since, by the time the recoding is completed, the new coding scheme would itself be outdated.

Other Uses

An exemplar of content analysis in psychiatry is James Rogers, Jamie Bromley, Christopher McNally, and David Lester's study of suicide notes that tested the motivational component of the existential-constructivist model of suicide. The content analysis of a sample of 40 suicide notes generally supported the four theoretical categories of somatic, relational (social), spiritual, and psychological motivations outlined by the literature. Psychological motivations were found to be the most prevalent, followed by relational, spiritual, and somatic concerns. Notes of completed suicides included more relational motivations than did those of suicide attempters. Older note writers showed more psychological and fewer spiritual motivations than did younger writers. Based on the study, the authors concluded that the existen-tial-constructivist model of suicide was robust and parsimonious, but at the same time they recommended its revision to provide a stronger meaning-based understanding of suicidal behavior.

In business and management, Richard D'Aveni and Ian MacMillan used content analysis to examine the focus of attention of top managers in companies that are surviving or failing bankruptcy. By examining the letters sent to shareholders by senior managers of 57 bankrupt firms and 57 surviving firms, researchers found out that under normal circumstances managers pay equal attention to the internal and external environment. In times of crisis, managers of surviving firms pay more attention to the critical aspects of their external environment and their firms' output, while those of failing firms focus on the internal environment and their firms' input.

Sociologists David Schweingruber and Ronald Wohlstein used content analysis to examine myths about crowds in introductory sociology textbooks. The authors examined the paragraphs on crowds included in 20 introductory sociology textbooks, coding them for the presence of seven crowd myths, claims about crowds that have no empirical support and have been rejected by scholars in the field. After discovering that the number of myths per book ranged from five to one, Schweingruber and Wohlstein made important suggestions for rewriting these chapters and for improving the book reviewing process.

Critical Summary

Content analysis is particularly useful for case study research when more sophisticated tools of analysis cannot be employed because they are more expensive or because their use is restricted by a number of ethical dilemmas. By using quantitative and qualitative interpretive analysis, the examination of available records of communication can allow researchers to gain a great deal of knowledge about individuals, groups, organizations, and institutions, provided that their examination is sensitive to both the context and the purpose of the communication.

See also

Critical Discourse Analysis, Document Analysis, Explanatory Case Study, Qualitative Analysis in Case Study, Quantitative Analysis in Case Study, Relational Analysis, Textual Analysis

Further Readings
  • Budge, I.Klingemann, H.-D. (2001). Mapping policy preferences: Estimates for parties, electors and governments, 1945-1988. Oxford, UK: Oxford University Press.
  • D'Aveni, R. A.; MacMillan, I. C. Crisis and the content of managerial communications: A study of the focus of attention of top managers in surviving and failing firms. Administrative Science Quarterly 35 : 634-657., 1990.
  • Foster, D. (2001). Author unknown: On the trail of anonymous [Tales of a literary detective]. New York: Holt.
  • Klingemann, H.-D.Volkens, A. (2006). Mapping policy preferences: II. Estimates for parties, electors and governments in Central and Eastern Europe, European Union, and OECD 1990-2003. Oxford, UK: Oxford University Press.
  • Krippendorf, K. (1980). Content analysis: An introduction to its methodology. Beverly Hills, CA: Sage.
  • McKeone, D. (1995). Measuring your media profile. Aldershot, UK, & Brookfield, VT: Gower Press.
  • Mosteller, F., & Wallace, D. A. (1964). Inference and disputed authorship: The Federalist. Reading, MA: Addison-Wesley.
  • Rogers, J. P.; Bromley, J. L.; McNally, C. J.; Lester, D. Content analysis of suicide notes as a test of the motivational component of the existential-constructivist model of suicide. Journal of Counseling and Development 85 : 182-188., 2007.
  • Schweingruber, D.; Wohlstein, R. T. The madding crowd goes to school: Myths about crowds in introductory sociology textbooks. Teaching Sociology 33 : 136-153., 2005.
  • Stan, Lavinia
    Copyright © 2010 by SAGE Publications, Inc.

    Related Credo Articles

    Full text Article content analysis
    Merriam-Webster's Collegiate(R) Dictionary

    (1940) : analysis of the manifest and latent content of a body of communicated material (as a book or film) through a classification, tabulation, an

    Full text Article Content analysis
    Key Concepts in Journalism Studies

    Content analysis is a research method aimed at recording the salient features of texts using a uniform system of categories. The content analyst...

    Full text Article Content Analysis
    Encyclopedia of Survey Research Methods

    As it relates to survey research, content analysis is a research method that is applied to the verbatim responses given to open-ended questions...

    See more from Credo