Web Scraping

As another very nice little programming project we look at web scraping, extracting useful text from web pages.

  • Each web site has its own style, and a general-purpose scraping approach is of limited use
  • Instead, the scraping methods have to be adapted to the style of the sources
  • Since those styles tend to change over time it is quite a lot of work to keep the scraping procedures up-to-date

You probably need to install the beautiful soup package:

pip3 install beautifulsoup4 --user

First get some news:

In [188]:
import requests

sources = {'google': 'https://news.google.com/', 
           'yahoo': 'https://news.yahoo.com/' }
html = {}
for s in sources:
    # do not send these requests too often
    html[s] = requests.get(sources[s]).content

The beautiful soup package supports a number of HTML parsers to split the content according to structure.

Note that many sites only provide content via javascript; this makes web scraping a little more difficult.

In [189]:
from bs4 import BeautifulSoup

soup = {}
for s in sources:
    soup[s] = BeautifulSoup(html[s], 'html.parser')

With the structuring done we can now extract content via a number of methods.

We look at the web page in our browser and identify some news article text.

In the yahoo news we find the tags around the text using the pretty_print() function:

In [190]:
pr = soup['yahoo'].prettify()
i = pr.find('A large truck was seen driving')
print(pr[i-1000:i+1000])
v class="C(#959595) Fz(11px) D(ib) Mb(6px)" data-reactid="185">
                       NBC News
                      </div>
                      <h3 class="Mb(5px)" data-reactid="186">
                       <a class="Fw(b) Fz(20px) Lh(23px) Fz(17px)--sm1024 Lh(19px)--sm1024 mega-item-header-link Td(n) C(#0078ff):h C(#000) LineClamp(2,46px) LineClamp(2,38px)--sm1024 not-isInStreamVideoEnabled" data-reactid="187" href="/large-truck-seen-driving-crowd-235902932.html">
                        <u class="StretchedBox" data-reactid="188">
                        </u>
                        <!-- react-text: 189 -->
                        Truck seen driving into protesters in Minneapolis
                        <!-- /react-text -->
                       </a>
                      </h3>
                      <p class="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)--sm1024 LineClamp(2,38px) LineClamp(2,34px)--sm1024 M(0) D(n)--sm1024 Bxz(bb) Pb(2px)" data-reactid="190">
                       A large truck was seen driving at full speed into a crowd of protesters Sunday on a bridge in Minneapolis, sending people running for safety. The Minnesota Department of Public Safety called it "very disturbing actions by a truck driver on I-35W, inciting a crowd of peaceful demonstrators." The truck driver was injured and is under arrest, the department said.
                      </p>
                      <ul class="Cf Mt(12px) Fz(12px) Pos(r) Mt(8px)--sm1024" data-reactid="191">
                       <li class="Fl(start) W(50%) W(100%)--sm1024 Mb(8px)--sm1024" data-reactid="192">
                        <a class="Td(n) D(ib) Va(t) W(90%) Mend(10%) C(#0078ff)!:h C(#000) C(#959595):vi" data-reactid="193" href="/george-floyd-protests-live-updates-103135819.html">
                         <img alt="George Floyd protests in Minneapolis: Police use tear gas, smoke grenades; more than two dozen arrested" class="Fl(start) W(29%) Miw(65px) Maw(72px) Mend(10px) Trsdu(0s)! D(n)--sm1024 Bdrs(

After some careful study of the prettfied version we find that

  • h3 is used for various headers
  • we want news headers and corresponding summary
  • look for h3 element followed immediately by a p element

The attribute next_sibling is set for each element if there is a next element on the same level in the parse tree.

Looking at the parse tree above (pretty print) we code:

In [191]:
news = []

for h in soup['yahoo'].find_all('h3'):
    p = h.next_sibling
    if p and len(p.text) > 2*len(h.text):
        e = {}
        e['source'] = 'yahoo'
        e['header'] = h.text
        e['text'] = p.text
        print(e)
        news.append(e)
        
{'source': 'yahoo', 'header': 'Protesters tear through D.C. after National Guard troops and Secret Service keep them from the White House', 'text': "Downtown Washington, D.C., was filled with flames and broken glass in the early hours of Sunday morning as large groups of protesters moved through the city for the second straight night. The protesters caused extensive damage to businesses in the blocks surrounding the White House after a large contingent of law enforcement — including National Guard troops, the U.S. Park Police and the Secret Service — kept the demonstrators back from the president's residence. Protesters lit fires at multiple locations around the city and clashed with law enforcement, hurling fireworks and other projectiles at the officers."}
{'source': 'yahoo', 'header': "On all sides, fears of 'outside agitators' in Floyd protests\xa0", 'text': "From the earliest days of the civil rights era, officials have been quick to assert that demonstrations were the work of “outside agitators,” as a way of distracting from the protesters' grievances and mobilizing local opinion against them. Last week, as protests over the death of George Floyd at the hands of a Minneapolis police officer erupted around the nation, the phrase reemerged, amplified by social media and echoed across the political spectrum, from the Democratic mayor of Minneapolis\xa0to Attorney General William Barr and President Trump. Had the countless fires, broken windows and vandalized police vehicles seen in cities across the country, from Minneapolis to Atlanta, New York and Washington, D.C., been caused by mostly white, far-left antifascists?"}
{'source': 'yahoo', 'header': 'Israel police kill Palestinian they mistakenly thought was armed', 'text': 'Israeli police in annexed east Jerusalem on Saturday shot dead a disabled Palestinian they mistakenly thought was armed with a pistol, prompting furious condemnation from the Palestinians. The incident happened in the alleys of the walled Old City near Lions\' Gate, an access point mainly used by Palestinians. "Police units on patrol there spotted a suspect with a suspicious object that looked like a pistol," an Israeli police statement said.'}
{'source': 'yahoo', 'header': 'Letters to the Editor: Stacey Abrams lost in Georgia, but she could lift Biden as his VP.', 'text': "To the editor: I like what columnist Jonah Goldberg has to say about Joe Biden's potential picks for vice president, yet I disagree with his assessment of former Georgia gubernatorial candidate Stacey Abrams. Abrams is a winner. Maybe it behooves Goldberg to take a second look at Abrams and her qualifications."}
{'source': 'yahoo', 'header': 'Iran says it is ready to continue fuel shipments to Venezuela', 'text': 'Iran will continue fuel shipments to Venezuela if Caracas requests more supplies, the Iranian Foreign Ministry spokesman said on Monday, despite Washington\'s criticism of the trade between the two nations, which are both under U.S. sanctions. "Iran practises its free trade rights with Venezuela and we are ready to send more ships if Caracas demands more supplies from Iran," Abbas Mousavi told a weekly news conference broadcast live on state TV. Defying U.S. threats, Iran has sent a flotilla of five tankers of fuel to the South American oil-producing nation, which is suffering from a gasoline shortage.'}
{'source': 'yahoo', 'header': 'Louisville police and soldiers return fire, killing man', 'text': "Police officers and National Guard soldiers enforcing a curfew in Louisville killed a man early Monday when they returned fire after someone in a large group fired at them first, the city's police chief said. Chief Steve Conrad confirmed the shooting happened around 12:15 a.m. outside a food market on West Broadway, where police and the National Guard had been called to break up a large group of people gathering in defiance of the city's curfew. It recorded the sound of bullets being fired as groups of police and national guard soldiers crouched behind cars."}
{'source': 'yahoo', 'header': 'Truck seen driving into protesters in Minneapolis', 'text': 'A large truck was seen driving at full speed into a crowd of protesters Sunday on a bridge in Minneapolis, sending people running for safety. The Minnesota Department of Public Safety called it "very disturbing actions by a truck driver on I-35W, inciting a crowd of peaceful demonstrators." The truck driver was injured and is under arrest, the department said.'}
{'source': 'yahoo', 'header': 'Tiananmen: Police ban Hong Kong vigil for victims of 1989 crackdown', 'text': 'Hong Kong police have banned a vigil marking the Tiananmen Square crackdown for the first time in 30 years. Currently, Hong Kong and Macau are the only places in Chinese territory where people can commemorate the deadly 1989 crackdown on pro-democracy protesters. In mainland China, the authorities have banned even oblique references the events of June 4, which came after weeks of mass demonstrations that were tolerated by the government.'}
{'source': 'yahoo', 'header': 'Watch live: New York Governor Cuomo gives coronavirus update', 'text': 'New York Governor Andrew Cuomo is giving an update Monday on the state\'s response to the coronavirus pandemic. On Sunday,\xa0 Cuomo used his daily coronavirus briefing on Sunday to plead for calm after a night of unrest in cities throughout the state, saying "violence never works." Meanwhile, Cuomo said Sunday the number of daily coronavirus deaths in the state had dropped to 56, a low not seen since March 24.'}
{'source': 'yahoo', 'header': 'A black congresswoman was pepper-sprayed by police while marching with George Floyd protesters in Ohio', 'text': 'Chip Somodevilla/Getty Images Congresswoman Joyce Beatty represents Ohio\'s 3rd Congressional District in the House of Representatives. While marching in a protest regarding the death of George Floyd, Beatty, who is black, tried to deescalate a confrontation between protesters and police and was hit with pepper spray. "While it was peaceful, there were times when people got off the curb, into the streets, but too much force is not the answer to this," Beatty said.'}
{'source': 'yahoo', 'header': 'India expels Pakistan embassy officials for alleged spying', 'text': 'Two officials at Pakistan\'s High Commission in New Delhi were being expelled for "espionage activities", India\'s foreign ministry said Sunday, allegations its nuclear-armed rival called "baseless". Tensions are already heightened between the neighbouring foes over the Himalayan region of Kashmir, which was split between them in 1947 when they gained independence from Britain. "The government has declared both these officials persona non grata for indulging in activities incompatible with their status as members of a diplomatic mission," the ministry said in a statement.'}
{'source': 'yahoo', 'header': 'The coronavirus is disappearing in Italy, according to Italian doctors', 'text': 'PIERO CRUCIATTI/AFP via Getty Images Italy has been one of the worst-affected countries in the global coronavirus pandemic. However, the COVID-19 virus is now disappearing in the country according to Italian doctors Alberto Zangrillo, who heads a hospital in Milan, said that "in reality, the virus clinically no longer exists in Italy." A leading doctor in Genoa said that "the strength the virus had two months ago is not the same strength it has today."'}
{'source': 'yahoo', 'header': 'Journalists Under Attack Show How Trump’s Hate for the Press Has Spread', 'text': 'Journalists have been attacked all over the world while on the job covering protests for years, but never like they were this week in the United States during the George Floyd protests. At least half a dozen incidences of arrests and attacks were reported in protests across the United States this weekend. Others got less attention, like Los Angeles Times reporter Molly Hennessy-Fiske getting pelted with rubber bullets and tear gas or the two Los Angeles Times photographers who were briefly taken into custody.'}
{'source': 'yahoo', 'header': 'Thousands of Complaints Do Little to Change Police Ways', 'text': 'In nearly two decades with the Minneapolis Police Department, Derek Chauvin faced at least 17 misconduct complaints, none of which derailed his career. Over the years, civilian review boards came and went, and a federal review recommended that the troubled department improve its system for flagging problematic officers. All the while, Chauvin tussled with a man before firing two shots, critically wounding him.'}
{'source': 'yahoo', 'header': 'Most voters plan to cast early ballots in presidential race', 'text': 'Six-in-10 registered voters plan to vote early in the November general election, either by mail or at in-person early voting centers, according to a new TargetSmart + Dynata National Voter Insights Poll. Forty-one percent plan to vote by mail and 19 percent plan to vote in-person early. Another 36 percent plan to vote in-person at their polling place on Election Day.'}
{'source': 'yahoo', 'header': 'Protesters in some cities target Confederate monuments', 'text': 'Protesters demonstrating against the death of George Floyd, a black man who pleaded for air as a white police officer pressed his knee on his neck, targeted Confederate monuments in multiple cities. As tense protests swelled across the country Saturday into Sunday morning, monuments in Virginia, the Carolinas, Tennessee and Mississippi were defaced. The presence of Confederate monuments across the South — and elsewhere in the United States — has been challenged for years, and some of the monuments targeted were already under consideration for removal.'}
{'source': 'yahoo', 'header': 'Coronavirus: South Africans cheer as alcohol goes back on sale', 'text': 'Long queues have formed outside shops selling alcohol in South Africa after restrictions on its sale, imposed two months ago as part of measures to fight Covid-19, were lifted. Social media posts showed people, who had braved the morning chill, cheering as buyers emerged with their bottles. The alcohol ban was to allow police and hospitals to better focus on tackling the coronavirus, the authorities said.'}
{'source': 'yahoo', 'header': "Europe's factories starting to recover, Asia's pain worsens", 'text': "European manufacturers may be over the worst of a coronavirus-driven downturn, but Asia's pain deepened in May due to a slump in global trade, with export powerhouses Japan and South Korea seeing the sharpest falls in activity in over a decade, surveys showed. While factory activity still contracted sharply across Europe last month, purchasing managers said April lows had passed as governments on the continent began to ease the tough lockdown measures implemented to contain the spread of the virus. After crashing to its lowest reading in the survey's nearly 22-year history in April, IHS Markit's Manufacturing Purchasing Managers' Index (PMI) for the euro zone recovered somewhat last month, rising to 39.4 from 33.4."}
{'source': 'yahoo', 'header': 'Trump tweets do little to calm a nation on edge, as more violent protests rock cities', 'text': 'As violent protests continued for a fifth straight night over the death of an African-American man during an arrest by Minneapolis police, President Trump took advantage of the crisis to take a swipe at “the Democrat Mayor” of Minneapolis for failing to control the protests, praising a “great job” by the Minnesota National Guard. The National Guard “should have been used 2 days ago & there would not have been damage & Police Headquarters [sic] would not have been taken over & ruined,” Trump tweeted. As police clashed with demonstrators in New York, Los Angeles, Chicago and other cities, Trump, after returning to the White House from Florida where he witnessed the launch of two astronauts aboard the SpaceX rocket, was uncharacteristically reticent on Twitter.'}
{'source': 'yahoo', 'header': 'Bangladesh lifts virus lockdown, logs record deaths on same day', 'text': 'Bangladesh lifted its coronavirus lockdown Sunday, with millions heading back to work in densely populated cities and towns even as the country logged a record spike in deaths and new infections. "The lockdown has been lifted and we are heading almost towards our regular life," health department spokeswoman Nasima Sultana said, calling on those returning to work to wear masks and observe social distancing. The lifting comes as Bangladesh -- which on Friday took an emergency pandemic loan from the International Monetary Fund -- reported its biggest daily jump in infections Sunday, with 2,545 new cases and a record 40 deaths.'}
{'source': 'yahoo', 'header': 'Hong Kong Police Ban Annual Tiananmen Square Massacre Vigil', 'text': 'Hong Kong police have banned the annual candlelight vigil commemorating the\xa0Tiananmen Square massacre, the deadly 1989 crackdown on students demanding democracy in Beijing, just as tensions rise in the city over controversial national-security legislation. Police denied an application by the group that organizes the vigil in Victoria Park on Hong Kong Island, stating in a letter that the decision was due to concerns surrounding the coronavirus pandemic. “We are extremely disappointed and strongly object to this decision,” said Richard Tsoi, secretary of the organizing group, the Hong Kong Alliance in Support of Patriotic Democratic Movements of China.'}
{'source': 'yahoo', 'header': 'This high-tech Embraer private jet design seamlessly blends sustainability and technology. Take a look at Praeterra.', 'text': "Embraer Brazilian aircraft manufacturer Embraer's Praeterra design concept for its Praetor 600 business jet merges high-tech with sustainability. The aircraft interior features computer circuit board-like designs complemented by fiber-optic ceiling lighting and sidewalls lined with informational screens. Cabin materials are also sourced sustainably and developed in a way that allows them to have a second life once they're no longer required inside the aircraft."}
{'source': 'yahoo', 'header': 'Minneapolis police made 44 people unconscious with neck restraints since 2015', 'text': "Since the beginning of 2015, officers from the Minneapolis Police Department have rendered people unconscious with neck restraints 44 times, according to an NBC News analysis of police records. Minneapolis police used neck restraints at least 237 times during that span, and in 16 percent of the incidents the suspects and other individuals lost consciousness, the department's use-of-force records show. A lack of publicly available use-of-force data from other departments makes it difficult to compare Minneapolis to other cities of the same or any size."}
{'source': 'yahoo', 'header': "Burkina Faso gunmen 'kill dozens' at cattle market in Kompienga", 'text': 'Some 30 people have been killed in eastern Burkina Faso in a gun attack on a cattle market, reports say. Gunmen on motorbikes fired into the crowded market in Kompienga town around lunchtime on Saturday, eyewitnesses and residents said. It is unclear who was behind the attack, but Burkina Faso has seen a recent sharp rise in jihadist violence and inter-communal clashes.'}

Let us try the same thing for another source:

In [192]:
pr = soup['google'].prettify()
i = pr.find('U.S. cities see more protests')
print(pr[i-1000:i+1000])
ta-n-ham="true" data-n-vlb="0" jsaction=";rcuQ6b:npT2md; click:KjsqPd;EXlHgb:HQ4Dqd" jscontroller="mhFxVb" jsdata="oM6qxc;CBMiQGh0dHBzOi8vd3d3LmNiYy5jYS9uZXdzL3dvcmxkL3Byb3Rlc3RzLWdlb3JnZS1mbG95ZC11cy0xLjU1OTI4MjnSASBodHRwczovL3d3dy5jYmMuY2EvYW1wLzEuNTU5MjgyOQ;21" jslog="85008" jsmodel="QWGJif hT8rr">
             <a aria-hidden="true" class="VDXfz" href="./articles/CBMiQGh0dHBzOi8vd3d3LmNiYy5jYS9uZXdzL3dvcmxkL3Byb3Rlc3RzLWdlb3JnZS1mbG95ZC11cy0xLjU1OTI4MjnSASBodHRwczovL3d3dy5jYmMuY2EvYW1wLzEuNTU5MjgyOQ?hl=en-CA&amp;gl=CA&amp;ceid=CA%3Aen" jslog="95014; 4:https://www.cbc.ca/news/world/protests-george-floyd-us-1.5592829; track:click" jsname="hXwDdf" tabindex="-1" target="_blank">
             </a>
             <h3 class="ipQwMb ekueJc gEATFF RD0gLb">
              <a class="DY5T1d" href="./articles/CBMiQGh0dHBzOi8vd3d3LmNiYy5jYS9uZXdzL3dvcmxkL3Byb3Rlc3RzLWdlb3JnZS1mbG95ZC11cy0xLjU1OTI4MjnSASBodHRwczovL3d3dy5jYmMuY2EvYW1wLzEuNTU5MjgyOQ?hl=en-CA&amp;gl=CA&amp;ceid=CA%3Aen">
               U.S. cities see more protests, violent unrest over George Floyd's death
              </a>
             </h3>
             <div aria-hidden="true" class="Da10Tb gEABFF Rai5ob" jsname="jVqMGc">
              <span class="xBbh9">
               With cities wounded by days of violent unrest, America headed into a new week with neighbourhoods in shambles, urban streets on lockdown and shaken ...
              </span>
             </div>
             <div class="QmrVtf RD0gLb">
              <div class="SVJrMe gEAMFF" jsname="Hn1wIf">
               <span aria-hidden="true" class="DPvwYc N3ElHc hEsB5d eLNT1d uQIVzc" jsname="boXlNc">
                amp
               </span>
               <span aria-hidden="true" class="DPvwYc N3ElHc gQtGhf eLNT1d uQIVzc">
                video_youtube
               </span>
               <a class="wEwyrc AVN2gc uQIVzc Sksgp">
                CBC.ca
               </a>
               <time class="WW6dff uQIVzc Sksgp" datetime="2020-06-01T14:49:00Z">
     

Here we find content in h3 elements followed by a div element:

In [193]:
for h in soup['google'].find_all('h3'):
    d = h.next_sibling
    if d and d.name == 'div':
        e = {}
        e['source'] = 'google'
        e['header'] = h.text
        e['text'] = d.text
        print(e)
        news.append(e)
{'source': 'google', 'header': "U.S. cities see more protests, violent unrest over George Floyd's death", 'text': 'With cities wounded by days of violent unrest, America headed into a new week with neighbourhoods in shambles, urban streets on lockdown and shaken ...'}
{'source': 'google', 'header': 'Montrealers rally to protest police brutality and racism', 'text': "Sunday's demonstration was in solidarity with protests in the U.S. in the wake of the killing of George Floyd in Minneapolis."}
{'source': 'google', 'header': 'Hong Kong activist urges Canada, others to speak out against China security bill', 'text': 'One prominent Hong Kong activist and former lawmaker is urging Canada and the rest of the world to speak out against a Chinese national security law that she ...'}
{'source': 'google', 'header': 'Coronavirus: Provinces continue to loosen COVID-19 restrictions', 'text': 'As COVID-19 cases continue to decline in much of the country, some provinces are moving today to loosen more of the restrictions they implemented to slow the ...'}
{'source': 'google', 'header': "'We need justice': Protesters set fires near White House, smash storefronts, march against police brutality in U.S.", 'text': 'Authorities imposed curfews on dozens of cities, the most since the aftermath of the assassination of Martin Luther King Jr in 1968.'}
{'source': 'google', 'header': 'Hundreds attend peaceful Ottawa vigil for Regis Korchinski-Paquet', 'text': 'They came to Dundonald Park on Sunday afternoon by the hundreds, nearly all wearing masks and scrupulously following physical distancing rules. It was part ...'}
{'source': 'google', 'header': '2 COVID-19 deaths reported on last day of May', 'text': "Another two people have died of COVID-19, according to Ottawa Public Health's Sunday report, bringing the city's death toll to 244."}
{'source': 'google', 'header': 'Take a look inside a COVID unit at an Ontario long-term care home', 'text': 'ampvideo_youtubeToronto Star2 days agobookmark_bordersharemore_vert'}
{'source': 'google', 'header': 'COVID-19 in Sask: 1 death in North region, 1 new case in Regina', 'text': 'One more person is dead from COVID-19 in Saskatchewan, bringing the total number of deaths in the province to 11. The person was in their 70s and lived in ...'}
{'source': 'google', 'header': 'Rain pounds B.C.’s southeast, as Central Kootenay placed on ‘unprecedented’ flood evacuation alert', 'text': "British Columbia's Kootenays are being pounded by rain Sunday, as the Regional District of the Central Kootenay, with the exception of the cities of Castlegar ..."}
{'source': 'google', 'header': "Trump tweets Antifa will be labelled a terrorist organization but experts believe that's unconstitutional", 'text': 'U.S. President Donald Trump tweeted Sunday that the United States will designate Antifa as a terrorist organization, even though the U.S. government has no ...'}
{'source': 'google', 'header': "Protests put Trump and Biden's leadership to the test", 'text': "With just 156 days until the 2020 election, you'll be voting before you know it. Every Sunday, Chris Cillizza outlines the 5 BIG storylines you need to know to ..."}
{'source': 'google', 'header': "'No justice, no peace': Protests resume in New York for fourth day", 'text': 'New York City officials were looking for a peaceful way forward as the city entered a fourth day of protests against police brutality that have left police cars ...'}
{'source': 'google', 'header': 'How divisive is politics in the United States? I Inside Story', 'text': 'ampvideo_youtubeAl Jazeera EnglishYesterdaybookmark_bordersharemore_vert'}
{'source': 'google', 'header': 'Global stocks buoyant, dollar slips as economies start to unlock', 'text': 'World stocks hovered near three-month highs and the dollar was flat on Monday as optimism over economies opening up again boosted risk appetite, despite ...'}
{'source': 'google', 'header': 'At the open: TSX starts flat as oil prices drop', 'text': "Canada's main stock index opened lower on Monday, dragged down by energy stocks on falling oil prices, as fears of low demand for crude offset OPEC and ..."}
{'source': 'google', 'header': '18 new COVID-19 cases Sunday; Alberta recoveries over last week outweigh new cases', 'text': 'Based on stats given by the province, 89 per cent of Albertans who have tested positive for COVID-19 are now recovered.'}
{'source': 'google', 'header': 'Russia has no objection to earlier OPEC+ meeting: sources', 'text': 'MOSCOW/LONDON (Reuters) - Russia has no objection to the next meeting of OPEC and its allies, known as OPEC+, being brought forward to June 4 from the ...'}
{'source': 'google', 'header': 'Buffett-backed BYD to supply EV batteries to Ford', 'text': 'Daily news of the business leaders and top investors who make the markets : Warren Buffett, George Soros, Michael Bloomberg, Peter Lynch, Richard Branson,.'}
{'source': 'google', 'header': 'Redmibook 13 slated for a June 11th launch in India', 'text': 'Xiaomi will reportedly unveil its first-ever laptop for the Indian market on June 11th. It is rumored to be a rebranded Redmibook 13 model featuring 10th ...'}
{'source': 'google', 'header': "Sony says PS5 might not have the 'lowest price' in battle against Xbox Series X", 'text': "One of the big questions resting on the tip of gamers' tongues is, just what will the PS5 price tag be? A new interview with Sony Interactive Entertainment's ..."}
{'source': 'google', 'header': 'Lenovo Mirage VR S3 Standalone Headset with ThinkReality Is Ready to Empower Global Enterprises and Their Workers', 'text': 'Today, during the VR/AR Global Summit Online Conference, Lenovo™ (HKSE: 992) (ADR: LNVGY) announced the latest addition to its portfolio of commercial ...'}
{'source': 'google', 'header': 'Daily horoscope for Monday, June 1, 2020', 'text': 'There are no restrictions to shopping or important decisions today. The Moon is in Libra.'}
{'source': 'google', 'header': 'Singer Halsey slams arrest reports', 'text': 'Halsey has slammed reports suggesting she was arrested during the violent Los Angeles Black Lives Matter clashes between police and protesters on Saturday.'}
{'source': 'google', 'header': 'Mustang Drive-In set to reopen following provincial announcement', 'text': 'The Ontario government is allowing drive-in movie theatres to reopen amid the COVID-19 pandemic. The decision was announced on Saturday as the province ...'}
{'source': 'google', 'header': 'Elton John lays off band after losing US$75 million due to cancelled tour', 'text': 'Elton John has reportedly been left “bereft” after taking a significant US$75 million hit after the coronavirus forced him to cancel his farewell tour.'}
{'source': 'google', 'header': "Vanessa Bryant shares photo of Kobe in an 'I Can't Breathe' T-shirt", 'text': 'Widow of basketball icon posts Instagram message to fight for change and register to vote.'}
{'source': 'google', 'header': 'MLS players approve summer tournament', 'text': 'TORONTO — MLS players have approved taking part in a summer tournament in Orlando, agreeing to a "package of economic concessions" for the revamped ...'}
{'source': 'google', 'header': 'Video shows Jon Jones confront vandals during George Floyd protest in Albuquerque', 'text': "Jon Jones may not be planning on competing anytime soon but that won't stop him from fighting crime."}
{'source': 'google', 'header': "Masai Ujiri says the conversation about racism 'can no longer be avoided'", 'text': 'In a Globe and Mail column, Toronto Raptors president Masai Ujiri reflected on the death of George Floyd in Minneapolis and protests erupting across the United ...'}
{'source': 'google', 'header': "After Dragon's historic docking, America has more new spaceships on the way", 'text': "It flew really well, very crisp. We couldn't be happier.”"}
{'source': 'google', 'header': "Russian space agency calls Trump's reaction to SpaceX launch...", 'text': "SPACE-EXPLORATION/SPACEX-LAUNCH-RUSSIA (PIX):Russian space agency calls Trump's reaction to SpaceX launch."}
{'source': 'google', 'header': 'SpaceX Crew Dragon chalks up picture-perfect docking at International Space Station', 'text': "Nineteen hours after a spectacular Florida launch, SpaceX's Crew Dragon capsule caught up with the International Space Station early Sunday and glided in for ..."}
{'source': 'google', 'header': 'Tesla Sentry Mode Captures Falcon 9 & Crew Dragon Liftoff', 'text': 'May 31st, 2020 by Johnna Crider. Tesla Sentry Mode is great at capturing vandals and thieves, witnessing fistfights, and identifying people who like to be just ...'}
{'source': 'google', 'header': 'Union calls on province to take control of a Woodbridge long-term care home after COVID-19 outbreak', 'text': 'A long-term care home in Woodbridge says 18 people were sent to the hospital on Saturday as a COVID-19 outbreak continued at the facility. The Woodbridge ...'}
{'source': 'google', 'header': "NB health authority CEO says COVID-19 outbreak is 'worst possible scenario'", 'text': 'FREDERICTON — The chief executive of a New Brunswick health network says the ongoing COVID-19 outbreak in the north of the province is a worst-case ...'}
{'source': 'google', 'header': 'Restaurants, bars, some school services allowed today in Manitoba', 'text': 'Manitobans will be allowed to visit dine-in restaurants, drink in bars and go bowling today, as the province eases more restrictions that were put in place to help ...'}
{'source': 'google', 'header': 'At least 21 employees at St. Catharines greenhouse test positive for COVID-19', 'text': 'At least 21 employees at a St. Catharines greenhouse have tested positive for COVID-19. Officials with Pioneer Flower Farms say they began testing all ...'}

Aggregating the Aggregators

Just as a proof of concept we use the following very simple approach:

  • exclude stop words
  • compute similarity for all items from one source
  • list item and most similar

First we get a huge list of stop words created from news articles:

In [194]:
import wget

wget.download('https://github.com/vikasing/news-stopwords/raw/master/sw1k.csv', 'sw1k.csv')
Out[194]:
'sw1k (1).csv'

Use pandas read_csv() and get only the most frequent stop words:

In [195]:
import pandas as pd

stopw = set(pd.read_csv('stopw.txt')['term'][:100])

print(stopw)
{'is', 'and', 'not', 'years', 'three', 'other', 'even', 'back', 'only', 'at', 'because', 'that', 'in', 'what', 'get', 'any', 'up', 'new', 'said', 'with', 'there', 'so', 'as', 'first', 'its', 'been', 'made', 'an', 'had', 't', 'could', 'are', 'it', 'one', 'people', 'all', 'than', 'which', 'most', 'being', 'two', 'make', 'against', 'who', 'i', 'they', 'no', 'how', 'them', 'have', 'during', 's', 'their', 'was', 'our', 'we', 'year', 'also', 'a', 'if', 'day', 'for', 'like', 'but', '1', 'about', 'may', 'into', 'while', 'be', 'of', 'you', 'before', 'from', 'his', 'or', 'more', 'on', 'by', 'just', 'would', 'last', 'out', 'will', 'where', 'can', 'this', 'now', 'do', 'over', 'some', 'has', 'the', 'were', 'to', 'when', 'us', 'after', 'he', 'time'}

Implement our naive and not very successful method:

In [196]:
n = len(news)
for i in range(n):
    news[i]['kw'] = set([ w for w in news[i]['header'].split() if w not in stopw])
    #print(news[i]['kw'])
    
import numpy as np

primary = 'yahoo'
for i in range(3):
    if news[i]['source'] != primary: continue
    print(news[i]['source']+':', news[i]['header'])
    for src in set(sources) - set([primary]):
        sim = []
        for j in range(n):
            if news[j]['source'] != src: 
                sim.append(0)
            else:
                x = news[i]['kw']
                y = news[j]['kw']
                sim.append(len(x.intersection(y)) / len(x.union(y)) + 0.001)
        idx = np.asarray(sim).argsort()[-1]
        print(news[idx]['source']+':', news[idx]['header'])
    print('')
yahoo: Protesters tear through D.C. after National Guard troops and Secret Service keep them from the White House
google: 'We need justice': Protesters set fires near White House, smash storefronts, march against police brutality in U.S.

yahoo: On all sides, fears of 'outside agitators' in Floyd protests 
google: Video shows Jon Jones confront vandals during George Floyd protest in Albuquerque

yahoo: Israel police kill Palestinian they mistakenly thought was armed
google: Montrealers rally to protest police brutality and racism

EXERCISES:

  • find more sources to scrape
  • find more applications to process the text from various sources