Data Science News Flash: 07-26-2019

The latest data science articles - algorithmically curated, ranked, and summarized just for you.


News Flash is a weekly publication that features the top news stories for a specific topic. The stories are algorithmically curated, evaluated for quality, and ranked so that you can stay on top of the most important developments. Additionally, the most important sentences for each story are extracted and displayed as highlights so you can get a sense of what each story is about. If you want more information for a particular story, just click the heading or image to read the entire article.

You can see the other topics we have News Flashes available for here and sign up to receive any that you're interested in.


17 Data Science Applications & Examples You Should Know

Highlights:

  • Its also based on really good math, according to the Washington Post specifically, a type of data science known as network science, which essentially forecasts the growth of a users social network based on the growth of similar users networks.
  • Liverpools soccer team almost won the 2019 Premier League championship with data science, which the team uses to ferret out and recruit undervalued soccer players.
  • The most cutting-edge data scientists, working in machine learning and AI, make models that automatically self-improve, noting and learning from their mistakes.
  • Facebook, of course, uses data science in various ways, but one of its buzzier data-driven features is the People You May Know sidebar, which appears on the social networks home screen.
  • Data analysts synthesize big data to answer concrete questions grounded in the past, e.g., How has our subscriber base grown from 2016 to 2019? In other words, they mine big data for insights on whats already happened.



The subtle art of really big data: Recursion Pharma maps the body

Highlights:

  • In the cell painting approach adopted by Recursion, the goal is to identify not just a handful but hundreds or thousands of features of cells that a machine learning program can then analyze.
  • "It's still a big data problem," says Mason Victors, chief technologist of Recursion, which is based in Salt Lake City, Utah.
  • Using its capital, Recursion is in Phase I clinical trials for the treatment of CCM, the problem Gibson was studying when he had the epiphany about cell painting and big data.
  • Because much of AI is affected by small statistical variations in the data, being cognizant of things such as distribution shifts may play a role in getting useful analysis out of the model.
  • Knowing the "vintage" of data, if you will, the company can adjust its analysis to take into account how that drift may affect machine learning.



Is Your Data Ready for Machine Learning?

Highlights:

  • This is almost a separate engineering discipline focusing on Data Engineering, which incorporates data sanitization, setting up the data infrastructure and the preparation of data for analysis.
  • If best practices arent used when the data is first collected, it is likely that the data will be generally inconsistent, occasionally incorrect and require significant preprocessing before any analytics and machine learning.
  • We have used these data quality dimensions in our machine learning practice and have found them to be a very helpful way to understand the quality of the data we are working with.
  • While, there are techniques to work around missing data when applying machine learning, as a best practice, it is important to evaluate how complete the data is and set a threshold to define the required level of completeness.
  • Wed be lucky to find such perfect data in the real world where many imperfect and raw datasets require proper data sanitization before you can even start thinking about applying machine learning.



Informatica makes the case: Why data needs AI and vice versa

Highlights:

  • In a recent blog post, Amalgam Insights analyst Lynne Baer (no relation, at least that we know of), delivered a good overview of how Informatica's underlying CLAIRE machine learning engine is driving Informatica's product strategy.
  • The impact of using the wrong data on a machine learning or deep learning model, or not using an adequately sized cohort of data to train and then run the model, will be arguably far greater compared to working with compromised data on a static data science model.
  • the company believes that AI in most cases here, machine learning is the only way that enterprises will be able to keep the data quality issue under control.
  • CLAIRE itself is not a product or tool, but it's an umbrella for the machine learning capabilities that are sprinkled through Informatica's suite.
  • Examples include automating parsing, transforming, and joining of variably structured data; tagging so data can be identified for classification, governance, and identifying privacy-sensitive data; flagging of potential data quality issues or place to de-duplicate; scanning a data set to generate data quality rules; scoring data so it can be labeled in a business glossary; and provide machine assistance in data discovery.



Squirrel AI Learning Attended AI & Big Data Expo Global 2019 in London

Highlights:

  • In addition, the MCM system developed by Squirrel AI Learning can disassemble students' model of thinking, capabilities and methods of learning and then provide training of these abilities and methods in a single subject according to students' learning status.
  • Anand Mariappan, senior director of Reddit in charge of search and machine learning engineering, reviewed the history, current projects and future direction of Reddit's machine learning that covers data platforms, feed rankings, recommendations, user and channel similarities.
  • Based on user status evaluation engine and knowledge recommendation engine, Squirrel AI Learning will build a data model to detect the gaps of knowledge for each student accurately and efficiently and then recommend corresponding learning content according to these gaps.
  • By the beginning of this year, Squirrel AI Learning has set up nearly 2,000 learning centers in more than 300 cities across China with nearly 2 million registered student users.
  • Mariappan said Reddit is currently developing machine learning programs to optimize personalized models, and has achieved amazing results in the early development phase by using models on TensorFlow to improve the quality of content recommendation.



Machine Learning vs. Climate Change: AI for the Greener Good

Highlights:

  • Controlling such a network of supply, storage, and demand in the presence of uncertainty and streams of data from millions of different sources is a job for machine learning.
  • Churning through satellite data, particularly where it requires feature recognition, is a job that machine learning algorithms already excel at.
  • When crisis hits, machine learning algorithms can trawl through aerial photography, satellite data, and even social media posts in real time, providing information to rescuers about where help is most needed.
  • As autonomous vehicles become increasingly prevalent, machine learning algorithms can optimize with emissions in mind and help cut down on the sector that accounts for a quarter of carbon dioxide emissions.
  • Now, a team of researchers from multiple institutionsincluding Coursera founder Andrew Ng, Chief Scientist of Google John Platt, and Turing Prize winner Yoshua Bengiohave published a 100-page research paper outlining some of the areas where machine learning is best-placed to make a difference.



More to machine learning platforms than meets the AI

Highlights:

  • Companies can now store their data, run their applications and access extensive machine learning and AI tool sets and libraries in cloud systems.
  • Machine learning platforms have to provide data scientists with the functionality to integrate with data sources and popular open source tools as well as interfaces for nontechnical users and compatibility with a company's overall IT architecture.
  • Consumer products maker and online retailer Brandless Inc. began moving from custom-built Python notebooks running in on-premises machines to a scalable, cloud-based data science architecture using Amazon SageMaker, MLflow and Databricks.
  • For production-level projects, machine learning platforms need to be scalable as well as provide the ability to compare and retrain models and integrate with enterprise data systems and other technology infrastructure.
  • And as companies transition from proof-of-concept pilots and individual use cases to broader, enterprise-wide AI strategies, the machine learning platforms at the heart of many of those strategies often change as well.



Creating Reproducible Data Science Projects

Highlights:

  • If your team doesnt already have its own project structure, consider using tools such as Cookiecutter to generate a standard data science project folder structure for you.
  • This can be combined with DVC a version control system designed to effectively version control the output of machine learning projects, without pushing your large data and model files to GIT.
  • Using an automatic code formatter such as Black will ensure that the code in your teams projects has a consistent style, improving readability.
  • This will enable the sharing of code across your team, avoiding duplicate and slightly edited versions of core data science code being scattered across your teams notebooks.
  • This blog article gives an overview of how we avoid this scenario by ensuring our data science projects and code are reproducible and production ready from the outset.



A computing visionary looks beyond todays AI

Highlights:

  • It is something not unlike "convolutions" in the convolutional neural network of machine learning, which learn features of data at various places in a data sample by having signals from that one area repeatedly emphasized in the network.
  • Earlier this year, Terry Sejnowski of the Salk Institute in La Jolla, California, who is a pioneer in machine learning and was a mentor to Geoffrey Hinton, joined with researchers in his lab to publish research on spiking nets in which learning was accomplished via a transfer of parameters from an ordinary deep learning network.
  • "If you only want to continue the state of the art in AI, there is no reason to think about anything but deep learning," says Siegelmann.
  • That shortcoming is an artifact, in Siegelmann's view, of the fact that machine learning is crafted within the von Neumann machine, and its static, limited programming model.
  • Siegelmann, who holds two appointments, one with the University of Massachusetts at Amherst as professor of computer science, and one as a program manager at the Defense Advanced Research Projects Agency, DARPA, sat down with ZDNet to discuss where neuromorphic computing goes next, and the insights it can bring about artificial intelligence, especially why AI succeeds and fails.



Data science is transforming, thanks to automation

Highlights:

  • Meanwhile, data science automation that leverages AI and machine learning could immediately address some of the pressing issues that have plagued the data science process.
  • While it may be still in nascent stages, automation in data science and analytics hold great potential in unlocking real business values of big data.
  • DATA has emerged as one of the most valuable assets for companies across many industries, and investing in data science definitely promises lucrative returns.
  • Without clearly defined business goals or context, data scientists would not only be drowning in a large volume of data, but also producing complex analysis and models that have no business value.
  • Beyond that, data science automation could also empower a whole host of citizen data scientists, allowing them to experiment and business models while pushing the organization towards being more data-driven.



Produced and Sponsored by:

Innovative Data Science & Advanced Analytics Solutions



Provide Feedback | Unsubscribe