Using Machine Learning and Artificial Intelligence in eDiscovery Projects

Karl Schliep and Steven Gawthorpe discuss developments in the eDiscovery space and how BRG’s Artificial Intelligence and Machine Learning group uses technology to solve complex business problems.


TRANSCRIPT

MJ 00:00              Welcome to Force Multiplier, the official podcast of BRG's Global Applied Technology team. The GAT team, as we call ourselves, is a globally distributed team of software engineers, data scientists, graphic designers, and industry experts who serve clients through our BRG Drive™ web analytics platform. We're helping some of the world's largest and most innovative companies and governments transform data into actionable insights. I'm Michael Jelen. And in these conversations, we speak with people both internal and external to BRG to discuss how technology, and specifically software, acts as a force multiplier to extend the impact of people across any kind of professional organization.

Hi, everyone. Welcome back to Force Multiplier, GAT’s podcast here at BRG. I'm joined today with two guests, Karl and Steven, from our machine learning team. Karl, Steven, so nice to have you. Thanks so much for joining. And, Karl, I'd love you to start off and tell everyone a little bit about yourself.

KS 00:55               Hi. Yeah, I'm Karl Schlepp. I'm a senior managing consultant at BRG, a PhD, and one of the lead data scientists in the Artificial Intelligence and Machine Learning group. A lot of what we do in this practice is create artificial intelligence and machine learning solutions for complex business problems. Those problems span a bunch of different sectors: banking, litigation, energy, things like that. And we're constantly looking for new places to apply our tools to kind of give that edge to different groups around the Berkeley Research Group.

Steven, do you want to talk about yourself a bit?

SG 01:31               Yeah. Hi. My name is Stephen Gawthorpe, also PhD, a senior managing consultant in data science alongside Karl. I've been with Berkeley Research Group for about nine months now. Came from the United Nations Office of Drugs and Crime prior to this, where I also worked on projects dealing with natural language processing and dealing with open-source intelligence, so. Thank you for having me here today.

MJ 02:01              Awesome. Well, yes, thank you so much. I'm super excited. In our conversations, we've been talking about a couple of the different projects and things you've been working on. I think they're super exciting, and I think everyone else would love to hear a bit more about them. Maybe we can start off with what you're doing in the eDiscovery space and a new technology partner that you're going to be working with soon.

KS 02:20               Yeah, great. I'll start off with this one. So, in the eDiscovery space, we're trying to deal with a problem that is ubiquitous around a bunch of different fields, is getting at the root of the litigation. Where are the documents that specifically kind of tie into the arguments either the plaintiff or the defendant is trying to make? So, searching through millions of documents and identifying here are the most important ones. Here are the ones that we need to review and produce here.

And so, in the eDiscovery space, what we found is there's not a lot of literature on identifying privileged documents. The number of privileged documents that a case might have is sometimes as low as 0.1 percent. You've got a million documents, and only 0.1 percent of them are these privileged documents that you can't produce.

So, what we do is, we use natural language processing and artificial intelligence to try to give a rank scoring of the likelihood of a document being privileged based on other documents that have been predefined as being privileged. And the literature in academics and other software providers, what they've got doesn't quite solve the problem very well. So, what Steven and I have done is, we're actually partnering together with a company called Snorkel, which uses weak supervision to kind of help expedite this process. A way to find the very limited number of documents that have been defined as being privileged and abstracting out the ideas. What about this document makes it privileged? Taking those rules, whether they be only applicable to 10 percent of the documents, taking those rules and stacking them on top of each other to try to define—if we use five rules, we can say with certainty that this document is also privileged based solely on only documents that have already been defined.

Steven, do you want to talk a little bit more about Snorkel and how that works with weak supervision, and what weak supervision is?

SG 04:20               Yeah, sure. Weak supervision is pretty interesting, and machine learning and AI, one of the large foundations of all of it is labeling data. And it can be quite laborious to hand-label thousands upon thousands of documents that this is relevant, or this is irrelevant, and so forth.

And so, what weak supervision does is, it's a compromise that tries to deal with this data labeling problem, right? So how it works is, you provide a series of heuristics or rules about whatever it is that you're trying to label in this. So, in the eDiscovery space, we might take a heuristic that to and from coming from these key people. So, you'd build up a dictionary of important names. You might assign some other type of rule-based behaviors, important keywords, key term phrases, things of that nature.

And what it does is, you assign a certain number of rules, and these rules, you can find conflicts between them. But the weak supervision aspect of it, it resolves all these conflicts between rules that overlap and overgeneralize, and it creates an optimized output for a labeled sample. So weak supervision actually can help us to label data thousands of times faster than if we were to do it manually, document by document. So, it's a very interesting technology that we have at our disposal.

MJ 06:05              Got you. First of all, on the surface, it doesn't sound like this is something a lot of people are doing, but it seems like that speed, which you just mentioned—that gets through this sifting through the haystack as quickly as possible—is really a huge value-add. I guess, what does it look like to an attorney that's working with us, or what does that end product look like? How do they use it?

KS 06:28               Yeah. So, at the end of the day, lawyers and other companies who might work with us on this, they're used to this kind of thing. They do it multiple times a year, this eDiscovery process, and they know one of the most expensive parts is just how long it takes to sift through that haystack looking for the needles. In a previous case we were working on, it's pretty common for this to run up to three, four months.

Using the solution that we're putting together that really nobody else in the field does, we're going to hopefully be able to cut this timing down to about six weeks end to end, rather than three or four months. So, we're cutting the amount of costs involved with this whole eDiscovery process in half. And that's something nobody else has. And the reason that nobody else has done this right now is because the technology and the partnering that we're doing is pretty recent on the market. I think Snorkel has only been around on the market for about nine months, and we're getting in there early. We're innovators at the end of the day. We try to find the best solutions for our clients using the newest technology and give it to them faster than everybody else. So, we're going to be setting the curve here. We are going to be well in front of the competition; so that's something that we can kind of bring in.

MJ 07:44              Cool. And in my exposure to litigation, I know often certain judges are wary of using technology, especially technology that prevents humans from looking at every single piece of paper. How is this framed to the current legal structure, and what do people think about technology like this?

SG 08:03               Well, it is new technology, and we're not replacing any of the junior lawyers or associates that are doing manual review. We're just going to be able to optimize and provide samples for them to review much faster. So, we're not entirely turning this process upside down. We're just going to be able to make the best use of the data that we have at our disposal and hasten our turnaround time.

KS 08:33               And one thing I'll add to that is the courts are becoming more acceptable to these types of technologies, because the scale of the data that they're having to deal with is becoming prohibitive. If they were to define that, yes, a lawyer has to lay their eyes on every single document, these court cases could take years. Back in the early 2000s, this is what they did. They just had lawyers look at every document. But the scale of the data, the number of emails, the number of files, the number of folders that everyone is accumulating as we go on makes this prohibitively expensive to kind of look at all of these. So, I'd say the courts are becoming more open to it.

And as Steven mentioned, this process still keeps lawyers in the loop. All we're doing is providing them better documents to review. So rather than spending 90 percent of their time looking at documents that are not even close to being privileged or what we're looking for, we're giving them highlighted documents, here's the top, most-likely documents to review, so they can get through this faster. And then we can apply a model at the end that says, "We are 95 percent certain that every document above this cutoff is going to be privileged." Eventually, lawyers will look through those.

But the other value-add that we're giving them is, "Here's where our model is not very certain. Look, here. Here's where our model is very certain. It's definitely not privileged." And you can take sampling techniques to prove out that, yes, within our margins of error, here are the documents that you're looking for without us missing any.

MJ 10:07              Cool, and the final question on this specific project and work stream: You mentioned privilege, but is this something that could potentially have other use cases and can expand to different industries as well? I'm just thinking whoever's listening to this maybe is thinking about their cases that they're working on and the amount of documents that they have to review. Are there other use cases and problems that this technology can solve?

SG 10:31               Yeah. Weak supervision is an incredible new technology, and it can resolve all aspects of artificial intelligence and machine learning. So, any type of tasks that you have where you have an issue with the laborious nature of manually reviewing documents and labeling, weak supervision has its place in all of that.

KS 10:57               Some other applications, just to put it concretely, is in forecasting. We know at some point in time, some event happens, and so having an ability to look through and abstract out why at this point we're saying there was fraud, why at this point is there fraud?

Taking all those ideas and extracting it, then we can apply it to all-time, other places as in the medical field. If you want to try to classify different types of documents, whether they be about cancer or they'd be about different medical documents, going through and specifically labeling each one of those using natural language processing kind of techniques, using the words involved inside them can be very laborious. I mean, you can have paid doctors go through and review every single one, or we can kind of use this—I'll call it label enhancement—kind of process that kind of speeds it all up. We can apply the same technology there. So, its applications are anywhere that you need to have labeled data that takes a while to get those labels. Anywhere where it's cost-prohibitive to do that, this application is perfect for it.

MJ 12:11              Perfect. That's awesome. Just transitioning a little bit from data that's been labeled to situations where you're dealing with large amounts of unstructured data. Can you talk a little bit about the project that you're doing in the crypto space with extracting information from a ton of different online forums and places like that?

SG 12:31               Yeah, sure. So, we have a project that's still in development at this stage where we're mining data on Twitter and Reddit related to scams and fraudulent activity in cryptocurrency. It's quite tricky because there's a lot of noise. It's real-time data. It's very fast moving, high volume. But we've started to use natural language processing techniques to classify these data points where persons that are victims of fraudulent scams and cryptocurrency, we can extract information about what happened, the amount of money that's been lost, who's the perpetrator of some of these crimes. And the end goal is to try to aggregate all of this information in a systematic way so that we can kind of get the entire risk landscape of cryptocurrency so we can say, what are the run-of-the-mill scams? What are the run-of-the-mill fraudulent activities? We've seen this type of stuff before, and that we can basically have a better estimator of what risks are and try to bolster up defenses or try to steer clear of certain scams and fraudulent activities. So, it's a very, very interesting project that's still in development, but we're seeing some fruitful results already at this point.

MJ 13:55              Very cool. And I guess at a high level for someone who's not very technical, if they heard something like that, you can provide a risk score across the entire industry, that seems very daunting, and that's quite a statement. I guess, could you break it down into what steps actually take place when you start going through this process? Is it you're starting with a certain token that you're looking for, you have a hypothesis, or are you just ingesting a ton of information, and how do you extract that and group it together into things that are actually meaningful and provide some sort of metrics around risk score?

SG 14:29               Yeah. So, part of the project is first to just query a large volume of data from Twitter and Reddit using specific keywords. But then the first initial part of this is developing a classification model. So basically, a model that can quickly go through over a large stream of information and can say, "Okay, this is relevant. This is irrelevant," and basically separate the wheat from the chaff, so to speak, just getting all of the relevant content out first and separating it so that we know that we have some kind of likelihood that this is a relevant scam. For example, you might get specifics in the text saying, "I just got scammed by X, and this happened to me or to us, and be careful of that."

SG 15:26               These types of keywords will help the model flag whether or not it's relevant document. And then from there, once we parse that out, then we start fine-tuning onto other elements that are contained within these sentences, dates, values, references to cryptocurrency tokens or maybe Web 3 technologies such as MetaMask or something like that, and start parsing that out into individualized component parts. So, there's also a final part where we'll use third-party references, people that have already documented fraudulent activity, and we'll try to reference that up to get a better quality control assurance whether or not this is a legitimate scam or whether or not we can use supplementary documents to qualify some of these statements. So, yeah, there's quite a lot of steps and stages that are involved in this process. [laughter]

MJ 16:28              Yeah. And as you're going through this, I presume it's not easy. So, I would love to hear a little bit about some of the challenges that you're coming into face. I presume that a number of bots and just robots be on these platforms is one of them. But, yeah, any other things that stick out as huge challenges as you build this?

SG 16:46               I mean, with this type of stuff, it's always a challenge to try to settle on deduplication. So, as you said, there's bots that will just kind of spam this type of information over and over again, so it's kind of hard to figure out and manage the filtering of that. I'll give you one little interesting tidbit that I didn't think of before starting this project. But particularly in Twitter, you're constrained to only 140 characters, so it's very difficult to express a scam or something complex in such a minimal amount of time—or space.

But what people do to shortcut that is, they'll supplement a tweet with an image. And what's tricky is that you'll get sentences that are filled with syntactical errors, and they don't make any sense. But when you look at the text and then you look at the image, then you get a full understanding of what the scam is. Maybe people will show a screenshot of their computer or their browser and how there was a phishing attempt or something of that nature. So, we're having to develop an entirely separate model just to deal with the text and the images rather than just looking at the text on its own. So, this is an unexpected challenge that we're having to currently overcome. Yeah, it's quite difficult.

MJ 18:19              Yeah, it's fascinating. And I guess the output from all of this when you're finished and you're extracting the information, you're aggregating it, you're putting in, what does the user see, and what is the thing that you're selling to clients?

SG 18:33               Yeah. So that's the wonderful part, is that the user is just going to see cleaned information and number of counts and aggregated scores. So, for them, it's going to look very pretty and refined and processed. They're not going to see the underlying mechanics in the background that's very messy and complex. But they're going to get to see either two forms. One, a dashboard that will show these aggregated scores of risks overall. Or the second option is an API structure that they can have automated calls to for requesting this information. And then if they want to have their own dashboards or their own type of visual interfaces, then they can do with what they please.

MJ 19:27              Got you. And this sounds like it overlaps, or at least this sort of technology seems like it could be very powerful in many other industries. I mean, I wonder, I'm just thinking off the top of my head, maybe like electronic medical records have a lot of just text and things like that, that you could potentially pull out and try to associate things. What are some other places that you would think to use this next or build upon what you've already got here?

KS 19:50               Some applications that we've thought of in the past is even on just social media tracking for different companies. I know there's a bunch of different services that kind of provide that sort of service. But say for in litigation, if you know in advance you're going to have some big case, you might want to track what are the public—what's the public opinion of this?

So basically, changing our queries from searching specifically for crypto things has now started searching towards specific companies. Seeing what the sentiment is about these different companies as they're getting brought up. Are there any trending hashtags about those?

Other than that, we haven't thought too much about how to take the same type of technology and apply it to bigger data. I mean, at the end of the day, the technology is kind of all the same. We're using natural language processing to kind of parse through huge amounts of data and extract out only the bits of information that we really care about. Once we have that important bit of information, then we can start developing models around that. We can start creating risk scores or creating other types of metrics that we can track. But on the grand scheme, we haven't thought too much about how else we can apply it. We're always open for suggestions. So, if you've got anything cool that you've thought of, feel free to let us know.

MJ 21:10              Yeah, absolutely. Everyone can obviously contact you guys and see if it would work in their scenarios; that's perfect. Any other projects or things that you've been working on lately that you think are really interesting, that are representative of some of the different things that you do in your team?

KS 21:26               We actually got a cool new one last week. I think it's amazing, but we'll be looking into fraud detection between big banks. And the problem is really interesting because a specific bank can be doing something that seems illegal to kind of get more money from the market and kind of, in my opinion, pulling a fast one over on all the people that they're working for, but it's entirely legal. The real part that the litigation is about is when they start colluding between each other. When more than one bank or more than one of these companies start talking to each other, it's like, "Hey, I'm going to screw over my customers." If they tell other companies to do it at the same time to kind of enhance the amounts of money that they're going to be pulling in, that's where the problem—that's where it becomes illegal, and that's what all the litigation is about. So even formulating the ideas of how do we show the correlation that two of these companies are kind of working together. Finding that correlation and proving that these two are working together to do something is pretty interesting.

And the scale of the data is so vast, and the number of people affected is huge, so it's something pretty rewarding, at least in my opinion, that we get to go after the big, bad banks and kind of showcase without us, the data is so large, I don't know how anybody would do it. So being able to use the technologies that we've kind of put together to kind of be the Robin Hoods and kind of give back, I think that's pretty cool, at least, in my opinion.

Steven, you got anything cool that we're working on that you want to talk about?

SG 23:05               I think that what we're doing right now that's pretty interesting is that we're developing and revamping our entire machine learning ops from start to finish. The ML ops, the entire pipeline of getting an idea into a finalized production environment. We're building all of the technologies up for that. So, the idea here is how fast and how accurately can we take just a nascent idea in place and translate that into a machine learning type of solution and try to automate this in the best way that we can have oversight and better and improved accuracy over time? That's one of the things that we're doing.

So, an example of that is we're currently using a product called Weights & Biases. It's an experiment tracking software so that Karl and myself, we can attempt every configuration imaginable to try to get the best score for a natural language processing model. We can try different data sets. We can try cleaned results, uncleaned results, hyperparameter tuning, you name it. We can track the results and select the model that performs the best. So, this is a really automated platform that we can use where we can leave little breadcrumbs and backcheck which paths and which choices we made for a particular model. So, yeah, I would say that the interesting part is our ML ops platform that we're putting together. It's very interesting for me.

KS 24:56               Yeah. The Ml ops, I think, is really cool. You always tell me, Steven, is we're not trying to build a car. We're trying to build a factory that produces cars. So rather than creating one-off kind of models for every different type of project that we might work on, we want to make a factory so that you can feed your ideas in and then automatically kind of generate these models so that we can work faster; we can work more efficiently; we can work with more people; and we can spend more time innovating and developing and researching than having our hands directly on creating these one-off products.

MJ 25:33              That's awesome. And another cool project that we have been collaborating with you guys on recently is analysis of video. And so, I guess I didn't realize this was a law until fairly recently. But certain states have restrictions based on when someone from an employment law perspective gets a chair or is required to have a chair at their workstation. And so, it's something like if you're standing in a given fixed position for a certain amount of time, think a cashier or someone who's perhaps working a pharmacy desk, something like that, then you would be entitled to this chair. And so, what we did was rather than have an attorney with a clipboard and a stopwatch, just stand there and track every single moment that that person is standing, moving, reaching, bending, whatever, we could just take a look at the video security camera footage and then identify where that person is and which zone they're located in, inside of their general workstation area, what movement they're doing, and then obviously just perform all those calculations and provide output, which can save tons and tons of different time. But I learned recently that you guys also have an even larger project related to video analysis. Do you want to talk about that a little bit?

KS 26:41               Yeah. We recently got a lot of data, video data, audio data from a big international company, a nonprofit company. And the scale of it is 150 terabytes. That was down front. And we recently learned that's actually up to about 400 terabytes of data now that we get to go through and process. And basically, what they're looking for is to host the data, one. Who knows how to host 400 terabytes of data? And there's other aspects to that. As part of their company, they need to make sure that they're compliant with a bunch of—if litigation were to come their way, they kind of have to have this data in a nice place, in a nice, organized way, so that once these things come along that they're well within their compliance and can present this data.

So, with the 400 terabytes of data, what we want to do, and what we're trying to provide for this company is looking through all the data, whether that data shows a bird or anything that they're kind of interested in, if we can extract out that information so that they can make it searchable.

                I want to find all videos that contain birds so that they can look through and say, "Hey, there's a specific bird that we're tracking. Or here's a specific dolphin. Or here's one of our adventures where we were down in the Cayman Islands, something like that." And so, taking both the video and audio elements, passing it through an audio transcription service, then running named entity recognition, keyword extraction, so they can maybe based on the words inside the document, "Okay, I know we were in the Cayman Islands; we were talking about some specific bird. I'm going to search through the text and find this specific bird." Providing them those capabilities so that they can search through the metadata and kind of have access to pull on a thread to pull up all the documents they've got on the specific bird or project that they're working on so they can put together their marketing materials. It's really the scale of this project, which is kind of fun to handle. And working with the video data, trying to extract out information from the video that we can then provide to them. That's a lot of fun. And 400 terabytes, it's the biggest one we've worked on for a while, but it's a lot of data. [laughter]

MJ 28:59              Very cool. I love hearing about all the different things you guys are doing. It's super innovative, and it's so fun that you can touch so many different industries and work with so many different kinds of people. In our final couple of minutes, I'd love to just kind of nerd out a little bit and hear from you guys what you think is the future of machine learning and AI? What are some cool tools besides, obviously, you mentioned Snorkel to measure and things like that, but what else are things you're looking forward to in the industry or cool things you think are right around the corner?

KS 29:28               So, one thing that we've already started seeing be implemented is just software, artificial intelligence, machine learning solutions. Everybody is starting to throw that in as an extra add-on of, "Hey, we just added on this very small artificial intelligence component, and so it's becoming a lot more mainstream." We've got a software license with a company that, basically, a lot of what they're providing is just open-source models that they're just putting into their product and saying, "Hey, we can do that too. Hey, we can do that too." Which I think is cool, one for normalizing the use of AI and ML, but there are some caveats with that. Don't always trust what you're getting out of these softwares because they are just a basic general model throwing at it. So, you should always be kind of concerned about and wary of some of the claims that these people are making. But I do like the idea that it's becoming more mainstream, so the ideas can start flowing, and the activation energy for adding in a new machine learning project is a lot lower. People are already ready for that kind of thing.

MJ 30:38              Cool. Steven, I don't know if you had anything to add as well.

SG 30:40               Yeah, I think that what's interesting—and it has been going on for a while, but I think it's changing every day—it's just the component part of transfer learning and artificial intelligence. So just taking use of very, very large-scale natural language models that kind of have generalized attributes on just the entire English language used for a particular context. I think this is becoming more accessible for people to develop, and you're seeing rapid improvements in accuracy by the use of transfer learning. It's incredible, and it's constantly changing. Every single month, you see new models, new context, new different transfer learning models that are basically all at our disposal, which is very interesting and very cool.

KS 31:34               One thing I'll add to that quick is the idea of transfer learning only started coming around, what, in 2017 or so? And so, any technology that we were using even before that, we kind of had to revamp because the use of this transfer learning increased our performance by 10–15 percent right out of the bat. So, we're in such a fast-moving field that if you're not innovating, if you're not looking at where things are going, you can easily be left in the dust. So, we always have to be kind of at the forefront, always looking for the new technologies as they come out and try to put that into the products that we're kind of offering.

MJ 32:16              Very cool. If you don't move quickly, you'll be left in the dust. Well, perfect. That's a great spot to end. Thank you both so much.

I encourage everyone to reach out to both Karl and Steven if you have any questions or want to learn more about the capabilities of machine learning artificial intelligence here at BRG. And as always, guys, thanks. It's been a pleasure, and I can't wait until the next time.

SG 32:37               Thank you.

KS 32:38               Thanks.

MJ 32:39              Bye-bye. The views and opinions expressed in this podcast are those of the participants and do not necessarily reflect the opinions, position, or policy of Berkeley Research Group or its other employees and affiliates.